|
| 1 | +# LocalAI Backend Architecture |
| 2 | + |
| 3 | +This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks. |
| 8 | + |
| 9 | +## Architecture Components |
| 10 | + |
| 11 | +### 1. Protocol Definition (`backend.proto`) |
| 12 | + |
| 13 | +The `backend.proto` file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services. |
| 14 | + |
| 15 | +#### Core Services |
| 16 | + |
| 17 | +- **Text Generation**: `Predict`, `PredictStream` for LLM inference |
| 18 | +- **Embeddings**: `Embedding` for text vectorization |
| 19 | +- **Image Generation**: `GenerateImage` for stable diffusion and image models |
| 20 | +- **Audio Processing**: `AudioTranscription`, `TTS`, `SoundGeneration` |
| 21 | +- **Video Generation**: `GenerateVideo` for video synthesis |
| 22 | +- **Object Detection**: `Detect` for computer vision tasks |
| 23 | +- **Vector Storage**: `StoresSet`, `StoresGet`, `StoresFind` for RAG operations |
| 24 | +- **Reranking**: `Rerank` for document relevance scoring |
| 25 | +- **Voice Activity Detection**: `VAD` for audio segmentation |
| 26 | + |
| 27 | +#### Key Message Types |
| 28 | + |
| 29 | +- **`PredictOptions`**: Comprehensive configuration for text generation |
| 30 | +- **`ModelOptions`**: Model loading and configuration parameters |
| 31 | +- **`Result`**: Standardized response format |
| 32 | +- **`StatusResponse`**: Backend health and memory usage information |
| 33 | + |
| 34 | +### 2. Multi-Language Dockerfiles |
| 35 | + |
| 36 | +The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages: |
| 37 | + |
| 38 | +- `Dockerfile.python` |
| 39 | +- `Dockerfile.golang` |
| 40 | +- `Dockerfile.llama-cpp` |
| 41 | + |
| 42 | +### 3. Language-Specific Implementations |
| 43 | + |
| 44 | +#### Python Backends (`python/`) |
| 45 | +- **transformers**: Hugging Face Transformers framework |
| 46 | +- **vllm**: High-performance LLM inference |
| 47 | +- **mlx**: Apple Silicon optimization |
| 48 | +- **diffusers**: Stable Diffusion models |
| 49 | +- **Audio**: bark, coqui, faster-whisper, kitten-tts |
| 50 | +- **Vision**: mlx-vlm, rfdetr |
| 51 | +- **Specialized**: rerankers, chatterbox, kokoro |
| 52 | + |
| 53 | +#### Go Backends (`go/`) |
| 54 | +- **whisper**: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp) |
| 55 | +- **stablediffusion-ggml**: Stable Diffusion in Go with GGML Cpp backend |
| 56 | +- **huggingface**: Hugging Face model integration |
| 57 | +- **piper**: Text-to-speech synthesis Golang with C bindings using rhaspy/piper |
| 58 | +- **bark-cpp**: Bark TTS models Golang with Cpp bindings |
| 59 | +- **local-store**: Vector storage backend |
| 60 | + |
| 61 | +#### C++ Backends (`cpp/`) |
| 62 | +- **llama-cpp**: Llama.cpp integration |
| 63 | +- **grpc**: GRPC utilities and helpers |
| 64 | + |
| 65 | +## Hardware Acceleration Support |
| 66 | + |
| 67 | +### CUDA (NVIDIA) |
| 68 | +- **Versions**: CUDA 11.x, 12.x |
| 69 | +- **Features**: cuBLAS, cuDNN, TensorRT optimization |
| 70 | +- **Targets**: x86_64, ARM64 (Jetson) |
| 71 | + |
| 72 | +### ROCm (AMD) |
| 73 | +- **Features**: HIP, rocBLAS, MIOpen |
| 74 | +- **Targets**: AMD GPUs with ROCm support |
| 75 | + |
| 76 | +### Intel |
| 77 | +- **Features**: oneAPI, Intel Extension for PyTorch |
| 78 | +- **Targets**: Intel GPUs, XPUs, CPUs |
| 79 | + |
| 80 | +### Vulkan |
| 81 | +- **Features**: Cross-platform GPU acceleration |
| 82 | +- **Targets**: Windows, Linux, Android, macOS |
| 83 | + |
| 84 | +### Apple Silicon |
| 85 | +- **Features**: MLX framework, Metal Performance Shaders |
| 86 | +- **Targets**: M1/M2/M3 Macs |
| 87 | + |
| 88 | +## Backend Registry (`index.yaml`) |
| 89 | + |
| 90 | +The `index.yaml` file serves as a central registry for all available backends, providing: |
| 91 | + |
| 92 | +- **Metadata**: Name, description, license, icons |
| 93 | +- **Capabilities**: Hardware targets and optimization profiles |
| 94 | +- **Tags**: Categorization for discovery |
| 95 | +- **URLs**: Source code and documentation links |
| 96 | + |
| 97 | +## Building Backends |
| 98 | + |
| 99 | +### Prerequisites |
| 100 | +- Docker with multi-architecture support |
| 101 | +- Appropriate hardware drivers (CUDA, ROCm, etc.) |
| 102 | +- Build tools (make, cmake, compilers) |
| 103 | + |
| 104 | +### Build Commands |
| 105 | + |
| 106 | +Example of build commands with Docker |
| 107 | + |
| 108 | +```bash |
| 109 | +# Build Python backend |
| 110 | +docker build -f backend/Dockerfile.python \ |
| 111 | + --build-arg BACKEND=transformers \ |
| 112 | + --build-arg BUILD_TYPE=cublas12 \ |
| 113 | + --build-arg CUDA_MAJOR_VERSION=12 \ |
| 114 | + --build-arg CUDA_MINOR_VERSION=0 \ |
| 115 | + -t localai-backend-transformers . |
| 116 | + |
| 117 | +# Build Go backend |
| 118 | +docker build -f backend/Dockerfile.golang \ |
| 119 | + --build-arg BACKEND=whisper \ |
| 120 | + --build-arg BUILD_TYPE=cpu \ |
| 121 | + -t localai-backend-whisper . |
| 122 | + |
| 123 | +# Build C++ backend |
| 124 | +docker build -f backend/Dockerfile.llama-cpp \ |
| 125 | + --build-arg BACKEND=llama-cpp \ |
| 126 | + --build-arg BUILD_TYPE=cublas12 \ |
| 127 | + -t localai-backend-llama-cpp . |
| 128 | +``` |
| 129 | + |
| 130 | +For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used. |
| 131 | + |
| 132 | +### Build Types |
| 133 | + |
| 134 | +- **`cpu`**: CPU-only optimization |
| 135 | +- **`cublas11`**: CUDA 11.x with cuBLAS |
| 136 | +- **`cublas12`**: CUDA 12.x with cuBLAS |
| 137 | +- **`hipblas`**: ROCm with rocBLAS |
| 138 | +- **`intel`**: Intel oneAPI optimization |
| 139 | +- **`vulkan`**: Vulkan-based acceleration |
| 140 | +- **`metal`**: Apple Metal optimization |
| 141 | + |
| 142 | +## Backend Development |
| 143 | + |
| 144 | +### Creating a New Backend |
| 145 | + |
| 146 | +1. **Choose Language**: Select Python, Go, or C++ based on requirements |
| 147 | +2. **Implement Interface**: Implement the gRPC service defined in `backend.proto` |
| 148 | +3. **Add Dependencies**: Create appropriate requirements files |
| 149 | +4. **Configure Build**: Set up Dockerfile and build scripts |
| 150 | +5. **Register Backend**: Add entry to `index.yaml` |
| 151 | +6. **Test Integration**: Verify gRPC communication and functionality |
| 152 | + |
| 153 | +### Backend Structure |
| 154 | + |
| 155 | +``` |
| 156 | +backend-name/ |
| 157 | +├── backend.py/go/cpp # Main implementation |
| 158 | +├── requirements.txt # Dependencies |
| 159 | +├── Dockerfile # Build configuration |
| 160 | +├── install.sh # Installation script |
| 161 | +├── run.sh # Execution script |
| 162 | +├── test.sh # Test script |
| 163 | +└── README.md # Backend documentation |
| 164 | +``` |
| 165 | + |
| 166 | +### Required gRPC Methods |
| 167 | + |
| 168 | +At minimum, backends must implement: |
| 169 | +- `Health()` - Service health check |
| 170 | +- `LoadModel()` - Model loading and initialization |
| 171 | +- `Predict()` - Main inference endpoint |
| 172 | +- `Status()` - Backend status and metrics |
| 173 | + |
| 174 | +## Integration with LocalAI Core |
| 175 | + |
| 176 | +Backends communicate with LocalAI core through gRPC: |
| 177 | + |
| 178 | +1. **Service Discovery**: Core discovers available backends |
| 179 | +2. **Model Loading**: Core requests model loading via `LoadModel` |
| 180 | +3. **Inference**: Core sends requests via `Predict` or specialized endpoints |
| 181 | +4. **Streaming**: Core handles streaming responses for real-time generation |
| 182 | +5. **Monitoring**: Core tracks backend health and performance |
| 183 | + |
| 184 | +## Performance Optimization |
| 185 | + |
| 186 | +### Memory Management |
| 187 | +- **Model Caching**: Efficient model loading and caching |
| 188 | +- **Batch Processing**: Optimize for multiple concurrent requests |
| 189 | +- **Memory Pinning**: GPU memory optimization for CUDA/ROCm |
| 190 | + |
| 191 | +### Hardware Utilization |
| 192 | +- **Multi-GPU**: Support for tensor parallelism |
| 193 | +- **Mixed Precision**: FP16/BF16 for memory efficiency |
| 194 | +- **Kernel Fusion**: Optimized CUDA/ROCm kernels |
| 195 | + |
| 196 | +## Troubleshooting |
| 197 | + |
| 198 | +### Common Issues |
| 199 | + |
| 200 | +1. **GRPC Connection**: Verify backend service is running and accessible |
| 201 | +2. **Model Loading**: Check model paths and dependencies |
| 202 | +3. **Hardware Detection**: Ensure appropriate drivers and libraries |
| 203 | +4. **Memory Issues**: Monitor GPU memory usage and model sizes |
| 204 | + |
| 205 | +## Contributing |
| 206 | + |
| 207 | +When contributing to the backend system: |
| 208 | + |
| 209 | +1. **Follow Protocol**: Implement the exact gRPC interface |
| 210 | +2. **Add Tests**: Include comprehensive test coverage |
| 211 | +3. **Document**: Provide clear usage examples |
| 212 | +4. **Optimize**: Consider performance and resource usage |
| 213 | +5. **Validate**: Test across different hardware targets |
0 commit comments