Skip to content

Commit e35ad56

Browse files
committed
chore(docs): add backends README
Signed-off-by: Ettore Di Giacinto <[email protected]>
1 parent 3be8b2d commit e35ad56

File tree

1 file changed

+213
-0
lines changed

1 file changed

+213
-0
lines changed

backend/README.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# LocalAI Backend Architecture
2+
3+
This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.
4+
5+
## Overview
6+
7+
LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.
8+
9+
## Architecture Components
10+
11+
### 1. Protocol Definition (`backend.proto`)
12+
13+
The `backend.proto` file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.
14+
15+
#### Core Services
16+
17+
- **Text Generation**: `Predict`, `PredictStream` for LLM inference
18+
- **Embeddings**: `Embedding` for text vectorization
19+
- **Image Generation**: `GenerateImage` for stable diffusion and image models
20+
- **Audio Processing**: `AudioTranscription`, `TTS`, `SoundGeneration`
21+
- **Video Generation**: `GenerateVideo` for video synthesis
22+
- **Object Detection**: `Detect` for computer vision tasks
23+
- **Vector Storage**: `StoresSet`, `StoresGet`, `StoresFind` for RAG operations
24+
- **Reranking**: `Rerank` for document relevance scoring
25+
- **Voice Activity Detection**: `VAD` for audio segmentation
26+
27+
#### Key Message Types
28+
29+
- **`PredictOptions`**: Comprehensive configuration for text generation
30+
- **`ModelOptions`**: Model loading and configuration parameters
31+
- **`Result`**: Standardized response format
32+
- **`StatusResponse`**: Backend health and memory usage information
33+
34+
### 2. Multi-Language Dockerfiles
35+
36+
The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:
37+
38+
- `Dockerfile.python`
39+
- `Dockerfile.golang`
40+
- `Dockerfile.llama-cpp`
41+
42+
### 3. Language-Specific Implementations
43+
44+
#### Python Backends (`python/`)
45+
- **transformers**: Hugging Face Transformers framework
46+
- **vllm**: High-performance LLM inference
47+
- **mlx**: Apple Silicon optimization
48+
- **diffusers**: Stable Diffusion models
49+
- **Audio**: bark, coqui, faster-whisper, kitten-tts
50+
- **Vision**: mlx-vlm, rfdetr
51+
- **Specialized**: rerankers, chatterbox, kokoro
52+
53+
#### Go Backends (`go/`)
54+
- **whisper**: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)
55+
- **stablediffusion-ggml**: Stable Diffusion in Go with GGML Cpp backend
56+
- **huggingface**: Hugging Face model integration
57+
- **piper**: Text-to-speech synthesis Golang with C bindings using rhaspy/piper
58+
- **bark-cpp**: Bark TTS models Golang with Cpp bindings
59+
- **local-store**: Vector storage backend
60+
61+
#### C++ Backends (`cpp/`)
62+
- **llama-cpp**: Llama.cpp integration
63+
- **grpc**: GRPC utilities and helpers
64+
65+
## Hardware Acceleration Support
66+
67+
### CUDA (NVIDIA)
68+
- **Versions**: CUDA 11.x, 12.x
69+
- **Features**: cuBLAS, cuDNN, TensorRT optimization
70+
- **Targets**: x86_64, ARM64 (Jetson)
71+
72+
### ROCm (AMD)
73+
- **Features**: HIP, rocBLAS, MIOpen
74+
- **Targets**: AMD GPUs with ROCm support
75+
76+
### Intel
77+
- **Features**: oneAPI, Intel Extension for PyTorch
78+
- **Targets**: Intel GPUs, XPUs, CPUs
79+
80+
### Vulkan
81+
- **Features**: Cross-platform GPU acceleration
82+
- **Targets**: Windows, Linux, Android, macOS
83+
84+
### Apple Silicon
85+
- **Features**: MLX framework, Metal Performance Shaders
86+
- **Targets**: M1/M2/M3 Macs
87+
88+
## Backend Registry (`index.yaml`)
89+
90+
The `index.yaml` file serves as a central registry for all available backends, providing:
91+
92+
- **Metadata**: Name, description, license, icons
93+
- **Capabilities**: Hardware targets and optimization profiles
94+
- **Tags**: Categorization for discovery
95+
- **URLs**: Source code and documentation links
96+
97+
## Building Backends
98+
99+
### Prerequisites
100+
- Docker with multi-architecture support
101+
- Appropriate hardware drivers (CUDA, ROCm, etc.)
102+
- Build tools (make, cmake, compilers)
103+
104+
### Build Commands
105+
106+
Example of build commands with Docker
107+
108+
```bash
109+
# Build Python backend
110+
docker build -f backend/Dockerfile.python \
111+
--build-arg BACKEND=transformers \
112+
--build-arg BUILD_TYPE=cublas12 \
113+
--build-arg CUDA_MAJOR_VERSION=12 \
114+
--build-arg CUDA_MINOR_VERSION=0 \
115+
-t localai-backend-transformers .
116+
117+
# Build Go backend
118+
docker build -f backend/Dockerfile.golang \
119+
--build-arg BACKEND=whisper \
120+
--build-arg BUILD_TYPE=cpu \
121+
-t localai-backend-whisper .
122+
123+
# Build C++ backend
124+
docker build -f backend/Dockerfile.llama-cpp \
125+
--build-arg BACKEND=llama-cpp \
126+
--build-arg BUILD_TYPE=cublas12 \
127+
-t localai-backend-llama-cpp .
128+
```
129+
130+
For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.
131+
132+
### Build Types
133+
134+
- **`cpu`**: CPU-only optimization
135+
- **`cublas11`**: CUDA 11.x with cuBLAS
136+
- **`cublas12`**: CUDA 12.x with cuBLAS
137+
- **`hipblas`**: ROCm with rocBLAS
138+
- **`intel`**: Intel oneAPI optimization
139+
- **`vulkan`**: Vulkan-based acceleration
140+
- **`metal`**: Apple Metal optimization
141+
142+
## Backend Development
143+
144+
### Creating a New Backend
145+
146+
1. **Choose Language**: Select Python, Go, or C++ based on requirements
147+
2. **Implement Interface**: Implement the gRPC service defined in `backend.proto`
148+
3. **Add Dependencies**: Create appropriate requirements files
149+
4. **Configure Build**: Set up Dockerfile and build scripts
150+
5. **Register Backend**: Add entry to `index.yaml`
151+
6. **Test Integration**: Verify gRPC communication and functionality
152+
153+
### Backend Structure
154+
155+
```
156+
backend-name/
157+
├── backend.py/go/cpp # Main implementation
158+
├── requirements.txt # Dependencies
159+
├── Dockerfile # Build configuration
160+
├── install.sh # Installation script
161+
├── run.sh # Execution script
162+
├── test.sh # Test script
163+
└── README.md # Backend documentation
164+
```
165+
166+
### Required gRPC Methods
167+
168+
At minimum, backends must implement:
169+
- `Health()` - Service health check
170+
- `LoadModel()` - Model loading and initialization
171+
- `Predict()` - Main inference endpoint
172+
- `Status()` - Backend status and metrics
173+
174+
## Integration with LocalAI Core
175+
176+
Backends communicate with LocalAI core through gRPC:
177+
178+
1. **Service Discovery**: Core discovers available backends
179+
2. **Model Loading**: Core requests model loading via `LoadModel`
180+
3. **Inference**: Core sends requests via `Predict` or specialized endpoints
181+
4. **Streaming**: Core handles streaming responses for real-time generation
182+
5. **Monitoring**: Core tracks backend health and performance
183+
184+
## Performance Optimization
185+
186+
### Memory Management
187+
- **Model Caching**: Efficient model loading and caching
188+
- **Batch Processing**: Optimize for multiple concurrent requests
189+
- **Memory Pinning**: GPU memory optimization for CUDA/ROCm
190+
191+
### Hardware Utilization
192+
- **Multi-GPU**: Support for tensor parallelism
193+
- **Mixed Precision**: FP16/BF16 for memory efficiency
194+
- **Kernel Fusion**: Optimized CUDA/ROCm kernels
195+
196+
## Troubleshooting
197+
198+
### Common Issues
199+
200+
1. **GRPC Connection**: Verify backend service is running and accessible
201+
2. **Model Loading**: Check model paths and dependencies
202+
3. **Hardware Detection**: Ensure appropriate drivers and libraries
203+
4. **Memory Issues**: Monitor GPU memory usage and model sizes
204+
205+
## Contributing
206+
207+
When contributing to the backend system:
208+
209+
1. **Follow Protocol**: Implement the exact gRPC interface
210+
2. **Add Tests**: Include comprehensive test coverage
211+
3. **Document**: Provide clear usage examples
212+
4. **Optimize**: Consider performance and resource usage
213+
5. **Validate**: Test across different hardware targets

0 commit comments

Comments
 (0)