Better strategy for GPU offload #520
Merged
+30
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In a hybrid GPU/CPU situation, the decision if to offload model weights residing in RAM to the GPU to perform matrix multiplications is a tricky business. On the master branch (and also in mainline
llama.cpp
) a simply heuristics is used: if the batch size is>= 32
and the operation is supported, it is offloaded to the GPU. This heuristics comes from the experience with dense models (but even then, the correct decision will depend on the speed of the CPU, the GPU, and the PCI-E bandwidth).This heuristics is definitely not meaningful for MoE models. In a MoE model with$N_{\rm tot}$ total routed experts and $N_A$ active experts, the matrix multiplication for each expert will contain, on average, $N_A/N_{\rm tot} N_b$ tokens, where $N_b$ is the batch (or rather, u-batch, size). For a model such as DeepSeek-R1/V3 with $N_A = 8, N_{\rm tot} = 256$ , a batch size of 32 will result in a single token per expert on average, so offloading gigabytes of data to the GPU does not make sense at all.
This PR adds the above consideration. MoE matrix multiplications will only be offloaded if
where$N_{\rm min}$ is the minimum batch size for dense models (hard-coded to 32 on the main branch). To allow for setup/model specific adjustment, a compile time option is added that allows to change $N_{\rm min}$ via
The default value for$\ge 1024$ . For Qwen3-235B-A22B the minimumbtach size for offload becomes 512 tokens.
GGML_CUDA_MIN_BATCH_OFFLOAD
is left at 32. With this, MoE matrix multiplications will not get offloaded for DeepSeelk-R1/V3 unless the batch size isAs a reminder, in addition to this PR in
ik_llama.cpp
GPU offload can be disabled via-op 26,0,27,0,29,0
.As a quick example, the following tables contain
llama-bench
results forPP-4096
usingIQ4_KS
quantized DeepSeek-Lite, with all experts left on the CPU.On the main branch we get this:
With this PR we get this:
We see massively better performance for small u-batch
sizes (important for a more fluid interaction with the LLM as not all prompts are so long). For this model offload kicks in at
64/6*32 = 341` tokens, so for batch sizes of 512 and above the two results are the same.If I change
GGML_CUDA_MIN_BATCH_OFFLOAD
to 64, min batch size for offload becomes 682 tokens, and we get this result:We see that for my setup, even batches of 512 tokens are better left on the CPU (for this specific quantization type).
Please play with this PR and let me know if it is useful to get merged.