Skip to content

Conversation

ikawrakow
Copy link
Owner

In a hybrid GPU/CPU situation, the decision if to offload model weights residing in RAM to the GPU to perform matrix multiplications is a tricky business. On the master branch (and also in mainline llama.cpp) a simply heuristics is used: if the batch size is >= 32 and the operation is supported, it is offloaded to the GPU. This heuristics comes from the experience with dense models (but even then, the correct decision will depend on the speed of the CPU, the GPU, and the PCI-E bandwidth).

This heuristics is definitely not meaningful for MoE models. In a MoE model with $N_{\rm tot}$ total routed experts and $N_A$ active experts, the matrix multiplication for each expert will contain, on average, $N_A/N_{\rm tot} N_b$ tokens, where $N_b$ is the batch (or rather, u-batch, size). For a model such as DeepSeek-R1/V3 with $N_A = 8, N_{\rm tot} = 256$, a batch size of 32 will result in a single token per expert on average, so offloading gigabytes of data to the GPU does not make sense at all.

This PR adds the above consideration. MoE matrix multiplications will only be offloaded if

$$N_b \ge \frac{N_{\rm tot}}{N_A} N_{\rm min}$$

where $N_{\rm min}$ is the minimum batch size for dense models (hard-coded to 32 on the main branch). To allow for setup/model specific adjustment, a compile time option is added that allows to change $N_{\rm min}$ via

cmake -DGGML_CUDA_MIN_BATCH_OFFLOAD=new_value ...

The default value for GGML_CUDA_MIN_BATCH_OFFLOAD is left at 32. With this, MoE matrix multiplications will not get offloaded for DeepSeelk-R1/V3 unless the batch size is $\ge 1024$. For Qwen3-235B-A22B the minimumbtach size for offload becomes 512 tokens.

As a reminder, in addition to this PR in ik_llama.cpp GPU offload can be disabled via -op 26,0,27,0,29,0.

As a quick example, the following tables contain llama-bench results for PP-4096 using IQ4_KS quantized DeepSeek-Lite, with all experts left on the CPU.

On the main branch we get this:

model params n_ubatch fa mla rtr fmoe test t/s
deepseek2 16B IQ4_KS 15.76 B 128 1 3 1 1 pp4096 344.75 ± 1.52
deepseek2 16B IQ4_KS 15.76 B 256 1 3 1 1 pp4096 604.47 ± 10.39
deepseek2 16B IQ4_KS 15.76 B 512 1 3 1 1 pp4096 973.29 ± 14.90
deepseek2 16B IQ4_KS 15.76 B 1024 1 3 1 1 pp4096 1427.88 ± 9.06
deepseek2 16B IQ4_KS 15.76 B 2048 1 3 1 1 pp4096 1804.31 ± 70.77
deepseek2 16B IQ4_KS 15.76 B 4096 1 3 1 1 pp4096 1878.12 ± 139.24

With this PR we get this:

model params n_ubatch fa mla rtr fmoe test t/s
deepseek2 16B IQ4_KS 15.76 B 128 1 3 1 1 pp4096 723.34 ± 2.93
deepseek2 16B IQ4_KS 15.76 B 256 1 3 1 1 pp4096 955.96 ± 3.76
deepseek2 16B IQ4_KS 15.76 B 512 1 3 1 1 pp4096 974.72 ± 12.17
deepseek2 16B IQ4_KS 15.76 B 1024 1 3 1 1 pp4096 1410.79 ± 20.59
deepseek2 16B IQ4_KS 15.76 B 2048 1 3 1 1 pp4096 1838.61 ± 2.46
deepseek2 16B IQ4_KS 15.76 B 4096 1 3 1 1 pp4096 2071.28 ± 37.94

We see massively better performance for small u-batchsizes (important for a more fluid interaction with the LLM as not all prompts are so long). For this model offload kicks in at64/6*32 = 341` tokens, so for batch sizes of 512 and above the two results are the same.

If I change GGML_CUDA_MIN_BATCH_OFFLOAD to 64, min batch size for offload becomes 682 tokens, and we get this result:

model params n_ubatch fa mla rtr fmoe test t/s
deepseek2 16B IQ4_KS 15.76 B 128 1 3 1 1 pp4096 737.72 ± 7.27
deepseek2 16B IQ4_KS 15.76 B 256 1 3 1 1 pp4096 968.12 ± 5.75
deepseek2 16B IQ4_KS 15.76 B 512 1 3 1 1 pp4096 1081.28 ± 28.45
deepseek2 16B IQ4_KS 15.76 B 1024 1 3 1 1 pp4096 1428.79 ± 3.19
deepseek2 16B IQ4_KS 15.76 B 2048 1 3 1 1 pp4096 1844.95 ± 9.59
deepseek2 16B IQ4_KS 15.76 B 4096 1 3 1 1 pp4096 2052.55 ± 78.42

We see that for my setup, even batches of 512 tokens are better left on the CPU (for this specific quantization type).

Please play with this PR and let me know if it is useful to get merged.

@quasar-of-mikus
Copy link

quasar-of-mikus commented Jun 11, 2025

Looks good for setups like mine where PCIe bandwidth is low and prompt length is short.

128gb ddr4 3200 2ch
2x 3090 PCIe 3.0 x8 x8
DeepSeek-V3-0324-IQ1_S_R4.gguf
Default value for DGGML_CUDA_MIN_BATCH_OFFLOAD=32,

For an existing context of 1400 + added prompt of 34 tokens, the difference was waiting a mere 3 seconds instead of 23 seconds until the first tokens:
Main: ~1.5t/s pp
PR: 9-10t/s pp

PR build: cdcb324 (3743):

model size params backend ngl threads n_batch n_ubatch fa mla amb ts mmap fmoe test t/s
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp16 7.81 ± 0.55
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp32 10.61 ± 0.34
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp64 13.31 ± 0.16
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp128 17.58 ± 0.20
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp256 19.66 ± 0.08
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp512 21.24 ± 0.10
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp1024 52.75 ± 0.37
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp2048 97.01 ± 0.59
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp4096 165.89 ± 0.63

Main, note the very low speeds for pp32 to pp256 build: 3f54b49 (3742):

model size params backend ngl threads n_batch n_ubatch fa mla amb ts mmap fmoe test t/s
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp16 7.81 ± 0.40
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp32 1.89 ± 0.01
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp64 3.69 ± 0.01
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp128 7.44 ± 0.01
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp256 14.47 ± 0.03
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp512 27.94 ± 0.10
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp1024 52.96 ± 0.18
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp2048 97.27 ± 0.25
ds2 671B IQ1_S_R4 130.20 GiB 672.05 B CUDA 999 18 4096 4096 1 3 512 23.00/23.00 0 1 pp4096 166.23 ± 0.19

@ikawrakow
Copy link
Owner Author

Here the above data illustrated in a graph:

batch_strategy

@ikawrakow
Copy link
Owner Author

A also took the liberty to plot @quasar-of-mikus's data:

batch_strategy1

We see that in this case the performance at 512 tokens is better on the main branch. With the default value of GGML_CUDA_MIN_BATCH_OFFLOAD=32 the MoE matrix multiplications are done on the CPU for a batch of 512 tokens, and in this case it is slower than to offload to the GPU. So, @quasar-of-mikus will likely benefit from using -GGML_CUDA_MIN_BATCH_OFFLOAD=20

@ikawrakow ikawrakow merged commit b57bd86 into main Jun 12, 2025
@quasar-of-mikus
Copy link

quasar-of-mikus commented Jun 12, 2025

On my setup and with this model, a lower value of -DGGML_CUDA_MIN_BATCH_OFFLOAD=16 brought the performance back @ pp512, resulting in an overall improvement (at least with this level of granularity) 👍

test Old main t/s =32 =16
pp16 7.81 ± 0.40 7.81 ± 0.55 7.72 ± 0.49
pp32 1.89 ± 0.01 10.61 ± 0.34 10.71 ± 0.05
pp64 3.69 ± 0.01 13.31 ± 0.16 13.72 ± 0.19
pp128 7.44 ± 0.01 17.58 ± 0.20 17.61 ± 0.25
pp256 14.47 ± 0.03 19.66 ± 0.08 19.73 ± 0.13
--> pp512 27.94 ± 0.10 21.24 ± 0.10 27.94 ± 0.20
pp1024 52.96 ± 0.18 52.75 ± 0.37 52.92 ± 0.30
pp2048 97.27 ± 0.25 97.01 ± 0.59 97.12 ± 0.54
pp4096 166.23 ± 0.19 165.89 ± 0.63 165.97 ± 0.92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants