Better strategy for GPU offload #520

ikawrakow · 2025-06-11T17:12:14Z

In a hybrid GPU/CPU situation, the decision if to offload model weights residing in RAM to the GPU to perform matrix multiplications is a tricky business. On the master branch (and also in mainline llama.cpp) a simply heuristics is used: if the batch size is >= 32 and the operation is supported, it is offloaded to the GPU. This heuristics comes from the experience with dense models (but even then, the correct decision will depend on the speed of the CPU, the GPU, and the PCI-E bandwidth).

This heuristics is definitely not meaningful for MoE models. In a MoE model with $N_{\rm tot}$ total routed experts and $N_A$ active experts, the matrix multiplication for each expert will contain, on average, $N_A/N_{\rm tot} N_b$ tokens, where $N_b$ is the batch (or rather, u-batch, size). For a model such as DeepSeek-R1/V3 with $N_A = 8, N_{\rm tot} = 256$, a batch size of 32 will result in a single token per expert on average, so offloading gigabytes of data to the GPU does not make sense at all.

This PR adds the above consideration. MoE matrix multiplications will only be offloaded if

$$N_b \ge \frac{N_{\rm tot}}{N_A} N_{\rm min}$$

where $N_{\rm min}$ is the minimum batch size for dense models (hard-coded to 32 on the main branch). To allow for setup/model specific adjustment, a compile time option is added that allows to change $N_{\rm min}$ via

cmake -DGGML_CUDA_MIN_BATCH_OFFLOAD=new_value ...

The default value for GGML_CUDA_MIN_BATCH_OFFLOAD is left at 32. With this, MoE matrix multiplications will not get offloaded for DeepSeelk-R1/V3 unless the batch size is $\ge 1024$. For Qwen3-235B-A22B the minimumbtach size for offload becomes 512 tokens.

As a reminder, in addition to this PR in ik_llama.cpp GPU offload can be disabled via -op 26,0,27,0,29,0.

As a quick example, the following tables contain llama-bench results for PP-4096 using IQ4_KS quantized DeepSeek-Lite, with all experts left on the CPU.

On the main branch we get this:

model	params	n_ubatch	fa	mla	rtr	fmoe	test	t/s
deepseek2 16B IQ4_KS	15.76 B	128	1	3	1	1	pp4096	344.75 ± 1.52
deepseek2 16B IQ4_KS	15.76 B	256	1	3	1	1	pp4096	604.47 ± 10.39
deepseek2 16B IQ4_KS	15.76 B	512	1	3	1	1	pp4096	973.29 ± 14.90
deepseek2 16B IQ4_KS	15.76 B	1024	1	3	1	1	pp4096	1427.88 ± 9.06
deepseek2 16B IQ4_KS	15.76 B	2048	1	3	1	1	pp4096	1804.31 ± 70.77
deepseek2 16B IQ4_KS	15.76 B	4096	1	3	1	1	pp4096	1878.12 ± 139.24

With this PR we get this:

model	params	n_ubatch	fa	mla	rtr	fmoe	test	t/s
deepseek2 16B IQ4_KS	15.76 B	128	1	3	1	1	pp4096	723.34 ± 2.93
deepseek2 16B IQ4_KS	15.76 B	256	1	3	1	1	pp4096	955.96 ± 3.76
deepseek2 16B IQ4_KS	15.76 B	512	1	3	1	1	pp4096	974.72 ± 12.17
deepseek2 16B IQ4_KS	15.76 B	1024	1	3	1	1	pp4096	1410.79 ± 20.59
deepseek2 16B IQ4_KS	15.76 B	2048	1	3	1	1	pp4096	1838.61 ± 2.46
deepseek2 16B IQ4_KS	15.76 B	4096	1	3	1	1	pp4096	2071.28 ± 37.94

We see massively better performance for small u-batchsizes (important for a more fluid interaction with the LLM as not all prompts are so long). For this model offload kicks in at64/6*32 = 341` tokens, so for batch sizes of 512 and above the two results are the same.

If I change GGML_CUDA_MIN_BATCH_OFFLOAD to 64, min batch size for offload becomes 682 tokens, and we get this result:

model	params	n_ubatch	fa	mla	rtr	fmoe	test	t/s
deepseek2 16B IQ4_KS	15.76 B	128	1	3	1	1	pp4096	737.72 ± 7.27
deepseek2 16B IQ4_KS	15.76 B	256	1	3	1	1	pp4096	968.12 ± 5.75
deepseek2 16B IQ4_KS	15.76 B	512	1	3	1	1	pp4096	1081.28 ± 28.45
deepseek2 16B IQ4_KS	15.76 B	1024	1	3	1	1	pp4096	1428.79 ± 3.19
deepseek2 16B IQ4_KS	15.76 B	2048	1	3	1	1	pp4096	1844.95 ± 9.59
deepseek2 16B IQ4_KS	15.76 B	4096	1	3	1	1	pp4096	2052.55 ± 78.42

We see that for my setup, even batches of 512 tokens are better left on the CPU (for this specific quantization type).

Please play with this PR and let me know if it is useful to get merged.

quasar-of-mikus · 2025-06-11T20:40:59Z

Looks good for setups like mine where PCIe bandwidth is low and prompt length is short.

128gb ddr4 3200 2ch
2x 3090 PCIe 3.0 x8 x8
DeepSeek-V3-0324-IQ1_S_R4.gguf
Default value for DGGML_CUDA_MIN_BATCH_OFFLOAD=32,

For an existing context of 1400 + added prompt of 34 tokens, the difference was waiting a mere 3 seconds instead of 23 seconds until the first tokens:
Main: ~1.5t/s pp
PR: 9-10t/s pp

PR build: cdcb324 (3743):

model	size	params	backend	ngl	threads	n_batch	n_ubatch	fa	mla	amb	ts	fmoe	test	t/s
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp16	7.81 ± 0.55
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp32	10.61 ± 0.34
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp64	13.31 ± 0.16
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp128	17.58 ± 0.20
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp256	19.66 ± 0.08
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp512	21.24 ± 0.10
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp1024	52.75 ± 0.37
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp2048	97.01 ± 0.59
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp4096	165.89 ± 0.63

Main, note the very low speeds for pp32 to pp256 build: 3f54b49 (3742):

model	size	params	backend	ngl	threads	n_batch	n_ubatch	fa	mla	amb	ts	fmoe	test	t/s
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp16	7.81 ± 0.40
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp32	1.89 ± 0.01
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp64	3.69 ± 0.01
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp128	7.44 ± 0.01
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp256	14.47 ± 0.03
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp512	27.94 ± 0.10
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp1024	52.96 ± 0.18
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp2048	97.27 ± 0.25
ds2 671B IQ1_S_R4	130.20 GiB	672.05 B	CUDA	999	18	4096	4096	1	3	512	23.00/23.00	1	pp4096	166.23 ± 0.19

ikawrakow · 2025-06-12T04:44:22Z

Here the above data illustrated in a graph:

ikawrakow · 2025-06-12T04:58:12Z

A also took the liberty to plot @quasar-of-mikus's data:

We see that in this case the performance at 512 tokens is better on the main branch. With the default value of GGML_CUDA_MIN_BATCH_OFFLOAD=32 the MoE matrix multiplications are done on the CPU for a batch of 512 tokens, and in this case it is slower than to offload to the GPU. So, @quasar-of-mikus will likely benefit from using -GGML_CUDA_MIN_BATCH_OFFLOAD=20

quasar-of-mikus · 2025-06-12T17:30:44Z

On my setup and with this model, a lower value of -DGGML_CUDA_MIN_BATCH_OFFLOAD=16 brought the performance back @ pp512, resulting in an overall improvement (at least with this level of granularity) 👍

test	Old main t/s	=32	=16
pp16	7.81 ± 0.40	7.81 ± 0.55	7.72 ± 0.49
pp32	1.89 ± 0.01	10.61 ± 0.34	10.71 ± 0.05
pp64	3.69 ± 0.01	13.31 ± 0.16	13.72 ± 0.19
pp128	7.44 ± 0.01	17.58 ± 0.20	17.61 ± 0.25
pp256	14.47 ± 0.03	19.66 ± 0.08	19.73 ± 0.13
--> pp512	27.94 ± 0.10	21.24 ± 0.10	27.94 ± 0.20
pp1024	52.96 ± 0.18	52.75 ± 0.37	52.92 ± 0.30
pp2048	97.27 ± 0.25	97.01 ± 0.59	97.12 ± 0.54
pp4096	166.23 ± 0.19	165.89 ± 0.63	165.97 ± 0.92

Better strategy for GPU offload

cdcb324

ikawrakow merged commit b57bd86 into main Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better strategy for GPU offload #520

Better strategy for GPU offload #520

ikawrakow commented Jun 11, 2025

Uh oh!

quasar-of-mikus commented Jun 11, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Jun 12, 2025

Uh oh!

ikawrakow commented Jun 12, 2025

Uh oh!

quasar-of-mikus commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Better strategy for GPU offload #520

Better strategy for GPU offload #520

Conversation

ikawrakow commented Jun 11, 2025

Uh oh!

quasar-of-mikus commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Jun 12, 2025

Uh oh!

ikawrakow commented Jun 12, 2025

Uh oh!

quasar-of-mikus commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

quasar-of-mikus commented Jun 11, 2025 •

edited

Loading

quasar-of-mikus commented Jun 12, 2025 •

edited

Loading