[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM #23693

SageMoore · 2025-08-26T23:00:57Z

Purpose

This PR adds support for Dual-Batch Overlap in VLLM. In it's current state it will only be abled when a user provides the --enable-microbatching flag. Furthermore, it will only be used when all DP groups are running full-decode batches. This PR supports running DBO with full cudagraphs, which is essential for minimizing the CPU overhead and getting performance from this feature.

To implement Dual-Batch Overlap (DBO), at a high level, we split the batch into two microbatches. Then using two threads and two cuda streams, one for communication and one for computation, to overlap the dispatch and combine all-to-all kernels of one microbatch with the compute kernels of the other microbatch.

When microbatching is enabled and supported, the GPUModelRunner will split the batch into two token_slices. These token_slices are then passed into the attention meta data builders during _prepare_inputs to generate one attention metadata object per-microbatch. When actually running the model, the model runner will spawn off two microbatching threads that will each communicate with each other using a UBatchContext. Each of these threads will then run self.model with the appropriate attention meta data.

Without any additional modifications to the code, this will just result in one microbatch running to completion before the other microbatch starts. In order to get overlaps, we've added a "yield" call that can be inserted into the all-to-all kernels to interleave the two microbatches. The yield_and_switch_from_compute_to_comm function yield the CPU from this thread (thread A) to the other microbatching thread (thread B). Once thread A has resumed execution, either because thread B yielded the CPU or finished it's execution, it will swap over to the communication stream and start dispatching kernels there. yield_and_switch_from_comm_to_compute behaves similarly but in the opposite direction. It swaps from the communication stream to the compute stream.

There are both GPU and CPU events to synchronize all of this. That being said, it is absolutely critical that only one microbatching thread is running at a time, meaning the other one is waiting on an event. It is also absolutely critical that both microbatches are running the exact same number of yields.

Test Plan

In general my test plan was to run lm_eval with deepseek-ai/DeepSeek-V2-Lite. We've also run numerous times with R1 in a multi node setup and verified that lm_eval produces reasonable output.

Non-DBO Runs

Eager

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --enforce-eager

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3567|±  |0.0277|
|     |       |strict-match    |     5|exact_match|↑  |0.3533|±  |0.0276|

Default

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency g2 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3700|±  |0.0279|
|     |       |strict-match    |     5|exact_match|↑  |0.3667|±  |0.0279|

DBO Runs

Eager

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency g2 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --enforce-eager --enable-microbatching --microbatching-token-threshold 4

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3800|±  |0.0281|
|     |       |strict-match    |     5|exact_match|↑  |0.3767|±  |0.0280|

Full cudagraphs

Command

VLLM_ALL2ALL_BACKEND=deepep_low_latency g2 vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --compilation_config '{"cudagraph_mode": "full_decode_only"}' --enable-microbatching --microbatching-token-threshold 4

Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3733|±  |0.0280|
|     |       |strict-match    |     5|exact_match|↑  |0.3700|±  |0.0279|

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Sage Moore <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Sage Moore <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Sage Moore <[email protected]>

tlrmchlsmth · 2025-09-15T15:11:46Z

vllm/engine/arg_utils.py

+        parallel_group.add_argument(
+            "--dbo-decode-token-threshold",
+            **parallel_kwargs["dbo_decode_token_threshold"])


What is the future plan for this argument? Will we add a separate --dbo-prefill-token-threshold? Could there be one argument instead?

Yep we are planning to add a prefill version of this argument.

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

tlrmchlsmth · 2025-09-15T15:21:23Z

vllm/model_executor/layers/fused_moe/modular_kernel.py

+    fused_out_buffer = SharedResizableBuffer()
+    workspace13_buffer = SharedResizableBuffer()
+    workspace2_buffer = SharedResizableBuffer()


Why do we need this?

Just a general memory footprint reduction. Primarily targeting cudagraphs, though.

vllm/v1/worker/gpu_ubatch_wrapper.py

tlrmchlsmth · 2025-09-15T15:52:13Z

vllm/v1/worker/gpu_ubatch_wrapper.py

+        # if we are using mrope
+        if positions.ndim == 2:
+            sliced_positions = positions[:, tokens_slice]
+        else:
+            sliced_positions = positions[tokens_slice]


What is the mrope interaction? Could we add a comment explaining it here?

It's largely just that mrope adds an additional dimension to the positions tensor so we need to slice the lower dimension. I'll add a comment.

tlrmchlsmth · 2025-09-15T15:59:11Z

vllm/v1/worker/ubatch_splitting.py

+    # Sanity Check that the existing padding isn't giving us an empty second
+    # ubatch. Abort if so
+    if is_second_ubatch_empty(num_tokens_unpadded, num_tokens_padded):
+        logger.debug("Aborting ubatching %s %s", num_tokens_unpadded,
+                     num_tokens_padded)
+        should_ubatch = False


Is this something that's expected to happen sometimes, and that's OK? If not, I think this should be a warning instead.

And then could you add a bit more detail to the log, e.g.

Suggested change

# Sanity Check that the existing padding isn't giving us an empty second

# ubatch. Abort if so

if is_second_ubatch_empty(num_tokens_unpadded, num_tokens_padded):

logger.debug("Aborting ubatching %s %s", num_tokens_unpadded,

num_tokens_padded)

should_ubatch = False

# Sanity Check that the existing padding isn't giving us an empty second

# ubatch. Abort if so

if is_second_ubatch_empty(num_tokens_unpadded, num_tokens_padded):

logger.warning("Empty second µbatch detected: unpadded tokens: %s, padded tokens: %s",

num_tokens_unpadded,

num_tokens_padded)

should_ubatch = False

It is expected to happen and isn't necessarily a bug when it does. I find the debug log to be really helpful when debugging misc padding issues. We can certainly take it out, though.

Signed-off-by: Sage Moore <[email protected]>

vllm/v1/worker/ubatching.py

Signed-off-by: Sage Moore <[email protected]>

tlrmchlsmth

Left a few minor comments. Overall I think this is ready to land otherwise. Maybe a little rough around the edges with the model runner changes but will be great to have this landed on main, especially as we have a prefill DBO PR ready to be reviewed as soon as this one lands.

Signed-off-by: Sage Moore <[email protected]>

mergify · 2025-09-15T19:09:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SageMoore.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…dbo-full-cudagraphs Signed-off-by: Sage Moore <[email protected]>

mergify · 2025-09-16T13:06:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SageMoore.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2025-09-16T13:19:34Z

I thought the kernels-moe-test failures were due to VLLM_USE_PRECOMPILED=1 not picking up the changes from #24054, but that was from 3 days ago

Could it be a real problem? @elvircrn, @dougbtv

AttributeError:` '_OpNamespace' '_C' object has no attribute 'silu_mul_fp8_quant_deep_gemm_cuda

Edit: Confirmed it's picking up old binaries.

Signed-off-by: Sage Moore <[email protected]>

…dbo-full-cudagraphs Signed-off-by: Sage Moore <[email protected]>

NihalPotdar · 2025-09-16T19:53:38Z

Hey! Quick question - do you have any performance numbers for this change?

Mainly wondering about the efficiency of the communication-computation overlap strategy in the PR.

LucasWilkinson and others added 30 commits May 22, 2025 20:51

wip

8293182

Signed-off-by: Lucas Wilkinson <[email protected]>

enable naive microbatching

37c9bab

Signed-off-by: Lucas Wilkinson <[email protected]>

support MLA

df8f889

Signed-off-by: Lucas Wilkinson <[email protected]>

support more args in dp example

f93bdd3

Signed-off-by: Lucas Wilkinson <[email protected]>

fix dummy mode

9ccfd09

Signed-off-by: Lucas Wilkinson <[email protected]>

added multhreading support

020269c

Signed-off-by: Sage Moore <[email protected]>

manually manage stream

ffb740a

Signed-off-by: Lucas Wilkinson <[email protected]>

working but only on the same stream

04f11d9

Signed-off-by: Lucas Wilkinson <[email protected]>

use vllm current_stream

2259b47

Signed-off-by: Lucas Wilkinson <[email protected]>

tp1 working multistream tp > 1 broken

9c60a62

Signed-off-by: Lucas Wilkinson <[email protected]>

fix hang

2a7f25f

dp working no yields

a8439e2

Signed-off-by: Lucas Wilkinson <[email protected]>

seperate gpu wait

00f526f

Signed-off-by: Lucas Wilkinson <[email protected]>

wip

18bf91e

Signed-off-by: Lucas Wilkinson <[email protected]>

wip

2dc3b8b

Signed-off-by: Lucas Wilkinson <[email protected]>

debugging hang

9edd082

Signed-off-by: Lucas Wilkinson <[email protected]>

tone down prints

952f3c5

Signed-off-by: Lucas Wilkinson <[email protected]>

better debug utils

e4419df

Signed-off-by: Lucas Wilkinson <[email protected]>

better logging

37bdf9f

Signed-off-by: Lucas Wilkinson <[email protected]>

fix dp=2 tp=2 hang

020d9b0

Signed-off-by: Sage Moore <[email protected]>

add comment

2f39206

Signed-off-by: Lucas Wilkinson <[email protected]>

wip seperate comm and compute threads

7b31e8a

Signed-off-by: Lucas Wilkinson <[email protected]>

fixes

a743a35

Signed-off-by: Lucas Wilkinson <[email protected]>

prints

f0b66d6

Signed-off-by: Lucas Wilkinson <[email protected]>

misc fixes

5cc573e

one a2a kernel per microbatch group

895a6c2

various fixes

5b0249b

more fixes

62da375

debugging

252bf08

misc cleanups to prepare for rebase

0323e29

Signed-off-by: Sage Moore <[email protected]>