[Hybrid Allocator] Support Pipeline Parallel #23974

heheda12345 · 2025-08-30T00:12:53Z

Purpose

Different PP stages have different layers, thus the ratio of full:sw is different. For example, #23883 has 10 full & 11 sw in stage 0, and 11 full & 10 sw in stage 1. This PR supports this case by generating the kv cache groups based on the full:sw ratio of the full model.
fix #23883

UPD 2025/09/10
Need to be careful about how to partition the layers into groups.
In PP case, say if we have

stage 0: full.0, sw.0, sw.1
stage 1: full.1, sw.2, sw.3

We should have 3 groups: (full.0, full.1), (sw.0, sw.2), (sw.1, sw.3)
It can't be (full.0, full.1), (sw.0, sw.1), (sw.2, sw.3) because the 3 groups in stage 0 will be (full.0), (sw.0, sw.1), (empty group) and it will be padded to (full.0, padding), (sw.0, sw.1), (padding, padding) to ensure the number of layers in each group is the same and will cause memory waste. To avoid this, we assign layers[i::num_groups] to the i-th group instead of layers[i * group_size: (i + 1) * group_size]

I've updated the comments in kv_cache_interface and will update the related figures in design doc in with a follow-up PR.

Test Plan

vllm serve -pp 2 BAAI/bge-multilingual-gemma2 --enforce-eager
several unit tests for TP / PP
add unit test for attention free
re-enable BAAI/bge-multilingual-gemma2 PP test

Test Result

can launch the server
2-3: can pass locally
let's wait for CI

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>

gemini-code-assist

Code Review

This pull request introduces support for pipeline parallelism where different stages can have varying layer compositions. The core change involves refactoring the KV cache configuration logic to first determine a global grouping strategy based on the entire model's layers, and then applying this to each worker. This is a solid approach that enhances flexibility for pipeline parallel setups. The associated tests are comprehensive and cover various scenarios including TP, PP, and hybrid models.

I've identified one potential issue in the new logic that could lead to a crash in an edge case involving workers with no KV cache layers. A fix is suggested below. Overall, the changes are well-structured and move the project in the right direction.

vllm/v1/core/kv_cache_utils.py

Signed-off-by: Chen Zhang <[email protected]>

maxdebayser

Thanks for the fix @heheda12345. I'm not very familiar with this working of this part of the code, but it makes sense to me. I like that it's cleaner and that you added detailed comments.

mergify · 2025-09-01T06:16:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <[email protected]>

mergify · 2025-09-08T13:57:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 · 2025-09-08T21:17:56Z

I'm waiting for [buildkite/ci/pr/distributed-tests-2-gpus](https://buildkite.com/vllm/ci/builds/29500#01991676-e49f-4d44-b9cd-f2e5b8295673) to be fixed on main.

Signed-off-by: Chen Zhang <[email protected]>

…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (123 commits) [Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974) [Spec Decoding]Support Spec Decoding Metrics in DP Mode (vllm-project#24049) [Chore] Remove ipex_ops warning (vllm-project#24835) Force use C++17 globally to avoid compilation error (vllm-project#24823) [Benchmarks] Throw usage error when using dataset-name random and dataset-path together (vllm-project#24819) fix type of sampling rate for encode_base64 (vllm-project#24826) [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement (vllm-project#24783) [Misc] Improve `s3_utils` type hints with `BaseClient` (vllm-project#24825) [Multi Modal][Performance] Fused Q,K's apply_rope into one (vllm-project#24511) [Chore] Minor simplification for non-PP path (vllm-project#24810) [Minor] Simplify duplicative device check for cuda (vllm-project#24793) Remove redundant assignment in xfer_buffers, This is a little fix (vllm-project#24732) [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again (vllm-project#24771) [Doc]: fix typos in various files (vllm-project#24798) [Misc] Correct an outdated comment. (vllm-project#24765) [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (vllm-project#24750) [Core][Multimodal] Cache `supports_kw` (vllm-project#24773) [Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054) [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (vllm-project#24757) [Doc]: Remove 404 hyperlinks (vllm-project#24785) ...

Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bbartels <[email protected]>

[PR #23974](vllm-project/vllm#23974) updated the vllm.v1.core.kv_cache_utils.get_kv_cache_configs api. Signed-off-by: Chendi Xue <[email protected]>

Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bruceszchen <[email protected]>

Signed-off-by: Chen Zhang <[email protected]>

[PR #23974](vllm-project/vllm#23974) updated the vllm.v1.core.kv_cache_utils.get_kv_cache_configs api. Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: slokesha <[email protected]>

support PP for hybrid

3d5155d

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from youkaichao, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 30, 2025 00:12

heheda12345 mentioned this pull request Aug 30, 2025

[Bug]: KV cache specs are not equal accross rank #23883

Closed

1 task

mergify bot added the v1 label Aug 30, 2025

gemini-code-assist bot reviewed Aug 30, 2025

View reviewed changes

vllm/v1/core/kv_cache_utils.py Show resolved Hide resolved

heheda12345 added 2 commits August 29, 2025 17:20

add assert

9008a2b

Signed-off-by: Chen Zhang <[email protected]>

fix precommit

469b01e

Signed-off-by: Chen Zhang <[email protected]>

maxdebayser approved these changes Aug 30, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 1, 2025

WoosukKwon approved these changes Sep 2, 2025

View reviewed changes

Merge branch 'main' of github.com:vllm-project/vllm into hybrid_pp

93a5a17

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 2, 2025

mergify bot removed the needs-rebase label Sep 2, 2025

update doc

8663a99

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 enabled auto-merge (squash) September 2, 2025 17:08

fix tests

ccaaac4

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested a review from DarkLight1337 as a code owner September 2, 2025 18:51

mergify bot added the tpu Related to Google TPUs label Sep 2, 2025

heheda12345 added 2 commits September 4, 2025 00:22

fix tests

29777d2

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' into hybrid_pp

521adef

mergify bot added the needs-rebase label Sep 8, 2025

heheda12345 added 2 commits September 10, 2025 15:56

round robin assignment

7327027

Signed-off-by: Chen Zhang <[email protected]>

Merge branch 'main' of github.com:vllm-project/vllm into hybrid_pp

19912d9

Signed-off-by: Chen Zhang <[email protected]>

mergify bot removed the needs-rebase label Sep 10, 2025

heheda12345 disabled auto-merge September 10, 2025 23:04

Merge branch 'main' into hybrid_pp

8a776ba

heheda12345 merged commit 8e5cdcd into vllm-project:main Sep 14, 2025
47 checks passed

heheda12345 deleted the hybrid_pp branch September 14, 2025 22:55

dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025

[Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974)

4f42c32

Signed-off-by: Chen Zhang <[email protected]>

bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025

[Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974)

5687b6a

Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bbartels <[email protected]>

xuechendi mentioned this pull request Sep 15, 2025

[CI FIX]Fix issue introduced by upstream PR #23974 vllm-project/vllm-gaudi#172

Merged

cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025

[Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974)

d84ec29

Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bruceszchen <[email protected]>

cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025

[Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974)

f0ad68a

Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bruceszchen <[email protected]>

langc23 pushed a commit to zte-riscv/vllm that referenced this pull request Sep 23, 2025

[Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974)

4857b53

Signed-off-by: Chen Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Hybrid Allocator] Support Pipeline Parallel #23974

[Hybrid Allocator] Support Pipeline Parallel #23974

Uh oh!

heheda12345 commented Aug 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

maxdebayser left a comment

Uh oh!

mergify bot commented Sep 1, 2025

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

heheda12345 commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Hybrid Allocator] Support Pipeline Parallel #23974

[Hybrid Allocator] Support Pipeline Parallel #23974

Uh oh!

Conversation

heheda12345 commented Aug 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

maxdebayser left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 1, 2025

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

heheda12345 commented Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Aug 30, 2025 •

edited by github-actions bot

Loading