Skip to content

Conversation

heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Aug 30, 2025

Purpose

Different PP stages have different layers, thus the ratio of full:sw is different. For example, #23883 has 10 full & 11 sw in stage 0, and 11 full & 10 sw in stage 1. This PR supports this case by generating the kv cache groups based on the full:sw ratio of the full model.
fix #23883

UPD 2025/09/10
Need to be careful about how to partition the layers into groups.
In PP case, say if we have

  • stage 0: full.0, sw.0, sw.1
  • stage 1: full.1, sw.2, sw.3

We should have 3 groups: (full.0, full.1), (sw.0, sw.2), (sw.1, sw.3)
It can't be (full.0, full.1), (sw.0, sw.1), (sw.2, sw.3) because the 3 groups in stage 0 will be (full.0), (sw.0, sw.1), (empty group) and it will be padded to (full.0, padding), (sw.0, sw.1), (padding, padding) to ensure the number of layers in each group is the same and will cause memory waste. To avoid this, we assign layers[i::num_groups] to the i-th group instead of layers[i * group_size: (i + 1) * group_size]

I've updated the comments in kv_cache_interface and will update the related figures in design doc in with a follow-up PR.

Test Plan

  1. vllm serve -pp 2 BAAI/bge-multilingual-gemma2 --enforce-eager
  2. several unit tests for TP / PP
  3. add unit test for attention free
  4. re-enable BAAI/bge-multilingual-gemma2 PP test

Test Result

  1. can launch the server
    2-3: can pass locally
  2. let's wait for CI

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for pipeline parallelism where different stages can have varying layer compositions. The core change involves refactoring the KV cache configuration logic to first determine a global grouping strategy based on the entire model's layers, and then applying this to each worker. This is a solid approach that enhances flexibility for pipeline parallel setups. The associated tests are comprehensive and cover various scenarios including TP, PP, and hybrid models.

I've identified one potential issue in the new logic that could lead to a crash in an edge case involving workers with no KV cache layers. A fix is suggested below. Overall, the changes are well-structured and move the project in the right direction.

Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @heheda12345. I'm not very familiar with this working of this part of the code, but it makes sense to me. I like that it's cleaner and that you added detailed comments.

Copy link

mergify bot commented Sep 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 1, 2025
@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 2, 2025
@mergify mergify bot removed the needs-rebase label Sep 2, 2025
Signed-off-by: Chen Zhang <[email protected]>
@heheda12345 heheda12345 enabled auto-merge (squash) September 2, 2025 17:08
Signed-off-by: Chen Zhang <[email protected]>
@mergify mergify bot added the tpu Related to Google TPUs label Sep 2, 2025
Copy link

mergify bot commented Sep 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 8, 2025
@heheda12345
Copy link
Collaborator Author

I'm waiting for [buildkite/ci/pr/distributed-tests-2-gpus](https://buildkite.com/vllm/ci/builds/29500#01991676-e49f-4d44-b9cd-f2e5b8295673) to be fixed on main.

@mergify mergify bot removed the needs-rebase label Sep 10, 2025
@heheda12345 heheda12345 merged commit 8e5cdcd into vllm-project:main Sep 14, 2025
47 checks passed
@heheda12345 heheda12345 deleted the hybrid_pp branch September 14, 2025 22:55
845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
…to loader

* 'loader' of https://github.com/dsxsteven/vllm_splitPR: (123 commits)
  [Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974)
  [Spec Decoding]Support Spec Decoding Metrics in DP Mode (vllm-project#24049)
  [Chore] Remove ipex_ops warning (vllm-project#24835)
  Force use C++17 globally to avoid compilation error (vllm-project#24823)
  [Benchmarks] Throw usage error when using dataset-name random and dataset-path together (vllm-project#24819)
  fix type of sampling rate for encode_base64 (vllm-project#24826)
  [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement (vllm-project#24783)
  [Misc] Improve `s3_utils` type hints with `BaseClient` (vllm-project#24825)
  [Multi Modal][Performance] Fused Q,K's apply_rope into one (vllm-project#24511)
  [Chore] Minor simplification for non-PP path (vllm-project#24810)
  [Minor] Simplify duplicative device check for cuda (vllm-project#24793)
  Remove redundant assignment in xfer_buffers, This is a little fix (vllm-project#24732)
  [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again (vllm-project#24771)
  [Doc]: fix typos in various files (vllm-project#24798)
  [Misc] Correct an outdated comment. (vllm-project#24765)
  [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (vllm-project#24750)
  [Core][Multimodal] Cache `supports_kw` (vllm-project#24773)
  [Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054)
  [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (vllm-project#24757)
  [Doc]: Remove 404 hyperlinks (vllm-project#24785)
  ...
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025
xuechendi added a commit to vllm-project/vllm-gaudi that referenced this pull request Sep 15, 2025
[PR #23974](vllm-project/vllm#23974) updated the
vllm.v1.core.kv_cache_utils.get_kv_cache_configs api.

Signed-off-by: Chendi Xue <[email protected]>
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
langc23 pushed a commit to zte-riscv/vllm that referenced this pull request Sep 23, 2025
slokesha pushed a commit to slokesha/vllm-gaudi that referenced this pull request Sep 24, 2025
[PR #23974](vllm-project/vllm#23974) updated the
vllm.v1.core.kv_cache_utils.get_kv_cache_configs api.

Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: slokesha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: KV cache specs are not equal accross rank
3 participants