-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[Hybrid Allocator] Support Pipeline Parallel #23974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chen Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for pipeline parallelism where different stages can have varying layer compositions. The core change involves refactoring the KV cache configuration logic to first determine a global grouping strategy based on the entire model's layers, and then applying this to each worker. This is a solid approach that enhances flexibility for pipeline parallel setups. The associated tests are comprehensive and cover various scenarios including TP, PP, and hybrid models.
I've identified one potential issue in the new logic that could lead to a crash in an edge case involving workers with no KV cache layers. A fix is suggested below. Overall, the changes are well-structured and move the project in the right direction.
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @heheda12345. I'm not very familiar with this working of this part of the code, but it makes sense to me. I like that it's cleaner and that you added detailed comments.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
I'm waiting for |
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (123 commits) [Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974) [Spec Decoding]Support Spec Decoding Metrics in DP Mode (vllm-project#24049) [Chore] Remove ipex_ops warning (vllm-project#24835) Force use C++17 globally to avoid compilation error (vllm-project#24823) [Benchmarks] Throw usage error when using dataset-name random and dataset-path together (vllm-project#24819) fix type of sampling rate for encode_base64 (vllm-project#24826) [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement (vllm-project#24783) [Misc] Improve `s3_utils` type hints with `BaseClient` (vllm-project#24825) [Multi Modal][Performance] Fused Q,K's apply_rope into one (vllm-project#24511) [Chore] Minor simplification for non-PP path (vllm-project#24810) [Minor] Simplify duplicative device check for cuda (vllm-project#24793) Remove redundant assignment in xfer_buffers, This is a little fix (vllm-project#24732) [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again (vllm-project#24771) [Doc]: fix typos in various files (vllm-project#24798) [Misc] Correct an outdated comment. (vllm-project#24765) [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (vllm-project#24750) [Core][Multimodal] Cache `supports_kw` (vllm-project#24773) [Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054) [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (vllm-project#24757) [Doc]: Remove 404 hyperlinks (vllm-project#24785) ...
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bbartels <[email protected]>
[PR #23974](vllm-project/vllm#23974) updated the vllm.v1.core.kv_cache_utils.get_kv_cache_configs api. Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
[PR #23974](vllm-project/vllm#23974) updated the vllm.v1.core.kv_cache_utils.get_kv_cache_configs api. Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: slokesha <[email protected]>
Purpose
Different PP stages have different layers, thus the ratio of full:sw is different. For example, #23883 has 10 full & 11 sw in stage 0, and 11 full & 10 sw in stage 1. This PR supports this case by generating the kv cache groups based on the full:sw ratio of the full model.
fix #23883
UPD 2025/09/10
Need to be careful about how to partition the layers into groups.
In PP case, say if we have
We should have 3 groups: (full.0, full.1), (sw.0, sw.2), (sw.1, sw.3)
It can't be (full.0, full.1), (sw.0, sw.1), (sw.2, sw.3) because the 3 groups in stage 0 will be (full.0), (sw.0, sw.1), (empty group) and it will be padded to (full.0, padding), (sw.0, sw.1), (padding, padding) to ensure the number of layers in each group is the same and will cause memory waste. To avoid this, we assign layers[i::num_groups] to the i-th group instead of layers[i * group_size: (i + 1) * group_size]
I've updated the comments in kv_cache_interface and will update the related figures in design doc in with a follow-up PR.
Test Plan
Test Result
2-3: can pass locally
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.