-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
[Chore] Minor simplification for non-PP path #24810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a nice simplification to the execute_model
method in GPUModelRunner
. By moving the broadcast_pp_output
flag to the __init__
method and restructuring the logic to handle the common (non-broadcast) and rare (broadcast) paths separately, the code becomes much clearer and easier to follow. This refactoring also correctly handles kv_connector_output
for pooling models and appears to fix a latent bug where logits
were not updated on non-last pipeline parallel ranks after being broadcast. The changes are well-contained and improve both readability and correctness. I have no further suggestions.
model_output_broadcast_data = { | ||
"logits": logits.contiguous(), | ||
} if logits is not None else {} | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a thought, could we put the else
logic in a different function to be even less intrusive to the common path?
Signed-off-by: Woosuk Kwon <[email protected]>
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (123 commits) [Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974) [Spec Decoding]Support Spec Decoding Metrics in DP Mode (vllm-project#24049) [Chore] Remove ipex_ops warning (vllm-project#24835) Force use C++17 globally to avoid compilation error (vllm-project#24823) [Benchmarks] Throw usage error when using dataset-name random and dataset-path together (vllm-project#24819) fix type of sampling rate for encode_base64 (vllm-project#24826) [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement (vllm-project#24783) [Misc] Improve `s3_utils` type hints with `BaseClient` (vllm-project#24825) [Multi Modal][Performance] Fused Q,K's apply_rope into one (vllm-project#24511) [Chore] Minor simplification for non-PP path (vllm-project#24810) [Minor] Simplify duplicative device check for cuda (vllm-project#24793) Remove redundant assignment in xfer_buffers, This is a little fix (vllm-project#24732) [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again (vllm-project#24771) [Doc]: fix typos in various files (vllm-project#24798) [Misc] Correct an outdated comment. (vllm-project#24765) [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (vllm-project#24750) [Core][Multimodal] Cache `supports_kw` (vllm-project#24773) [Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054) [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (vllm-project#24757) [Doc]: Remove 404 hyperlinks (vllm-project#24785) ...
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: bbartels <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Currently,
broadcast_pp_output
complicates the logic in the common non-PP path.This PR simplifies the logic a bit.