-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
[Kernels][DP/EP] Optimize Silu Kernel for R1 #24054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernels][DP/EP] Optimize Silu Kernel for R1 #24054
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new CUDA kernel for silu_mul_fp8_quant
to replace the existing Triton implementation, aiming for better performance. The changes include the new CUDA kernel, its C++ bindings, and updates to tests and benchmarks to compare against the old implementation, which is preserved as a baseline. While the overall approach is sound, the review identified several critical correctness issues in the new CUDA kernel related to parameter handling, as well as high-severity maintainability problems such as dead code, code duplication, and style violations. These issues should be addressed to ensure the correctness and long-term health of the codebase.
|
||
// quant params | ||
float fp8_min, float fp8_max) { | ||
static constexpr float EPS = 1e-10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel uses a hardcoded EPS
value, ignoring the eps
parameter passed to the host function silu_mul_fp8_quant_deep_gemm_cuda
. This is a correctness bug. The eps
value should be passed to this kernel as an argument and used instead of the hardcoded constant.
This will require changes in:
- The kernel signature to accept
eps
. - The kernel body to use the
eps
argument (e.g.,float y_max = eps;
on line 265). - The host launcher
silu_mul_fp8_quant_deep_gemm_cuda
to passeps
to the kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson Is this OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should remove eps/fp8_max/fp8_min from the python API?
(APIServer pid=1) (EngineCore_0 pid=283) RuntimeError: Worker failed with error ''Keyword argument NUM_WARPS was specified but unrecognised'', please check the stack trace above for the root cause
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611] a2q, a2q_scale = silu_mul_fp8_quant_deep_gemm(workspace1,
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611] File "/opt/vllm-source/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py", line 206, in silu_mul_fp8_quant_deep_gemm
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611] _silu_mul_fp8_quant_deep_gemm[grid](
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611] File "/opt/vllm/lib64/python3.12/site-packages/triton/runtime/jit.py", line 390, in <lambda>
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(APIServer pid=1) (EngineCore_7 pid=304) (VllmWorker pid=368) ERROR 09-03 01:11:27 [multiproc_executor.py:611] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It looks like the triton kernel is still hooked up to the actual layer. |
made branch on nm repo that hooks up the kernel: 376129e I still have an IMA: RuntimeError: CUDA driver error: an illegal memory access was encountered Im trying to build it locally |
386a373
to
f7820e5
Compare
1a43afc
to
7d9ad60
Compare
@elvircrn would a coredump help? https://blog.vllm.ai/2025/08/11/cuda-debugging.html @robertgshaw2-redhat would you be able to enable this environment variables? |
@LucasWilkinson Yes, a coredump would help, thanks! |
@robertgshaw2-redhat @LucasWilkinson Can H ever not be divisible by 128? |
Signed-off-by: elvircrn <[email protected]>
b9f6303
to
49aab19
Compare
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
/gemini-review |
I've received your request for a new review. Please note that this is a command for the system to initiate a new review, and I will not be providing a conversational response here. The review process will begin shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great job and very thorough work!
@elvircrn the CI build failures seem related to this PR |
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
@LucasWilkinson pre-commit check failure unrelated to this PR. |
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]>
Used this for reference a5b84f1 |
sounds good - let's wait for distributed-tests-2-gpus and entrypoints-integration-test-api-server to finish and then we can request a force merge |
@tlrmchlsmth @LucasWilkinson CI is done. |
Signed-off-by: elvircrn <[email protected]>
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (123 commits) [Hybrid Allocator] Support Pipeline Parallel (vllm-project#23974) [Spec Decoding]Support Spec Decoding Metrics in DP Mode (vllm-project#24049) [Chore] Remove ipex_ops warning (vllm-project#24835) Force use C++17 globally to avoid compilation error (vllm-project#24823) [Benchmarks] Throw usage error when using dataset-name random and dataset-path together (vllm-project#24819) fix type of sampling rate for encode_base64 (vllm-project#24826) [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement (vllm-project#24783) [Misc] Improve `s3_utils` type hints with `BaseClient` (vllm-project#24825) [Multi Modal][Performance] Fused Q,K's apply_rope into one (vllm-project#24511) [Chore] Minor simplification for non-PP path (vllm-project#24810) [Minor] Simplify duplicative device check for cuda (vllm-project#24793) Remove redundant assignment in xfer_buffers, This is a little fix (vllm-project#24732) [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again (vllm-project#24771) [Doc]: fix typos in various files (vllm-project#24798) [Misc] Correct an outdated comment. (vllm-project#24765) [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (vllm-project#24750) [Core][Multimodal] Cache `supports_kw` (vllm-project#24773) [Kernels][DP/EP] Optimize Silu Kernel for R1 (vllm-project#24054) [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (vllm-project#24757) [Doc]: Remove 404 hyperlinks (vllm-project#24785) ...
Signed-off-by: elvircrn <[email protected]>
Signed-off-by: elvircrn <[email protected]> Signed-off-by: bbartels <[email protected]>
Signed-off-by: elvircrn <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: elvircrn <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Purpose
The purpose of this PR is to replace the triton silu implementation with a faster cuda version.
Here's a benchmark slice:
This was achieved by launching additional cuda blocks and parallelizing over T dimension. This means that we end up launching NOOP threads and that the parallelization factor is now an additional tunable parameter.
To understand the impact of the parallelization factor, see the following graphs for E <=9 :
For E=32, we have:




16X seems like a parallelization that works well for most configuration - this was chosen as the default.
Test Plan
Given
y
of shape(E, T, 2H)
as input the function is expected to work for allGROUP_SIZE=128
andH
divisible by 128.Test Result
From main:
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.