[CI][Bugfix] Fix failing Blackwell test #24993

MatthewBonanni · 2025-09-16T19:29:04Z

Purpose

Blackwell test is broken on main. This PR fixes the underlying issue. The dtype was changing between tests, causing an assertion failure in the fused MoE buffer. Now, the buffer is rebuilt if the dtype (or device) changes

Test Plan

pytest tests/kernels/moe/test_nvfp4_moe.py

Test Result

Test passes

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <[email protected]>

gemini-code-assist

Code Review

This pull request fixes a bug in the Blackwell test where a shared buffer was not being recreated when the dtype changed between tests. The fix correctly adds checks for device and dtype to the buffer reallocation logic.

However, the implementation of SharedResizableBuffer.get is not thread-safe. Since the buffers are shared as class attributes, concurrent requests can lead to a race condition, causing memory corruption. I've added a critical review comment with a suggestion to add a lock to ensure thread safety.

gemini-code-assist · 2025-09-16T19:30:59Z

vllm/model_executor/layers/fused_moe/modular_kernel.py

+        if (self.buffer is None or self.buffer.numel() < shape_numel or
+            self.buffer.device != device or self.buffer.dtype != dtype):
            self.buffer = torch.empty(shape_numel, device=device, dtype=dtype)


While this change correctly handles buffer recreation for different dtypes or devices in sequential runs (like tests), it is not thread-safe.

The SharedResizableBuffer instances are used as class attributes in FusedMoEModularKernel, making them singletons shared across all MoE layers. If multiple threads execute MoE layers concurrently, they can race to check and reallocate the buffer in get(), leading to memory corruption or incorrect results.

To ensure thread safety, the read-modify-write operation on self.buffer must be atomic. Please protect it with a threading.Lock.

This would involve:

Adding import threading at the top of the file.

Initializing a lock in SharedResizableBuffer.__init__: self._lock = threading.Lock().

Wrapping the logic in get() with the lock: with self._lock: ....

Example of the final SharedResizableBuffer class:

import threading from math import prod ... class SharedResizableBuffer: def __init__(self): self.buffer = None self._lock = threading.Lock() def get(self, shape: tuple[int, ...], device: torch.device, dtype: torch.dtype): with self._lock: shape_numel = prod(shape) if (self.buffer is None or self.buffer.numel() < shape_numel or self.buffer.device != device or self.buffer.dtype != dtype): self.buffer = torch.empty(shape_numel, device=device, dtype=dtype) return self.buffer[:shape_numel].view(*shape)

Since these changes are outside the diff, I cannot provide a direct code suggestion. This is a critical issue that should be addressed to prevent race conditions in production.

LucasWilkinson

Ya seems fine; I guess the worst case is we just end up back in the previous behavior of just allocating every time 👍 thanks for the fix!

Signed-off-by: Matthew Bonanni <[email protected]>

njhill · 2025-09-16T23:02:15Z

@MatthewBonanni do you know how this was introduced in the first place? Did that test group run on the PR in question? is it that it only fails intermittently?

MatthewBonanni · 2025-09-17T01:29:15Z

@njhill It looks like this was introduced by the DBO PR, #23693 . The blackwell test was not run on that PR: https://buildkite.com/vllm/ci/builds/30923

njhill · 2025-09-17T03:28:27Z

@njhill It looks like this was introduced by the DBO PR, #23693 . The blackwell test was not run on that PR: https://buildkite.com/vllm/ci/builds/30923

Thanks @MatthewBonanni, I guess that means we should potentially adjust the scoping of the blackwell tests.

Create new buffer if dtype or device changes

bb84f12

Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni requested a review from mgoin as a code owner September 16, 2025 19:29

MatthewBonanni changed the title ~~[CI][Bugfix] Fix broken Blackwell test~~ [CI][Bugfix] Fix failing Blackwell test Sep 16, 2025

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

LucasWilkinson approved these changes Sep 16, 2025

View reviewed changes

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025

MatthewBonanni and others added 2 commits September 16, 2025 12:39

Fix pre-commit

f17a3c9

Signed-off-by: Matthew Bonanni <[email protected]>

Merge branch 'main' into fix_cache_dtype

6c1d886

mgoin approved these changes Sep 16, 2025

View reviewed changes

mgoin enabled auto-merge (squash) September 16, 2025 21:38

zhuohan123 disabled auto-merge September 16, 2025 22:54

zhuohan123 merged commit d119fc8 into vllm-project:main Sep 16, 2025
45 of 46 checks passed

MatthewBonanni deleted the fix_cache_dtype branch September 17, 2025 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI][Bugfix] Fix failing Blackwell test #24993

[CI][Bugfix] Fix failing Blackwell test #24993

Uh oh!

MatthewBonanni commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

njhill commented Sep 16, 2025

Uh oh!

MatthewBonanni commented Sep 17, 2025

Uh oh!

njhill commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

[CI][Bugfix] Fix failing Blackwell test #24993

[CI][Bugfix] Fix failing Blackwell test #24993

Uh oh!

Conversation

MatthewBonanni commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill commented Sep 16, 2025

Uh oh!

MatthewBonanni commented Sep 17, 2025

Uh oh!

njhill commented Sep 17, 2025

Uh oh!

Uh oh!

MatthewBonanni commented Sep 16, 2025 •

edited by github-actions bot

Loading