-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
[CI][Bugfix] Fix failing Blackwell test #24993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Matthew Bonanni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a bug in the Blackwell test where a shared buffer was not being recreated when the dtype changed between tests. The fix correctly adds checks for device and dtype to the buffer reallocation logic.
However, the implementation of SharedResizableBuffer.get
is not thread-safe. Since the buffers are shared as class attributes, concurrent requests can lead to a race condition, causing memory corruption. I've added a critical review comment with a suggestion to add a lock to ensure thread safety.
if (self.buffer is None or self.buffer.numel() < shape_numel or | ||
self.buffer.device != device or self.buffer.dtype != dtype): | ||
self.buffer = torch.empty(shape_numel, device=device, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly handles buffer recreation for different dtypes or devices in sequential runs (like tests), it is not thread-safe.
The SharedResizableBuffer
instances are used as class attributes in FusedMoEModularKernel
, making them singletons shared across all MoE layers. If multiple threads execute MoE layers concurrently, they can race to check and reallocate the buffer in get()
, leading to memory corruption or incorrect results.
To ensure thread safety, the read-modify-write operation on self.buffer
must be atomic. Please protect it with a threading.Lock
.
This would involve:
- Adding
import threading
at the top of the file. - Initializing a lock in
SharedResizableBuffer.__init__
:self._lock = threading.Lock()
. - Wrapping the logic in
get()
with the lock:with self._lock: ...
.
Example of the final SharedResizableBuffer
class:
import threading
from math import prod
...
class SharedResizableBuffer:
def __init__(self):
self.buffer = None
self._lock = threading.Lock()
def get(self, shape: tuple[int, ...], device: torch.device,
dtype: torch.dtype):
with self._lock:
shape_numel = prod(shape)
if (self.buffer is None or self.buffer.numel() < shape_numel or
self.buffer.device != device or self.buffer.dtype != dtype):
self.buffer = torch.empty(shape_numel, device=device, dtype=dtype)
return self.buffer[:shape_numel].view(*shape)
Since these changes are outside the diff, I cannot provide a direct code suggestion. This is a critical issue that should be addressed to prevent race conditions in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya seems fine; I guess the worst case is we just end up back in the previous behavior of just allocating every time 👍 thanks for the fix!
Signed-off-by: Matthew Bonanni <[email protected]>
@MatthewBonanni do you know how this was introduced in the first place? Did that test group run on the PR in question? is it that it only fails intermittently? |
@njhill It looks like this was introduced by the DBO PR, #23693 . The blackwell test was not run on that PR: https://buildkite.com/vllm/ci/builds/30923 |
Thanks @MatthewBonanni, I guess that means we should potentially adjust the scoping of the blackwell tests. |
Purpose
Blackwell test is broken on main. This PR fixes the underlying issue. The dtype was changing between tests, causing an assertion failure in the fused MoE buffer. Now, the buffer is rebuilt if the dtype (or device) changes
Test Plan
pytest tests/kernels/moe/test_nvfp4_moe.py
Test Result
Test passes
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.