Skip to content

Conversation

MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Sep 16, 2025

Purpose

Blackwell test is broken on main. This PR fixes the underlying issue. The dtype was changing between tests, causing an assertion failure in the fused MoE buffer. Now, the buffer is rebuilt if the dtype (or device) changes

Test Plan

pytest tests/kernels/moe/test_nvfp4_moe.py

Test Result

Test passes


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@MatthewBonanni MatthewBonanni changed the title [CI][Bugfix] Fix broken Blackwell test [CI][Bugfix] Fix failing Blackwell test Sep 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug in the Blackwell test where a shared buffer was not being recreated when the dtype changed between tests. The fix correctly adds checks for device and dtype to the buffer reallocation logic.

However, the implementation of SharedResizableBuffer.get is not thread-safe. Since the buffers are shared as class attributes, concurrent requests can lead to a race condition, causing memory corruption. I've added a critical review comment with a suggestion to add a lock to ensure thread safety.

Comment on lines 509 to 511
if (self.buffer is None or self.buffer.numel() < shape_numel or
self.buffer.device != device or self.buffer.dtype != dtype):
self.buffer = torch.empty(shape_numel, device=device, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While this change correctly handles buffer recreation for different dtypes or devices in sequential runs (like tests), it is not thread-safe.

The SharedResizableBuffer instances are used as class attributes in FusedMoEModularKernel, making them singletons shared across all MoE layers. If multiple threads execute MoE layers concurrently, they can race to check and reallocate the buffer in get(), leading to memory corruption or incorrect results.

To ensure thread safety, the read-modify-write operation on self.buffer must be atomic. Please protect it with a threading.Lock.

This would involve:

  1. Adding import threading at the top of the file.
  2. Initializing a lock in SharedResizableBuffer.__init__: self._lock = threading.Lock().
  3. Wrapping the logic in get() with the lock: with self._lock: ....

Example of the final SharedResizableBuffer class:

import threading
from math import prod
...

class SharedResizableBuffer:

    def __init__(self):
        self.buffer = None
        self._lock = threading.Lock()

    def get(self, shape: tuple[int, ...], device: torch.device,
            dtype: torch.dtype):
        with self._lock:
            shape_numel = prod(shape)
            if (self.buffer is None or self.buffer.numel() < shape_numel or
                self.buffer.device != device or self.buffer.dtype != dtype):
                self.buffer = torch.empty(shape_numel, device=device, dtype=dtype)
            return self.buffer[:shape_numel].view(*shape)

Since these changes are outside the diff, I cannot provide a direct code suggestion. This is a critical issue that should be addressed to prevent race conditions in production.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya seems fine; I guess the worst case is we just end up back in the previous behavior of just allocating every time 👍 thanks for the fix!

@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 16, 2025
@mgoin mgoin enabled auto-merge (squash) September 16, 2025 21:38
@zhuohan123 zhuohan123 disabled auto-merge September 16, 2025 22:54
@zhuohan123 zhuohan123 merged commit d119fc8 into vllm-project:main Sep 16, 2025
45 of 46 checks passed
@njhill
Copy link
Member

njhill commented Sep 16, 2025

@MatthewBonanni do you know how this was introduced in the first place? Did that test group run on the PR in question? is it that it only fails intermittently?

@MatthewBonanni MatthewBonanni deleted the fix_cache_dtype branch September 17, 2025 01:18
@MatthewBonanni
Copy link
Contributor Author

@njhill It looks like this was introduced by the DBO PR, #23693 . The blackwell test was not run on that PR: https://buildkite.com/vllm/ci/builds/30923

@njhill
Copy link
Member

njhill commented Sep 17, 2025

@njhill It looks like this was introduced by the DBO PR, #23693 . The blackwell test was not run on that PR: https://buildkite.com/vllm/ci/builds/30923

Thanks @MatthewBonanni, I guess that means we should potentially adjust the scoping of the blackwell tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants