-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][chore] remove circular dependency between model engine and cuda graph runner #7572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
/bot run |
PR_Github #17812 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughRefactors CUDAGraphRunner to a dependency-injected, engine-agnostic component with expanded APIs carrying speculative-decoding context and metadata. Updates ModelEngine to wire new parameters and draft-token CUDA buffers. Migrates unit tests to a new create_mock_cuda_graph_runner helper and adapts capture/replay call signatures with an added boolean flag. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant ME as ModelEngine
participant GR as CUDAGraphRunner
participant FW as forward_fn
participant Attn as AttentionMetadata
participant Spec as SpecMetadata
Note over ME,GR: New flow passes is_spec_decode, metadata, draft tokens
ME->>GR: maybe_get_cuda_graph(batch, iter_counter, is_spec_decode, Attn, Spec?, draft_tokens_cuda)
GR-->>ME: (can_use_graph, Attn', Spec')
alt needs capture
ME->>GR: needs_capture(batch_size, is_spec_decode)
GR-->>ME: bool
alt true
ME->>GR: capture(batch_size, is_spec_decode, FW, initial_inputs)
GR->>FW: forward(**initial_inputs) during capture
FW-->>GR: outputs (captured)
GR-->>ME: capture complete
end
end
alt can_use_graph
ME->>GR: replay(batch_size, is_spec_decode, current_inputs)
GR->>GR: update static tensors, set draft_len from Spec'
GR-->>ME: logits (replayed)
else not eligible
ME->>FW: forward(**current_inputs)
FW-->>ME: logits
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)
416-423
: Fix int64→int32 dtype mismatch in CUDA-graph path (will crash at replay).cuda_graph_runner allocates static input_ids/position_ids as int32; current inputs use default int (arange → int64). torch.Tensor.copy_ requires matching dtypes and will error at replay. Cast both to int32 before capture/replay.
Apply this diff:
- inputs = { - "input_ids": input_ids, - "position_ids": position_ids, - "attn_metadata": attn_metadata, - } + inputs = { + "input_ids": input_ids.to(torch.int32), + "position_ids": position_ids.to(torch.int32), + "attn_metadata": attn_metadata, + }Also applies to: 429-429
tests/unittest/_torch/modeling/test_modeling_phi3.py (1)
322-329
: Align dtypes with CUDA-graph static buffers (int32).position_ids from arange default to int64; runner uses int32 buffers, causing copy_ dtype mismatch at replay. Cast inputs before capture/replay.
- inputs = { - "input_ids": input_ids, - "position_ids": position_ids, - "attn_metadata": attn_metadata, - } + inputs = { + "input_ids": input_ids.to(torch.int32), + "position_ids": position_ids.to(torch.int32), + "attn_metadata": attn_metadata, + }Also applies to: 335-335
tests/unittest/_torch/modeling/test_modeling_mllama.py (1)
429-436
: Prevent dtype mismatch with CUDA-graph int32 inputs.Ensure input_ids/position_ids are int32 before capture/replay to match runner’s static tensors.
- inputs = { - "input_ids": input_ids, - "position_ids": position_ids, - "attn_metadata": attn_metadata, - } + inputs = { + "input_ids": input_ids.to(torch.int32), + "position_ids": position_ids.to(torch.int32), + "attn_metadata": attn_metadata, + }Also applies to: 442-442
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)
327-334
: Cast position_ids to int32 for CUDA-graph replay.Runner’s static tensors are int32; arange yields int64. Cast before capture to avoid copy_ errors.
- inputs = { - "input_ids": input_ids, - "position_ids": position_ids, - "attn_metadata": attn_metadata, - } + inputs = { + "input_ids": input_ids.to(torch.int32), + "position_ids": position_ids.to(torch.int32), + "attn_metadata": attn_metadata, + }Also applies to: 340-340
tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)
322-329
: Match int32 expectations in CUDA-graph path.Cast inputs to int32 to align with cuda_graph_runner’s buffers.
- inputs = { - "input_ids": input_ids, - "position_ids": position_ids, - "attn_metadata": attn_metadata, - } + inputs = { + "input_ids": input_ids.to(torch.int32), + "position_ids": position_ids.to(torch.int32), + "attn_metadata": attn_metadata, + }Also applies to: 335-335
tests/unittest/_torch/modeling/test_modeling_qwen.py (1)
85-93
: Python 3.8 compatibility: avoid built-in genericsdict[str, Any] requires Python 3.9+. Tests target 3.8+. Use typing.Dict.
-from typing import Any +from typing import Any, Dict @@ -def reduce_qwen_config(mem_for_full_model: int, config_dict: dict[str, Any]): +def reduce_qwen_config(mem_for_full_model: int, config_dict: Dict[str, Any]):tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
400-406
: Fix dtype of index tensors (must be torch.long).
gather_ids_cuda
andprevious_pos_indices_cuda
are later used for tensor indexing (e.g.,logits[gather_ids]
), which requiresLongTensor
indices in PyTorch. Usingtorch.int
(int32) risks runtime errors.Apply:
- self.gather_ids_cuda = torch.empty((self.max_num_tokens, ), - dtype=torch.int, - device='cuda') - self.previous_pos_indices_cuda = torch.empty( - (self.max_num_tokens, ), dtype=torch.int, device='cuda') + self.gather_ids_cuda = torch.empty((self.max_num_tokens, ), + dtype=torch.long, + device='cuda') + self.previous_pos_indices_cuda = torch.empty( + (self.max_num_tokens, ), dtype=torch.long, device='cuda')Note:
previous_batch_indices_cuda
(Line 444) is also used as an index and should betorch.long
for consistency. See additional snippet below.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
225-225
: Inconsistent return value from replay methodThe
replay
method returnsoutput_ref
which is a callable (weak reference), but the return type annotation suggests it should returnOptional[torch.Tensor]
. The method should either call the reference or update the return type.Call the weak reference to get the actual tensor:
- return output_ref + return output_ref() if output_ref else None
🧹 Nitpick comments (11)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)
406-408
: Release CUDA graph resources after test.Call graph_runner.clear() after use to free the graph’s memory pool.
Example (near test end):
if graph_runner is not None: graph_runner.clear()tests/unittest/_torch/modeling/test_modeling_phi3.py (1)
312-314
: Exercise the CUDA-graph path in this test.scenario.use_cuda_graph is never set to True here, so the CUDA-graph code isn’t exercised. Consider parameterizing to run both paths.
tests/unittest/_torch/helpers.py (1)
171-186
: Helper factory looks good; add return type and optional knobsAnnotate the return type and expose use_mrope/enable_attention_dp to avoid future test drift.
-def create_mock_cuda_graph_runner(batch_size: int): +def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner: return CUDAGraphRunner( use_cuda_graph=True, cuda_graph_padding_enabled=False, supported_batch_sizes=[batch_size], max_supported_batch_size=batch_size, max_batch_size=batch_size, max_beam_width=1, max_draft_len=0, use_mrope=False, spec_config=None, cuda_graph_mem_pool=None, enable_attention_dp=False, mapping=Mapping(), dist=None, kv_cache_manager_key=ResourceManagerType.KV_CACHE_MANAGER)Optionally:
-def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner: +def create_mock_cuda_graph_runner( + batch_size: int, + *, + use_mrope: bool = False, + enable_attention_dp: bool = False, +) -> CUDAGraphRunner: @@ - use_mrope=False, + use_mrope=use_mrope, @@ - enable_attention_dp=False, - mapping=Mapping(), + enable_attention_dp=enable_attention_dp, + mapping=Mapping(enable_attention_dp=enable_attention_dp),tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)
25-25
: Fix E402: keep imports at top-of-fileMove this import to the top with the other imports to satisfy Ruff E402.
+from _torch.helpers import create_mock_cuda_graph_runner @@ -from _torch.helpers import create_mock_cuda_graph_runnertests/unittest/_torch/modeling/test_modeling_nemotron.py (2)
320-321
: Minor: avoid magic number for batch size in tests.Consider a local
graph_bs = 1
to avoid repeating the literal and ease future edits.
343-344
: Optional: assert non-None replay output.
replay
returns an Optional (weak-ref). Add a quickassert logits is not None
for clearer failures.tests/unittest/_torch/modeling/test_modeling_mistral.py (2)
402-402
: Minor: avoid magic number for batch size.Define a local
graph_bs = 1
and reuse it in capture/replay calls.
422-423
: Optional: guard against None from replay.Add
assert logits is not None
before comparisons to make weak-ref issues obvious.tests/unittest/_torch/modeling/test_modeling_llama.py (2)
328-329
: Minor: avoid magic number for batch size.Use a local
graph_bs = 1
and reuse it.
350-351
: Optional: assert non-None replay output.Add
assert logits is not None
to surface weak-ref invalidation early.tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
412-413
: Use instance attribute for clarity.Prefer
self.spec_config.max_draft_len
over the shadowedspec_config
.- self.max_draft_len = spec_config.max_draft_len + self.max_draft_len = self.spec_config.max_draft_len
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
(10 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(5 hunks)tests/unittest/_torch/helpers.py
(2 hunks)tests/unittest/_torch/modeling/test_modeling_exaone4.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_llama.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_mistral.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_mixtral.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_mllama.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_nemotron.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_phi3.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_qwen.py
(3 hunks)tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tests/unittest/_torch/modeling/test_modeling_mllama.py
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
tests/unittest/_torch/modeling/test_modeling_exaone4.py
tests/unittest/_torch/helpers.py
tests/unittest/_torch/modeling/test_modeling_llama.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_mistral.py
tests/unittest/_torch/modeling/test_modeling_mixtral.py
tests/unittest/_torch/modeling/test_modeling_phi3.py
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron.py
tests/unittest/_torch/modeling/test_modeling_qwen.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tests/unittest/_torch/modeling/test_modeling_mllama.py
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
tests/unittest/_torch/modeling/test_modeling_exaone4.py
tests/unittest/_torch/helpers.py
tests/unittest/_torch/modeling/test_modeling_llama.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_mistral.py
tests/unittest/_torch/modeling/test_modeling_mixtral.py
tests/unittest/_torch/modeling/test_modeling_phi3.py
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron.py
tests/unittest/_torch/modeling/test_modeling_qwen.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tests/unittest/_torch/modeling/test_modeling_mllama.py
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
tests/unittest/_torch/modeling/test_modeling_exaone4.py
tests/unittest/_torch/helpers.py
tests/unittest/_torch/modeling/test_modeling_llama.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_mistral.py
tests/unittest/_torch/modeling/test_modeling_mixtral.py
tests/unittest/_torch/modeling/test_modeling_phi3.py
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron.py
tests/unittest/_torch/modeling/test_modeling_qwen.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/unittest/_torch/modeling/test_modeling_mllama.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/unittest/_torch/modeling/test_modeling_mllama.py
🧬 Code graph analysis (13)
tests/unittest/_torch/modeling/test_modeling_mllama.py (2)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (2)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)
tests/unittest/_torch/modeling/test_modeling_exaone4.py (3)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
forward
(79-87)forward
(2211-2313)
tests/unittest/_torch/helpers.py (3)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
CUDAGraphRunner
(22-316)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
ResourceManagerType
(47-52)tensorrt_llm/mapping.py (1)
Mapping
(32-513)
tests/unittest/_torch/modeling/test_modeling_llama.py (2)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
CUDAGraphRunner
(22-316)needs_capture
(139-141)replay
(196-225)
tests/unittest/_torch/modeling/test_modeling_mistral.py (2)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)
tests/unittest/_torch/modeling/test_modeling_mixtral.py (2)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)
tests/unittest/_torch/modeling/test_modeling_phi3.py (2)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (2)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)
tests/unittest/_torch/modeling/test_modeling_nemotron.py (3)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
forward
(79-87)forward
(2211-2313)
tests/unittest/_torch/modeling/test_modeling_qwen.py (3)
tests/unittest/_torch/helpers.py (1)
create_mock_cuda_graph_runner
(171-186)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
capture
(143-194)replay
(196-225)tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
forward
(79-87)forward
(2211-2313)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (8)
tensorrt_llm/mapping.py (1)
Mapping
(32-513)tensorrt_llm/_torch/attention_backend/interface.py (2)
AttentionMetadata
(39-347)create_cuda_graph_metadata
(288-328)tensorrt_llm/_torch/distributed/communicator.py (3)
MPIDist
(98-145)tp_size
(46-47)tp_allgather
(138-139)tensorrt_llm/_torch/expert_statistic.py (2)
ExpertStatistic
(10-98)set_iter
(32-36)tensorrt_llm/_torch/modules/multi_stream_utils.py (1)
with_multi_stream
(26-32)tensorrt_llm/_torch/speculative/interface.py (2)
SpecMetadata
(122-217)create_cuda_graph_metadata
(181-192)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
ResourceManagerType
(47-52)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)can_run_cuda_graph
(31-32)
🪛 Ruff (0.12.2)
tests/unittest/_torch/modeling/test_modeling_exaone4.py
25-25: Module level import not at top of file
(E402)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
43-43: Undefined name DecodingBaseConfig
(F821)
72-72: Undefined name Request
(F821)
🔇 Additional comments (21)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)
7-7
: Good switch to factory helper.Decoupling tests from CUDAGraphRunner construction via create_mock_cuda_graph_runner improves isolation and avoids circular deps.
tests/unittest/_torch/modeling/test_modeling_phi3.py (1)
7-7
: Factory-based runner import looks good.Keeps tests engine-agnostic and matches the new API surface.
tests/unittest/_torch/modeling/test_modeling_mllama.py (1)
6-6
: Nice: helper import unifies CUDA-graph setup across tests.tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)
6-6
: Good move to create_mock_cuda_graph_runner.tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)
6-6
: Helper import LGTM.tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)
355-357
: CUDA graph capture/replay usage matches new APICapture with the added boolean flag and subsequent replay look correct.
Also applies to: 364-364
tests/unittest/_torch/modeling/test_modeling_qwen.py (1)
20-20
: Import of helper factory looks correctAligned with new factory-based setup.
tests/unittest/_torch/modeling/test_modeling_nemotron.py (2)
7-7
: LGTM: switched to factory helper.Importing
create_mock_cuda_graph_runner
decouples tests from the runner class.
335-337
: LGTM: updated capture signature correctly.
is_spec_decode=False
here is appropriate for pure decoding.tests/unittest/_torch/modeling/test_modeling_mistral.py (2)
10-10
: LGTM: switched to factory helper.Keeps tests aligned with the new API surface.
416-416
: LGTM: capture API use matches new signature.tests/unittest/_torch/modeling/test_modeling_llama.py (2)
7-7
: LGTM: switched to factory helper.
343-345
: LGTM: capture API updated correctly.tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
460-474
: LGTM: runner decoupled and fully parameterized.Constructor wiring looks correct and removes the prior dependency.
2258-2264
: Confirmed safe:draft_tokens_cuda=None
is tolerated
maybe_get_cuda_graph
only referencesdraft_tokens_cuda
whenspec_metadata
is non-null, so passingNone
for non-speculative paths is ignored and causes no errors.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (6)
32-49
: Well-designed dependency injection refactorGreat work converting from engine-centric initialization to explicit dependency injection! This change successfully breaks the circular dependency by accepting individual parameters rather than the full engine object. The use of keyword-only arguments ensures clarity at call sites.
77-85
: Enhanced API with speculative decoding contextThe updated
maybe_get_cuda_graph
signature properly extends support for speculative decoding by acceptingis_spec_decode
,spec_metadata
, anddraft_tokens_cuda
parameters. This provides the necessary context for graph eligibility decisions.
101-111
: Distributed batch size synchronization logic looks correctThe multi-GPU synchronization properly gathers batch information across TP ranks and validates consistency before allowing CUDA graph execution. This ensures all ranks are in sync.
127-135
: Proper metadata creation for CUDA graphsThe code correctly creates graph-specific metadata for both attention and speculative decoding, properly handling the draft tokens buffer assignment.
168-171
: Conditional mrope position deltas handlingGood use of the
use_mrope
flag to conditionally include position deltas in the static tensors.
266-271
: Comprehensive dummy request configurationThe padding dummy request creation properly includes all necessary parameters for speculative decoding support including
max_num_draft_tokens
,use_mrope
, andmax_beam_width
.
max_beam_width: int, | ||
max_draft_len: int, | ||
use_mrope: bool, | ||
spec_config: Optional["DecodingBaseConfig"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix undefined type hint DecodingBaseConfig
The type hint for spec_config
references DecodingBaseConfig
which is not imported. This will cause a runtime error if the type hint is evaluated.
Add the missing import at the top of the file:
from ..modules.multi_stream_utils import with_multi_stream
from ..speculative import SpecMetadata
+from ..speculative.base import DecodingBaseConfig
from ..utils import make_weak_ref, piecewise_cuda_graph
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
spec_config: Optional["DecodingBaseConfig"], | |
# at the top of tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py | |
from ..modules.multi_stream_utils import with_multi_stream | |
from ..speculative import SpecMetadata | |
from ..speculative.base import DecodingBaseConfig | |
from ..utils import make_weak_ref, piecewise_cuda_graph |
🧰 Tools
🪛 Ruff (0.12.2)
43-43: Undefined name DecodingBaseConfig
(F821)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around line 43, the type
hint DecodingBaseConfig is referenced but not imported; add an import for
DecodingBaseConfig at the top of the file from the module where
DecodingBaseConfig is defined (for example the decoding module under
tensorrt_llm._torch.pyexecutor), or alternatively guard the annotation with
typing.TYPE_CHECKING or keep it as a string literal to avoid runtime evaluation.
self.graph_metadata: Dict[Tuple[int, int], Dict[str, Any]] = {} | ||
self.memory_pool = engine._cuda_graph_mem_pool | ||
self.memory_pool = cuda_graph_mem_pool | ||
self.padding_dummy_request: Optional["Request"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix undefined type hint Request
The type hint for padding_dummy_request
references Request
which is not imported. This will cause a runtime error if the type hint is evaluated.
Add the missing import at the top of the file:
from ..utils import make_weak_ref, piecewise_cuda_graph
from .resource_manager import ResourceManager, ResourceManagerType
+from .request import Request
from .scheduler import ScheduledRequests
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.padding_dummy_request: Optional["Request"] = None | |
from ..utils import make_weak_ref, piecewise_cuda_graph | |
from .resource_manager import ResourceManager, ResourceManagerType | |
from .request import Request | |
from .scheduler import ScheduledRequests |
🧰 Tools
🪛 Ruff (0.12.2)
72-72: Undefined name Request
(F821)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around line 72, the type
hint Optional["Request"] references Request but it's not imported; add the
missing import at the top of the file to resolve the undefined name (e.g., add a
relative import such as "from .request import Request" or the correct absolute
path where Request is defined) so the type is available when type hints are
evaluated.
if hasattr(self, | ||
'iter_counter') and ExpertStatistic.set_iter(iter_counter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the attribute check for iter_counter
The code checks for iter_counter
attribute using hasattr(self, 'iter_counter')
, but this attribute is never set in the __init__
method. This appears to be dead code unless the attribute is dynamically added elsewhere.
🏁 Script executed:
#!/bin/bash
# Search for where iter_counter might be set on CUDAGraphRunner instances
rg -n "\.iter_counter\s*=" --type=py
Length of output: 205
🏁 Script executed:
#!/bin/bash
# Locate CUDAGraphRunner definition and inspect its __init__
rg -nP "class\s+CUDAGraphRunner" -n -t py tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
rg -nP "def\s+__init__" -n -C5 -t py tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
Length of output: 563
Initialize or remove the iter_counter
check
The hasattr(self, 'iter_counter')
guard in tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py:95–96
is always false—self.iter_counter
is never defined on CUDAGraphRunner
. Either add self.iter_counter = 0
in __init__
or remove these lines.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 95-96 the
hasattr(self, 'iter_counter') check is dead because CUDAGraphRunner never
defines self.iter_counter; either initialize self.iter_counter = 0 in the class
__init__ (or where instances are constructed) so the guard can be true and
ExpertStatistic.set_iter receives a valid value, or remove the hasattr check and
the conditional entirely if iteration counting is not needed—implement whichever
fix matches intended behavior and keep code and tests consistent.
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0 | ||
key = (batch_size, draft_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider null-safety for spec_config access
The code accesses self.spec_config.max_draft_len
without checking if spec_config
is None, despite it being Optional in the constructor.
Add a null check or provide a default:
- draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
+ draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0 | |
key = (batch_size, draft_len) | |
draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0 | |
key = (batch_size, draft_len) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 116-117, the
code accesses self.spec_config.max_draft_len but spec_config is Optional; guard
against None by computing draft_len = self.spec_config.max_draft_len if
(is_spec_decode and self.spec_config is not None) else 0 (or use
getattr(self.spec_config, "max_draft_len", 0) when is_spec_decode is True), then
build key = (batch_size, draft_len); ensure the null-check covers only when
is_spec_decode is true so behavior remains unchanged otherwise.
def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool: | ||
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0 | ||
return (batch_size, draft_len) not in self.graph_outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply the same null-safety fix in needs_capture
- draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
+ draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool: | |
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0 | |
return (batch_size, draft_len) not in self.graph_outputs | |
def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool: | |
draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0 | |
return (batch_size, draft_len) not in self.graph_outputs |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 139 to 141,
needs_capture currently assumes self.spec_config and self.graph_outputs are
non-null; make it null-safe by computing draft_len = (getattr(self.spec_config,
"max_draft_len", 0) if is_spec_decode else 0) and checking membership against a
safe container (e.g., graph_outputs = self.graph_outputs or set()), then return
(batch_size, draft_len) not in graph_outputs so the method won't throw if
spec_config or graph_outputs is None.
# here we don't need to use context since cuda graph capture didn't run kernel. | ||
# maybe we need a cleaner way to do this. | ||
outputs = self.cuda_graph_runner.replay(batch_size, inputs) | ||
outputs = self.cuda_graph_runner.replay( | ||
batch_size, self.enable_spec_decode, inputs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap replay in MoE iterator context on capture path as well.
The else-branch replays under MoeLoadBalancerIterContext
, but the capture-branch replay does not. Replay runs kernels and should be consistently wrapped to keep MoE iteration state correct.
- # here we don't need to use context since cuda graph capture didn't run kernel.
- # maybe we need a cleaner way to do this.
- outputs = self.cuda_graph_runner.replay(
- batch_size, self.enable_spec_decode, inputs)
+ # Replay runs kernels; keep MoE iteration state consistent.
+ with MoeLoadBalancerIterContext(moe_load_balancer):
+ outputs = self.cuda_graph_runner.replay(
+ batch_size, self.enable_spec_decode, inputs)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# here we don't need to use context since cuda graph capture didn't run kernel. | |
# maybe we need a cleaner way to do this. | |
outputs = self.cuda_graph_runner.replay(batch_size, inputs) | |
outputs = self.cuda_graph_runner.replay( | |
batch_size, self.enable_spec_decode, inputs) | |
else: | |
# Replay runs kernels; keep MoE iteration state consistent. | |
with MoeLoadBalancerIterContext(moe_load_balancer): | |
outputs = self.cuda_graph_runner.replay( | |
batch_size, self.enable_spec_decode, inputs) | |
else: |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 2302 to 2306, the
replay call on the capture path is not wrapped in MoeLoadBalancerIterContext
whereas the else-branch wraps replay inside that context; wrap the capture-path
replay call in the same MoeLoadBalancerIterContext used in the else branch
(enter context before calling self.cuda_graph_runner.replay and exit after), so
both paths run replay under the MoE iterator context and preserve consistent MoE
iteration state.
from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner | ||
from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType | ||
from tensorrt_llm.mapping import Mapping | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add required NVIDIA Apache-2.0 header
Per repo guidelines, prepend the 2025 NVIDIA Apache-2.0 header to all .py files.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner | |
from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType | |
from tensorrt_llm.mapping import Mapping | |
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# Unless required by applicable law or agreed to in writing, software | |
# distributed under the License is distributed on an "AS IS" BASIS, | |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
# See the License for the specific language governing permissions and | |
# limitations under the License. | |
from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner | |
from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType | |
from tensorrt_llm.mapping import Mapping |
🤖 Prompt for AI Agents
In tests/unittest/_torch/helpers.py around lines 6 to 9, the file is missing the
required 2025 NVIDIA Apache-2.0 header; prepend the official NVIDIA Apache-2.0
header block (with year 2025 and appropriate license text per repo guidelines)
at the very top of the file before any imports, preserving encoding and line
endings, and ensure the header matches other repository files exactly.
SKIP_EXAONE4_HF_ACCURACY_TEST = True | ||
|
||
from _torch.helpers import create_mock_engine | ||
from _torch.helpers import create_mock_cuda_graph_runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add required NVIDIA Apache-2.0 header
This file is missing the mandated header.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from _torch.helpers import create_mock_cuda_graph_runner | |
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# Unless required by applicable law or agreed to in writing, software | |
# distributed under the License is distributed on an "AS IS" BASIS, | |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
# See the License for the specific language governing permissions and | |
# limitations under the License. | |
from _torch.helpers import create_mock_cuda_graph_runner |
🧰 Tools
🪛 Ruff (0.12.2)
25-25: Module level import not at top of file
(E402)
🤖 Prompt for AI Agents
In tests/unittest/_torch/modeling/test_modeling_exaone4.py around line 25, the
file is missing the required NVIDIA Apache-2.0 header; add the mandated
multi-line Apache-2.0 license header (including copyright notice and SPDX
identifier) at the top of the file following the project's header format so the
file begins with the standard NVIDIA Apache-2.0 license block.
Qwen2ForCausalLM, Qwen2ForProcessRewardModel) | ||
# yapf: enable | ||
from _torch.helpers import create_mock_engine | ||
from _torch.helpers import create_mock_cuda_graph_runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add required NVIDIA Apache-2.0 header
Apply the standard header at the top.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from _torch.helpers import create_mock_cuda_graph_runner | |
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
# Licensed under the Apache License, Version 2.0 (the "License"); | |
# you may not use this file except in compliance with the License. | |
# You may obtain a copy of the License at | |
# http://www.apache.org/licenses/LICENSE-2.0 | |
# Unless required by applicable law or agreed to in writing, software | |
# distributed under the License is distributed on an "AS IS" BASIS, | |
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
# See the License for the specific language governing permissions and | |
# limitations under the License. | |
from _torch.helpers import create_mock_cuda_graph_runner | |
# …rest of the file… |
🤖 Prompt for AI Agents
In tests/unittest/_torch/modeling/test_modeling_qwen.py around line 20, the file
is missing the required NVIDIA Apache-2.0 license header; add the standard
NVIDIA Apache-2.0 header block at the very top of the file (before any imports)
matching the project's header template, ensuring correct copyright year/owner
and SPDX identifier.
graph_runner = create_mock_cuda_graph_runner( | ||
1) if scenario.use_cuda_graph else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix dtype mismatch for position_ids with CUDA graphs
CUDAGraphRunner allocates position_ids as int32. Here gen_position_ids defaults to int64, causing copy_ dtype mismatch during replay. Force int32.
- gen_position_ids = [
- torch.arange(input_ids.size(-1),
- input_ids.size(-1) + gen_input_ids.size(-1))
- ]
+ gen_position_ids = [
+ torch.arange(
+ input_ids.size(-1),
+ input_ids.size(-1) + gen_input_ids.size(-1),
+ dtype=torch.int32,
+ )
+ ]
Also applies to: 282-283, 290-290
🤖 Prompt for AI Agents
In tests/unittest/_torch/modeling/test_modeling_qwen.py around lines 267-268
(also apply same change at 282-283 and 290), when scenario.use_cuda_graph is
true the CUDAGraphRunner allocates position_ids as int32 but gen_position_ids
currently defaults to int64 causing a dtype mismatch on copy_ during replay;
update the test to force gen_position_ids to torch.int32 (or call
.to(torch.int32)) whenever graph_runner is created/used so the generated
position_ids match the CUDA graph's int32 dtype.
PR_Github #17812 [ run ] completed with state |
Signed-off-by: junq <[email protected]>
c3143c4
to
b727284
Compare
/bot run |
PR_Github #19305 [ run ] triggered by Bot |
PR_Github #19305 [ run ] completed with state |
/bot run |
PR_Github #19310 [ run ] triggered by Bot |
PR_Github #19310 [ run ] completed with state |
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
/bot run |
PR_Github #19948 [ run ] triggered by Bot |
PR_Github #19948 [ run ] completed with state |
Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
def __init__( | ||
self, | ||
*, | ||
use_cuda_graph: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use for a CUDAGraphRunner
do if cuda graph is disabled?
use_cuda_graph: bool, | ||
cuda_graph_padding_enabled: bool, | ||
supported_batch_sizes: list[int], | ||
max_supported_batch_size: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between max_supported_batch_size
and max_batch_size
?
Signed-off-by: junq <[email protected]>
attn_metadata: AttentionMetadata, | ||
spec_metadata: Optional[SpecMetadata], | ||
draft_tokens_cuda: torch.Tensor, | ||
) -> Tuple[bool, Optional[AttentionMetadata], Optional[SpecMetadata]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the return type here should be Optional[Tuple[AttentionMetadata, SpecMetadata]]
.
/bot run |
|
||
padding_size = padded_batch_size - batch_size | ||
if padding_size + batch.batch_size > engine.batch_size: | ||
if padding_size + batch.batch_size > self.max_batch_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rename the batch_size
on engine to max_batch_size
?
PR_Github #19955 [ run ] triggered by Bot |
PR_Github #19955 [ run ] completed with state |
Signed-off-by: junq <[email protected]>
/bot run |
PR_Github #20021 [ run ] triggered by Bot |
PR_Github #20021 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.