[None][chore] remove circular dependency between model engine and cuda graph runner #7572

QiJune · 2025-09-05T17:58:31Z

Summary by CodeRabbit

New Features
- Enhanced CUDA Graph execution with support for speculative decoding and draft tokens.
- More configurable graph behavior (batch sizing, padding, memory pool) with improved multi-GPU awareness.
Refactor
- Decoupled the CUDA graph runner from the model engine with explicit, injectable dependencies.
- Streamlined graph capture/replay logic and caching with draft-length awareness.
Tests
- Updated unit tests to use a mock graph-runner helper and the new capture/replay API (extra boolean flag).

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: junq <[email protected]>

QiJune · 2025-09-05T17:59:23Z

/bot run

tensorrt-cicd · 2025-09-05T18:07:42Z

PR_Github #17812 [ run ] triggered by Bot

coderabbitai · 2025-09-05T18:08:30Z

📝 Walkthrough

Walkthrough

Refactors CUDAGraphRunner to a dependency-injected, engine-agnostic component with expanded APIs carrying speculative-decoding context and metadata. Updates ModelEngine to wire new parameters and draft-token CUDA buffers. Migrates unit tests to a new create_mock_cuda_graph_runner helper and adapts capture/replay call signatures with an added boolean flag.

Changes

Cohort / File(s)	Summary
CUDA Graph Runner Refactor `tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`	Replaces engine-centric constructor with explicit keyword args; expands methods to pass/return speculative-decoding context and metadata; introduces per-(batch_size,draft_len) graph keys; revises capture/replay, padding, and memory pool handling; integrates Mapping/MPIDist/ResourceManager dependencies.
Model Engine Integration `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Wires new CUDAGraphRunner constructor parameters; adds/initializes draft token CUDA buffers; propagates is_spec_decode, attn/spec metadata and draft tokens through maybe_get_cuda_graph, needs_capture, capture, replay.
Test Helper Update `tests/unittest/_torch/helpers.py`	Removes mock engine; adds create_mock_cuda_graph_runner(batch_size) returning configured CUDAGraphRunner using Mapping/kv_cache_manager_key; updates defaults (no padding, beam_width=1, max_draft_len=0).
Modeling Tests Migration `tests/unittest/_torch/modeling/test_modeling_exaone4.py`, `.../test_modeling_llama.py`, `.../test_modeling_llama_min_latency.py`, `.../test_modeling_mistral.py`, `.../test_modeling_mixtral.py`, `.../test_modeling_mllama.py`, `.../test_modeling_nemotron.py`, `.../test_modeling_phi3.py`, `.../test_modeling_qwen.py`, `.../test_modeling_qwen_moe.py`	Replace CUDAGraphRunner+mock-engine with create_mock_cuda_graph_runner; gate usage by scenario.use_cuda_graph; update capture/replay signatures to include extra boolean argument; minor import adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant ME as ModelEngine
  participant GR as CUDAGraphRunner
  participant FW as forward_fn
  participant Attn as AttentionMetadata
  participant Spec as SpecMetadata

  Note over ME,GR: New flow passes is_spec_decode, metadata, draft tokens
  ME->>GR: maybe_get_cuda_graph(batch, iter_counter, is_spec_decode, Attn, Spec?, draft_tokens_cuda)
  GR-->>ME: (can_use_graph, Attn', Spec')

  alt needs capture
    ME->>GR: needs_capture(batch_size, is_spec_decode)
    GR-->>ME: bool
    alt true
      ME->>GR: capture(batch_size, is_spec_decode, FW, initial_inputs)
      GR->>FW: forward(**initial_inputs) during capture
      FW-->>GR: outputs (captured)
      GR-->>ME: capture complete
    end
  end

  alt can_use_graph
    ME->>GR: replay(batch_size, is_spec_decode, current_inputs)
    GR->>GR: update static tensors, set draft_len from Spec'
    GR-->>ME: logits (replayed)
  else not eligible
    ME->>FW: forward(**current_inputs)
    FW-->>ME: logits
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[None][refactor] refactor the CUDA graph runner to manage all CUDA graphs #6846 — Refactors CUDAGraphRunner and capture/replay management; overlaps with runner API and flow changes here.
[TRTLLM-7353][feat] Implement capturable drafting loops for speculation #7100 — Modifies speculative drafting paths and CUDA-graph integration; aligns with draft-token buffers and spec metadata wiring.
[TRTLLM-6633][feat] Padding for piecewise cudagraph #6750 — Adjusts AttentionMetadata padding and CUDA-graph logic; interacts with metadata-driven graph decisions.

Suggested labels

Community want to contribute

Suggested reviewers

byshiue
hypdeb
amukkara

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)

tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

416-423: Fix int64→int32 dtype mismatch in CUDA-graph path (will crash at replay).

cuda_graph_runner allocates static input_ids/position_ids as int32; current inputs use default int (arange → int64). torch.Tensor.copy_ requires matching dtypes and will error at replay. Cast both to int32 before capture/replay.

Apply this diff:

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 429-429

tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

322-329: Align dtypes with CUDA-graph static buffers (int32).

position_ids from arange default to int64; runner uses int32 buffers, causing copy_ dtype mismatch at replay. Cast inputs before capture/replay.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 335-335

tests/unittest/_torch/modeling/test_modeling_mllama.py (1)

429-436: Prevent dtype mismatch with CUDA-graph int32 inputs.

Ensure input_ids/position_ids are int32 before capture/replay to match runner’s static tensors.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 442-442

tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)

327-334: Cast position_ids to int32 for CUDA-graph replay.

Runner’s static tensors are int32; arange yields int64. Cast before capture to avoid copy_ errors.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 340-340

tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)

322-329: Match int32 expectations in CUDA-graph path.

Cast inputs to int32 to align with cuda_graph_runner’s buffers.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 335-335

tests/unittest/_torch/modeling/test_modeling_qwen.py (1)

85-93: Python 3.8 compatibility: avoid built-in generics

dict[str, Any] requires Python 3.9+. Tests target 3.8+. Use typing.Dict.
-from typing import Any
+from typing import Any, Dict
@@
-def reduce_qwen_config(mem_for_full_model: int, config_dict: dict[str, Any]):
+def reduce_qwen_config(mem_for_full_model: int, config_dict: Dict[str, Any]):

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

400-406: Fix dtype of index tensors (must be torch.long).

gather_ids_cuda and previous_pos_indices_cuda are later used for tensor indexing (e.g., logits[gather_ids]), which requires LongTensor indices in PyTorch. Using torch.int (int32) risks runtime errors.

Apply:

-            self.gather_ids_cuda = torch.empty((self.max_num_tokens, ),
-                                               dtype=torch.int,
-                                               device='cuda')
-            self.previous_pos_indices_cuda = torch.empty(
-                (self.max_num_tokens, ), dtype=torch.int, device='cuda')
+            self.gather_ids_cuda = torch.empty((self.max_num_tokens, ),
+                                               dtype=torch.long,
+                                               device='cuda')
+            self.previous_pos_indices_cuda = torch.empty(
+                (self.max_num_tokens, ), dtype=torch.long, device='cuda')

Note: previous_batch_indices_cuda (Line 444) is also used as an index and should be torch.long for consistency. See additional snippet below.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

225-225: Inconsistent return value from replay method

The replay method returns output_ref which is a callable (weak reference), but the return type annotation suggests it should return Optional[torch.Tensor]. The method should either call the reference or update the return type.

Call the weak reference to get the actual tensor:
-        return output_ref
+        return output_ref() if output_ref else None

🧹 Nitpick comments (11)

tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)
406-408: Release CUDA graph resources after test.

Call graph_runner.clear() after use to free the graph’s memory pool.

Example (near test end):
if graph_runner is not None:
    graph_runner.clear()
tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

312-314: Exercise the CUDA-graph path in this test.

scenario.use_cuda_graph is never set to True here, so the CUDA-graph code isn’t exercised. Consider parameterizing to run both paths.
tests/unittest/_torch/helpers.py (1)
171-186: Helper factory looks good; add return type and optional knobs

Annotate the return type and expose use_mrope/enable_attention_dp to avoid future test drift.
-def create_mock_cuda_graph_runner(batch_size: int):
+def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner:
     return CUDAGraphRunner(
         use_cuda_graph=True,
         cuda_graph_padding_enabled=False,
         supported_batch_sizes=[batch_size],
         max_supported_batch_size=batch_size,
         max_batch_size=batch_size,
         max_beam_width=1,
         max_draft_len=0,
         use_mrope=False,
         spec_config=None,
         cuda_graph_mem_pool=None,
         enable_attention_dp=False,
         mapping=Mapping(),
         dist=None,
         kv_cache_manager_key=ResourceManagerType.KV_CACHE_MANAGER)
Optionally:
-def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner:
+def create_mock_cuda_graph_runner(
+    batch_size: int,
+    *,
+    use_mrope: bool = False,
+    enable_attention_dp: bool = False,
+) -> CUDAGraphRunner:
@@
-        use_mrope=False,
+        use_mrope=use_mrope,
@@
-        enable_attention_dp=False,
-        mapping=Mapping(),
+        enable_attention_dp=enable_attention_dp,
+        mapping=Mapping(enable_attention_dp=enable_attention_dp),
tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)
25-25: Fix E402: keep imports at top-of-file

Move this import to the top with the other imports to satisfy Ruff E402.
+from _torch.helpers import create_mock_cuda_graph_runner
@@
-from _torch.helpers import create_mock_cuda_graph_runner
tests/unittest/_torch/modeling/test_modeling_nemotron.py (2)

320-321: Minor: avoid magic number for batch size in tests.

Consider a local graph_bs = 1 to avoid repeating the literal and ease future edits.

343-344: Optional: assert non-None replay output.

replay returns an Optional (weak-ref). Add a quick assert logits is not None for clearer failures.

tests/unittest/_torch/modeling/test_modeling_mistral.py (2)

402-402: Minor: avoid magic number for batch size.

Define a local graph_bs = 1 and reuse it in capture/replay calls.

422-423: Optional: guard against None from replay.

Add assert logits is not None before comparisons to make weak-ref issues obvious.

tests/unittest/_torch/modeling/test_modeling_llama.py (2)

328-329: Minor: avoid magic number for batch size.

Use a local graph_bs = 1 and reuse it.

350-351: Optional: assert non-None replay output.

Add assert logits is not None to surface weak-ref invalidation early.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
412-413: Use instance attribute for clarity.

Prefer self.spec_config.max_draft_len over the shadowed spec_config.
-            self.max_draft_len = spec_config.max_draft_len
+            self.max_draft_len = self.spec_config.max_draft_len

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 25389c9 and c3143c4.

📒 Files selected for processing (13)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (10 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
tests/unittest/_torch/helpers.py (2 hunks)
tests/unittest/_torch/modeling/test_modeling_exaone4.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_llama.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_mistral.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_mixtral.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_mllama.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_nemotron.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_phi3.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_qwen.py (3 hunks)
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tests/unittest/_torch/modeling/test_modeling_mllama.py
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
tests/unittest/_torch/modeling/test_modeling_exaone4.py
tests/unittest/_torch/helpers.py
tests/unittest/_torch/modeling/test_modeling_llama.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_mistral.py
tests/unittest/_torch/modeling/test_modeling_mixtral.py
tests/unittest/_torch/modeling/test_modeling_phi3.py
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron.py
tests/unittest/_torch/modeling/test_modeling_qwen.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tests/unittest/_torch/modeling/test_modeling_mllama.py
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
tests/unittest/_torch/modeling/test_modeling_exaone4.py
tests/unittest/_torch/helpers.py
tests/unittest/_torch/modeling/test_modeling_llama.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_mistral.py
tests/unittest/_torch/modeling/test_modeling_mixtral.py
tests/unittest/_torch/modeling/test_modeling_phi3.py
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron.py
tests/unittest/_torch/modeling/test_modeling_qwen.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tests/unittest/_torch/modeling/test_modeling_mllama.py
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
tests/unittest/_torch/modeling/test_modeling_exaone4.py
tests/unittest/_torch/helpers.py
tests/unittest/_torch/modeling/test_modeling_llama.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tests/unittest/_torch/modeling/test_modeling_mistral.py
tests/unittest/_torch/modeling/test_modeling_mixtral.py
tests/unittest/_torch/modeling/test_modeling_phi3.py
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
tests/unittest/_torch/modeling/test_modeling_nemotron.py
tests/unittest/_torch/modeling/test_modeling_qwen.py
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

🧠 Learnings (2)

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tests/unittest/_torch/modeling/test_modeling_mllama.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tests/unittest/_torch/modeling/test_modeling_mllama.py

🧬 Code graph analysis (13)

tests/unittest/_torch/modeling/test_modeling_mllama.py (2)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (2)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tests/unittest/_torch/modeling/test_modeling_exaone4.py (3)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

forward (79-87)

forward (2211-2313)

tests/unittest/_torch/helpers.py (3)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

CUDAGraphRunner (22-316)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

ResourceManagerType (47-52)

tensorrt_llm/mapping.py (1)

Mapping (32-513)

tests/unittest/_torch/modeling/test_modeling_llama.py (2)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)

CUDAGraphRunner (22-316)

needs_capture (139-141)

replay (196-225)

tests/unittest/_torch/modeling/test_modeling_mistral.py (2)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tests/unittest/_torch/modeling/test_modeling_mixtral.py (2)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tests/unittest/_torch/modeling/test_modeling_phi3.py (2)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (2)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tests/unittest/_torch/modeling/test_modeling_nemotron.py (3)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

forward (79-87)

forward (2211-2313)

tests/unittest/_torch/modeling/test_modeling_qwen.py (3)

tests/unittest/_torch/helpers.py (1)

create_mock_cuda_graph_runner (171-186)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

capture (143-194)

replay (196-225)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

forward (79-87)

forward (2211-2313)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (8)

tensorrt_llm/mapping.py (1)

Mapping (32-513)

tensorrt_llm/_torch/attention_backend/interface.py (2)

AttentionMetadata (39-347)

create_cuda_graph_metadata (288-328)

tensorrt_llm/_torch/distributed/communicator.py (3)

MPIDist (98-145)

tp_size (46-47)

tp_allgather (138-139)

tensorrt_llm/_torch/expert_statistic.py (2)

ExpertStatistic (10-98)

set_iter (32-36)

tensorrt_llm/_torch/modules/multi_stream_utils.py (1)

with_multi_stream (26-32)

tensorrt_llm/_torch/speculative/interface.py (2)

SpecMetadata (122-217)

create_cuda_graph_metadata (181-192)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

ResourceManagerType (47-52)

tensorrt_llm/_torch/pyexecutor/scheduler.py (2)

ScheduledRequests (18-39)

can_run_cuda_graph (31-32)

🪛 Ruff (0.12.2)

tests/unittest/_torch/modeling/test_modeling_exaone4.py

25-25: Module level import not at top of file

(E402)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

43-43: Undefined name DecodingBaseConfig

(F821)

72-72: Undefined name Request

(F821)

🔇 Additional comments (21)

tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

7-7: Good switch to factory helper.

Decoupling tests from CUDAGraphRunner construction via create_mock_cuda_graph_runner improves isolation and avoids circular deps.

tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

7-7: Factory-based runner import looks good.

Keeps tests engine-agnostic and matches the new API surface.

tests/unittest/_torch/modeling/test_modeling_mllama.py (1)

6-6: Nice: helper import unifies CUDA-graph setup across tests.

tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)

6-6: Good move to create_mock_cuda_graph_runner.

tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)

6-6: Helper import LGTM.

tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)

355-357: CUDA graph capture/replay usage matches new API

Capture with the added boolean flag and subsequent replay look correct.

Also applies to: 364-364

tests/unittest/_torch/modeling/test_modeling_qwen.py (1)

20-20: Import of helper factory looks correct

Aligned with new factory-based setup.

tests/unittest/_torch/modeling/test_modeling_nemotron.py (2)

7-7: LGTM: switched to factory helper.

Importing create_mock_cuda_graph_runner decouples tests from the runner class.

335-337: LGTM: updated capture signature correctly.

is_spec_decode=False here is appropriate for pure decoding.

tests/unittest/_torch/modeling/test_modeling_mistral.py (2)

10-10: LGTM: switched to factory helper.

Keeps tests aligned with the new API surface.

416-416: LGTM: capture API use matches new signature.

tests/unittest/_torch/modeling/test_modeling_llama.py (2)

7-7: LGTM: switched to factory helper.

343-345: LGTM: capture API updated correctly.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

460-474: LGTM: runner decoupled and fully parameterized.

Constructor wiring looks correct and removes the prior dependency.

2258-2264: Confirmed safe: draft_tokens_cuda=None is tolerated
maybe_get_cuda_graph only references draft_tokens_cuda when spec_metadata is non-null, so passing None for non-speculative paths is ignored and causes no errors.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (6)

32-49: Well-designed dependency injection refactor

Great work converting from engine-centric initialization to explicit dependency injection! This change successfully breaks the circular dependency by accepting individual parameters rather than the full engine object. The use of keyword-only arguments ensures clarity at call sites.

77-85: Enhanced API with speculative decoding context

The updated maybe_get_cuda_graph signature properly extends support for speculative decoding by accepting is_spec_decode, spec_metadata, and draft_tokens_cuda parameters. This provides the necessary context for graph eligibility decisions.

101-111: Distributed batch size synchronization logic looks correct

The multi-GPU synchronization properly gathers batch information across TP ranks and validates consistency before allowing CUDA graph execution. This ensures all ranks are in sync.

127-135: Proper metadata creation for CUDA graphs

The code correctly creates graph-specific metadata for both attention and speculative decoding, properly handling the draft tokens buffer assignment.

168-171: Conditional mrope position deltas handling

Good use of the use_mrope flag to conditionally include position deltas in the static tensors.

266-271: Comprehensive dummy request configuration

The padding dummy request creation properly includes all necessary parameters for speculative decoding support including max_num_draft_tokens, use_mrope, and max_beam_width.

coderabbitai · 2025-09-05T18:19:33Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

+        max_beam_width: int,
+        max_draft_len: int,
+        use_mrope: bool,
+        spec_config: Optional["DecodingBaseConfig"],


⚠️ Potential issue

Fix undefined type hint DecodingBaseConfig

The type hint for spec_config references DecodingBaseConfig which is not imported. This will cause a runtime error if the type hint is evaluated.

Add the missing import at the top of the file:

from ..modules.multi_stream_utils import with_multi_stream from ..speculative import SpecMetadata +from ..speculative.base import DecodingBaseConfig from ..utils import make_weak_ref, piecewise_cuda_graph

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

spec_config: Optional["DecodingBaseConfig"],

# at the top of tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

from ..modules.multi_stream_utils import with_multi_stream

from ..speculative import SpecMetadata

from ..speculative.base import DecodingBaseConfig

from ..utils import make_weak_ref, piecewise_cuda_graph

🧰 Tools

🪛 Ruff (0.12.2)

43-43: Undefined name DecodingBaseConfig

(F821)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around line 43, the type hint DecodingBaseConfig is referenced but not imported; add an import for DecodingBaseConfig at the top of the file from the module where DecodingBaseConfig is defined (for example the decoding module under tensorrt_llm._torch.pyexecutor), or alternatively guard the annotation with typing.TYPE_CHECKING or keep it as a string literal to avoid runtime evaluation.

coderabbitai · 2025-09-05T18:19:33Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

        self.graph_metadata: Dict[Tuple[int, int], Dict[str, Any]] = {}
-        self.memory_pool = engine._cuda_graph_mem_pool
+        self.memory_pool = cuda_graph_mem_pool
        self.padding_dummy_request: Optional["Request"] = None


⚠️ Potential issue

Fix undefined type hint Request

The type hint for padding_dummy_request references Request which is not imported. This will cause a runtime error if the type hint is evaluated.

Add the missing import at the top of the file:

from ..utils import make_weak_ref, piecewise_cuda_graph from .resource_manager import ResourceManager, ResourceManagerType +from .request import Request from .scheduler import ScheduledRequests

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.padding_dummy_request: Optional["Request"] = None

from ..utils import make_weak_ref, piecewise_cuda_graph

from .resource_manager import ResourceManager, ResourceManagerType

from .request import Request

from .scheduler import ScheduledRequests

🧰 Tools

🪛 Ruff (0.12.2)

72-72: Undefined name Request

(F821)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around line 72, the type hint Optional["Request"] references Request but it's not imported; add the missing import at the top of the file to resolve the undefined name (e.g., add a relative import such as "from .request import Request" or the correct absolute path where Request is defined) so the type is available when type hints are evaluated.

coderabbitai · 2025-09-05T18:19:33Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

+        if hasattr(self,
+                   'iter_counter') and ExpertStatistic.set_iter(iter_counter):


💡 Verification agent

🧩 Analysis chain

Verify the attribute check for iter_counter

The code checks for iter_counter attribute using hasattr(self, 'iter_counter'), but this attribute is never set in the __init__ method. This appears to be dead code unless the attribute is dynamically added elsewhere.

🏁 Script executed:

#!/bin/bash # Search for where iter_counter might be set on CUDAGraphRunner instances rg -n "\.iter_counter\s*=" --type=py

Length of output: 205

🏁 Script executed:

#!/bin/bash # Locate CUDAGraphRunner definition and inspect its __init__ rg -nP "class\s+CUDAGraphRunner" -n -t py tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py rg -nP "def\s+__init__" -n -C5 -t py tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

Length of output: 563

Initialize or remove the iter_counter check
The hasattr(self, 'iter_counter') guard in tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py:95–96 is always false—self.iter_counter is never defined on CUDAGraphRunner. Either add self.iter_counter = 0 in __init__ or remove these lines.

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 95-96 the hasattr(self, 'iter_counter') check is dead because CUDAGraphRunner never defines self.iter_counter; either initialize self.iter_counter = 0 in the class __init__ (or where instances are constructed) so the guard can be true and ExpertStatistic.set_iter receives a valid value, or remove the hasattr check and the conditional entirely if iteration counting is not needed—implement whichever fix matches intended behavior and keep code and tests consistent.

coderabbitai · 2025-09-05T18:19:34Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

+        draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
+        key = (batch_size, draft_len)


⚠️ Potential issue

Consider null-safety for spec_config access

The code accesses self.spec_config.max_draft_len without checking if spec_config is None, despite it being Optional in the constructor.

Add a null check or provide a default:

- draft_len = self.spec_config.max_draft_len if is_spec_decode else 0 + draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

draft_len = self.spec_config.max_draft_len if is_spec_decode else 0

key = (batch_size, draft_len)

draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0

key = (batch_size, draft_len)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 116-117, the code accesses self.spec_config.max_draft_len but spec_config is Optional; guard against None by computing draft_len = self.spec_config.max_draft_len if (is_spec_decode and self.spec_config is not None) else 0 (or use getattr(self.spec_config, "max_draft_len", 0) when is_spec_decode is True), then build key = (batch_size, draft_len); ensure the null-check covers only when is_spec_decode is true so behavior remains unchanged otherwise.

coderabbitai · 2025-09-05T18:19:34Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

+    def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool:
+        draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
+        return (batch_size, draft_len) not in self.graph_outputs


⚠️ Potential issue

Apply the same null-safety fix in needs_capture

- draft_len = self.spec_config.max_draft_len if is_spec_decode else 0 + draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool:

draft_len = self.spec_config.max_draft_len if is_spec_decode else 0

return (batch_size, draft_len) not in self.graph_outputs

def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool:

draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0

return (batch_size, draft_len) not in self.graph_outputs

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 139 to 141, needs_capture currently assumes self.spec_config and self.graph_outputs are non-null; make it null-safe by computing draft_len = (getattr(self.spec_config, "max_draft_len", 0) if is_spec_decode else 0) and checking membership against a safe container (e.g., graph_outputs = self.graph_outputs or set()), then return (batch_size, draft_len) not in graph_outputs so the method won't throw if spec_config or graph_outputs is None.

coderabbitai · 2025-09-05T18:19:34Z

tensorrt_llm/_torch/pyexecutor/model_engine.py

                    # here we don't need to use context since cuda graph capture didn't run kernel.
                    # maybe we need a cleaner way to do this.
-                    outputs = self.cuda_graph_runner.replay(batch_size, inputs)
+                    outputs = self.cuda_graph_runner.replay(
+                        batch_size, self.enable_spec_decode, inputs)
                else:


⚠️ Potential issue

Wrap replay in MoE iterator context on capture path as well.

The else-branch replays under MoeLoadBalancerIterContext, but the capture-branch replay does not. Replay runs kernels and should be consistently wrapped to keep MoE iteration state correct.

- # here we don't need to use context since cuda graph capture didn't run kernel. - # maybe we need a cleaner way to do this. - outputs = self.cuda_graph_runner.replay( - batch_size, self.enable_spec_decode, inputs) + # Replay runs kernels; keep MoE iteration state consistent. + with MoeLoadBalancerIterContext(moe_load_balancer): + outputs = self.cuda_graph_runner.replay( + batch_size, self.enable_spec_decode, inputs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# here we don't need to use context since cuda graph capture didn't run kernel.

# maybe we need a cleaner way to do this.

outputs = self.cuda_graph_runner.replay(batch_size, inputs)

outputs = self.cuda_graph_runner.replay(

batch_size, self.enable_spec_decode, inputs)

else:

# Replay runs kernels; keep MoE iteration state consistent.

with MoeLoadBalancerIterContext(moe_load_balancer):

outputs = self.cuda_graph_runner.replay(

batch_size, self.enable_spec_decode, inputs)

else:

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 2302 to 2306, the replay call on the capture path is not wrapped in MoeLoadBalancerIterContext whereas the else-branch wraps replay inside that context; wrap the capture-path replay call in the same MoeLoadBalancerIterContext used in the else branch (enter context before calling self.cuda_graph_runner.replay and exit after), so both paths run replay under the MoE iterator context and preserve consistent MoE iteration state.

coderabbitai · 2025-09-05T18:19:34Z

tests/unittest/_torch/helpers.py

+from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner
+from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType
+from tensorrt_llm.mapping import Mapping
+


🛠️ Refactor suggestion

Add required NVIDIA Apache-2.0 header

Per repo guidelines, prepend the 2025 NVIDIA Apache-2.0 header to all .py files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner

from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType

from tensorrt_llm.mapping import Mapping

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner

from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType

from tensorrt_llm.mapping import Mapping

🤖 Prompt for AI Agents

In tests/unittest/_torch/helpers.py around lines 6 to 9, the file is missing the required 2025 NVIDIA Apache-2.0 header; prepend the official NVIDIA Apache-2.0 header block (with year 2025 and appropriate license text per repo guidelines) at the very top of the file before any imports, preserving encoding and line endings, and ensure the header matches other repository files exactly.

coderabbitai · 2025-09-05T18:19:34Z

tests/unittest/_torch/modeling/test_modeling_exaone4.py

    SKIP_EXAONE4_HF_ACCURACY_TEST = True

-from _torch.helpers import create_mock_engine
+from _torch.helpers import create_mock_cuda_graph_runner


🛠️ Refactor suggestion

Add required NVIDIA Apache-2.0 header

This file is missing the mandated header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from _torch.helpers import create_mock_cuda_graph_runner

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

from _torch.helpers import create_mock_cuda_graph_runner

🧰 Tools

🪛 Ruff (0.12.2)

25-25: Module level import not at top of file

(E402)

🤖 Prompt for AI Agents

In tests/unittest/_torch/modeling/test_modeling_exaone4.py around line 25, the file is missing the required NVIDIA Apache-2.0 header; add the mandated multi-line Apache-2.0 license header (including copyright notice and SPDX identifier) at the top of the file following the project's header format so the file begins with the standard NVIDIA Apache-2.0 license block.

coderabbitai · 2025-09-05T18:19:34Z

tests/unittest/_torch/modeling/test_modeling_qwen.py

    Qwen2ForCausalLM, Qwen2ForProcessRewardModel)
 # yapf: enable
-from _torch.helpers import create_mock_engine
+from _torch.helpers import create_mock_cuda_graph_runner


🛠️ Refactor suggestion

Add required NVIDIA Apache-2.0 header

Apply the standard header at the top.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from _torch.helpers import create_mock_cuda_graph_runner

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

from _torch.helpers import create_mock_cuda_graph_runner

# …rest of the file…

🤖 Prompt for AI Agents

In tests/unittest/_torch/modeling/test_modeling_qwen.py around line 20, the file is missing the required NVIDIA Apache-2.0 license header; add the standard NVIDIA Apache-2.0 header block at the very top of the file (before any imports) matching the project's header template, ensuring correct copyright year/owner and SPDX identifier.

coderabbitai · 2025-09-05T18:19:35Z

tests/unittest/_torch/modeling/test_modeling_qwen.py

+        graph_runner = create_mock_cuda_graph_runner(
+            1) if scenario.use_cuda_graph else None


⚠️ Potential issue

Fix dtype mismatch for position_ids with CUDA graphs

CUDAGraphRunner allocates position_ids as int32. Here gen_position_ids defaults to int64, causing copy_ dtype mismatch during replay. Force int32.

- gen_position_ids = [ - torch.arange(input_ids.size(-1), - input_ids.size(-1) + gen_input_ids.size(-1)) - ] + gen_position_ids = [ + torch.arange( + input_ids.size(-1), + input_ids.size(-1) + gen_input_ids.size(-1), + dtype=torch.int32, + ) + ]

Also applies to: 282-283, 290-290

🤖 Prompt for AI Agents

In tests/unittest/_torch/modeling/test_modeling_qwen.py around lines 267-268 (also apply same change at 282-283 and 290), when scenario.use_cuda_graph is true the CUDAGraphRunner allocates position_ids as int32 but gen_position_ids currently defaults to int64 causing a dtype mismatch on copy_ during replay; update the test to force gen_position_ids to torch.int32 (or call .to(torch.int32)) whenever graph_runner is created/used so the generated position_ids match the CUDA graph's int32 dtype.

tensorrt-cicd · 2025-09-05T21:08:51Z

PR_Github #17812 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13336 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>

QiJune · 2025-09-19T06:44:53Z

/bot run

tensorrt-cicd · 2025-09-19T06:49:52Z

PR_Github #19305 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-19T06:52:35Z

PR_Github #19305 [ run ] completed with state FAILURE

QiJune · 2025-09-19T06:59:24Z

/bot run

tensorrt-cicd · 2025-09-19T07:04:40Z

PR_Github #19310 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-19T08:33:31Z

PR_Github #19310 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14499 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>

QiJune · 2025-09-25T11:29:12Z

/bot run

tensorrt-cicd · 2025-09-25T11:34:16Z

PR_Github #19948 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-25T13:02:02Z

PR_Github #19948 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15018 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>

hypdeb · 2025-09-25T13:16:48Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

+    def __init__(
+        self,
+        *,
+        use_cuda_graph: bool,


What is the use for a CUDAGraphRunner do if cuda graph is disabled?

hypdeb · 2025-09-25T13:17:37Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

+        use_cuda_graph: bool,
+        cuda_graph_padding_enabled: bool,
+        supported_batch_sizes: list[int],
+        max_supported_batch_size: int,


What is the difference between max_supported_batch_size and max_batch_size?

Signed-off-by: junq <[email protected]>

hypdeb · 2025-09-25T13:22:56Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

+        attn_metadata: AttentionMetadata,
+        spec_metadata: Optional[SpecMetadata],
+        draft_tokens_cuda: torch.Tensor,
+    ) -> Tuple[bool, Optional[AttentionMetadata], Optional[SpecMetadata]]:


It seems the return type here should be Optional[Tuple[AttentionMetadata, SpecMetadata]].

QiJune · 2025-09-25T13:24:20Z

/bot run

hypdeb · 2025-09-25T13:25:04Z

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py


        padding_size = padded_batch_size - batch_size
-        if padding_size + batch.batch_size > engine.batch_size:
+        if padding_size + batch.batch_size > self.max_batch_size:


Maybe we should rename the batch_size on engine to max_batch_size?

tensorrt-cicd · 2025-09-25T13:30:56Z

PR_Github #19955 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-25T14:09:48Z

PR_Github #19955 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #15025 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>

QiJune · 2025-09-26T02:10:34Z

/bot run

tensorrt-cicd · 2025-09-26T02:15:44Z

PR_Github #20021 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-26T04:45:14Z

PR_Github #20021 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15080 completed with status: 'FAILURE'

QiJune added 4 commits September 4, 2025 10:11

remove circular dependency between model engine and cuda graph runner

53439c9

Signed-off-by: junq <[email protected]>

fix test

37654a8

Signed-off-by: junq <[email protected]>

fix

690e633

Signed-off-by: junq <[email protected]>

fix

2cf1852

Signed-off-by: junq <[email protected]>

QiJune requested review from a team as code owners September 5, 2025 17:58

QiJune requested review from danielafrimi, Wanli-Jiang, byshiue, amukkara and hypdeb September 5, 2025 17:58

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

rebase

b727284

Signed-off-by: junq <[email protected]>

QiJune force-pushed the clean_cuda_graph branch from c3143c4 to b727284 Compare September 19, 2025 03:15

QiJune added 2 commits September 19, 2025 17:21

fix ci

49e5fc0

Signed-off-by: junq <[email protected]>

fix

679f67a

Signed-off-by: junq <[email protected]>

QiJune added 3 commits September 25, 2025 18:40

fix

06bc3ea

Signed-off-by: junq <[email protected]>

fix

84ad219

Signed-off-by: junq <[email protected]>

Merge branch 'main' into clean_cuda_graph

6e57bb0

QiJune added 2 commits September 25, 2025 21:03

fix ci

8db4cfa

Signed-off-by: junq <[email protected]>

fix ci

b387272

Signed-off-by: junq <[email protected]>

hypdeb reviewed Sep 25, 2025

View reviewed changes

rebase

d043821

Signed-off-by: junq <[email protected]>

hypdeb reviewed Sep 25, 2025

View reviewed changes

rebase

fd96a08

Signed-off-by: junq <[email protected]>

-        spec_config: Optional["DecodingBaseConfig"],
+# at the top of tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
+from ..modules.multi_stream_utils import with_multi_stream
+from ..speculative import SpecMetadata
+from ..speculative.base import DecodingBaseConfig
+from ..utils import make_weak_ref, piecewise_cuda_graph

-        self.padding_dummy_request: Optional["Request"] = None
+from ..utils import make_weak_ref, piecewise_cuda_graph
+ from .resource_manager import ResourceManager, ResourceManagerType
+from .request import Request
+ from .scheduler import ScheduledRequests

		if hasattr(self,
		'iter_counter') and ExpertStatistic.set_iter(iter_counter):

		draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
		key = (batch_size, draft_len)

-from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner
-from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType
-from tensorrt_llm.mapping import Mapping
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner
+from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType
+from tensorrt_llm.mapping import Mapping

		graph_runner = create_mock_cuda_graph_runner(
		1) if scenario.use_cuda_graph else None

[None][chore] remove circular dependency between model engine and cuda graph runner #7572

Are you sure you want to change the base?

[None][chore] remove circular dependency between model engine and cuda graph runner #7572

Conversation

QiJune commented Sep 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

QiJune commented Sep 5, 2025

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

coderabbitai bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

QiJune commented Sep 19, 2025

Uh oh!

tensorrt-cicd commented Sep 19, 2025

Uh oh!

tensorrt-cicd commented Sep 19, 2025

Uh oh!

QiJune commented Sep 19, 2025

Uh oh!

tensorrt-cicd commented Sep 19, 2025

Uh oh!

tensorrt-cicd commented Sep 19, 2025

QiJune commented Sep 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 5, 2025 •

edited

Loading