Skip to content

Conversation

QiJune
Copy link
Collaborator

@QiJune QiJune commented Sep 5, 2025

Summary by CodeRabbit

  • New Features
    • Enhanced CUDA Graph execution with support for speculative decoding and draft tokens.
    • More configurable graph behavior (batch sizing, padding, memory pool) with improved multi-GPU awareness.
  • Refactor
    • Decoupled the CUDA graph runner from the model engine with explicit, injectable dependencies.
    • Streamlined graph capture/replay logic and caching with draft-length awareness.
  • Tests
    • Updated unit tests to use a mock graph-runner helper and the new capture/replay API (extra boolean flag).

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17812 [ run ] triggered by Bot

Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

📝 Walkthrough

Walkthrough

Refactors CUDAGraphRunner to a dependency-injected, engine-agnostic component with expanded APIs carrying speculative-decoding context and metadata. Updates ModelEngine to wire new parameters and draft-token CUDA buffers. Migrates unit tests to a new create_mock_cuda_graph_runner helper and adapts capture/replay call signatures with an added boolean flag.

Changes

Cohort / File(s) Summary
CUDA Graph Runner Refactor
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
Replaces engine-centric constructor with explicit keyword args; expands methods to pass/return speculative-decoding context and metadata; introduces per-(batch_size,draft_len) graph keys; revises capture/replay, padding, and memory pool handling; integrates Mapping/MPIDist/ResourceManager dependencies.
Model Engine Integration
tensorrt_llm/_torch/pyexecutor/model_engine.py
Wires new CUDAGraphRunner constructor parameters; adds/initializes draft token CUDA buffers; propagates is_spec_decode, attn/spec metadata and draft tokens through maybe_get_cuda_graph, needs_capture, capture, replay.
Test Helper Update
tests/unittest/_torch/helpers.py
Removes mock engine; adds create_mock_cuda_graph_runner(batch_size) returning configured CUDAGraphRunner using Mapping/kv_cache_manager_key; updates defaults (no padding, beam_width=1, max_draft_len=0).
Modeling Tests Migration
tests/unittest/_torch/modeling/test_modeling_exaone4.py, .../test_modeling_llama.py, .../test_modeling_llama_min_latency.py, .../test_modeling_mistral.py, .../test_modeling_mixtral.py, .../test_modeling_mllama.py, .../test_modeling_nemotron.py, .../test_modeling_phi3.py, .../test_modeling_qwen.py, .../test_modeling_qwen_moe.py
Replace CUDAGraphRunner+mock-engine with create_mock_cuda_graph_runner; gate usage by scenario.use_cuda_graph; update capture/replay signatures to include extra boolean argument; minor import adjustments.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant ME as ModelEngine
  participant GR as CUDAGraphRunner
  participant FW as forward_fn
  participant Attn as AttentionMetadata
  participant Spec as SpecMetadata

  Note over ME,GR: New flow passes is_spec_decode, metadata, draft tokens
  ME->>GR: maybe_get_cuda_graph(batch, iter_counter, is_spec_decode, Attn, Spec?, draft_tokens_cuda)
  GR-->>ME: (can_use_graph, Attn', Spec')

  alt needs capture
    ME->>GR: needs_capture(batch_size, is_spec_decode)
    GR-->>ME: bool
    alt true
      ME->>GR: capture(batch_size, is_spec_decode, FW, initial_inputs)
      GR->>FW: forward(**initial_inputs) during capture
      FW-->>GR: outputs (captured)
      GR-->>ME: capture complete
    end
  end

  alt can_use_graph
    ME->>GR: replay(batch_size, is_spec_decode, current_inputs)
    GR->>GR: update static tensors, set draft_len from Spec'
    GR-->>ME: logits (replayed)
  else not eligible
    ME->>FW: forward(**current_inputs)
    FW-->>ME: logits
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • byshiue
  • hypdeb
  • amukkara
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

416-423: Fix int64→int32 dtype mismatch in CUDA-graph path (will crash at replay).

cuda_graph_runner allocates static input_ids/position_ids as int32; current inputs use default int (arange → int64). torch.Tensor.copy_ requires matching dtypes and will error at replay. Cast both to int32 before capture/replay.

Apply this diff:

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 429-429

tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

322-329: Align dtypes with CUDA-graph static buffers (int32).

position_ids from arange default to int64; runner uses int32 buffers, causing copy_ dtype mismatch at replay. Cast inputs before capture/replay.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 335-335

tests/unittest/_torch/modeling/test_modeling_mllama.py (1)

429-436: Prevent dtype mismatch with CUDA-graph int32 inputs.

Ensure input_ids/position_ids are int32 before capture/replay to match runner’s static tensors.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 442-442

tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)

327-334: Cast position_ids to int32 for CUDA-graph replay.

Runner’s static tensors are int32; arange yields int64. Cast before capture to avoid copy_ errors.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 340-340

tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)

322-329: Match int32 expectations in CUDA-graph path.

Cast inputs to int32 to align with cuda_graph_runner’s buffers.

-                inputs = {
-                    "input_ids": input_ids,
-                    "position_ids": position_ids,
-                    "attn_metadata": attn_metadata,
-                }
+                inputs = {
+                    "input_ids": input_ids.to(torch.int32),
+                    "position_ids": position_ids.to(torch.int32),
+                    "attn_metadata": attn_metadata,
+                }

Also applies to: 335-335

tests/unittest/_torch/modeling/test_modeling_qwen.py (1)

85-93: Python 3.8 compatibility: avoid built-in generics

dict[str, Any] requires Python 3.9+. Tests target 3.8+. Use typing.Dict.

-from typing import Any
+from typing import Any, Dict
@@
-def reduce_qwen_config(mem_for_full_model: int, config_dict: dict[str, Any]):
+def reduce_qwen_config(mem_for_full_model: int, config_dict: Dict[str, Any]):
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

400-406: Fix dtype of index tensors (must be torch.long).

gather_ids_cuda and previous_pos_indices_cuda are later used for tensor indexing (e.g., logits[gather_ids]), which requires LongTensor indices in PyTorch. Using torch.int (int32) risks runtime errors.

Apply:

-            self.gather_ids_cuda = torch.empty((self.max_num_tokens, ),
-                                               dtype=torch.int,
-                                               device='cuda')
-            self.previous_pos_indices_cuda = torch.empty(
-                (self.max_num_tokens, ), dtype=torch.int, device='cuda')
+            self.gather_ids_cuda = torch.empty((self.max_num_tokens, ),
+                                               dtype=torch.long,
+                                               device='cuda')
+            self.previous_pos_indices_cuda = torch.empty(
+                (self.max_num_tokens, ), dtype=torch.long, device='cuda')

Note: previous_batch_indices_cuda (Line 444) is also used as an index and should be torch.long for consistency. See additional snippet below.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

225-225: Inconsistent return value from replay method

The replay method returns output_ref which is a callable (weak reference), but the return type annotation suggests it should return Optional[torch.Tensor]. The method should either call the reference or update the return type.

Call the weak reference to get the actual tensor:

-        return output_ref
+        return output_ref() if output_ref else None
🧹 Nitpick comments (11)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

406-408: Release CUDA graph resources after test.

Call graph_runner.clear() after use to free the graph’s memory pool.

Example (near test end):

if graph_runner is not None:
    graph_runner.clear()
tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

312-314: Exercise the CUDA-graph path in this test.

scenario.use_cuda_graph is never set to True here, so the CUDA-graph code isn’t exercised. Consider parameterizing to run both paths.

tests/unittest/_torch/helpers.py (1)

171-186: Helper factory looks good; add return type and optional knobs

Annotate the return type and expose use_mrope/enable_attention_dp to avoid future test drift.

-def create_mock_cuda_graph_runner(batch_size: int):
+def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner:
     return CUDAGraphRunner(
         use_cuda_graph=True,
         cuda_graph_padding_enabled=False,
         supported_batch_sizes=[batch_size],
         max_supported_batch_size=batch_size,
         max_batch_size=batch_size,
         max_beam_width=1,
         max_draft_len=0,
         use_mrope=False,
         spec_config=None,
         cuda_graph_mem_pool=None,
         enable_attention_dp=False,
         mapping=Mapping(),
         dist=None,
         kv_cache_manager_key=ResourceManagerType.KV_CACHE_MANAGER)

Optionally:

-def create_mock_cuda_graph_runner(batch_size: int) -> CUDAGraphRunner:
+def create_mock_cuda_graph_runner(
+    batch_size: int,
+    *,
+    use_mrope: bool = False,
+    enable_attention_dp: bool = False,
+) -> CUDAGraphRunner:
@@
-        use_mrope=False,
+        use_mrope=use_mrope,
@@
-        enable_attention_dp=False,
-        mapping=Mapping(),
+        enable_attention_dp=enable_attention_dp,
+        mapping=Mapping(enable_attention_dp=enable_attention_dp),
tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)

25-25: Fix E402: keep imports at top-of-file

Move this import to the top with the other imports to satisfy Ruff E402.

+from _torch.helpers import create_mock_cuda_graph_runner
@@
-from _torch.helpers import create_mock_cuda_graph_runner
tests/unittest/_torch/modeling/test_modeling_nemotron.py (2)

320-321: Minor: avoid magic number for batch size in tests.

Consider a local graph_bs = 1 to avoid repeating the literal and ease future edits.


343-344: Optional: assert non-None replay output.

replay returns an Optional (weak-ref). Add a quick assert logits is not None for clearer failures.

tests/unittest/_torch/modeling/test_modeling_mistral.py (2)

402-402: Minor: avoid magic number for batch size.

Define a local graph_bs = 1 and reuse it in capture/replay calls.


422-423: Optional: guard against None from replay.

Add assert logits is not None before comparisons to make weak-ref issues obvious.

tests/unittest/_torch/modeling/test_modeling_llama.py (2)

328-329: Minor: avoid magic number for batch size.

Use a local graph_bs = 1 and reuse it.


350-351: Optional: assert non-None replay output.

Add assert logits is not None to surface weak-ref invalidation early.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

412-413: Use instance attribute for clarity.

Prefer self.spec_config.max_draft_len over the shadowed spec_config.

-            self.max_draft_len = spec_config.max_draft_len
+            self.max_draft_len = self.spec_config.max_draft_len
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 25389c9 and c3143c4.

📒 Files selected for processing (13)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (10 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
  • tests/unittest/_torch/helpers.py (2 hunks)
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_llama.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_mistral.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_mllama.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_phi3.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_qwen.py (3 hunks)
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py
  • tests/unittest/_torch/helpers.py
  • tests/unittest/_torch/modeling/test_modeling_llama.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_mistral.py
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py
  • tests/unittest/_torch/modeling/test_modeling_phi3.py
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py
  • tests/unittest/_torch/modeling/test_modeling_qwen.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py
  • tests/unittest/_torch/helpers.py
  • tests/unittest/_torch/modeling/test_modeling_llama.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_mistral.py
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py
  • tests/unittest/_torch/modeling/test_modeling_phi3.py
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py
  • tests/unittest/_torch/modeling/test_modeling_qwen.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
  • tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py
  • tests/unittest/_torch/modeling/test_modeling_exaone4.py
  • tests/unittest/_torch/helpers.py
  • tests/unittest/_torch/modeling/test_modeling_llama.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_mistral.py
  • tests/unittest/_torch/modeling/test_modeling_mixtral.py
  • tests/unittest/_torch/modeling/test_modeling_phi3.py
  • tests/unittest/_torch/modeling/test_modeling_qwen_moe.py
  • tests/unittest/_torch/modeling/test_modeling_nemotron.py
  • tests/unittest/_torch/modeling/test_modeling_qwen.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/unittest/_torch/modeling/test_modeling_mllama.py
🧬 Code graph analysis (13)
tests/unittest/_torch/modeling/test_modeling_mllama.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_exaone4.py (3)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2211-2313)
tests/unittest/_torch/helpers.py (3)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
  • CUDAGraphRunner (22-316)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • ResourceManagerType (47-52)
tensorrt_llm/mapping.py (1)
  • Mapping (32-513)
tests/unittest/_torch/modeling/test_modeling_llama.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
  • CUDAGraphRunner (22-316)
  • needs_capture (139-141)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_mistral.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_mixtral.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_phi3.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (2)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tests/unittest/_torch/modeling/test_modeling_nemotron.py (3)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2211-2313)
tests/unittest/_torch/modeling/test_modeling_qwen.py (3)
tests/unittest/_torch/helpers.py (1)
  • create_mock_cuda_graph_runner (171-186)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
  • capture (143-194)
  • replay (196-225)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
  • forward (79-87)
  • forward (2211-2313)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (8)
tensorrt_llm/mapping.py (1)
  • Mapping (32-513)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • AttentionMetadata (39-347)
  • create_cuda_graph_metadata (288-328)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • MPIDist (98-145)
  • tp_size (46-47)
  • tp_allgather (138-139)
tensorrt_llm/_torch/expert_statistic.py (2)
  • ExpertStatistic (10-98)
  • set_iter (32-36)
tensorrt_llm/_torch/modules/multi_stream_utils.py (1)
  • with_multi_stream (26-32)
tensorrt_llm/_torch/speculative/interface.py (2)
  • SpecMetadata (122-217)
  • create_cuda_graph_metadata (181-192)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • ResourceManagerType (47-52)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • ScheduledRequests (18-39)
  • can_run_cuda_graph (31-32)
🪛 Ruff (0.12.2)
tests/unittest/_torch/modeling/test_modeling_exaone4.py

25-25: Module level import not at top of file

(E402)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

43-43: Undefined name DecodingBaseConfig

(F821)


72-72: Undefined name Request

(F821)

🔇 Additional comments (21)
tests/unittest/_torch/modeling/test_modeling_llama_min_latency.py (1)

7-7: Good switch to factory helper.

Decoupling tests from CUDAGraphRunner construction via create_mock_cuda_graph_runner improves isolation and avoids circular deps.

tests/unittest/_torch/modeling/test_modeling_phi3.py (1)

7-7: Factory-based runner import looks good.

Keeps tests engine-agnostic and matches the new API surface.

tests/unittest/_torch/modeling/test_modeling_mllama.py (1)

6-6: Nice: helper import unifies CUDA-graph setup across tests.

tests/unittest/_torch/modeling/test_modeling_qwen_moe.py (1)

6-6: Good move to create_mock_cuda_graph_runner.

tests/unittest/_torch/modeling/test_modeling_mixtral.py (1)

6-6: Helper import LGTM.

tests/unittest/_torch/modeling/test_modeling_exaone4.py (1)

355-357: CUDA graph capture/replay usage matches new API

Capture with the added boolean flag and subsequent replay look correct.

Also applies to: 364-364

tests/unittest/_torch/modeling/test_modeling_qwen.py (1)

20-20: Import of helper factory looks correct

Aligned with new factory-based setup.

tests/unittest/_torch/modeling/test_modeling_nemotron.py (2)

7-7: LGTM: switched to factory helper.

Importing create_mock_cuda_graph_runner decouples tests from the runner class.


335-337: LGTM: updated capture signature correctly.

is_spec_decode=False here is appropriate for pure decoding.

tests/unittest/_torch/modeling/test_modeling_mistral.py (2)

10-10: LGTM: switched to factory helper.

Keeps tests aligned with the new API surface.


416-416: LGTM: capture API use matches new signature.

tests/unittest/_torch/modeling/test_modeling_llama.py (2)

7-7: LGTM: switched to factory helper.


343-345: LGTM: capture API updated correctly.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

460-474: LGTM: runner decoupled and fully parameterized.

Constructor wiring looks correct and removes the prior dependency.


2258-2264: Confirmed safe: draft_tokens_cuda=None is tolerated
maybe_get_cuda_graph only references draft_tokens_cuda when spec_metadata is non-null, so passing None for non-speculative paths is ignored and causes no errors.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (6)

32-49: Well-designed dependency injection refactor

Great work converting from engine-centric initialization to explicit dependency injection! This change successfully breaks the circular dependency by accepting individual parameters rather than the full engine object. The use of keyword-only arguments ensures clarity at call sites.


77-85: Enhanced API with speculative decoding context

The updated maybe_get_cuda_graph signature properly extends support for speculative decoding by accepting is_spec_decode, spec_metadata, and draft_tokens_cuda parameters. This provides the necessary context for graph eligibility decisions.


101-111: Distributed batch size synchronization logic looks correct

The multi-GPU synchronization properly gathers batch information across TP ranks and validates consistency before allowing CUDA graph execution. This ensures all ranks are in sync.


127-135: Proper metadata creation for CUDA graphs

The code correctly creates graph-specific metadata for both attention and speculative decoding, properly handling the draft tokens buffer assignment.


168-171: Conditional mrope position deltas handling

Good use of the use_mrope flag to conditionally include position deltas in the static tensors.


266-271: Comprehensive dummy request configuration

The padding dummy request creation properly includes all necessary parameters for speculative decoding support including max_num_draft_tokens, use_mrope, and max_beam_width.

max_beam_width: int,
max_draft_len: int,
use_mrope: bool,
spec_config: Optional["DecodingBaseConfig"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix undefined type hint DecodingBaseConfig

The type hint for spec_config references DecodingBaseConfig which is not imported. This will cause a runtime error if the type hint is evaluated.

Add the missing import at the top of the file:

 from ..modules.multi_stream_utils import with_multi_stream
 from ..speculative import SpecMetadata
+from ..speculative.base import DecodingBaseConfig
 from ..utils import make_weak_ref, piecewise_cuda_graph
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
spec_config: Optional["DecodingBaseConfig"],
# at the top of tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
from ..modules.multi_stream_utils import with_multi_stream
from ..speculative import SpecMetadata
from ..speculative.base import DecodingBaseConfig
from ..utils import make_weak_ref, piecewise_cuda_graph
🧰 Tools
🪛 Ruff (0.12.2)

43-43: Undefined name DecodingBaseConfig

(F821)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around line 43, the type
hint DecodingBaseConfig is referenced but not imported; add an import for
DecodingBaseConfig at the top of the file from the module where
DecodingBaseConfig is defined (for example the decoding module under
tensorrt_llm._torch.pyexecutor), or alternatively guard the annotation with
typing.TYPE_CHECKING or keep it as a string literal to avoid runtime evaluation.

self.graph_metadata: Dict[Tuple[int, int], Dict[str, Any]] = {}
self.memory_pool = engine._cuda_graph_mem_pool
self.memory_pool = cuda_graph_mem_pool
self.padding_dummy_request: Optional["Request"] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix undefined type hint Request

The type hint for padding_dummy_request references Request which is not imported. This will cause a runtime error if the type hint is evaluated.

Add the missing import at the top of the file:

 from ..utils import make_weak_ref, piecewise_cuda_graph
 from .resource_manager import ResourceManager, ResourceManagerType
+from .request import Request
 from .scheduler import ScheduledRequests
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.padding_dummy_request: Optional["Request"] = None
from ..utils import make_weak_ref, piecewise_cuda_graph
from .resource_manager import ResourceManager, ResourceManagerType
from .request import Request
from .scheduler import ScheduledRequests
🧰 Tools
🪛 Ruff (0.12.2)

72-72: Undefined name Request

(F821)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around line 72, the type
hint Optional["Request"] references Request but it's not imported; add the
missing import at the top of the file to resolve the undefined name (e.g., add a
relative import such as "from .request import Request" or the correct absolute
path where Request is defined) so the type is available when type hints are
evaluated.

Comment on lines +95 to +98
if hasattr(self,
'iter_counter') and ExpertStatistic.set_iter(iter_counter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the attribute check for iter_counter

The code checks for iter_counter attribute using hasattr(self, 'iter_counter'), but this attribute is never set in the __init__ method. This appears to be dead code unless the attribute is dynamically added elsewhere.


🏁 Script executed:

#!/bin/bash
# Search for where iter_counter might be set on CUDAGraphRunner instances
rg -n "\.iter_counter\s*=" --type=py

Length of output: 205


🏁 Script executed:

#!/bin/bash
# Locate CUDAGraphRunner definition and inspect its __init__
rg -nP "class\s+CUDAGraphRunner" -n -t py tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
rg -nP "def\s+__init__" -n -C5 -t py tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py

Length of output: 563


Initialize or remove the iter_counter check
The hasattr(self, 'iter_counter') guard in tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py:95–96 is always false—self.iter_counter is never defined on CUDAGraphRunner. Either add self.iter_counter = 0 in __init__ or remove these lines.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 95-96 the
hasattr(self, 'iter_counter') check is dead because CUDAGraphRunner never
defines self.iter_counter; either initialize self.iter_counter = 0 in the class
__init__ (or where instances are constructed) so the guard can be true and
ExpertStatistic.set_iter receives a valid value, or remove the hasattr check and
the conditional entirely if iteration counting is not needed—implement whichever
fix matches intended behavior and keep code and tests consistent.

Comment on lines +116 to +119
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
key = (batch_size, draft_len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider null-safety for spec_config access

The code accesses self.spec_config.max_draft_len without checking if spec_config is None, despite it being Optional in the constructor.

Add a null check or provide a default:

-        draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
+        draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
key = (batch_size, draft_len)
draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0
key = (batch_size, draft_len)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 116-117, the
code accesses self.spec_config.max_draft_len but spec_config is Optional; guard
against None by computing draft_len = self.spec_config.max_draft_len if
(is_spec_decode and self.spec_config is not None) else 0 (or use
getattr(self.spec_config, "max_draft_len", 0) when is_spec_decode is True), then
build key = (batch_size, draft_len); ensure the null-check covers only when
is_spec_decode is true so behavior remains unchanged otherwise.

Comment on lines +139 to +143
def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool:
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
return (batch_size, draft_len) not in self.graph_outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Apply the same null-safety fix in needs_capture

-        draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
+        draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool:
draft_len = self.spec_config.max_draft_len if is_spec_decode else 0
return (batch_size, draft_len) not in self.graph_outputs
def needs_capture(self, batch_size: int, is_spec_decode: bool) -> bool:
draft_len = (self.spec_config.max_draft_len if self.spec_config else 0) if is_spec_decode else 0
return (batch_size, draft_len) not in self.graph_outputs
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py around lines 139 to 141,
needs_capture currently assumes self.spec_config and self.graph_outputs are
non-null; make it null-safe by computing draft_len = (getattr(self.spec_config,
"max_draft_len", 0) if is_spec_decode else 0) and checking membership against a
safe container (e.g., graph_outputs = self.graph_outputs or set()), then return
(batch_size, draft_len) not in graph_outputs so the method won't throw if
spec_config or graph_outputs is None.

Comment on lines 2302 to 2394
# here we don't need to use context since cuda graph capture didn't run kernel.
# maybe we need a cleaner way to do this.
outputs = self.cuda_graph_runner.replay(batch_size, inputs)
outputs = self.cuda_graph_runner.replay(
batch_size, self.enable_spec_decode, inputs)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Wrap replay in MoE iterator context on capture path as well.

The else-branch replays under MoeLoadBalancerIterContext, but the capture-branch replay does not. Replay runs kernels and should be consistently wrapped to keep MoE iteration state correct.

-                    # here we don't need to use context since cuda graph capture didn't run kernel.
-                    # maybe we need a cleaner way to do this.
-                    outputs = self.cuda_graph_runner.replay(
-                        batch_size, self.enable_spec_decode, inputs)
+                    # Replay runs kernels; keep MoE iteration state consistent.
+                    with MoeLoadBalancerIterContext(moe_load_balancer):
+                        outputs = self.cuda_graph_runner.replay(
+                            batch_size, self.enable_spec_decode, inputs)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# here we don't need to use context since cuda graph capture didn't run kernel.
# maybe we need a cleaner way to do this.
outputs = self.cuda_graph_runner.replay(batch_size, inputs)
outputs = self.cuda_graph_runner.replay(
batch_size, self.enable_spec_decode, inputs)
else:
# Replay runs kernels; keep MoE iteration state consistent.
with MoeLoadBalancerIterContext(moe_load_balancer):
outputs = self.cuda_graph_runner.replay(
batch_size, self.enable_spec_decode, inputs)
else:
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/model_engine.py around lines 2302 to 2306, the
replay call on the capture path is not wrapped in MoeLoadBalancerIterContext
whereas the else-branch wraps replay inside that context; wrap the capture-path
replay call in the same MoeLoadBalancerIterContext used in the else branch
(enter context before calling self.cuda_graph_runner.replay and exit after), so
both paths run replay under the MoE iterator context and preserve consistent MoE
iteration state.

Comment on lines +6 to +9
from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner
from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType
from tensorrt_llm.mapping import Mapping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add required NVIDIA Apache-2.0 header

Per repo guidelines, prepend the 2025 NVIDIA Apache-2.0 header to all .py files.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner
from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType
from tensorrt_llm.mapping import Mapping
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import CUDAGraphRunner
from tensorrt_llm._torch.pyexecutor.resource_manager import ResourceManagerType
from tensorrt_llm.mapping import Mapping
🤖 Prompt for AI Agents
In tests/unittest/_torch/helpers.py around lines 6 to 9, the file is missing the
required 2025 NVIDIA Apache-2.0 header; prepend the official NVIDIA Apache-2.0
header block (with year 2025 and appropriate license text per repo guidelines)
at the very top of the file before any imports, preserving encoding and line
endings, and ensure the header matches other repository files exactly.

SKIP_EXAONE4_HF_ACCURACY_TEST = True

from _torch.helpers import create_mock_engine
from _torch.helpers import create_mock_cuda_graph_runner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add required NVIDIA Apache-2.0 header

This file is missing the mandated header.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from _torch.helpers import create_mock_cuda_graph_runner
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from _torch.helpers import create_mock_cuda_graph_runner
🧰 Tools
🪛 Ruff (0.12.2)

25-25: Module level import not at top of file

(E402)

🤖 Prompt for AI Agents
In tests/unittest/_torch/modeling/test_modeling_exaone4.py around line 25, the
file is missing the required NVIDIA Apache-2.0 header; add the mandated
multi-line Apache-2.0 license header (including copyright notice and SPDX
identifier) at the top of the file following the project's header format so the
file begins with the standard NVIDIA Apache-2.0 license block.

Qwen2ForCausalLM, Qwen2ForProcessRewardModel)
# yapf: enable
from _torch.helpers import create_mock_engine
from _torch.helpers import create_mock_cuda_graph_runner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add required NVIDIA Apache-2.0 header

Apply the standard header at the top.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from _torch.helpers import create_mock_cuda_graph_runner
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from _torch.helpers import create_mock_cuda_graph_runner
# …rest of the file…
🤖 Prompt for AI Agents
In tests/unittest/_torch/modeling/test_modeling_qwen.py around line 20, the file
is missing the required NVIDIA Apache-2.0 license header; add the standard
NVIDIA Apache-2.0 header block at the very top of the file (before any imports)
matching the project's header template, ensuring correct copyright year/owner
and SPDX identifier.

Comment on lines +267 to +268
graph_runner = create_mock_cuda_graph_runner(
1) if scenario.use_cuda_graph else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix dtype mismatch for position_ids with CUDA graphs

CUDAGraphRunner allocates position_ids as int32. Here gen_position_ids defaults to int64, causing copy_ dtype mismatch during replay. Force int32.

-        gen_position_ids = [
-            torch.arange(input_ids.size(-1),
-                         input_ids.size(-1) + gen_input_ids.size(-1))
-        ]
+        gen_position_ids = [
+            torch.arange(
+                input_ids.size(-1),
+                input_ids.size(-1) + gen_input_ids.size(-1),
+                dtype=torch.int32,
+            )
+        ]

Also applies to: 282-283, 290-290

🤖 Prompt for AI Agents
In tests/unittest/_torch/modeling/test_modeling_qwen.py around lines 267-268
(also apply same change at 282-283 and 290), when scenario.use_cuda_graph is
true the CUDAGraphRunner allocates position_ids as int32 but gen_position_ids
currently defaults to int64 causing a dtype mismatch on copy_ during replay;
update the test to force gen_position_ids to torch.int32 (or call
.to(torch.int32)) whenever graph_runner is created/used so the generated
position_ids match the CUDA graph's int32 dtype.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17812 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13336 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>
@QiJune
Copy link
Collaborator Author

QiJune commented Sep 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19305 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19305 [ run ] completed with state FAILURE

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19310 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19310 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14499 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
@QiJune
Copy link
Collaborator Author

QiJune commented Sep 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19948 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19948 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15018 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>
Signed-off-by: junq <[email protected]>
def __init__(
self,
*,
use_cuda_graph: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use for a CUDAGraphRunner do if cuda graph is disabled?

use_cuda_graph: bool,
cuda_graph_padding_enabled: bool,
supported_batch_sizes: list[int],
max_supported_batch_size: int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between max_supported_batch_size and max_batch_size?

Signed-off-by: junq <[email protected]>
attn_metadata: AttentionMetadata,
spec_metadata: Optional[SpecMetadata],
draft_tokens_cuda: torch.Tensor,
) -> Tuple[bool, Optional[AttentionMetadata], Optional[SpecMetadata]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the return type here should be Optional[Tuple[AttentionMetadata, SpecMetadata]].

@QiJune
Copy link
Collaborator Author

QiJune commented Sep 25, 2025

/bot run


padding_size = padded_batch_size - batch_size
if padding_size + batch.batch_size > engine.batch_size:
if padding_size + batch.batch_size > self.max_batch_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should rename the batch_size on engine to max_batch_size?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19955 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19955 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #15025 completed with status: 'FAILURE'

Signed-off-by: junq <[email protected]>
@QiJune
Copy link
Collaborator Author

QiJune commented Sep 26, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20021 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20021 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15080 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants