📢 Announcing the TensorRT LLM Software Architecture Roadmap (August 2025 Update) #7834

QiJune · 2025-09-18T06:50:47Z

QiJune
Sep 18, 2025
Maintainer

Hi everyone,

The TensorRT LLM team recently held a software architecture review to align on our strategy for the future. We're excited to share the key updates and our forward-looking roadmap with the community.

Our mission is to

re-architect TensorRT LLM into a robust, modular, and highly extensible platform that enables rapid innovation while delivering production-grade stability. Our work is focused on four key initiatives: SW Architecture & Modularity, Compute Efficiency, System-Level Optimization, and Exploration of next-generation ideas.

Here’s a look at our recent progress and where we're headed next.

🚀 Recent Updates & Accomplishments

We've been focused on improving stability, modularity, and performance throughout the stack.

Feature Combination

We've made significant progress in ensuring features work together reliably. You can view the latest compatibility status in our updated Feature Combination Matrix.

Expanded Test Coverage: We've tested and stabilized numerous feature combinations that were previously untested, improving the reliability of the platform. ([None][infra] Enable accuracy test for eagle3 and chunked prefill #6386, [doc] update the doc of feature combination matrix #6441, [None][infra] Enable test of chunked prefill with logit post processor #6483, [None][infra] Enable accuracy test for mtp and chunked prefill #6314, [None][infra] update feature_combination_matrix of disaggregated and chunked prefill #6661, [None][doc] update feature_combination_matrix doc #6691)
Centralized Validation(ongoing): We've introduced a central validation function (validate_feature_combination) to check for incompatible features early, preventing runtime failures and providing clear error messages. ([None][chore] Validate features combination #7630)
Improved Guided Decoding: Guided decoding is now fully compatible and tested with our most critical performance features, including the Overlap Scheduler, CUDA Graph, Disaggregated Serving, and both one-model and two-model speculative decoding. ([TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler #6000, [TRTLLM-6854][feat] Enable guided decoding with CUDA graph padding and draft model chunked prefill #6774, [TRTLLM-6854][feat] Enable guided decoding with disagg serving #6704, [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) #6300, [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) #6948)

Modularization

Speculative Decoding: We are creating a unified architecture for speculative decoding that combines the performance of the one-model approach with the flexibility of the two-model approach. ([TRTLLM-7353][feat] Implement capturable drafting loops for speculation #7100)
Request Queue: The request enqueuing and batching logic has been extracted from the PyExecutor into a new ExecutorRequestQueue class for better modularity and easier maintenance. (feat: Refactor the fetching request logic #5786, [None][feat] Add support of scheduling attention dp request #6246, [None][opt] Add batch wait timeout in fetching requests #6923)
CUDA Graph Runner: The CUDA graph runner has been refactored to handle multiple graphs, keyed by batch size and draft length, simplifying the model engine's design. ([None][refactor] refactor the CUDA graph runner to manage all CUDA graphs #6846)

System-Level Optimization

Reduced Host Overhead: We are migrating from pybind11 to nanobind. This will become the default in the 1.1 release. (feat: nanobind bindings #5961, feat: nanobind bindings #6185, [None][feat] Enable nanobind as the default binding library #6608)
Python-Native IPC: The LlmResponse class has been re-implemented in pure Python to reduce serialization/deserialization overhead, significantly speeding up inter-process communication. (Re-implement LlmResponse in Python to reduce host overhead of pybind #5224)
Critical Issue Resolution: We've identified and solved critical stability issues, including hangs, assertions, and out-of-memory (OOM) errors.

🗺️ The Road Ahead: Our Future Focus

Our future work is aimed at building a more flexible, powerful, and easy-to-use platform.

Next-Gen KV Cache Manager

We are designing a new KV cache manager from the ground up to accelerate feature development and improve maintainability.

Core Principles: It will be a Python-first, fully asynchronous, and multi-stream-aware manager with a clean, modern API.
New Capabilities: This will unlock dynamic memory management for features like Variable Sliding Window Attention (VSWA), dynamic suspend/resume of requests, and native support for complex workflows like Chain-of-Thought.

Unified Sampler Architecture

Our goal is to create a unified sampler that offers the peak performance of the C++ TRTLLM Sampler with the developer-friendly, extensible architecture of the Torch Sampler. We will build upon the Torch Sampler codebase and methodically integrate the high-performance kernels and advanced features from the C++ sampler.

Ray Integration

To better support advanced workloads like Reinforcement Learning and multi-step agents, we are integrating Ray for orchestration. (#7520)

General Refactoring

Refactor MoE Module: Improve the maintainability of the Mixture of Experts module, which now has 7 different backends.
Refactor Custom AllReduce: Improve heuristics and clean up the implementation of our custom AllReduce operations.
Refactor ModelEngine: Refactor initialization and warmup logic of ModelEngine.
Decouple PyTorch Backend: We plan to remove the hard dependency on the TensorRT framework from the PyTorch backend, increasing its portability.

We believe these initiatives will significantly enhance TensorRT LLM's capabilities and user experience. We welcome your feedback, ideas, and contributions. Please share your thoughts in the comments below!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

📢 Announcing the TensorRT LLM Software Architecture Roadmap (August 2025 Update) #7834

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

📢 Announcing the TensorRT LLM Software Architecture Roadmap (August 2025 Update) #7834

Uh oh!

Uh oh!

QiJune Sep 18, 2025 Maintainer

🚀 Recent Updates & Accomplishments

Feature Combination

Modularization

System-Level Optimization

🗺️ The Road Ahead: Our Future Focus

Next-Gen KV Cache Manager

Unified Sampler Architecture

Ray Integration

General Refactoring

Replies: 0 comments

QiJune
Sep 18, 2025
Maintainer