[fix] Catch inference failures in `trtllm-bench` #5841

omera-nv · 2025-07-08T10:51:09Z

Description

This PR adds several checks that ensures trtllm-bench does not hang indefinitely if an exception is thrown during inference:

asyncio tasks are checked for errors when done, and the manager is stopped if an error occurs.
Number of perf items is checked against the number of requests submitted.
CancelErrors are caught for a cleaner traceback.
Added an assert in py_executor.py to check for sampling failures.

Signed-off-by: Omer Ullman Argov <[email protected]>

omera-nv · 2025-07-08T10:51:39Z

/bot run

tensorrt-cicd · 2025-07-08T10:57:08Z

PR_Github #11287 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T17:09:30Z

PR_Github #11287 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8347 completed with status: 'SUCCESS'

Signed-off-by: Omer Ullman Argov <[email protected]> Signed-off-by: Yuxin <[email protected]>

omera-nv added 3 commits July 8, 2025 13:40

reraise task exception

aa8b9a3

Signed-off-by: Omer Ullman Argov <[email protected]>

catch cancel error

16d67fc

Signed-off-by: Omer Ullman Argov <[email protected]>

ensure sampling worked

66a5e4c

Signed-off-by: Omer Ullman Argov <[email protected]>

omera-nv requested review from a team as code owners July 8, 2025 10:51

omera-nv requested a review from achartier July 8, 2025 10:51

achartier approved these changes Jul 8, 2025

View reviewed changes

FrankD412 approved these changes Jul 9, 2025

View reviewed changes

omera-nv merged commit d6d2ab2 into NVIDIA:main Jul 9, 2025
3 checks passed

zhou-yuxin pushed a commit to zhou-yuxin/TensorRT-LLM that referenced this pull request Jul 15, 2025

[fix] Catch inference failures in trtllm-bench (NVIDIA#5841)

75bd6e7

Signed-off-by: Omer Ullman Argov <[email protected]> Signed-off-by: Yuxin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix] Catch inference failures in `trtllm-bench` #5841

[fix] Catch inference failures in `trtllm-bench` #5841

Uh oh!

omera-nv commented Jul 8, 2025

Uh oh!

omera-nv commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

[fix] Catch inference failures in trtllm-bench #5841

[fix] Catch inference failures in trtllm-bench #5841

Uh oh!

Conversation

omera-nv commented Jul 8, 2025

Description

Uh oh!

omera-nv commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

Uh oh!

Uh oh!

[fix] Catch inference failures in `trtllm-bench` #5841

[fix] Catch inference failures in `trtllm-bench` #5841