[Feat] add vllm:avg_prefill_comp_speed to Prometheus #24299

chickeyton · 2025-09-05T07:05:17Z

Purpose

Fix #20962 , an anwser to the requested feature TTFT Routing, to (but not depends on) production-stack PR vllm-project/production-stack#670, this PR add a metric vllm:avg_prefill_comp_speed the average prefill computation speed of requests for the next phase of TTFT Routing.

The definition of Average Prefill Computation Speed:

def amount(num_prefix_tokens, num_cached_tokens):
    """Computes number of Q•K dot products in prefill with prefix caching by trapezoid area formula"""
    top = num_cached_tokens + 1
    bottom = num_prefix_tokens
    height = num_prefix_tokens - num_cached_tokens
    return (top  + bottom) * height / 2

speed_history = []
for request in _100_recent_requests_with_first_token:
    speed_history.append(amount(request.num_prefix_tokens, request.num_cached_tokens) / request.prefill_time)
avg_prefill_comp_speed = avg(speed_history)

Test Plan

pull the vllm source code with this PR included
cd to the directory of the source code then install from source

uv pip install -e .

start a vllm instance

vllm serve Qwen/Qwen3-0.6B --host 0.0.0.0 --gpu-memory-utilization 0.8

cd to <vllm source>/vllm/benchmarks and run benchmark_serving.py

python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8080 \
    --dataset-name=random \
    --model Qwen/Qwen3-0.6B \
    --seed 12345

After benchmark_serving.py is finished, run the /metrics HTTP API

curl http://localhost:8080/metrics  | grep vllm:avg_prefill_comp_speed

Test Result

If everything goes fine, the following alike printout is expected after step 5:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:-- 100 45016  100 45016    0     0  1916k      0 --:--:-- --:--:-- --:--:-- 1998k
# HELP vllm:avg_prefill_comp_speed Avg. prefill computation speed of the 100 most recent finished requests.
# TYPE vllm:avg_prefill_comp_speed gauge
vllm:avg_prefill_comp_speed{engine="0",model_name="Qwen/Qwen3-0.6B"} 4.226412238196155e+06

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: chickeyton <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a new Prometheus metric, vllm:avg_prefill_comp_speed, to track the average prefill computation speed. The implementation correctly calculates this metric based on the number of prompt tokens and cached tokens for finished requests, using a trapezoid formula to quantify the computation amount. The changes are well-contained and span across the engine, output processor, and metrics modules to propagate the necessary data and state. I've found one critical issue regarding missing imports that will cause a runtime error. Please see my specific comment for details.

vllm/v1/engine/llm_engine.py

Signed-off-by: chickeyton <[email protected]>

chickeyton · 2025-09-09T03:12:21Z

/gemini review

gemini-code-assist · 2025-09-09T03:20:36Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

chickeyton · 2025-09-09T03:23:19Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new Prometheus metric, vllm:avg_prefill_comp_speed, to monitor the average prefill computation speed. The implementation correctly calculates this metric for finished requests, using a trapezoid formula to determine the computational 'amount' and considering prefix caching. The changes are well-structured, passing a history of speeds via a deque from the engine to IterationStats, where the average is computed. The code handles edge cases, such as insufficient history or zero prefill time, to prevent errors. Overall, the implementation is robust and a valuable addition for monitoring performance.

chickeyton · 2025-09-09T03:30:35Z

This is a new metric that required by the TTFT Routing, please comment @ywang96 @DarkLight1337

DarkLight1337 · 2025-09-09T03:33:22Z

cc @markmc

DarkLight1337

Is it necessary to keep a sliding window of the history to get the average? IIRC Prometheus can handle this already, perhaps @markmc could help elaborate on this

Signed-off-by: chickeyton <[email protected]>

markmc · 2025-09-10T10:17:24Z

Is it necessary to keep a sliding window of the history to get the average? IIRC Prometheus can handle this already, perhaps @markmc could help elaborate on this

See my response to separate proposal for a request-count based sliding window: #22480 (comment)

chickeyton · 2025-09-25T03:58:45Z

This PR is closed as it's no more needed by TTFT Routing

add vllm:avg_prefill_comp_speed to Prometheus

5c86b27

Signed-off-by: chickeyton <[email protected]>

chickeyton requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 5, 2025 07:05

chickeyton marked this pull request as draft September 5, 2025 07:05

mergify bot added the v1 label Sep 5, 2025

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

vllm/v1/engine/llm_engine.py Outdated Show resolved Hide resolved

chickeyton changed the title ~~add vllm:avg_prefill_comp_speed to Prometheus~~ [Feat] add vllm:avg_prefill_comp_speed to Prometheus Sep 5, 2025

chickeyton added 5 commits September 5, 2025 15:14

import Deque

0099f5a

Signed-off-by: chickeyton <[email protected]>

import deque

0a11f89

Signed-off-by: chickeyton <[email protected]>

fix review issues

90c3aa3

Signed-off-by: chickeyton <[email protected]>

fix codecheck issues

02e508d

Signed-off-by: chickeyton <[email protected]>

Merge branch 'main' into request_prefill_comp_speed

4533ed8

chickeyton marked this pull request as ready for review September 8, 2025 10:04

chickeyton added 5 commits September 9, 2025 10:05

fix codecheck issues

176bbee

Signed-off-by: chickeyton <[email protected]>

fix codecheck issues

d10bbc7

Signed-off-by: chickeyton <[email protected]>

fix codecheck issues

9bd06cb

Signed-off-by: chickeyton <[email protected]>

Merge branch 'main' into request_prefill_comp_speed

87d3f61

fix codecheck issues

f5ef982

Signed-off-by: chickeyton <[email protected]>

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

chickeyton mentioned this pull request Sep 9, 2025

[Feat][Router] Add TTFT Routing vllm-project/production-stack#670

Draft

3 tasks

DarkLight1337 reviewed Sep 9, 2025

View reviewed changes

chickeyton added 2 commits September 9, 2025 12:48

update the way of finding prefill computation amount

c921b28

Signed-off-by: chickeyton <[email protected]>

update the way of finding prefill computation amount

1e42eae

Signed-off-by: chickeyton <[email protected]>

chickeyton closed this Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feat] add vllm:avg_prefill_comp_speed to Prometheus #24299

[Feat] add vllm:avg_prefill_comp_speed to Prometheus #24299

Uh oh!

chickeyton commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chickeyton commented Sep 9, 2025

Uh oh!

gemini-code-assist bot commented Sep 9, 2025

Uh oh!

chickeyton commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chickeyton commented Sep 9, 2025

Uh oh!

DarkLight1337 commented Sep 9, 2025

Uh oh!

DarkLight1337 left a comment •

edited

Loading

Uh oh!

markmc commented Sep 10, 2025

Uh oh!

chickeyton commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Feat] add vllm:avg_prefill_comp_speed to Prometheus #24299

[Feat] add vllm:avg_prefill_comp_speed to Prometheus #24299

Uh oh!

Conversation

chickeyton commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chickeyton commented Sep 9, 2025

Uh oh!

gemini-code-assist bot commented Sep 9, 2025

Uh oh!

chickeyton commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chickeyton commented Sep 9, 2025

Uh oh!

DarkLight1337 commented Sep 9, 2025

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markmc commented Sep 10, 2025

Uh oh!

chickeyton commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chickeyton commented Sep 5, 2025 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

chickeyton commented Sep 25, 2025 •

edited

Loading