Skip to content

Conversation

chickeyton
Copy link

@chickeyton chickeyton commented Sep 5, 2025

Purpose

Fix #20962 , an anwser to the requested feature TTFT Routing, to (but not depends on) production-stack PR vllm-project/production-stack#670, this PR add a metric vllm:avg_prefill_comp_speed the average prefill computation speed of requests for the next phase of TTFT Routing.

The definition of Average Prefill Computation Speed:

def amount(num_prefix_tokens, num_cached_tokens):
    """Computes number of Q•K dot products in prefill with prefix caching by trapezoid area formula"""
    top = num_cached_tokens + 1
    bottom = num_prefix_tokens
    height = num_prefix_tokens - num_cached_tokens
    return (top  + bottom) * height / 2

speed_history = []
for request in _100_recent_requests_with_first_token:
    speed_history.append(amount(request.num_prefix_tokens, request.num_cached_tokens) / request.prefill_time)
avg_prefill_comp_speed = avg(speed_history)

Test Plan

  1. pull the vllm source code with this PR included
  2. cd to the directory of the source code then install from source
uv pip install -e .
  1. start a vllm instance
vllm serve Qwen/Qwen3-0.6B --host 0.0.0.0 --gpu-memory-utilization 0.8
  1. cd to <vllm source>/vllm/benchmarks and run benchmark_serving.py
python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8080 \
    --dataset-name=random \
    --model Qwen/Qwen3-0.6B \
    --seed 12345
  1. After benchmark_serving.py is finished, run the /metrics HTTP API
curl http://localhost:8080/metrics  | grep vllm:avg_prefill_comp_speed

Test Result

If everything goes fine, the following alike printout is expected after step 5:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:-- 100 45016  100 45016    0     0  1916k      0 --:--:-- --:--:-- --:--:-- 1998k
# HELP vllm:avg_prefill_comp_speed Avg. prefill computation speed of the 100 most recent finished requests.
# TYPE vllm:avg_prefill_comp_speed gauge
vllm:avg_prefill_comp_speed{engine="0",model_name="Qwen/Qwen3-0.6B"} 4.226412238196155e+06

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Prometheus metric, vllm:avg_prefill_comp_speed, to track the average prefill computation speed. The implementation correctly calculates this metric based on the number of prompt tokens and cached tokens for finished requests, using a trapezoid formula to quantify the computation amount. The changes are well-contained and span across the engine, output processor, and metrics modules to propagate the necessary data and state. I've found one critical issue regarding missing imports that will cause a runtime error. Please see my specific comment for details.

@chickeyton chickeyton changed the title add vllm:avg_prefill_comp_speed to Prometheus [Feat] add vllm:avg_prefill_comp_speed to Prometheus Sep 5, 2025
@chickeyton chickeyton marked this pull request as ready for review September 8, 2025 10:04
@chickeyton
Copy link
Author

/gemini review

Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@chickeyton
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Prometheus metric, vllm:avg_prefill_comp_speed, to monitor the average prefill computation speed. The implementation correctly calculates this metric for finished requests, using a trapezoid formula to determine the computational 'amount' and considering prefix caching. The changes are well-structured, passing a history of speeds via a deque from the engine to IterationStats, where the average is computed. The code handles edge cases, such as insufficient history or zero prefill time, to prevent errors. Overall, the implementation is robust and a valuable addition for monitoring performance.

@chickeyton
Copy link
Author

This is a new metric that required by the TTFT Routing, please comment @ywang96 @DarkLight1337

@DarkLight1337
Copy link
Member

cc @markmc

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to keep a sliding window of the history to get the average? IIRC Prometheus can handle this already, perhaps @markmc could help elaborate on this

@markmc
Copy link
Member

markmc commented Sep 10, 2025

Is it necessary to keep a sliding window of the history to get the average? IIRC Prometheus can handle this already, perhaps @markmc could help elaborate on this

See my response to separate proposal for a request-count based sliding window: #22480 (comment)

@chickeyton
Copy link
Author

chickeyton commented Sep 25, 2025

This PR is closed as it's no more needed by TTFT Routing

@chickeyton chickeyton closed this Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC][FEATURE]: TTFT Routing
3 participants