-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[Feat] add vllm:avg_prefill_comp_speed to Prometheus #24299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] add vllm:avg_prefill_comp_speed to Prometheus #24299
Conversation
Signed-off-by: chickeyton <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new Prometheus metric, vllm:avg_prefill_comp_speed
, to track the average prefill computation speed. The implementation correctly calculates this metric based on the number of prompt tokens and cached tokens for finished requests, using a trapezoid formula to quantify the computation amount. The changes are well-contained and span across the engine, output processor, and metrics modules to propagate the necessary data and state. I've found one critical issue regarding missing imports that will cause a runtime error. Please see my specific comment for details.
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
/gemini review |
Warning Gemini encountered an error creating the review. You can try again by commenting |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new Prometheus metric, vllm:avg_prefill_comp_speed
, to monitor the average prefill computation speed. The implementation correctly calculates this metric for finished requests, using a trapezoid formula to determine the computational 'amount' and considering prefix caching. The changes are well-structured, passing a history of speeds via a deque
from the engine to IterationStats
, where the average is computed. The code handles edge cases, such as insufficient history or zero prefill time, to prevent errors. Overall, the implementation is robust and a valuable addition for monitoring performance.
This is a new metric that required by the TTFT Routing, please comment @ywang96 @DarkLight1337 |
cc @markmc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to keep a sliding window of the history to get the average? IIRC Prometheus can handle this already, perhaps @markmc could help elaborate on this
Signed-off-by: chickeyton <[email protected]>
Signed-off-by: chickeyton <[email protected]>
See my response to separate proposal for a request-count based sliding window: #22480 (comment) |
This PR is closed as it's no more needed by TTFT Routing |
Purpose
Fix #20962 , an anwser to the requested feature TTFT Routing, to (but not depends on) production-stack PR vllm-project/production-stack#670, this PR add a metric vllm:avg_prefill_comp_speed the average prefill computation speed of requests for the next phase of TTFT Routing.
The definition of Average Prefill Computation Speed:
Test Plan
uv pip install -e .
<vllm source>/vllm/benchmarks
and runbenchmark_serving.py
benchmark_serving.py
is finished, run the/metrics
HTTP APIcurl http://localhost:8080/metrics | grep vllm:avg_prefill_comp_speed
Test Result
If everything goes fine, the following alike printout is expected after step 5:
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.