-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
D82005826: [vllm][gptoss] pass toolcall turn to kv cache mgr #24787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a toolcall_turn
parameter to track turn numbers in tool-calling conversations, enabling more detailed prefix cache statistics. The parameter is correctly passed through the necessary layers to the KV cache manager, where it's used to collect stats for subsequent tool-call turns. The implementation is sound. I have one minor style suggestion to align with Python best practices.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Juechen Liu <[email protected]>
@@ -56,6 +56,7 @@ def generate( | |||
lora_request: Optional[LoRARequest] = None, | |||
trace_headers: Optional[Mapping[str, str]] = None, | |||
priority: int = 0, | |||
toolcall_turn: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it's a good idea to expose "toolcall" as a concept to engine and it's better to let engine focus on single turn scheduling while we can influence its scheduling behavior though setting priority.
is it possible to prevent this? like introducing some api server level stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense thanks! Let me close this PR for now and have more discussions.
Summary:
Pass the tool call turn to engine thus we can collect cache stats for turn > 0 in following changes.
Test Plan:
run vllm locally, with debugging log, we can see the turn is successfully passed in
(EngineCore_0 pid=910425) INFO 09-09 00:55:16 [kv_cache_manager.py:185] ======kingsmad: current turn is request.toolcall_turn=3
(EngineCore_0 pid=910425) INFO 09-09 00:55:16 [kv_cache_manager.py:186] ======kingsmad: writing stats as hits self.prefix_cache_stats.toolcall_non_1st_turn_hits=912
Reviewers: @yeqcharlotte
Subscribers:
Tasks:
Tags:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.