Skip to content

[Feature]: Add Moving Average Statistics for Better Performance Monitoring #22480

@NumberWan

Description

@NumberWan

🚀 The feature, motivation and pitch

Currently, vLLM's Prometheus metrics, such as IterationStats and RequestStateStats, provide real-time or event-based data. While useful for immediate monitoring, it is challenging to get a clear, stable view of the long-term or recent average performance of the vLLM server without implementing an external data processing and aggregation layer.

WHY:

IterationStats: Emitted per scheduling step (not per request), too frequent for request-level tracking
RequestStateStats: Only captures request start events, lacks completion data
FinishedRequestStats: Records completed requests but doesn't maintain historical aggregates
LoRAStats: Unrelated to performance tracking (handles Low-Rank Adaptation metrics)

Gap: No sliding-window aggregation for request performance metrics exists

For example, a user wants to monitor the average latency or throughput of the last 100 finished requests to understand the server's typical performance, rather than just the real-time fluctuations of each iteration or individual request. Existing metrics are not designed to provide this kind of rolling average.

I propose adding a new metrics class, for example, SlidingWindowStats, to track and expose rolling average statistics for key performance indicators. This would allow users to directly monitor the server's performance trends with a smoother, more representative metric.

This new class could track metrics such as:

Average Latency: The average time from request start to finish over the last N completed requests.

Average Throughput: The average number of tokens processed per second over the last N completed requests.

Average Time To First Token (TTFT): The average time until the first token is generated for the last N requests.

The value of N could be configurable to allow users to adjust the window size for the moving average.

Suggested Solution

Implement a sliding-window metric tracker that maintains:

  1. Average latency (prefill + decoding)
  2. Throughput (tokens/sec)
  3. Request completion time
    ...over a configurable request window (e.g., last N requests).

Proposed Interface

class **SlidingWindowStats**:
    def __init__(self, window_size: int = 100):
        self.window_size = window_size
        self.request_metrics = deque(maxlen=window_size)

    def add_request(self, metrics: RequestMetrics):
        # ... update statistics

# Exposed as Prometheus metrics
sliding_window_latency = Gauge(
    "vllm:sliding_window_avg_latency_ms",
    "Average latency over last N requests",
    ["window_size"]
)

Currently, users have to write their own scripts to collect and process the FinishedRequestStats data and calculate a moving average. This adds complexity and requires extra infrastructure, which could be avoided if vLLM provides this functionality out-of-the-box.

Alternatives

Prometheus Native Functions
We evaluated using built-in Prometheus functions like:
avg_over_time(finished_request_latency_seconds[5m])

However, this approach has critical limitations:
Operates on fixed time windows rather than request-count windows
Cannot guarantee inclusion of exactly N requests (may include partial/fewer requests)
Vulnerable to traffic spikes/dips distorting the average
Impossible to implement "last 100 requests" semantics

Recording Rules

Attempted to create rules aggregating rate(finished_requests_total):

recording_rules:
  - record: vllm:request_latency:avg_100
    expr: avg by (job) (finished_request_latency_seconds)

This fails because:

Aggregates over all requests in metric history (not sliding window)
No mechanism to limit to most recent N requests
Causes metric cardinality explosion

Conclusion:

Existing Prometheus features fundamentally cannot implement request-count based sliding windows due to:
The time-series model lacking request ordering guarantees
Absence of stateful circular buffer semantics
Inability to evict specific old data points when new ones arrive

Additional context

This feature would significantly enhance vLLM's monitoring capabilities, making it easier for users to assess the stability and performance of their deployments in a production environment. It would provide a more direct and actionable metric for performance tuning and capacity planning.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

CC List:

@ywang96 @DarkLight1337

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions