[Feature]: Add Moving Average Statistics for Better Performance Monitoring

### 🚀 The feature, motivation and pitch

Currently, vLLM's Prometheus metrics, such as **IterationStats** and **RequestStateStats**, provide real-time or event-based data. While useful for immediate monitoring, it is challenging to get a clear, stable view of the long-term or recent average performance of the vLLM server without implementing an external data processing and aggregation layer.

### **WHY:**
**IterationStats**: Emitted per scheduling step (not per request), too frequent for request-level tracking
**RequestStateStats**: Only captures request start events, lacks completion data
**FinishedRequestStats**: Records completed requests but doesn't maintain historical aggregates
**LoRAStats**: Unrelated to performance tracking (handles Low-Rank Adaptation metrics)

### Gap: No sliding-window aggregation for request performance metrics exists


For example, a user wants to monitor the average latency or throughput of the last 100 finished requests to understand the server's typical performance, rather than just the real-time fluctuations of each iteration or individual request. Existing metrics are not designed to provide this kind of rolling average.


I propose adding a new metrics class, for example, **SlidingWindowStats**, to track and expose rolling average statistics for key performance indicators. This would allow users to directly monitor the server's performance trends with a smoother, more representative metric.

This new class could track metrics such as:

**Average Latency:** The average time from request start to finish over the last N completed requests.

**Average Throughput:** The average number of tokens processed per second over the last N completed requests.

**Average Time To First Token (TTFT):** The average time until the first token is generated for the last N requests.

The value of N could be configurable to allow users to adjust the window size for the moving average.


### Suggested Solution
Implement a sliding-window metric tracker that maintains:
1. Average latency (prefill + decoding)
2. Throughput (tokens/sec)
3. Request completion time
...over a configurable request window (e.g., last N requests).

### Proposed Interface
```python
class **SlidingWindowStats**:
    def __init__(self, window_size: int = 100):
        self.window_size = window_size
        self.request_metrics = deque(maxlen=window_size)

    def add_request(self, metrics: RequestMetrics):
        # ... update statistics

# Exposed as Prometheus metrics
sliding_window_latency = Gauge(
    "vllm:sliding_window_avg_latency_ms",
    "Average latency over last N requests",
    ["window_size"]
)
```
Currently, users have to write their own scripts to collect and process the FinishedRequestStats data and calculate a moving average. This adds complexity and requires extra infrastructure, which could be avoided if vLLM provides this functionality out-of-the-box.

### Alternatives

Prometheus Native Functions
We evaluated using built-in Prometheus functions like:
`avg_over_time(finished_request_latency_seconds[5m])`

However, this approach has critical limitations:
Operates on fixed time windows rather than request-count windows
Cannot guarantee inclusion of exactly N requests (may include partial/fewer requests)
Vulnerable to traffic spikes/dips distorting the average
Impossible to implement "last 100 requests" semantics

### Recording Rules
Attempted to create rules aggregating rate(finished_requests_total):
```
recording_rules:
  - record: vllm:request_latency:avg_100
    expr: avg by (job) (finished_request_latency_seconds)
```

This fails because:

Aggregates over all requests in metric history (not sliding window)
No mechanism to limit to most recent N requests
Causes metric cardinality explosion

### Conclusion:
Existing Prometheus features fundamentally cannot implement request-count based sliding windows due to:
The time-series model lacking request ordering guarantees
Absence of stateful circular buffer semantics
Inability to evict specific old data points when new ones arrive

### Additional context

This feature would significantly enhance vLLM's monitoring capabilities, making it easier for users to assess the stability and performance of their deployments in a production environment. It would provide a more direct and actionable metric for performance tuning and capacity planning.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

### CC List:
@ywang96 @DarkLight1337 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Add Moving Average Statistics for Better Performance Monitoring #22480

🚀 The feature, motivation and pitch

WHY:

Gap: No sliding-window aggregation for request performance metrics exists

Suggested Solution

Proposed Interface

Alternatives

Recording Rules

Conclusion:

Additional context

Before submitting a new issue...

CC List:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Add Moving Average Statistics for Better Performance Monitoring #22480

Description

🚀 The feature, motivation and pitch

WHY:

Gap: No sliding-window aggregation for request performance metrics exists

Suggested Solution

Proposed Interface

Alternatives

Recording Rules

Conclusion:

Additional context

Before submitting a new issue...

CC List:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions