You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[5. Configure the ScaledObject](#5-configure-the-scaledobject)
18
18
*[6. Test Autoscaling](#6-test-autoscaling)
19
-
*[7. Cleanup](#7-cleanup)
19
+
*[7. Scale down to zero](#7-scale-down-to-zero)
20
+
*[8. Cleanup](#8-cleanup)
20
21
*[Additional Resources](#additional-resources)
21
22
22
23
---
23
24
25
+
> **Note**: This tutorial only supports non-disaggregated prefill request autoscaling.
26
+
24
27
## Prerequisites
25
28
26
29
* A working vLLM deployment on Kubernetes (see [01-minimal-helm-installation](01-minimal-helm-installation.md))
@@ -99,7 +102,7 @@ This means that at the given timestamp, there were 0 pending requests in the que
99
102
100
103
### 5. Configure the ScaledObject
101
104
102
-
The following `ScaledObject` configuration is provided in `tutorials/assets/values-19-keda.yaml`. Review its contents:
105
+
The following `ScaledObject` configuration is provided in `tutorials/assets/values-20-keda.yaml`. Review its contents:
103
106
104
107
```yaml
105
108
apiVersion: keda.sh/v1alpha1
@@ -113,7 +116,7 @@ spec:
113
116
minReplicaCount: 1
114
117
maxReplicaCount: 2
115
118
pollingInterval: 15
116
-
cooldownPeriod: 30
119
+
cooldownPeriod: 360
117
120
triggers:
118
121
- type: prometheus
119
122
metadata:
@@ -127,7 +130,7 @@ Apply the ScaledObject:
127
130
128
131
```bash
129
132
cd ../tutorials
130
-
kubectl apply -f assets/values-19-keda.yaml
133
+
kubectl apply -f assets/values-20-keda.yaml
131
134
```
132
135
133
136
This tells KEDA to:
@@ -172,12 +175,114 @@ Within a few minutes, the `REPLICAS` value should increase to 2.
172
175
173
176
---
174
177
175
-
### 7. Cleanup
178
+
### 7. Scale Down to Zero
179
+
180
+
Sometimes you want to scale down to zero replicas when there's no traffic. This is a unique capability of KEDA compared to Kubernetes' HPA, which always maintains at least one replica. Scale-to-zero is particularly useful for:
181
+
182
+
***Cost optimization**: Eliminate resource usage during idle periods
183
+
***Resource efficiency**: Free up GPU resources for other workloads
184
+
***Cold start scenarios**: Scale up only when requests arrive
185
+
186
+
We provide this capability through a dual-trigger configuration. To configure it, modify the `tutorials/assets/values-20-keda.yaml`:
187
+
188
+
```yaml
189
+
# KEDA ScaledObject for vLLM deployment with scale-to-zero capability
190
+
# This configuration enables automatic scaling of vLLM pods based on queue length metrics
191
+
apiVersion: keda.sh/v1alpha1
192
+
kind: ScaledObject
193
+
metadata:
194
+
name: vllm-scaledobject
195
+
namespace: default
196
+
spec:
197
+
scaleTargetRef:
198
+
name: vllm-llama3-deployment-vllm
199
+
minReplicaCount: 0# Allow scaling down to zero
200
+
maxReplicaCount: 2
201
+
# How often KEDA should check the metrics (in seconds)
202
+
pollingInterval: 15
203
+
# How long to wait before scaling down after scaling up (in seconds)
curl -X POST http://localhost:30080/v1/completions \
262
+
-H "Content-Type: application/json" \
263
+
-d '{
264
+
"model": "meta-llama/Llama-3.1-8B-Instruct",
265
+
"prompt": "Once upon a time,",
266
+
"max_tokens": 10
267
+
}'
268
+
```
269
+
270
+
You should initially get a HTTP 503 error saying the service is temporarily unavailable. However, within a few minutes, you should see a fresh pod being brought up and the same query should succeed.
271
+
272
+
**Expected behavior:**
273
+
274
+
* **Scale down**: Pods terminate when there's no traffic and no queued requests
275
+
* **Scale up**: New pods start when requests arrive, even from zero replicas
276
+
* **Cold start delay**: First request after scale-to-zero will experience a delay while the pod initializes
277
+
278
+
---
279
+
280
+
### 8. Cleanup
176
281
177
282
To remove KEDA configuration and observability components:
0 commit comments