Production LLM Inference with vLLM on Kubernetes
An end-to-end guide to deploying high-throughput LLM inference using vLLM, NVIDIA MIG, and Kubernetes scheduling constraints in enterprise environments.
By Keith Rose
Introduction
Deploying large language models at scale requires more than a docker run command. This guide covers a production-grade stack using vLLM as the inference engine, NVIDIA Multi-Instance GPU (MIG) for hardware isolation, and Kubernetes for orchestration.
Architecture Overview
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Ingress/Gateway│────▶│ vLLM Service │────▶│ vLLM Pod (GPU) │
│ (Rate limiting)│ │ (Load Bal) │ │ (MIG slice) │
└─────────────────┘ └──────────────┘ └─────────────────┘
│
┌─────────────────────────┘
▼
┌─────────────────┐
│ Shared PVC │
│ (Model weights) │
└─────────────────┘
Model Preparation
Download and convert weights to a vLLM-compatible format. For this example, we use Meta-Llama-3-70B:
# Using Hugging Face CLI
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct \
--local-dir /models/llama-3-70b \
--local-dir-use-symlinks False
# Quantize to AWQ for memory efficiency (optional)
python -m awq.entry --model_path /models/llama-3-70b \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/llama-3-70b-w4-g128.pt
Kubernetes Deployment
Node Preparation — MIG Configuration
# NVIDIA Device Plugin config for MIG
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
data:
config.json: |
{
"version": "v1",
"sharing": {
"timeSlicing": {},
"mig": {
"strategy": "mixed"
}
}
}
vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-70b
spec:
replicas: 2
selector:
matchLabels:
app: vllm-llama-70b
template:
metadata:
labels:
app: vllm-llama-70b
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
containers:
- name: vllm
image: vllm/vllm-openai:v0.5.4
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- /models/llama-3-70b
- --tensor-parallel-size
- "2"
- --max-model-len
- "8192"
- --gpu-memory-utilization
- "0.92"
resources:
limits:
nvidia.com/mig-2g.20gb: "2"
volumeMounts:
- name: model-storage
mountPath: /models
ports:
- containerPort: 8000
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-weights-pvc
Service and Autoscaling
apiVersion: v1
kind: Service
metadata:
name: vllm-llama-70b-svc
spec:
selector:
app: vllm-llama-70b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
---
# Custom Metrics HPA using Prometheus + vLLM metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama-70b
minReplicas: 2
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: vllm:gpu_cache_usage_perc
target:
type: AverageValue
averageValue: "75"
Performance Benchmarks
| Batch Size | Prefill (tok/s) | Decode (tok/s) | TTFT (ms) |
|---|---|---|---|
| 1 | 12,400 | 42 | 18 |
| 8 | 11,800 | 312 | 22 |
| 32 | 9,200 | 1,180 | 35 |
| 64 | 6,100 | 1,920 | 58 |
Measured on 2×A100-80GB (MIG 2g.20gb slices) with AWQ 4-bit quantization.
Observability Stack
Deploy the vLLM Prometheus exporter and scrape the following critical metrics:
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
spec:
selector:
matchLabels:
app: vllm-llama-70b
endpoints:
- port: http
path: /metrics
interval: 15s
Key alerts:
vllm:gpu_cache_usage_perc > 0.95— approaching KV cache exhaustionvllm:time_to_first_token_seconds > 2— unacceptable latencyrate(vllm:prompt_tokens_total[5m]) == 0— inference stall
Security Considerations
- Network Policies: Restrict egress from vLLM pods to only model registries and telemetry endpoints
- RBAC: Use dedicated ServiceAccounts with minimal permissions
- Model Signing: Verify model checksums at pod init via initContainers