Production LLM Inference with vLLM on Kubernetes

Introduction

Deploying large language models at scale requires more than a docker run command. This guide covers a production-grade stack using vLLM as the inference engine, NVIDIA Multi-Instance GPU (MIG) for hardware isolation, and Kubernetes for orchestration.

Architecture Overview

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│   Ingress/Gateway│────▶│ vLLM Service │────▶│  vLLM Pod (GPU) │
│   (Rate limiting)│     │  (Load Bal)  │     │  (MIG slice)    │
└─────────────────┘     └──────────────┘     └─────────────────┘
                                                        │
                              ┌─────────────────────────┘
                              ▼
                     ┌─────────────────┐
                     │   Shared PVC    │
                     │ (Model weights) │
                     └─────────────────┘

Model Preparation

Download and convert weights to a vLLM-compatible format. For this example, we use Meta-Llama-3-70B:

# Using Hugging Face CLI
huggingface-cli download meta-llama/Meta-Llama-3-70B-Instruct \
  --local-dir /models/llama-3-70b \
  --local-dir-use-symlinks False

# Quantize to AWQ for memory efficiency (optional)
python -m awq.entry --model_path /models/llama-3-70b \
  --w_bit 4 --q_group_size 128 \
  --run_awq --dump_awq awq_cache/llama-3-70b-w4-g128.pt

Kubernetes Deployment

Node Preparation — MIG Configuration

# NVIDIA Device Plugin config for MIG
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
data:
  config.json: |
    {
      "version": "v1",
      "sharing": {
        "timeSlicing": {},
        "mig": {
          "strategy": "mixed"
        }
      }
    }

vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama-70b
  template:
    metadata:
      labels:
        app: vllm-llama-70b
    spec:
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.5.4
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /models/llama-3-70b
            - --tensor-parallel-size
            - "2"
            - --max-model-len
            - "8192"
            - --gpu-memory-utilization
            - "0.92"
          resources:
            limits:
              nvidia.com/mig-2g.20gb: "2"
          volumeMounts:
            - name: model-storage
              mountPath: /models
          ports:
            - containerPort: 8000
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            claimName: model-weights-pvc

Service and Autoscaling

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama-70b-svc
spec:
  selector:
    app: vllm-llama-70b
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP
---
# Custom Metrics HPA using Prometheus + vLLM metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama-70b
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm:gpu_cache_usage_perc
        target:
          type: AverageValue
          averageValue: "75"

Performance Benchmarks

Batch Size	Prefill (tok/s)	Decode (tok/s)	TTFT (ms)
1	12,400	42	18
8	11,800	312	22
32	9,200	1,180	35
64	6,100	1,920	58

Measured on 2×A100-80GB (MIG 2g.20gb slices) with AWQ 4-bit quantization.

Observability Stack

Deploy the vLLM Prometheus exporter and scrape the following critical metrics:

# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-metrics
spec:
  selector:
    matchLabels:
      app: vllm-llama-70b
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Key alerts:

vllm:gpu_cache_usage_perc > 0.95 — approaching KV cache exhaustion
vllm:time_to_first_token_seconds > 2 — unacceptable latency
rate(vllm:prompt_tokens_total[5m]) == 0 — inference stall

Security Considerations

Network Policies: Restrict egress from vLLM pods to only model registries and telemetry endpoints
RBAC: Use dedicated ServiceAccounts with minimal permissions
Model Signing: Verify model checksums at pod init via initContainers