Labs

78 labs · Azure AKS · Feb–Dec 2026 · 4 hrs/day · Every Helm chart committed. Each lab is one namespace, one concept, fully instrumented.

github.com/bytes1inger/ai-infra-labs-2026
PERMANENT BASELINE · L06 Phi-3 Mini 3.8B · fp16 · T4 16GB
Concurrency Tokens/sec p95 TTFT GPU Util
1
4
8
16

Updated when L06 is complete. These numbers are the control group for all subsequent labs.

Phase 1 — Foundation

Weeks 1–6
# Name Status
L01 The Instrument Panel ✓ Done
L02 GPU Metrics from Zero ○ Planned
L03 Alert Before It Breaks ○ Planned
L04 Loki Query Language for LLM ○ Planned
L05 vLLM Standing Up ○ Planned
L06 The Load Test Contract ○ Planned
L07 The Scheduler's View ○ Planned
L08 Node Affinity vs Taints ○ Planned
L09 Model Weights as Infrastructure ○ Planned
L10 Secrets & Workload Identity ○ Planned

Phase 2 — GPU Internals

Weeks 7–14
# Name Status
L11 The KV Cache Budget ○ Planned
L12 Eviction and Swap ○ Planned
L13 Continuous Batching Deep Dive ○ Planned
L14 Chunked Prefill ○ Planned
L15 fp16 vs AWQ vs GPTQ ○ Planned
L16 Fitting Larger Models on Smaller GPUs ○ Planned
L17 Prefix Caching ○ Planned
L18 Speculative Decoding ○ Planned
L19 Tensor Parallelism on 2xT4 ○ Planned
L20 Pipeline Parallelism ○ Planned
L21 max_model_len Tuning ○ Planned
L22 Throughput vs Latency Mode ○ Planned
L23 T4 vs A100: The Cost-Performance Cliff ○ Planned
L24 Right-sizing GPU SKU ○ Planned

Phase 3 — Autoscaling

Weeks 15–22
# Name Status
L25 HPA on CPU Workloads ○ Planned
L26 HPA on Custom Metrics ○ Planned
L27 KEDA ScaledObject Basics ○ Planned
L28 KEDA on LLM Queue Depth ○ Planned
L29 GPU Node Cold Start: Full Timeline ○ Planned
L30 Warm Pool Strategy ○ Planned
L31 Scale to Zero with KEDA ○ Planned
L32 The Warm Replica Floor ○ Planned
L33 Predictive vs Reactive Scaling ○ Planned
L34 Scaling Lag Measurement ○ Planned
L35 CPU Tier vs GPU Tier Scaling ○ Planned
L36 Request Queue Architecture ○ Planned
L37 NGINX Ingress for LLM APIs ○ Planned
L38 Canary Routing for Model Versions ○ Planned

Phase 4 — RAG Systems

Weeks 23–30
# Name Status
L39 Qdrant on AKS ○ Planned
L40 Indexing at Scale ○ Planned
L41 Embedding Service on CPU ○ Planned
L42 Embedding Service on GPU ○ Planned
L43 End-to-End RAG: First Contact ○ Planned
L44 RAG Latency Decomposition ○ Planned
L45 top-K Tuning ○ Planned
L46 HNSW Tuning for Recall vs Speed ○ Planned
L47 Qdrant vs pgvector ○ Planned
L48 Weaviate for Hybrid Search ○ Planned
L49 RAG Under Load ○ Planned
L50 RAG + KEDA: Scaling the Right Tier ○ Planned
L51 Reranking Pipeline ○ Planned
L52 Prefix Caching for RAG System Prompts ○ Planned

Phase 5 — Advanced Serving

Weeks 31–36
# Name Status
L53 LoRA Adapters in vLLM ○ Planned
L54 Multi-LoRA Memory Budget ○ Planned
L55 LoRA Hot-Swap Benchmark ○ Planned
L56 Adapter Routing Service ○ Planned
L57 Triton Model Repository ○ Planned
L58 Triton Dynamic Batching ○ Planned
L59 Triton Ensemble Pipeline ○ Planned
L60 vLLM vs Triton: Structured Benchmark ○ Planned
L61 OpenAI-Compatible Gateway ○ Planned
L62 Model A/B Testing ○ Planned

Phase 6 — Reliability & Chaos

Weeks 37–40
# Name Status
L63 Probe Design for LLM Workloads ○ Planned
L64 PodDisruptionBudget in Practice ○ Planned
L65 GPU OOM Injection ○ Planned
L66 CrashLoopBackOff Simulation ○ Planned
L67 Node Drain: GPU Pod Rescheduling ○ Planned
L68 Resource Starvation ○ Planned
L69 NetworkPolicy Chaos ○ Planned
L70 Storage Latency Injection ○ Planned

Phase 7 — FinOps

Weeks 41–43
# Name Status
L71 GPU Cost Attribution by Namespace ○ Planned
L72 Idle GPU Waste Quantification ○ Planned
L73 Spot GPU Nodepool ○ Planned
L74 Spot Reclaim Simulation ○ Planned
L75 Spot vs On-Demand Cost Model ○ Planned

Phase 8 — Capstone

Weeks 44–45
# Name Status
L76 The Full Stack: Assembly ○ Planned
L77 Production Hardening ○ Planned
L78 The Year in Numbers ○ Planned