2026 Lab Series
github.com/bytes1inger/ai-infra-labs-2026
Labs
78 labs · Azure AKS · Feb–Dec 2026 · 4 hrs/day · Every Helm chart committed. Each lab is one namespace, one concept, fully instrumented.
PERMANENT BASELINE · L06 Phi-3 Mini 3.8B · fp16 · T4 16GB
| Concurrency | Tokens/sec | p95 TTFT | GPU Util |
|---|---|---|---|
| 1 | — | — | — |
| 4 | — | — | — |
| 8 | — | — | — |
| 16 | — | — | — |
Updated when L06 is complete. These numbers are the control group for all subsequent labs.
Phase 1 — Foundation
Weeks 1–6| # | Name | Status |
|---|---|---|
| L01 | The Instrument Panel | ✓ Done |
| L02 | GPU Metrics from Zero | ○ Planned |
| L03 | Alert Before It Breaks | ○ Planned |
| L04 | Loki Query Language for LLM | ○ Planned |
| L05 | vLLM Standing Up | ○ Planned |
| L06 | The Load Test Contract | ○ Planned |
| L07 | The Scheduler's View | ○ Planned |
| L08 | Node Affinity vs Taints | ○ Planned |
| L09 | Model Weights as Infrastructure | ○ Planned |
| L10 | Secrets & Workload Identity | ○ Planned |
Phase 2 — GPU Internals
Weeks 7–14| # | Name | Status |
|---|---|---|
| L11 | The KV Cache Budget | ○ Planned |
| L12 | Eviction and Swap | ○ Planned |
| L13 | Continuous Batching Deep Dive | ○ Planned |
| L14 | Chunked Prefill | ○ Planned |
| L15 | fp16 vs AWQ vs GPTQ | ○ Planned |
| L16 | Fitting Larger Models on Smaller GPUs | ○ Planned |
| L17 | Prefix Caching | ○ Planned |
| L18 | Speculative Decoding | ○ Planned |
| L19 | Tensor Parallelism on 2xT4 | ○ Planned |
| L20 | Pipeline Parallelism | ○ Planned |
| L21 | max_model_len Tuning | ○ Planned |
| L22 | Throughput vs Latency Mode | ○ Planned |
| L23 | T4 vs A100: The Cost-Performance Cliff | ○ Planned |
| L24 | Right-sizing GPU SKU | ○ Planned |
Phase 3 — Autoscaling
Weeks 15–22| # | Name | Status |
|---|---|---|
| L25 | HPA on CPU Workloads | ○ Planned |
| L26 | HPA on Custom Metrics | ○ Planned |
| L27 | KEDA ScaledObject Basics | ○ Planned |
| L28 | KEDA on LLM Queue Depth | ○ Planned |
| L29 | GPU Node Cold Start: Full Timeline | ○ Planned |
| L30 | Warm Pool Strategy | ○ Planned |
| L31 | Scale to Zero with KEDA | ○ Planned |
| L32 | The Warm Replica Floor | ○ Planned |
| L33 | Predictive vs Reactive Scaling | ○ Planned |
| L34 | Scaling Lag Measurement | ○ Planned |
| L35 | CPU Tier vs GPU Tier Scaling | ○ Planned |
| L36 | Request Queue Architecture | ○ Planned |
| L37 | NGINX Ingress for LLM APIs | ○ Planned |
| L38 | Canary Routing for Model Versions | ○ Planned |
Phase 4 — RAG Systems
Weeks 23–30| # | Name | Status |
|---|---|---|
| L39 | Qdrant on AKS | ○ Planned |
| L40 | Indexing at Scale | ○ Planned |
| L41 | Embedding Service on CPU | ○ Planned |
| L42 | Embedding Service on GPU | ○ Planned |
| L43 | End-to-End RAG: First Contact | ○ Planned |
| L44 | RAG Latency Decomposition | ○ Planned |
| L45 | top-K Tuning | ○ Planned |
| L46 | HNSW Tuning for Recall vs Speed | ○ Planned |
| L47 | Qdrant vs pgvector | ○ Planned |
| L48 | Weaviate for Hybrid Search | ○ Planned |
| L49 | RAG Under Load | ○ Planned |
| L50 | RAG + KEDA: Scaling the Right Tier | ○ Planned |
| L51 | Reranking Pipeline | ○ Planned |
| L52 | Prefix Caching for RAG System Prompts | ○ Planned |
Phase 5 — Advanced Serving
Weeks 31–36| # | Name | Status |
|---|---|---|
| L53 | LoRA Adapters in vLLM | ○ Planned |
| L54 | Multi-LoRA Memory Budget | ○ Planned |
| L55 | LoRA Hot-Swap Benchmark | ○ Planned |
| L56 | Adapter Routing Service | ○ Planned |
| L57 | Triton Model Repository | ○ Planned |
| L58 | Triton Dynamic Batching | ○ Planned |
| L59 | Triton Ensemble Pipeline | ○ Planned |
| L60 | vLLM vs Triton: Structured Benchmark | ○ Planned |
| L61 | OpenAI-Compatible Gateway | ○ Planned |
| L62 | Model A/B Testing | ○ Planned |
Phase 6 — Reliability & Chaos
Weeks 37–40| # | Name | Status |
|---|---|---|
| L63 | Probe Design for LLM Workloads | ○ Planned |
| L64 | PodDisruptionBudget in Practice | ○ Planned |
| L65 | GPU OOM Injection | ○ Planned |
| L66 | CrashLoopBackOff Simulation | ○ Planned |
| L67 | Node Drain: GPU Pod Rescheduling | ○ Planned |
| L68 | Resource Starvation | ○ Planned |
| L69 | NetworkPolicy Chaos | ○ Planned |
| L70 | Storage Latency Injection | ○ Planned |
Phase 7 — FinOps
Weeks 41–43| # | Name | Status |
|---|---|---|
| L71 | GPU Cost Attribution by Namespace | ○ Planned |
| L72 | Idle GPU Waste Quantification | ○ Planned |
| L73 | Spot GPU Nodepool | ○ Planned |
| L74 | Spot Reclaim Simulation | ○ Planned |
| L75 | Spot vs On-Demand Cost Model | ○ Planned |
Phase 8 — Capstone
Weeks 44–45| # | Name | Status |
|---|---|---|
| L76 | The Full Stack: Assembly | ○ Planned |
| L77 | Production Hardening | ○ Planned |
| L78 | The Year in Numbers | ○ Planned |