2026 Lab Series

Labs

78 labs · Azure AKS · Feb–Dec 2026 · 4 hrs/day · Every Helm chart committed. Each lab is one namespace, one concept, fully instrumented.

PERMANENT BASELINE · L06 Phi-3 Mini 3.8B · fp16 · T4 16GB

Concurrency	Tokens/sec	p95 TTFT	GPU Util
1	—	—	—
4	—	—	—
8	—	—	—
16	—	—	—

Updated when L06 is complete. These numbers are the control group for all subsequent labs.

Phase 1 — Foundation

Weeks 1–6

#	Name	Concepts	GPU	Status
L01	The Instrument Panel	Prometheus ServiceMonitorGrafana provisioning	None	✓ Done
L02	GPU Metrics from Zero	DCGM ExporterGPU metric taxonomy	T4	○ Planned
L03	Alert Before It Breaks	PrometheusRule CRDsBurn rate alerts	None	○ Planned
L04	Loki Query Language for LLM	LogQL aggregationsJSON log parsing	None	○ Planned
L05	vLLM Standing Up	vLLM Helm deployNode affinity	T4	○ Planned
L06	The Load Test Contract	Locust concurrency rampThroughput baseline	T4	○ Planned
L07	The Scheduler's View	Extended resourcesResourceQuota	T4	○ Planned
L08	Node Affinity vs Taints	nodeAffinityTaint/toleration precedence	T4	○ Planned
L09	Model Weights as Infrastructure	Azure Disk vs FilePVC for model cache	T4	○ Planned
L10	Secrets & Workload Identity	Azure Workload IdentityProjected volumes	None	○ Planned

Weeks 7–14

#	Name	Concepts	GPU	Status
L11	The KV Cache Budget	gpu_memory_utilizationKV block size	T4	○ Planned
L12	Eviction and Swap	KV eviction policyCPU swap	T4	○ Planned
L13	Continuous Batching Deep Dive	Continuous vs static batchingmax_num_batched_tokens	T4	○ Planned
L14	Chunked Prefill	--enable-chunked-prefillTTFT variance	T4	○ Planned
L15	fp16 vs AWQ vs GPTQ	Quantization formatsMemory footprint	T4	○ Planned
L16	Fitting Larger Models on Smaller GPUs	7B fp16 vs 13B AWQKV cache headroom	T4	○ Planned
L17	Prefix Caching	Automatic prefix cachingCache hit rate metric	T4	○ Planned
L18	Speculative Decoding	Draft + target model--speculative-model	T4	○ Planned
L19	Tensor Parallelism on 2xT4	--tensor-parallel-size 2NVLink vs PCIe bandwidth	2×T4	○ Planned
L20	Pipeline Parallelism	--pipeline-parallel-size 2Inter-stage latency	2×T4	○ Planned
L21	max_model_len Tuning	Context window vs KV cacheRequest rejection at limit	T4	○ Planned
L22	Throughput vs Latency Mode	--max-num-seqsBatch throughput (offline)	T4	○ Planned
L23	T4 vs A100: The Cost-Performance Cliff	Same model same loadCost per 1K tokens	T4+A100	○ Planned
L24	Right-sizing GPU SKU	Interactive vs batch workloadsModel size vs SKU	Both	○ Planned

Weeks 15–22

#	Name	Concepts	GPU	Status
L25	HPA on CPU Workloads	HPA v2targetAverageUtilization	None	○ Planned
L26	HPA on Custom Metrics	Prometheus AdapterPrometheusRule as HPA source	None	○ Planned
L27	KEDA ScaledObject Basics	KEDA installScaledObject CRD	None	○ Planned
L28	KEDA on LLM Queue Depth	vllm:num_requests_waitingCooldown periods	T4	○ Planned
L29	GPU Node Cold Start: Full Timeline	Cluster Autoscaler triggerVM provision → GPU init	T4	○ Planned
L30	Warm Pool Strategy	Pre-provisioned standby nodesCost of warm pool	T4	○ Planned
L31	Scale to Zero with KEDA	minReplicaCount=0scaleToZeroOnIdle	T4	○ Planned
L32	The Warm Replica Floor	minReplicas=1 cost modelp99 latency SLO vs idle cost	T4	○ Planned
L33	Predictive vs Reactive Scaling	Cron-based KEDA scalerTraffic pattern analysis	T4	○ Planned
L34	Scaling Lag Measurement	Trigger lagProvision lag	T4	○ Planned
L35	CPU Tier vs GPU Tier Scaling	Embedding pods (CPU) vs generation pods (GPU)Independent scaling	T4	○ Planned
L36	Request Queue Architecture	NATS/Redis queueKEDA queue scaler	T4	○ Planned
L37	NGINX Ingress for LLM APIs	Rate limiting at ingressToken bucket per client	T4	○ Planned
L38	Canary Routing for Model Versions	NGINX canary annotationTraffic split 90/10	T4	○ Planned

Weeks 23–30

#	Name	Concepts	GPU	Status
L39	Qdrant on AKS	Qdrant Helm deployCollection creation	None	○ Planned
L40	Indexing at Scale	100K document indexIndexing throughput	None	○ Planned
L41	Embedding Service on CPU	sentence-transformersall-MiniLM-L6-v2	None	○ Planned
L42	Embedding Service on GPU	GPU-accelerated embeddingCPU vs GPU throughput	T4	○ Planned
L43	End-to-End RAG: First Contact	Embed → Qdrant → vLLM pipelineEnd-to-end latency	T4	○ Planned
L44	RAG Latency Decomposition	Embed time + search + prompt build + generationLatency breakdown	T4	○ Planned
L45	top-K Tuning	top-K=3 vs 5 vs 10Retrieval quality vs context length	T4	○ Planned
L46	HNSW Tuning for Recall vs Speed	ef search parameterrecall@K	None	○ Planned
L47	Qdrant vs pgvector	pgvector on PostgreSQL100K index comparison	None	○ Planned
L48	Weaviate for Hybrid Search	BM25 + vector hybrid searchKeyword vs semantic recall	None	○ Planned
L49	RAG Under Load	L06 load test on full RAGBottleneck tier identification	T4	○ Planned
L50	RAG + KEDA: Scaling the Right Tier	Scale embedding (CPU) vs generation (GPU)Cost-efficiency under load	T4	○ Planned
L51	Reranking Pipeline	Cross-encoder rerankerRetrieval to generation latency vs quality	T4	○ Planned
L52	Prefix Caching for RAG System Prompts	Shared system contextCache hit rate on RAG workload	T4	○ Planned

Weeks 31–36

#	Name	Concepts	GPU	Status
L53	LoRA Adapters in vLLM	--enable-loraDomain adapter loading	A100	○ Planned
L54	Multi-LoRA Memory Budget	--max-lorasAdapter eviction	A100	○ Planned
L55	LoRA Hot-Swap Benchmark	1 server × 3 adapters vs 3 servers × 1 adapterCost latency resource util	A100	○ Planned
L56	Adapter Routing Service	Sidecar routing by tenant IDPer-adapter metrics isolation	A100	○ Planned
L57	Triton Model Repository	Triton Helm deployModel repository structure	A100	○ Planned
L58	Triton Dynamic Batching	max_batch_sizepreferred_batch_size	A100	○ Planned
L59	Triton Ensemble Pipeline	Embed + reranker + generation chainSingle inference request ensemble	A100	○ Planned
L60	vLLM vs Triton: Structured Benchmark	Same model same loadOverhead vs flexibility tradeoff	A100	○ Planned
L61	OpenAI-Compatible Gateway	LiteLLM or FastAPI gatewayRoute to multiple backends by model name	T4	○ Planned
L62	Model A/B Testing	Traffic split between model versionsPer-variant latency and quality metrics	T4	○ Planned

Weeks 37–40

#	Name	Concepts	GPU	Status
L63	Probe Design for LLM Workloads	Liveness vs readiness vs startup probeSlow-start model servers	T4	○ Planned
L64	PodDisruptionBudget in Practice	PDB minAvailable=1Node drain with/without PDB	T4	○ Planned
L65	GPU OOM Injection	Oversized requestCUDA OOM	T4	○ Planned
L66	CrashLoopBackOff Simulation	Bad config via ConfigMapExponential backoff	T4	○ Planned
L67	Node Drain: GPU Pod Rescheduling	kubectl drain GPU nodePod rescheduling	T4	○ Planned
L68	Resource Starvation	Memory limit below working setOOMKilled	T4	○ Planned
L69	NetworkPolicy Chaos	NetworkPolicy blocking QdrantRAG failure mode	T4	○ Planned
L70	Storage Latency Injection	tc netem 500ms to QdrantEnd-to-end RAG degradation	T4	○ Planned

Weeks 41–43

#	Name	Concepts	GPU	Status
L71	GPU Cost Attribution by Namespace	Azure Cost Management APICost per lab namespace	Both	○ Planned
L72	Idle GPU Waste Quantification	Idle GPU hours measurementWasted spend calculation	T4	○ Planned
L73	Spot GPU Nodepool	Azure Spot nodepool--priority Spot	T4	○ Planned
L74	Spot Reclaim Simulation	Manual spot node evictionPod requeue and reschedule	T4	○ Planned
L75	Spot vs On-Demand Cost Model	100% spot vs 80/20 hybrid vs on-demandCost vs p99 availability	T4	○ Planned

Weeks 44–45

#	Name	Concepts	GPU	Status
L76	The Full Stack: Assembly	vLLM + LoRA + RAG + KEDA + NGINX + Qdrant + ObservabilitySingle namespace	A100	○ Planned
L77	Production Hardening	NetworkPolicies + PDBs + ResourceQuotasWorkload Identity	A100	○ Planned
L78	The Year in Numbers	All baseline metrics compiledImprovement deltas	—	○ Planned