Code Crunch Labs · Tier IISub-brand · Agents24 weeks · semesterGPL-3.0

Crunch Agents.

Twenty-four weeks on the modern AI engineering stack. The agent loop as a discipline. Retrieval as engineering. Memory as a context-budget cache. Multi-agent orchestration graphs. Local inference on Ollama, vLLM, and NeMo. Tool surfaces over the Model Context Protocol. Open-Claude-compatible (OpenClaw) runtimes. Eval-in-prod. Free, forever.

24weeks
Program length
864hrs
Total workload
24+1
Labs + capstone
$0
Tuition · always

§ I · The Program

Agents that ship.

Crunch Agents is the AI agent systems engineering specialization of the Code Crunch academy — a 2026 redesign for the Crunch Labs tier. It is built for engineers who want to ship LLM-backed products: not notebook demos, but agentic systems with versioned prompts, hybrid retrieval over a private corpus, memory tiers that survive a forty-turn conversation, multi-agent supervisor graphs, and a tool surface served over the Model Context Protocol.

This track is deliberately distinct from C5 (Crunch AI · Data Science). C5 owns classical machine learning through deep learning — feature engineering, gradient boosting, CNNs, transformers as architectures, PyTorch fundamentals. C23 picks up where C5 leaves the room: LLMs as components in a system, not as research artifacts. The discipline here is systems engineering on top of models you mostly do not train. By Week 24 you have shipped a multi-agent research assistant served from your own vLLM cluster, instrumented with OpenTelemetry Gen-AI conventions, evaluated with Ragas and a calibrated LLM-as-judge, and survived a chaos drill in which a GPU node dies, a tool is prompt-injected, and a vector index corrupts — all in the same week.

"If C5 makes you a model builder, C23 makes you the engineer who keeps an agentic product alive in production."— Crunch Agents, course README

§ II · Who It's For

Four engineers, one agent loop.

Agents is opinionated about its audience. C1 (Code Crunch Convos) is the floor — you should be able to read Python comfortably, run Docker without flinching, and survive a Linux shell.

No. 01

The Python Developer

Has shipped Flask, FastAPI, or Django. Has called the OpenAI API. The gap between demo and production is fog — wants the durable patterns: retrieval as engineering, eval as engineering, the agent loop with rigor.

No. 02

The Classical ML Engineer

Came up on scikit-learn, XGBoost, PyTorch. Can train, calibrate, fight a leakage bug. Lacks muscle memory for RAG, MCP, vLLM, agent graphs, tool-use safety, and the operating model of an LLM-backed product.

No. 03

The SRE Crossing Over

Runs platforms — K8s, Prometheus, on-call, autoscaling. The new pager is GPU pools and token budgets. Wants vLLM topology, NeMo Inference, continuous batching, KV-cache economics, and a real runbook for an agentic system.

No. 04

The Researcher Applied

Has a Ph.D. or near-Ph.D. and reads a paper a week. The applied side — testable prompts, observability, retrieval pipelines, cost engineering, the messy reality of shipping — is the gap. C23 closes it without insulting the background.

§ III · Four Phases

From the agent loop to production.

The arc of the program is composed in four phases — six weeks each — each building on the last like floors of a building.

Phase I · Wk. 01—06

Foundations

LLMs as system components. Tokens, context, sampling, structured output. Prompt-as-code with promptfoo and Langfuse. Tool calling across vendors. The agent loop (ReAct, plan-and-execute, reflection). Local inference bring-up on Ollama, llama.cpp, and vLLM.

Phase II · Wk. 07—12

RAG & Memory Systems

Embeddings, chunking, hybrid search, reranking. Vector stores in production — pgvector, Qdrant, Weaviate. Memory tiers: episodic, semantic, procedural. Multimodal RAG. Evaluation as engineering with Ragas and a calibrated LLM-as-judge.

Phase III · Wk. 13—18

Agents & Orchestration

LangGraph state graphs. Mastra and durable execution. The Model Context Protocol in depth, plus the open-Claude-compatible OpenClaw runtime family. Fine-tuning with LoRA / QLoRA. Safety, prompt injection, threat modeling. Observability via OTel Gen-AI conventions.

Phase IV · Wk. 19—24

Production AI & Capstone

vLLM clusters with continuous batching. NeMo Inference and the NVIDIA stack. Cost engineering, model routing, semantic cache. Two capstone sprints. Week 24 chaos drill: GPU node loss, prompt-injection on a tool, vector-index corruption. Postmortem.

§ IV · The Curriculum

Twenty-four weeks, week by week.

Each entry corresponds to a folder in the GitHub repository with lecture notes, exercises, challenges, a quiz, homework, and a mini-project. Detailed acceptance criteria live in the syllabus.

01

The LLM as a System Component

Tokenizer → context → forward pass → sampling · decoder-only transformer at a systems level · open-weights (Llama 4, Qwen 3, Mistral, Gemma 3, DeepSeek) vs closed-weights · reading a model card without falling for the benchmark.

Lab 01

Build llmpick — multi-model recommender

02

Tokens, Context, and Sampling

BPE · SentencePiece · tiktoken · the price of long context · KV cache · temperature, top-p, top-k, min-p · structured output with outlines, guidance, xgrammar · grammar-constrained decoding.

Lab 02

Tokenization explorer + grammar-constrained JSON

03

Prompt Engineering as Engineering

The prompt is code · system/user/assistant roles · few-shot, CoT (honest), self-consistency · jailbreak surface · versioning with promptfoo and Langfuse · spec-then-implement with Claude Code and Cursor.

Lab 03

Promptfoo regression suite — 6 versions, 30 golden examples

04

Tool Calling and Structured Output

Function calling across vendors · open-source equivalents on Llama and Qwen · MCP as the cross-vendor protocol (stdio, SSE, streamable HTTP) · JSON-mode vs grammar-constrained · tool-use as RCE.

Lab 04

4-tool agent on Claude + local Qwen 2.5 7B

05

The Agent Loop

ReAct, plan-and-execute, reflection, self-critique · the "infinite tool-call" failure · step / token / time / cost budgets · claude-agent-sdk, OpenAI Agents SDK, AWS Strands, Google ADK at survey depth.

Lab 05

~150-line ReAct loop, no framework, vs claude-agent-sdk

06

Local Inference Bring-Up

Ollama for iteration · llama.cpp for portability · vLLM for throughput · SGLang, TensorRT-LLM, TGI · quantization (GGUF, AWQ, GPTQ, bitsandbytes) · speculative decoding · Apple MLX.

Lab 06

Same 7B on Ollama, llama.cpp, vLLM — measure tokens/sec, p95, VRAM

07

Embeddings and Vector Search

Open families — BGE, GTE, jina-embeddings-v3, nomic-embed, E5-Mistral · vendor embeddings · the MTEB leaderboard, read skeptically · ANN indexes (HNSW, IVF, ScaNN).

Lab 07

Three open + one vendor embedding over a legal corpus in pgvector

08

Chunking and Document Processing

Token-window, semantic-paragraph, recursive, sliding-window, late chunking · Unstructured, MinerU, LlamaParse, PyMuPDF · OCR (Tesseract, Surya) · table extraction · chunk size as a hyperparameter.

Lab 08

Chunking A/B harness — MRR, Recall@5, faithfulness delta

09

Reranking and Hybrid Search

BM25 + dense · reciprocal rank fusion · bge-reranker-v2, Cohere, ColBERT late-interaction · text-to-SQL · query rewriting · HyDE.

Lab 09

Cumulative lift: dense → +BM25 → +RRF → +reranker → +HyDE

10

Vector Stores in Production

pgvector, Qdrant, Weaviate, Milvus, Chroma · filtered ANN · backup, replication, rebuild · GraphRAG and knowledge-graph hybrids · agentic-RAG patterns.

Lab 10

Same pipeline across pgvector, Qdrant, Weaviate — ingest, query, recovery

11

Memory Systems and Context Budgeting

Three tiers — episodic, semantic, procedural · summarization (rolling, hierarchical, map-reduce) · eviction (LRU, salience-weighted) · MemGPT / Letta· the "lost in the middle" effect.

Lab 11

40-turn chat with three-tier memory vs no-memory baseline

12

Multimodal RAG and Evaluation

VLMs — LLaVA, Qwen2-VL, Phi-Vision, InternVL · CLIP / SigLIP · Whisper, Piper, XTTS · SDXL / Flux · Ragas as standard · DeepEval, promptfoo, TruLens · calibrated LLM-as-judge.

Lab 12

Full Ragas suite — faithfulness, context recall, precision, relevancy

13

LangGraph and the Graph Pattern

State graphs, conditional edges, checkpoints, persistence · supervisor, swarm, hierarchical patterns · honest critique of AutoGen and CrewAI · when a state machine beats an agent.

Lab 13

Re-implement Week 5 as a LangGraph with SQLite checkpointer

14

Mastra, Inngest, and TS Agent Stacks

Mastra (TypeScript-first) · Inngest for event-driven durable execution · Trigger.dev · Temporal at scale · Python-first vs TS-first agent stacks, honestly.

Lab 14

Supervisor in Mastra + LangGraph · Inngest-triggered research run

15

MCP — The Cross-Vendor Tool Protocol

MCP server / client, transport modes, tool / resource / prompt primitives · writing MCP servers in Python and TS · the open OpenClaw family — open MCP gateways, self-hosted MCP servers, community Claude-compatible loops · MCP security review.

Lab 15

Two MCP servers (FS + private-corpus search) over stdio and HTTP

16

Fine-Tuning at the Modern Scale

When not to fine-tune (almost always) · LoRA, QLoRA, DoRA · SFT, DPO, ORPO, KTO · NeMo Framework, Axolotl, Unsloth · RLHF/RLAIF at survey depth.

Lab 16

Unsloth SFT of Qwen2.5-7B on a 500-example custom DSL

17

Safety — Injection, Jailbreaks, Filtering

Direct vs indirect prompt injection · jailbreak surface · output filtering (regex, classifier, judge) · Llama Guard, OpenAI Moderation, Perspective · the OWASP LLM Top 10 · tool-use threat modeling.

Lab 17

Red-team your Week 15 MCP agent — 25 attacks, 3 defenses

18

Observability for Agentic Systems

OpenTelemetry Gen-AI semantic conventions · Langfuse (self-hosted), Arize Phoenix, LangSmith, Helicone · token accounting per route/user/model · latency SLOs · eval-on-traces.

Lab 18

Three dashboards: token-by-route, p95-by-step, retrieval over time

19

vLLM in Production

Continuous batching · paged attention · prefix caching · tensor / pipeline parallel · multi-replica vLLM behind LiteLLM · speculative decoding · OpenAI-compatible API.

Lab 19

vLLM + LiteLLM on H100 — concurrency 1/8/32/128 + break-even memo

20

NeMo Inference and the NVIDIA Stack

NVIDIA NeMo Inference for production serving · TensorRT-LLM kernels · Triton Inference Server · NeMo Guardrails as policy · honest NeMo vs vLLM comparison.

Lab 20

Qwen2.5-14B on NeMo + Triton — benchmark vs vLLM, add a guardrail

21

Cost Engineering and Model Routing

Routing — small for easy, big for hard · semantic cache (GPTCache or self-built pgvector) · prompt compression (LLMLingua) · batched inference · per-feature cost accounting · speculative decoding as a cost lever.

Lab 21

Router: Qwen2.5-7B local vs Claude vendor + 0.92-cosine semantic cache

22

Capstone Sprint A — Retrieval & Memory

Architecture review · hybrid retrieval over a 10 GB private corpus · three-tier memory wiring · Mermaid architecture diagram · 6-page architecture document.

Capstone

Retrieval + memory layers of the agentic research assistant

23

Capstone Sprint B — Agents, MCP, Eval, Serving

Supervisor + retrieval + code + writing agents · MCP tool surface · vLLM cluster deploy · Ragas + calibrated LLM-as-judge on 100-question gold set · OTel traces to Langfuse + Phoenix · cost-tracked routing.

Capstone

End-to-end system shipped as deploy URL or docker compose image

24

Chaos Drill, Eval-in-Prod, On-Call

Eval-in-prod with shadow traffic · blue/green model deploys · canary by user cohort · the on-call runbook for an agentic system · the chaos drill · the postmortem · interview prep.

Chaos Drill

GPU node loss · prompt-injection on a tool · vector-index corruption

§ V · The Toolchain

Open-source first, vendor-aware.

Every primary tool below is open-source or has an open-weights, self-hosted path. Vendor APIs (OpenAI, Anthropic, Gemini, Bedrock) are taught as the production scale path and the frontier-capability path — never as the only path.

Orchestration
LangGraph
state graphs · persistence
TS Agents
Mastra
TypeScript-first workflows
Protocol
MCP
cross-vendor tool surface
Local Inference
Ollama
fast iteration · Apple Silicon
Serving
vLLM
continuous batching · paged attention
NVIDIA Stack
NeMo Inference
enterprise serving · Guardrails
Vector Store
Qdrant
Rust · filtered ANN
Routing
LiteLLM
vendor + self-hosted router
Observability
Langfuse
open · self-hostable tracing
RAG Eval
Ragas
faithfulness · context recall
Tracing
OpenTelemetry Gen-AI
the cross-vendor standard
HF Serving
Hugging Face TGI
pragmatic HF-ecosystem path

§ VI · Skills You Will Carry

What you walk away with.

By the end of Week 24, you are able to do each of the following — credibly, on a real system, in front of a real reviewer.

  • Reason about LLMs as system components — tokens, context, KV cache, sampling — and pick the right knob for the right product symptom.
  • Treat prompts as code: version, diff, test, gate releases on regression with promptfoo and Langfuse.
  • Stand up local inference on commodity hardware with Ollama and llama.cpp, and scale to a multi-GPU vLLM cluster with continuous batching.
  • Operate NVIDIA NeMo Inference, Triton, and NeMo Guardrails; choose between vLLM, SGLang, TGI, and TensorRT-LLM with reasons.
  • Build a retrieval-augmented system end to end — chunking, open embeddings, rerankers, hybrid search, GraphRAG, agentic RAG.
  • Choose, deploy, and operate a vector store (pgvector, Qdrant, Weaviate, Milvus, Chroma) and defend the choice in review.
  • Process real documents — PDFs via Unstructured / MinerU, OCR with Tesseract / Surya, tables and multimodal pages.
  • Design and ship tool-using agents with function calling, JSON-mode, and grammar-constrained decoding over MCP.
  • Orchestrate multi-agent systems with LangGraph and Mastra — supervisor, swarm, hierarchical patterns — and read their failure modes.
  • Build memory tiers — episodic, semantic, procedural — and budget context windows like cache.
  • Fine-tune open-weights models with LoRA / QLoRA on Axolotl or Unsloth; decide when fine-tuning is the wrong answer.
  • Instrument an agentic system with OpenTelemetry Gen-AI semantic conventions and trace every step.
  • Evaluate like an engineer: golden sets, Ragas, DeepEval, promptfoo, calibrated LLM-as-judge, eval-in-prod.
  • Run cost-tracked model routing — small for easy, big for hard — with semantic cache and prompt compression.
  • Threat-model an agentic system; defend against direct and indirect prompt injection; red-team your own product.
  • Run on-call for an LLM-backed product — blue/green, canary, postmortem, the 3 AM runbook.

§ VII · The Capstone

One agent. Shipped, signed, evaluated.

Weeks 22–24 are reserved for a single substantial system — the kind a real product team would scope across a quarter. Architecture document, live deploy, video walkthrough, chaos-drill postmortem.

Capstone Brief

Production Agentic Research Assistant

A multi-agent system — supervisor plus retrieval, code, and writing agents — operating over a 10 GB private corpus with hybrid retrieval, three-tier memory, an MCP tool surface, and self-hosted serving on a vLLM cluster with continuous batching. Cost-tracked routing between a local 7B/13B and a frontier vendor model. Full eval suite. OTel tracing. Red-team report. Chaos-drill postmortem.

  • Supervisor + retrieval-agent + code-agent + writing-agent, orchestrated as a LangGraph state machine with persistence.
  • Hybrid retrieval (BM25 + dense + reranker) over a 10 GB private corpus, with chunking A/B'd and embeddings chosen with reasons.
  • Three memory tiers — episodic, semantic, procedural — surviving multi-turn dialogue with eviction and salience weighting.
  • MCP-server tool surface: filesystem, web, calculator, and one custom domain tool, with input validation and rate limiting.
  • Self-hosted vLLM cluster (continuous batching, paged attention) plus a LiteLLM router and a vendor fallback for hard routes.
  • Full eval suite — Ragas + calibrated LLM-as-judge on a 100-question gold set, plus shadow online eval in production.
  • OpenTelemetry Gen-AI traces flowing to Langfuse and Arize Phoenix, with three dashboards (tokens, latency, retrieval).
  • A red-team report (25 adversarial prompts, defense success before and after hardening) and a written threat model.
  • Chaos drill in a single 4-hour window — GPU node loss, prompt-injection on a tool, vector-index corruption — with postmortem.
  • A 5-minute video walkthrough plus an on-call runbook another engineer can read at 3 AM.

§ VIII · Getting Started

Four commands. Then begin.

The setup is intentionally lightweight. If you have a Linux, macOS, or WSL2 laptop and can run a terminal command, you can begin Week 1 today. No GPU required on day one.

# 1. Clone the curriculum repository
git clone https://github.com/CODE-CRUNCH-WORLDWIDE/C23-CRUNCH-AGENTS.git
cd C23-CRUNCH-AGENTS

# 2. Pull a local model (Apple Silicon / Linux / WSL2)
ollama pull llama3.1                          # or qwen2.5:7b, mistral, gemma3

# 3. Set up a Python venv with uv (fast, reproducible)
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# 4. Open Week 1 README and begin
$EDITOR curriculum/week-01-llm-as-system-component/README.md

Need the hardware budget, rented-GPU recipes, or CPU-only fallbacks? See the README.

§ IX · Frequently Asked

Questions, anticipated.

Not on day one. A 16 GB laptop carries you through Week 6, and Apple Silicon via Metal carries you further. From Week 7 onward a 24 GB GPU (RTX 3090 / 4090, used or new, or a rented A10 / L4 at ~$0.50–$1.00 per hour) unlocks the local-embedding and reranker labs. Two specific labs (vLLM cluster, NeMo Inference) benefit from H100 access on rented spot instances. Every lab has a CPU-only fallback. Total compute budget across the course on rented GPUs is roughly $30.

§ X · Begin

Twenty-four weeks from now,
you will have shipped an agent.

Open the repository. Read Week 1. The agent loop is yours.