C23 — Crunch Agents · AI Agent Systems Engineering

Crunch Agents.

Twenty-four weeks on the modern AI engineering stack. The agent loop as a discipline. Retrieval as engineering. Memory as a context-budget cache. Multi-agent orchestration graphs. Local inference on Ollama, vLLM, and NeMo. Tool surfaces over the Model Context Protocol. Open-Claude-compatible (OpenClaw) runtimes. Eval-in-prod. Free, forever.

The LLM as a System Component

Tokenizer → context → forward pass → sampling · decoder-only transformer at a systems level · open-weights (Llama 4, Qwen 3, Mistral, Gemma 3, DeepSeek) vs closed-weights · reading a model card without falling for the benchmark.

Lab 01

Build llmpick — multi-model recommender

Tokens, Context, and Sampling

BPE · SentencePiece · tiktoken · the price of long context · KV cache · temperature, top-p, top-k, min-p · structured output with outlines, guidance, xgrammar · grammar-constrained decoding.

Lab 02

Tokenization explorer + grammar-constrained JSON

Prompt Engineering as Engineering

The prompt is code · system/user/assistant roles · few-shot, CoT (honest), self-consistency · jailbreak surface · versioning with promptfoo and Langfuse · spec-then-implement with Claude Code and Cursor.

Lab 03

Promptfoo regression suite — 6 versions, 30 golden examples

Tool Calling and Structured Output

Function calling across vendors · open-source equivalents on Llama and Qwen · MCP as the cross-vendor protocol (stdio, SSE, streamable HTTP) · JSON-mode vs grammar-constrained · tool-use as RCE.

Lab 04

4-tool agent on Claude + local Qwen 2.5 7B

The Agent Loop

ReAct, plan-and-execute, reflection, self-critique · the "infinite tool-call" failure · step / token / time / cost budgets · claude-agent-sdk, OpenAI Agents SDK, AWS Strands, Google ADK at survey depth.

Lab 05

~150-line ReAct loop, no framework, vs claude-agent-sdk

Local Inference Bring-Up

Ollama for iteration · llama.cpp for portability · vLLM for throughput · SGLang, TensorRT-LLM, TGI · quantization (GGUF, AWQ, GPTQ, bitsandbytes) · speculative decoding · Apple MLX.

Lab 06

Same 7B on Ollama, llama.cpp, vLLM — measure tokens/sec, p95, VRAM

Embeddings and Vector Search

Open families — BGE, GTE, jina-embeddings-v3, nomic-embed, E5-Mistral · vendor embeddings · the MTEB leaderboard, read skeptically · ANN indexes (HNSW, IVF, ScaNN).

Lab 07

Three open + one vendor embedding over a legal corpus in pgvector

Chunking and Document Processing

Token-window, semantic-paragraph, recursive, sliding-window, late chunking · Unstructured, MinerU, LlamaParse, PyMuPDF · OCR (Tesseract, Surya) · table extraction · chunk size as a hyperparameter.

Lab 08

Chunking A/B harness — MRR, Recall@5, faithfulness delta

Reranking and Hybrid Search

BM25 + dense · reciprocal rank fusion · bge-reranker-v2, Cohere, ColBERT late-interaction · text-to-SQL · query rewriting · HyDE.

Lab 09

Cumulative lift: dense → +BM25 → +RRF → +reranker → +HyDE

Vector Stores in Production

pgvector, Qdrant, Weaviate, Milvus, Chroma · filtered ANN · backup, replication, rebuild · GraphRAG and knowledge-graph hybrids · agentic-RAG patterns.

Lab 10

Same pipeline across pgvector, Qdrant, Weaviate — ingest, query, recovery

Memory Systems and Context Budgeting

Three tiers — episodic, semantic, procedural · summarization (rolling, hierarchical, map-reduce) · eviction (LRU, salience-weighted) · MemGPT / Letta· the "lost in the middle" effect.

Lab 11

40-turn chat with three-tier memory vs no-memory baseline

Multimodal RAG and Evaluation

VLMs — LLaVA, Qwen2-VL, Phi-Vision, InternVL · CLIP / SigLIP · Whisper, Piper, XTTS · SDXL / Flux · Ragas as standard · DeepEval, promptfoo, TruLens · calibrated LLM-as-judge.

Lab 12

Full Ragas suite — faithfulness, context recall, precision, relevancy

LangGraph and the Graph Pattern

State graphs, conditional edges, checkpoints, persistence · supervisor, swarm, hierarchical patterns · honest critique of AutoGen and CrewAI · when a state machine beats an agent.

Lab 13

Re-implement Week 5 as a LangGraph with SQLite checkpointer

Mastra, Inngest, and TS Agent Stacks

Mastra (TypeScript-first) · Inngest for event-driven durable execution · Trigger.dev · Temporal at scale · Python-first vs TS-first agent stacks, honestly.

Lab 14

Supervisor in Mastra + LangGraph · Inngest-triggered research run

MCP — The Cross-Vendor Tool Protocol

MCP server / client, transport modes, tool / resource / prompt primitives · writing MCP servers in Python and TS · the open OpenClaw family — open MCP gateways, self-hosted MCP servers, community Claude-compatible loops · MCP security review.

Lab 15

Two MCP servers (FS + private-corpus search) over stdio and HTTP

Fine-Tuning at the Modern Scale

When not to fine-tune (almost always) · LoRA, QLoRA, DoRA · SFT, DPO, ORPO, KTO · NeMo Framework, Axolotl, Unsloth · RLHF/RLAIF at survey depth.

Lab 16

Unsloth SFT of Qwen2.5-7B on a 500-example custom DSL

Safety — Injection, Jailbreaks, Filtering

Direct vs indirect prompt injection · jailbreak surface · output filtering (regex, classifier, judge) · Llama Guard, OpenAI Moderation, Perspective · the OWASP LLM Top 10 · tool-use threat modeling.

Lab 17

Red-team your Week 15 MCP agent — 25 attacks, 3 defenses

Observability for Agentic Systems

OpenTelemetry Gen-AI semantic conventions · Langfuse (self-hosted), Arize Phoenix, LangSmith, Helicone · token accounting per route/user/model · latency SLOs · eval-on-traces.

Lab 18

Three dashboards: token-by-route, p95-by-step, retrieval over time

vLLM in Production

Continuous batching · paged attention · prefix caching · tensor / pipeline parallel · multi-replica vLLM behind LiteLLM · speculative decoding · OpenAI-compatible API.

Lab 19

vLLM + LiteLLM on H100 — concurrency 1/8/32/128 + break-even memo

NeMo Inference and the NVIDIA Stack

NVIDIA NeMo Inference for production serving · TensorRT-LLM kernels · Triton Inference Server · NeMo Guardrails as policy · honest NeMo vs vLLM comparison.

Lab 20

Qwen2.5-14B on NeMo + Triton — benchmark vs vLLM, add a guardrail

Cost Engineering and Model Routing

Routing — small for easy, big for hard · semantic cache (GPTCache or self-built pgvector) · prompt compression (LLMLingua) · batched inference · per-feature cost accounting · speculative decoding as a cost lever.

Lab 21

Router: Qwen2.5-7B local vs Claude vendor + 0.92-cosine semantic cache

Capstone Sprint A — Retrieval & Memory

Architecture review · hybrid retrieval over a 10 GB private corpus · three-tier memory wiring · Mermaid architecture diagram · 6-page architecture document.

Capstone

Retrieval + memory layers of the agentic research assistant

Capstone Sprint B — Agents, MCP, Eval, Serving

Supervisor + retrieval + code + writing agents · MCP tool surface · vLLM cluster deploy · Ragas + calibrated LLM-as-judge on 100-question gold set · OTel traces to Langfuse + Phoenix · cost-tracked routing.

Capstone

End-to-end system shipped as deploy URL or docker compose image

Chaos Drill, Eval-in-Prod, On-Call

Eval-in-prod with shadow traffic · blue/green model deploys · canary by user cohort · the on-call runbook for an agentic system · the chaos drill · the postmortem · interview prep.

Chaos Drill

GPU node loss · prompt-injection on a tool · vector-index corruption

# 1. Clone the curriculum repository git clone https://github.com/CODE-CRUNCH-WORLDWIDE/C23-CRUNCH-AGENTS.git cd C23-CRUNCH-AGENTS # 2. Pull a local model (Apple Silicon / Linux / WSL2) ollama pull llama3.1 # or qwen2.5:7b, mistral, gemma3 # 3. Set up a Python venv with uv (fast, reproducible) uv venv && source .venv/bin/activate uv pip install -r requirements.txt # 4. Open Week 1 README and begin $EDITOR curriculum/week-01-llm-as-system-component/README.md

Crunch Agents.

Agents that ship.

Four engineers, one agent loop.

The Python Developer

The Classical ML Engineer

The SRE Crossing Over

The Researcher Applied

From the agent loop to production.

Foundations

RAG & Memory Systems

Agents & Orchestration

Production AI & Capstone

Twenty-four weeks, week by week.

The LLM as a System Component

Tokens, Context, and Sampling

Prompt Engineering as Engineering

Tool Calling and Structured Output

The Agent Loop

Local Inference Bring-Up

Embeddings and Vector Search

Chunking and Document Processing

Reranking and Hybrid Search

Vector Stores in Production

Memory Systems and Context Budgeting

Multimodal RAG and Evaluation

LangGraph and the Graph Pattern

Mastra, Inngest, and TS Agent Stacks

MCP — The Cross-Vendor Tool Protocol

Fine-Tuning at the Modern Scale

Safety — Injection, Jailbreaks, Filtering

Observability for Agentic Systems

vLLM in Production

NeMo Inference and the NVIDIA Stack

Cost Engineering and Model Routing

Capstone Sprint A — Retrieval & Memory

Capstone Sprint B — Agents, MCP, Eval, Serving

Chaos Drill, Eval-in-Prod, On-Call

Open-source first, vendor-aware.

What you walk away with.

One agent. Shipped, signed, evaluated.

Production Agentic Research Assistant

Four commands. Then begin.

Questions, anticipated.

Do I need a GPU?

Do I need C5 (Crunch AI · Data Science) first?

Open-weights or vendor APIs — which side are you on?

How is this different from C5?

Will I be on-call for an LLM product?

What is the OpenClaw / open-MCP ecosystem?

Twenty-four weeks from now,you will have shipped an agent.

Twenty-four weeks from now,
you will have shipped an agent.