Capstone project

Ship a Production AI Agent With Tools, RAG, Guardrails & Evals

Design, build, evaluate, and deploy a real domain agent that uses tools, retrieves from your own knowledge base, respects safety guardrails, and is observable in production.

Estimated effort25–40 hours over 2–3 weeks

Passing score70 / 100

ReviewerReviewed by a senior agent engineer. They will run your repo, hit your eval harness, inspect one trace, and attempt 3 jailbreaks.

The brief

You are the founding AI engineer at a small company. Pick a real workflow that a knowledgeable human currently performs (e.g. triage support tickets, draft fraud-report summaries, prepare investor briefings, write PR responses, qualify sales leads).

Build an agent that does this workflow end-to-end. It must call at least three real tools, retrieve from a knowledge base you ingest yourself, refuse unsafe actions, and produce traceable, auditable runs.

Reviewers will run your agent against a held-out eval set you ship, inspect a captured trace, and try to break your guardrails. The bar is not a demo — it is something a small team could put in front of internal users on Monday.

Deliverables

1
Agent architecture doc
A 1–2 page diagram + write-up of the agent's loop, state, tools, memory, and failure paths.
format: docs/architecture.md with a Mermaid or PNG diagram
2
Knowledge base + RAG pipeline
Ingest at least 200 documents (your own notes, public docs, scraped pages with permission) into a vector store with chunking, embeddings, and retrieval evaluation.
format: scripts/ingest.py + retrieval/eval.json
3
Tool layer
≥3 real tools implemented as Python functions with JSON schemas, including at least one that talks to a real external API and one that mutates state (file write, DB insert, ticket create).
format: agent/tools/*.py + agent/tools/schemas.json
4
Agent loop
ReAct, LangGraph, or equivalent architecture you can defend. Must handle tool errors, loops, and budget exhaustion.
format: agent/loop.py with unit tests
5
Safety guardrails
Input + output guards (denylist, schema validation, LLM-judge), plus a tool allow-list. Documented threat model.
format: agent/guardrails.py + docs/threat_model.md
6
Evaluation harness
≥30 golden test cases with assertions, plus an LLM-judge for open-ended outputs, runnable via one command.
format: evals/golden.jsonl + evals/run.py + evals/report.html
7
Observability
OpenTelemetry traces for every step with input/output attributes; cost + token budget guard; structured logs.
format: agent/tracing.py + observability/README.md
8
Service deployment
FastAPI service in a Docker container exposing POST /ask with auth, /healthz, and a smoke-test script.
format: server/ + Dockerfile + scripts/smoke.sh
9
Engineering report
A 5–7 page PDF covering architecture, eval results, observed failure modes, cost per request, and what you would change with another month.
format: report.pdf with ≥2 figures and a metrics table

Suggested timeline

1. Scope & architecture
Days 1–3
Pick task, write architecture doc, define eval set.
2. RAG + tools
Days 4–8
Ingest KB, implement tools, basic agent loop runs end-to-end.
3. Guardrails
Days 9–11
Input/output guards, tool allow-list, threat model.
4. Evaluation
Days 12–14
Author golden set, build harness, measure baseline.
5. Observability + deploy
Days 15–17
OTel traces, budget guard, FastAPI + Docker.
6. Polish & submit
Days 18–21
Red-team session, final report, demo video.

Graded rubric (100 pts)

Each criterion is scored 0–4. Final score = Σ (score × weight) ÷ 4. You need ≥70 to earn the capstone seal on your transcript.

Task framing & problem fit

10 pts

0
No clear task or user.
1
Toy task with no realistic user.
2
Reasonable task, vague success criteria.
3
Real workflow, explicit success metric, target user persona.
4
Workflow with documented baseline (human time/cost) the agent measurably beats.

Architecture & reasoning loop

15 pts

0
Single LLM call, no loop.
1
Loop exists but cannot recover from tool errors.
2
Loop handles errors and termination conditions.
3
Plan-act-reflect or LangGraph state machine with documented states.
4
Architecture choice defended with an ablation against a simpler baseline.

Tool use & RAG quality

20 pts

0
No real tools; no retrieval.
1
Tools exist but agent rarely picks correctly.
2
3 tools, decent routing, basic RAG.
3
Tool schemas validated; RAG with chunking + retrieval eval; mutations logged.
4
Retrieval eval shows Recall@5 ≥0.8; tool routing accuracy measured on a held-out set.

Safety & guardrails

15 pts

0
None.
1
Denylist only.
2
Input + output guards, allow-list on tools.
3
Threat model documented, ≥3 documented jailbreak attempts blocked.
4
Red-team report with mitigations; PII redaction in logs.

Evaluation rigor

15 pts

0
No eval set.
1
<10 ad-hoc test cases.
2
≥30 cases, automated harness.
3
Mix of deterministic assertions + LLM-judge; reported with CI.
4
Eval gates a CI workflow; regression budgets explicit.

Observability & cost

10 pts

0
No traces, no cost tracking.
1
Print-style logging.
2
Structured logs + token counts.
3
OTel traces per step + per-request USD budget.
4
Trace inspector or dashboard URL provided; cost/request reported.

Deployment & reproducibility

10 pts

0
Runs only on author's laptop.
1
Requirements file but undocumented.
2
Dockerfile + README quickstart.
3
`docker run` brings up /ask and /healthz; smoke test passes.
4
CI builds and smoke-tests the container on every commit.

Report & defensibility

5 pts

0
No report.
1
README-style summary only.
2
Covers what was built.
3
Covers decisions, eval, failure modes, cost.
4
Engineering-postmortem quality, would convince a hiring manager.

Pre-submission checklist

Every item must be true before you submit. Reviewers will spot-check.

Agent ships ≥3 real tools, at least one with state mutation, all with JSON schemas
RAG ingestion script runs end-to-end on a single command and reports chunk count
Retrieval eval reports Recall@5 ≥ 0.7 on a held-out query set
≥30 golden test cases; `make eval` prints PASS X/Y and writes evals/report.html
Guardrails block ≥3 documented unsafe prompts; false-positive rate ≤10% on safe prompts
Every /ask request emits one OTel trace with at least 4 nested spans
Per-request USD budget enforced; cost/request reported in report.pdf
`docker run -p 8080:8080 agent` brings up /ask, /healthz, and passes smoke.sh
Report is 5–7 pages, has ≥2 diagrams, an eval table, and a cost section
Repo has license, pinned requirements, and a 60-second quickstart in README

Stretch goals (bonus 0–10 pts)

Hierarchical agent (manager + sub-agents) with documented hand-off contract
Self-reflection / critic loop with measured quality lift
Multi-tenant: per-user RAG namespace + per-user budget
Public demo URL with rate limits and abuse monitoring
Eval suite runs on every PR and posts a comment with the delta