Ship a Production AI Agent With Tools, RAG, Guardrails & Evals
Design, build, evaluate, and deploy a real domain agent that uses tools, retrieves from your own knowledge base, respects safety guardrails, and is observable in production.
The brief
You are the founding AI engineer at a small company. Pick a real workflow that a knowledgeable human currently performs (e.g. triage support tickets, draft fraud-report summaries, prepare investor briefings, write PR responses, qualify sales leads).
Build an agent that does this workflow end-to-end. It must call at least three real tools, retrieve from a knowledge base you ingest yourself, refuse unsafe actions, and produce traceable, auditable runs.
Reviewers will run your agent against a held-out eval set you ship, inspect a captured trace, and try to break your guardrails. The bar is not a demo — it is something a small team could put in front of internal users on Monday.
Deliverables
- 1Agent architecture doc
A 1–2 page diagram + write-up of the agent's loop, state, tools, memory, and failure paths.
format: docs/architecture.md with a Mermaid or PNG diagram - 2Knowledge base + RAG pipeline
Ingest at least 200 documents (your own notes, public docs, scraped pages with permission) into a vector store with chunking, embeddings, and retrieval evaluation.
format: scripts/ingest.py + retrieval/eval.json - 3Tool layer
≥3 real tools implemented as Python functions with JSON schemas, including at least one that talks to a real external API and one that mutates state (file write, DB insert, ticket create).
format: agent/tools/*.py + agent/tools/schemas.json - 4Agent loop
ReAct, LangGraph, or equivalent architecture you can defend. Must handle tool errors, loops, and budget exhaustion.
format: agent/loop.py with unit tests - 5Safety guardrails
Input + output guards (denylist, schema validation, LLM-judge), plus a tool allow-list. Documented threat model.
format: agent/guardrails.py + docs/threat_model.md - 6Evaluation harness
≥30 golden test cases with assertions, plus an LLM-judge for open-ended outputs, runnable via one command.
format: evals/golden.jsonl + evals/run.py + evals/report.html - 7Observability
OpenTelemetry traces for every step with input/output attributes; cost + token budget guard; structured logs.
format: agent/tracing.py + observability/README.md - 8Service deployment
FastAPI service in a Docker container exposing POST /ask with auth, /healthz, and a smoke-test script.
format: server/ + Dockerfile + scripts/smoke.sh - 9Engineering report
A 5–7 page PDF covering architecture, eval results, observed failure modes, cost per request, and what you would change with another month.
format: report.pdf with ≥2 figures and a metrics table
Suggested timeline
- 1. Scope & architectureDays 1–3
Pick task, write architecture doc, define eval set.
- 2. RAG + toolsDays 4–8
Ingest KB, implement tools, basic agent loop runs end-to-end.
- 3. GuardrailsDays 9–11
Input/output guards, tool allow-list, threat model.
- 4. EvaluationDays 12–14
Author golden set, build harness, measure baseline.
- 5. Observability + deployDays 15–17
OTel traces, budget guard, FastAPI + Docker.
- 6. Polish & submitDays 18–21
Red-team session, final report, demo video.
Graded rubric (100 pts)
Each criterion is scored 0–4. Final score = Σ (score × weight) ÷ 4. You need ≥70 to earn the capstone seal on your transcript.
- 0
No clear task or user.
- 1
Toy task with no realistic user.
- 2
Reasonable task, vague success criteria.
- 3
Real workflow, explicit success metric, target user persona.
- 4
Workflow with documented baseline (human time/cost) the agent measurably beats.
- 0
Single LLM call, no loop.
- 1
Loop exists but cannot recover from tool errors.
- 2
Loop handles errors and termination conditions.
- 3
Plan-act-reflect or LangGraph state machine with documented states.
- 4
Architecture choice defended with an ablation against a simpler baseline.
- 0
No real tools; no retrieval.
- 1
Tools exist but agent rarely picks correctly.
- 2
3 tools, decent routing, basic RAG.
- 3
Tool schemas validated; RAG with chunking + retrieval eval; mutations logged.
- 4
Retrieval eval shows Recall@5 ≥0.8; tool routing accuracy measured on a held-out set.
- 0
None.
- 1
Denylist only.
- 2
Input + output guards, allow-list on tools.
- 3
Threat model documented, ≥3 documented jailbreak attempts blocked.
- 4
Red-team report with mitigations; PII redaction in logs.
- 0
No eval set.
- 1
<10 ad-hoc test cases.
- 2
≥30 cases, automated harness.
- 3
Mix of deterministic assertions + LLM-judge; reported with CI.
- 4
Eval gates a CI workflow; regression budgets explicit.
- 0
No traces, no cost tracking.
- 1
Print-style logging.
- 2
Structured logs + token counts.
- 3
OTel traces per step + per-request USD budget.
- 4
Trace inspector or dashboard URL provided; cost/request reported.
- 0
Runs only on author's laptop.
- 1
Requirements file but undocumented.
- 2
Dockerfile + README quickstart.
- 3
`docker run` brings up /ask and /healthz; smoke test passes.
- 4
CI builds and smoke-tests the container on every commit.
- 0
No report.
- 1
README-style summary only.
- 2
Covers what was built.
- 3
Covers decisions, eval, failure modes, cost.
- 4
Engineering-postmortem quality, would convince a hiring manager.
Pre-submission checklist
Every item must be true before you submit. Reviewers will spot-check.
- Agent ships ≥3 real tools, at least one with state mutation, all with JSON schemas
- RAG ingestion script runs end-to-end on a single command and reports chunk count
- Retrieval eval reports Recall@5 ≥ 0.7 on a held-out query set
- ≥30 golden test cases; `make eval` prints PASS X/Y and writes evals/report.html
- Guardrails block ≥3 documented unsafe prompts; false-positive rate ≤10% on safe prompts
- Every /ask request emits one OTel trace with at least 4 nested spans
- Per-request USD budget enforced; cost/request reported in report.pdf
- `docker run -p 8080:8080 agent` brings up /ask, /healthz, and passes smoke.sh
- Report is 5–7 pages, has ≥2 diagrams, an eval table, and a cost section
- Repo has license, pinned requirements, and a 60-second quickstart in README
Stretch goals (bonus 0–10 pts)
- Hierarchical agent (manager + sub-agents) with documented hand-off contract
- Self-reflection / critic loop with measured quality lift
- Multi-tenant: per-user RAG namespace + per-user budget
- Public demo URL with rate limits and abuse monitoring
- Eval suite runs on every PR and posts a comment with the delta
