GACS Logo
AI Agents Mastery
Capstone project

Ship a Production AI Agent With Tools, RAG, Guardrails & Evals

Design, build, evaluate, and deploy a real domain agent that uses tools, retrieves from your own knowledge base, respects safety guardrails, and is observable in production.

Estimated effort25–40 hours over 2–3 weeks
Passing score70 / 100
ReviewerReviewed by a senior agent engineer. They will run your repo, hit your eval harness, inspect one trace, and attempt 3 jailbreaks.

The brief

You are the founding AI engineer at a small company. Pick a real workflow that a knowledgeable human currently performs (e.g. triage support tickets, draft fraud-report summaries, prepare investor briefings, write PR responses, qualify sales leads).

Build an agent that does this workflow end-to-end. It must call at least three real tools, retrieve from a knowledge base you ingest yourself, refuse unsafe actions, and produce traceable, auditable runs.

Reviewers will run your agent against a held-out eval set you ship, inspect a captured trace, and try to break your guardrails. The bar is not a demo — it is something a small team could put in front of internal users on Monday.

Deliverables

  1. 1
    Agent architecture doc

    A 1–2 page diagram + write-up of the agent's loop, state, tools, memory, and failure paths.

    format: docs/architecture.md with a Mermaid or PNG diagram
  2. 2
    Knowledge base + RAG pipeline

    Ingest at least 200 documents (your own notes, public docs, scraped pages with permission) into a vector store with chunking, embeddings, and retrieval evaluation.

    format: scripts/ingest.py + retrieval/eval.json
  3. 3
    Tool layer

    ≥3 real tools implemented as Python functions with JSON schemas, including at least one that talks to a real external API and one that mutates state (file write, DB insert, ticket create).

    format: agent/tools/*.py + agent/tools/schemas.json
  4. 4
    Agent loop

    ReAct, LangGraph, or equivalent architecture you can defend. Must handle tool errors, loops, and budget exhaustion.

    format: agent/loop.py with unit tests
  5. 5
    Safety guardrails

    Input + output guards (denylist, schema validation, LLM-judge), plus a tool allow-list. Documented threat model.

    format: agent/guardrails.py + docs/threat_model.md
  6. 6
    Evaluation harness

    ≥30 golden test cases with assertions, plus an LLM-judge for open-ended outputs, runnable via one command.

    format: evals/golden.jsonl + evals/run.py + evals/report.html
  7. 7
    Observability

    OpenTelemetry traces for every step with input/output attributes; cost + token budget guard; structured logs.

    format: agent/tracing.py + observability/README.md
  8. 8
    Service deployment

    FastAPI service in a Docker container exposing POST /ask with auth, /healthz, and a smoke-test script.

    format: server/ + Dockerfile + scripts/smoke.sh
  9. 9
    Engineering report

    A 5–7 page PDF covering architecture, eval results, observed failure modes, cost per request, and what you would change with another month.

    format: report.pdf with ≥2 figures and a metrics table

Suggested timeline

  1. 1. Scope & architecture
    Days 1–3

    Pick task, write architecture doc, define eval set.

  2. 2. RAG + tools
    Days 4–8

    Ingest KB, implement tools, basic agent loop runs end-to-end.

  3. 3. Guardrails
    Days 9–11

    Input/output guards, tool allow-list, threat model.

  4. 4. Evaluation
    Days 12–14

    Author golden set, build harness, measure baseline.

  5. 5. Observability + deploy
    Days 15–17

    OTel traces, budget guard, FastAPI + Docker.

  6. 6. Polish & submit
    Days 18–21

    Red-team session, final report, demo video.

Graded rubric (100 pts)

Each criterion is scored 0–4. Final score = Σ (score × weight) ÷ 4. You need ≥70 to earn the capstone seal on your transcript.

Task framing & problem fit
10 pts
  • 0

    No clear task or user.

  • 1

    Toy task with no realistic user.

  • 2

    Reasonable task, vague success criteria.

  • 3

    Real workflow, explicit success metric, target user persona.

  • 4

    Workflow with documented baseline (human time/cost) the agent measurably beats.

Architecture & reasoning loop
15 pts
  • 0

    Single LLM call, no loop.

  • 1

    Loop exists but cannot recover from tool errors.

  • 2

    Loop handles errors and termination conditions.

  • 3

    Plan-act-reflect or LangGraph state machine with documented states.

  • 4

    Architecture choice defended with an ablation against a simpler baseline.

Tool use & RAG quality
20 pts
  • 0

    No real tools; no retrieval.

  • 1

    Tools exist but agent rarely picks correctly.

  • 2

    3 tools, decent routing, basic RAG.

  • 3

    Tool schemas validated; RAG with chunking + retrieval eval; mutations logged.

  • 4

    Retrieval eval shows Recall@5 ≥0.8; tool routing accuracy measured on a held-out set.

Safety & guardrails
15 pts
  • 0

    None.

  • 1

    Denylist only.

  • 2

    Input + output guards, allow-list on tools.

  • 3

    Threat model documented, ≥3 documented jailbreak attempts blocked.

  • 4

    Red-team report with mitigations; PII redaction in logs.

Evaluation rigor
15 pts
  • 0

    No eval set.

  • 1

    <10 ad-hoc test cases.

  • 2

    ≥30 cases, automated harness.

  • 3

    Mix of deterministic assertions + LLM-judge; reported with CI.

  • 4

    Eval gates a CI workflow; regression budgets explicit.

Observability & cost
10 pts
  • 0

    No traces, no cost tracking.

  • 1

    Print-style logging.

  • 2

    Structured logs + token counts.

  • 3

    OTel traces per step + per-request USD budget.

  • 4

    Trace inspector or dashboard URL provided; cost/request reported.

Deployment & reproducibility
10 pts
  • 0

    Runs only on author's laptop.

  • 1

    Requirements file but undocumented.

  • 2

    Dockerfile + README quickstart.

  • 3

    `docker run` brings up /ask and /healthz; smoke test passes.

  • 4

    CI builds and smoke-tests the container on every commit.

Report & defensibility
5 pts
  • 0

    No report.

  • 1

    README-style summary only.

  • 2

    Covers what was built.

  • 3

    Covers decisions, eval, failure modes, cost.

  • 4

    Engineering-postmortem quality, would convince a hiring manager.

Pre-submission checklist

Every item must be true before you submit. Reviewers will spot-check.

  • Agent ships ≥3 real tools, at least one with state mutation, all with JSON schemas
  • RAG ingestion script runs end-to-end on a single command and reports chunk count
  • Retrieval eval reports Recall@5 ≥ 0.7 on a held-out query set
  • ≥30 golden test cases; `make eval` prints PASS X/Y and writes evals/report.html
  • Guardrails block ≥3 documented unsafe prompts; false-positive rate ≤10% on safe prompts
  • Every /ask request emits one OTel trace with at least 4 nested spans
  • Per-request USD budget enforced; cost/request reported in report.pdf
  • `docker run -p 8080:8080 agent` brings up /ask, /healthz, and passes smoke.sh
  • Report is 5–7 pages, has ≥2 diagrams, an eval table, and a cost section
  • Repo has license, pinned requirements, and a 60-second quickstart in README

Stretch goals (bonus 0–10 pts)

  • Hierarchical agent (manager + sub-agents) with documented hand-off contract
  • Self-reflection / critic loop with measured quality lift
  • Multi-tenant: per-user RAG namespace + per-user budget
  • Public demo URL with rate limits and abuse monitoring
  • Eval suite runs on every PR and posts a comment with the delta