GACS Logo
LLM Engineering
Capstone project

Ship a Domain-Tuned LLM End-to-End

Take an open base model, fine-tune it on a domain dataset, align it with preference data, evaluate it on a custom benchmark, and serve it under a measured SLO.

Estimated effort30–50 hours over 2–3 weeks
Passing score70 / 100
ReviewerReviewed by a senior ML engineer playing the role of your hiring manager. They will reproduce your repo and read your report.

The brief

You are the founding LLM engineer at a 5-person startup. The product team needs a small, fast, in-house model that beats general-purpose APIs on one narrow domain (your choice: legal Q&A, medical note summarisation, code review, financial filings extraction, customer-support triage — pick one).

Your job is to take this from raw data to a running HTTP endpoint, with measured quality and cost, and to defend every architectural choice in a short report.

Treat this as a production deliverable: someone else should be able to clone your repo and reproduce every number in your report with one command.

Deliverables

  1. 1
    Curated dataset

    ≥3,000 high-quality (instruction, response) pairs in your chosen domain, with a held-out 10% test split. Document the source, license, and your cleaning pipeline.

    format: data/train.jsonl + data/test.jsonl + data/README.md
  2. 2
    Fine-tuned model

    A LoRA-adapted (or fully merged) checkpoint of a 1–8B open base model trained on your dataset, with reproducible training script and config.

    format: out/merged/ + scripts/sft.py + configs/sft.yaml
  3. 3
    Aligned model

    DPO (or RLHF) pass on ≥1,000 preference pairs from the same domain, with a clear win-rate improvement over the SFT-only model.

    format: out/dpo/ + scripts/dpo.py + evals/winrate.json
  4. 4
    Custom evaluation suite

    At least one academic benchmark (e.g. MMLU subset, HumanEval) AND a domain-specific eval of ≥50 hand-graded cases with an LLM-judge harness.

    format: evals/run.py + evals/golden.jsonl + evals/report.html
  5. 5
    Production serving

    Model served via vLLM (or equivalent) behind an OpenAI-compatible endpoint, with prefix caching enabled and a measured throughput benchmark.

    format: serve/Dockerfile + serve/start.sh + bench/results.csv
  6. 6
    Engineering report

    A 5–8 page PDF covering: problem, dataset choices, base-model choice, training/alignment decisions, evaluation results, latency/throughput/cost numbers, failure modes, and what you would do with another month.

    format: report.pdf with ≥2 figures and a metrics table
  7. 7
    Reproducibility

    A single `make all` target (or equivalent) that reproduces the full pipeline end-to-end on one GPU host.

    format: Makefile + README.md + requirements.txt

Suggested timeline

  1. 1. Scope & data
    Days 1–4

    Pick domain, collect/clean dataset, define eval set.

  2. 2. SFT
    Days 5–9

    LoRA fine-tune base model, log loss, qualitative spot-check.

  3. 3. Alignment
    Days 10–13

    Author preference pairs, run DPO, measure win-rate.

  4. 4. Evaluation
    Days 14–16

    Build judge harness, run benchmarks, write report draft.

  5. 5. Serving & bench
    Days 17–19

    Containerise, benchmark, document cost.

  6. 6. Polish & submit
    Days 20–21

    Finalise report, clean repo, record demo video.

Graded rubric (100 pts)

Each criterion is scored 0–4. Final score = Σ (score × weight) ÷ 4. You need ≥70 to earn the capstone seal on your transcript.

Data quality and curation
15 pts
  • 0

    No dataset, or unlicensed / scraped without attribution.

  • 1

    Small or noisy dataset, no documented cleaning.

  • 2

    Adequate size, basic dedup and filtering, license noted.

  • 3

    Well-curated, clear cleaning pipeline, train/test split with no leakage.

  • 4

    Curation rivals public research datasets — provenance, quality scoring, leakage audit.

Training & alignment correctness
20 pts
  • 0

    Training script does not run or loss diverges.

  • 1

    Trains but no measurable improvement over base.

  • 2

    Clear loss curve, SFT improves over base on at least one metric.

  • 3

    SFT + DPO both improve; sensible hyperparameters; gradient clipping, schedule, mixed precision.

  • 4

    Defended choice of base model, LoRA rank, β, learning rate; ablation table included.

Evaluation rigor
20 pts
  • 0

    Only vibes-based examples.

  • 1

    Ran one off-the-shelf benchmark.

  • 2

    Off-the-shelf + small domain eval.

  • 3

    Domain eval with LLM-judge, base/SFT/DPO compared side by side.

  • 4

    Judge calibrated vs human, confidence intervals, regression on academic benchmarks reported honestly.

Serving, latency & cost
15 pts
  • 0

    Model not served.

  • 1

    Runs via raw transformers .generate(), no batching.

  • 2

    Served via vLLM, basic latency numbers.

  • 3

    Throughput + p95 latency benchmarked, prefix caching on, cost per 1k tokens reported.

  • 4

    Quantisation / speculative decoding explored with measured tradeoffs.

Engineering quality & reproducibility
15 pts
  • 0

    No README; scripts do not run.

  • 1

    Runs with manual steps; partial README.

  • 2

    Documented setup, dependency pinning.

  • 3

    One-command reproduction; seeded; checkpoints versioned.

  • 4

    CI runs eval harness; Docker image; clean module boundaries.

Report & defensibility
15 pts
  • 0

    No report.

  • 1

    Report exists but is mostly a README.

  • 2

    Covers what was built and what worked.

  • 3

    Covers decisions, tradeoffs, failure modes, and cost.

  • 4

    Reads like a serious engineering postmortem — would convince a hiring manager.

Pre-submission checklist

Every item must be true before you submit. Reviewers will spot-check.

  • Dataset has ≥3,000 train + ≥300 held-out test pairs with no leakage
  • Loss curve PNG committed; final eval loss < 80% of base model's eval loss on the domain set
  • DPO model beats SFT model on ≥55% of head-to-head preference judgments
  • Academic benchmark regression (e.g. MMLU drop) is ≤5 points or explicitly justified
  • vLLM endpoint returns 200 OK on the standard OpenAI client
  • Throughput bench tested at concurrencies 1 / 16 / 64 with p50, p95 logged
  • Cost per 1k output tokens calculated and included in report
  • `make all` reproduces every figure and number in the report from a clean checkout
  • Report is 5–8 pages, includes ≥2 figures, no broken metrics tables
  • Repo has a license, a README quickstart, and pinned requirements

Stretch goals (bonus 0–10 pts)

  • 4-bit quantisation with measured quality delta
  • Speculative decoding using a sub-1B draft model
  • Continuous eval in CI — fail the build on a >5-point regression
  • Multi-turn fine-tuning with a chat template you defend
  • Public demo URL with rate limits and request logging