Capstone project

Ship a Domain-Tuned LLM End-to-End

Take an open base model, fine-tune it on a domain dataset, align it with preference data, evaluate it on a custom benchmark, and serve it under a measured SLO.

Estimated effort30–50 hours over 2–3 weeks

Passing score70 / 100

ReviewerReviewed by a senior ML engineer playing the role of your hiring manager. They will reproduce your repo and read your report.

The brief

You are the founding LLM engineer at a 5-person startup. The product team needs a small, fast, in-house model that beats general-purpose APIs on one narrow domain (your choice: legal Q&A, medical note summarisation, code review, financial filings extraction, customer-support triage — pick one).

Your job is to take this from raw data to a running HTTP endpoint, with measured quality and cost, and to defend every architectural choice in a short report.

Treat this as a production deliverable: someone else should be able to clone your repo and reproduce every number in your report with one command.

Deliverables

1
Curated dataset
≥3,000 high-quality (instruction, response) pairs in your chosen domain, with a held-out 10% test split. Document the source, license, and your cleaning pipeline.
format: data/train.jsonl + data/test.jsonl + data/README.md
2
Fine-tuned model
A LoRA-adapted (or fully merged) checkpoint of a 1–8B open base model trained on your dataset, with reproducible training script and config.
format: out/merged/ + scripts/sft.py + configs/sft.yaml
3
Aligned model
DPO (or RLHF) pass on ≥1,000 preference pairs from the same domain, with a clear win-rate improvement over the SFT-only model.
format: out/dpo/ + scripts/dpo.py + evals/winrate.json
4
Custom evaluation suite
At least one academic benchmark (e.g. MMLU subset, HumanEval) AND a domain-specific eval of ≥50 hand-graded cases with an LLM-judge harness.
format: evals/run.py + evals/golden.jsonl + evals/report.html
5
Production serving
Model served via vLLM (or equivalent) behind an OpenAI-compatible endpoint, with prefix caching enabled and a measured throughput benchmark.
format: serve/Dockerfile + serve/start.sh + bench/results.csv
6
Engineering report
A 5–8 page PDF covering: problem, dataset choices, base-model choice, training/alignment decisions, evaluation results, latency/throughput/cost numbers, failure modes, and what you would do with another month.
format: report.pdf with ≥2 figures and a metrics table
7
Reproducibility
A single `make all` target (or equivalent) that reproduces the full pipeline end-to-end on one GPU host.
format: Makefile + README.md + requirements.txt

Suggested timeline

1. Scope & data
Days 1–4
Pick domain, collect/clean dataset, define eval set.
2. SFT
Days 5–9
LoRA fine-tune base model, log loss, qualitative spot-check.
3. Alignment
Days 10–13
Author preference pairs, run DPO, measure win-rate.
4. Evaluation
Days 14–16
Build judge harness, run benchmarks, write report draft.
5. Serving & bench
Days 17–19
Containerise, benchmark, document cost.
6. Polish & submit
Days 20–21
Finalise report, clean repo, record demo video.

Graded rubric (100 pts)

Each criterion is scored 0–4. Final score = Σ (score × weight) ÷ 4. You need ≥70 to earn the capstone seal on your transcript.

Data quality and curation

15 pts

0
No dataset, or unlicensed / scraped without attribution.
1
Small or noisy dataset, no documented cleaning.
2
Adequate size, basic dedup and filtering, license noted.
3
Well-curated, clear cleaning pipeline, train/test split with no leakage.
4
Curation rivals public research datasets — provenance, quality scoring, leakage audit.

Training & alignment correctness

20 pts

0
Training script does not run or loss diverges.
1
Trains but no measurable improvement over base.
2
Clear loss curve, SFT improves over base on at least one metric.
3
SFT + DPO both improve; sensible hyperparameters; gradient clipping, schedule, mixed precision.
4
Defended choice of base model, LoRA rank, β, learning rate; ablation table included.

Evaluation rigor

20 pts

0
Only vibes-based examples.
1
Ran one off-the-shelf benchmark.
2
Off-the-shelf + small domain eval.
3
Domain eval with LLM-judge, base/SFT/DPO compared side by side.
4
Judge calibrated vs human, confidence intervals, regression on academic benchmarks reported honestly.

Serving, latency & cost

15 pts

0
Model not served.
1
Runs via raw transformers .generate(), no batching.
2
Served via vLLM, basic latency numbers.
3
Throughput + p95 latency benchmarked, prefix caching on, cost per 1k tokens reported.
4
Quantisation / speculative decoding explored with measured tradeoffs.

Engineering quality & reproducibility

15 pts

0
No README; scripts do not run.
1
Runs with manual steps; partial README.
2
Documented setup, dependency pinning.
3
One-command reproduction; seeded; checkpoints versioned.
4
CI runs eval harness; Docker image; clean module boundaries.

Report & defensibility

15 pts

0
No report.
1
Report exists but is mostly a README.
2
Covers what was built and what worked.
3
Covers decisions, tradeoffs, failure modes, and cost.
4
Reads like a serious engineering postmortem — would convince a hiring manager.

Pre-submission checklist

Every item must be true before you submit. Reviewers will spot-check.

Dataset has ≥3,000 train + ≥300 held-out test pairs with no leakage
Loss curve PNG committed; final eval loss < 80% of base model's eval loss on the domain set
DPO model beats SFT model on ≥55% of head-to-head preference judgments
Academic benchmark regression (e.g. MMLU drop) is ≤5 points or explicitly justified
vLLM endpoint returns 200 OK on the standard OpenAI client
Throughput bench tested at concurrencies 1 / 16 / 64 with p50, p95 logged
Cost per 1k output tokens calculated and included in report
`make all` reproduces every figure and number in the report from a clean checkout
Report is 5–8 pages, includes ≥2 figures, no broken metrics tables
Repo has a license, a README quickstart, and pinned requirements

Stretch goals (bonus 0–10 pts)

4-bit quantisation with measured quality delta
Speculative decoding using a sub-1B draft model
Continuous eval in CI — fail the build on a >5-point regression
Multi-turn fine-tuning with a chat template you defend
Public demo URL with rate limits and request logging