Ship a Domain-Tuned LLM End-to-End
Take an open base model, fine-tune it on a domain dataset, align it with preference data, evaluate it on a custom benchmark, and serve it under a measured SLO.
The brief
You are the founding LLM engineer at a 5-person startup. The product team needs a small, fast, in-house model that beats general-purpose APIs on one narrow domain (your choice: legal Q&A, medical note summarisation, code review, financial filings extraction, customer-support triage — pick one).
Your job is to take this from raw data to a running HTTP endpoint, with measured quality and cost, and to defend every architectural choice in a short report.
Treat this as a production deliverable: someone else should be able to clone your repo and reproduce every number in your report with one command.
Deliverables
- 1Curated dataset
≥3,000 high-quality (instruction, response) pairs in your chosen domain, with a held-out 10% test split. Document the source, license, and your cleaning pipeline.
format: data/train.jsonl + data/test.jsonl + data/README.md - 2Fine-tuned model
A LoRA-adapted (or fully merged) checkpoint of a 1–8B open base model trained on your dataset, with reproducible training script and config.
format: out/merged/ + scripts/sft.py + configs/sft.yaml - 3Aligned model
DPO (or RLHF) pass on ≥1,000 preference pairs from the same domain, with a clear win-rate improvement over the SFT-only model.
format: out/dpo/ + scripts/dpo.py + evals/winrate.json - 4Custom evaluation suite
At least one academic benchmark (e.g. MMLU subset, HumanEval) AND a domain-specific eval of ≥50 hand-graded cases with an LLM-judge harness.
format: evals/run.py + evals/golden.jsonl + evals/report.html - 5Production serving
Model served via vLLM (or equivalent) behind an OpenAI-compatible endpoint, with prefix caching enabled and a measured throughput benchmark.
format: serve/Dockerfile + serve/start.sh + bench/results.csv - 6Engineering report
A 5–8 page PDF covering: problem, dataset choices, base-model choice, training/alignment decisions, evaluation results, latency/throughput/cost numbers, failure modes, and what you would do with another month.
format: report.pdf with ≥2 figures and a metrics table - 7Reproducibility
A single `make all` target (or equivalent) that reproduces the full pipeline end-to-end on one GPU host.
format: Makefile + README.md + requirements.txt
Suggested timeline
- 1. Scope & dataDays 1–4
Pick domain, collect/clean dataset, define eval set.
- 2. SFTDays 5–9
LoRA fine-tune base model, log loss, qualitative spot-check.
- 3. AlignmentDays 10–13
Author preference pairs, run DPO, measure win-rate.
- 4. EvaluationDays 14–16
Build judge harness, run benchmarks, write report draft.
- 5. Serving & benchDays 17–19
Containerise, benchmark, document cost.
- 6. Polish & submitDays 20–21
Finalise report, clean repo, record demo video.
Graded rubric (100 pts)
Each criterion is scored 0–4. Final score = Σ (score × weight) ÷ 4. You need ≥70 to earn the capstone seal on your transcript.
- 0
No dataset, or unlicensed / scraped without attribution.
- 1
Small or noisy dataset, no documented cleaning.
- 2
Adequate size, basic dedup and filtering, license noted.
- 3
Well-curated, clear cleaning pipeline, train/test split with no leakage.
- 4
Curation rivals public research datasets — provenance, quality scoring, leakage audit.
- 0
Training script does not run or loss diverges.
- 1
Trains but no measurable improvement over base.
- 2
Clear loss curve, SFT improves over base on at least one metric.
- 3
SFT + DPO both improve; sensible hyperparameters; gradient clipping, schedule, mixed precision.
- 4
Defended choice of base model, LoRA rank, β, learning rate; ablation table included.
- 0
Only vibes-based examples.
- 1
Ran one off-the-shelf benchmark.
- 2
Off-the-shelf + small domain eval.
- 3
Domain eval with LLM-judge, base/SFT/DPO compared side by side.
- 4
Judge calibrated vs human, confidence intervals, regression on academic benchmarks reported honestly.
- 0
Model not served.
- 1
Runs via raw transformers .generate(), no batching.
- 2
Served via vLLM, basic latency numbers.
- 3
Throughput + p95 latency benchmarked, prefix caching on, cost per 1k tokens reported.
- 4
Quantisation / speculative decoding explored with measured tradeoffs.
- 0
No README; scripts do not run.
- 1
Runs with manual steps; partial README.
- 2
Documented setup, dependency pinning.
- 3
One-command reproduction; seeded; checkpoints versioned.
- 4
CI runs eval harness; Docker image; clean module boundaries.
- 0
No report.
- 1
Report exists but is mostly a README.
- 2
Covers what was built and what worked.
- 3
Covers decisions, tradeoffs, failure modes, and cost.
- 4
Reads like a serious engineering postmortem — would convince a hiring manager.
Pre-submission checklist
Every item must be true before you submit. Reviewers will spot-check.
- Dataset has ≥3,000 train + ≥300 held-out test pairs with no leakage
- Loss curve PNG committed; final eval loss < 80% of base model's eval loss on the domain set
- DPO model beats SFT model on ≥55% of head-to-head preference judgments
- Academic benchmark regression (e.g. MMLU drop) is ≤5 points or explicitly justified
- vLLM endpoint returns 200 OK on the standard OpenAI client
- Throughput bench tested at concurrencies 1 / 16 / 64 with p50, p95 logged
- Cost per 1k output tokens calculated and included in report
- `make all` reproduces every figure and number in the report from a clean checkout
- Report is 5–8 pages, includes ≥2 figures, no broken metrics tables
- Repo has a license, a README quickstart, and pinned requirements
Stretch goals (bonus 0–10 pts)
- 4-bit quantisation with measured quality delta
- Speculative decoding using a sub-1B draft model
- Continuous eval in CI — fail the build on a >5-point regression
- Multi-turn fine-tuning with a chat template you defend
- Public demo URL with rate limits and request logging
