Designing & Building Large Language Models — Study Manual
World-class, exam-aligned reference for the GACS Diploma in LLM Engineering & Applied AI Architecture. Twenty-plus modules covering transformers, training, fine-tuning, RAG, agents, RLHF, reasoning models, multimodal, deployment, safety, FinOps and career prep.
- 0
Module 0 — Orientation — Start Here: How to Succeed in This Program
Welcome, what's included, technical requirements, time commitment, how to study, how to pass the exam, how to complete the capstone, and how to get help.
Welcome
Welcome to the GACS Diploma in Designing & Building Large Language Models. This program transforms you from a learner into a full-stack AI engineer capable of building real, production-grade LLM systems.
What's included
12 full textbook modules, a 100-question final exam, a capstone project, diploma + transcript + badges, a RAG system build, an API deployment, a chatbot interface, and safety & governance training.
Technical requirements
A computer (Windows, macOS, or Linux), Python 3.10+, VS Code or PyCharm, a GitHub account, a HuggingFace account, and basic terminal knowledge.
Time commitment
Modules: 40–60 hours. Capstone: 20–40 hours. Total: 60–100 hours.
How to study
Read each module fully, complete the lab exercises, build your portfolio as you go, take notes, ask questions, and practice coding constantly.
How to pass the exam
100 questions, 80% required to pass, unlimited attempts, open-book. Treat it as a final review, not a guessing game.
How to complete the capstone
Follow the 11-step build process, then submit your GitHub repo, demo video, model card, and technical report.
How to get help
Support email, community forum, and optional instructor office hours.
- 1
Module 1 — Foundations of Large Language Models
How modern LLMs work end to end: tokens, embeddings, transformer blocks, autoregressive generation, and scaling.
Module 1 — Transformer block at a glance. What an LLM actually is
An LLM is a stack of transformer blocks trained to predict the next token in a sequence. Text is split into subword tokens, each token is mapped to a learned embedding vector, and the stack repeatedly mixes information across positions with self-attention and refines it with feed-forward layers. At inference the model samples one token at a time, feeding its own output back in.
Why scale matters
Larger models, trained on more data with more compute, follow smooth scaling laws: loss decreases predictably as parameters, tokens and FLOPs grow together. Chinchilla showed that for a fixed compute budget you should balance parameters and training tokens rather than dumping everything into one axis.
Core vocabulary
Token, embedding, attention head, context window, perplexity, logits and temperature are the words you need to be fluent in. Most operational decisions — context size, KV cache cost, sampling behavior — trace back to these primitives.
- 2
Module 2 — Tokenization & Embeddings
How raw text becomes numbers and why this layer silently controls cost, quality and multilingual performance.
Module 2 — From raw string to token IDs to embedding vectors. Subword tokenization
Modern LLMs use byte-pair encoding (BPE) or SentencePiece. Rare words are split into common subword pieces, so the model can represent anything without an unbounded vocabulary. A bad tokenizer wastes tokens (and money) on every request.
Embeddings as meaning
Each token ID is mapped to a high-dimensional vector. After training, semantically related tokens end up near each other. The same embedding idea powers retrieval: encode a chunk of text, store the vector, then search by cosine similarity.
Operational impact
Token count drives latency, cost, context limits and KV cache memory. Multilingual quality depends heavily on how well the tokenizer covers the target script. Always measure tokens, not characters.
- 3
Module 3 — Transformer Architecture in Depth
Self-attention, multi-head attention, residuals, LayerNorm and positional encoding — the actual machinery.
Module 3 — Self-attention with Q, K, V projections. Self-attention
Each token produces three projections — query, key and value. Attention scores are softmax(QKᵀ/√dₖ), and the output is a weighted sum of values. This lets every token look at every other token in parallel, which is the core advantage over RNNs.
Multi-head & FFN
Multiple attention heads run in parallel so the model can attend to different relations simultaneously (syntax, coreference, long-range topic). A position-wise feed-forward network then refines each token independently.
Residuals, LayerNorm, positions
Residual connections and LayerNorm keep gradients well-behaved in very deep stacks. Positional encoding (sinusoidal, learned, or rotary/RoPE) injects order because attention itself is permutation-invariant.
- 4
Module 4 — Pre-Training at Scale
Distributed training, optimizers, mixed precision and the engineering reality of multi-GPU runs.
Module 4 — Parameter and state sharding across GPUs. Parallelism strategies
Data parallel splits the batch, tensor parallel splits weight matrices, pipeline parallel splits layers into stages, and FSDP / ZeRO shards parameters, gradients and optimizer states across workers. Real training runs combine several of these.
Optimizer and precision
AdamW with weight decay is the default. Bfloat16 (or fp16 + loss scaling) cuts memory and speeds up matmul on modern accelerators. Gradient checkpointing trades extra compute for much lower activation memory.
Stability
Warmup the learning rate, then decay it (cosine is common). Clip gradients. Watch for loss spikes — they usually mean a bad batch, an LR too high, or numerical drift, not divine intervention.
- 5
Module 5 — Fine-Tuning Techniques for LLMs
Why fine-tuning exists, the five major approaches (Full, LoRA, QLoRA, PEFT, Instruction Tuning), domain adaptation, safety alignment, and how to build and evaluate a real pipeline.
Module 5 — LoRA adds a low-rank update on top of frozen base weights. 5.0 — Why fine-tuning exists
Training an LLM from scratch is expensive, slow, and resource-intensive. Fine-tuning is the practical, cost-effective alternative: specialize a base model, improve domain performance, add capabilities, align behavior, reduce hallucinations, improve safety, and train on proprietary data — all without retraining from zero.
5.1 — What fine-tuning actually does
Fine-tuning nudges existing weights to learn new patterns, adapt to a domain, follow instructions better, and reduce errors. It does NOT replace the base model, retrain from scratch, or change the architecture — it shifts the model in a new direction.
5.2 — The five major types
Full Fine-Tuning updates every weight. LoRA adds small trainable matrices. QLoRA combines LoRA with 4-bit quantization. PEFT is the umbrella category (LoRA, Prefix Tuning, Prompt Tuning, Adapters, BitFit). Instruction Tuning teaches the model to follow instructions. Each has different cost/benefit tradeoffs.
5.3 — Full fine-tuning
Updates every parameter. Pros: maximum performance, deep domain adaptation, best for large datasets. Cons: massive GPU requirements, expensive, slow, and risk of catastrophic forgetting. Used for medical, legal, scientific, and enterprise-specific models.
5.4 — LoRA (Low-Rank Adaptation)
The most popular method. Freeze the base model, inject small trainable low-rank matrices into key projections, and train only those. Fast, cheap, low-GPU, easy to merge/unmerge, and resistant to catastrophic forgetting. Works because most useful weight updates lie in a low-rank subspace.
5.5 — QLoRA
Combines 4-bit quantization of the base model with LoRA adapters on top. Enables fine-tuning of 70B models on a single GPU with minimal VRAM and strong performance. Today's industry standard for cost-effective fine-tuning.
5.6 — PEFT (Parameter-Efficient Fine-Tuning)
Umbrella term covering LoRA, Prefix Tuning, Prompt Tuning, Adapter Tuning, and BitFit. PEFT methods typically train less than 1% of the model's parameters while approaching full fine-tuning quality.
5.7 — Instruction tuning
Teaches the model to follow instructions, respond helpfully, stay on topic, and avoid hallucinations. Common datasets: FLAN, Dolly, Alpaca, OpenHermes, and synthetic instruction sets. Essential for chatbots and assistants.
5.8 — Domain adaptation
Specializes the model for law, medicine, finance, cybersecurity, engineering, or customer support. Requires domain-specific datasets, domain-specific evaluation, and safety alignment matched to the domain's risk profile.
5.9 — Safety alignment
Covers RLHF, Constitutional AI, curated safety datasets, red-teaming, and toxicity filtering. Ensures compliance, ethical behavior, and reduced harm before deployment.
5.10 — The fine-tuning pipeline
A professional pipeline: dataset preparation → tokenization → LoRA/QLoRA setup → training loop → validation → checkpointing → merging adapters → deployment. Each stage has its own failure modes worth monitoring.
5.11 — Evaluation
Evaluate fine-tuned models with perplexity, task accuracy, domain benchmarks, safety tests, and human evaluation. Never rely on a single metric — combine automated evals with human spot-checks.
- 6
Module 6 — Evaluation & Benchmarks
Designing trustworthy evals, avoiding contamination, and reading benchmark results without lying to yourself.
What good evals look like
Hold-out evaluation sets must never appear in training. They should mirror production traffic, include edge cases, and be stable enough to compare model versions over time.
Standard benchmarks
MMLU for broad knowledge and reasoning, GSM8K and MATH for math, HumanEval and MBPP for code, MT-Bench and Arena for chat quality. Always check which version and how it was scored.
Contamination & gaming
Public benchmarks leak into training corpora. Treat single-benchmark wins skeptically; rely on aggregates, private evals and human spot-checks.
- 7
Module 7 — Deployment & Inference Optimization
Turning a trained LLM into a real product: pipelines, GPU vs CPU, quantization, distillation, serving architectures, caching, KV-cache, streaming, batching, and cost optimization.
7.0 — Why deployment matters
Deployment determines speed, cost, capacity, reliability, safety, and whether the model can scale. A model that performs well in training but poorly in deployment is not a usable model. Roughly 90% of real-world LLM cost lives in inference, not training.
7.1 — What inference is
Inference is running the model to generate outputs. Training = learning, inference = using. It must be fast, cheap, reliable, scalable, and safe.
7.2 — The inference pipeline
User request → API gateway → load balancer → inference server → tokenizer → model forward pass → logits → softmax → tokens → detokenizer → response. Every step is an optimization target.
7.3 — GPU vs CPU inference
GPUs are fast, parallel, and built for matrix math, but expensive and supply-constrained. CPUs are cheap, scalable, and easy to deploy but slow and size-limited. Rule of thumb: small models on CPU, medium on either, large on GPU only.
7.4 — Quantization
Reduces weight precision: FP16, BF16, INT8, INT4. Lower memory, faster inference, lower cost — at a small accuracy cost. INT4 is the most popular for deployment: it enables 70B models on a single GPU and 7B models on CPU.
7.5 — Distillation
Train a smaller student model to mimic a larger teacher. Faster, cheaper, smaller footprint — at the cost of some accuracy and reasoning. Used heavily by Mistral, Google, Meta, and Cohere.
7.6 — Model serving architectures
Single-node: simple, cheap, but a single point of failure. Multi-node: scalable, redundant, more complex. Distributed: model split across GPUs/nodes — supports huge models with high throughput but requires specialized frameworks.
7.7 — Load balancing
Distributes requests across servers via round-robin, least-load, latency-based, or weighted strategies. Ensures availability, low latency, and even GPU usage.
7.8 — Caching
Stores embeddings, token sequences, and partial computations. Common types: KV-cache (key/value), prompt cache, and response cache. KV-cache is the single most important.
7.9 — KV-cache explained
During generation, keys and values for past tokens are stored so future tokens don't recompute them. Without KV-cache, 100 tokens = 100 full passes. With it, 100 tokens = 1 full pass + 99 cheap incremental passes. This is why modern LLMs stream quickly.
7.10 — Token streaming
Send tokens to the client as they're generated. Lower perceived latency, better UX, faster feedback. Essential for chatbots, assistants, and real-time systems.
7.11 — Batch inference
Processes multiple requests in one forward pass. Higher throughput, lower per-request cost, better GPU utilization, at the cost of slightly higher latency. Used by every major provider.
7.12 — Serverless LLMs
On-demand, auto-scaling, pay-per-use. No idle cost, but cold starts and limited GPU availability remain real tradeoffs.
7.13 — Inference optimization techniques
Quantization, KV-cache optimization, FlashAttention, TensorRT-LLM, speculative decoding (small draft model proposes tokens, big model verifies in batch), Mixture-of-Experts (activate only parts of the model), and distillation.
7.14 — Cost optimization
Driven by model size, hardware, batch size, quantization, throughput, and latency targets. Levers: smaller models, quantized models, CPU for small models, GPU for large ones, batching, caching, and distillation.
7.15 — Monitoring & logging
Track latency, tokens-per-second, GPU utilization, memory, error rate, safety violations, and user satisfaction. Without monitoring you cannot diagnose regressions or abuse.
7.16 — Failure modes in deployment
OOM, GPU overload, latency spikes, deadlocks, tokenizer mismatch, KV-cache corruption, network failures. Professional systems add auto-restart, auto-scaling, health checks, and redundancy.
7.17 — Deployment readiness checklist
Latency < 300ms, TPS > 50, stable memory, active safety filters, no hallucination spikes, benchmarks passed, red-teaming passed, monitoring active.
- 8
Module 8 — Building Your Own LLM API
Turning an LLM into a real product: gateway, auth, rate limiting, safety filters, load balancing, streaming, logging, monitoring, versioning, scaling, and security.
Module 8 — Minimum viable LLM serving topology. 8.0 — Why every LLM needs an API
Training and deploying isn't enough — usability requires an API. It's what lets websites, mobile apps, internal tools, chatbots, automations, and enterprise systems talk to your model.
8.1 — What an LLM API is
A web endpoint that accepts a prompt + parameters and returns generated text, tokens, metadata, safety flags, and usage stats. Example: POST /v1/generate with prompt, max_tokens, temperature → text, tokens_used, latency_ms, safety.
8.2 — Architecture overview
Client → API gateway → auth → rate limiter → safety filter → load balancer → inference servers → tokenizer → model → detokenizer → response formatter → client. Each layer has a single, well-defined purpose.
8.3 — API gateway
Front door of your LLM. Handles routing, authentication, logging, rate limiting, request validation, and error handling.
8.4 — Authentication
Ensures only authorized callers reach the model. Common methods: API keys, OAuth2, JWT, HMAC signatures. Rotate keys, revoke compromised ones, scope permissions, log usage per key.
8.5 — Rate limiting
Stops abuse, DDoS, runaway cost, and overload. Strategies: requests per minute, tokens per minute, burst caps, sliding windows. Example budget: 100 req/min and 10,000 tokens/min per key.
8.6 — Safety filters
Run before and after the model. Check for toxicity, hate, violence, self-harm, illegal content, sensitive data leaks, and jailbreak attempts. Actions: block, sanitize, replace, warn, log.
8.7 — Load balancing
Distributes traffic across inference servers via round-robin, least-load, latency-based, or weighted routing — for availability, low latency, and even GPU usage.
8.8 — Inference server
Loads the model, tokenizes input, runs the forward pass, applies sampling, streams tokens, returns output. Needs high-end GPU, fast storage, low-latency networking, and KV-cache support.
8.9 — Tokenization layer
Convert text → tokens using the SAME tokenizer, vocabulary, merges, and normalization as training. Tokenizer mismatch = broken model.
8.10 — Sampling methods
Control generation with temperature, top-k, top-p, repetition penalty, and max_tokens. Typical defaults: temperature 0.7, top_k 50, top_p 0.9.
8.11 — Streaming responses
Send tokens as they're generated. Lower perceived latency, better UX, faster feedback. Essential for chatbots, assistants, and real-time systems.
8.12 — Logging
Critical for debugging, monitoring, billing, safety audits, and abuse detection. Log timestamp, user ID, API key, prompt length, tokens generated, latency, safety flags, and errors.
8.13 — Monitoring
Track latency, TPS, GPU utilization, memory, error rate, safety violations, and user satisfaction to ensure reliability, performance, and safety.
8.14 — Error handling
Common errors: invalid API key, rate limit exceeded, unsafe content, model overload, timeout, tokenizer mismatch. Return structured errors with retry_after where applicable.
8.15 — Versioning
Prevents breaking changes. Each version (/v1/generate, /v2/generate) can have its own model, tokenizer, and safety rules. Clients pin a version, you upgrade safely.
8.16 — Scaling strategies
Horizontal, vertical, auto-scaling, sharding, and multi-region. Multi-region gives lower latency, higher availability, and disaster recovery.
8.17 — Security
Protect user data, API keys, model weights, logs, and infrastructure. Encrypt everything, rotate keys, enforce HTTPS, use firewalls, WAF, IAM roles, and audit logs.
- 9
Module 9 — Building a Chatbot Interface
Turning an LLM API into a real product people can use: UI/UX, sessions, memory, safety, RAG, streaming, formatting, error handling, multi-turn reasoning, analytics, deployment, and security.
9.0 — Why chatbots matter
Raw LLM APIs aren't user-friendly. Most users can't write API calls, format JSON, manage tokens, or interpret logs. A chatbot UI hides the complexity and exposes the intelligence.
9.1 — What makes a good chatbot
Fast, clear, helpful, safe, context-aware, reliable, scalable, and beautiful. A chatbot isn't just a UI — it's a full system spanning frontend, backend, memory, and safety.
9.2 — Architecture overview
User → frontend UI → message handler → session manager → input safety filter → RAG/memory retrieval → LLM API → output safety filter → response formatter → frontend UI. Each layer has a single purpose.
9.3 — Frontend UI design
Clean, minimal, responsive, accessible, fast. Core components: input box, send button, history panel, streaming output, loading indicator, error messages, optional settings, model selector, file upload. Use rounded bubbles, timestamps, avatars, subtle animations, markdown rendering, and code blocks.
9.4 — Message handling
Receive, validate, sanitize, send, display. Validate against empty, oversized, unsafe, or unsupported messages. Sanitize HTML, scripts, and malicious payloads before they touch the model.
9.5 — Session management
A session stores conversation history, user preferences, model settings, memory state, and RAG context. Types: stateless, stateful, or hybrid (summarized). Storage: in-memory, Redis, database, or browser local storage.
9.6 — Memory systems
Short-term (recent messages), long-term (stored facts), working (summaries), and RAG (external docs retrieved on demand). Challenges: token limits, drift, hallucinations, and privacy.
9.7 — Safety filters
Run on input (harmful intent, illegal requests, self-harm, violence, hate, jailbreaks) and on output (toxicity, bias, unsafe instructions, misinformation, sensitive data leaks). Protect users, company, and reputation.
9.8 — RAG integration
Embed the query → search the vector DB → retrieve top-k chunks → insert into prompt → call the LLM → generate an answer. Result: accurate, up-to-date, domain-specific responses with citations.
9.9 — Streaming responses
Send tokens to the UI as they're generated. Lower perceived latency, better UX, more natural conversation. Implement with WebSockets, Server-Sent Events, or HTTP chunked responses.
9.10 — Response formatting
Render markdown: headings, bullets, code blocks, tables, bold/italic. Essential for technical answers, tutorials, code explanations, and summaries.
9.11 — Error handling
Handle timeouts, rate limits, invalid input, safety blocks, server overload, and network failures. Never show raw stack traces or internal errors. Always offer a retry path.
9.12 — Multi-turn reasoning
Maintain context across long conversations using history, summarization, memory compression, and careful context-window management.
9.13 — Analytics & logging
Track satisfaction, conversation length, error rate, safety violations, popular topics, token usage, and model performance. Log user ID, timestamp, prompt, response, safety flags, latency, and tokens. Drives UX, safety, performance, and product decisions.
9.14 — Chatbot deployment
Requires API gateway, load balancer, inference servers, logging, monitoring, safety filters, database, and CDN. Deploy to cloud, on-prem, hybrid, or edge depending on latency and compliance needs.
9.15 — Chatbot security
Protect user data, API keys, model weights, logs, and infrastructure. Encrypt everything, sanitize input and output, enforce HTTPS, use WAF, IAM roles, and audit logs.
9.16 — UX principles
A great chatbot feels fast, friendly, helpful, predictable, and trustworthy. Always show typing indicators, stream responses, allow message editing/deletion, show timestamps, and surface clear error messages.
- 10
Module 10 — Retrieval-Augmented Generation (RAG)
Connect the LLM to external knowledge so it stops hallucinating: embeddings, vector DBs, similarity search, chunking, context assembly, RAG prompts, evaluation, failure modes, advanced techniques, security, deployment, and performance.
Module 10 — Standard RAG pipeline. 10.0 — Why RAG exists
LLMs can't see anything after their training cut-off and hallucinate confidently when they don't know. Unacceptable in law, medicine, finance, engineering, and enterprise. RAG fixes both problems.
10.1 — What RAG is
Retrieval-Augmented Generation = LLM + Search + Context Injection. The model retrieves relevant information from an external source (vector DB, doc store, knowledge graph, corporate DB) BEFORE generating. Now the industry standard for enterprise AI.
10.2 — Why RAG works
LLMs are excellent at reasoning but terrible at recall. Databases are excellent at recall but can't reason. RAG combines the strengths of both.
10.3 — Architecture overview
User query → embed query → vector search → retrieve top-k documents → context assembly → LLM prompt → generated answer → response. Each step is critical.
10.4 — Embeddings
Convert text into high-dimensional vectors (768–4096) that capture semantic similarity. Use the SAME embedding model for indexing and querying — mixing models breaks recall completely.
10.5 — Vector databases
Store and search embeddings: FAISS, Pinecone, Weaviate, Milvus, Chroma, Qdrant, pgvector. Need fast similarity search, scalable indexing, metadata filtering, hybrid search, and persistence.
10.6 — Similarity search
Finds documents closest to the query vector using cosine similarity, dot product, or Euclidean distance. Top-k results become the LLM's grounding context.
10.7 — Chunking
Split documents into focused pieces (typically 256/512/1024 tokens) with overlap. Better chunking = better retrieval accuracy, less noise, fewer hallucinations. Chunks must be semantic and clean.
10.8 — Context assembly
Build the final prompt: system role + retrieved chunks + user query + answer format. Forces the LLM to use retrieved knowledge instead of inventing it.
10.9 — RAG prompt engineering
Always instruct the model to answer ONLY from provided context and to say 'I don't know' when the answer isn't there. Dramatically reduces hallucinations.
10.10 — Evaluation
Measure retrieval accuracy, context relevance, answer accuracy, hallucination rate, latency, and coverage. Evaluate retrieval and generation separately so you can fix the right layer.
10.11 — Failure modes
Retrieval failure, bad chunking, embedding drift, context overflow, model ignoring context, latency spikes, missing data in the corpus.
10.12 — Advanced RAG techniques
Multi-vector retrieval, hybrid (keyword + vector) search, re-ranking with a second model, context compression/summarization, Graph-RAG over knowledge graphs, and agentic RAG where the LLM decides what to retrieve.
10.13 — Enterprise RAG
Powers internal knowledge bases, customer support, legal/medical search, financial analysis, engineering manuals, and compliance. Often the only safe way to use LLMs in regulated industries.
10.14 — RAG security
Encrypt embeddings and source documents, enforce access control, keep audit logs, filter PII, and apply data governance. RAG corpora usually contain sensitive corporate data.
10.15 — Deployment
Needs a vector DB cluster, embedding service, RAG orchestrator, LLM API, safety filters, monitoring, and logging. RAG is a full system, not a feature.
10.16 — Performance optimization
Pre-computed embeddings, approximate nearest neighbor (ANN) search, caching, batch retrieval, multi-threaded and GPU-accelerated search. Critical for real-time chatbots.
- 11
Module 11 — AI Safety, Alignment & Governance
Bias, hallucination, prompt injection, jailbreaks, red-teaming, guardrails, and the safety lifecycle.
Module 11 — Safety as a lifecycle, not a checkbox. Core risks
Hallucination, bias, toxicity, privacy leakage, prompt injection, jailbreaks, and misuse for fraud or harm. Each has both technical and policy mitigations.
Mitigation stack
Filter pre-training data, align with RLHF/DPO, add input/output guardrails, ground with retrieval, set refusal policies, monitor in production, and red-team continuously. No single layer is enough.
Governance
Publish a model card with intended use, limitations, evaluations and known risks. Keep an abuse policy, an incident playbook and a rollback plan. Compliance is part of the engineering, not an afterthought.
- 12
Module 12 — Capstone Project: Building a Complete LLM System
The final demonstration of mastery. Students build a full-stack AI product: dataset, fine-tuning, RAG, inference API, chatbot UI, safety, monitoring, model card, technical report, and demo video.
12.0 — Why the capstone matters
Not a theoretical exercise. A real engineering project that simulates the work done in enterprise AI teams, startups, research labs, government AI programs, and consulting firms. Transforms learners into practitioners.
12.1 — Capstone overview
Students build a complete LLM system end-to-end: data pipeline, fine-tuning pipeline, RAG system, inference API, chatbot interface, safety filters, monitoring & analytics, documentation, model card, and demo video.
12.2 — Deliverables
A fine-tuned model (LoRA/QLoRA), a vector database, a RAG pipeline, a production-ready API, a chatbot interface, a safety system, a monitoring dashboard, a model card, a technical report, and a 5–10 minute demo video.
12.3 — Capstone architecture
Runtime: User → Chatbot UI → API Gateway → Input Safety Filter → RAG Retrieval → LLM Inference → Output Safety Filter → Response Formatter → UI. Behind the scenes: Data Pipeline → Fine-Tuning → Model → API → Chatbot → Monitoring.
12.4 — Step 1: Build the dataset
Pick a domain (legal, medical, finance, engineering, support, tech docs). Build a minimum of 500 cleaned, deduplicated, tokenized, documented examples covering instructions, responses, explanations, and safety cases.
12.5 — Step 2: Fine-tune the model
Fine-tune a 7B–13B base with LoRA/QLoRA/PEFT. Deliver training logs, loss curves, checkpoints, and evaluation results. Demonstrate proper hyperparameters, LR schedule, validation, and checkpointing.
12.6 — Step 3: Build the vector database
Chunk, embed, and store at least 100 documents with metadata. Document chunk size (256–1024 tokens), embedding model, and vector DB choice. Build the retrieval pipeline on top.
12.7 — Step 4: Build the RAG pipeline
Implement query embedding, vector search, top-k retrieval, context assembly, and a documented RAG prompt template. Deliver retrieval logs and evaluation results.
12.8 — Step 5: Build the inference API
Auth, rate limiting, logging, error handling, streaming, and safety filters. Required endpoints: /generate, /rag-generate, /health, /metrics.
12.9 — Step 6: Build the chatbot interface
Streaming responses, message history, memory, safety warnings, RAG citations, optional model selector. Clean UI with markdown, code blocks, and clear error messages.
12.10 — Step 7: Implement safety filters
Input filters (harmful intent, illegal requests, self-harm, violence, hate) and output filters (toxicity, bias, unsafe instructions, sensitive data). Deliver safety logs, test results, and a red-team report.
12.11 — Step 8: Monitoring & analytics
Dashboard with latency, TPS, token usage, safety violations, error rate, RAG retrieval accuracy, and (optional) user satisfaction. Real-time + historical metrics with alerts.
12.12 — Step 9: Model card
Document purpose, training data, fine-tuning data, intended use cases, prohibited use cases, risks, limitations, safety measures, and evaluation results. Required for compliance.
12.13 — Step 10: Technical report
10–20 page report covering architecture, data pipeline, fine-tuning, RAG, API, chatbot, safety, monitoring, evaluation results, and lessons learned. Becomes the student's professional portfolio piece.
12.14 — Step 11: Demo video
5–10 minute video demonstrating chatbot usage, RAG retrieval, safety filters, API calls, the monitoring dashboard, and the system architecture. The final deliverable.
12.15 — Evaluation rubric
Technical Quality 40% (model, RAG, API, UX), Safety & Governance 20% (filters, red-teaming, model card, compliance), Documentation 20% (report, diagrams, dataset docs), Presentation 20% (demo, clarity, professionalism).
- 13
Module 13 — Appendix A — Instructor Guide & Teaching Manual
For instructors, TAs, partners, and licensees running cohort-based teaching, corporate training, or institutional workshops.
Purpose
Enables cohort-based teaching, corporate training, instructor-led workshops, and licensing to institutions.
Learning objectives by module
Every module ships with learning outcomes, key concepts, common misconceptions, teaching notes, discussion prompts, lab guidance, and assessment criteria.
Lab instructions
Each module includes lab setup, expected outputs, troubleshooting, and extension tasks for advanced students.
Assessment rubrics
Rubrics for labs, capstone, exam, and optional participation grading.
Suggested pacing
12-week program, 2 modules per week, capstone in weeks 10–12.
Instructor scripts
Guidance on how to introduce each module, explain complex topics, lead discussions, and handle student questions.
Troubleshooting guide
Covers Python issues, GPU issues, tokenization errors, RAG failures, API errors, and safety filter issues.
- 14
Module 14 — Appendix B — Student Workbook & Lab Book
Hands-on exercises for every module: setup, steps, expected outputs, reflection questions, and challenge tasks.
Module-by-module labs
Each module includes 3–6 lab exercises with step-by-step instructions, expected outputs, reflection questions, and 'challenge mode' extension tasks.
Example: Lab 5.2 — Fine-Tune a Model with QLoRA
Steps: load a 7B model → apply QLoRA adapters → train on 200 examples → evaluate perplexity → save the adapter → merge the adapter → test the model. Reflection: what improved, what degraded, what surprised you.
Workbook sections
Notes pages, diagrams, checklists, code snippets, and troubleshooting notes for every module.
- 15
Module 15 — Appendix C — Portfolio Template
A complete GitHub repo + portfolio website template so students ship like professional AI engineers.
GitHub repo structure
/project with /api, /rag, /chatbot, /model, /data, /safety, /monitoring directories, plus README.md, demo.mp4, and model-card.md at the root.
Portfolio website template
Sections: About Me, Skills, Projects, Capstone, Certifications, Contact.
README template
Project overview, architecture diagram, features, installation, usage, API endpoints, safety notes, and license.
- 16
Module 16 — Appendix D — Mastery Checklist
If you can do everything on this list, you are an LLM Engineer.
Foundations
Understand tokens, embeddings, attention, and transformers end-to-end.
Data engineering
Clean, filter, deduplicate, and tokenize datasets confidently.
Training
Train a model, use distributed training, choose optimizers, and run learning-rate schedules.
Fine-tuning
LoRA, QLoRA, PEFT, and instruction tuning — pick and execute the right approach per problem.
RAG
Build a vector DB, chunk and embed documents, retrieve top-k, and assemble context.
Deployment
Build an API with authentication, rate limiting, logging, and monitoring.
Chatbot
Build the UI with streaming, memory, and safety filters.
Safety
Input filtering, output filtering, red-teaming, model cards, governance.
Capstone
Build the full system, document it, present it, deploy it.
- 17
Module 17 — Prompt & Context Engineering
How to elicit the best behavior from an LLM through prompt design, structured outputs, and context packing — the highest-leverage skill in applied AI.
Why prompt engineering still matters
Even with frontier models, prompt quality dominates downstream quality. Small wording, ordering and formatting changes routinely move evaluation scores by 10–30%. Prompt engineering is the cheapest, fastest lever you have before training, fine-tuning, or RAG.
Core techniques
Zero-shot, few-shot, chain-of-thought (CoT), self-consistency, ReAct, tree-of-thoughts, reflexion, and prompt chaining. Each has a use case: CoT for reasoning, few-shot for format conformance, ReAct for tool use, chaining for multi-stage workflows.
Structured outputs
JSON mode, function/tool schemas, regex-constrained decoding, and grammar-based generation (GBNF, Outlines, Instructor). Structured outputs are mandatory for production — never parse free-form text when a schema will do.
Context engineering
Decide what goes into the context window and in what order: system prompt → tools → retrieved docs → conversation history → user message. Recency and primacy bias the model; place the most important instructions at the very top and the most important data at the very bottom.
Prompt injection & defenses
Untrusted text in the context can override your instructions. Defenses: privilege separation, instruction hierarchy, delimiter tagging, output filtering, and never executing tool calls produced from untrusted data without a confirmation step.
Evaluation
Track prompt versions like code. Use a fixed eval set, run A/B comparisons, and log win-rate per prompt. Tools: PromptLayer, Langfuse, Braintrust, OpenAI Evals.
- 18
Module 18 — Agents, Tools & MCP
Going beyond single-turn chat: function calling, ReAct loops, multi-agent systems, the Model Context Protocol, and how to ship autonomous workflows safely.
What an agent actually is
An agent is an LLM in a loop that observes, plans, acts via tools, and reflects. The loop runs until a stopping criterion is met. The model is the brain; tools are the hands; the loop is the nervous system.
Function calling & tool use
Define tools as JSON schemas. The model emits a structured tool call; your runtime executes it and returns the result. Best practices: small focused tools, idempotent operations, explicit error returns, and timeouts on every call.
Agent architectures
ReAct (reason-act-observe), planner-executor, multi-agent debate, hierarchical agents (manager + workers), and swarm patterns. Pick the simplest architecture that solves the task — most production agents are single-loop ReAct.
Model Context Protocol (MCP)
An open standard (Anthropic, 2024) for connecting LLMs to tools, data sources, and prompts. MCP servers expose resources, tools, and prompts; MCP clients (Claude Desktop, Cursor, custom apps) consume them. Learn to write both a server and a client.
Memory & state
Short-term (conversation buffer), long-term (vector store of past interactions), and episodic (structured logs of agent runs). Frameworks: LangGraph, CrewAI, AutoGen, Mastra.
Safety & control
Sandboxed execution, human-in-the-loop approvals for destructive actions, cost ceilings per run, max-iteration limits, and full observability. Never let an agent execute arbitrary code on production infrastructure without sandboxing.
- 19
Module 19 — RLHF, DPO & Modern Alignment
How frontier models are aligned with human preferences: reward modeling, PPO, DPO, GRPO, Constitutional AI, and RLAIF.
Why SFT is not enough
Supervised fine-tuning teaches the model to imitate demonstrations but does not teach it what to prefer when multiple acceptable answers exist. Alignment teaches preference — being helpful, harmless, and honest at the margin.
RLHF pipeline
Three stages: (1) SFT on demonstrations, (2) train a reward model on human preference pairs, (3) optimize the policy against the reward model using PPO. This is the classic InstructGPT/ChatGPT recipe.
Direct Preference Optimization (DPO)
DPO skips the reward model entirely and optimizes the policy directly on preference pairs using a clever reformulation of the RLHF objective. Simpler, more stable, cheaper. Now the default for most open-source alignment work.
GRPO & online RL
Group Relative Policy Optimization (DeepSeek) eliminates the value model by computing advantages relative to a group of sampled responses. Used in DeepSeek-R1 to train reasoning behavior at scale.
Constitutional AI & RLAIF
Replace human labels with AI-generated critiques against a written constitution. Anthropic's Claude was trained this way. Cheaper, more scalable, but only as good as the constitution and critic model.
Evaluating alignment
MT-Bench, AlpacaEval, Chatbot Arena Elo, and red-teaming. Track helpfulness vs. harmlessness as a Pareto frontier — over-aligning makes models refuse benign requests.
- 20
Module 20 — Multimodal Models
Vision, audio, and video: how models extend beyond text and how to build multimodal applications.
Vision-language models
CLIP (contrastive image-text embeddings), BLIP, LLaVA, GPT-4V, Claude 3.5 Vision, Gemini. Architecture: a vision encoder (usually a ViT) projects images into the language model's embedding space.
Audio & speech
Whisper for ASR (speech-to-text), TTS models (ElevenLabs, OpenAI TTS), and speech-to-speech (GPT-4o realtime, Moshi). Latency budgets are tight — sub-300ms for natural conversation.
Image & video generation
Diffusion (Stable Diffusion, FLUX, Imagen), autoregressive image generation (GPT-4o, Gemini Nano Banana), and video (Sora, Veo, Runway). Conditioning: text, image, depth, pose, sketch.
Multimodal RAG
Embed images and text in a shared space (CLIP, Nomic, Voyage Multimodal). Retrieve mixed-modality results. Critical for product catalogs, technical documentation with diagrams, and medical imaging.
Building multimodal apps
Patterns: OCR-then-LLM (cheap, lossy), native VLM (expensive, accurate), and hybrid (VLM for layout, LLM for reasoning). Choose based on document complexity and cost budget.
- 21
Module 21 — Reasoning Models & Test-Time Compute
The 2025 frontier: o1, o3, DeepSeek-R1, and the shift from scaling training to scaling inference.
What a reasoning model is
A model trained to spend additional compute at inference time thinking before answering. The model generates a long internal chain-of-thought (often hidden) and then produces the final answer. Examples: OpenAI o1/o3, DeepSeek-R1, Gemini 2.5 Thinking, Claude Extended Thinking.
Test-time compute scaling
Performance on hard tasks (math, code, science) scales smoothly with thinking tokens. You can trade dollars for IQ at inference time. This is a new scaling axis orthogonal to model size and training data.
How reasoning models are trained
Large-scale RL on verifiable rewards (math problems with known answers, code that passes tests). DeepSeek-R1 showed pure RL from a base model can elicit reasoning behavior without any SFT demonstrations.
Verifiers & process reward models
Train a separate model to score reasoning steps (PRM) or final answers (ORM). Use the verifier to rerank N sampled solutions (best-of-N) or guide search (MCTS).
When to use reasoning models
Math, code, scientific analysis, complex planning, agentic workflows. NOT for: simple chat, summarization, classification — they cost 5–50× more per token and add latency. Route requests dynamically.
- 22
Module 22 — Cost Economics & FinOps for LLMs
Token economics, caching, model routing, and how to keep an LLM product profitable at scale.
Unit economics
Model every feature as $/request: input tokens × input price + output tokens × output price + retrieval cost + infrastructure overhead. Compare against revenue per user. Most failed AI startups lost money per request and didn't know it.
Prompt caching
OpenAI, Anthropic, and Gemini all support prefix caching: identical prompt prefixes are billed at 10–25% of normal cost. Restructure prompts to put stable content (system, tools, documents) first and variable content (user message) last.
Model routing
Use a cheap classifier (or a small LLM) to route requests to the smallest model that can handle them. Patterns: cascade (try cheap, escalate on low confidence), router (classify upfront), and ensemble. Can cut costs 60–90%.
Batch APIs & async
OpenAI and Anthropic Batch APIs run at 50% cost with 24h SLA. Use for evaluations, backfills, classification jobs, and any non-interactive workload.
Prompt & context compression
LLMLingua, summary-buffer memory, and selective retrieval can cut input tokens 5–10× with minimal quality loss. Critical for long-context applications.
Self-hosting vs. API
Self-hosting (vLLM, SGLang on rented GPUs) becomes cheaper than API calls around 100M+ tokens/day for open models. Below that, APIs win on every dimension: cost, latency, reliability, model quality.
- 23
Module 23 — Career & Interview Preparation
How to land an AI engineering role: portfolio, system design interviews, common pitfalls, and the four major career tracks.
The four tracks
(1) Research scientist — builds new architectures, requires PhD or equivalent publications. (2) Applied AI engineer — ships LLM products, the largest and fastest-growing role. (3) ML platform / MLOps — builds the training and inference infrastructure. (4) AI safety / alignment — red-teams, evals, governance. Pick one and specialize.
Portfolio that gets interviews
Three projects beats ten. Each must have: a live demo URL, a clean GitHub repo with README, a one-paragraph problem statement, an architecture diagram, and a section on tradeoffs. Capstone counts as one — build two more in your target domain.
LLM system design interview
Common prompts: 'design ChatGPT', 'design a code assistant', 'design a RAG system for legal docs'. Framework: clarify requirements → estimate scale → draw architecture → discuss data, training, serving, evaluation, safety, cost. Always quantify (tokens/sec, $/request, latency p95).
Common interview topics
Attention mechanism math, why decoder-only won, LoRA vs full fine-tune tradeoffs, RAG failure modes, hallucination mitigation, prompt injection, evaluation strategies, cost optimization. Be able to whiteboard each in 5 minutes.
Resume signals that work
Specific models you've shipped (not 'used AI'), measurable outcomes (latency cut 60%, cost reduced 40%, accuracy improved 12 points), open-source contributions (HuggingFace, vLLM, llama.cpp), and published evaluations or blog posts.
Negotiation & comp
AI engineer comp in 2026: $180k–$400k base in US tier-1 cities, $500k–$1M+ total at frontier labs. Always negotiate. Use levels.fyi, Blind, and competing offers. Equity vests over 4 years — model the expected value.
Ready for the capstone?
100 questions · 80% to pass. Your diploma is auto-issued to your account name on pass.
