GACS Logo
Certifications library
LLM Engineering Diploma · Required pre-exam reading
GACS Master Diploma · LLM Engineering

Designing & Building Large Language Models — Study Manual

World-class, exam-aligned reference for the GACS Diploma in LLM Engineering & Applied AI Architecture. Twenty-plus modules covering transformers, training, fine-tuning, RAG, agents, RLHF, reasoning models, multimodal, deployment, safety, FinOps and career prep.

  1. 0

    Module 0Orientation — Start Here: How to Succeed in This Program

    Welcome, what's included, technical requirements, time commitment, how to study, how to pass the exam, how to complete the capstone, and how to get help.

    Welcome

    Welcome to the GACS Diploma in Designing & Building Large Language Models. This program transforms you from a learner into a full-stack AI engineer capable of building real, production-grade LLM systems.

    What's included

    12 full textbook modules, a 100-question final exam, a capstone project, diploma + transcript + badges, a RAG system build, an API deployment, a chatbot interface, and safety & governance training.

    Technical requirements

    A computer (Windows, macOS, or Linux), Python 3.10+, VS Code or PyCharm, a GitHub account, a HuggingFace account, and basic terminal knowledge.

    Time commitment

    Modules: 40–60 hours. Capstone: 20–40 hours. Total: 60–100 hours.

    How to study

    Read each module fully, complete the lab exercises, build your portfolio as you go, take notes, ask questions, and practice coding constantly.

    How to pass the exam

    100 questions, 80% required to pass, unlimited attempts, open-book. Treat it as a final review, not a guessing game.

    How to complete the capstone

    Follow the 11-step build process, then submit your GitHub repo, demo video, model card, and technical report.

    How to get help

    Support email, community forum, and optional instructor office hours.

  2. 1

    Module 1Foundations of Large Language Models

    How modern LLMs work end to end: tokens, embeddings, transformer blocks, autoregressive generation, and scaling.

    Input tokens Embedding + positional Multi-head self-attention Feed-forward + LayerNorm Transformer block (decoder-only) Residual + LayerNorm wrap attention and the FFN. Stacked N times.
    Module 1 — Transformer block at a glance.

    What an LLM actually is

    An LLM is a stack of transformer blocks trained to predict the next token in a sequence. Text is split into subword tokens, each token is mapped to a learned embedding vector, and the stack repeatedly mixes information across positions with self-attention and refines it with feed-forward layers. At inference the model samples one token at a time, feeding its own output back in.

    Why scale matters

    Larger models, trained on more data with more compute, follow smooth scaling laws: loss decreases predictably as parameters, tokens and FLOPs grow together. Chinchilla showed that for a fixed compute budget you should balance parameters and training tokens rather than dumping everything into one axis.

    Core vocabulary

    Token, embedding, attention head, context window, perplexity, logits and temperature are the words you need to be fluent in. Most operational decisions — context size, KV cache cost, sampling behavior — trace back to these primitives.

  3. 2

    Module 2Tokenization & Embeddings

    How raw text becomes numbers and why this layer silently controls cost, quality and multilingual performance.

    Tokenization pipeline (BPE) "unbelievable" ["un","believ","able"] → token IDs (e.g. [438, 9912, 1124]) → embedding vectors
    Module 2 — From raw string to token IDs to embedding vectors.

    Subword tokenization

    Modern LLMs use byte-pair encoding (BPE) or SentencePiece. Rare words are split into common subword pieces, so the model can represent anything without an unbounded vocabulary. A bad tokenizer wastes tokens (and money) on every request.

    Embeddings as meaning

    Each token ID is mapped to a high-dimensional vector. After training, semantically related tokens end up near each other. The same embedding idea powers retrieval: encode a chunk of text, store the vector, then search by cosine similarity.

    Operational impact

    Token count drives latency, cost, context limits and KV cache memory. Multilingual quality depends heavily on how well the tokenizer covers the target script. Always measure tokens, not characters.

  4. 3

    Module 3Transformer Architecture in Depth

    Self-attention, multi-head attention, residuals, LayerNorm and positional encoding — the actual machinery.

    Self-attention: softmax(QKᵀ / √dₖ) · V Q (query) K (key) V (value) Attention scores
    Module 3 — Self-attention with Q, K, V projections.

    Self-attention

    Each token produces three projections — query, key and value. Attention scores are softmax(QKᵀ/√dₖ), and the output is a weighted sum of values. This lets every token look at every other token in parallel, which is the core advantage over RNNs.

    Multi-head & FFN

    Multiple attention heads run in parallel so the model can attend to different relations simultaneously (syntax, coreference, long-range topic). A position-wise feed-forward network then refines each token independently.

    Residuals, LayerNorm, positions

    Residual connections and LayerNorm keep gradients well-behaved in very deep stacks. Positional encoding (sinusoidal, learned, or rotary/RoPE) injects order because attention itself is permutation-invariant.

  5. 4

    Module 4Pre-Training at Scale

    Distributed training, optimizers, mixed precision and the engineering reality of multi-GPU runs.

    Distributed training (FSDP / data + tensor parallel) GPU 0 shard of params + optimizer GPU 1 shard of params + optimizer GPU 2 shard of params + optimizer GPU 3 shard of params + optimizer all-reduce gradients · sync optimizer states
    Module 4 — Parameter and state sharding across GPUs.

    Parallelism strategies

    Data parallel splits the batch, tensor parallel splits weight matrices, pipeline parallel splits layers into stages, and FSDP / ZeRO shards parameters, gradients and optimizer states across workers. Real training runs combine several of these.

    Optimizer and precision

    AdamW with weight decay is the default. Bfloat16 (or fp16 + loss scaling) cuts memory and speeds up matmul on modern accelerators. Gradient checkpointing trades extra compute for much lower activation memory.

    Stability

    Warmup the learning rate, then decay it (cosine is common). Clip gradients. Watch for loss spikes — they usually mean a bad batch, an LR too high, or numerical drift, not divine intervention.

  6. 5

    Module 5Fine-Tuning Techniques for LLMs

    Why fine-tuning exists, the five major approaches (Full, LoRA, QLoRA, PEFT, Instruction Tuning), domain adaptation, safety alignment, and how to build and evaluate a real pipeline.

    LoRA: W' = W + B·A (rank-r adapter) W (frozen) + B (d×r) · A (r×d) Train only A and B (tiny). Base weights stay frozen.
    Module 5 — LoRA adds a low-rank update on top of frozen base weights.

    5.0 — Why fine-tuning exists

    Training an LLM from scratch is expensive, slow, and resource-intensive. Fine-tuning is the practical, cost-effective alternative: specialize a base model, improve domain performance, add capabilities, align behavior, reduce hallucinations, improve safety, and train on proprietary data — all without retraining from zero.

    5.1 — What fine-tuning actually does

    Fine-tuning nudges existing weights to learn new patterns, adapt to a domain, follow instructions better, and reduce errors. It does NOT replace the base model, retrain from scratch, or change the architecture — it shifts the model in a new direction.

    5.2 — The five major types

    Full Fine-Tuning updates every weight. LoRA adds small trainable matrices. QLoRA combines LoRA with 4-bit quantization. PEFT is the umbrella category (LoRA, Prefix Tuning, Prompt Tuning, Adapters, BitFit). Instruction Tuning teaches the model to follow instructions. Each has different cost/benefit tradeoffs.

    5.3 — Full fine-tuning

    Updates every parameter. Pros: maximum performance, deep domain adaptation, best for large datasets. Cons: massive GPU requirements, expensive, slow, and risk of catastrophic forgetting. Used for medical, legal, scientific, and enterprise-specific models.

    5.4 — LoRA (Low-Rank Adaptation)

    The most popular method. Freeze the base model, inject small trainable low-rank matrices into key projections, and train only those. Fast, cheap, low-GPU, easy to merge/unmerge, and resistant to catastrophic forgetting. Works because most useful weight updates lie in a low-rank subspace.

    5.5 — QLoRA

    Combines 4-bit quantization of the base model with LoRA adapters on top. Enables fine-tuning of 70B models on a single GPU with minimal VRAM and strong performance. Today's industry standard for cost-effective fine-tuning.

    5.6 — PEFT (Parameter-Efficient Fine-Tuning)

    Umbrella term covering LoRA, Prefix Tuning, Prompt Tuning, Adapter Tuning, and BitFit. PEFT methods typically train less than 1% of the model's parameters while approaching full fine-tuning quality.

    5.7 — Instruction tuning

    Teaches the model to follow instructions, respond helpfully, stay on topic, and avoid hallucinations. Common datasets: FLAN, Dolly, Alpaca, OpenHermes, and synthetic instruction sets. Essential for chatbots and assistants.

    5.8 — Domain adaptation

    Specializes the model for law, medicine, finance, cybersecurity, engineering, or customer support. Requires domain-specific datasets, domain-specific evaluation, and safety alignment matched to the domain's risk profile.

    5.9 — Safety alignment

    Covers RLHF, Constitutional AI, curated safety datasets, red-teaming, and toxicity filtering. Ensures compliance, ethical behavior, and reduced harm before deployment.

    5.10 — The fine-tuning pipeline

    A professional pipeline: dataset preparation → tokenization → LoRA/QLoRA setup → training loop → validation → checkpointing → merging adapters → deployment. Each stage has its own failure modes worth monitoring.

    5.11 — Evaluation

    Evaluate fine-tuned models with perplexity, task accuracy, domain benchmarks, safety tests, and human evaluation. Never rely on a single metric — combine automated evals with human spot-checks.

  7. 6

    Module 6Evaluation & Benchmarks

    Designing trustworthy evals, avoiding contamination, and reading benchmark results without lying to yourself.

    What good evals look like

    Hold-out evaluation sets must never appear in training. They should mirror production traffic, include edge cases, and be stable enough to compare model versions over time.

    Standard benchmarks

    MMLU for broad knowledge and reasoning, GSM8K and MATH for math, HumanEval and MBPP for code, MT-Bench and Arena for chat quality. Always check which version and how it was scored.

    Contamination & gaming

    Public benchmarks leak into training corpora. Treat single-benchmark wins skeptically; rely on aggregates, private evals and human spot-checks.

  8. 7

    Module 7Deployment & Inference Optimization

    Turning a trained LLM into a real product: pipelines, GPU vs CPU, quantization, distillation, serving architectures, caching, KV-cache, streaming, batching, and cost optimization.

    7.0 — Why deployment matters

    Deployment determines speed, cost, capacity, reliability, safety, and whether the model can scale. A model that performs well in training but poorly in deployment is not a usable model. Roughly 90% of real-world LLM cost lives in inference, not training.

    7.1 — What inference is

    Inference is running the model to generate outputs. Training = learning, inference = using. It must be fast, cheap, reliable, scalable, and safe.

    7.2 — The inference pipeline

    User request → API gateway → load balancer → inference server → tokenizer → model forward pass → logits → softmax → tokens → detokenizer → response. Every step is an optimization target.

    7.3 — GPU vs CPU inference

    GPUs are fast, parallel, and built for matrix math, but expensive and supply-constrained. CPUs are cheap, scalable, and easy to deploy but slow and size-limited. Rule of thumb: small models on CPU, medium on either, large on GPU only.

    7.4 — Quantization

    Reduces weight precision: FP16, BF16, INT8, INT4. Lower memory, faster inference, lower cost — at a small accuracy cost. INT4 is the most popular for deployment: it enables 70B models on a single GPU and 7B models on CPU.

    7.5 — Distillation

    Train a smaller student model to mimic a larger teacher. Faster, cheaper, smaller footprint — at the cost of some accuracy and reasoning. Used heavily by Mistral, Google, Meta, and Cohere.

    7.6 — Model serving architectures

    Single-node: simple, cheap, but a single point of failure. Multi-node: scalable, redundant, more complex. Distributed: model split across GPUs/nodes — supports huge models with high throughput but requires specialized frameworks.

    7.7 — Load balancing

    Distributes requests across servers via round-robin, least-load, latency-based, or weighted strategies. Ensures availability, low latency, and even GPU usage.

    7.8 — Caching

    Stores embeddings, token sequences, and partial computations. Common types: KV-cache (key/value), prompt cache, and response cache. KV-cache is the single most important.

    7.9 — KV-cache explained

    During generation, keys and values for past tokens are stored so future tokens don't recompute them. Without KV-cache, 100 tokens = 100 full passes. With it, 100 tokens = 1 full pass + 99 cheap incremental passes. This is why modern LLMs stream quickly.

    7.10 — Token streaming

    Send tokens to the client as they're generated. Lower perceived latency, better UX, faster feedback. Essential for chatbots, assistants, and real-time systems.

    7.11 — Batch inference

    Processes multiple requests in one forward pass. Higher throughput, lower per-request cost, better GPU utilization, at the cost of slightly higher latency. Used by every major provider.

    7.12 — Serverless LLMs

    On-demand, auto-scaling, pay-per-use. No idle cost, but cold starts and limited GPU availability remain real tradeoffs.

    7.13 — Inference optimization techniques

    Quantization, KV-cache optimization, FlashAttention, TensorRT-LLM, speculative decoding (small draft model proposes tokens, big model verifies in batch), Mixture-of-Experts (activate only parts of the model), and distillation.

    7.14 — Cost optimization

    Driven by model size, hardware, batch size, quantization, throughput, and latency targets. Levers: smaller models, quantized models, CPU for small models, GPU for large ones, batching, caching, and distillation.

    7.15 — Monitoring & logging

    Track latency, tokens-per-second, GPU utilization, memory, error rate, safety violations, and user satisfaction. Without monitoring you cannot diagnose regressions or abuse.

    7.16 — Failure modes in deployment

    OOM, GPU overload, latency spikes, deadlocks, tokenizer mismatch, KV-cache corruption, network failures. Professional systems add auto-restart, auto-scaling, health checks, and redundancy.

    7.17 — Deployment readiness checklist

    Latency < 300ms, TPS > 50, stable memory, active safety filters, no hallucination spikes, benchmarks passed, red-teaming passed, monitoring active.

  9. 8

    Module 8Building Your Own LLM API

    Turning an LLM into a real product: gateway, auth, rate limiting, safety filters, load balancing, streaming, logging, monitoring, versioning, scaling, and security.

    Serving topology Client API gateway Load balancer Inference workers Rate limits · auth · batching · KV cache · autoscaling
    Module 8 — Minimum viable LLM serving topology.

    8.0 — Why every LLM needs an API

    Training and deploying isn't enough — usability requires an API. It's what lets websites, mobile apps, internal tools, chatbots, automations, and enterprise systems talk to your model.

    8.1 — What an LLM API is

    A web endpoint that accepts a prompt + parameters and returns generated text, tokens, metadata, safety flags, and usage stats. Example: POST /v1/generate with prompt, max_tokens, temperature → text, tokens_used, latency_ms, safety.

    8.2 — Architecture overview

    Client → API gateway → auth → rate limiter → safety filter → load balancer → inference servers → tokenizer → model → detokenizer → response formatter → client. Each layer has a single, well-defined purpose.

    8.3 — API gateway

    Front door of your LLM. Handles routing, authentication, logging, rate limiting, request validation, and error handling.

    8.4 — Authentication

    Ensures only authorized callers reach the model. Common methods: API keys, OAuth2, JWT, HMAC signatures. Rotate keys, revoke compromised ones, scope permissions, log usage per key.

    8.5 — Rate limiting

    Stops abuse, DDoS, runaway cost, and overload. Strategies: requests per minute, tokens per minute, burst caps, sliding windows. Example budget: 100 req/min and 10,000 tokens/min per key.

    8.6 — Safety filters

    Run before and after the model. Check for toxicity, hate, violence, self-harm, illegal content, sensitive data leaks, and jailbreak attempts. Actions: block, sanitize, replace, warn, log.

    8.7 — Load balancing

    Distributes traffic across inference servers via round-robin, least-load, latency-based, or weighted routing — for availability, low latency, and even GPU usage.

    8.8 — Inference server

    Loads the model, tokenizes input, runs the forward pass, applies sampling, streams tokens, returns output. Needs high-end GPU, fast storage, low-latency networking, and KV-cache support.

    8.9 — Tokenization layer

    Convert text → tokens using the SAME tokenizer, vocabulary, merges, and normalization as training. Tokenizer mismatch = broken model.

    8.10 — Sampling methods

    Control generation with temperature, top-k, top-p, repetition penalty, and max_tokens. Typical defaults: temperature 0.7, top_k 50, top_p 0.9.

    8.11 — Streaming responses

    Send tokens as they're generated. Lower perceived latency, better UX, faster feedback. Essential for chatbots, assistants, and real-time systems.

    8.12 — Logging

    Critical for debugging, monitoring, billing, safety audits, and abuse detection. Log timestamp, user ID, API key, prompt length, tokens generated, latency, safety flags, and errors.

    8.13 — Monitoring

    Track latency, TPS, GPU utilization, memory, error rate, safety violations, and user satisfaction to ensure reliability, performance, and safety.

    8.14 — Error handling

    Common errors: invalid API key, rate limit exceeded, unsafe content, model overload, timeout, tokenizer mismatch. Return structured errors with retry_after where applicable.

    8.15 — Versioning

    Prevents breaking changes. Each version (/v1/generate, /v2/generate) can have its own model, tokenizer, and safety rules. Clients pin a version, you upgrade safely.

    8.16 — Scaling strategies

    Horizontal, vertical, auto-scaling, sharding, and multi-region. Multi-region gives lower latency, higher availability, and disaster recovery.

    8.17 — Security

    Protect user data, API keys, model weights, logs, and infrastructure. Encrypt everything, rotate keys, enforce HTTPS, use firewalls, WAF, IAM roles, and audit logs.

  10. 9

    Module 9Building a Chatbot Interface

    Turning an LLM API into a real product people can use: UI/UX, sessions, memory, safety, RAG, streaming, formatting, error handling, multi-turn reasoning, analytics, deployment, and security.

    9.0 — Why chatbots matter

    Raw LLM APIs aren't user-friendly. Most users can't write API calls, format JSON, manage tokens, or interpret logs. A chatbot UI hides the complexity and exposes the intelligence.

    9.1 — What makes a good chatbot

    Fast, clear, helpful, safe, context-aware, reliable, scalable, and beautiful. A chatbot isn't just a UI — it's a full system spanning frontend, backend, memory, and safety.

    9.2 — Architecture overview

    User → frontend UI → message handler → session manager → input safety filter → RAG/memory retrieval → LLM API → output safety filter → response formatter → frontend UI. Each layer has a single purpose.

    9.3 — Frontend UI design

    Clean, minimal, responsive, accessible, fast. Core components: input box, send button, history panel, streaming output, loading indicator, error messages, optional settings, model selector, file upload. Use rounded bubbles, timestamps, avatars, subtle animations, markdown rendering, and code blocks.

    9.4 — Message handling

    Receive, validate, sanitize, send, display. Validate against empty, oversized, unsafe, or unsupported messages. Sanitize HTML, scripts, and malicious payloads before they touch the model.

    9.5 — Session management

    A session stores conversation history, user preferences, model settings, memory state, and RAG context. Types: stateless, stateful, or hybrid (summarized). Storage: in-memory, Redis, database, or browser local storage.

    9.6 — Memory systems

    Short-term (recent messages), long-term (stored facts), working (summaries), and RAG (external docs retrieved on demand). Challenges: token limits, drift, hallucinations, and privacy.

    9.7 — Safety filters

    Run on input (harmful intent, illegal requests, self-harm, violence, hate, jailbreaks) and on output (toxicity, bias, unsafe instructions, misinformation, sensitive data leaks). Protect users, company, and reputation.

    9.8 — RAG integration

    Embed the query → search the vector DB → retrieve top-k chunks → insert into prompt → call the LLM → generate an answer. Result: accurate, up-to-date, domain-specific responses with citations.

    9.9 — Streaming responses

    Send tokens to the UI as they're generated. Lower perceived latency, better UX, more natural conversation. Implement with WebSockets, Server-Sent Events, or HTTP chunked responses.

    9.10 — Response formatting

    Render markdown: headings, bullets, code blocks, tables, bold/italic. Essential for technical answers, tutorials, code explanations, and summaries.

    9.11 — Error handling

    Handle timeouts, rate limits, invalid input, safety blocks, server overload, and network failures. Never show raw stack traces or internal errors. Always offer a retry path.

    9.12 — Multi-turn reasoning

    Maintain context across long conversations using history, summarization, memory compression, and careful context-window management.

    9.13 — Analytics & logging

    Track satisfaction, conversation length, error rate, safety violations, popular topics, token usage, and model performance. Log user ID, timestamp, prompt, response, safety flags, latency, and tokens. Drives UX, safety, performance, and product decisions.

    9.14 — Chatbot deployment

    Requires API gateway, load balancer, inference servers, logging, monitoring, safety filters, database, and CDN. Deploy to cloud, on-prem, hybrid, or edge depending on latency and compliance needs.

    9.15 — Chatbot security

    Protect user data, API keys, model weights, logs, and infrastructure. Encrypt everything, sanitize input and output, enforce HTTPS, use WAF, IAM roles, and audit logs.

    9.16 — UX principles

    A great chatbot feels fast, friendly, helpful, predictable, and trustworthy. Always show typing indicators, stream responses, allow message editing/deletion, show timestamps, and surface clear error messages.

  11. 10

    Module 10Retrieval-Augmented Generation (RAG)

    Connect the LLM to external knowledge so it stops hallucinating: embeddings, vector DBs, similarity search, chunking, context assembly, RAG prompts, evaluation, failure modes, advanced techniques, security, deployment, and performance.

    Retrieval-Augmented Generation User query Embed query Vector search Top-k chunks LLM (with retrieved context)
    Module 10 — Standard RAG pipeline.

    10.0 — Why RAG exists

    LLMs can't see anything after their training cut-off and hallucinate confidently when they don't know. Unacceptable in law, medicine, finance, engineering, and enterprise. RAG fixes both problems.

    10.1 — What RAG is

    Retrieval-Augmented Generation = LLM + Search + Context Injection. The model retrieves relevant information from an external source (vector DB, doc store, knowledge graph, corporate DB) BEFORE generating. Now the industry standard for enterprise AI.

    10.2 — Why RAG works

    LLMs are excellent at reasoning but terrible at recall. Databases are excellent at recall but can't reason. RAG combines the strengths of both.

    10.3 — Architecture overview

    User query → embed query → vector search → retrieve top-k documents → context assembly → LLM prompt → generated answer → response. Each step is critical.

    10.4 — Embeddings

    Convert text into high-dimensional vectors (768–4096) that capture semantic similarity. Use the SAME embedding model for indexing and querying — mixing models breaks recall completely.

    10.5 — Vector databases

    Store and search embeddings: FAISS, Pinecone, Weaviate, Milvus, Chroma, Qdrant, pgvector. Need fast similarity search, scalable indexing, metadata filtering, hybrid search, and persistence.

    10.6 — Similarity search

    Finds documents closest to the query vector using cosine similarity, dot product, or Euclidean distance. Top-k results become the LLM's grounding context.

    10.7 — Chunking

    Split documents into focused pieces (typically 256/512/1024 tokens) with overlap. Better chunking = better retrieval accuracy, less noise, fewer hallucinations. Chunks must be semantic and clean.

    10.8 — Context assembly

    Build the final prompt: system role + retrieved chunks + user query + answer format. Forces the LLM to use retrieved knowledge instead of inventing it.

    10.9 — RAG prompt engineering

    Always instruct the model to answer ONLY from provided context and to say 'I don't know' when the answer isn't there. Dramatically reduces hallucinations.

    10.10 — Evaluation

    Measure retrieval accuracy, context relevance, answer accuracy, hallucination rate, latency, and coverage. Evaluate retrieval and generation separately so you can fix the right layer.

    10.11 — Failure modes

    Retrieval failure, bad chunking, embedding drift, context overflow, model ignoring context, latency spikes, missing data in the corpus.

    10.12 — Advanced RAG techniques

    Multi-vector retrieval, hybrid (keyword + vector) search, re-ranking with a second model, context compression/summarization, Graph-RAG over knowledge graphs, and agentic RAG where the LLM decides what to retrieve.

    10.13 — Enterprise RAG

    Powers internal knowledge bases, customer support, legal/medical search, financial analysis, engineering manuals, and compliance. Often the only safe way to use LLMs in regulated industries.

    10.14 — RAG security

    Encrypt embeddings and source documents, enforce access control, keep audit logs, filter PII, and apply data governance. RAG corpora usually contain sensitive corporate data.

    10.15 — Deployment

    Needs a vector DB cluster, embedding service, RAG orchestrator, LLM API, safety filters, monitoring, and logging. RAG is a full system, not a feature.

    10.16 — Performance optimization

    Pre-computed embeddings, approximate nearest neighbor (ANN) search, caching, batch retrieval, multi-threaded and GPU-accelerated search. Critical for real-time chatbots.

  12. 11

    Module 11AI Safety, Alignment & Governance

    Bias, hallucination, prompt injection, jailbreaks, red-teaming, guardrails, and the safety lifecycle.

    AI safety lifecycle Threat model Pre-train filters RLHF / DPO Red-team Guardrails Monitoring
    Module 11 — Safety as a lifecycle, not a checkbox.

    Core risks

    Hallucination, bias, toxicity, privacy leakage, prompt injection, jailbreaks, and misuse for fraud or harm. Each has both technical and policy mitigations.

    Mitigation stack

    Filter pre-training data, align with RLHF/DPO, add input/output guardrails, ground with retrieval, set refusal policies, monitor in production, and red-team continuously. No single layer is enough.

    Governance

    Publish a model card with intended use, limitations, evaluations and known risks. Keep an abuse policy, an incident playbook and a rollback plan. Compliance is part of the engineering, not an afterthought.

  13. 12

    Module 12Capstone Project: Building a Complete LLM System

    The final demonstration of mastery. Students build a full-stack AI product: dataset, fine-tuning, RAG, inference API, chatbot UI, safety, monitoring, model card, technical report, and demo video.

    12.0 — Why the capstone matters

    Not a theoretical exercise. A real engineering project that simulates the work done in enterprise AI teams, startups, research labs, government AI programs, and consulting firms. Transforms learners into practitioners.

    12.1 — Capstone overview

    Students build a complete LLM system end-to-end: data pipeline, fine-tuning pipeline, RAG system, inference API, chatbot interface, safety filters, monitoring & analytics, documentation, model card, and demo video.

    12.2 — Deliverables

    A fine-tuned model (LoRA/QLoRA), a vector database, a RAG pipeline, a production-ready API, a chatbot interface, a safety system, a monitoring dashboard, a model card, a technical report, and a 5–10 minute demo video.

    12.3 — Capstone architecture

    Runtime: User → Chatbot UI → API Gateway → Input Safety Filter → RAG Retrieval → LLM Inference → Output Safety Filter → Response Formatter → UI. Behind the scenes: Data Pipeline → Fine-Tuning → Model → API → Chatbot → Monitoring.

    12.4 — Step 1: Build the dataset

    Pick a domain (legal, medical, finance, engineering, support, tech docs). Build a minimum of 500 cleaned, deduplicated, tokenized, documented examples covering instructions, responses, explanations, and safety cases.

    12.5 — Step 2: Fine-tune the model

    Fine-tune a 7B–13B base with LoRA/QLoRA/PEFT. Deliver training logs, loss curves, checkpoints, and evaluation results. Demonstrate proper hyperparameters, LR schedule, validation, and checkpointing.

    12.6 — Step 3: Build the vector database

    Chunk, embed, and store at least 100 documents with metadata. Document chunk size (256–1024 tokens), embedding model, and vector DB choice. Build the retrieval pipeline on top.

    12.7 — Step 4: Build the RAG pipeline

    Implement query embedding, vector search, top-k retrieval, context assembly, and a documented RAG prompt template. Deliver retrieval logs and evaluation results.

    12.8 — Step 5: Build the inference API

    Auth, rate limiting, logging, error handling, streaming, and safety filters. Required endpoints: /generate, /rag-generate, /health, /metrics.

    12.9 — Step 6: Build the chatbot interface

    Streaming responses, message history, memory, safety warnings, RAG citations, optional model selector. Clean UI with markdown, code blocks, and clear error messages.

    12.10 — Step 7: Implement safety filters

    Input filters (harmful intent, illegal requests, self-harm, violence, hate) and output filters (toxicity, bias, unsafe instructions, sensitive data). Deliver safety logs, test results, and a red-team report.

    12.11 — Step 8: Monitoring & analytics

    Dashboard with latency, TPS, token usage, safety violations, error rate, RAG retrieval accuracy, and (optional) user satisfaction. Real-time + historical metrics with alerts.

    12.12 — Step 9: Model card

    Document purpose, training data, fine-tuning data, intended use cases, prohibited use cases, risks, limitations, safety measures, and evaluation results. Required for compliance.

    12.13 — Step 10: Technical report

    10–20 page report covering architecture, data pipeline, fine-tuning, RAG, API, chatbot, safety, monitoring, evaluation results, and lessons learned. Becomes the student's professional portfolio piece.

    12.14 — Step 11: Demo video

    5–10 minute video demonstrating chatbot usage, RAG retrieval, safety filters, API calls, the monitoring dashboard, and the system architecture. The final deliverable.

    12.15 — Evaluation rubric

    Technical Quality 40% (model, RAG, API, UX), Safety & Governance 20% (filters, red-teaming, model card, compliance), Documentation 20% (report, diagrams, dataset docs), Presentation 20% (demo, clarity, professionalism).

  14. 13

    Module 13Appendix A — Instructor Guide & Teaching Manual

    For instructors, TAs, partners, and licensees running cohort-based teaching, corporate training, or institutional workshops.

    Purpose

    Enables cohort-based teaching, corporate training, instructor-led workshops, and licensing to institutions.

    Learning objectives by module

    Every module ships with learning outcomes, key concepts, common misconceptions, teaching notes, discussion prompts, lab guidance, and assessment criteria.

    Lab instructions

    Each module includes lab setup, expected outputs, troubleshooting, and extension tasks for advanced students.

    Assessment rubrics

    Rubrics for labs, capstone, exam, and optional participation grading.

    Suggested pacing

    12-week program, 2 modules per week, capstone in weeks 10–12.

    Instructor scripts

    Guidance on how to introduce each module, explain complex topics, lead discussions, and handle student questions.

    Troubleshooting guide

    Covers Python issues, GPU issues, tokenization errors, RAG failures, API errors, and safety filter issues.

  15. 14

    Module 14Appendix B — Student Workbook & Lab Book

    Hands-on exercises for every module: setup, steps, expected outputs, reflection questions, and challenge tasks.

    Module-by-module labs

    Each module includes 3–6 lab exercises with step-by-step instructions, expected outputs, reflection questions, and 'challenge mode' extension tasks.

    Example: Lab 5.2 — Fine-Tune a Model with QLoRA

    Steps: load a 7B model → apply QLoRA adapters → train on 200 examples → evaluate perplexity → save the adapter → merge the adapter → test the model. Reflection: what improved, what degraded, what surprised you.

    Workbook sections

    Notes pages, diagrams, checklists, code snippets, and troubleshooting notes for every module.

  16. 15

    Module 15Appendix C — Portfolio Template

    A complete GitHub repo + portfolio website template so students ship like professional AI engineers.

    GitHub repo structure

    /project with /api, /rag, /chatbot, /model, /data, /safety, /monitoring directories, plus README.md, demo.mp4, and model-card.md at the root.

    Portfolio website template

    Sections: About Me, Skills, Projects, Capstone, Certifications, Contact.

    README template

    Project overview, architecture diagram, features, installation, usage, API endpoints, safety notes, and license.

  17. 16

    Module 16Appendix D — Mastery Checklist

    If you can do everything on this list, you are an LLM Engineer.

    Foundations

    Understand tokens, embeddings, attention, and transformers end-to-end.

    Data engineering

    Clean, filter, deduplicate, and tokenize datasets confidently.

    Training

    Train a model, use distributed training, choose optimizers, and run learning-rate schedules.

    Fine-tuning

    LoRA, QLoRA, PEFT, and instruction tuning — pick and execute the right approach per problem.

    RAG

    Build a vector DB, chunk and embed documents, retrieve top-k, and assemble context.

    Deployment

    Build an API with authentication, rate limiting, logging, and monitoring.

    Chatbot

    Build the UI with streaming, memory, and safety filters.

    Safety

    Input filtering, output filtering, red-teaming, model cards, governance.

    Capstone

    Build the full system, document it, present it, deploy it.

  18. 17

    Module 17Prompt & Context Engineering

    How to elicit the best behavior from an LLM through prompt design, structured outputs, and context packing — the highest-leverage skill in applied AI.

    Why prompt engineering still matters

    Even with frontier models, prompt quality dominates downstream quality. Small wording, ordering and formatting changes routinely move evaluation scores by 10–30%. Prompt engineering is the cheapest, fastest lever you have before training, fine-tuning, or RAG.

    Core techniques

    Zero-shot, few-shot, chain-of-thought (CoT), self-consistency, ReAct, tree-of-thoughts, reflexion, and prompt chaining. Each has a use case: CoT for reasoning, few-shot for format conformance, ReAct for tool use, chaining for multi-stage workflows.

    Structured outputs

    JSON mode, function/tool schemas, regex-constrained decoding, and grammar-based generation (GBNF, Outlines, Instructor). Structured outputs are mandatory for production — never parse free-form text when a schema will do.

    Context engineering

    Decide what goes into the context window and in what order: system prompt → tools → retrieved docs → conversation history → user message. Recency and primacy bias the model; place the most important instructions at the very top and the most important data at the very bottom.

    Prompt injection & defenses

    Untrusted text in the context can override your instructions. Defenses: privilege separation, instruction hierarchy, delimiter tagging, output filtering, and never executing tool calls produced from untrusted data without a confirmation step.

    Evaluation

    Track prompt versions like code. Use a fixed eval set, run A/B comparisons, and log win-rate per prompt. Tools: PromptLayer, Langfuse, Braintrust, OpenAI Evals.

  19. 18

    Module 18Agents, Tools & MCP

    Going beyond single-turn chat: function calling, ReAct loops, multi-agent systems, the Model Context Protocol, and how to ship autonomous workflows safely.

    What an agent actually is

    An agent is an LLM in a loop that observes, plans, acts via tools, and reflects. The loop runs until a stopping criterion is met. The model is the brain; tools are the hands; the loop is the nervous system.

    Function calling & tool use

    Define tools as JSON schemas. The model emits a structured tool call; your runtime executes it and returns the result. Best practices: small focused tools, idempotent operations, explicit error returns, and timeouts on every call.

    Agent architectures

    ReAct (reason-act-observe), planner-executor, multi-agent debate, hierarchical agents (manager + workers), and swarm patterns. Pick the simplest architecture that solves the task — most production agents are single-loop ReAct.

    Model Context Protocol (MCP)

    An open standard (Anthropic, 2024) for connecting LLMs to tools, data sources, and prompts. MCP servers expose resources, tools, and prompts; MCP clients (Claude Desktop, Cursor, custom apps) consume them. Learn to write both a server and a client.

    Memory & state

    Short-term (conversation buffer), long-term (vector store of past interactions), and episodic (structured logs of agent runs). Frameworks: LangGraph, CrewAI, AutoGen, Mastra.

    Safety & control

    Sandboxed execution, human-in-the-loop approvals for destructive actions, cost ceilings per run, max-iteration limits, and full observability. Never let an agent execute arbitrary code on production infrastructure without sandboxing.

  20. 19

    Module 19RLHF, DPO & Modern Alignment

    How frontier models are aligned with human preferences: reward modeling, PPO, DPO, GRPO, Constitutional AI, and RLAIF.

    Why SFT is not enough

    Supervised fine-tuning teaches the model to imitate demonstrations but does not teach it what to prefer when multiple acceptable answers exist. Alignment teaches preference — being helpful, harmless, and honest at the margin.

    RLHF pipeline

    Three stages: (1) SFT on demonstrations, (2) train a reward model on human preference pairs, (3) optimize the policy against the reward model using PPO. This is the classic InstructGPT/ChatGPT recipe.

    Direct Preference Optimization (DPO)

    DPO skips the reward model entirely and optimizes the policy directly on preference pairs using a clever reformulation of the RLHF objective. Simpler, more stable, cheaper. Now the default for most open-source alignment work.

    GRPO & online RL

    Group Relative Policy Optimization (DeepSeek) eliminates the value model by computing advantages relative to a group of sampled responses. Used in DeepSeek-R1 to train reasoning behavior at scale.

    Constitutional AI & RLAIF

    Replace human labels with AI-generated critiques against a written constitution. Anthropic's Claude was trained this way. Cheaper, more scalable, but only as good as the constitution and critic model.

    Evaluating alignment

    MT-Bench, AlpacaEval, Chatbot Arena Elo, and red-teaming. Track helpfulness vs. harmlessness as a Pareto frontier — over-aligning makes models refuse benign requests.

  21. 20

    Module 20Multimodal Models

    Vision, audio, and video: how models extend beyond text and how to build multimodal applications.

    Vision-language models

    CLIP (contrastive image-text embeddings), BLIP, LLaVA, GPT-4V, Claude 3.5 Vision, Gemini. Architecture: a vision encoder (usually a ViT) projects images into the language model's embedding space.

    Audio & speech

    Whisper for ASR (speech-to-text), TTS models (ElevenLabs, OpenAI TTS), and speech-to-speech (GPT-4o realtime, Moshi). Latency budgets are tight — sub-300ms for natural conversation.

    Image & video generation

    Diffusion (Stable Diffusion, FLUX, Imagen), autoregressive image generation (GPT-4o, Gemini Nano Banana), and video (Sora, Veo, Runway). Conditioning: text, image, depth, pose, sketch.

    Multimodal RAG

    Embed images and text in a shared space (CLIP, Nomic, Voyage Multimodal). Retrieve mixed-modality results. Critical for product catalogs, technical documentation with diagrams, and medical imaging.

    Building multimodal apps

    Patterns: OCR-then-LLM (cheap, lossy), native VLM (expensive, accurate), and hybrid (VLM for layout, LLM for reasoning). Choose based on document complexity and cost budget.

  22. 21

    Module 21Reasoning Models & Test-Time Compute

    The 2025 frontier: o1, o3, DeepSeek-R1, and the shift from scaling training to scaling inference.

    What a reasoning model is

    A model trained to spend additional compute at inference time thinking before answering. The model generates a long internal chain-of-thought (often hidden) and then produces the final answer. Examples: OpenAI o1/o3, DeepSeek-R1, Gemini 2.5 Thinking, Claude Extended Thinking.

    Test-time compute scaling

    Performance on hard tasks (math, code, science) scales smoothly with thinking tokens. You can trade dollars for IQ at inference time. This is a new scaling axis orthogonal to model size and training data.

    How reasoning models are trained

    Large-scale RL on verifiable rewards (math problems with known answers, code that passes tests). DeepSeek-R1 showed pure RL from a base model can elicit reasoning behavior without any SFT demonstrations.

    Verifiers & process reward models

    Train a separate model to score reasoning steps (PRM) or final answers (ORM). Use the verifier to rerank N sampled solutions (best-of-N) or guide search (MCTS).

    When to use reasoning models

    Math, code, scientific analysis, complex planning, agentic workflows. NOT for: simple chat, summarization, classification — they cost 5–50× more per token and add latency. Route requests dynamically.

  23. 22

    Module 22Cost Economics & FinOps for LLMs

    Token economics, caching, model routing, and how to keep an LLM product profitable at scale.

    Unit economics

    Model every feature as $/request: input tokens × input price + output tokens × output price + retrieval cost + infrastructure overhead. Compare against revenue per user. Most failed AI startups lost money per request and didn't know it.

    Prompt caching

    OpenAI, Anthropic, and Gemini all support prefix caching: identical prompt prefixes are billed at 10–25% of normal cost. Restructure prompts to put stable content (system, tools, documents) first and variable content (user message) last.

    Model routing

    Use a cheap classifier (or a small LLM) to route requests to the smallest model that can handle them. Patterns: cascade (try cheap, escalate on low confidence), router (classify upfront), and ensemble. Can cut costs 60–90%.

    Batch APIs & async

    OpenAI and Anthropic Batch APIs run at 50% cost with 24h SLA. Use for evaluations, backfills, classification jobs, and any non-interactive workload.

    Prompt & context compression

    LLMLingua, summary-buffer memory, and selective retrieval can cut input tokens 5–10× with minimal quality loss. Critical for long-context applications.

    Self-hosting vs. API

    Self-hosting (vLLM, SGLang on rented GPUs) becomes cheaper than API calls around 100M+ tokens/day for open models. Below that, APIs win on every dimension: cost, latency, reliability, model quality.

  24. 23

    Module 23Career & Interview Preparation

    How to land an AI engineering role: portfolio, system design interviews, common pitfalls, and the four major career tracks.

    The four tracks

    (1) Research scientist — builds new architectures, requires PhD or equivalent publications. (2) Applied AI engineer — ships LLM products, the largest and fastest-growing role. (3) ML platform / MLOps — builds the training and inference infrastructure. (4) AI safety / alignment — red-teams, evals, governance. Pick one and specialize.

    Portfolio that gets interviews

    Three projects beats ten. Each must have: a live demo URL, a clean GitHub repo with README, a one-paragraph problem statement, an architecture diagram, and a section on tradeoffs. Capstone counts as one — build two more in your target domain.

    LLM system design interview

    Common prompts: 'design ChatGPT', 'design a code assistant', 'design a RAG system for legal docs'. Framework: clarify requirements → estimate scale → draw architecture → discuss data, training, serving, evaluation, safety, cost. Always quantify (tokens/sec, $/request, latency p95).

    Common interview topics

    Attention mechanism math, why decoder-only won, LoRA vs full fine-tune tradeoffs, RAG failure modes, hallucination mitigation, prompt injection, evaluation strategies, cost optimization. Be able to whiteboard each in 5 minutes.

    Resume signals that work

    Specific models you've shipped (not 'used AI'), measurable outcomes (latency cut 60%, cost reduced 40%, accuracy improved 12 points), open-source contributions (HuggingFace, vLLM, llama.cpp), and published evaluations or blog posts.

    Negotiation & comp

    AI engineer comp in 2026: $180k–$400k base in US tier-1 cities, $500k–$1M+ total at frontier labs. Always negotiate. Use levels.fyi, Blind, and competing offers. Equity vests over 4 years — model the expected value.

Ready for the capstone?

100 questions · 80% to pass. Your diploma is auto-issued to your account name on pass.

Take the LLM capstone exam