GACS Logo
LLM EngineeringModule 1 / 12
LLME · Module 1Free preview

Foundations of LLMs

History, architecture, transformer basics, scaling laws, and the end-to-end LLM training pipeline.

You're reading Module 1 free. Unlock the full 12-module diploma, the final exam, and a verifiable certificate for $499.

Log in to enroll
AI tutor · included with this diploma
Discuss this module with the AI tutor — chat or talk
Ask questions, get re-explanations, run scenario roleplays, or have it quiz you out loud.

What Is a Large Language Model?

A large language model (LLM) is a neural network — almost always a transformer — trained on hundreds of billions to trillions of text tokens to predict the next token in a sequence. That single objective, applied at scale, produces systems that can summarize, translate, write code, follow instructions, reason in steps, and hold extended dialogue.

What makes them 'large' is not depth alone but the combination of three things: parameter count (billions to trillions), training-data volume (often multiple TB of text), and compute spent during pre-training (measured in GPU-years or PFLOP-days). When all three grow together according to the scaling laws, capability grows predictably — and at certain thresholds, new behaviors emerge that smaller models simply cannot perform.

It is critical to set expectations early: LLMs are statistical pattern engines, not knowledge bases. They hallucinate plausible-sounding falsehoods, their knowledge has a training cutoff, they cost real money per token, and they are sensitive to prompt phrasing. Treating them as oracles is the single most common engineering mistake.

A Short History: From n-grams to GPT

Language modeling began with n-gram statistics in the 1980s and 1990s — counting how often words follow one another. The 2000s brought neural language models (Bengio et al., 2003) that learned dense word embeddings. The 2010s saw RNNs and LSTMs dominate sequence modeling, but they were inherently sequential and struggled with long contexts.

The 2017 'Attention Is All You Need' paper introduced the transformer, replacing recurrence with self-attention. GPT-1 (2018), BERT (2018), GPT-2 (2019), and GPT-3 (2020) demonstrated that scaling transformers produces qualitative leaps. Instruction tuning and RLHF (InstructGPT, 2022; ChatGPT, 2022) turned raw next-token predictors into usable assistants. Since then, open-weight families (LLaMA, Mistral, Qwen, DeepSeek) have made frontier-class models broadly accessible.

Anatomy of an LLM

Every modern LLM, regardless of vendor, decomposes into the same backbone. A tokenizer maps text into integer token IDs. An embedding layer projects those IDs into dense vectors of dimension d_model. A stack of N transformer blocks (typically 12 to 80+) refines those vectors using self-attention and feed-forward layers. A final output head — usually tied to the input embedding matrix — produces logits over the vocabulary, which are converted to a probability distribution via softmax.

Positional information is injected because attention itself is permutation-invariant: sinusoidal encodings, learned embeddings, or modern relative schemes like RoPE and ALiBi all aim to tell the model 'where' each token lives. The context window — the maximum number of tokens the model can attend to at once — is a hard architectural limit set at training time and expensive to extend.

The Transformer Revolution

The transformer's key insight is that any token can directly attend to any other token in one operation, regardless of distance. This solves two RNN problems at once: long-range dependencies are no longer a multi-step path through hidden states, and training can be parallelized across the entire sequence because there is no recurrence.

In practical terms, transformers map cleanly onto GPU/TPU hardware: nearly all the work is large matrix multiplications, which modern accelerators do extremely fast. This is why the same architecture scales from 100M to 1T+ parameters with only modest changes — the bottleneck is no longer the algorithm, it is data and compute.

Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. ('Chinchilla', 2022) showed that LLM loss decreases as a smooth power law in three quantities: model size N, dataset size D, and compute C. Chinchilla refined this to show that for a given compute budget, the loss-optimal allocation puts roughly equal weight on growing model and data — earlier large models like GPT-3 were under-trained relative to their size.

Scaling laws explain why bigger does not always mean better in isolation: a 70B model trained on 200B tokens will usually underperform a 13B model trained on 2T tokens. They also justify the heavy investment in data curation: data quality has roughly the same leverage as compute.

Emergent Capabilities and Their Limits

Past certain scale thresholds, models start solving tasks they could not solve at smaller scale: multi-step arithmetic, instruction following, code synthesis, basic chain-of-thought reasoning. These are called emergent capabilities. They are not magic — they appear because the model has finally learned the underlying skills well enough that performance crosses a useful threshold.

The flip side: emergence does not mean reliability. A model that can solve 60% of grade-school math problems still fails the other 40%, often confidently. Production systems must assume the model is wrong some non-zero fraction of the time and design accordingly — with verification, retries, tool use, and clear UX around uncertainty.

The LLM Engineering Pipeline

Building an LLM end-to-end has well-defined stages. Data: collect, deduplicate, filter, and tokenize. Pre-training: train next-token prediction at scale across many GPUs. Mid/Post-training: instruction tuning on (prompt, response) pairs, then RLHF or DPO on preference data. Evaluation: benchmarks (MMLU, HumanEval, GSM8K), capability probes, and red-teaming. Optimization: quantization, distillation, speculative decoding. Serving: batching, KV caching, request scheduling. Application layer: RAG, tool use, agents, guardrails.

Every later module in this diploma zooms into one of these stages. Knowing the full pipeline is what separates an LLM engineer from someone who only calls an API.

How To Read the Rest of This Diploma

Modules 2–5 cover the architectural foundations: tokenization, embeddings, attention, and the transformer block. Module 6 covers training at scale. Modules 7–8 cover post-training (fine-tuning and alignment). Modules 9–10 cover the application stack — RAG and evaluation. Modules 11–12 cover production: cost, scaling, serving, and deployment.

Work through them in order. Each one assumes you have internalized the previous one. Skipping ahead because 'I already know transformers' is the single most common reason students fail the gated quizzes.

Hello, LLM — generate text with a small open model
pip install torch transformers accelerate
python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16, device_map="auto"
)

prompt = "Explain scaling laws in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=160, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
Count parameters and estimate FLOPs
python
def count_params(m):
    return sum(p.numel() for p in m.parameters())

n = count_params(model)
print(f"params: {n/1e6:.1f}M")
# Chinchilla rule of thumb: optimal tokens ≈ 20 × params
print(f"chinchilla-optimal tokens ≈ {20*n/1e9:.1f}B")

Workbook · hands-on exercises

Build the artifact, then self-check against the expected output.

  1. 1
    Param-count audit

    Task — Load three open models of different sizes (e.g. Qwen2.5-0.5B, Llama-3.2-1B, Mistral-7B). Print parameters in millions and the Chinchilla-optimal token budget for each.

    Expected output — A 3-row table: model, params (M), optimal tokens ≈ 20 × params. The 7B row should report ~140B optimal tokens.

    Hint: sum(p.numel() for p in model.parameters())

  2. 2
    Hallucination probe

    Task — Ask a small model (≤1.5B) five questions whose answer is a specific year (e.g. 'Year RoPE was introduced?'). Record correct vs hallucinated answers.

    Expected output — A markdown table with columns question / model answer / ground truth / correct?, and an accuracy score at the bottom.

  3. 3
    Scaling-law plot

    Task — Train a tiny transformer (≤1M params) for 1000 steps on shakespeare.txt at three model sizes. Plot final loss vs params on a log-log axis.

    Expected output — scaling.png showing a downward-sloping line — bigger model, lower loss.

Glossary

Token
An atomic unit of text the model consumes — typically a subword fragment, not a whole word.
Parameter
A learned weight in the network. Modern LLMs range from ~1B to >1T parameters.
Pre-training
The initial, expensive training phase on a large unlabeled text corpus using next-token prediction.
Context window
The maximum number of tokens (input + output) the model can attend to in a single forward pass.
Logits
The raw, unnormalized scores the model outputs for each vocabulary token before softmax.
Softmax
Function that turns logits into a probability distribution summing to 1.
Scaling law
Empirical relationship showing loss decreases as a power law in model size, data size, and compute.
Chinchilla-optimal
A data:parameter ratio (~20 tokens per parameter) that minimizes loss for a fixed compute budget.
Emergent capability
A skill that appears only after a model exceeds a certain scale threshold.
Hallucination
When a model generates plausible-sounding but factually wrong output.

Further reading

  • Vaswani et al., 'Attention Is All You Need' (2017)The original transformer paper. Read sections 3 and 4 carefully — they are the foundation of everything else.
  • Kaplan et al., 'Scaling Laws for Neural Language Models' (2020)Introduced empirical scaling laws for LLMs.
  • Hoffmann et al., 'Training Compute-Optimal Large Language Models' (Chinchilla, 2022)Refined scaling laws and showed prior models were under-trained on data.
  • Andrej Karpathy, 'Let's build GPT: from scratch'Free 2-hour YouTube walkthrough that builds a working transformer in PyTorch. Watch before Module 5.
  • Jay Alammar, 'The Illustrated Transformer'Visual primer; useful if any diagram in this module feels unclear.
  • Anthropic, 'A Mathematical Framework for Transformer Circuits'Advanced. Skim now, return after Module 5.

Module 1 Quiz

20 questions · 80% required to unlock the next module.

  1. 1. What is the canonical training objective of a pre-trained LLM?
  2. 2. Which architecture is the dominant backbone of modern LLMs?
  3. 3. Why are positional encodings required in a transformer?
  4. 4. What did the Chinchilla paper conclude?
  5. 5. Which component converts raw text into integer IDs?
  6. 6. What is the context window?
  7. 7. Hallucination in an LLM refers to:
  8. 8. Which transformation produces the final probability distribution over the vocabulary?
  9. 9. Why are transformers well-suited to GPUs and TPUs?
  10. 10. Emergent capabilities are best described as:
  11. 11. Which paper introduced the transformer architecture?
  12. 12. The output head of an LLM typically produces:
  13. 13. Which is NOT a stage of the LLM engineering pipeline?
  14. 14. An LLM's knowledge cutoff means:
  15. 15. Scaling laws say loss decreases as a function of which three quantities?
  16. 16. Why is RNN training inherently slower than transformer training on long sequences?
  17. 17. Modern LLMs typically tie which two weight matrices?
  18. 18. What does 'instruction tuning' refer to?
  19. 19. Which of these is a hard architectural limit on the model?
  20. 20. Which best describes the correct mental model for LLM output?