What Is a Large Language Model?
A large language model (LLM) is a neural network — almost always a transformer — trained on hundreds of billions to trillions of text tokens to predict the next token in a sequence. That single objective, applied at scale, produces systems that can summarize, translate, write code, follow instructions, reason in steps, and hold extended dialogue.
What makes them 'large' is not depth alone but the combination of three things: parameter count (billions to trillions), training-data volume (often multiple TB of text), and compute spent during pre-training (measured in GPU-years or PFLOP-days). When all three grow together according to the scaling laws, capability grows predictably — and at certain thresholds, new behaviors emerge that smaller models simply cannot perform.
It is critical to set expectations early: LLMs are statistical pattern engines, not knowledge bases. They hallucinate plausible-sounding falsehoods, their knowledge has a training cutoff, they cost real money per token, and they are sensitive to prompt phrasing. Treating them as oracles is the single most common engineering mistake.
A Short History: From n-grams to GPT
Language modeling began with n-gram statistics in the 1980s and 1990s — counting how often words follow one another. The 2000s brought neural language models (Bengio et al., 2003) that learned dense word embeddings. The 2010s saw RNNs and LSTMs dominate sequence modeling, but they were inherently sequential and struggled with long contexts.
The 2017 'Attention Is All You Need' paper introduced the transformer, replacing recurrence with self-attention. GPT-1 (2018), BERT (2018), GPT-2 (2019), and GPT-3 (2020) demonstrated that scaling transformers produces qualitative leaps. Instruction tuning and RLHF (InstructGPT, 2022; ChatGPT, 2022) turned raw next-token predictors into usable assistants. Since then, open-weight families (LLaMA, Mistral, Qwen, DeepSeek) have made frontier-class models broadly accessible.
Anatomy of an LLM
Every modern LLM, regardless of vendor, decomposes into the same backbone. A tokenizer maps text into integer token IDs. An embedding layer projects those IDs into dense vectors of dimension d_model. A stack of N transformer blocks (typically 12 to 80+) refines those vectors using self-attention and feed-forward layers. A final output head — usually tied to the input embedding matrix — produces logits over the vocabulary, which are converted to a probability distribution via softmax.
Positional information is injected because attention itself is permutation-invariant: sinusoidal encodings, learned embeddings, or modern relative schemes like RoPE and ALiBi all aim to tell the model 'where' each token lives. The context window — the maximum number of tokens the model can attend to at once — is a hard architectural limit set at training time and expensive to extend.
The Transformer Revolution
The transformer's key insight is that any token can directly attend to any other token in one operation, regardless of distance. This solves two RNN problems at once: long-range dependencies are no longer a multi-step path through hidden states, and training can be parallelized across the entire sequence because there is no recurrence.
In practical terms, transformers map cleanly onto GPU/TPU hardware: nearly all the work is large matrix multiplications, which modern accelerators do extremely fast. This is why the same architecture scales from 100M to 1T+ parameters with only modest changes — the bottleneck is no longer the algorithm, it is data and compute.
Scaling Laws
Kaplan et al. (2020) and Hoffmann et al. ('Chinchilla', 2022) showed that LLM loss decreases as a smooth power law in three quantities: model size N, dataset size D, and compute C. Chinchilla refined this to show that for a given compute budget, the loss-optimal allocation puts roughly equal weight on growing model and data — earlier large models like GPT-3 were under-trained relative to their size.
Scaling laws explain why bigger does not always mean better in isolation: a 70B model trained on 200B tokens will usually underperform a 13B model trained on 2T tokens. They also justify the heavy investment in data curation: data quality has roughly the same leverage as compute.
Emergent Capabilities and Their Limits
Past certain scale thresholds, models start solving tasks they could not solve at smaller scale: multi-step arithmetic, instruction following, code synthesis, basic chain-of-thought reasoning. These are called emergent capabilities. They are not magic — they appear because the model has finally learned the underlying skills well enough that performance crosses a useful threshold.
The flip side: emergence does not mean reliability. A model that can solve 60% of grade-school math problems still fails the other 40%, often confidently. Production systems must assume the model is wrong some non-zero fraction of the time and design accordingly — with verification, retries, tool use, and clear UX around uncertainty.
The LLM Engineering Pipeline
Building an LLM end-to-end has well-defined stages. Data: collect, deduplicate, filter, and tokenize. Pre-training: train next-token prediction at scale across many GPUs. Mid/Post-training: instruction tuning on (prompt, response) pairs, then RLHF or DPO on preference data. Evaluation: benchmarks (MMLU, HumanEval, GSM8K), capability probes, and red-teaming. Optimization: quantization, distillation, speculative decoding. Serving: batching, KV caching, request scheduling. Application layer: RAG, tool use, agents, guardrails.
Every later module in this diploma zooms into one of these stages. Knowing the full pipeline is what separates an LLM engineer from someone who only calls an API.
How To Read the Rest of This Diploma
Modules 2–5 cover the architectural foundations: tokenization, embeddings, attention, and the transformer block. Module 6 covers training at scale. Modules 7–8 cover post-training (fine-tuning and alignment). Modules 9–10 cover the application stack — RAG and evaluation. Modules 11–12 cover production: cost, scaling, serving, and deployment.
Work through them in order. Each one assumes you have internalized the previous one. Skipping ahead because 'I already know transformers' is the single most common reason students fail the gated quizzes.
