Large Language Models (LLMs)¶
Large Language Models (LLMs) are deep learning models trained on vast amounts of text data to understand and generate human-like language. They are based on transformer architectures (introduced in "Attention is All You Need," Vaswani et al., 2017) and have revolutionized natural language processing by achieving state-of-the-art performance on tasks like translation, summarization, question answering, code generation, and reasoning.
LLMs represent a paradigm shift in AI: rather than training task-specific models, a single pre-trained foundation model can be adapted (via fine-tuning or prompting) to thousands of downstream tasks. This chapter provides a deep technical exploration of how LLMs work—from the mathematics of attention to production deployment strategies.
1. The Transformer Architecture¶
The transformer is the foundational architecture behind all modern LLMs. Unlike previous sequence models (RNNs, LSTMs) that process tokens sequentially, transformers process all tokens in parallel using self-attention, enabling massive parallelism during training and capturing long-range dependencies effectively.
Original Transformer (Encoder-Decoder)¶
The original transformer has two components:
- Encoder: Processes input sequence and produces contextualized representations. Used for understanding tasks (BERT, T5 encoder).
- Decoder: Generates output sequence autoregressively (one token at a time). Used for generation tasks (GPT series).
Most modern LLMs use decoder-only architectures (GPT, LLaMA, Claude, Gemini) because autoregressive generation is sufficient for most tasks when the model is large enough.
Transformer Block Components¶
Each transformer block contains:
- Multi-Head Self-Attention: Allows each token to attend to all other tokens in the sequence.
- Feed-Forward Network (FFN): Two-layer MLP applied to each position independently. Typically
FFN(x) = GELU(xW₁ + b₁)W₂ + b₂where the inner dimension is 4× the model dimension. - Layer Normalization: Stabilizes training by normalizing activations. Modern LLMs use Pre-Norm (normalize before attention/FFN) rather than Post-Norm (original transformer).
- Residual Connections: Skip connections that add the input to the output of each sub-layer, enabling gradient flow through deep networks:
output = LayerNorm(x + SubLayer(x)).
Transformer Variants¶
| Architecture | Direction | Masking | Pre-training Objective | Use Cases | Examples |
|---|---|---|---|---|---|
| Encoder-only | Bidirectional | No causal mask | Masked Language Modeling (MLM) | Classification, NER, embeddings | BERT, RoBERTa, DeBERTa |
| Decoder-only | Left-to-right | Causal mask | Next Token Prediction (NTP) | Text generation, reasoning, chat | GPT-4, Claude, LLaMA, Gemini |
| Encoder-Decoder | Both | Causal in decoder | Span corruption / denoising | Translation, summarization | T5, BART, Flan-T5 |
Pseudocode (Transformer Block)¶
class TransformerBlock:
attention: MultiHeadAttention
ffn: FeedForwardNetwork
norm1: LayerNorm
norm2: LayerNorm
function forward(x: Tensor, mask: Tensor = None) -> Tensor
// Pre-norm architecture (modern LLMs)
// Self-attention with residual connection
normed = self.norm1(x)
attention_output = self.attention(normed, normed, normed, mask)
x = x + attention_output // Residual
// Feed-forward with residual connection
normed = self.norm2(x)
ffn_output = self.ffn(normed)
x = x + ffn_output // Residual
return x
class Transformer:
embedding: EmbeddingLayer
blocks: list[TransformerBlock]
head: LinearLayer // Projects to vocabulary size
function forward(token_ids: Tensor) -> Tensor
x = self.embedding(token_ids)
for block in self.blocks:
x = block(x)
logits = self.head(x) // [batch, seq_len, vocab_size]
return logits
2. Tokenization¶
Tokenization is the process of breaking text into smaller units (tokens) that the model can process. Unlike simple word splitting, modern tokenizers use subword algorithms to handle out-of-vocabulary words, reduce vocabulary size, and enable multilingual support.
Tokenization Algorithms¶
Byte Pair Encoding (BPE)¶
Used by GPT models. Iteratively merges the most frequent pairs of adjacent characters/subwords:
- Start with character-level vocabulary (plus special tokens).
- Count all pairs of adjacent symbols in the training corpus.
- Merge the most frequent pair into a new symbol.
- Repeat until desired vocabulary size (e.g., 50,257 for GPT-2).
Example: "running" → ["run", "ning"] if "ning" is a learned subword.
Key insight: BPE naturally handles common words as single tokens and rare/novel words as sequences of subwords.
WordPiece¶
Used by BERT. Similar to BPE but uses a likelihood-based criterion:
- Merge the pair that maximizes the language model likelihood of the training data.
- Splits words with "##" prefix for continuation subwords: "unbelievable" → ["un", "##believ", "##able"].
SentencePiece¶
Used by T5, LLaMA, PaLM. A language-independent tokenizer that:
- Treats the input as a raw byte stream (no pre-tokenization required).
- Supports both BPE and Unigram language model algorithms.
- Handles whitespace explicitly with the "▁" character.
- Enables truly language-agnostic tokenization (critical for multilingual models).
Byte-level BPE¶
Used by GPT-2+. Operates on raw bytes rather than Unicode characters:
- Vocabulary starts with 256 byte tokens.
- Can represent any text in any language without unknown tokens.
- Avoids character-level preprocessing entirely.
Tokenization Comparison¶
| Algorithm | Vocabulary Size | OOV Handling | Language Support | Used By |
|---|---|---|---|---|
| BPE | 10K-50K | Subword merging | Good | GPT-2, GPT-3, GPT-4 |
| WordPiece | 30K | Subword splitting | Good | BERT, DistilBERT |
| SentencePiece | 32K-128K | Character-level | Excellent (multilingual) | T5, LLaMA, PaLM |
| Byte-level BPE | 50K-100K | Byte fallback | Universal | GPT-2+, LLaMA 2 |
| Character-level | ~100 | Perfect | Universal | Rare (sequences too long) |
Pseudocode (BPE Tokenization)¶
class BPETokenizer:
vocab: dict[str, int] // token → id
merges: list[tuple[str, str]] // learned merge pairs, in priority order
function train(corpus: str, vocab_size: int)
// Initialize vocabulary with individual characters
self.vocab = {char: id for id, char in enumerate(unique_chars(corpus))}
while len(self.vocab) < vocab_size:
// Count all adjacent pairs
pair_counts = count_adjacent_pairs(corpus)
// Find most frequent pair
best_pair = argmax(pair_counts)
// Merge this pair everywhere in corpus
corpus = merge_pair(corpus, best_pair)
// Add merged token to vocabulary
new_token = best_pair[0] + best_pair[1]
self.vocab[new_token] = len(self.vocab)
self.merges.append(best_pair)
function tokenize(text: str) -> list[int]
tokens = list(text) // Start with characters
for merge_pair in self.merges:
tokens = apply_merge(tokens, merge_pair)
return [self.vocab[token] for token in tokens]
function apply_merge(tokens: list[str], pair: tuple[str, str]) -> list[str]
new_tokens = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
new_tokens.append(pair[0] + pair[1]) // Merge
i += 2
else:
new_tokens.append(tokens[i])
i += 1
return new_tokens
Tokenization Challenges¶
- Token Count Mismatch: Same text can have different token counts across models. "Hello world" might be 2 tokens in one model and 3 in another.
- Special Tokens: Models use special tokens like
<|endoftext|>,[CLS],[SEP],<s>,</s>for formatting and control flow. - Multilingual Efficiency: Languages with non-Latin scripts (Chinese, Japanese, Arabic) often require more tokens per word, making inference more expensive.
- Numbers and Code: Tokenizers often split numbers digit-by-digit and code in unexpected ways, affecting arithmetic and code generation abilities.
- Tokenizer-Model Coupling: A model can only use the tokenizer it was trained with. Changing the tokenizer requires retraining.
3. Embeddings (Vectorization)¶
After tokenization, tokens are converted into dense vector representations (embeddings) that capture semantic meaning. These embeddings exist in a high-dimensional space where geometric relationships encode linguistic relationships.
Embedding Process¶
- Token Embedding: Each token ID maps to a learned embedding vector (typically 768-8192 dimensions). This is simply a lookup in a learned embedding matrix
E ∈ ℝ^(vocab_size × d_model). - Positional Embedding: Adds position information since transformers have no inherent sense of order:
- Learned Positional Embeddings: A separate embedding matrix for positions (GPT, BERT). Limited to a fixed maximum sequence length.
- Sinusoidal Positional Encoding: Uses sine and cosine functions of different frequencies (original transformer). Can generalize to unseen lengths.
- Rotary Positional Embeddings (RoPE): Encodes position by rotating the embedding vector (LLaMA, Mistral). Enables better length extrapolation.
- ALiBi (Attention with Linear Biases): Adds a linear bias to attention scores based on distance (BLOOM). No positional embeddings needed.
- Layer/Segment Embedding: In some models, adds token type information (e.g., sentence A vs. sentence B in BERT).
Embedding Properties¶
- Semantic Similarity: Words with similar meanings have similar embeddings.
cosine_similarity(embed("king"), embed("queen"))is high. - Linear Relationships: Word analogies emerge as vector arithmetic:
embed("king") - embed("man") + embed("woman") ≈ embed("queen"). - Contextual Embeddings: In transformers, the same word gets different embeddings in different contexts. "bank" in "river bank" vs. "bank account" will have different representations after attention layers.
- Clustering: Embeddings naturally cluster by topic, language, and semantic category in the high-dimensional space.
Embedding Dimensions Across Models¶
| Model | Embedding Dimension | Vocabulary Size | Total Embedding Params | Positional Encoding |
|---|---|---|---|---|
| BERT-base | 768 | 30K | ~23M | Learned |
| GPT-2 | 768 | 50,257 | ~39M | Learned |
| GPT-3 (175B) | 12,288 | 50,257 | ~614M | Learned |
| LLaMA-7B | 4,096 | 32,000 | ~131M | RoPE |
| LLaMA-70B | 8,192 | 32,000 | ~262M | RoPE |
| Mistral-7B | 4,096 | 32,000 | ~131M | RoPE |
Pseudocode (Embedding Layer)¶
class EmbeddingLayer:
token_embeddings: Tensor // [vocab_size, d_model]
position_encoder: PositionEncoder // RoPE, learned, or sinusoidal
function forward(token_ids: Tensor) -> Tensor
// Token embeddings via lookup
token_embeds = self.token_embeddings[token_ids] // [batch, seq_len, d_model]
// Add positional information
positions = range(token_ids.shape[1])
output = self.position_encoder(token_embeds, positions)
return output
class RoPE:
// Rotary Positional Embeddings
function forward(x: Tensor, positions: Tensor) -> Tensor
d = x.shape[-1]
// Generate rotation frequencies
freqs = 1.0 / (10000 ** (arange(0, d, 2) / d))
// Apply rotation to pairs of dimensions
angles = positions[:, None] * freqs[None, :]
cos_angles = cos(angles)
sin_angles = sin(angles)
// Rotate embedding pairs
x_rotated = rotate_half(x, cos_angles, sin_angles)
return x_rotated
4. The Attention Mechanism (Deep Dive)¶
Attention is the core innovation of transformers, allowing models to dynamically focus on relevant parts of the input when processing each token. It replaces the fixed-length bottleneck of RNNs with a mechanism that can look at the entire sequence at once.
Scaled Dot-Product Attention¶
The fundamental attention computation:
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
Where:
- Q (Query): "What am I looking for?" — represents the current token's question to other tokens. Shape:
[seq_len, d_k] - K (Key): "What do I contain?" — represents what each token offers. Shape:
[seq_len, d_k] - V (Value): "What information do I provide?" — the actual content to aggregate. Shape:
[seq_len, d_v] - d_k: Dimension of keys (scaling factor prevents softmax from saturating into one-hot distributions for large dimensions)
Attention Step by Step¶
- Compute Similarity Scores:
QK^Tgives a[seq_len × seq_len]matrix of similarity scores between every pair of tokens. - Scale: Divide by
√d_kto keep gradients stable. Without scaling, dot products grow with dimension, causing softmax to produce near-one-hot distributions. - Apply Causal Mask (decoder): Set future positions to
-∞so they become 0 after softmax (prevents the model from "looking ahead" during generation). - Softmax: Convert scores to probabilities (attention weights sum to 1 for each query token).
- Weighted Sum: Multiply attention weights by values to produce the output — a contextually-informed representation of each token.
Multi-Head Attention¶
Instead of a single attention function, use multiple "heads" in parallel, each projecting Q, K, V into a different subspace. This allows the model to simultaneously attend to different types of relationships:
- Head 1: Might learn syntactic dependencies (subject-verb agreement).
- Head 2: Might learn semantic relationships (synonyms, antonyms).
- Head 3: Might learn positional patterns (adjacent tokens).
- Head 4: Might learn long-range coreference (pronoun resolution).
After computing attention for each head, outputs are concatenated and projected:
MultiHead(Q, K, V) = Concat(head₁, head₂, ..., headₕ) × W_O
where headᵢ = Attention(QW_Qᵢ, KW_Kᵢ, VW_Vᵢ)
Pseudocode (Complete Attention)¶
function scaled_dot_product_attention(
query: Tensor, // [batch, num_heads, seq_len_q, d_k]
key: Tensor, // [batch, num_heads, seq_len_k, d_k]
value: Tensor, // [batch, num_heads, seq_len_k, d_v]
mask: Tensor = None
) -> tuple[Tensor, Tensor]
d_k = query.shape[-1]
// Step 1: Compute attention scores
scores = matmul(query, key.transpose(-2, -1)) // [batch, heads, seq_q, seq_k]
// Step 2: Scale
scores = scores / sqrt(d_k)
// Step 3: Apply causal mask (if provided)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
// Step 4: Softmax to get attention weights
attention_weights = softmax(scores, dim=-1) // [batch, heads, seq_q, seq_k]
// Step 5: Weighted sum of values
output = matmul(attention_weights, value) // [batch, heads, seq_q, d_v]
return output, attention_weights
class MultiHeadAttention:
num_heads: int
d_model: int
d_k: int // = d_model // num_heads
W_Q: Tensor // [d_model, d_model]
W_K: Tensor // [d_model, d_model]
W_V: Tensor // [d_model, d_model]
W_O: Tensor // [d_model, d_model]
function forward(
query: Tensor, // [batch, seq_len, d_model]
key: Tensor,
value: Tensor,
mask: Tensor = None
) -> Tensor
batch_size = query.shape[0]
// Linear projections
Q = matmul(query, self.W_Q) // [batch, seq_len, d_model]
K = matmul(key, self.W_K)
V = matmul(value, self.W_V)
// Reshape to [batch, num_heads, seq_len, d_k]
Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
// Apply attention
output, weights = scaled_dot_product_attention(Q, K, V, mask)
// Concatenate heads: [batch, seq_len, d_model]
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
// Final linear projection
output = matmul(output, self.W_O)
return output
Attention Complexity and Optimizations¶
Standard Attention: - Time Complexity: O(n² × d) where n is sequence length and d is dimension. - Space Complexity: O(n²) for the attention matrix.
This quadratic scaling is the primary bottleneck for long sequences. Key optimizations:
| Optimization | Complexity | Approach | Used By |
|---|---|---|---|
| Flash Attention | O(n²) time, O(n) memory | Tiling + kernel fusion, avoids materializing full attention matrix | LLaMA 2+, Mistral, most modern LLMs |
| Flash Attention 2 | O(n²) time, O(n) memory | Improved parallelism, better GPU utilization (2× faster than v1) | State of the art |
| Multi-Query Attention (MQA) | O(n² × d/h) | Share K, V across all heads (only Q differs per head) | PaLM, Falcon |
| Grouped-Query Attention (GQA) | O(n² × d×g/h) | Groups of heads share K, V (compromise between MHA and MQA) | LLaMA 2 70B, Mistral |
| Sparse Attention | O(n × √n) | Only attend to local windows + sparse global patterns | Longformer, BigBird |
| Linear Attention | O(n × d²) | Approximate softmax with kernel trick | Performer, Linformer |
| Ring Attention | O(n²/p) per device | Distribute attention across devices for very long sequences | Research |
Types of Attention in LLMs¶
- Self-Attention: Q, K, V all come from the same sequence. Every token attends to every other token. Used in both encoders and decoders.
- Cross-Attention: Q from one sequence (decoder), K and V from another (encoder output). Used in encoder-decoder models for tasks like translation.
- Causal (Masked) Self-Attention: Tokens can only attend to previous tokens and themselves. Enforced by a triangular mask. Essential for autoregressive generation.
5. Pre-training¶
Pre-training is the process of training a large model on massive unlabeled text corpora to learn general language understanding. This is the most expensive phase (millions of dollars, weeks on thousands of GPUs).
Pre-training Objectives¶
Next Token Prediction (Autoregressive / GPT-style)¶
The dominant objective for decoder-only LLMs. Given a sequence of tokens, predict the next token:
Loss = -Σ log P(tokenᵢ | token₁, token₂, ..., tokenᵢ₋₁)
The model learns to model the probability distribution over the entire vocabulary for each position, conditioned on all previous tokens. This simple objective, at sufficient scale, leads to emergent capabilities like reasoning, translation, and code generation.
Masked Language Modeling (BERT-style)¶
Randomly mask 15% of tokens and predict them from context:
Input: "The [MASK] sat on the [MASK]"
Output: "The cat sat on the mat"
Bidirectional — the model sees both left and right context. Better for understanding tasks but cannot generate text autoregressively.
Span Corruption (T5-style)¶
Replace random spans of text with sentinel tokens and predict the original spans:
Input: "The <X> sat on the <Y>"
Output: "<X> cat <Y> mat"
Combines benefits of MLM with sequence-to-sequence format.
Pre-training Data¶
| Dataset | Size | Content | Used By |
|---|---|---|---|
| Common Crawl | ~60TB text | Web pages (filtered) | GPT-3, many others |
| The Pile | 825 GB | Curated mix (academic, code, books, web) | GPT-NeoX, Pythia |
| RedPajama | 1.2T tokens | Open reproduction of LLaMA training data | RedPajama models |
| FineWeb | 15T tokens | High-quality filtered web data | Open-source community |
| StarCoder Data | 783 GB | Source code from GitHub | StarCoder, CodeLLaMA |
| Wikipedia | ~20 GB | Encyclopedia articles | Nearly all models |
| Books3 | ~100 GB | Digitized books | GPT-3 (controversial) |
Data Quality and Curation¶
Data quality is more important than quantity. Key preprocessing steps:
- Deduplication: Remove exact and near-duplicate documents (MinHash, SimHash). Duplicates cause memorization and degrade generalization.
- Quality Filtering: Remove low-quality content (boilerplate, spam, machine-generated text). Use classifiers trained on high-quality data.
- Toxicity Filtering: Remove or reduce harmful, biased, and explicit content.
- Language Identification: Tag and balance content across languages.
- Domain Mixing: Balance data from different domains (web, books, code, academic papers) to ensure broad capabilities.
Training Infrastructure¶
Training a frontier LLM requires massive distributed computing:
- GPT-3 (175B): ~3,640 petaflop-days, estimated $4.6M on cloud GPUs.
- LLaMA-65B: 2,048 A100 GPUs for ~21 days, ~1.4M GPU-hours.
- GPT-4: Estimated 10,000+ A100s for months, cost ~$100M+.
Distributed Training Strategies:
| Strategy | Description | Use Case |
|---|---|---|
| Data Parallelism (DP) | Same model on each GPU, different data batches | Small models that fit in one GPU |
| Tensor Parallelism (TP) | Split individual layers across GPUs | Large layers that don't fit in one GPU |
| Pipeline Parallelism (PP) | Split model layers across GPUs sequentially | Very deep models |
| Fully Sharded Data Parallelism (FSDP) | Shard model parameters, gradients, and optimizer states | Modern default for large models |
| 3D Parallelism | Combine DP + TP + PP | Frontier model training |
6. Scaling Laws¶
Scaling laws describe the predictable relationship between model size, dataset size, compute budget, and model performance. These empirical laws guide how to allocate resources for training.
Kaplan Scaling Laws (OpenAI, 2020)¶
Performance (measured by cross-entropy loss) follows a power law:
L(N) ∝ N^(-0.076) // Loss vs. parameters
L(D) ∝ D^(-0.095) // Loss vs. dataset size
L(C) ∝ C^(-0.050) // Loss vs. compute
Key insight: Scaling up model size is more efficient than scaling up data, suggesting larger models trained on relatively less data are optimal.
Chinchilla Scaling Laws (DeepMind, 2022)¶
Chinchilla challenged Kaplan's findings:
Optimal tokens ≈ 20 × parameters
This means a 10B parameter model should be trained on ~200B tokens. Most models before Chinchilla (including GPT-3) were significantly undertrained relative to their size.
Impact: Led to smaller, better-trained models. Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) despite being 4× smaller.
Modern Scaling Observations¶
- Emergent Abilities: Certain capabilities (multi-step reasoning, code generation) appear suddenly at specific scale thresholds rather than improving gradually.
- Inference Cost Matters: Chinchilla-optimal training doesn't account for inference costs. Smaller, overtrained models (LLaMA approach) may be more practical since inference cost scales with model size.
- Data Walls: The internet has a finite amount of high-quality text. Synthetic data generation and data efficiency techniques are increasingly important.
7. Fine-tuning Techniques¶
Fine-tuning adapts a pre-trained model to specific tasks or behaviors using smaller, curated datasets. Modern fine-tuning ranges from full parameter updates to lightweight adapter methods.
Full Fine-tuning¶
Update all model parameters on task-specific data:
- Pros: Maximum flexibility, best task-specific performance.
- Cons: Requires storing a full copy of the model per task, expensive, risk of catastrophic forgetting.
- Compute: Comparable to pre-training on the fine-tuning dataset.
Parameter-Efficient Fine-Tuning (PEFT)¶
Train only a small subset of parameters, keeping the base model frozen:
LoRA (Low-Rank Adaptation)¶
The most popular PEFT method. Instead of updating weight matrix W directly, decompose the update into two low-rank matrices:
W' = W + ΔW = W + BA
where B ∈ ℝ^(d × r), A ∈ ℝ^(r × d), r << d
- r (rank): Typically 4-64. Controls the expressiveness vs. efficiency trade-off.
- Trainable parameters: Only
2 × d × rper adapted layer (vs.d²for full fine-tuning). - Savings: For a 7B model, LoRA might train 0.1-1% of parameters.
Pseudocode (LoRA):
class LoRALayer:
base_weight: Tensor // [d_out, d_in] (frozen)
lora_A: Tensor // [r, d_in] (trainable)
lora_B: Tensor // [d_out, r] (trainable)
scaling: float // alpha / r
function forward(x: Tensor) -> Tensor
// Original computation + low-rank adaptation
base_output = matmul(x, self.base_weight.T)
lora_output = matmul(matmul(x, self.lora_A.T), self.lora_B.T) * self.scaling
return base_output + lora_output
QLoRA (Quantized LoRA)¶
Combines LoRA with 4-bit quantization of the base model:
- Quantize base model to 4-bit NormalFloat (NF4) format.
- Apply LoRA adapters in full precision (bfloat16).
- Use double quantization (quantize the quantization constants).
- Use paged optimizers to handle memory spikes.
Impact: Fine-tune a 65B model on a single 48GB GPU (vs. ~780GB for full fine-tuning).
Other PEFT Methods¶
| Method | Approach | Trainable Params | Performance |
|---|---|---|---|
| LoRA | Low-rank weight decomposition | 0.1-1% | Excellent |
| QLoRA | LoRA + 4-bit quantization | 0.1-1% | Near LoRA, much less memory |
| Prefix Tuning | Learnable prefix tokens in each layer | ~0.1% | Good for generation |
| Prompt Tuning | Learnable soft prompt tokens (input only) | <0.01% | Good for classification |
| Adapters | Small bottleneck layers inserted between transformer layers | 1-3% | Good, higher overhead |
| IA³ | Learned rescaling of activations | <0.01% | Competitive with LoRA |
Instruction Tuning¶
Fine-tuning on instruction-following datasets to make models helpful and aligned:
- Datasets: FLAN (1.8K tasks), Alpaca (52K instructions), ShareGPT (user conversations), OpenAssistant.
- Format: Each example is an instruction-input-output triple.
- Effect: Transforms a base model (which just predicts next tokens) into a helpful assistant.
Supervised Fine-Tuning (SFT)¶
Training on high-quality demonstrations of desired behavior:
Dataset: [(instruction, ideal_response), ...]
Loss: Cross-entropy between model output and ideal_response
SFT is the first step in the alignment pipeline (before RLHF/DPO).
8. Alignment: RLHF and DPO¶
Alignment ensures models are helpful, harmless, and honest. Raw pre-trained models can generate toxic, biased, or unhelpful content. Alignment techniques steer model behavior toward human preferences.
RLHF (Reinforcement Learning from Human Feedback)¶
The three-step process used by ChatGPT, Claude, and others:
Step 1: Supervised Fine-Tuning (SFT) - Train on demonstrations of ideal assistant behavior. - Result: A model that follows instructions but may still produce undesirable outputs.
Step 2: Reward Model Training
- Collect human preference data: show pairs of model responses, humans rank which is better.
- Train a reward model R(prompt, response) → scalar that predicts human preferences.
- Dataset: Thousands of comparison pairs.
Step 3: PPO (Proximal Policy Optimization) - Use the reward model to further train the SFT model via reinforcement learning. - The model generates responses, the reward model scores them, and PPO updates the model to increase expected reward. - A KL divergence penalty prevents the model from diverging too far from the SFT model (prevents reward hacking).
Objective = E[R(prompt, response)] - β × KL(π_RL || π_SFT)
Pseudocode (RLHF Training Loop):
class RLHFTrainer:
policy_model: LLM // Model being trained
reference_model: LLM // Frozen SFT model
reward_model: RewardModel // Trained on human preferences
beta: float = 0.1 // KL penalty coefficient
function train_step(prompts: list[str])
// Generate responses with current policy
responses = self.policy_model.generate(prompts)
// Score with reward model
rewards = self.reward_model.score(prompts, responses)
// Compute KL penalty
policy_logprobs = self.policy_model.log_prob(responses)
ref_logprobs = self.reference_model.log_prob(responses)
kl_penalty = policy_logprobs - ref_logprobs
// Combined objective
objective = rewards - self.beta * kl_penalty
// Update policy with PPO
ppo_update(self.policy_model, objective)
DPO (Direct Preference Optimization)¶
A simpler alternative to RLHF that eliminates the need for a separate reward model:
- Directly optimize the policy model using preference pairs (chosen vs. rejected responses).
- Mathematically equivalent to RLHF with a specific reward model class.
- Much simpler to implement and more stable to train.
Loss_DPO = -log(σ(β × (log π(chosen)/π_ref(chosen) - log π(rejected)/π_ref(rejected))))
DPO Advantages: - No reward model needed. - No RL training loop (just supervised learning on preferences). - More stable and reproducible. - Lower computational cost.
Constitutional AI (Anthropic)¶
Self-supervised alignment using a set of principles (constitution):
- Model generates responses.
- Model critiques its own responses using the constitution.
- Model revises responses based on critiques.
- Fine-tune on the revised responses.
Reduces reliance on human labelers while maintaining alignment quality.
Comparison¶
| Method | Complexity | Stability | Data Needed | Performance |
|---|---|---|---|---|
| SFT | Low | High | Demonstrations | Good baseline |
| RLHF (PPO) | High | Low (tricky to tune) | Preferences + RL | Best (when tuned well) |
| DPO | Medium | High | Preferences only | Near RLHF |
| Constitutional AI | Medium | Medium | Principles + self-play | Good alignment |
| ORPO | Low | High | Preferences only | Competitive |
9. Inference Optimization¶
Serving LLMs in production requires optimizing for latency, throughput, and cost. Inference is fundamentally different from training — it's memory-bandwidth bound (not compute-bound) and must handle variable-length sequences efficiently.
The KV Cache¶
During autoregressive generation, each new token requires attending to all previous tokens. Without caching, this requires recomputing attention for all previous tokens at each step (O(n²) total work for n tokens).
The KV Cache stores the key and value projections for all previous tokens:
// Without KV cache: recompute K, V for all tokens at each step
// With KV cache: only compute K, V for the new token, append to cache
class KVCache:
keys: list[Tensor] // One per layer: [batch, num_heads, seq_len, d_k]
values: list[Tensor] // One per layer: [batch, num_heads, seq_len, d_v]
function update(layer_idx: int, new_key: Tensor, new_value: Tensor)
self.keys[layer_idx] = concat(self.keys[layer_idx], new_key, dim=2)
self.values[layer_idx] = concat(self.values[layer_idx], new_value, dim=2)
KV Cache Memory: For a 70B model generating 4K tokens: ~40GB of KV cache per sequence. This is why memory, not compute, is the bottleneck.
Quantization¶
Reduce model precision to decrease memory usage and increase throughput:
| Precision | Bits per Param | Memory (7B model) | Quality Loss | Speed Gain |
|---|---|---|---|---|
| FP32 | 32 | 28 GB | Baseline | 1× |
| FP16/BF16 | 16 | 14 GB | Negligible | ~2× |
| INT8 | 8 | 7 GB | Minimal | ~2-4× |
| INT4 (GPTQ) | 4 | 3.5 GB | Small | ~4-8× |
| INT4 (AWQ) | 4 | 3.5 GB | Very small | ~4-8× |
| GGUF (mixed) | 2-8 | 2-7 GB | Varies | ~4-10× |
Quantization Methods:
- Post-Training Quantization (PTQ): Quantize after training using calibration data. Methods: GPTQ, AWQ, SqueezeLLM.
- Quantization-Aware Training (QAT): Simulate quantization during training. Higher quality but more expensive.
- GGUF: File format for quantized models used by llama.cpp. Supports mixed-precision quantization (more bits for important layers).
Speculative Decoding¶
Use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model:
- Draft model generates K candidate tokens quickly.
- Large model scores all K tokens in a single forward pass.
- Accept tokens that the large model agrees with; reject and regenerate from the first disagreement.
Speedup: 2-3× with no quality loss (mathematically equivalent to sampling from the large model).
Continuous Batching¶
Traditional batching waits for all sequences in a batch to finish. Continuous batching:
- Immediately start processing new requests as old ones finish.
- Dynamically add/remove sequences from the batch.
- Maximizes GPU utilization.
Used by vLLM, TGI, and all modern serving frameworks.
PagedAttention (vLLM)¶
Inspired by OS virtual memory paging:
- KV cache is stored in non-contiguous memory blocks (pages).
- Pages are allocated on demand (no pre-allocation waste).
- Enables sharing KV cache across sequences with common prefixes.
- Reduces memory waste by 60-80% compared to naive allocation.
Inference Serving Systems¶
| System | Key Innovation | Performance | Use Case |
|---|---|---|---|
| vLLM | PagedAttention, continuous batching | Very high throughput | Production serving |
| TGI (HuggingFace) | Tensor parallelism, Flash Attention | High throughput | HuggingFace ecosystem |
| TensorRT-LLM (NVIDIA) | GPU-optimized kernels, FP8 | Lowest latency on NVIDIA | Maximum performance |
| Ollama | Simple local serving, GGUF models | Easy to use | Local development |
| llama.cpp | CPU inference, GGUF quantization | Runs anywhere | Edge, desktop |
| SGLang | RadixAttention, structured generation | Optimized for structured output | Complex pipelines |
10. Popular LLMs¶
Proprietary Models¶
| Model | Organization | Parameters | Context Window | Key Features |
|---|---|---|---|---|
| GPT-4o | OpenAI | ~1.7T (est., MoE) | 128K tokens | Multimodal (text, image, audio), strong reasoning |
| Claude 3.5 Sonnet | Anthropic | Undisclosed | 200K tokens | Constitutional AI, strong coding, long context |
| Gemini 1.5 Pro | Undisclosed (MoE) | 1M tokens | Extremely long context, multimodal | |
| o1 / o3 | OpenAI | Undisclosed | 128K-200K | Chain-of-thought reasoning, "thinking" models |
Open-Source / Open-Weight Models¶
| Model | Organization | Parameters | Context | License | Key Features |
|---|---|---|---|---|---|
| LLaMA 3 | Meta | 8B, 70B, 405B | 128K | LLaMA 3 License | Strong general performance, widely adopted |
| Mistral / Mixtral | Mistral AI | 7B / 8×7B MoE | 32K | Apache 2.0 | Efficient, MoE architecture |
| Qwen 2.5 | Alibaba | 0.5B-72B | 128K | Apache 2.0 | Strong multilingual, coding |
| DeepSeek-V2/V3 | DeepSeek | 236B MoE | 128K | MIT | Multi-head latent attention, very efficient MoE |
| Phi-3/4 | Microsoft | 3.8B-14B | 128K | MIT | Small but capable, strong reasoning |
| Gemma 2 | 2B, 9B, 27B | 8K | Gemma License | Efficient, research-friendly | |
| CodeLLaMA | Meta | 7B-34B | 16K-100K | LLaMA 2 License | Code-specialized |
| StarCoder 2 | BigCode | 3B-15B | 16K | BigCode License | Code generation, fill-in-the-middle |
Mixture of Experts (MoE)¶
A key architectural innovation for scaling efficiently. Instead of a single large FFN, MoE uses multiple "expert" FFNs with a gating network that routes each token to a subset of experts:
MoE_FFN(x) = Σᵢ G(x)ᵢ × Expertᵢ(x)
where G(x) = TopK(softmax(x × W_gate))
- Sparse activation: Only K experts (typically 2) are activated per token, out of N total (e.g., 8 experts).
- Effective parameters: Total params may be 8×7B = 56B, but active params per token are only 2×7B = 14B.
- Benefits: More total knowledge (parameters) without proportional compute increase.
- Challenges: Load balancing across experts, memory for all expert weights, routing instability.
Example: Mixtral 8×7B has 46.7B total parameters but only 12.9B active per token, making it competitive with much larger dense models.
11. Multimodal Models and Diffusion¶
Vision-Language Models¶
Models that process both text and images:
- Architecture: Typically a vision encoder (ViT) + LLM, connected via a projection layer or cross-attention.
- Examples: GPT-4V/4o, Claude 3 (vision), Gemini, LLaVA.
- Capabilities: Image understanding, visual question answering, OCR, diagram interpretation.
Diffusion Models¶
While transformers dominate text, diffusion models have revolutionized image generation:
- Forward Process: Gradually add Gaussian noise to an image over T timesteps until it becomes pure noise.
- Reverse Process: Train a neural network to predict and remove noise at each step, gradually recovering the image.
Forward: x_t = √(α_t) × x_0 + √(1-α_t) × ε, ε ~ N(0, I)
Reverse: x_{t-1} = denoise(x_t, t) // Learned denoising
Key models: DALL-E 2/3, Stable Diffusion, Midjourney, Imagen.
Latent Diffusion: Perform diffusion in a compressed latent space (not pixel space) for efficiency. Stable Diffusion uses a VAE to compress images 8× before diffusion.
Diffusion Transformers (DiT)¶
Replace the U-Net backbone in diffusion models with a transformer:
- Better scaling properties than U-Nets.
- Used by DALL-E 3, Stable Diffusion 3, Sora.
- Enables unified transformer-based architecture for text and images.
12. Challenges and Limitations¶
Hallucination¶
Models generate plausible but factually incorrect information. This is inherent to language models (they model probability distributions over tokens, not truth).
Mitigation: - RAG (ground responses in retrieved facts). - Fine-tuning on factual data. - Self-consistency checking (generate multiple responses, check agreement). - Confidence calibration and "I don't know" training. - Citation and source attribution.
Context Window Limitations¶
Despite growing context windows (up to 1M+ tokens), challenges remain: - Lost in the middle: Models attend more to the beginning and end of context, potentially missing information in the middle. - Cost: Longer contexts are more expensive (quadratic attention cost). - Retrieval is often better: For large knowledge bases, RAG outperforms stuffing everything into context.
Safety and Misuse¶
- Prompt injection: Adversarial inputs that override system instructions.
- Jailbreaking: Techniques to bypass safety filters.
- Data poisoning: Corrupting training data to influence model behavior.
- Deepfakes: Generating misleading content.
Cost¶
- Training: GPT-4 estimated at $100M+. Even fine-tuning large models costs thousands.
- Inference: Serving at scale requires significant GPU infrastructure. Cost per token varies 100× between model sizes.
13. Best Practices for LLM Engineering¶
- Start with prompting: Try prompt engineering before fine-tuning. Many problems are solvable with good prompts.
- Choose the right model size: Smaller models are cheaper and faster. Only use large models when quality demands it. Consider model routing (small model for easy queries, large for hard ones).
- Evaluate systematically: Use benchmarks (MMLU, HumanEval, MT-Bench) AND task-specific evaluations AND human evaluation.
- Monitor in production: Track latency, error rates, user satisfaction, and output quality. Watch for drift.
- Optimize inference: Use quantization, batching, KV caching, and appropriate serving infrastructure.
- Cache aggressively: Semantic caching (similar queries → cached response), exact caching (identical prompts), prefix caching.
- Handle failures gracefully: LLMs are non-deterministic. Implement retries, fallbacks, and output validation.
- Version everything: Model versions, prompt templates, evaluation datasets, fine-tuning data.
- Security first: Implement input validation, output filtering, rate limiting, and prompt injection defenses.
- Keep up with the field: The landscape changes rapidly. New models, techniques, and best practices emerge monthly.