Large Language Models (LLMs)¶

Large Language Models (LLMs) are deep learning models trained on vast amounts of text data to understand and generate human-like language. They are based on transformer architectures (introduced in "Attention is All You Need," Vaswani et al., 2017) and have revolutionized natural language processing by achieving state-of-the-art performance on tasks like translation, summarization, question answering, code generation, and reasoning.

LLMs represent a paradigm shift in AI: rather than training task-specific models, a single pre-trained foundation model can be adapted (via fine-tuning or prompting) to thousands of downstream tasks. This chapter provides a deep technical exploration of how LLMs work—from the mathematics of attention to production deployment strategies.

1. The Transformer Architecture¶

The transformer is the foundational architecture behind all modern LLMs. Unlike previous sequence models (RNNs, LSTMs) that process tokens sequentially, transformers process all tokens in parallel using self-attention, enabling massive parallelism during training and capturing long-range dependencies effectively.

Original Transformer (Encoder-Decoder)¶

The original transformer has two components:

Encoder: Processes input sequence and produces contextualized representations. Used for understanding tasks (BERT, T5 encoder).
Decoder: Generates output sequence autoregressively (one token at a time). Used for generation tasks (GPT series).

Most modern LLMs use decoder-only architectures (GPT, LLaMA, Claude, Gemini) because autoregressive generation is sufficient for most tasks when the model is large enough.

Transformer Block Components¶

Each transformer block contains:

Multi-Head Self-Attention: Allows each token to attend to all other tokens in the sequence.
Feed-Forward Network (FFN): Two-layer MLP applied to each position independently. Typically FFN(x) = GELU(xW₁ + b₁)W₂ + b₂ where the inner dimension is 4× the model dimension.
Layer Normalization: Stabilizes training by normalizing activations. Modern LLMs use Pre-Norm (normalize before attention/FFN) rather than Post-Norm (original transformer).
Residual Connections: Skip connections that add the input to the output of each sub-layer, enabling gradient flow through deep networks: output = LayerNorm(x + SubLayer(x)).

Transformer Variants¶

Architecture	Direction	Masking	Pre-training Objective	Use Cases	Examples
Encoder-only	Bidirectional	No causal mask	Masked Language Modeling (MLM)	Classification, NER, embeddings	BERT, RoBERTa, DeBERTa
Decoder-only	Left-to-right	Causal mask	Next Token Prediction (NTP)	Text generation, reasoning, chat	GPT-4, Claude, LLaMA, Gemini
Encoder-Decoder	Both	Causal in decoder	Span corruption / denoising	Translation, summarization	T5, BART, Flan-T5

Pseudocode (Transformer Block)¶

class TransformerBlock:
    attention: MultiHeadAttention
    ffn: FeedForwardNetwork
    norm1: LayerNorm
    norm2: LayerNorm

    function forward(x: Tensor, mask: Tensor = None) -> Tensor
        // Pre-norm architecture (modern LLMs)

        // Self-attention with residual connection
        normed = self.norm1(x)
        attention_output = self.attention(normed, normed, normed, mask)
        x = x + attention_output  // Residual

        // Feed-forward with residual connection
        normed = self.norm2(x)
        ffn_output = self.ffn(normed)
        x = x + ffn_output  // Residual

        return x

class Transformer:
    embedding: EmbeddingLayer
    blocks: list[TransformerBlock]
    head: LinearLayer  // Projects to vocabulary size

    function forward(token_ids: Tensor) -> Tensor
        x = self.embedding(token_ids)
        for block in self.blocks:
            x = block(x)
        logits = self.head(x)  // [batch, seq_len, vocab_size]
        return logits

2. Tokenization¶

Tokenization is the process of breaking text into smaller units (tokens) that the model can process. Unlike simple word splitting, modern tokenizers use subword algorithms to handle out-of-vocabulary words, reduce vocabulary size, and enable multilingual support.

Tokenization Algorithms¶

Byte Pair Encoding (BPE)¶

Used by GPT models. Iteratively merges the most frequent pairs of adjacent characters/subwords:

Start with character-level vocabulary (plus special tokens).
Count all pairs of adjacent symbols in the training corpus.
Merge the most frequent pair into a new symbol.
Repeat until desired vocabulary size (e.g., 50,257 for GPT-2).

Example: "running" → ["run", "ning"] if "ning" is a learned subword.

Key insight: BPE naturally handles common words as single tokens and rare/novel words as sequences of subwords.

WordPiece¶

Used by BERT. Similar to BPE but uses a likelihood-based criterion:

Merge the pair that maximizes the language model likelihood of the training data.
Splits words with "##" prefix for continuation subwords: "unbelievable" → ["un", "##believ", "##able"].

SentencePiece¶

Used by T5, LLaMA, PaLM. A language-independent tokenizer that:

Treats the input as a raw byte stream (no pre-tokenization required).
Supports both BPE and Unigram language model algorithms.
Handles whitespace explicitly with the "▁" character.
Enables truly language-agnostic tokenization (critical for multilingual models).

Byte-level BPE¶

Used by GPT-2+. Operates on raw bytes rather than Unicode characters:

Vocabulary starts with 256 byte tokens.
Can represent any text in any language without unknown tokens.
Avoids character-level preprocessing entirely.

Tokenization Comparison¶

Algorithm	Vocabulary Size	OOV Handling	Language Support	Used By
BPE	10K-50K	Subword merging	Good	GPT-2, GPT-3, GPT-4
WordPiece	30K	Subword splitting	Good	BERT, DistilBERT
SentencePiece	32K-128K	Character-level	Excellent (multilingual)	T5, LLaMA, PaLM
Byte-level BPE	50K-100K	Byte fallback	Universal	GPT-2+, LLaMA 2
Character-level	~100	Perfect	Universal	Rare (sequences too long)

Pseudocode (BPE Tokenization)¶

class BPETokenizer:
    vocab: dict[str, int]         // token → id
    merges: list[tuple[str, str]] // learned merge pairs, in priority order

    function train(corpus: str, vocab_size: int)
        // Initialize vocabulary with individual characters
        self.vocab = {char: id for id, char in enumerate(unique_chars(corpus))}

        while len(self.vocab) < vocab_size:
            // Count all adjacent pairs
            pair_counts = count_adjacent_pairs(corpus)

            // Find most frequent pair
            best_pair = argmax(pair_counts)

            // Merge this pair everywhere in corpus
            corpus = merge_pair(corpus, best_pair)

            // Add merged token to vocabulary
            new_token = best_pair[0] + best_pair[1]
            self.vocab[new_token] = len(self.vocab)
            self.merges.append(best_pair)

    function tokenize(text: str) -> list[int]
        tokens = list(text)  // Start with characters
        for merge_pair in self.merges:
            tokens = apply_merge(tokens, merge_pair)
        return [self.vocab[token] for token in tokens]

    function apply_merge(tokens: list[str], pair: tuple[str, str]) -> list[str]
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
                new_tokens.append(pair[0] + pair[1])  // Merge
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        return new_tokens

Tokenization Challenges¶

Token Count Mismatch: Same text can have different token counts across models. "Hello world" might be 2 tokens in one model and 3 in another.
Special Tokens: Models use special tokens like <|endoftext|>, [CLS], [SEP], <s>, </s> for formatting and control flow.
Multilingual Efficiency: Languages with non-Latin scripts (Chinese, Japanese, Arabic) often require more tokens per word, making inference more expensive.
Numbers and Code: Tokenizers often split numbers digit-by-digit and code in unexpected ways, affecting arithmetic and code generation abilities.
Tokenizer-Model Coupling: A model can only use the tokenizer it was trained with. Changing the tokenizer requires retraining.

3. Embeddings (Vectorization)¶

After tokenization, tokens are converted into dense vector representations (embeddings) that capture semantic meaning. These embeddings exist in a high-dimensional space where geometric relationships encode linguistic relationships.

Embedding Process¶

Token Embedding: Each token ID maps to a learned embedding vector (typically 768-8192 dimensions). This is simply a lookup in a learned embedding matrix E ∈ ℝ^(vocab_size × d_model).
Positional Embedding: Adds position information since transformers have no inherent sense of order:
- Learned Positional Embeddings: A separate embedding matrix for positions (GPT, BERT). Limited to a fixed maximum sequence length.
- Sinusoidal Positional Encoding: Uses sine and cosine functions of different frequencies (original transformer). Can generalize to unseen lengths.
- Rotary Positional Embeddings (RoPE): Encodes position by rotating the embedding vector (LLaMA, Mistral). Enables better length extrapolation.
- ALiBi (Attention with Linear Biases): Adds a linear bias to attention scores based on distance (BLOOM). No positional embeddings needed.
Layer/Segment Embedding: In some models, adds token type information (e.g., sentence A vs. sentence B in BERT).

Embedding Properties¶

Semantic Similarity: Words with similar meanings have similar embeddings. cosine_similarity(embed("king"), embed("queen")) is high.
Linear Relationships: Word analogies emerge as vector arithmetic: embed("king") - embed("man") + embed("woman") ≈ embed("queen").
Contextual Embeddings: In transformers, the same word gets different embeddings in different contexts. "bank" in "river bank" vs. "bank account" will have different representations after attention layers.
Clustering: Embeddings naturally cluster by topic, language, and semantic category in the high-dimensional space.

Embedding Dimensions Across Models¶

Model	Embedding Dimension	Vocabulary Size	Total Embedding Params	Positional Encoding
BERT-base	768	30K	~23M	Learned
GPT-2	768	50,257	~39M	Learned
GPT-3 (175B)	12,288	50,257	~614M	Learned
LLaMA-7B	4,096	32,000	~131M	RoPE
LLaMA-70B	8,192	32,000	~262M	RoPE
Mistral-7B	4,096	32,000	~131M	RoPE

Pseudocode (Embedding Layer)¶

class EmbeddingLayer:
    token_embeddings: Tensor    // [vocab_size, d_model]
    position_encoder: PositionEncoder  // RoPE, learned, or sinusoidal

    function forward(token_ids: Tensor) -> Tensor
        // Token embeddings via lookup
        token_embeds = self.token_embeddings[token_ids]  // [batch, seq_len, d_model]

        // Add positional information
        positions = range(token_ids.shape[1])
        output = self.position_encoder(token_embeds, positions)

        return output

class RoPE:
    // Rotary Positional Embeddings
    function forward(x: Tensor, positions: Tensor) -> Tensor
        d = x.shape[-1]
        // Generate rotation frequencies
        freqs = 1.0 / (10000 ** (arange(0, d, 2) / d))
        // Apply rotation to pairs of dimensions
        angles = positions[:, None] * freqs[None, :]
        cos_angles = cos(angles)
        sin_angles = sin(angles)
        // Rotate embedding pairs
        x_rotated = rotate_half(x, cos_angles, sin_angles)
        return x_rotated

4. The Attention Mechanism (Deep Dive)¶

Attention is the core innovation of transformers, allowing models to dynamically focus on relevant parts of the input when processing each token. It replaces the fixed-length bottleneck of RNNs with a mechanism that can look at the entire sequence at once.

Scaled Dot-Product Attention¶

The fundamental attention computation:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Where:

Q (Query): "What am I looking for?" — represents the current token's question to other tokens. Shape: [seq_len, d_k]
K (Key): "What do I contain?" — represents what each token offers. Shape: [seq_len, d_k]
V (Value): "What information do I provide?" — the actual content to aggregate. Shape: [seq_len, d_v]
d_k: Dimension of keys (scaling factor prevents softmax from saturating into one-hot distributions for large dimensions)

Attention Step by Step¶

Compute Similarity Scores: QK^T gives a [seq_len × seq_len] matrix of similarity scores between every pair of tokens.
Scale: Divide by √d_k to keep gradients stable. Without scaling, dot products grow with dimension, causing softmax to produce near-one-hot distributions.
Apply Causal Mask (decoder): Set future positions to -∞ so they become 0 after softmax (prevents the model from "looking ahead" during generation).
Softmax: Convert scores to probabilities (attention weights sum to 1 for each query token).
Weighted Sum: Multiply attention weights by values to produce the output — a contextually-informed representation of each token.

Multi-Head Attention¶

Instead of a single attention function, use multiple "heads" in parallel, each projecting Q, K, V into a different subspace. This allows the model to simultaneously attend to different types of relationships:

Head 1: Might learn syntactic dependencies (subject-verb agreement).
Head 2: Might learn semantic relationships (synonyms, antonyms).
Head 3: Might learn positional patterns (adjacent tokens).
Head 4: Might learn long-range coreference (pronoun resolution).

After computing attention for each head, outputs are concatenated and projected:

MultiHead(Q, K, V) = Concat(head₁, head₂, ..., headₕ) × W_O
where headᵢ = Attention(QW_Qᵢ, KW_Kᵢ, VW_Vᵢ)

Pseudocode (Complete Attention)¶

function scaled_dot_product_attention(
    query: Tensor,   // [batch, num_heads, seq_len_q, d_k]
    key: Tensor,     // [batch, num_heads, seq_len_k, d_k]
    value: Tensor,   // [batch, num_heads, seq_len_k, d_v]
    mask: Tensor = None
) -> tuple[Tensor, Tensor]
    d_k = query.shape[-1]

    // Step 1: Compute attention scores
    scores = matmul(query, key.transpose(-2, -1))  // [batch, heads, seq_q, seq_k]

    // Step 2: Scale
    scores = scores / sqrt(d_k)

    // Step 3: Apply causal mask (if provided)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    // Step 4: Softmax to get attention weights
    attention_weights = softmax(scores, dim=-1)  // [batch, heads, seq_q, seq_k]

    // Step 5: Weighted sum of values
    output = matmul(attention_weights, value)  // [batch, heads, seq_q, d_v]

    return output, attention_weights

class MultiHeadAttention:
    num_heads: int
    d_model: int
    d_k: int  // = d_model // num_heads

    W_Q: Tensor  // [d_model, d_model]
    W_K: Tensor  // [d_model, d_model]
    W_V: Tensor  // [d_model, d_model]
    W_O: Tensor  // [d_model, d_model]

    function forward(
        query: Tensor,  // [batch, seq_len, d_model]
        key: Tensor,
        value: Tensor,
        mask: Tensor = None
    ) -> Tensor
        batch_size = query.shape[0]

        // Linear projections
        Q = matmul(query, self.W_Q)  // [batch, seq_len, d_model]
        K = matmul(key, self.W_K)
        V = matmul(value, self.W_V)

        // Reshape to [batch, num_heads, seq_len, d_k]
        Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        // Apply attention
        output, weights = scaled_dot_product_attention(Q, K, V, mask)

        // Concatenate heads: [batch, seq_len, d_model]
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        // Final linear projection
        output = matmul(output, self.W_O)

        return output

Attention Complexity and Optimizations¶

Standard Attention: - Time Complexity: O(n² × d) where n is sequence length and d is dimension. - Space Complexity: O(n²) for the attention matrix.

This quadratic scaling is the primary bottleneck for long sequences. Key optimizations:

Optimization	Complexity	Approach	Used By
Flash Attention	O(n²) time, O(n) memory	Tiling + kernel fusion, avoids materializing full attention matrix	LLaMA 2+, Mistral, most modern LLMs
Flash Attention 2	O(n²) time, O(n) memory	Improved parallelism, better GPU utilization (2× faster than v1)	State of the art
Multi-Query Attention (MQA)	O(n² × d/h)	Share K, V across all heads (only Q differs per head)	PaLM, Falcon
Grouped-Query Attention (GQA)	O(n² × d×g/h)	Groups of heads share K, V (compromise between MHA and MQA)	LLaMA 2 70B, Mistral
Sparse Attention	O(n × √n)	Only attend to local windows + sparse global patterns	Longformer, BigBird
Linear Attention	O(n × d²)	Approximate softmax with kernel trick	Performer, Linformer
Ring Attention	O(n²/p) per device	Distribute attention across devices for very long sequences	Research

Types of Attention in LLMs¶

Self-Attention: Q, K, V all come from the same sequence. Every token attends to every other token. Used in both encoders and decoders.
Cross-Attention: Q from one sequence (decoder), K and V from another (encoder output). Used in encoder-decoder models for tasks like translation.
Causal (Masked) Self-Attention: Tokens can only attend to previous tokens and themselves. Enforced by a triangular mask. Essential for autoregressive generation.

5. Pre-training¶

Pre-training is the process of training a large model on massive unlabeled text corpora to learn general language understanding. This is the most expensive phase (millions of dollars, weeks on thousands of GPUs).

Pre-training Objectives¶

Next Token Prediction (Autoregressive / GPT-style)¶

The dominant objective for decoder-only LLMs. Given a sequence of tokens, predict the next token:

Loss = -Σ log P(tokenᵢ | token₁, token₂, ..., tokenᵢ₋₁)

The model learns to model the probability distribution over the entire vocabulary for each position, conditioned on all previous tokens. This simple objective, at sufficient scale, leads to emergent capabilities like reasoning, translation, and code generation.

Masked Language Modeling (BERT-style)¶

Randomly mask 15% of tokens and predict them from context:

Input:  "The [MASK] sat on the [MASK]"
Output: "The cat sat on the mat"

Bidirectional — the model sees both left and right context. Better for understanding tasks but cannot generate text autoregressively.

Span Corruption (T5-style)¶

Replace random spans of text with sentinel tokens and predict the original spans:

Input:  "The <X> sat on the <Y>"
Output: "<X> cat <Y> mat"

Combines benefits of MLM with sequence-to-sequence format.

Pre-training Data¶

Dataset	Size	Content	Used By
Common Crawl	~60TB text	Web pages (filtered)	GPT-3, many others
The Pile	825 GB	Curated mix (academic, code, books, web)	GPT-NeoX, Pythia
RedPajama	1.2T tokens	Open reproduction of LLaMA training data	RedPajama models
FineWeb	15T tokens	High-quality filtered web data	Open-source community
StarCoder Data	783 GB	Source code from GitHub	StarCoder, CodeLLaMA
Wikipedia	~20 GB	Encyclopedia articles	Nearly all models
Books3	~100 GB	Digitized books	GPT-3 (controversial)

Data Quality and Curation¶

Data quality is more important than quantity. Key preprocessing steps:

Deduplication: Remove exact and near-duplicate documents (MinHash, SimHash). Duplicates cause memorization and degrade generalization.
Quality Filtering: Remove low-quality content (boilerplate, spam, machine-generated text). Use classifiers trained on high-quality data.
Toxicity Filtering: Remove or reduce harmful, biased, and explicit content.
Language Identification: Tag and balance content across languages.
Domain Mixing: Balance data from different domains (web, books, code, academic papers) to ensure broad capabilities.

Training Infrastructure¶

Training a frontier LLM requires massive distributed computing:

GPT-3 (175B): ~3,640 petaflop-days, estimated $4.6M on cloud GPUs.
LLaMA-65B: 2,048 A100 GPUs for ~21 days, ~1.4M GPU-hours.
GPT-4: Estimated 10,000+ A100s for months, cost ~$100M+.

Distributed Training Strategies:

Strategy	Description	Use Case
Data Parallelism (DP)	Same model on each GPU, different data batches	Small models that fit in one GPU
Tensor Parallelism (TP)	Split individual layers across GPUs	Large layers that don't fit in one GPU
Pipeline Parallelism (PP)	Split model layers across GPUs sequentially	Very deep models
Fully Sharded Data Parallelism (FSDP)	Shard model parameters, gradients, and optimizer states	Modern default for large models
3D Parallelism	Combine DP + TP + PP	Frontier model training

6. Scaling Laws¶

Scaling laws describe the predictable relationship between model size, dataset size, compute budget, and model performance. These empirical laws guide how to allocate resources for training.

Kaplan Scaling Laws (OpenAI, 2020)¶

Performance (measured by cross-entropy loss) follows a power law:

L(N) ∝ N^(-0.076)   // Loss vs. parameters
L(D) ∝ D^(-0.095)   // Loss vs. dataset size
L(C) ∝ C^(-0.050)   // Loss vs. compute

Key insight: Scaling up model size is more efficient than scaling up data, suggesting larger models trained on relatively less data are optimal.

Chinchilla Scaling Laws (DeepMind, 2022)¶

Chinchilla challenged Kaplan's findings:

Optimal tokens ≈ 20 × parameters

This means a 10B parameter model should be trained on ~200B tokens. Most models before Chinchilla (including GPT-3) were significantly undertrained relative to their size.

Impact: Led to smaller, better-trained models. Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) despite being 4× smaller.

Modern Scaling Observations¶

Emergent Abilities: Certain capabilities (multi-step reasoning, code generation) appear suddenly at specific scale thresholds rather than improving gradually.
Inference Cost Matters: Chinchilla-optimal training doesn't account for inference costs. Smaller, overtrained models (LLaMA approach) may be more practical since inference cost scales with model size.
Data Walls: The internet has a finite amount of high-quality text. Synthetic data generation and data efficiency techniques are increasingly important.

7. Fine-tuning Techniques¶

Fine-tuning adapts a pre-trained model to specific tasks or behaviors using smaller, curated datasets. Modern fine-tuning ranges from full parameter updates to lightweight adapter methods.

Full Fine-tuning¶

Update all model parameters on task-specific data:

Pros: Maximum flexibility, best task-specific performance.
Cons: Requires storing a full copy of the model per task, expensive, risk of catastrophic forgetting.
Compute: Comparable to pre-training on the fine-tuning dataset.

Parameter-Efficient Fine-Tuning (PEFT)¶

Train only a small subset of parameters, keeping the base model frozen:

LoRA (Low-Rank Adaptation)¶

The most popular PEFT method. Instead of updating weight matrix W directly, decompose the update into two low-rank matrices:

W' = W + ΔW = W + BA
where B ∈ ℝ^(d × r), A ∈ ℝ^(r × d), r << d

r (rank): Typically 4-64. Controls the expressiveness vs. efficiency trade-off.
Trainable parameters: Only 2 × d × r per adapted layer (vs. d² for full fine-tuning).
Savings: For a 7B model, LoRA might train 0.1-1% of parameters.

Pseudocode (LoRA):

class LoRALayer:
    base_weight: Tensor      // [d_out, d_in] (frozen)
    lora_A: Tensor           // [r, d_in] (trainable)
    lora_B: Tensor           // [d_out, r] (trainable)
    scaling: float           // alpha / r

    function forward(x: Tensor) -> Tensor
        // Original computation + low-rank adaptation
        base_output = matmul(x, self.base_weight.T)
        lora_output = matmul(matmul(x, self.lora_A.T), self.lora_B.T) * self.scaling
        return base_output + lora_output

QLoRA (Quantized LoRA)¶

Combines LoRA with 4-bit quantization of the base model:

Quantize base model to 4-bit NormalFloat (NF4) format.
Apply LoRA adapters in full precision (bfloat16).
Use double quantization (quantize the quantization constants).
Use paged optimizers to handle memory spikes.

Impact: Fine-tune a 65B model on a single 48GB GPU (vs. ~780GB for full fine-tuning).

Other PEFT Methods¶

Method	Approach	Trainable Params	Performance
LoRA	Low-rank weight decomposition	0.1-1%	Excellent
QLoRA	LoRA + 4-bit quantization	0.1-1%	Near LoRA, much less memory
Prefix Tuning	Learnable prefix tokens in each layer	~0.1%	Good for generation
Prompt Tuning	Learnable soft prompt tokens (input only)	<0.01%	Good for classification
Adapters	Small bottleneck layers inserted between transformer layers	1-3%	Good, higher overhead
IA³	Learned rescaling of activations	<0.01%	Competitive with LoRA

Instruction Tuning¶

Fine-tuning on instruction-following datasets to make models helpful and aligned:

Datasets: FLAN (1.8K tasks), Alpaca (52K instructions), ShareGPT (user conversations), OpenAssistant.
Format: Each example is an instruction-input-output triple.
Effect: Transforms a base model (which just predicts next tokens) into a helpful assistant.

Supervised Fine-Tuning (SFT)¶

Training on high-quality demonstrations of desired behavior:

Dataset: [(instruction, ideal_response), ...]
Loss: Cross-entropy between model output and ideal_response

SFT is the first step in the alignment pipeline (before RLHF/DPO).

8. Alignment: RLHF and DPO¶

Alignment ensures models are helpful, harmless, and honest. Raw pre-trained models can generate toxic, biased, or unhelpful content. Alignment techniques steer model behavior toward human preferences.

RLHF (Reinforcement Learning from Human Feedback)¶

The three-step process used by ChatGPT, Claude, and others:

Step 1: Supervised Fine-Tuning (SFT) - Train on demonstrations of ideal assistant behavior. - Result: A model that follows instructions but may still produce undesirable outputs.

Step 2: Reward Model Training - Collect human preference data: show pairs of model responses, humans rank which is better. - Train a reward model R(prompt, response) → scalar that predicts human preferences. - Dataset: Thousands of comparison pairs.

Step 3: PPO (Proximal Policy Optimization) - Use the reward model to further train the SFT model via reinforcement learning. - The model generates responses, the reward model scores them, and PPO updates the model to increase expected reward. - A KL divergence penalty prevents the model from diverging too far from the SFT model (prevents reward hacking).

Objective = E[R(prompt, response)] - β × KL(π_RL || π_SFT)

Pseudocode (RLHF Training Loop):

class RLHFTrainer:
    policy_model: LLM          // Model being trained
    reference_model: LLM       // Frozen SFT model
    reward_model: RewardModel   // Trained on human preferences
    beta: float = 0.1          // KL penalty coefficient

    function train_step(prompts: list[str])
        // Generate responses with current policy
        responses = self.policy_model.generate(prompts)

        // Score with reward model
        rewards = self.reward_model.score(prompts, responses)

        // Compute KL penalty
        policy_logprobs = self.policy_model.log_prob(responses)
        ref_logprobs = self.reference_model.log_prob(responses)
        kl_penalty = policy_logprobs - ref_logprobs

        // Combined objective
        objective = rewards - self.beta * kl_penalty

        // Update policy with PPO
        ppo_update(self.policy_model, objective)

DPO (Direct Preference Optimization)¶

A simpler alternative to RLHF that eliminates the need for a separate reward model:

Directly optimize the policy model using preference pairs (chosen vs. rejected responses).
Mathematically equivalent to RLHF with a specific reward model class.
Much simpler to implement and more stable to train.

Loss_DPO = -log(σ(β × (log π(chosen)/π_ref(chosen) - log π(rejected)/π_ref(rejected))))

DPO Advantages: - No reward model needed. - No RL training loop (just supervised learning on preferences). - More stable and reproducible. - Lower computational cost.

Constitutional AI (Anthropic)¶

Self-supervised alignment using a set of principles (constitution):

Model generates responses.
Model critiques its own responses using the constitution.
Model revises responses based on critiques.
Fine-tune on the revised responses.

Reduces reliance on human labelers while maintaining alignment quality.

Comparison¶

Method	Complexity	Stability	Data Needed	Performance
SFT	Low	High	Demonstrations	Good baseline
RLHF (PPO)	High	Low (tricky to tune)	Preferences + RL	Best (when tuned well)
DPO	Medium	High	Preferences only	Near RLHF
Constitutional AI	Medium	Medium	Principles + self-play	Good alignment
ORPO	Low	High	Preferences only	Competitive

9. Inference Optimization¶

Serving LLMs in production requires optimizing for latency, throughput, and cost. Inference is fundamentally different from training — it's memory-bandwidth bound (not compute-bound) and must handle variable-length sequences efficiently.

The KV Cache¶

During autoregressive generation, each new token requires attending to all previous tokens. Without caching, this requires recomputing attention for all previous tokens at each step (O(n²) total work for n tokens).

The KV Cache stores the key and value projections for all previous tokens:

// Without KV cache: recompute K, V for all tokens at each step
// With KV cache: only compute K, V for the new token, append to cache

class KVCache:
    keys: list[Tensor]    // One per layer: [batch, num_heads, seq_len, d_k]
    values: list[Tensor]  // One per layer: [batch, num_heads, seq_len, d_v]

    function update(layer_idx: int, new_key: Tensor, new_value: Tensor)
        self.keys[layer_idx] = concat(self.keys[layer_idx], new_key, dim=2)
        self.values[layer_idx] = concat(self.values[layer_idx], new_value, dim=2)

KV Cache Memory: For a 70B model generating 4K tokens: ~40GB of KV cache per sequence. This is why memory, not compute, is the bottleneck.

Quantization¶

Reduce model precision to decrease memory usage and increase throughput:

Precision	Bits per Param	Memory (7B model)	Quality Loss	Speed Gain
FP32	32	28 GB	Baseline	1×
FP16/BF16	16	14 GB	Negligible	~2×
INT8	8	7 GB	Minimal	~2-4×
INT4 (GPTQ)	4	3.5 GB	Small	~4-8×
INT4 (AWQ)	4	3.5 GB	Very small	~4-8×
GGUF (mixed)	2-8	2-7 GB	Varies	~4-10×

Quantization Methods:

Post-Training Quantization (PTQ): Quantize after training using calibration data. Methods: GPTQ, AWQ, SqueezeLLM.
Quantization-Aware Training (QAT): Simulate quantization during training. Higher quality but more expensive.
GGUF: File format for quantized models used by llama.cpp. Supports mixed-precision quantization (more bits for important layers).

Speculative Decoding¶

Use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model:

Draft model generates K candidate tokens quickly.
Large model scores all K tokens in a single forward pass.
Accept tokens that the large model agrees with; reject and regenerate from the first disagreement.

Speedup: 2-3× with no quality loss (mathematically equivalent to sampling from the large model).

Continuous Batching¶

Traditional batching waits for all sequences in a batch to finish. Continuous batching:

Immediately start processing new requests as old ones finish.
Dynamically add/remove sequences from the batch.
Maximizes GPU utilization.

Used by vLLM, TGI, and all modern serving frameworks.

PagedAttention (vLLM)¶

Inspired by OS virtual memory paging:

KV cache is stored in non-contiguous memory blocks (pages).
Pages are allocated on demand (no pre-allocation waste).
Enables sharing KV cache across sequences with common prefixes.
Reduces memory waste by 60-80% compared to naive allocation.

Inference Serving Systems¶

System	Key Innovation	Performance	Use Case
vLLM	PagedAttention, continuous batching	Very high throughput	Production serving
TGI (HuggingFace)	Tensor parallelism, Flash Attention	High throughput	HuggingFace ecosystem
TensorRT-LLM (NVIDIA)	GPU-optimized kernels, FP8	Lowest latency on NVIDIA	Maximum performance
Ollama	Simple local serving, GGUF models	Easy to use	Local development
llama.cpp	CPU inference, GGUF quantization	Runs anywhere	Edge, desktop
SGLang	RadixAttention, structured generation	Optimized for structured output	Complex pipelines

10. Popular LLMs¶

Proprietary Models¶

Model	Organization	Parameters	Context Window	Key Features
GPT-4o	OpenAI	~1.7T (est., MoE)	128K tokens	Multimodal (text, image, audio), strong reasoning
Claude 3.5 Sonnet	Anthropic	Undisclosed	200K tokens	Constitutional AI, strong coding, long context
Gemini 1.5 Pro	Google	Undisclosed (MoE)	1M tokens	Extremely long context, multimodal
o1 / o3	OpenAI	Undisclosed	128K-200K	Chain-of-thought reasoning, "thinking" models

Open-Source / Open-Weight Models¶

Model	Organization	Parameters	Context	License	Key Features
LLaMA 3	Meta	8B, 70B, 405B	128K	LLaMA 3 License	Strong general performance, widely adopted
Mistral / Mixtral	Mistral AI	7B / 8×7B MoE	32K	Apache 2.0	Efficient, MoE architecture
Qwen 2.5	Alibaba	0.5B-72B	128K	Apache 2.0	Strong multilingual, coding
DeepSeek-V2/V3	DeepSeek	236B MoE	128K	MIT	Multi-head latent attention, very efficient MoE
Phi-3/4	Microsoft	3.8B-14B	128K	MIT	Small but capable, strong reasoning
Gemma 2	Google	2B, 9B, 27B	8K	Gemma License	Efficient, research-friendly
CodeLLaMA	Meta	7B-34B	16K-100K	LLaMA 2 License	Code-specialized
StarCoder 2	BigCode	3B-15B	16K	BigCode License	Code generation, fill-in-the-middle

Mixture of Experts (MoE)¶

A key architectural innovation for scaling efficiently. Instead of a single large FFN, MoE uses multiple "expert" FFNs with a gating network that routes each token to a subset of experts:

MoE_FFN(x) = Σᵢ G(x)ᵢ × Expertᵢ(x)
where G(x) = TopK(softmax(x × W_gate))

Sparse activation: Only K experts (typically 2) are activated per token, out of N total (e.g., 8 experts).
Effective parameters: Total params may be 8×7B = 56B, but active params per token are only 2×7B = 14B.
Benefits: More total knowledge (parameters) without proportional compute increase.
Challenges: Load balancing across experts, memory for all expert weights, routing instability.

Example: Mixtral 8×7B has 46.7B total parameters but only 12.9B active per token, making it competitive with much larger dense models.

11. Multimodal Models and Diffusion¶

Vision-Language Models¶

Models that process both text and images:

Architecture: Typically a vision encoder (ViT) + LLM, connected via a projection layer or cross-attention.
Examples: GPT-4V/4o, Claude 3 (vision), Gemini, LLaVA.
Capabilities: Image understanding, visual question answering, OCR, diagram interpretation.

Diffusion Models¶

While transformers dominate text, diffusion models have revolutionized image generation:

Forward Process: Gradually add Gaussian noise to an image over T timesteps until it becomes pure noise.
Reverse Process: Train a neural network to predict and remove noise at each step, gradually recovering the image.

Forward:  x_t = √(α_t) × x_0 + √(1-α_t) × ε,  ε ~ N(0, I)
Reverse:  x_{t-1} = denoise(x_t, t)  // Learned denoising

Key models: DALL-E 2/3, Stable Diffusion, Midjourney, Imagen.

Latent Diffusion: Perform diffusion in a compressed latent space (not pixel space) for efficiency. Stable Diffusion uses a VAE to compress images 8× before diffusion.

Diffusion Transformers (DiT)¶

Replace the U-Net backbone in diffusion models with a transformer:

Better scaling properties than U-Nets.
Used by DALL-E 3, Stable Diffusion 3, Sora.
Enables unified transformer-based architecture for text and images.

12. Challenges and Limitations¶

Hallucination¶

Models generate plausible but factually incorrect information. This is inherent to language models (they model probability distributions over tokens, not truth).

Mitigation: - RAG (ground responses in retrieved facts). - Fine-tuning on factual data. - Self-consistency checking (generate multiple responses, check agreement). - Confidence calibration and "I don't know" training. - Citation and source attribution.

Context Window Limitations¶

Despite growing context windows (up to 1M+ tokens), challenges remain: - Lost in the middle: Models attend more to the beginning and end of context, potentially missing information in the middle. - Cost: Longer contexts are more expensive (quadratic attention cost). - Retrieval is often better: For large knowledge bases, RAG outperforms stuffing everything into context.

Safety and Misuse¶

Prompt injection: Adversarial inputs that override system instructions.
Jailbreaking: Techniques to bypass safety filters.
Data poisoning: Corrupting training data to influence model behavior.
Deepfakes: Generating misleading content.

Cost¶

Training: GPT-4 estimated at $100M+. Even fine-tuning large models costs thousands.
Inference: Serving at scale requires significant GPU infrastructure. Cost per token varies 100× between model sizes.

13. Best Practices for LLM Engineering¶

Start with prompting: Try prompt engineering before fine-tuning. Many problems are solvable with good prompts.
Choose the right model size: Smaller models are cheaper and faster. Only use large models when quality demands it. Consider model routing (small model for easy queries, large for hard ones).
Evaluate systematically: Use benchmarks (MMLU, HumanEval, MT-Bench) AND task-specific evaluations AND human evaluation.
Monitor in production: Track latency, error rates, user satisfaction, and output quality. Watch for drift.
Optimize inference: Use quantization, batching, KV caching, and appropriate serving infrastructure.
Cache aggressively: Semantic caching (similar queries → cached response), exact caching (identical prompts), prefix caching.
Handle failures gracefully: LLMs are non-deterministic. Implement retries, fallbacks, and output validation.
Version everything: Model versions, prompt templates, evaluation datasets, fine-tuning data.
Security first: Implement input validation, output filtering, rate limiting, and prompt injection defenses.
Keep up with the field: The landscape changes rapidly. New models, techniques, and best practices emerge monthly.