Skip to content

Large Language Models (LLMs)

Large Language Models (LLMs) are deep learning models trained on vast amounts of text data to understand and generate human-like language. They are based on transformer architectures (introduced in "Attention is All You Need," Vaswani et al., 2017) and have revolutionized natural language processing by achieving state-of-the-art performance on tasks like translation, summarization, question answering, code generation, and reasoning.

LLMs represent a paradigm shift in AI: rather than training task-specific models, a single pre-trained foundation model can be adapted (via fine-tuning or prompting) to thousands of downstream tasks. This chapter provides a deep technical exploration of how LLMs work—from the mathematics of attention to production deployment strategies.


1. The Transformer Architecture

The transformer is the foundational architecture behind all modern LLMs. Unlike previous sequence models (RNNs, LSTMs) that process tokens sequentially, transformers process all tokens in parallel using self-attention, enabling massive parallelism during training and capturing long-range dependencies effectively.

Original Transformer (Encoder-Decoder)

The original transformer has two components:

  • Encoder: Processes input sequence and produces contextualized representations. Used for understanding tasks (BERT, T5 encoder).
  • Decoder: Generates output sequence autoregressively (one token at a time). Used for generation tasks (GPT series).

Most modern LLMs use decoder-only architectures (GPT, LLaMA, Claude, Gemini) because autoregressive generation is sufficient for most tasks when the model is large enough.

Transformer Block Components

Each transformer block contains:

  1. Multi-Head Self-Attention: Allows each token to attend to all other tokens in the sequence.
  2. Feed-Forward Network (FFN): Two-layer MLP applied to each position independently. Typically FFN(x) = GELU(xW₁ + b₁)W₂ + b₂ where the inner dimension is 4× the model dimension.
  3. Layer Normalization: Stabilizes training by normalizing activations. Modern LLMs use Pre-Norm (normalize before attention/FFN) rather than Post-Norm (original transformer).
  4. Residual Connections: Skip connections that add the input to the output of each sub-layer, enabling gradient flow through deep networks: output = LayerNorm(x + SubLayer(x)).

Transformer Variants

Architecture Direction Masking Pre-training Objective Use Cases Examples
Encoder-only Bidirectional No causal mask Masked Language Modeling (MLM) Classification, NER, embeddings BERT, RoBERTa, DeBERTa
Decoder-only Left-to-right Causal mask Next Token Prediction (NTP) Text generation, reasoning, chat GPT-4, Claude, LLaMA, Gemini
Encoder-Decoder Both Causal in decoder Span corruption / denoising Translation, summarization T5, BART, Flan-T5

Pseudocode (Transformer Block)

class TransformerBlock:
    attention: MultiHeadAttention
    ffn: FeedForwardNetwork
    norm1: LayerNorm
    norm2: LayerNorm

    function forward(x: Tensor, mask: Tensor = None) -> Tensor
        // Pre-norm architecture (modern LLMs)

        // Self-attention with residual connection
        normed = self.norm1(x)
        attention_output = self.attention(normed, normed, normed, mask)
        x = x + attention_output  // Residual

        // Feed-forward with residual connection
        normed = self.norm2(x)
        ffn_output = self.ffn(normed)
        x = x + ffn_output  // Residual

        return x

class Transformer:
    embedding: EmbeddingLayer
    blocks: list[TransformerBlock]
    head: LinearLayer  // Projects to vocabulary size

    function forward(token_ids: Tensor) -> Tensor
        x = self.embedding(token_ids)
        for block in self.blocks:
            x = block(x)
        logits = self.head(x)  // [batch, seq_len, vocab_size]
        return logits

2. Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that the model can process. Unlike simple word splitting, modern tokenizers use subword algorithms to handle out-of-vocabulary words, reduce vocabulary size, and enable multilingual support.

Tokenization Algorithms

Byte Pair Encoding (BPE)

Used by GPT models. Iteratively merges the most frequent pairs of adjacent characters/subwords:

  1. Start with character-level vocabulary (plus special tokens).
  2. Count all pairs of adjacent symbols in the training corpus.
  3. Merge the most frequent pair into a new symbol.
  4. Repeat until desired vocabulary size (e.g., 50,257 for GPT-2).

Example: "running" → ["run", "ning"] if "ning" is a learned subword.

Key insight: BPE naturally handles common words as single tokens and rare/novel words as sequences of subwords.

WordPiece

Used by BERT. Similar to BPE but uses a likelihood-based criterion:

  • Merge the pair that maximizes the language model likelihood of the training data.
  • Splits words with "##" prefix for continuation subwords: "unbelievable" → ["un", "##believ", "##able"].

SentencePiece

Used by T5, LLaMA, PaLM. A language-independent tokenizer that:

  • Treats the input as a raw byte stream (no pre-tokenization required).
  • Supports both BPE and Unigram language model algorithms.
  • Handles whitespace explicitly with the "▁" character.
  • Enables truly language-agnostic tokenization (critical for multilingual models).

Byte-level BPE

Used by GPT-2+. Operates on raw bytes rather than Unicode characters:

  • Vocabulary starts with 256 byte tokens.
  • Can represent any text in any language without unknown tokens.
  • Avoids character-level preprocessing entirely.

Tokenization Comparison

Algorithm Vocabulary Size OOV Handling Language Support Used By
BPE 10K-50K Subword merging Good GPT-2, GPT-3, GPT-4
WordPiece 30K Subword splitting Good BERT, DistilBERT
SentencePiece 32K-128K Character-level Excellent (multilingual) T5, LLaMA, PaLM
Byte-level BPE 50K-100K Byte fallback Universal GPT-2+, LLaMA 2
Character-level ~100 Perfect Universal Rare (sequences too long)

Pseudocode (BPE Tokenization)

class BPETokenizer:
    vocab: dict[str, int]         // token → id
    merges: list[tuple[str, str]] // learned merge pairs, in priority order

    function train(corpus: str, vocab_size: int)
        // Initialize vocabulary with individual characters
        self.vocab = {char: id for id, char in enumerate(unique_chars(corpus))}

        while len(self.vocab) < vocab_size:
            // Count all adjacent pairs
            pair_counts = count_adjacent_pairs(corpus)

            // Find most frequent pair
            best_pair = argmax(pair_counts)

            // Merge this pair everywhere in corpus
            corpus = merge_pair(corpus, best_pair)

            // Add merged token to vocabulary
            new_token = best_pair[0] + best_pair[1]
            self.vocab[new_token] = len(self.vocab)
            self.merges.append(best_pair)

    function tokenize(text: str) -> list[int]
        tokens = list(text)  // Start with characters
        for merge_pair in self.merges:
            tokens = apply_merge(tokens, merge_pair)
        return [self.vocab[token] for token in tokens]

    function apply_merge(tokens: list[str], pair: tuple[str, str]) -> list[str]
        new_tokens = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
                new_tokens.append(pair[0] + pair[1])  // Merge
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        return new_tokens

Tokenization Challenges

  • Token Count Mismatch: Same text can have different token counts across models. "Hello world" might be 2 tokens in one model and 3 in another.
  • Special Tokens: Models use special tokens like <|endoftext|>, [CLS], [SEP], <s>, </s> for formatting and control flow.
  • Multilingual Efficiency: Languages with non-Latin scripts (Chinese, Japanese, Arabic) often require more tokens per word, making inference more expensive.
  • Numbers and Code: Tokenizers often split numbers digit-by-digit and code in unexpected ways, affecting arithmetic and code generation abilities.
  • Tokenizer-Model Coupling: A model can only use the tokenizer it was trained with. Changing the tokenizer requires retraining.

3. Embeddings (Vectorization)

After tokenization, tokens are converted into dense vector representations (embeddings) that capture semantic meaning. These embeddings exist in a high-dimensional space where geometric relationships encode linguistic relationships.

Embedding Process

  1. Token Embedding: Each token ID maps to a learned embedding vector (typically 768-8192 dimensions). This is simply a lookup in a learned embedding matrix E ∈ ℝ^(vocab_size × d_model).
  2. Positional Embedding: Adds position information since transformers have no inherent sense of order:
    • Learned Positional Embeddings: A separate embedding matrix for positions (GPT, BERT). Limited to a fixed maximum sequence length.
    • Sinusoidal Positional Encoding: Uses sine and cosine functions of different frequencies (original transformer). Can generalize to unseen lengths.
    • Rotary Positional Embeddings (RoPE): Encodes position by rotating the embedding vector (LLaMA, Mistral). Enables better length extrapolation.
    • ALiBi (Attention with Linear Biases): Adds a linear bias to attention scores based on distance (BLOOM). No positional embeddings needed.
  3. Layer/Segment Embedding: In some models, adds token type information (e.g., sentence A vs. sentence B in BERT).

Embedding Properties

  • Semantic Similarity: Words with similar meanings have similar embeddings. cosine_similarity(embed("king"), embed("queen")) is high.
  • Linear Relationships: Word analogies emerge as vector arithmetic: embed("king") - embed("man") + embed("woman") ≈ embed("queen").
  • Contextual Embeddings: In transformers, the same word gets different embeddings in different contexts. "bank" in "river bank" vs. "bank account" will have different representations after attention layers.
  • Clustering: Embeddings naturally cluster by topic, language, and semantic category in the high-dimensional space.

Embedding Dimensions Across Models

Model Embedding Dimension Vocabulary Size Total Embedding Params Positional Encoding
BERT-base 768 30K ~23M Learned
GPT-2 768 50,257 ~39M Learned
GPT-3 (175B) 12,288 50,257 ~614M Learned
LLaMA-7B 4,096 32,000 ~131M RoPE
LLaMA-70B 8,192 32,000 ~262M RoPE
Mistral-7B 4,096 32,000 ~131M RoPE

Pseudocode (Embedding Layer)

class EmbeddingLayer:
    token_embeddings: Tensor    // [vocab_size, d_model]
    position_encoder: PositionEncoder  // RoPE, learned, or sinusoidal

    function forward(token_ids: Tensor) -> Tensor
        // Token embeddings via lookup
        token_embeds = self.token_embeddings[token_ids]  // [batch, seq_len, d_model]

        // Add positional information
        positions = range(token_ids.shape[1])
        output = self.position_encoder(token_embeds, positions)

        return output

class RoPE:
    // Rotary Positional Embeddings
    function forward(x: Tensor, positions: Tensor) -> Tensor
        d = x.shape[-1]
        // Generate rotation frequencies
        freqs = 1.0 / (10000 ** (arange(0, d, 2) / d))
        // Apply rotation to pairs of dimensions
        angles = positions[:, None] * freqs[None, :]
        cos_angles = cos(angles)
        sin_angles = sin(angles)
        // Rotate embedding pairs
        x_rotated = rotate_half(x, cos_angles, sin_angles)
        return x_rotated

4. The Attention Mechanism (Deep Dive)

Attention is the core innovation of transformers, allowing models to dynamically focus on relevant parts of the input when processing each token. It replaces the fixed-length bottleneck of RNNs with a mechanism that can look at the entire sequence at once.

Scaled Dot-Product Attention

The fundamental attention computation:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Where:

  • Q (Query): "What am I looking for?" — represents the current token's question to other tokens. Shape: [seq_len, d_k]
  • K (Key): "What do I contain?" — represents what each token offers. Shape: [seq_len, d_k]
  • V (Value): "What information do I provide?" — the actual content to aggregate. Shape: [seq_len, d_v]
  • d_k: Dimension of keys (scaling factor prevents softmax from saturating into one-hot distributions for large dimensions)

Attention Step by Step

  1. Compute Similarity Scores: QK^T gives a [seq_len × seq_len] matrix of similarity scores between every pair of tokens.
  2. Scale: Divide by √d_k to keep gradients stable. Without scaling, dot products grow with dimension, causing softmax to produce near-one-hot distributions.
  3. Apply Causal Mask (decoder): Set future positions to -∞ so they become 0 after softmax (prevents the model from "looking ahead" during generation).
  4. Softmax: Convert scores to probabilities (attention weights sum to 1 for each query token).
  5. Weighted Sum: Multiply attention weights by values to produce the output — a contextually-informed representation of each token.

Multi-Head Attention

Instead of a single attention function, use multiple "heads" in parallel, each projecting Q, K, V into a different subspace. This allows the model to simultaneously attend to different types of relationships:

  • Head 1: Might learn syntactic dependencies (subject-verb agreement).
  • Head 2: Might learn semantic relationships (synonyms, antonyms).
  • Head 3: Might learn positional patterns (adjacent tokens).
  • Head 4: Might learn long-range coreference (pronoun resolution).

After computing attention for each head, outputs are concatenated and projected:

MultiHead(Q, K, V) = Concat(head₁, head₂, ..., headₕ) × W_O
where headᵢ = Attention(QW_Qᵢ, KW_Kᵢ, VW_Vᵢ)

Pseudocode (Complete Attention)

function scaled_dot_product_attention(
    query: Tensor,   // [batch, num_heads, seq_len_q, d_k]
    key: Tensor,     // [batch, num_heads, seq_len_k, d_k]
    value: Tensor,   // [batch, num_heads, seq_len_k, d_v]
    mask: Tensor = None
) -> tuple[Tensor, Tensor]
    d_k = query.shape[-1]

    // Step 1: Compute attention scores
    scores = matmul(query, key.transpose(-2, -1))  // [batch, heads, seq_q, seq_k]

    // Step 2: Scale
    scores = scores / sqrt(d_k)

    // Step 3: Apply causal mask (if provided)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    // Step 4: Softmax to get attention weights
    attention_weights = softmax(scores, dim=-1)  // [batch, heads, seq_q, seq_k]

    // Step 5: Weighted sum of values
    output = matmul(attention_weights, value)  // [batch, heads, seq_q, d_v]

    return output, attention_weights

class MultiHeadAttention:
    num_heads: int
    d_model: int
    d_k: int  // = d_model // num_heads

    W_Q: Tensor  // [d_model, d_model]
    W_K: Tensor  // [d_model, d_model]
    W_V: Tensor  // [d_model, d_model]
    W_O: Tensor  // [d_model, d_model]

    function forward(
        query: Tensor,  // [batch, seq_len, d_model]
        key: Tensor,
        value: Tensor,
        mask: Tensor = None
    ) -> Tensor
        batch_size = query.shape[0]

        // Linear projections
        Q = matmul(query, self.W_Q)  // [batch, seq_len, d_model]
        K = matmul(key, self.W_K)
        V = matmul(value, self.W_V)

        // Reshape to [batch, num_heads, seq_len, d_k]
        Q = Q.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        // Apply attention
        output, weights = scaled_dot_product_attention(Q, K, V, mask)

        // Concatenate heads: [batch, seq_len, d_model]
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        // Final linear projection
        output = matmul(output, self.W_O)

        return output

Attention Complexity and Optimizations

Standard Attention: - Time Complexity: O(n² × d) where n is sequence length and d is dimension. - Space Complexity: O(n²) for the attention matrix.

This quadratic scaling is the primary bottleneck for long sequences. Key optimizations:

Optimization Complexity Approach Used By
Flash Attention O(n²) time, O(n) memory Tiling + kernel fusion, avoids materializing full attention matrix LLaMA 2+, Mistral, most modern LLMs
Flash Attention 2 O(n²) time, O(n) memory Improved parallelism, better GPU utilization (2× faster than v1) State of the art
Multi-Query Attention (MQA) O(n² × d/h) Share K, V across all heads (only Q differs per head) PaLM, Falcon
Grouped-Query Attention (GQA) O(n² × d×g/h) Groups of heads share K, V (compromise between MHA and MQA) LLaMA 2 70B, Mistral
Sparse Attention O(n × √n) Only attend to local windows + sparse global patterns Longformer, BigBird
Linear Attention O(n × d²) Approximate softmax with kernel trick Performer, Linformer
Ring Attention O(n²/p) per device Distribute attention across devices for very long sequences Research

Types of Attention in LLMs

  • Self-Attention: Q, K, V all come from the same sequence. Every token attends to every other token. Used in both encoders and decoders.
  • Cross-Attention: Q from one sequence (decoder), K and V from another (encoder output). Used in encoder-decoder models for tasks like translation.
  • Causal (Masked) Self-Attention: Tokens can only attend to previous tokens and themselves. Enforced by a triangular mask. Essential for autoregressive generation.

5. Pre-training

Pre-training is the process of training a large model on massive unlabeled text corpora to learn general language understanding. This is the most expensive phase (millions of dollars, weeks on thousands of GPUs).

Pre-training Objectives

Next Token Prediction (Autoregressive / GPT-style)

The dominant objective for decoder-only LLMs. Given a sequence of tokens, predict the next token:

Loss = -Σ log P(tokenᵢ | token₁, token₂, ..., tokenᵢ₋₁)

The model learns to model the probability distribution over the entire vocabulary for each position, conditioned on all previous tokens. This simple objective, at sufficient scale, leads to emergent capabilities like reasoning, translation, and code generation.

Masked Language Modeling (BERT-style)

Randomly mask 15% of tokens and predict them from context:

Input:  "The [MASK] sat on the [MASK]"
Output: "The cat sat on the mat"

Bidirectional — the model sees both left and right context. Better for understanding tasks but cannot generate text autoregressively.

Span Corruption (T5-style)

Replace random spans of text with sentinel tokens and predict the original spans:

Input:  "The <X> sat on the <Y>"
Output: "<X> cat <Y> mat"

Combines benefits of MLM with sequence-to-sequence format.

Pre-training Data

Dataset Size Content Used By
Common Crawl ~60TB text Web pages (filtered) GPT-3, many others
The Pile 825 GB Curated mix (academic, code, books, web) GPT-NeoX, Pythia
RedPajama 1.2T tokens Open reproduction of LLaMA training data RedPajama models
FineWeb 15T tokens High-quality filtered web data Open-source community
StarCoder Data 783 GB Source code from GitHub StarCoder, CodeLLaMA
Wikipedia ~20 GB Encyclopedia articles Nearly all models
Books3 ~100 GB Digitized books GPT-3 (controversial)

Data Quality and Curation

Data quality is more important than quantity. Key preprocessing steps:

  1. Deduplication: Remove exact and near-duplicate documents (MinHash, SimHash). Duplicates cause memorization and degrade generalization.
  2. Quality Filtering: Remove low-quality content (boilerplate, spam, machine-generated text). Use classifiers trained on high-quality data.
  3. Toxicity Filtering: Remove or reduce harmful, biased, and explicit content.
  4. Language Identification: Tag and balance content across languages.
  5. Domain Mixing: Balance data from different domains (web, books, code, academic papers) to ensure broad capabilities.

Training Infrastructure

Training a frontier LLM requires massive distributed computing:

  • GPT-3 (175B): ~3,640 petaflop-days, estimated $4.6M on cloud GPUs.
  • LLaMA-65B: 2,048 A100 GPUs for ~21 days, ~1.4M GPU-hours.
  • GPT-4: Estimated 10,000+ A100s for months, cost ~$100M+.

Distributed Training Strategies:

Strategy Description Use Case
Data Parallelism (DP) Same model on each GPU, different data batches Small models that fit in one GPU
Tensor Parallelism (TP) Split individual layers across GPUs Large layers that don't fit in one GPU
Pipeline Parallelism (PP) Split model layers across GPUs sequentially Very deep models
Fully Sharded Data Parallelism (FSDP) Shard model parameters, gradients, and optimizer states Modern default for large models
3D Parallelism Combine DP + TP + PP Frontier model training

6. Scaling Laws

Scaling laws describe the predictable relationship between model size, dataset size, compute budget, and model performance. These empirical laws guide how to allocate resources for training.

Kaplan Scaling Laws (OpenAI, 2020)

Performance (measured by cross-entropy loss) follows a power law:

L(N) ∝ N^(-0.076)   // Loss vs. parameters
L(D) ∝ D^(-0.095)   // Loss vs. dataset size
L(C) ∝ C^(-0.050)   // Loss vs. compute

Key insight: Scaling up model size is more efficient than scaling up data, suggesting larger models trained on relatively less data are optimal.

Chinchilla Scaling Laws (DeepMind, 2022)

Chinchilla challenged Kaplan's findings:

Optimal tokens ≈ 20 × parameters

This means a 10B parameter model should be trained on ~200B tokens. Most models before Chinchilla (including GPT-3) were significantly undertrained relative to their size.

Impact: Led to smaller, better-trained models. Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) despite being 4× smaller.

Modern Scaling Observations

  • Emergent Abilities: Certain capabilities (multi-step reasoning, code generation) appear suddenly at specific scale thresholds rather than improving gradually.
  • Inference Cost Matters: Chinchilla-optimal training doesn't account for inference costs. Smaller, overtrained models (LLaMA approach) may be more practical since inference cost scales with model size.
  • Data Walls: The internet has a finite amount of high-quality text. Synthetic data generation and data efficiency techniques are increasingly important.

7. Fine-tuning Techniques

Fine-tuning adapts a pre-trained model to specific tasks or behaviors using smaller, curated datasets. Modern fine-tuning ranges from full parameter updates to lightweight adapter methods.

Full Fine-tuning

Update all model parameters on task-specific data:

  • Pros: Maximum flexibility, best task-specific performance.
  • Cons: Requires storing a full copy of the model per task, expensive, risk of catastrophic forgetting.
  • Compute: Comparable to pre-training on the fine-tuning dataset.

Parameter-Efficient Fine-Tuning (PEFT)

Train only a small subset of parameters, keeping the base model frozen:

LoRA (Low-Rank Adaptation)

The most popular PEFT method. Instead of updating weight matrix W directly, decompose the update into two low-rank matrices:

W' = W + ΔW = W + BA
where B ∈ ℝ^(d × r), A ∈ ℝ^(r × d), r << d
  • r (rank): Typically 4-64. Controls the expressiveness vs. efficiency trade-off.
  • Trainable parameters: Only 2 × d × r per adapted layer (vs. for full fine-tuning).
  • Savings: For a 7B model, LoRA might train 0.1-1% of parameters.

Pseudocode (LoRA):

class LoRALayer:
    base_weight: Tensor      // [d_out, d_in] (frozen)
    lora_A: Tensor           // [r, d_in] (trainable)
    lora_B: Tensor           // [d_out, r] (trainable)
    scaling: float           // alpha / r

    function forward(x: Tensor) -> Tensor
        // Original computation + low-rank adaptation
        base_output = matmul(x, self.base_weight.T)
        lora_output = matmul(matmul(x, self.lora_A.T), self.lora_B.T) * self.scaling
        return base_output + lora_output

QLoRA (Quantized LoRA)

Combines LoRA with 4-bit quantization of the base model:

  1. Quantize base model to 4-bit NormalFloat (NF4) format.
  2. Apply LoRA adapters in full precision (bfloat16).
  3. Use double quantization (quantize the quantization constants).
  4. Use paged optimizers to handle memory spikes.

Impact: Fine-tune a 65B model on a single 48GB GPU (vs. ~780GB for full fine-tuning).

Other PEFT Methods

Method Approach Trainable Params Performance
LoRA Low-rank weight decomposition 0.1-1% Excellent
QLoRA LoRA + 4-bit quantization 0.1-1% Near LoRA, much less memory
Prefix Tuning Learnable prefix tokens in each layer ~0.1% Good for generation
Prompt Tuning Learnable soft prompt tokens (input only) <0.01% Good for classification
Adapters Small bottleneck layers inserted between transformer layers 1-3% Good, higher overhead
IA³ Learned rescaling of activations <0.01% Competitive with LoRA

Instruction Tuning

Fine-tuning on instruction-following datasets to make models helpful and aligned:

  • Datasets: FLAN (1.8K tasks), Alpaca (52K instructions), ShareGPT (user conversations), OpenAssistant.
  • Format: Each example is an instruction-input-output triple.
  • Effect: Transforms a base model (which just predicts next tokens) into a helpful assistant.

Supervised Fine-Tuning (SFT)

Training on high-quality demonstrations of desired behavior:

Dataset: [(instruction, ideal_response), ...]
Loss: Cross-entropy between model output and ideal_response

SFT is the first step in the alignment pipeline (before RLHF/DPO).


8. Alignment: RLHF and DPO

Alignment ensures models are helpful, harmless, and honest. Raw pre-trained models can generate toxic, biased, or unhelpful content. Alignment techniques steer model behavior toward human preferences.

RLHF (Reinforcement Learning from Human Feedback)

The three-step process used by ChatGPT, Claude, and others:

Step 1: Supervised Fine-Tuning (SFT) - Train on demonstrations of ideal assistant behavior. - Result: A model that follows instructions but may still produce undesirable outputs.

Step 2: Reward Model Training - Collect human preference data: show pairs of model responses, humans rank which is better. - Train a reward model R(prompt, response) → scalar that predicts human preferences. - Dataset: Thousands of comparison pairs.

Step 3: PPO (Proximal Policy Optimization) - Use the reward model to further train the SFT model via reinforcement learning. - The model generates responses, the reward model scores them, and PPO updates the model to increase expected reward. - A KL divergence penalty prevents the model from diverging too far from the SFT model (prevents reward hacking).

Objective = E[R(prompt, response)] - β × KL(π_RL || π_SFT)

Pseudocode (RLHF Training Loop):

class RLHFTrainer:
    policy_model: LLM          // Model being trained
    reference_model: LLM       // Frozen SFT model
    reward_model: RewardModel   // Trained on human preferences
    beta: float = 0.1          // KL penalty coefficient

    function train_step(prompts: list[str])
        // Generate responses with current policy
        responses = self.policy_model.generate(prompts)

        // Score with reward model
        rewards = self.reward_model.score(prompts, responses)

        // Compute KL penalty
        policy_logprobs = self.policy_model.log_prob(responses)
        ref_logprobs = self.reference_model.log_prob(responses)
        kl_penalty = policy_logprobs - ref_logprobs

        // Combined objective
        objective = rewards - self.beta * kl_penalty

        // Update policy with PPO
        ppo_update(self.policy_model, objective)

DPO (Direct Preference Optimization)

A simpler alternative to RLHF that eliminates the need for a separate reward model:

  • Directly optimize the policy model using preference pairs (chosen vs. rejected responses).
  • Mathematically equivalent to RLHF with a specific reward model class.
  • Much simpler to implement and more stable to train.
Loss_DPO = -log(σ(β × (log π(chosen)/π_ref(chosen) - log π(rejected)/π_ref(rejected))))

DPO Advantages: - No reward model needed. - No RL training loop (just supervised learning on preferences). - More stable and reproducible. - Lower computational cost.

Constitutional AI (Anthropic)

Self-supervised alignment using a set of principles (constitution):

  1. Model generates responses.
  2. Model critiques its own responses using the constitution.
  3. Model revises responses based on critiques.
  4. Fine-tune on the revised responses.

Reduces reliance on human labelers while maintaining alignment quality.

Comparison

Method Complexity Stability Data Needed Performance
SFT Low High Demonstrations Good baseline
RLHF (PPO) High Low (tricky to tune) Preferences + RL Best (when tuned well)
DPO Medium High Preferences only Near RLHF
Constitutional AI Medium Medium Principles + self-play Good alignment
ORPO Low High Preferences only Competitive

9. Inference Optimization

Serving LLMs in production requires optimizing for latency, throughput, and cost. Inference is fundamentally different from training — it's memory-bandwidth bound (not compute-bound) and must handle variable-length sequences efficiently.

The KV Cache

During autoregressive generation, each new token requires attending to all previous tokens. Without caching, this requires recomputing attention for all previous tokens at each step (O(n²) total work for n tokens).

The KV Cache stores the key and value projections for all previous tokens:

// Without KV cache: recompute K, V for all tokens at each step
// With KV cache: only compute K, V for the new token, append to cache

class KVCache:
    keys: list[Tensor]    // One per layer: [batch, num_heads, seq_len, d_k]
    values: list[Tensor]  // One per layer: [batch, num_heads, seq_len, d_v]

    function update(layer_idx: int, new_key: Tensor, new_value: Tensor)
        self.keys[layer_idx] = concat(self.keys[layer_idx], new_key, dim=2)
        self.values[layer_idx] = concat(self.values[layer_idx], new_value, dim=2)

KV Cache Memory: For a 70B model generating 4K tokens: ~40GB of KV cache per sequence. This is why memory, not compute, is the bottleneck.

Quantization

Reduce model precision to decrease memory usage and increase throughput:

Precision Bits per Param Memory (7B model) Quality Loss Speed Gain
FP32 32 28 GB Baseline
FP16/BF16 16 14 GB Negligible ~2×
INT8 8 7 GB Minimal ~2-4×
INT4 (GPTQ) 4 3.5 GB Small ~4-8×
INT4 (AWQ) 4 3.5 GB Very small ~4-8×
GGUF (mixed) 2-8 2-7 GB Varies ~4-10×

Quantization Methods:

  • Post-Training Quantization (PTQ): Quantize after training using calibration data. Methods: GPTQ, AWQ, SqueezeLLM.
  • Quantization-Aware Training (QAT): Simulate quantization during training. Higher quality but more expensive.
  • GGUF: File format for quantized models used by llama.cpp. Supports mixed-precision quantization (more bits for important layers).

Speculative Decoding

Use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model:

  1. Draft model generates K candidate tokens quickly.
  2. Large model scores all K tokens in a single forward pass.
  3. Accept tokens that the large model agrees with; reject and regenerate from the first disagreement.

Speedup: 2-3× with no quality loss (mathematically equivalent to sampling from the large model).

Continuous Batching

Traditional batching waits for all sequences in a batch to finish. Continuous batching:

  1. Immediately start processing new requests as old ones finish.
  2. Dynamically add/remove sequences from the batch.
  3. Maximizes GPU utilization.

Used by vLLM, TGI, and all modern serving frameworks.

PagedAttention (vLLM)

Inspired by OS virtual memory paging:

  • KV cache is stored in non-contiguous memory blocks (pages).
  • Pages are allocated on demand (no pre-allocation waste).
  • Enables sharing KV cache across sequences with common prefixes.
  • Reduces memory waste by 60-80% compared to naive allocation.

Inference Serving Systems

System Key Innovation Performance Use Case
vLLM PagedAttention, continuous batching Very high throughput Production serving
TGI (HuggingFace) Tensor parallelism, Flash Attention High throughput HuggingFace ecosystem
TensorRT-LLM (NVIDIA) GPU-optimized kernels, FP8 Lowest latency on NVIDIA Maximum performance
Ollama Simple local serving, GGUF models Easy to use Local development
llama.cpp CPU inference, GGUF quantization Runs anywhere Edge, desktop
SGLang RadixAttention, structured generation Optimized for structured output Complex pipelines

Proprietary Models

Model Organization Parameters Context Window Key Features
GPT-4o OpenAI ~1.7T (est., MoE) 128K tokens Multimodal (text, image, audio), strong reasoning
Claude 3.5 Sonnet Anthropic Undisclosed 200K tokens Constitutional AI, strong coding, long context
Gemini 1.5 Pro Google Undisclosed (MoE) 1M tokens Extremely long context, multimodal
o1 / o3 OpenAI Undisclosed 128K-200K Chain-of-thought reasoning, "thinking" models

Open-Source / Open-Weight Models

Model Organization Parameters Context License Key Features
LLaMA 3 Meta 8B, 70B, 405B 128K LLaMA 3 License Strong general performance, widely adopted
Mistral / Mixtral Mistral AI 7B / 8×7B MoE 32K Apache 2.0 Efficient, MoE architecture
Qwen 2.5 Alibaba 0.5B-72B 128K Apache 2.0 Strong multilingual, coding
DeepSeek-V2/V3 DeepSeek 236B MoE 128K MIT Multi-head latent attention, very efficient MoE
Phi-3/4 Microsoft 3.8B-14B 128K MIT Small but capable, strong reasoning
Gemma 2 Google 2B, 9B, 27B 8K Gemma License Efficient, research-friendly
CodeLLaMA Meta 7B-34B 16K-100K LLaMA 2 License Code-specialized
StarCoder 2 BigCode 3B-15B 16K BigCode License Code generation, fill-in-the-middle

Mixture of Experts (MoE)

A key architectural innovation for scaling efficiently. Instead of a single large FFN, MoE uses multiple "expert" FFNs with a gating network that routes each token to a subset of experts:

MoE_FFN(x) = Σᵢ G(x)ᵢ × Expertᵢ(x)
where G(x) = TopK(softmax(x × W_gate))
  • Sparse activation: Only K experts (typically 2) are activated per token, out of N total (e.g., 8 experts).
  • Effective parameters: Total params may be 8×7B = 56B, but active params per token are only 2×7B = 14B.
  • Benefits: More total knowledge (parameters) without proportional compute increase.
  • Challenges: Load balancing across experts, memory for all expert weights, routing instability.

Example: Mixtral 8×7B has 46.7B total parameters but only 12.9B active per token, making it competitive with much larger dense models.


11. Multimodal Models and Diffusion

Vision-Language Models

Models that process both text and images:

  • Architecture: Typically a vision encoder (ViT) + LLM, connected via a projection layer or cross-attention.
  • Examples: GPT-4V/4o, Claude 3 (vision), Gemini, LLaVA.
  • Capabilities: Image understanding, visual question answering, OCR, diagram interpretation.

Diffusion Models

While transformers dominate text, diffusion models have revolutionized image generation:

  1. Forward Process: Gradually add Gaussian noise to an image over T timesteps until it becomes pure noise.
  2. Reverse Process: Train a neural network to predict and remove noise at each step, gradually recovering the image.
Forward:  x_t = √(α_t) × x_0 + √(1-α_t) × ε,  ε ~ N(0, I)
Reverse:  x_{t-1} = denoise(x_t, t)  // Learned denoising

Key models: DALL-E 2/3, Stable Diffusion, Midjourney, Imagen.

Latent Diffusion: Perform diffusion in a compressed latent space (not pixel space) for efficiency. Stable Diffusion uses a VAE to compress images 8× before diffusion.

Diffusion Transformers (DiT)

Replace the U-Net backbone in diffusion models with a transformer:

  • Better scaling properties than U-Nets.
  • Used by DALL-E 3, Stable Diffusion 3, Sora.
  • Enables unified transformer-based architecture for text and images.

12. Challenges and Limitations

Hallucination

Models generate plausible but factually incorrect information. This is inherent to language models (they model probability distributions over tokens, not truth).

Mitigation: - RAG (ground responses in retrieved facts). - Fine-tuning on factual data. - Self-consistency checking (generate multiple responses, check agreement). - Confidence calibration and "I don't know" training. - Citation and source attribution.

Context Window Limitations

Despite growing context windows (up to 1M+ tokens), challenges remain: - Lost in the middle: Models attend more to the beginning and end of context, potentially missing information in the middle. - Cost: Longer contexts are more expensive (quadratic attention cost). - Retrieval is often better: For large knowledge bases, RAG outperforms stuffing everything into context.

Safety and Misuse

  • Prompt injection: Adversarial inputs that override system instructions.
  • Jailbreaking: Techniques to bypass safety filters.
  • Data poisoning: Corrupting training data to influence model behavior.
  • Deepfakes: Generating misleading content.

Cost

  • Training: GPT-4 estimated at $100M+. Even fine-tuning large models costs thousands.
  • Inference: Serving at scale requires significant GPU infrastructure. Cost per token varies 100× between model sizes.

13. Best Practices for LLM Engineering

  1. Start with prompting: Try prompt engineering before fine-tuning. Many problems are solvable with good prompts.
  2. Choose the right model size: Smaller models are cheaper and faster. Only use large models when quality demands it. Consider model routing (small model for easy queries, large for hard ones).
  3. Evaluate systematically: Use benchmarks (MMLU, HumanEval, MT-Bench) AND task-specific evaluations AND human evaluation.
  4. Monitor in production: Track latency, error rates, user satisfaction, and output quality. Watch for drift.
  5. Optimize inference: Use quantization, batching, KV caching, and appropriate serving infrastructure.
  6. Cache aggressively: Semantic caching (similar queries → cached response), exact caching (identical prompts), prefix caching.
  7. Handle failures gracefully: LLMs are non-deterministic. Implement retries, fallbacks, and output validation.
  8. Version everything: Model versions, prompt templates, evaluation datasets, fine-tuning data.
  9. Security first: Implement input validation, output filtering, rate limiting, and prompt injection defenses.
  10. Keep up with the field: The landscape changes rapidly. New models, techniques, and best practices emerge monthly.