Performance Engineering¶

Performance engineering is the practice of designing, measuring, and optimizing software systems to meet performance requirements. It spans the entire SDLC—from architectural decisions to production monitoring. The golden rule: always measure before optimizing—intuition about bottlenecks is wrong more often than it's right.

Performance Metrics¶

Metric	Description	Typical Target
Latency	Time to process a single request (p50, p95, p99)	API: <100ms p95, Web: <200ms p95
Throughput	Number of operations per unit time (RPS, TPS)	Varies by service
TTFB	Time to First Byte — server processing + network time	<200ms
Apdex	Application Performance Index (0-1 score of user satisfaction)	>0.9
Error rate	Percentage of failed requests	<0.1%
Resource utilization	CPU, memory, disk I/O, network bandwidth usage	CPU <70%, Memory <80%

Why Percentiles Matter More Than Averages¶

Average latency hides problems. If 99% of requests take 50ms and 1% take 5000ms, the average is ~100ms—which looks fine. But that 1% represents real users with a terrible experience, and they're often your most important users (high-value customers with more data, more complex queries).

Latency distribution example:
  p50 (median):  50ms   — Half of requests are faster than this
  p90:          100ms   — 10% of requests are slower
  p95:          200ms   — 5% are slower (common SLO target)
  p99:          500ms   — 1% are slower (catches tail latency)
  p99.9:       2000ms   — 1 in 1000 (often database timeouts, GC pauses)
  max:         5000ms   — Single worst request (outliers)

Rule of thumb: Set SLOs on p95 or p99, not average.

Tail latency amplification: In microservices, a single user request may fan out to 10-50 backend services. If each service has p99 = 100ms, the overall p99 is NOT 100ms—it's closer to max(all services). With 50 parallel calls at p99 = 100ms, there's a ~40% chance that at least one exceeds 100ms per request.

Web Performance Metrics (Core Web Vitals)¶

Google's Core Web Vitals measure real user experience and affect search rankings:

Metric	What it Measures	Good	Needs Improvement	Poor
LCP (Largest Contentful Paint)	Loading performance — when the largest visible element renders	≤2.5s	≤4.0s	>4.0s
INP (Interaction to Next Paint)	Interactivity — time from user input to visual response	≤200ms	≤500ms	>500ms
CLS (Cumulative Layout Shift)	Visual stability — how much the page layout shifts unexpectedly	≤0.1	≤0.25	>0.25

Improving LCP: Optimize the critical rendering path (reduce render-blocking CSS/JS), preload key resources (<link rel="preload">), use CDN for static assets, optimize images (WebP/AVIF, responsive sizes, lazy loading below-the-fold).

Improving INP: Keep JavaScript execution short (break long tasks with requestIdleCallback), debounce/throttle event handlers, minimize main thread blocking, use web workers for heavy computation.

Improving CLS: Set explicit width/height on images and embeds, use aspect-ratio CSS, avoid injecting content above the fold after page load, use font-display: swap for web fonts.

Profiling¶

Profiling is the process of measuring where an application spends its time and resources. Always profile before optimizing—the bottleneck is rarely where you expect.

Types of Profiling¶

Type	What it Measures	Tools
CPU Profiling	Where computation time is spent	`cProfile` (Python), `perf` (Linux), `pprof` (Go), `cargo flamegraph` (Rust), `async-profiler` (Java)
Memory Profiling	Memory allocation, leaks, high-water marks	`tracemalloc` (Python), `Valgrind`, `heaptrack`, `memory_profiler`, `jemalloc`
I/O Profiling	Disk and network I/O patterns, blocking calls	`strace`, `ltrace`, `iotop`, `BPF/eBPF`, `perf trace`
Lock/Contention	Thread contention on locks, synchronization overhead	`perf lock`, `mutrace`, `async-profiler` (Java), `tokio-console` (Rust)
Allocation Profiling	Where memory allocations happen (separate from leaks)	`DHAT` (Valgrind), `heaptrack`, `jemalloc prof`

CPU Profiling¶

# Python CPU profiling example
import cProfile
import pstats

def expensive_function():
    result = []
    for i in range(1_000_000):
        result.append(i ** 2)
    return sorted(result, reverse=True)

# Profile the function
profiler = cProfile.Profile()
profiler.enable()
expensive_function()
profiler.disable()

# Print stats sorted by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 functions

// Go profiling with pprof (built into the standard library)
import (
    "net/http"
    _ "net/http/pprof"  // Register pprof endpoints
)

func main() {
    // Exposes /debug/pprof/ endpoints
    go http.ListenAndServe(":6060", nil)

    // Your application code...
}

// Then analyze:
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30  (CPU)
// go tool pprof http://localhost:6060/debug/pprof/heap                (memory)
// go tool pprof http://localhost:6060/debug/pprof/goroutine           (goroutines)

Memory Profiling¶

# Python memory profiling with tracemalloc
import tracemalloc

tracemalloc.start()

# Code to profile
data = [dict(index=i, value=i**2) for i in range(100_000)]

# Take a snapshot
snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics('lineno')

print("Top 10 memory allocations:")
for stat in stats[:10]:
    print(stat)

# Compare snapshots to find leaks
snapshot1 = tracemalloc.take_snapshot()
# ... more code ...
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
    print(stat)  # Shows memory growth between snapshots

Common memory issues: - Memory leaks: Objects that are no longer needed but still referenced (growing collections, unclosed connections, event listener accumulation) - High allocation rate: Creating and destroying many small objects (GC pressure). Fix: reuse objects, use object pools - Large objects: Single allocations that are disproportionately large. Fix: streaming, pagination, lazy loading

Flame Graphs¶

Flame graphs are a visualization of profiled software, showing the most frequent code paths. The x-axis represents the population of stack traces (wider = more samples), and the y-axis shows stack depth. They make it immediately obvious where time is spent.

# Generate a flame graph on Linux using perf
perf record -g -p <PID> -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

# For Python using py-spy (can attach to running process!)
py-spy record -o profile.svg --pid <PID>
py-spy top --pid <PID>              # Live top-like view

# For Go using pprof
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

# For Rust using cargo-flamegraph
cargo install flamegraph
cargo flamegraph --bin myapp

# For Node.js
node --prof app.js                  # Generate V8 profile
node --prof-process isolate-*.log   # Convert to readable format
# Or use 0x: npx 0x app.js         # Generates interactive flame graph

Reading a flame graph: - Width = proportion of total samples (wider = more time) - Height = call stack depth (taller = deeper nesting) - Color = typically random (or grouped by module) - Look for: Wide blocks (hot functions), tall narrow towers (deep recursion), flat tops (leaf functions doing work)

Benchmarking¶

Benchmarking measures the performance of specific code paths or system components under controlled conditions. Unlike profiling (which shows where time is spent), benchmarking measures how fast something is.

Microbenchmarking¶

# Python benchmarking with timeit
import timeit

# Compare list comprehension vs map
list_comp_time = timeit.timeit(
    '[x**2 for x in range(1000)]',
    number=10000
)
map_time = timeit.timeit(
    'list(map(lambda x: x**2, range(1000)))',
    number=10000
)
print(f"List comprehension: {list_comp_time:.4f}s")
print(f"Map:                {map_time:.4f}s")

// Rust benchmarking with criterion (statistical benchmarking)
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
// Criterion runs the benchmark enough times to be statistically significant,
// compares against previous runs, and detects regressions.

Microbenchmarking pitfalls: - Dead code elimination: Compilers may optimize away your benchmark code. Use black_box() (Rust) or assign results to a variable. - Warmup: JIT-compiled languages (Java, JavaScript) need warmup iterations. - Measurement overhead: time.time() itself has overhead. Use dedicated benchmarking tools. - Cache effects: The first run may be slower (cold cache). Run multiple iterations. - Context switching: Other processes can affect results. Pin to a CPU core for precise measurements.

Database Query Benchmarking¶

-- PostgreSQL: EXPLAIN ANALYZE shows actual execution time and row counts
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) 
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 100;

-- Key things to look for:
-- Seq Scan vs Index Scan (seq scan on large tables = bad)
-- Nested Loop vs Hash Join (nested loop on large datasets = bad)
-- Actual rows vs planned rows (large discrepancy = stale statistics → ANALYZE)
-- Buffers: shared hit vs read (reads = cache misses = disk I/O)

Load Testing¶

Load testing validates system performance under expected and peak load conditions. It answers: "Can our system handle the traffic we expect?"

Types of Load Tests¶

Type	Purpose	Duration	Load Pattern
Smoke test	Verify system works under minimal load	1-5 min	Baseline (1-5 users)
Load test	Validate performance at expected load	15-60 min	Normal traffic (target RPS)
Stress test	Find the breaking point	15-30 min	Ramp up until failure
Soak test	Find memory leaks, connection leaks, degradation	4-24 hours	Sustained normal load
Spike test	Validate behavior under sudden traffic burst	10-20 min	Sudden burst to 5-10x normal
Breakpoint test	Find maximum capacity	30-60 min	Gradually increasing load

Load Testing Tools¶

Tool	Language	Protocol Support	Strengths
k6	JavaScript (Go engine)	HTTP, WebSocket, gRPC	Modern, scriptable, CI-friendly, low resource usage
Locust	Python	HTTP, custom protocols	Easy to write tests, distributed, real-time web UI
JMeter	Java	HTTP, JDBC, JMS, SMTP	Feature-rich, GUI, wide protocol support
Gatling	Scala	HTTP, WebSocket	High performance, detailed reports
wrk	C/Lua	HTTP	Lightweight, extremely fast, simple benchmarks
hey	Go	HTTP	Simple CLI tool for quick benchmarks
vegeta	Go	HTTP	Constant RPS load testing (not concurrent users)

// k6 load test example with realistic scenario
import http from 'k6/http';
import { check, sleep, group } from 'k6';
import { Rate, Trend } from 'k6/metrics';

// Custom metrics
const errorRate = new Rate('error_rate');
const latencyTrend = new Trend('api_latency');

export const options = {
    stages: [
        { duration: '1m', target: 50 },   // Ramp up to 50 users over 1 minute
        { duration: '3m', target: 50 },   // Stay at 50 users for 3 minutes
        { duration: '1m', target: 200 },  // Ramp up to 200 users
        { duration: '3m', target: 200 },  // Stay at 200 users for 3 minutes
        { duration: '1m', target: 0 },    // Ramp down to 0 users
    ],
    thresholds: {
        http_req_duration: ['p(95)<200', 'p(99)<500'],  // 95% < 200ms, 99% < 500ms
        http_req_failed: ['rate<0.01'],                  // Error rate < 1%
        error_rate: ['rate<0.01'],
    },
};

export default function () {
    group('Browse products', () => {
        const listRes = http.get('https://api.example.com/products?page=1');
        check(listRes, {
            'status is 200': (r) => r.status === 200,
            'has products': (r) => JSON.parse(r.body).data.length > 0,
        });
        errorRate.add(listRes.status !== 200);
        latencyTrend.add(listRes.timings.duration);

        sleep(1); // Simulate user reading time

        // View a specific product
        const products = JSON.parse(listRes.body).data;
        if (products.length > 0) {
            const detailRes = http.get(`https://api.example.com/products/${products[0].id}`);
            check(detailRes, { 'product detail 200': (r) => r.status === 200 });
        }
    });

    group('Add to cart', () => {
        const res = http.post('https://api.example.com/cart/items', JSON.stringify({
            product_id: 'prod_123',
            quantity: 1,
        }), { headers: { 'Content-Type': 'application/json' } });

        check(res, { 'added to cart': (r) => r.status === 201 });
    });

    sleep(Math.random() * 3 + 1);  // Random think time 1-4 seconds
}

# Locust load test example
from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)  # Random wait between 1-3 seconds

    @task(3)  # Weight: 3x more likely than other tasks
    def view_items(self):
        self.client.get("/api/items")

    @task(1)
    def create_item(self):
        self.client.post("/api/items", json={
            "name": "Test Item",
            "price": 29.99
        })

    def on_start(self):
        """Called when a simulated user starts."""
        self.client.post("/api/auth/login", json={
            "username": "testuser",
            "password": "testpass"
        })

Interpreting Load Test Results¶

Key signals that indicate problems:

✗ Latency increases linearly with load
  → System is reaching capacity; requests are queuing

✗ Latency increases exponentially with load
  → System is saturated; likely a bottleneck (single lock, single DB connection)

✗ Error rate spikes at a specific load level
  → Resource exhaustion (connection pool, file descriptors, memory)

✗ Throughput plateaus while latency increases
  → System is at maximum capacity; additional requests just wait in queue

✗ Latency is fine under load but degrades over time (soak test)
  → Memory leak, connection leak, log file filling disk, GC pressure increasing

Common Optimization Patterns¶

Caching¶

Caching is the single most impactful optimization for most applications. The key challenge is cache invalidation—ensuring cached data stays consistent with the source of truth.

Cache Layer	Latency	Capacity	Examples
CPU L1 cache	~1 ns	64 KB	Automatic (hardware)
CPU L3 cache	~10 ns	8-64 MB	Automatic (hardware)
Application memory	~100 ns	GBs	In-process dict/map, LRU cache
Distributed cache	~1 ms	TBs	Redis, Memcached
CDN edge	~10 ms	Distributed	CloudFront, Cloudflare
Browser cache	~0 ms	MBs	Cache-Control headers

Caching strategies:

Strategy	Description	Consistency	Use Case
Cache-aside	App checks cache first; on miss, reads DB, writes to cache	Eventual (TTL-based)	General purpose, default choice
Read-through	Cache automatically fetches from DB on miss	Eventual	Simplifies app code
Write-through	Writes go to cache AND DB simultaneously	Strong	When reads are much more frequent than writes
Write-behind	Writes go to cache; async batch write to DB	Eventual (risk of data loss)	High write throughput
Refresh-ahead	Proactively refresh cache before expiration	Strong (if refresh is fast)	Predictable access patterns

# Cache-aside pattern implementation
import redis
import json

cache = redis.Redis(host='localhost', port=6379)
TTL = 300  # 5 minutes

def get_user(user_id: str) -> dict:
    # 1. Check cache
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # 2. Cache miss — fetch from database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if user is None:
        # Cache negative results too (prevent cache stampede on missing data)
        cache.setex(f"user:{user_id}", 60, json.dumps(None))
        return None

    # 3. Write to cache
    cache.setex(f"user:{user_id}", TTL, json.dumps(user))
    return user

def update_user(user_id: str, data: dict):
    # Update database
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    # Invalidate cache (don't update — invalidate to avoid race conditions)
    cache.delete(f"user:{user_id}")

Cache stampede prevention: When a popular cache key expires, hundreds of requests simultaneously miss the cache and hit the database. Solutions: - Locking: Only one request fetches from DB; others wait for the cache to be populated - Stale-while-revalidate: Serve stale data while one request refreshes in the background - Probabilistic early expiration: Each request has a small chance of refreshing before TTL expires

Connection Pooling¶

Creating database/HTTP connections is expensive (TCP handshake, TLS handshake, authentication). Connection pooling maintains a reusable set of connections.

# PostgreSQL connection pooling with psycopg2
import psycopg2.pool

# Create a pool of 5-20 connections
pool = psycopg2.pool.ThreadedConnectionPool(
    minconn=5,
    maxconn=20,
    host='localhost',
    dbname='myapp',
    user='app',
    password='secret'
)

def get_user(user_id):
    conn = pool.getconn()       # Get a connection from the pool
    try:
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        return cursor.fetchone()
    finally:
        pool.putconn(conn)      # Return connection to the pool (don't close!)

Pool sizing: Too few connections = requests queue waiting for a connection. Too many = database overwhelmed. A good starting point: pool_size = (2 * num_cores) + num_disks on the database server side. On the application side, set pool size equal to the maximum concurrent database operations per process.

The N+1 Query Problem¶

The most common database performance anti-pattern in ORMs:

# N+1 problem: 1 query to get users + N queries to get each user's orders
users = User.query.all()                      # 1 query
for user in users:
    orders = user.orders                       # N queries (1 per user!)

# Solution: Eager loading (1 query with JOIN or 2 queries with IN)
users = User.query.options(joinedload(User.orders)).all()  # 1-2 queries total

# Or explicit JOIN
users = db.session.query(User, Order) \
    .outerjoin(Order) \
    .all()

How to detect N+1: Enable query logging in development. If you see the same query repeated with different parameters, you likely have an N+1 problem. Tools: Django Debug Toolbar, SQLAlchemy echo=True, Rails bullet gem.

Database Query Optimization¶

Optimization	Description	Impact
Add indexes	B-tree indexes for equality/range queries, GIN for full-text/JSON	10-1000x faster queries
Composite indexes	Multi-column indexes for common query patterns	Avoids multiple index lookups
Covering indexes	Include all queried columns in the index (no table lookup)	Eliminates random I/O
Query rewriting	Replace subqueries with JOINs, use EXISTS instead of IN for large sets	2-100x improvement
Pagination	Cursor-based pagination instead of OFFSET (OFFSET scans skipped rows)	Constant time vs linear
Denormalization	Store computed/duplicated data to avoid JOINs	Faster reads, slower writes
Materialized views	Pre-computed query results, refreshed periodically	Instant complex queries

-- Bad: OFFSET pagination (gets slower as page increases)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 10000;
-- Must scan and discard 10,000 rows!

-- Good: Cursor-based pagination (constant time)
SELECT * FROM orders 
WHERE created_at < '2025-01-15T10:30:00Z'  -- cursor from last page
ORDER BY created_at DESC 
LIMIT 20;

-- Index to support this query:
CREATE INDEX idx_orders_created_at ON orders (created_at DESC);

Async and Non-Blocking I/O¶

For I/O-bound workloads, async processing dramatically improves throughput by not blocking threads waiting for I/O:

# Synchronous: 10 API calls × 200ms each = 2000ms total
import requests

def fetch_all_sync(urls):
    results = []
    for url in urls:
        resp = requests.get(url)          # Blocks for ~200ms
        results.append(resp.json())
    return results                         # Total: ~2000ms

# Asynchronous: 10 API calls in parallel = ~200ms total
import asyncio
import aiohttp

async def fetch_all_async(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [session.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)  # All run concurrently
        return [await r.json() for r in responses]
# Total: ~200ms (limited by slowest single call)

When to use async: - I/O-bound workloads (API calls, database queries, file I/O) - High concurrency requirements (thousands of simultaneous connections) - Websocket servers, chat applications, real-time systems

When NOT to use async: - CPU-bound workloads (use multiprocessing or threads instead) - Simple scripts with sequential logic - When the added complexity isn't justified by the performance gain

Batch Processing¶

Group multiple operations into fewer, larger operations:

# Bad: N individual INSERT statements
for user in users:
    db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (user.name, user.email))
# 1000 users = 1000 round trips to database

# Good: Single batch INSERT
values = [(u.name, u.email) for u in users]
db.executemany("INSERT INTO users (name, email) VALUES (%s, %s)", values)
# 1000 users = 1 round trip

# Even better: COPY command (PostgreSQL) for bulk loading
import io
import csv

buffer = io.StringIO()
writer = csv.writer(buffer)
for user in users:
    writer.writerow([user.name, user.email])
buffer.seek(0)
cursor.copy_from(buffer, 'users', columns=('name', 'email'), sep=',')
# 10-100x faster than INSERT for large datasets

Compression¶

Algorithm	Speed	Ratio	Use Case
gzip	Medium	Good (60-70% reduction)	HTTP responses (universal support)
Brotli	Slower compression, fast decompression	Better than gzip (20-30% smaller)	Static assets, HTTP (modern browsers)
zstd	Very fast	Similar to gzip	Logs, backups, inter-service communication
lz4	Extremely fast	Lower ratio	Real-time compression, databases
snappy	Very fast	Lower ratio	Big data (Hadoop, Kafka, Cassandra)

Data Serialization Performance¶

Format	Speed	Size	Schema	Use Case
JSON	Slow	Large	No	REST APIs, human-readable
Protocol Buffers	Fast	Small	Yes (.proto)	gRPC, inter-service communication
MessagePack	Fast	Medium	No	Binary JSON alternative
FlatBuffers	Very fast (zero-copy)	Small	Yes	Games, real-time systems
Avro	Fast	Small	Yes (embedded)	Data pipelines, Kafka

Concurrency and Parallelism¶

Concept	Description	Python	Go	Rust
Threading	Multiple threads sharing memory	`threading` (GIL-limited)	goroutines (multiplexed)	`std::thread`
Multiprocessing	Multiple processes with separate memory	`multiprocessing`	N/A (use goroutines)	`rayon`, `tokio::spawn`
Async I/O	Event loop with non-blocking I/O	`asyncio`	goroutines + channels	`tokio`, `async-std`
Actor model	Message-passing between isolated actors	`pykka`, `ray`	goroutines + channels	`actix`

Python's GIL (Global Interpreter Lock): CPython's GIL allows only one thread to execute Python bytecode at a time. This means: - CPU-bound: Use multiprocessing (separate processes, no GIL) or concurrent.futures.ProcessPoolExecutor - I/O-bound: Use asyncio or threading (GIL is released during I/O waits) - Alternative: Use C extensions (NumPy, pandas) that release the GIL during computation

Garbage Collection and Memory Management¶

Understanding GC behavior is critical for low-latency applications:

GC Type	Languages	Pause Behavior	Tuning
Mark and Sweep	Python (cycle collector), Go	Stop-the-world pauses	Go: GOGC env var
Generational	Java (G1, ZGC), .NET, Python (reference counting + generational)	Short young-gen pauses, occasional major pauses	Java: -Xms, -Xmx, GC algorithm selection
Reference Counting	Python (primary), Swift, Rust (Arc)	No pauses but cyclic reference issues	Python: gc module for cycle detection
Ownership	Rust	No GC pauses (compile-time memory management)	N/A (deterministic destruction)

GC tuning for Java:

# Use ZGC for low-latency applications (sub-ms pauses)
java -XX:+UseZGC -Xms4g -Xmx4g -jar app.jar

# GC logging for analysis
java -Xlog:gc*:file=gc.log:time -jar app.jar

Go GC tuning:

# GOGC controls how aggressive GC is (default 100)
# GOGC=100: GC runs when heap doubles since last collection
# GOGC=200: GC runs when heap triples (less frequent GC, more memory)
# GOGC=50: GC runs when heap grows by 50% (more frequent, less memory)
GOGC=200 ./myapp

# GOMEMLIMIT (Go 1.19+): Set soft memory limit
GOMEMLIMIT=4GiB ./myapp

Frontend Performance Optimization¶

Technique	Impact	Description
Code splitting	High	Load only the JavaScript needed for the current page (`React.lazy()`, dynamic imports)
Tree shaking	High	Remove unused code at build time (Webpack, Rollup, esbuild)
Image optimization	High	WebP/AVIF format, responsive sizes (`srcset`), lazy loading (`loading="lazy"`)
Minification	Medium	Reduce JS/CSS file size by removing whitespace, shortening variable names
Bundle analysis	Medium	Identify large dependencies that can be replaced or lazy-loaded
Preloading	Medium	`<link rel="preload">` for critical resources, `<link rel="prefetch">` for next-page resources
Service Workers	High	Cache assets and API responses for offline access and instant loads
SSR / SSG	High	Server-side rendering or static generation for faster first paint (Next.js, Nuxt, Astro)
HTTP/2 push / Early Hints 103	Medium	Send critical resources before the browser requests them

Performance Testing Methodology¶

A systematic approach to performance engineering:

1. DEFINE requirements
   - What are the performance SLOs? (p95 < 200ms, throughput > 1000 RPS)
   - What is the expected traffic pattern? (steady, bursty, seasonal)

2. MEASURE baseline
   - Profile the current system under production-like load
   - Identify the bottleneck (CPU, memory, I/O, network, database)

3. HYPOTHESIZE
   - "Adding an index on user_id will reduce the query from 50ms to 5ms"
   - "Caching the product catalog will reduce API latency by 60%"

4. OPTIMIZE
   - Implement ONE change at a time (otherwise you can't attribute improvements)

5. MEASURE again
   - Run the same benchmark/profile under the same conditions
   - Quantify the improvement

6. ITERATE
   - If target met: done (don't over-optimize)
   - If not: go back to step 3

NEVER skip step 2 (baseline measurement).
NEVER change multiple things at once.

Common Performance Anti-Patterns¶

Anti-Pattern	Description	Fix
Premature optimization	Optimizing before measuring	Profile first, optimize the actual bottleneck
N+1 queries	Fetching related records one at a time in a loop	Eager loading, JOINs, batch fetching
Unbounded queries	`SELECT *` without LIMIT or pagination	Always paginate, select only needed columns
Synchronous I/O in hot paths	Blocking on network/disk in request handlers	Use async I/O, background workers, caching
Missing indexes	Full table scans on large tables	Add indexes for common query patterns
Log-level too verbose	DEBUG logging in production	Use INFO/WARN in production, DEBUG only when needed
String concatenation in loops	O(n²) string building	Use StringBuilder/join/buffers
Chatty APIs	Multiple round trips for one screen of data	Aggregate endpoints, GraphQL, BFF pattern
Large payloads	Sending more data than the client needs	Sparse fieldsets, pagination, compression
No connection pooling	Creating new DB connections per request	Use connection pools

Optimization Summary Table¶

Layer	Optimization	Typical Improvement
Network	CDN, compression, HTTP/2, connection reuse	50-90% latency reduction for static assets
Application	Caching, async I/O, connection pooling, batch processing	2-100x throughput improvement
Database	Indexes, query optimization, read replicas, materialized views	10-1000x query speedup
Frontend	Code splitting, image optimization, SSR/SSG, service workers	2-5x faster page loads
Infrastructure	Auto-scaling, right-sizing, load balancing	Handle 10-100x more traffic