Skip to content

Performance Engineering

Performance engineering is the practice of designing, measuring, and optimizing software systems to meet performance requirements. It spans the entire SDLC—from architectural decisions to production monitoring. The golden rule: always measure before optimizing—intuition about bottlenecks is wrong more often than it's right.

Performance Metrics

Metric Description Typical Target
Latency Time to process a single request (p50, p95, p99) API: <100ms p95, Web: <200ms p95
Throughput Number of operations per unit time (RPS, TPS) Varies by service
TTFB Time to First Byte — server processing + network time <200ms
Apdex Application Performance Index (0-1 score of user satisfaction) >0.9
Error rate Percentage of failed requests <0.1%
Resource utilization CPU, memory, disk I/O, network bandwidth usage CPU <70%, Memory <80%

Why Percentiles Matter More Than Averages

Average latency hides problems. If 99% of requests take 50ms and 1% take 5000ms, the average is ~100ms—which looks fine. But that 1% represents real users with a terrible experience, and they're often your most important users (high-value customers with more data, more complex queries).

Latency distribution example:
  p50 (median):  50ms   — Half of requests are faster than this
  p90:          100ms   — 10% of requests are slower
  p95:          200ms   — 5% are slower (common SLO target)
  p99:          500ms   — 1% are slower (catches tail latency)
  p99.9:       2000ms   — 1 in 1000 (often database timeouts, GC pauses)
  max:         5000ms   — Single worst request (outliers)

Rule of thumb: Set SLOs on p95 or p99, not average.

Tail latency amplification: In microservices, a single user request may fan out to 10-50 backend services. If each service has p99 = 100ms, the overall p99 is NOT 100ms—it's closer to max(all services). With 50 parallel calls at p99 = 100ms, there's a ~40% chance that at least one exceeds 100ms per request.

Web Performance Metrics (Core Web Vitals)

Google's Core Web Vitals measure real user experience and affect search rankings:

Metric What it Measures Good Needs Improvement Poor
LCP (Largest Contentful Paint) Loading performance — when the largest visible element renders ≤2.5s ≤4.0s >4.0s
INP (Interaction to Next Paint) Interactivity — time from user input to visual response ≤200ms ≤500ms >500ms
CLS (Cumulative Layout Shift) Visual stability — how much the page layout shifts unexpectedly ≤0.1 ≤0.25 >0.25

Improving LCP: Optimize the critical rendering path (reduce render-blocking CSS/JS), preload key resources (<link rel="preload">), use CDN for static assets, optimize images (WebP/AVIF, responsive sizes, lazy loading below-the-fold).

Improving INP: Keep JavaScript execution short (break long tasks with requestIdleCallback), debounce/throttle event handlers, minimize main thread blocking, use web workers for heavy computation.

Improving CLS: Set explicit width/height on images and embeds, use aspect-ratio CSS, avoid injecting content above the fold after page load, use font-display: swap for web fonts.

Profiling

Profiling is the process of measuring where an application spends its time and resources. Always profile before optimizing—the bottleneck is rarely where you expect.

Types of Profiling

Type What it Measures Tools
CPU Profiling Where computation time is spent cProfile (Python), perf (Linux), pprof (Go), cargo flamegraph (Rust), async-profiler (Java)
Memory Profiling Memory allocation, leaks, high-water marks tracemalloc (Python), Valgrind, heaptrack, memory_profiler, jemalloc
I/O Profiling Disk and network I/O patterns, blocking calls strace, ltrace, iotop, BPF/eBPF, perf trace
Lock/Contention Thread contention on locks, synchronization overhead perf lock, mutrace, async-profiler (Java), tokio-console (Rust)
Allocation Profiling Where memory allocations happen (separate from leaks) DHAT (Valgrind), heaptrack, jemalloc prof

CPU Profiling

# Python CPU profiling example
import cProfile
import pstats

def expensive_function():
    result = []
    for i in range(1_000_000):
        result.append(i ** 2)
    return sorted(result, reverse=True)

# Profile the function
profiler = cProfile.Profile()
profiler.enable()
expensive_function()
profiler.disable()

# Print stats sorted by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 functions
// Go profiling with pprof (built into the standard library)
import (
    "net/http"
    _ "net/http/pprof"  // Register pprof endpoints
)

func main() {
    // Exposes /debug/pprof/ endpoints
    go http.ListenAndServe(":6060", nil)

    // Your application code...
}

// Then analyze:
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30  (CPU)
// go tool pprof http://localhost:6060/debug/pprof/heap                (memory)
// go tool pprof http://localhost:6060/debug/pprof/goroutine           (goroutines)

Memory Profiling

# Python memory profiling with tracemalloc
import tracemalloc

tracemalloc.start()

# Code to profile
data = [dict(index=i, value=i**2) for i in range(100_000)]

# Take a snapshot
snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics('lineno')

print("Top 10 memory allocations:")
for stat in stats[:10]:
    print(stat)

# Compare snapshots to find leaks
snapshot1 = tracemalloc.take_snapshot()
# ... more code ...
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
    print(stat)  # Shows memory growth between snapshots

Common memory issues: - Memory leaks: Objects that are no longer needed but still referenced (growing collections, unclosed connections, event listener accumulation) - High allocation rate: Creating and destroying many small objects (GC pressure). Fix: reuse objects, use object pools - Large objects: Single allocations that are disproportionately large. Fix: streaming, pagination, lazy loading

Flame Graphs

Flame graphs are a visualization of profiled software, showing the most frequent code paths. The x-axis represents the population of stack traces (wider = more samples), and the y-axis shows stack depth. They make it immediately obvious where time is spent.

# Generate a flame graph on Linux using perf
perf record -g -p <PID> -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

# For Python using py-spy (can attach to running process!)
py-spy record -o profile.svg --pid <PID>
py-spy top --pid <PID>              # Live top-like view

# For Go using pprof
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

# For Rust using cargo-flamegraph
cargo install flamegraph
cargo flamegraph --bin myapp

# For Node.js
node --prof app.js                  # Generate V8 profile
node --prof-process isolate-*.log   # Convert to readable format
# Or use 0x: npx 0x app.js         # Generates interactive flame graph

Reading a flame graph: - Width = proportion of total samples (wider = more time) - Height = call stack depth (taller = deeper nesting) - Color = typically random (or grouped by module) - Look for: Wide blocks (hot functions), tall narrow towers (deep recursion), flat tops (leaf functions doing work)

Benchmarking

Benchmarking measures the performance of specific code paths or system components under controlled conditions. Unlike profiling (which shows where time is spent), benchmarking measures how fast something is.

Microbenchmarking

# Python benchmarking with timeit
import timeit

# Compare list comprehension vs map
list_comp_time = timeit.timeit(
    '[x**2 for x in range(1000)]',
    number=10000
)
map_time = timeit.timeit(
    'list(map(lambda x: x**2, range(1000)))',
    number=10000
)
print(f"List comprehension: {list_comp_time:.4f}s")
print(f"Map:                {map_time:.4f}s")
// Rust benchmarking with criterion (statistical benchmarking)
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        n => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
// Criterion runs the benchmark enough times to be statistically significant,
// compares against previous runs, and detects regressions.

Microbenchmarking pitfalls: - Dead code elimination: Compilers may optimize away your benchmark code. Use black_box() (Rust) or assign results to a variable. - Warmup: JIT-compiled languages (Java, JavaScript) need warmup iterations. - Measurement overhead: time.time() itself has overhead. Use dedicated benchmarking tools. - Cache effects: The first run may be slower (cold cache). Run multiple iterations. - Context switching: Other processes can affect results. Pin to a CPU core for precise measurements.

Database Query Benchmarking

-- PostgreSQL: EXPLAIN ANALYZE shows actual execution time and row counts
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) 
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 100;

-- Key things to look for:
-- Seq Scan vs Index Scan (seq scan on large tables = bad)
-- Nested Loop vs Hash Join (nested loop on large datasets = bad)
-- Actual rows vs planned rows (large discrepancy = stale statistics → ANALYZE)
-- Buffers: shared hit vs read (reads = cache misses = disk I/O)

Load Testing

Load testing validates system performance under expected and peak load conditions. It answers: "Can our system handle the traffic we expect?"

Types of Load Tests

Type Purpose Duration Load Pattern
Smoke test Verify system works under minimal load 1-5 min Baseline (1-5 users)
Load test Validate performance at expected load 15-60 min Normal traffic (target RPS)
Stress test Find the breaking point 15-30 min Ramp up until failure
Soak test Find memory leaks, connection leaks, degradation 4-24 hours Sustained normal load
Spike test Validate behavior under sudden traffic burst 10-20 min Sudden burst to 5-10x normal
Breakpoint test Find maximum capacity 30-60 min Gradually increasing load

Load Testing Tools

Tool Language Protocol Support Strengths
k6 JavaScript (Go engine) HTTP, WebSocket, gRPC Modern, scriptable, CI-friendly, low resource usage
Locust Python HTTP, custom protocols Easy to write tests, distributed, real-time web UI
JMeter Java HTTP, JDBC, JMS, SMTP Feature-rich, GUI, wide protocol support
Gatling Scala HTTP, WebSocket High performance, detailed reports
wrk C/Lua HTTP Lightweight, extremely fast, simple benchmarks
hey Go HTTP Simple CLI tool for quick benchmarks
vegeta Go HTTP Constant RPS load testing (not concurrent users)
// k6 load test example with realistic scenario
import http from 'k6/http';
import { check, sleep, group } from 'k6';
import { Rate, Trend } from 'k6/metrics';

// Custom metrics
const errorRate = new Rate('error_rate');
const latencyTrend = new Trend('api_latency');

export const options = {
    stages: [
        { duration: '1m', target: 50 },   // Ramp up to 50 users over 1 minute
        { duration: '3m', target: 50 },   // Stay at 50 users for 3 minutes
        { duration: '1m', target: 200 },  // Ramp up to 200 users
        { duration: '3m', target: 200 },  // Stay at 200 users for 3 minutes
        { duration: '1m', target: 0 },    // Ramp down to 0 users
    ],
    thresholds: {
        http_req_duration: ['p(95)<200', 'p(99)<500'],  // 95% < 200ms, 99% < 500ms
        http_req_failed: ['rate<0.01'],                  // Error rate < 1%
        error_rate: ['rate<0.01'],
    },
};

export default function () {
    group('Browse products', () => {
        const listRes = http.get('https://api.example.com/products?page=1');
        check(listRes, {
            'status is 200': (r) => r.status === 200,
            'has products': (r) => JSON.parse(r.body).data.length > 0,
        });
        errorRate.add(listRes.status !== 200);
        latencyTrend.add(listRes.timings.duration);

        sleep(1); // Simulate user reading time

        // View a specific product
        const products = JSON.parse(listRes.body).data;
        if (products.length > 0) {
            const detailRes = http.get(`https://api.example.com/products/${products[0].id}`);
            check(detailRes, { 'product detail 200': (r) => r.status === 200 });
        }
    });

    group('Add to cart', () => {
        const res = http.post('https://api.example.com/cart/items', JSON.stringify({
            product_id: 'prod_123',
            quantity: 1,
        }), { headers: { 'Content-Type': 'application/json' } });

        check(res, { 'added to cart': (r) => r.status === 201 });
    });

    sleep(Math.random() * 3 + 1);  // Random think time 1-4 seconds
}
# Locust load test example
from locust import HttpUser, task, between

class WebsiteUser(HttpUser):
    wait_time = between(1, 3)  # Random wait between 1-3 seconds

    @task(3)  # Weight: 3x more likely than other tasks
    def view_items(self):
        self.client.get("/api/items")

    @task(1)
    def create_item(self):
        self.client.post("/api/items", json={
            "name": "Test Item",
            "price": 29.99
        })

    def on_start(self):
        """Called when a simulated user starts."""
        self.client.post("/api/auth/login", json={
            "username": "testuser",
            "password": "testpass"
        })

Interpreting Load Test Results

Key signals that indicate problems:

✗ Latency increases linearly with load
  → System is reaching capacity; requests are queuing

✗ Latency increases exponentially with load
  → System is saturated; likely a bottleneck (single lock, single DB connection)

✗ Error rate spikes at a specific load level
  → Resource exhaustion (connection pool, file descriptors, memory)

✗ Throughput plateaus while latency increases
  → System is at maximum capacity; additional requests just wait in queue

✗ Latency is fine under load but degrades over time (soak test)
  → Memory leak, connection leak, log file filling disk, GC pressure increasing

Common Optimization Patterns

Caching

Caching is the single most impactful optimization for most applications. The key challenge is cache invalidation—ensuring cached data stays consistent with the source of truth.

Cache Layer Latency Capacity Examples
CPU L1 cache ~1 ns 64 KB Automatic (hardware)
CPU L3 cache ~10 ns 8-64 MB Automatic (hardware)
Application memory ~100 ns GBs In-process dict/map, LRU cache
Distributed cache ~1 ms TBs Redis, Memcached
CDN edge ~10 ms Distributed CloudFront, Cloudflare
Browser cache ~0 ms MBs Cache-Control headers

Caching strategies:

Strategy Description Consistency Use Case
Cache-aside App checks cache first; on miss, reads DB, writes to cache Eventual (TTL-based) General purpose, default choice
Read-through Cache automatically fetches from DB on miss Eventual Simplifies app code
Write-through Writes go to cache AND DB simultaneously Strong When reads are much more frequent than writes
Write-behind Writes go to cache; async batch write to DB Eventual (risk of data loss) High write throughput
Refresh-ahead Proactively refresh cache before expiration Strong (if refresh is fast) Predictable access patterns
# Cache-aside pattern implementation
import redis
import json

cache = redis.Redis(host='localhost', port=6379)
TTL = 300  # 5 minutes

def get_user(user_id: str) -> dict:
    # 1. Check cache
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)

    # 2. Cache miss — fetch from database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    if user is None:
        # Cache negative results too (prevent cache stampede on missing data)
        cache.setex(f"user:{user_id}", 60, json.dumps(None))
        return None

    # 3. Write to cache
    cache.setex(f"user:{user_id}", TTL, json.dumps(user))
    return user

def update_user(user_id: str, data: dict):
    # Update database
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    # Invalidate cache (don't update — invalidate to avoid race conditions)
    cache.delete(f"user:{user_id}")

Cache stampede prevention: When a popular cache key expires, hundreds of requests simultaneously miss the cache and hit the database. Solutions: - Locking: Only one request fetches from DB; others wait for the cache to be populated - Stale-while-revalidate: Serve stale data while one request refreshes in the background - Probabilistic early expiration: Each request has a small chance of refreshing before TTL expires

Connection Pooling

Creating database/HTTP connections is expensive (TCP handshake, TLS handshake, authentication). Connection pooling maintains a reusable set of connections.

# PostgreSQL connection pooling with psycopg2
import psycopg2.pool

# Create a pool of 5-20 connections
pool = psycopg2.pool.ThreadedConnectionPool(
    minconn=5,
    maxconn=20,
    host='localhost',
    dbname='myapp',
    user='app',
    password='secret'
)

def get_user(user_id):
    conn = pool.getconn()       # Get a connection from the pool
    try:
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
        return cursor.fetchone()
    finally:
        pool.putconn(conn)      # Return connection to the pool (don't close!)

Pool sizing: Too few connections = requests queue waiting for a connection. Too many = database overwhelmed. A good starting point: pool_size = (2 * num_cores) + num_disks on the database server side. On the application side, set pool size equal to the maximum concurrent database operations per process.

The N+1 Query Problem

The most common database performance anti-pattern in ORMs:

# N+1 problem: 1 query to get users + N queries to get each user's orders
users = User.query.all()                      # 1 query
for user in users:
    orders = user.orders                       # N queries (1 per user!)

# Solution: Eager loading (1 query with JOIN or 2 queries with IN)
users = User.query.options(joinedload(User.orders)).all()  # 1-2 queries total

# Or explicit JOIN
users = db.session.query(User, Order) \
    .outerjoin(Order) \
    .all()

How to detect N+1: Enable query logging in development. If you see the same query repeated with different parameters, you likely have an N+1 problem. Tools: Django Debug Toolbar, SQLAlchemy echo=True, Rails bullet gem.

Database Query Optimization

Optimization Description Impact
Add indexes B-tree indexes for equality/range queries, GIN for full-text/JSON 10-1000x faster queries
Composite indexes Multi-column indexes for common query patterns Avoids multiple index lookups
Covering indexes Include all queried columns in the index (no table lookup) Eliminates random I/O
Query rewriting Replace subqueries with JOINs, use EXISTS instead of IN for large sets 2-100x improvement
Pagination Cursor-based pagination instead of OFFSET (OFFSET scans skipped rows) Constant time vs linear
Denormalization Store computed/duplicated data to avoid JOINs Faster reads, slower writes
Materialized views Pre-computed query results, refreshed periodically Instant complex queries
-- Bad: OFFSET pagination (gets slower as page increases)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 10000;
-- Must scan and discard 10,000 rows!

-- Good: Cursor-based pagination (constant time)
SELECT * FROM orders 
WHERE created_at < '2025-01-15T10:30:00Z'  -- cursor from last page
ORDER BY created_at DESC 
LIMIT 20;

-- Index to support this query:
CREATE INDEX idx_orders_created_at ON orders (created_at DESC);

Async and Non-Blocking I/O

For I/O-bound workloads, async processing dramatically improves throughput by not blocking threads waiting for I/O:

# Synchronous: 10 API calls × 200ms each = 2000ms total
import requests

def fetch_all_sync(urls):
    results = []
    for url in urls:
        resp = requests.get(url)          # Blocks for ~200ms
        results.append(resp.json())
    return results                         # Total: ~2000ms

# Asynchronous: 10 API calls in parallel = ~200ms total
import asyncio
import aiohttp

async def fetch_all_async(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [session.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)  # All run concurrently
        return [await r.json() for r in responses]
# Total: ~200ms (limited by slowest single call)

When to use async: - I/O-bound workloads (API calls, database queries, file I/O) - High concurrency requirements (thousands of simultaneous connections) - Websocket servers, chat applications, real-time systems

When NOT to use async: - CPU-bound workloads (use multiprocessing or threads instead) - Simple scripts with sequential logic - When the added complexity isn't justified by the performance gain

Batch Processing

Group multiple operations into fewer, larger operations:

# Bad: N individual INSERT statements
for user in users:
    db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (user.name, user.email))
# 1000 users = 1000 round trips to database

# Good: Single batch INSERT
values = [(u.name, u.email) for u in users]
db.executemany("INSERT INTO users (name, email) VALUES (%s, %s)", values)
# 1000 users = 1 round trip

# Even better: COPY command (PostgreSQL) for bulk loading
import io
import csv

buffer = io.StringIO()
writer = csv.writer(buffer)
for user in users:
    writer.writerow([user.name, user.email])
buffer.seek(0)
cursor.copy_from(buffer, 'users', columns=('name', 'email'), sep=',')
# 10-100x faster than INSERT for large datasets

Compression

Algorithm Speed Ratio Use Case
gzip Medium Good (60-70% reduction) HTTP responses (universal support)
Brotli Slower compression, fast decompression Better than gzip (20-30% smaller) Static assets, HTTP (modern browsers)
zstd Very fast Similar to gzip Logs, backups, inter-service communication
lz4 Extremely fast Lower ratio Real-time compression, databases
snappy Very fast Lower ratio Big data (Hadoop, Kafka, Cassandra)

Data Serialization Performance

Format Speed Size Schema Use Case
JSON Slow Large No REST APIs, human-readable
Protocol Buffers Fast Small Yes (.proto) gRPC, inter-service communication
MessagePack Fast Medium No Binary JSON alternative
FlatBuffers Very fast (zero-copy) Small Yes Games, real-time systems
Avro Fast Small Yes (embedded) Data pipelines, Kafka

Concurrency and Parallelism

Concept Description Python Go Rust
Threading Multiple threads sharing memory threading (GIL-limited) goroutines (multiplexed) std::thread
Multiprocessing Multiple processes with separate memory multiprocessing N/A (use goroutines) rayon, tokio::spawn
Async I/O Event loop with non-blocking I/O asyncio goroutines + channels tokio, async-std
Actor model Message-passing between isolated actors pykka, ray goroutines + channels actix

Python's GIL (Global Interpreter Lock): CPython's GIL allows only one thread to execute Python bytecode at a time. This means: - CPU-bound: Use multiprocessing (separate processes, no GIL) or concurrent.futures.ProcessPoolExecutor - I/O-bound: Use asyncio or threading (GIL is released during I/O waits) - Alternative: Use C extensions (NumPy, pandas) that release the GIL during computation

Garbage Collection and Memory Management

Understanding GC behavior is critical for low-latency applications:

GC Type Languages Pause Behavior Tuning
Mark and Sweep Python (cycle collector), Go Stop-the-world pauses Go: GOGC env var
Generational Java (G1, ZGC), .NET, Python (reference counting + generational) Short young-gen pauses, occasional major pauses Java: -Xms, -Xmx, GC algorithm selection
Reference Counting Python (primary), Swift, Rust (Arc) No pauses but cyclic reference issues Python: gc module for cycle detection
Ownership Rust No GC pauses (compile-time memory management) N/A (deterministic destruction)

GC tuning for Java:

# Use ZGC for low-latency applications (sub-ms pauses)
java -XX:+UseZGC -Xms4g -Xmx4g -jar app.jar

# GC logging for analysis
java -Xlog:gc*:file=gc.log:time -jar app.jar

Go GC tuning:

# GOGC controls how aggressive GC is (default 100)
# GOGC=100: GC runs when heap doubles since last collection
# GOGC=200: GC runs when heap triples (less frequent GC, more memory)
# GOGC=50: GC runs when heap grows by 50% (more frequent, less memory)
GOGC=200 ./myapp

# GOMEMLIMIT (Go 1.19+): Set soft memory limit
GOMEMLIMIT=4GiB ./myapp

Frontend Performance Optimization

Technique Impact Description
Code splitting High Load only the JavaScript needed for the current page (React.lazy(), dynamic imports)
Tree shaking High Remove unused code at build time (Webpack, Rollup, esbuild)
Image optimization High WebP/AVIF format, responsive sizes (srcset), lazy loading (loading="lazy")
Minification Medium Reduce JS/CSS file size by removing whitespace, shortening variable names
Bundle analysis Medium Identify large dependencies that can be replaced or lazy-loaded
Preloading Medium <link rel="preload"> for critical resources, <link rel="prefetch"> for next-page resources
Service Workers High Cache assets and API responses for offline access and instant loads
SSR / SSG High Server-side rendering or static generation for faster first paint (Next.js, Nuxt, Astro)
HTTP/2 push / Early Hints 103 Medium Send critical resources before the browser requests them

Performance Testing Methodology

A systematic approach to performance engineering:

1. DEFINE requirements
   - What are the performance SLOs? (p95 < 200ms, throughput > 1000 RPS)
   - What is the expected traffic pattern? (steady, bursty, seasonal)

2. MEASURE baseline
   - Profile the current system under production-like load
   - Identify the bottleneck (CPU, memory, I/O, network, database)

3. HYPOTHESIZE
   - "Adding an index on user_id will reduce the query from 50ms to 5ms"
   - "Caching the product catalog will reduce API latency by 60%"

4. OPTIMIZE
   - Implement ONE change at a time (otherwise you can't attribute improvements)

5. MEASURE again
   - Run the same benchmark/profile under the same conditions
   - Quantify the improvement

6. ITERATE
   - If target met: done (don't over-optimize)
   - If not: go back to step 3

NEVER skip step 2 (baseline measurement).
NEVER change multiple things at once.

Common Performance Anti-Patterns

Anti-Pattern Description Fix
Premature optimization Optimizing before measuring Profile first, optimize the actual bottleneck
N+1 queries Fetching related records one at a time in a loop Eager loading, JOINs, batch fetching
Unbounded queries SELECT * without LIMIT or pagination Always paginate, select only needed columns
Synchronous I/O in hot paths Blocking on network/disk in request handlers Use async I/O, background workers, caching
Missing indexes Full table scans on large tables Add indexes for common query patterns
Log-level too verbose DEBUG logging in production Use INFO/WARN in production, DEBUG only when needed
String concatenation in loops O(n²) string building Use StringBuilder/join/buffers
Chatty APIs Multiple round trips for one screen of data Aggregate endpoints, GraphQL, BFF pattern
Large payloads Sending more data than the client needs Sparse fieldsets, pagination, compression
No connection pooling Creating new DB connections per request Use connection pools

Optimization Summary Table

Layer Optimization Typical Improvement
Network CDN, compression, HTTP/2, connection reuse 50-90% latency reduction for static assets
Application Caching, async I/O, connection pooling, batch processing 2-100x throughput improvement
Database Indexes, query optimization, read replicas, materialized views 10-1000x query speedup
Frontend Code splitting, image optimization, SSR/SSG, service workers 2-5x faster page loads
Infrastructure Auto-scaling, right-sizing, load balancing Handle 10-100x more traffic