Performance Engineering¶
Performance engineering is the practice of designing, measuring, and optimizing software systems to meet performance requirements. It spans the entire SDLC—from architectural decisions to production monitoring. The golden rule: always measure before optimizing—intuition about bottlenecks is wrong more often than it's right.
Performance Metrics¶
| Metric | Description | Typical Target |
|---|---|---|
| Latency | Time to process a single request (p50, p95, p99) | API: <100ms p95, Web: <200ms p95 |
| Throughput | Number of operations per unit time (RPS, TPS) | Varies by service |
| TTFB | Time to First Byte — server processing + network time | <200ms |
| Apdex | Application Performance Index (0-1 score of user satisfaction) | >0.9 |
| Error rate | Percentage of failed requests | <0.1% |
| Resource utilization | CPU, memory, disk I/O, network bandwidth usage | CPU <70%, Memory <80% |
Why Percentiles Matter More Than Averages¶
Average latency hides problems. If 99% of requests take 50ms and 1% take 5000ms, the average is ~100ms—which looks fine. But that 1% represents real users with a terrible experience, and they're often your most important users (high-value customers with more data, more complex queries).
Latency distribution example:
p50 (median): 50ms — Half of requests are faster than this
p90: 100ms — 10% of requests are slower
p95: 200ms — 5% are slower (common SLO target)
p99: 500ms — 1% are slower (catches tail latency)
p99.9: 2000ms — 1 in 1000 (often database timeouts, GC pauses)
max: 5000ms — Single worst request (outliers)
Rule of thumb: Set SLOs on p95 or p99, not average.
Tail latency amplification: In microservices, a single user request may fan out to 10-50 backend services. If each service has p99 = 100ms, the overall p99 is NOT 100ms—it's closer to max(all services). With 50 parallel calls at p99 = 100ms, there's a ~40% chance that at least one exceeds 100ms per request.
Web Performance Metrics (Core Web Vitals)¶
Google's Core Web Vitals measure real user experience and affect search rankings:
| Metric | What it Measures | Good | Needs Improvement | Poor |
|---|---|---|---|---|
| LCP (Largest Contentful Paint) | Loading performance — when the largest visible element renders | ≤2.5s | ≤4.0s | >4.0s |
| INP (Interaction to Next Paint) | Interactivity — time from user input to visual response | ≤200ms | ≤500ms | >500ms |
| CLS (Cumulative Layout Shift) | Visual stability — how much the page layout shifts unexpectedly | ≤0.1 | ≤0.25 | >0.25 |
Improving LCP: Optimize the critical rendering path (reduce render-blocking CSS/JS), preload key resources (<link rel="preload">), use CDN for static assets, optimize images (WebP/AVIF, responsive sizes, lazy loading below-the-fold).
Improving INP: Keep JavaScript execution short (break long tasks with requestIdleCallback), debounce/throttle event handlers, minimize main thread blocking, use web workers for heavy computation.
Improving CLS: Set explicit width/height on images and embeds, use aspect-ratio CSS, avoid injecting content above the fold after page load, use font-display: swap for web fonts.
Profiling¶
Profiling is the process of measuring where an application spends its time and resources. Always profile before optimizing—the bottleneck is rarely where you expect.
Types of Profiling¶
| Type | What it Measures | Tools |
|---|---|---|
| CPU Profiling | Where computation time is spent | cProfile (Python), perf (Linux), pprof (Go), cargo flamegraph (Rust), async-profiler (Java) |
| Memory Profiling | Memory allocation, leaks, high-water marks | tracemalloc (Python), Valgrind, heaptrack, memory_profiler, jemalloc |
| I/O Profiling | Disk and network I/O patterns, blocking calls | strace, ltrace, iotop, BPF/eBPF, perf trace |
| Lock/Contention | Thread contention on locks, synchronization overhead | perf lock, mutrace, async-profiler (Java), tokio-console (Rust) |
| Allocation Profiling | Where memory allocations happen (separate from leaks) | DHAT (Valgrind), heaptrack, jemalloc prof |
CPU Profiling¶
# Python CPU profiling example
import cProfile
import pstats
def expensive_function():
result = []
for i in range(1_000_000):
result.append(i ** 2)
return sorted(result, reverse=True)
# Profile the function
profiler = cProfile.Profile()
profiler.enable()
expensive_function()
profiler.disable()
# Print stats sorted by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 functions
// Go profiling with pprof (built into the standard library)
import (
"net/http"
_ "net/http/pprof" // Register pprof endpoints
)
func main() {
// Exposes /debug/pprof/ endpoints
go http.ListenAndServe(":6060", nil)
// Your application code...
}
// Then analyze:
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30 (CPU)
// go tool pprof http://localhost:6060/debug/pprof/heap (memory)
// go tool pprof http://localhost:6060/debug/pprof/goroutine (goroutines)
Memory Profiling¶
# Python memory profiling with tracemalloc
import tracemalloc
tracemalloc.start()
# Code to profile
data = [dict(index=i, value=i**2) for i in range(100_000)]
# Take a snapshot
snapshot = tracemalloc.take_snapshot()
stats = snapshot.statistics('lineno')
print("Top 10 memory allocations:")
for stat in stats[:10]:
print(stat)
# Compare snapshots to find leaks
snapshot1 = tracemalloc.take_snapshot()
# ... more code ...
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
print(stat) # Shows memory growth between snapshots
Common memory issues: - Memory leaks: Objects that are no longer needed but still referenced (growing collections, unclosed connections, event listener accumulation) - High allocation rate: Creating and destroying many small objects (GC pressure). Fix: reuse objects, use object pools - Large objects: Single allocations that are disproportionately large. Fix: streaming, pagination, lazy loading
Flame Graphs¶
Flame graphs are a visualization of profiled software, showing the most frequent code paths. The x-axis represents the population of stack traces (wider = more samples), and the y-axis shows stack depth. They make it immediately obvious where time is spent.
# Generate a flame graph on Linux using perf
perf record -g -p <PID> -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# For Python using py-spy (can attach to running process!)
py-spy record -o profile.svg --pid <PID>
py-spy top --pid <PID> # Live top-like view
# For Go using pprof
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
# For Rust using cargo-flamegraph
cargo install flamegraph
cargo flamegraph --bin myapp
# For Node.js
node --prof app.js # Generate V8 profile
node --prof-process isolate-*.log # Convert to readable format
# Or use 0x: npx 0x app.js # Generates interactive flame graph
Reading a flame graph: - Width = proportion of total samples (wider = more time) - Height = call stack depth (taller = deeper nesting) - Color = typically random (or grouped by module) - Look for: Wide blocks (hot functions), tall narrow towers (deep recursion), flat tops (leaf functions doing work)
Benchmarking¶
Benchmarking measures the performance of specific code paths or system components under controlled conditions. Unlike profiling (which shows where time is spent), benchmarking measures how fast something is.
Microbenchmarking¶
# Python benchmarking with timeit
import timeit
# Compare list comprehension vs map
list_comp_time = timeit.timeit(
'[x**2 for x in range(1000)]',
number=10000
)
map_time = timeit.timeit(
'list(map(lambda x: x**2, range(1000)))',
number=10000
)
print(f"List comprehension: {list_comp_time:.4f}s")
print(f"Map: {map_time:.4f}s")
// Rust benchmarking with criterion (statistical benchmarking)
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn fibonacci(n: u64) -> u64 {
match n {
0 => 0,
1 => 1,
n => fibonacci(n - 1) + fibonacci(n - 2),
}
}
fn criterion_benchmark(c: &mut Criterion) {
c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
// Criterion runs the benchmark enough times to be statistically significant,
// compares against previous runs, and detects regressions.
Microbenchmarking pitfalls:
- Dead code elimination: Compilers may optimize away your benchmark code. Use black_box() (Rust) or assign results to a variable.
- Warmup: JIT-compiled languages (Java, JavaScript) need warmup iterations.
- Measurement overhead: time.time() itself has overhead. Use dedicated benchmarking tools.
- Cache effects: The first run may be slower (cold cache). Run multiple iterations.
- Context switching: Other processes can affect results. Pin to a CPU core for precise measurements.
Database Query Benchmarking¶
-- PostgreSQL: EXPLAIN ANALYZE shows actual execution time and row counts
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, COUNT(o.id) as order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 100;
-- Key things to look for:
-- Seq Scan vs Index Scan (seq scan on large tables = bad)
-- Nested Loop vs Hash Join (nested loop on large datasets = bad)
-- Actual rows vs planned rows (large discrepancy = stale statistics → ANALYZE)
-- Buffers: shared hit vs read (reads = cache misses = disk I/O)
Load Testing¶
Load testing validates system performance under expected and peak load conditions. It answers: "Can our system handle the traffic we expect?"
Types of Load Tests¶
| Type | Purpose | Duration | Load Pattern |
|---|---|---|---|
| Smoke test | Verify system works under minimal load | 1-5 min | Baseline (1-5 users) |
| Load test | Validate performance at expected load | 15-60 min | Normal traffic (target RPS) |
| Stress test | Find the breaking point | 15-30 min | Ramp up until failure |
| Soak test | Find memory leaks, connection leaks, degradation | 4-24 hours | Sustained normal load |
| Spike test | Validate behavior under sudden traffic burst | 10-20 min | Sudden burst to 5-10x normal |
| Breakpoint test | Find maximum capacity | 30-60 min | Gradually increasing load |
Load Testing Tools¶
| Tool | Language | Protocol Support | Strengths |
|---|---|---|---|
| k6 | JavaScript (Go engine) | HTTP, WebSocket, gRPC | Modern, scriptable, CI-friendly, low resource usage |
| Locust | Python | HTTP, custom protocols | Easy to write tests, distributed, real-time web UI |
| JMeter | Java | HTTP, JDBC, JMS, SMTP | Feature-rich, GUI, wide protocol support |
| Gatling | Scala | HTTP, WebSocket | High performance, detailed reports |
| wrk | C/Lua | HTTP | Lightweight, extremely fast, simple benchmarks |
| hey | Go | HTTP | Simple CLI tool for quick benchmarks |
| vegeta | Go | HTTP | Constant RPS load testing (not concurrent users) |
// k6 load test example with realistic scenario
import http from 'k6/http';
import { check, sleep, group } from 'k6';
import { Rate, Trend } from 'k6/metrics';
// Custom metrics
const errorRate = new Rate('error_rate');
const latencyTrend = new Trend('api_latency');
export const options = {
stages: [
{ duration: '1m', target: 50 }, // Ramp up to 50 users over 1 minute
{ duration: '3m', target: 50 }, // Stay at 50 users for 3 minutes
{ duration: '1m', target: 200 }, // Ramp up to 200 users
{ duration: '3m', target: 200 }, // Stay at 200 users for 3 minutes
{ duration: '1m', target: 0 }, // Ramp down to 0 users
],
thresholds: {
http_req_duration: ['p(95)<200', 'p(99)<500'], // 95% < 200ms, 99% < 500ms
http_req_failed: ['rate<0.01'], // Error rate < 1%
error_rate: ['rate<0.01'],
},
};
export default function () {
group('Browse products', () => {
const listRes = http.get('https://api.example.com/products?page=1');
check(listRes, {
'status is 200': (r) => r.status === 200,
'has products': (r) => JSON.parse(r.body).data.length > 0,
});
errorRate.add(listRes.status !== 200);
latencyTrend.add(listRes.timings.duration);
sleep(1); // Simulate user reading time
// View a specific product
const products = JSON.parse(listRes.body).data;
if (products.length > 0) {
const detailRes = http.get(`https://api.example.com/products/${products[0].id}`);
check(detailRes, { 'product detail 200': (r) => r.status === 200 });
}
});
group('Add to cart', () => {
const res = http.post('https://api.example.com/cart/items', JSON.stringify({
product_id: 'prod_123',
quantity: 1,
}), { headers: { 'Content-Type': 'application/json' } });
check(res, { 'added to cart': (r) => r.status === 201 });
});
sleep(Math.random() * 3 + 1); // Random think time 1-4 seconds
}
# Locust load test example
from locust import HttpUser, task, between
class WebsiteUser(HttpUser):
wait_time = between(1, 3) # Random wait between 1-3 seconds
@task(3) # Weight: 3x more likely than other tasks
def view_items(self):
self.client.get("/api/items")
@task(1)
def create_item(self):
self.client.post("/api/items", json={
"name": "Test Item",
"price": 29.99
})
def on_start(self):
"""Called when a simulated user starts."""
self.client.post("/api/auth/login", json={
"username": "testuser",
"password": "testpass"
})
Interpreting Load Test Results¶
Key signals that indicate problems:
✗ Latency increases linearly with load
→ System is reaching capacity; requests are queuing
✗ Latency increases exponentially with load
→ System is saturated; likely a bottleneck (single lock, single DB connection)
✗ Error rate spikes at a specific load level
→ Resource exhaustion (connection pool, file descriptors, memory)
✗ Throughput plateaus while latency increases
→ System is at maximum capacity; additional requests just wait in queue
✗ Latency is fine under load but degrades over time (soak test)
→ Memory leak, connection leak, log file filling disk, GC pressure increasing
Common Optimization Patterns¶
Caching¶
Caching is the single most impactful optimization for most applications. The key challenge is cache invalidation—ensuring cached data stays consistent with the source of truth.
| Cache Layer | Latency | Capacity | Examples |
|---|---|---|---|
| CPU L1 cache | ~1 ns | 64 KB | Automatic (hardware) |
| CPU L3 cache | ~10 ns | 8-64 MB | Automatic (hardware) |
| Application memory | ~100 ns | GBs | In-process dict/map, LRU cache |
| Distributed cache | ~1 ms | TBs | Redis, Memcached |
| CDN edge | ~10 ms | Distributed | CloudFront, Cloudflare |
| Browser cache | ~0 ms | MBs | Cache-Control headers |
Caching strategies:
| Strategy | Description | Consistency | Use Case |
|---|---|---|---|
| Cache-aside | App checks cache first; on miss, reads DB, writes to cache | Eventual (TTL-based) | General purpose, default choice |
| Read-through | Cache automatically fetches from DB on miss | Eventual | Simplifies app code |
| Write-through | Writes go to cache AND DB simultaneously | Strong | When reads are much more frequent than writes |
| Write-behind | Writes go to cache; async batch write to DB | Eventual (risk of data loss) | High write throughput |
| Refresh-ahead | Proactively refresh cache before expiration | Strong (if refresh is fast) | Predictable access patterns |
# Cache-aside pattern implementation
import redis
import json
cache = redis.Redis(host='localhost', port=6379)
TTL = 300 # 5 minutes
def get_user(user_id: str) -> dict:
# 1. Check cache
cached = cache.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# 2. Cache miss — fetch from database
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
if user is None:
# Cache negative results too (prevent cache stampede on missing data)
cache.setex(f"user:{user_id}", 60, json.dumps(None))
return None
# 3. Write to cache
cache.setex(f"user:{user_id}", TTL, json.dumps(user))
return user
def update_user(user_id: str, data: dict):
# Update database
db.execute("UPDATE users SET ... WHERE id = %s", user_id)
# Invalidate cache (don't update — invalidate to avoid race conditions)
cache.delete(f"user:{user_id}")
Cache stampede prevention: When a popular cache key expires, hundreds of requests simultaneously miss the cache and hit the database. Solutions: - Locking: Only one request fetches from DB; others wait for the cache to be populated - Stale-while-revalidate: Serve stale data while one request refreshes in the background - Probabilistic early expiration: Each request has a small chance of refreshing before TTL expires
Connection Pooling¶
Creating database/HTTP connections is expensive (TCP handshake, TLS handshake, authentication). Connection pooling maintains a reusable set of connections.
# PostgreSQL connection pooling with psycopg2
import psycopg2.pool
# Create a pool of 5-20 connections
pool = psycopg2.pool.ThreadedConnectionPool(
minconn=5,
maxconn=20,
host='localhost',
dbname='myapp',
user='app',
password='secret'
)
def get_user(user_id):
conn = pool.getconn() # Get a connection from the pool
try:
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cursor.fetchone()
finally:
pool.putconn(conn) # Return connection to the pool (don't close!)
Pool sizing: Too few connections = requests queue waiting for a connection. Too many = database overwhelmed. A good starting point: pool_size = (2 * num_cores) + num_disks on the database server side. On the application side, set pool size equal to the maximum concurrent database operations per process.
The N+1 Query Problem¶
The most common database performance anti-pattern in ORMs:
# N+1 problem: 1 query to get users + N queries to get each user's orders
users = User.query.all() # 1 query
for user in users:
orders = user.orders # N queries (1 per user!)
# Solution: Eager loading (1 query with JOIN or 2 queries with IN)
users = User.query.options(joinedload(User.orders)).all() # 1-2 queries total
# Or explicit JOIN
users = db.session.query(User, Order) \
.outerjoin(Order) \
.all()
How to detect N+1: Enable query logging in development. If you see the same query repeated with different parameters, you likely have an N+1 problem. Tools: Django Debug Toolbar, SQLAlchemy echo=True, Rails bullet gem.
Database Query Optimization¶
| Optimization | Description | Impact |
|---|---|---|
| Add indexes | B-tree indexes for equality/range queries, GIN for full-text/JSON | 10-1000x faster queries |
| Composite indexes | Multi-column indexes for common query patterns | Avoids multiple index lookups |
| Covering indexes | Include all queried columns in the index (no table lookup) | Eliminates random I/O |
| Query rewriting | Replace subqueries with JOINs, use EXISTS instead of IN for large sets | 2-100x improvement |
| Pagination | Cursor-based pagination instead of OFFSET (OFFSET scans skipped rows) | Constant time vs linear |
| Denormalization | Store computed/duplicated data to avoid JOINs | Faster reads, slower writes |
| Materialized views | Pre-computed query results, refreshed periodically | Instant complex queries |
-- Bad: OFFSET pagination (gets slower as page increases)
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 10000;
-- Must scan and discard 10,000 rows!
-- Good: Cursor-based pagination (constant time)
SELECT * FROM orders
WHERE created_at < '2025-01-15T10:30:00Z' -- cursor from last page
ORDER BY created_at DESC
LIMIT 20;
-- Index to support this query:
CREATE INDEX idx_orders_created_at ON orders (created_at DESC);
Async and Non-Blocking I/O¶
For I/O-bound workloads, async processing dramatically improves throughput by not blocking threads waiting for I/O:
# Synchronous: 10 API calls × 200ms each = 2000ms total
import requests
def fetch_all_sync(urls):
results = []
for url in urls:
resp = requests.get(url) # Blocks for ~200ms
results.append(resp.json())
return results # Total: ~2000ms
# Asynchronous: 10 API calls in parallel = ~200ms total
import asyncio
import aiohttp
async def fetch_all_async(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
responses = await asyncio.gather(*tasks) # All run concurrently
return [await r.json() for r in responses]
# Total: ~200ms (limited by slowest single call)
When to use async: - I/O-bound workloads (API calls, database queries, file I/O) - High concurrency requirements (thousands of simultaneous connections) - Websocket servers, chat applications, real-time systems
When NOT to use async: - CPU-bound workloads (use multiprocessing or threads instead) - Simple scripts with sequential logic - When the added complexity isn't justified by the performance gain
Batch Processing¶
Group multiple operations into fewer, larger operations:
# Bad: N individual INSERT statements
for user in users:
db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (user.name, user.email))
# 1000 users = 1000 round trips to database
# Good: Single batch INSERT
values = [(u.name, u.email) for u in users]
db.executemany("INSERT INTO users (name, email) VALUES (%s, %s)", values)
# 1000 users = 1 round trip
# Even better: COPY command (PostgreSQL) for bulk loading
import io
import csv
buffer = io.StringIO()
writer = csv.writer(buffer)
for user in users:
writer.writerow([user.name, user.email])
buffer.seek(0)
cursor.copy_from(buffer, 'users', columns=('name', 'email'), sep=',')
# 10-100x faster than INSERT for large datasets
Compression¶
| Algorithm | Speed | Ratio | Use Case |
|---|---|---|---|
| gzip | Medium | Good (60-70% reduction) | HTTP responses (universal support) |
| Brotli | Slower compression, fast decompression | Better than gzip (20-30% smaller) | Static assets, HTTP (modern browsers) |
| zstd | Very fast | Similar to gzip | Logs, backups, inter-service communication |
| lz4 | Extremely fast | Lower ratio | Real-time compression, databases |
| snappy | Very fast | Lower ratio | Big data (Hadoop, Kafka, Cassandra) |
Data Serialization Performance¶
| Format | Speed | Size | Schema | Use Case |
|---|---|---|---|---|
| JSON | Slow | Large | No | REST APIs, human-readable |
| Protocol Buffers | Fast | Small | Yes (.proto) | gRPC, inter-service communication |
| MessagePack | Fast | Medium | No | Binary JSON alternative |
| FlatBuffers | Very fast (zero-copy) | Small | Yes | Games, real-time systems |
| Avro | Fast | Small | Yes (embedded) | Data pipelines, Kafka |
Concurrency and Parallelism¶
| Concept | Description | Python | Go | Rust |
|---|---|---|---|---|
| Threading | Multiple threads sharing memory | threading (GIL-limited) |
goroutines (multiplexed) | std::thread |
| Multiprocessing | Multiple processes with separate memory | multiprocessing |
N/A (use goroutines) | rayon, tokio::spawn |
| Async I/O | Event loop with non-blocking I/O | asyncio |
goroutines + channels | tokio, async-std |
| Actor model | Message-passing between isolated actors | pykka, ray |
goroutines + channels | actix |
Python's GIL (Global Interpreter Lock): CPython's GIL allows only one thread to execute Python bytecode at a time. This means:
- CPU-bound: Use multiprocessing (separate processes, no GIL) or concurrent.futures.ProcessPoolExecutor
- I/O-bound: Use asyncio or threading (GIL is released during I/O waits)
- Alternative: Use C extensions (NumPy, pandas) that release the GIL during computation
Garbage Collection and Memory Management¶
Understanding GC behavior is critical for low-latency applications:
| GC Type | Languages | Pause Behavior | Tuning |
|---|---|---|---|
| Mark and Sweep | Python (cycle collector), Go | Stop-the-world pauses | Go: GOGC env var |
| Generational | Java (G1, ZGC), .NET, Python (reference counting + generational) | Short young-gen pauses, occasional major pauses | Java: -Xms, -Xmx, GC algorithm selection |
| Reference Counting | Python (primary), Swift, Rust (Arc) | No pauses but cyclic reference issues | Python: gc module for cycle detection |
| Ownership | Rust | No GC pauses (compile-time memory management) | N/A (deterministic destruction) |
GC tuning for Java:
# Use ZGC for low-latency applications (sub-ms pauses)
java -XX:+UseZGC -Xms4g -Xmx4g -jar app.jar
# GC logging for analysis
java -Xlog:gc*:file=gc.log:time -jar app.jar
Go GC tuning:
# GOGC controls how aggressive GC is (default 100)
# GOGC=100: GC runs when heap doubles since last collection
# GOGC=200: GC runs when heap triples (less frequent GC, more memory)
# GOGC=50: GC runs when heap grows by 50% (more frequent, less memory)
GOGC=200 ./myapp
# GOMEMLIMIT (Go 1.19+): Set soft memory limit
GOMEMLIMIT=4GiB ./myapp
Frontend Performance Optimization¶
| Technique | Impact | Description |
|---|---|---|
| Code splitting | High | Load only the JavaScript needed for the current page (React.lazy(), dynamic imports) |
| Tree shaking | High | Remove unused code at build time (Webpack, Rollup, esbuild) |
| Image optimization | High | WebP/AVIF format, responsive sizes (srcset), lazy loading (loading="lazy") |
| Minification | Medium | Reduce JS/CSS file size by removing whitespace, shortening variable names |
| Bundle analysis | Medium | Identify large dependencies that can be replaced or lazy-loaded |
| Preloading | Medium | <link rel="preload"> for critical resources, <link rel="prefetch"> for next-page resources |
| Service Workers | High | Cache assets and API responses for offline access and instant loads |
| SSR / SSG | High | Server-side rendering or static generation for faster first paint (Next.js, Nuxt, Astro) |
| HTTP/2 push / Early Hints 103 | Medium | Send critical resources before the browser requests them |
Performance Testing Methodology¶
A systematic approach to performance engineering:
1. DEFINE requirements
- What are the performance SLOs? (p95 < 200ms, throughput > 1000 RPS)
- What is the expected traffic pattern? (steady, bursty, seasonal)
2. MEASURE baseline
- Profile the current system under production-like load
- Identify the bottleneck (CPU, memory, I/O, network, database)
3. HYPOTHESIZE
- "Adding an index on user_id will reduce the query from 50ms to 5ms"
- "Caching the product catalog will reduce API latency by 60%"
4. OPTIMIZE
- Implement ONE change at a time (otherwise you can't attribute improvements)
5. MEASURE again
- Run the same benchmark/profile under the same conditions
- Quantify the improvement
6. ITERATE
- If target met: done (don't over-optimize)
- If not: go back to step 3
NEVER skip step 2 (baseline measurement).
NEVER change multiple things at once.
Common Performance Anti-Patterns¶
| Anti-Pattern | Description | Fix |
|---|---|---|
| Premature optimization | Optimizing before measuring | Profile first, optimize the actual bottleneck |
| N+1 queries | Fetching related records one at a time in a loop | Eager loading, JOINs, batch fetching |
| Unbounded queries | SELECT * without LIMIT or pagination |
Always paginate, select only needed columns |
| Synchronous I/O in hot paths | Blocking on network/disk in request handlers | Use async I/O, background workers, caching |
| Missing indexes | Full table scans on large tables | Add indexes for common query patterns |
| Log-level too verbose | DEBUG logging in production | Use INFO/WARN in production, DEBUG only when needed |
| String concatenation in loops | O(n²) string building | Use StringBuilder/join/buffers |
| Chatty APIs | Multiple round trips for one screen of data | Aggregate endpoints, GraphQL, BFF pattern |
| Large payloads | Sending more data than the client needs | Sparse fieldsets, pagination, compression |
| No connection pooling | Creating new DB connections per request | Use connection pools |
Optimization Summary Table¶
| Layer | Optimization | Typical Improvement |
|---|---|---|
| Network | CDN, compression, HTTP/2, connection reuse | 50-90% latency reduction for static assets |
| Application | Caching, async I/O, connection pooling, batch processing | 2-100x throughput improvement |
| Database | Indexes, query optimization, read replicas, materialized views | 10-1000x query speedup |
| Frontend | Code splitting, image optimization, SSR/SSG, service workers | 2-5x faster page loads |
| Infrastructure | Auto-scaling, right-sizing, load balancing | Handle 10-100x more traffic |