Site Reliability Engineering (SRE)¶

Site Reliability Engineering (SRE), pioneered by Google, is a discipline that applies software engineering principles to infrastructure and operations problems. It bridges the gap between development (who want to ship fast) and operations (who want stability). SRE treats operations work as a software problem—instead of manually managing systems, SREs write code to automate away operational toil.

While monitoring (Chapter 16) covers the observability stack, SRE is about the organizational practices, frameworks, and culture for running reliable systems at scale.

SRE vs. DevOps¶

Aspect	SRE	DevOps
Origin	Google (2003)	Community movement (2008)
Focus	Reliability as a feature	Breaking silos between dev and ops
Approach	Prescriptive (concrete practices, metrics)	Philosophical (culture, automation, sharing)
Key metric	Error budget	Deployment frequency, lead time
Relationship	SRE implements DevOps principles with concrete practices

Google's Ben Treynor Sloss: "SRE is what happens when you ask a software engineer to design an operations function."

A useful analogy: DevOps is an interface (a set of principles and values), while SRE is a concrete class that implements that interface. DevOps says "you should automate"; SRE says "you should spend no more than 50% of your time on toil, and here's how to measure and reduce it."

SRE Team Structures¶

Model	Description	Pros	Cons
Embedded SRE	SREs are members of product teams	Deep product knowledge, tight collaboration	May lose SRE community, inconsistent practices
Centralized SRE	Dedicated SRE team supports multiple services	Consistent standards, shared tooling, career growth	Can become a bottleneck, less product context
Consulting SRE	SRE team advises product teams, doesn't own services	Scales knowledge broadly, product teams own reliability	Advice may be ignored, less operational depth
Platform Engineering	Builds self-service reliability tools and platforms	Scales to many teams, reduces per-team toil	Requires significant investment, can feel disconnected

Most organizations start with centralized SRE and evolve toward a hybrid model: a central platform/SRE team builds shared tooling (CI/CD, observability, infrastructure), while embedded SREs or reliability-focused engineers within product teams apply those tools to their specific services.

SRE Engagement Model¶

SRE teams cannot support every service in an organization. A common engagement model:

Self-serve tier: Product teams use SRE-provided tools and runbooks independently
Consulting tier: SRE provides architecture reviews, production readiness reviews, and guidance
Embedded tier: SRE is directly involved in operating the service (reserved for critical, high-complexity systems)

Criteria for full SRE engagement typically include: business criticality, service complexity, traffic volume, and the product team's willingness to follow SRE practices (SLOs, error budgets, postmortems).

Service Level Indicators, Objectives, and Agreements¶

The SLI → SLO → SLA hierarchy is the foundation of SRE. It transforms reliability from a vague goal ("make it reliable") into a measurable, actionable framework.

Relationship:
  SLI (what you measure) → SLO (what you target) → SLA (what you promise)

Example:
  SLI: Availability = successful requests / total requests
  SLO: Availability >= 99.95% per month (internal target)
  SLA: Availability >= 99.9% per month (customer contract — with credits if breached)

SLIs (Service Level Indicators)¶

Quantitative measures of service behavior. The most important principle: SLIs should reflect the user experience, not system internals. CPU utilization is not an SLI (users don't experience CPU); request latency is.

SLI	Definition	Example	Measurement Point
Availability	Proportion of successful requests	99.95% of requests return non-5xx responses	Load balancer access logs
Latency	Time to serve a request	95th percentile response time < 200ms	Application instrumentation
Throughput	Rate of successful operations	> 10,000 requests/second sustained	Metrics aggregation
Error rate	Proportion of failed requests	< 0.1% of requests result in errors	Application error tracking
Freshness	How up-to-date data is	Search index updated within 5 minutes	Pipeline monitoring
Correctness	Proportion of correct responses	99.99% of calculations return correct results	End-to-end validation
Durability	Proportion of data retained	99.999999999% of objects stored are not lost	Storage system metrics

SLI Specification vs. Implementation:

Specification: What the SLI measures conceptually (e.g., "the proportion of valid requests served within 200ms")
Implementation: How you actually measure it (e.g., "count of responses with status < 500 and duration < 200ms at the load balancer, divided by total request count")

The implementation matters because where you measure changes what you see. Measuring latency at the server misses network latency; measuring at the client includes the full user experience but is harder to collect. A common compromise: measure at the load balancer (captures server processing + internal network, but not client-side network).

Choosing SLIs by service type:

Service Type	Primary SLIs
User-facing API	Availability, latency (p50, p95, p99), error rate
Data pipeline	Freshness (data staleness), correctness, throughput
Storage system	Availability, latency, durability
Streaming service	Start-up time, rebuffer rate, resolution quality
Batch processing	Completion time, success rate, data quality

SLOs (Service Level Objectives)¶

Target values for SLIs that define "good enough" reliability. SLOs are internal commitments—they represent the reliability level that satisfies users without over-investing in reliability.

SLO Examples:
  - "99.9% of API requests will succeed (non-5xx) measured over a rolling 30-day window"
  - "95th percentile latency will be under 200ms measured over a rolling 7-day window"
  - "99.99% of payment transactions will complete successfully measured monthly"
  - "Data pipeline freshness: 99% of records available within 5 minutes of creation"

Setting SLOs:

Too aggressive (99.999%) → team spends all time on reliability, can't ship features, infrastructure costs explode
Too relaxed (99%) → users have poor experience, churn increases
Target the level of reliability users actually need — consider: if you're 99.9% and your upstream dependency is 99%, your extra 0.9% doesn't matter to the user

The nines table — what each level of availability actually means:

Availability	Downtime/month	Downtime/year	Typical For
99% (two 9s)	7.2 hours	3.65 days	Internal tools, batch systems
99.9% (three 9s)	43.2 minutes	8.76 hours	SaaS applications, APIs
99.95%	21.6 minutes	4.38 hours	Business-critical services
99.99% (four 9s)	4.32 minutes	52.56 minutes	Payment systems, core infrastructure
99.999% (five 9s)	26.3 seconds	5.26 minutes	Telecom, medical systems

Each additional nine is roughly 10x harder and more expensive to achieve. Going from 99.9% to 99.99% might require multi-region redundancy, automated failover, and eliminating all single points of failure. Going from 99.99% to 99.999% might require custom hardware, redundant providers, and near-zero planned maintenance.

SLO-based alerting: Instead of alerting on symptoms (CPU > 90%), alert on SLO burn rate. The burn rate is how fast you're consuming your error budget:

Burn rate = (error rate observed) / (error rate allowed by SLO)

Example:
  SLO: 99.9% availability (0.1% error budget per month)
  Current error rate: 1%
  Burn rate = 1% / 0.1% = 10x

  At 10x burn rate, a month's error budget is consumed in 3 days.

Multi-window, multi-burn-rate alerting (recommended by the SRE Workbook):

Severity	Burn Rate	Long Window	Short Window	Action
Page (critical)	14.4x	1 hour	5 minutes	Immediate response
Page (warning)	6x	6 hours	30 minutes	Investigate soon
Ticket	3x	3 days	6 hours	Schedule fix

The short window confirms the long window isn't a brief spike that already resolved. Both windows must be breaching simultaneously to fire the alert.

SLAs (Service Level Agreements)¶

External contracts with customers that include SLOs plus consequences (refunds, credits) if not met. SLAs should always be less aggressive than internal SLOs—you need a buffer.

	SLO	SLA
Audience	Internal teams	External customers
Consequence of breach	Error budget policy (slow down releases)	Financial penalties (credits, refunds)
Aggressiveness	More aggressive (internal target)	More conservative (contractual promise)
Measurement	Fine-grained (per-request, rolling window)	Coarser (monthly, aggregated)

Worked Example: E-Commerce Platform¶

Service: Product Catalog API

SLI (Availability):
  Implementation: Count of responses with status < 500, divided by total
                  responses, measured at the ALB over a rolling 30-day window.

SLI (Latency):
  Implementation: Proportion of responses where server processing time
                  (from ALB metrics) is < 200ms, over a rolling 30-day window.

SLO: 99.95% availability, 99% of requests served under 200ms.

Error Budget:
  Availability: 0.05% = ~21.6 minutes of downtime per month
  Latency: 1% of requests can exceed 200ms

SLA (to enterprise customers):
  99.9% availability, measured monthly. If breached, 10% service credit.
  (Note: SLA is 99.9% while SLO is 99.95% — 0.05% buffer)

Error Budgets¶

The error budget is the allowed amount of unreliability. It's derived from the SLO:

Error Budget = 100% - SLO

Example:
  SLO = 99.9% availability per month
  Error Budget = 0.1% = ~43.2 minutes of downtime per month
  At 1 million requests/day, that's ~1,000 failed requests/day allowed

The error budget transforms the dev/ops tension from a cultural conflict into a data-driven discussion:

Budget remaining: Development velocity is prioritized. Ship features, take risks.
Budget exhausted: Reliability is prioritized. Freeze feature launches, focus on stability, automation, and reducing technical debt.

This creates a self-regulating system: if a team ships a buggy release that causes an outage, they've consumed error budget and must slow down to improve reliability. If the system is ultra-stable, they have budget to take more risks.

Error Budget Calculation Examples¶

Availability-based:

SLO: 99.95% over 30 days
Budget: 0.05% of 30 days = 0.05% × 43,200 minutes = 21.6 minutes

If an outage lasted 15 minutes:
  Remaining budget = 21.6 - 15 = 6.6 minutes (69% consumed)
  ⚠️ Need to be cautious for the rest of the month

Latency-based:

SLO: 99% of requests under 200ms over 7 days
Budget: 1% of requests can exceed 200ms

If total requests this week = 10,000,000:
  Budget = 100,000 slow requests allowed
  Current slow requests = 75,000
  Remaining budget = 25,000 (75% consumed)

Error Budget Policy¶

An error budget policy formalizes what happens at different budget consumption levels. It must be agreed upon by both product and SRE leadership before incidents happen—not during.

Error Budget Policy:

> 50% remaining:  Full speed ahead. Ship features, experiment.
                   Standard code review and testing.

25-50% remaining: Increased caution. Additional testing required.
                   Post-deploy monitoring for 30 minutes.
                   No deploys on Fridays.

10-25% remaining: Slow down significantly.
                   Only well-tested, low-risk changes.
                   Require SRE approval for deploys.
                   Begin reliability improvement work.

< 10% remaining:  Feature freeze.
                   Only reliability improvements and critical bug fixes.
                   Incident review if not already done.

Exhausted (0%):   Full freeze on all non-reliability changes.
                   All engineering effort goes to reliability.
                   Executive escalation.
                   Post-freeze review required before resuming feature work.

Error budget negotiation: When product teams push back on budget-driven slowdowns, point to the data. The error budget isn't arbitrary—it's derived from the SLO, which is set based on user needs. If the team doesn't want the slowdown, they have two options: accept a less aggressive SLO (more error budget), or invest in reliability to reduce budget consumption.

Toil¶

Toil is the kind of work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth.

The defining characteristics: - Manual: A human must perform it (not automated) - Repetitive: Done over and over, not a one-time task - Automatable: Could be done by a script or system - Tactical: Reactive, interrupt-driven, not strategic - No enduring value: Doesn't permanently improve the service - Scales with service growth: Doubling traffic doubles the work

Examples of toil:

Manually restarting failed services
Manually provisioning user accounts
Responding to routine alerts that always require the same fix
Manually scaling infrastructure before expected traffic spikes
Copy-pasting configuration between environments
Manually running database migrations
Manually reviewing and approving routine certificate renewals

Not toil (necessary overhead): meetings, on-call rotations, architectural design, strategic planning, writing postmortems, mentoring.

Google's target: SREs should spend no more than 50% of their time on toil. The remaining 50% should be spent on engineering projects that reduce future toil. If toil exceeds 50%, the team is understaffed or under-automated.

Measuring Toil¶

Track toil by having team members categorize their work:

Category	Definition	Target
Toil	Manual, repetitive, automatable operational work	≤ 50%
Engineering	Software development to improve reliability, tools, automation	≥ 50%
Overhead	Meetings, planning, admin, hiring	Track but don't over-optimize

Measure weekly with simple time tracking. Trend over quarters—if toil percentage is increasing, the team needs to prioritize automation projects.

The Automation Ladder¶

Progress toil reduction through stages:

Level 0: Manual          — Human does everything manually
Level 1: Documented      — Runbook exists, human follows steps
Level 2: Scripted        — Script performs the task, human triggers it
Level 3: Automated       — System triggers script based on conditions
Level 4: Self-service    — Users trigger automation themselves (no SRE needed)
Level 5: Self-healing    — System detects and fixes issues automatically

Example progression for "scaling a service": - Level 0: SSH into server, manually add instances - Level 1: Runbook: "Run terraform apply with instance_count=N" - Level 2: Script: ./scale-service.sh --count=10 - Level 3: Auto-scaling: Kubernetes HPA scales based on CPU/memory - Level 4: Self-service: Product team adjusts scaling policy via internal platform UI - Level 5: Predictive scaling: ML model forecasts traffic and pre-scales

Incident Management¶

A structured approach to handling production incidents. The goal is to restore service first, investigate root causes later.

Incident Severity Levels¶

Level	Name	Description	Response Time	Response
SEV1	Critical	Complete service outage affecting all users	< 5 minutes	All hands, exec notification, war room, status page
SEV2	Major	Significant degradation, partial outage, data at risk	< 15 minutes	On-call team + escalation, status page update
SEV3	Minor	Minor impact, workaround available	< 1 hour	On-call handles, next business day follow-up
SEV4	Low	Cosmetic, minimal user impact	Next business day	Track in backlog

Incident Response Process¶

1. DETECT    → Automated alerts (preferred) or user reports identify the issue
               Goal: Detect before users notice (proactive monitoring)

2. TRIAGE    → Assess severity, assign Incident Commander (IC)
               Ask: Who is affected? How many? Is data at risk?
               Declare incident in Slack/Teams channel

3. MITIGATE  → Restore service ASAP (rollback, scale up, failover, feature flag off)
               Priority: Mitigate first, root-cause later
               "Can we make the bleeding stop, even if we don't know why it's bleeding?"

4. RESOLVE   → Fix the underlying issue (deploy fix, repair data, etc.)
               Only after service is restored

5. FOLLOW-UP → Postmortem within 48 hours (for SEV1/SEV2)
               Track action items to completion

Incident Roles¶

Role	Responsibilities
Incident Commander (IC)	Coordinates response, makes decisions, delegates tasks, manages communication cadence, determines severity, decides when incident is resolved
Operations Lead	Executes technical mitigation (rollbacks, infrastructure changes, debugging)
Communications Lead	Updates status page, notifies stakeholders, manages internal/external comms, records timeline
Subject Matter Experts (SMEs)	Pulled in as needed for specific subsystems (database expert, network expert, etc.)

IC responsibilities in detail: - Maintain a clear picture of the current situation - Assign tasks to specific individuals (never "someone should...") - Set check-in cadence ("Let's sync every 15 minutes") - Manage scope — prevent rabbit holes ("That's a good investigation but let's focus on mitigation first") - Make the call on risky mitigations ("Yes, let's roll back even though we'll lose the last hour of data") - Declare incident resolved and schedule postmortem

Incident Communication¶

Internal communication template (posted every 15-30 minutes in the incident channel):

## Incident Update — [SERVICE] — SEV[X] — [HH:MM UTC]

**Status**: Investigating / Identified / Mitigating / Resolved
**Impact**: [Description of user-facing impact]
**IC**: @person
**Current actions**:
  - @alice is investigating database connection pool exhaustion
  - @bob is preparing a rollback of the 14:00 deploy
**Next update**: [HH:MM UTC]

External communication (status page update):

[14:05 UTC] Investigating — We are investigating elevated error rates
            for the API. Some users may experience timeouts.

[14:20 UTC] Identified — A database migration has been identified as
            the cause. We are rolling back the change.

[14:30 UTC] Monitoring — The rollback is complete and error rates are
            returning to normal. We are monitoring for stability.

[15:00 UTC] Resolved — This incident has been resolved. All services
            are operating normally. A postmortem will be published
            within 48 hours.

Incident Walkthrough Example¶

13:55 UTC — Monitoring alert fires: "API error rate > 5% for 5 minutes"
            Alert routes to PagerDuty → on-call engineer Alice's phone

14:00 UTC — Alice acknowledges. Opens laptop, checks dashboard.
            Error rate is 12% and rising. Creates #incident-api-errors Slack channel.
            Posts: "SEV2 incident declared. I'm IC. API error rate at 12%."

14:03 UTC — Alice checks recent deploys: deployment at 13:50 by Bob (PR #1234).
            Pulls in Bob as SME.

14:07 UTC — Bob confirms: "The deploy added a new database migration that adds
            a NOT NULL column without a default value. All INSERTs are failing."

14:10 UTC — Alice makes the call: "Let's rollback the deploy immediately.
            Bob, please initiate rollback."

14:12 UTC — Bob runs: kubectl rollout undo deployment/api-server
            Migration rollback: applies down migration via CI/CD

14:18 UTC — Error rate drops from 12% to 0.3% (normal baseline).
            Alice posts update: "Mitigation successful. Error rates returning
            to normal. Monitoring for 30 minutes before declaring resolved."

14:50 UTC — Alice declares incident resolved.
            Schedules postmortem for tomorrow 10am.
            Error budget consumed: ~2.5 minutes of equivalent downtime
            (18 minutes × 12% error rate).

Blameless Postmortems¶

After every significant incident, conduct a blameless postmortem: a structured review focused on what happened and how to prevent recurrence, not on who caused it. The fundamental belief is that people don't cause incidents—systems that allow single human errors to cause outages are poorly designed.

Blameless Culture¶

Blameless does not mean "accountable-less". People are still accountable for learning and improving. The distinction:

Blaming: "Bob deployed bad code and caused the outage" → People hide mistakes
Blameless: "The deployment pipeline didn't catch the migration issue" → People share freely, system improves

If people fear punishment, they'll hide information. If information is hidden, you can't identify systemic issues. If you can't identify systemic issues, they'll recur.

Facilitating a Postmortem¶

Who attends: IC, operations lead, communications lead, relevant SMEs, engineering manager. Optionally: anyone who wants to learn.
When: Within 48 hours of the incident (while memories are fresh)
Duration: 30-60 minutes
Facilitator: Someone not directly involved in the incident (reduces bias)

Facilitation tips: - Start by reviewing the timeline together - Ask "what" and "how" questions, not "why didn't you..." questions - Rephrase blame: "Why didn't Alice check the migration?" → "What in our process could have caught this migration issue?" - Ensure all perspectives are heard (the junior engineer who noticed something may have critical context) - Focus on systemic improvements, not individual behavior

Root Cause Analysis Techniques¶

5 Whys:

Why did the API return errors?
  → Because the database INSERT queries were failing.

Why were INSERT queries failing?
  → Because a new NOT NULL column was added without a default value.

Why was there no default value?
  → Because the migration wasn't tested against production-like data.

Why wasn't the migration tested against production-like data?
  → Because our CI pipeline doesn't include migration testing.

Why doesn't our CI pipeline include migration testing?
  → Because migration testing was never added as a requirement.

Action item: Add migration testing to CI pipeline.

Note: "5 Whys" is a starting point, not a rigid formula. Some root causes need 3 whys, some need 7. And most incidents have multiple contributing factors, not a single root cause.

Contributing factors model (preferred over single root cause): Instead of finding "the" root cause, identify all factors that contributed:

Contributing factors for the API outage:
1. Migration lacked a default value (direct cause)
2. CI pipeline doesn't test migrations (detection gap)
3. No canary deployment for database changes (rollout gap)
4. On-call engineer took 5 minutes to acknowledge (response gap)
5. Rollback procedure wasn't documented (knowledge gap)

Each factor gets its own action item.

Postmortem Template¶

# Incident Postmortem: [Title]

## Summary
Brief description of what happened (2-3 sentences).

## Impact
- Duration: X hours Y minutes
- Users affected: N (or percentage)
- Revenue impact: $X (if applicable)
- Error budget consumed: X%
- Data lost/corrupted: Y records (if applicable)

## Timeline (all times UTC)
- 14:00 — Monitoring alert fires for elevated 5xx rates
- 14:05 — On-call engineer acknowledged, begins investigation
- 14:10 — Recent deploy identified as potential cause
- 14:15 — Root cause identified: bad database migration
- 14:20 — Mitigation: rolled back database migration
- 14:25 — Service restored, error rates return to normal
- 14:50 — Incident declared resolved after monitoring period

## Root Cause
The database migration (PR #1234) added a column with a NOT NULL constraint
without a default value, causing INSERT failures for all new records.

## Contributing Factors
- Migration was not tested against production-like data
- CI pipeline doesn't run migration tests against a full dataset
- No canary deployment for database migrations
- Rollback procedure for migrations was not documented

## What Went Well
- Fast detection (5 min from deploy to alert)
- Fast mitigation (20 min total incident duration)
- Clear incident communication

## What Went Poorly
- No pre-production migration testing
- On-call engineer unfamiliar with migration rollback procedure
- No automated rollback trigger for SLO violations

## Where We Got Lucky
- Low-traffic window reduced user impact
- The migration was reversible (some aren't)

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add migration testing to CI pipeline | @alice | P1 | 2025-02-15 | TODO |
| Implement canary process for DB migrations | @bob | P1 | 2025-03-01 | TODO |
| Add runbook for migration rollbacks | @carol | P2 | 2025-02-28 | TODO |
| Investigate automated rollback on SLO breach | @dave | P2 | 2025-03-15 | TODO |

## Lessons Learned
[Open discussion points and takeaways]

Action item tracking: The postmortem is only valuable if action items are completed. Assign a specific owner, priority, and due date for each item. Track completion in your project management tool. Review outstanding postmortem action items weekly.

On-Call Practices¶

On-call is a fundamental part of SRE. Best practices ensure it's sustainable, effective, and doesn't burn out engineers.

Rotation Design¶

Model	Description	Pros	Cons
Weekly	One person on-call for a full week	Predictable schedule, context continuity	Long stretches can be tiring
Biweekly	Alternating weeks	More time off between rotations	Less frequent, may lose context
Follow-the-sun	Hand off between time zones (US → EU → APAC)	No overnight pages	Requires global team, handoff complexity
Primary + Secondary	Primary responds first; secondary is backup	Redundancy, load sharing, mentoring	Requires more people in rotation

Best practices: - Minimum 8 people in rotation (for weekly: each person is on-call ~6 weeks/year) - Maximum 2 incidents per on-call shift (target) - Compensate on-call with pay or time off - Primary hands off to secondary if they've been engaged for more than 8 hours - Require at least one business day between being primary and secondary

Alert Quality¶

Alert fatigue is the enemy of effective on-call. If engineers are paged too frequently or for non-critical issues, they become desensitized and may miss critical alerts.

Every alert should pass the "Is this actionable?" test:

Alert Quality	Criteria	What to Do
Good alert	Actionable, time-sensitive, affects users	Keep as page
Informational	Useful to know but not time-sensitive	Demote to ticket or dashboard
Noisy	Fires frequently, no action needed	Delete or tune threshold
Symptom-based	Based on user-visible symptoms (error rate, latency)	Preferred approach
Cause-based	Based on internal state (CPU, memory, disk)	Supplement, not primary alerts

Alert hygiene metrics: - Pages per on-call shift: Target ≤ 2 (not counting false positives) - False positive rate: Target < 5% (pages that required no action) - Time to acknowledge: < 5 minutes - Time to mitigate: Track p50 and p95 - Proportion of pages outside business hours: Should be roughly proportional to traffic pattern

Runbooks¶

Every alert should link to a runbook. Runbooks are step-by-step procedures for diagnosing and resolving common issues:

# Runbook: API Error Rate > 5%

## Alert
API error rate exceeds 5% for 5 minutes (SEV2)

## Quick Diagnosis
1. Check recent deploys: `kubectl rollout history deployment/api-server`
   - If recent deploy: consider rollback (Step A)
2. Check database connectivity: `curl http://api-server:8080/health/db`
   - If database down: see Database Runbook
3. Check upstream dependencies: dashboard link [here]
   - If dependency is down: enable circuit breaker (Step B)
4. Check for traffic spike: Grafana dashboard [here]
   - If traffic spike: scale up (Step C)

## Step A: Rollback Recent Deploy
kubectl rollout undo deployment/api-server
# Wait 2 minutes, verify error rate drops

## Step B: Enable Circuit Breaker
kubectl set env deployment/api-server CIRCUIT_BREAKER_ENABLED=true
# This returns cached/default responses when dependency is down

## Step C: Scale Up
kubectl scale deployment/api-server --replicas=10
# Normal is 3 replicas; 10 handles 3x traffic

## Escalation
If none of the above resolves the issue within 15 minutes:
- Page the API team lead: @alice (PagerDuty)
- Page the database on-call: @bob (PagerDuty)

Capacity Planning¶

Capacity planning ensures your system can handle expected demand with sufficient headroom for unexpected spikes.

Demand Forecasting¶

Source	Predictability	Example
Organic growth	Medium (trend-based)	10% monthly user growth → capacity needs grow proportionally
Seasonal patterns	High (historical data)	Black Friday traffic 5x normal, end-of-quarter reporting spikes
Launch events	Medium (planned but uncertain magnitude)	New feature launch, marketing campaign, TV ad
Viral events	Low (unpredictable)	App goes viral on social media, breaking news

Capacity Modeling¶

Little's Law (fundamental relationship for queuing systems):

L = λ × W

L = average number of items in the system (concurrent requests)
λ = average arrival rate (requests per second)
W = average time in the system (latency)

Example:
  λ = 1,000 requests/sec
  W = 200ms = 0.2s
  L = 1,000 × 0.2 = 200 concurrent requests

  If each server handles 50 concurrent requests, you need ≥ 4 servers.
  With 50% headroom: 6 servers.

Capacity planning checklist: 1. Measure current usage: CPU, memory, network, connections, request rate 2. Identify the bottleneck: What resource saturates first? 3. Model growth: Linear, exponential, or stepped (based on business plans) 4. Add headroom: Plan for at least 50% above expected peak 5. Plan for failures: If one AZ goes down, can the remaining AZs handle the load? 6. Set alerts: Alert at 70% capacity to trigger scaling or procurement

Load Testing for Capacity¶

Load test in a production-like environment to find actual capacity limits:

Load Test Strategy:
1. Baseline test: Normal traffic level for 30 minutes (establish metrics)
2. Ramp test: Gradually increase to 2x normal over 30 minutes
3. Stress test: Continue ramping until failure (find the breaking point)
4. Soak test: Run at 1.5x normal for 24 hours (find memory leaks, connection leaks)
5. Spike test: Sudden burst to 5x normal (test auto-scaling response time)

Release Engineering¶

Release engineering is the practice of building and deploying software reliably and safely.

Progressive Rollouts¶

Strategy	Description	Detection Time	Blast Radius
Canary	Deploy to 1-5% of instances, monitor, gradually increase	Minutes to hours	Small (1-5% of traffic)
Blue-Green	Deploy to idle environment, switch traffic all at once	Immediate	100% if not caught in staging
Rolling update	Replace instances one at a time	Moderate	Grows over time
Feature flags	Code deployed everywhere but gated behind flag	Immediate (toggle off)	Controlled (targeted users)

Canary deployment flow:

1. Deploy v2 to 1% of instances
2. Monitor for 10 minutes: error rate, latency, business metrics
3. If metrics are healthy: increase to 5%
4. Monitor for 30 minutes
5. If healthy: increase to 25% → 50% → 100%
6. If unhealthy at any stage: roll back to 0%, investigate

Automatic rollback trigger:
  - Error rate increases by > 0.1% compared to baseline
  - p95 latency increases by > 50ms compared to baseline
  - Any custom business metric degrades

Feature Flags¶

Feature flags decouple deployment from release. Code is deployed to all instances but new features are toggled on/off without a deploy:

# Feature flag usage
if feature_flags.is_enabled("new-checkout-flow", user_id=user.id):
    return new_checkout_flow(cart)
else:
    return legacy_checkout_flow(cart)

Feature flag lifecycle: 1. Development: Flag created, defaults to off 2. Testing: Enabled for internal users, QA team 3. Canary: Enabled for 1% of production users 4. Rollout: Gradually increase to 100% 5. Cleanup: Remove flag and old code path (critical — tech debt otherwise)

Change Management¶

Change freezes: During high-risk periods (Black Friday, year-end processing), restrict changes to emergency-only. Define clearly what qualifies as an emergency.

Deploy windows: Some teams restrict deploys to specific hours (e.g., 9am-3pm, no Fridays). This ensures experienced staff are available if issues arise. Counter-argument: smaller, more frequent deploys are safer than large, batched deploys.

Chaos Engineering¶

Chaos engineering is the practice of intentionally injecting failures into production systems to test resilience and discover weaknesses before they cause real outages.

Principles¶

Start with a hypothesis: "Our system should handle the loss of one database replica without user-visible impact"
Define steady state: Normal metrics (error rate, latency, throughput)
Inject failure: Kill the replica
Observe: Did the system maintain steady state? How long did recovery take?
Learn: If the hypothesis failed, fix the system. If it passed, try a harder failure.

Failure Injection Patterns¶

Failure Type	Tools	What You Learn
Instance termination	Chaos Monkey, LitmusChaos	Auto-scaling, load balancer health checks
Network latency	tc (traffic control), Toxiproxy	Timeout handling, circuit breakers
Network partition	iptables, Chaos Mesh	Split-brain handling, consistency behavior
AZ/zone failure	AWS FIS, Gremlin	Multi-AZ resilience, data replication
Dependency failure	Toxiproxy, service mesh fault injection	Graceful degradation, fallback behavior
CPU/memory stress	stress-ng, Chaos Mesh	Resource limits, OOM handling, auto-scaling
DNS failure	Modify /etc/resolv.conf, DNS poisoning	DNS caching, fallback resolvers

Game Days¶

A game day is a planned chaos engineering exercise where the team intentionally breaks systems and practices incident response:

Game Day Plan:

Objective: Validate that the payment service handles database failover

Participants: SRE team, payment team, database team
Date: Tuesday, 2pm-4pm (low traffic window)

Failure to inject: Force failover of the primary database replica

Hypothesis: Payment service will:
  - Experience < 5 seconds of errors during failover
  - Automatically reconnect to the new primary
  - No data loss or corruption

Blast radius controls:
  - Only payment-staging environment (not production)
  - Rollback plan: manually promote old primary if failover fails
  - Abort trigger: error rate > 50% for > 30 seconds

Observation:
  - Monitor: payment error rate, latency, database connections
  - Record: timeline of events, actual behavior vs. hypothesis

Building organizational confidence: Start with non-production environments. Graduate to production during low-traffic windows. Eventually run experiments during business hours. The goal is to make chaos engineering routine, not scary.

SRE Organizational Practices¶

Production Readiness Review (PRR)¶

Before a service can go to production (or before SRE takes on operational responsibility), conduct a production readiness review:

# Production Readiness Review Checklist

## Reliability
- [ ] SLOs defined and instrumented
- [ ] Error budget policy agreed upon with product team
- [ ] Alert rules configured with runbooks
- [ ] On-call rotation established
- [ ] Incident response procedure documented

## Architecture
- [ ] No single points of failure
- [ ] Graceful degradation when dependencies fail
- [ ] Circuit breakers for external dependencies
- [ ] Timeouts configured for all external calls
- [ ] Rate limiting implemented

## Observability
- [ ] Structured logging to centralized system
- [ ] Metrics exported (request rate, latency, errors, saturation)
- [ ] Distributed tracing enabled
- [ ] Dashboards created for key metrics
- [ ] Health check endpoint (/health, /ready)

## Operations
- [ ] Deployment pipeline with automated rollback
- [ ] Canary or blue-green deployment strategy
- [ ] Rollback tested and documented
- [ ] Capacity plan documented
- [ ] Load tested at 2x expected peak

## Security
- [ ] No hardcoded secrets
- [ ] TLS for all external communication
- [ ] Authentication and authorization configured
- [ ] Dependencies scanned for vulnerabilities

## Data
- [ ] Backup strategy tested (including restore)
- [ ] Data retention policy defined
- [ ] GDPR/compliance requirements met

Service Tiers¶

Not all services require the same level of reliability. Classify services into tiers:

Tier	Reliability Target	Monitoring	On-Call	Example
Tier 1 (Critical)	99.99%	Real-time alerting, SLO-based	24/7 dedicated rotation	Payment processing, authentication
Tier 2 (Important)	99.9%	Alerting with business-hours response	Shared on-call rotation	Product catalog, user profiles
Tier 3 (Standard)	99%	Monitoring dashboards, next-business-day	Best-effort	Internal tools, analytics pipelines
Tier 4 (Best-effort)	None	Basic monitoring	No on-call	Experimental features, internal prototypes

Service tiers drive investment decisions: Tier 1 services get multi-region deployment, automated failover, and chaos engineering. Tier 4 services run on a single instance and accept periodic downtime.

Technical Debt Management¶

SRE teams often encounter technical debt that impacts reliability. Track reliability-related debt and prioritize it:

Category	Example	Impact
Operational debt	Manual deploy process, no runbooks	Slower incident response, higher toil
Architectural debt	Single point of failure, monolithic database	Outage risk, scaling bottleneck
Observability debt	Missing metrics, no distributed tracing	Longer time to diagnose issues
Testing debt	No load tests, no chaos testing	Unknown failure modes

Allocate a percentage of engineering capacity (20-30%) specifically for reliability and infrastructure improvements, separate from feature work. Without explicit allocation, reliability work is perpetually deprioritized until the next outage.