Skip to content

Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE), pioneered by Google, is a discipline that applies software engineering principles to infrastructure and operations problems. It bridges the gap between development (who want to ship fast) and operations (who want stability). SRE treats operations work as a software problem—instead of manually managing systems, SREs write code to automate away operational toil.

While monitoring (Chapter 16) covers the observability stack, SRE is about the organizational practices, frameworks, and culture for running reliable systems at scale.

SRE vs. DevOps

Aspect SRE DevOps
Origin Google (2003) Community movement (2008)
Focus Reliability as a feature Breaking silos between dev and ops
Approach Prescriptive (concrete practices, metrics) Philosophical (culture, automation, sharing)
Key metric Error budget Deployment frequency, lead time
Relationship SRE implements DevOps principles with concrete practices

Google's Ben Treynor Sloss: "SRE is what happens when you ask a software engineer to design an operations function."

A useful analogy: DevOps is an interface (a set of principles and values), while SRE is a concrete class that implements that interface. DevOps says "you should automate"; SRE says "you should spend no more than 50% of your time on toil, and here's how to measure and reduce it."

SRE Team Structures

Model Description Pros Cons
Embedded SRE SREs are members of product teams Deep product knowledge, tight collaboration May lose SRE community, inconsistent practices
Centralized SRE Dedicated SRE team supports multiple services Consistent standards, shared tooling, career growth Can become a bottleneck, less product context
Consulting SRE SRE team advises product teams, doesn't own services Scales knowledge broadly, product teams own reliability Advice may be ignored, less operational depth
Platform Engineering Builds self-service reliability tools and platforms Scales to many teams, reduces per-team toil Requires significant investment, can feel disconnected

Most organizations start with centralized SRE and evolve toward a hybrid model: a central platform/SRE team builds shared tooling (CI/CD, observability, infrastructure), while embedded SREs or reliability-focused engineers within product teams apply those tools to their specific services.

SRE Engagement Model

SRE teams cannot support every service in an organization. A common engagement model:

  1. Self-serve tier: Product teams use SRE-provided tools and runbooks independently
  2. Consulting tier: SRE provides architecture reviews, production readiness reviews, and guidance
  3. Embedded tier: SRE is directly involved in operating the service (reserved for critical, high-complexity systems)

Criteria for full SRE engagement typically include: business criticality, service complexity, traffic volume, and the product team's willingness to follow SRE practices (SLOs, error budgets, postmortems).

Service Level Indicators, Objectives, and Agreements

The SLI → SLO → SLA hierarchy is the foundation of SRE. It transforms reliability from a vague goal ("make it reliable") into a measurable, actionable framework.

Relationship:
  SLI (what you measure) → SLO (what you target) → SLA (what you promise)

Example:
  SLI: Availability = successful requests / total requests
  SLO: Availability >= 99.95% per month (internal target)
  SLA: Availability >= 99.9% per month (customer contract — with credits if breached)

SLIs (Service Level Indicators)

Quantitative measures of service behavior. The most important principle: SLIs should reflect the user experience, not system internals. CPU utilization is not an SLI (users don't experience CPU); request latency is.

SLI Definition Example Measurement Point
Availability Proportion of successful requests 99.95% of requests return non-5xx responses Load balancer access logs
Latency Time to serve a request 95th percentile response time < 200ms Application instrumentation
Throughput Rate of successful operations > 10,000 requests/second sustained Metrics aggregation
Error rate Proportion of failed requests < 0.1% of requests result in errors Application error tracking
Freshness How up-to-date data is Search index updated within 5 minutes Pipeline monitoring
Correctness Proportion of correct responses 99.99% of calculations return correct results End-to-end validation
Durability Proportion of data retained 99.999999999% of objects stored are not lost Storage system metrics

SLI Specification vs. Implementation:

  • Specification: What the SLI measures conceptually (e.g., "the proportion of valid requests served within 200ms")
  • Implementation: How you actually measure it (e.g., "count of responses with status < 500 and duration < 200ms at the load balancer, divided by total request count")

The implementation matters because where you measure changes what you see. Measuring latency at the server misses network latency; measuring at the client includes the full user experience but is harder to collect. A common compromise: measure at the load balancer (captures server processing + internal network, but not client-side network).

Choosing SLIs by service type:

Service Type Primary SLIs
User-facing API Availability, latency (p50, p95, p99), error rate
Data pipeline Freshness (data staleness), correctness, throughput
Storage system Availability, latency, durability
Streaming service Start-up time, rebuffer rate, resolution quality
Batch processing Completion time, success rate, data quality

SLOs (Service Level Objectives)

Target values for SLIs that define "good enough" reliability. SLOs are internal commitments—they represent the reliability level that satisfies users without over-investing in reliability.

SLO Examples:
  - "99.9% of API requests will succeed (non-5xx) measured over a rolling 30-day window"
  - "95th percentile latency will be under 200ms measured over a rolling 7-day window"
  - "99.99% of payment transactions will complete successfully measured monthly"
  - "Data pipeline freshness: 99% of records available within 5 minutes of creation"

Setting SLOs:

  • Too aggressive (99.999%) → team spends all time on reliability, can't ship features, infrastructure costs explode
  • Too relaxed (99%) → users have poor experience, churn increases
  • Target the level of reliability users actually need — consider: if you're 99.9% and your upstream dependency is 99%, your extra 0.9% doesn't matter to the user

The nines table — what each level of availability actually means:

Availability Downtime/month Downtime/year Typical For
99% (two 9s) 7.2 hours 3.65 days Internal tools, batch systems
99.9% (three 9s) 43.2 minutes 8.76 hours SaaS applications, APIs
99.95% 21.6 minutes 4.38 hours Business-critical services
99.99% (four 9s) 4.32 minutes 52.56 minutes Payment systems, core infrastructure
99.999% (five 9s) 26.3 seconds 5.26 minutes Telecom, medical systems

Each additional nine is roughly 10x harder and more expensive to achieve. Going from 99.9% to 99.99% might require multi-region redundancy, automated failover, and eliminating all single points of failure. Going from 99.99% to 99.999% might require custom hardware, redundant providers, and near-zero planned maintenance.

SLO-based alerting: Instead of alerting on symptoms (CPU > 90%), alert on SLO burn rate. The burn rate is how fast you're consuming your error budget:

Burn rate = (error rate observed) / (error rate allowed by SLO)

Example:
  SLO: 99.9% availability (0.1% error budget per month)
  Current error rate: 1%
  Burn rate = 1% / 0.1% = 10x

  At 10x burn rate, a month's error budget is consumed in 3 days.

Multi-window, multi-burn-rate alerting (recommended by the SRE Workbook):

Severity Burn Rate Long Window Short Window Action
Page (critical) 14.4x 1 hour 5 minutes Immediate response
Page (warning) 6x 6 hours 30 minutes Investigate soon
Ticket 3x 3 days 6 hours Schedule fix

The short window confirms the long window isn't a brief spike that already resolved. Both windows must be breaching simultaneously to fire the alert.

SLAs (Service Level Agreements)

External contracts with customers that include SLOs plus consequences (refunds, credits) if not met. SLAs should always be less aggressive than internal SLOs—you need a buffer.

SLO SLA
Audience Internal teams External customers
Consequence of breach Error budget policy (slow down releases) Financial penalties (credits, refunds)
Aggressiveness More aggressive (internal target) More conservative (contractual promise)
Measurement Fine-grained (per-request, rolling window) Coarser (monthly, aggregated)

Worked Example: E-Commerce Platform

Service: Product Catalog API

SLI (Availability):
  Implementation: Count of responses with status < 500, divided by total
                  responses, measured at the ALB over a rolling 30-day window.

SLI (Latency):
  Implementation: Proportion of responses where server processing time
                  (from ALB metrics) is < 200ms, over a rolling 30-day window.

SLO: 99.95% availability, 99% of requests served under 200ms.

Error Budget:
  Availability: 0.05% = ~21.6 minutes of downtime per month
  Latency: 1% of requests can exceed 200ms

SLA (to enterprise customers):
  99.9% availability, measured monthly. If breached, 10% service credit.
  (Note: SLA is 99.9% while SLO is 99.95% — 0.05% buffer)

Error Budgets

The error budget is the allowed amount of unreliability. It's derived from the SLO:

Error Budget = 100% - SLO

Example:
  SLO = 99.9% availability per month
  Error Budget = 0.1% = ~43.2 minutes of downtime per month
  At 1 million requests/day, that's ~1,000 failed requests/day allowed

The error budget transforms the dev/ops tension from a cultural conflict into a data-driven discussion:

  • Budget remaining: Development velocity is prioritized. Ship features, take risks.
  • Budget exhausted: Reliability is prioritized. Freeze feature launches, focus on stability, automation, and reducing technical debt.

This creates a self-regulating system: if a team ships a buggy release that causes an outage, they've consumed error budget and must slow down to improve reliability. If the system is ultra-stable, they have budget to take more risks.

Error Budget Calculation Examples

Availability-based:

SLO: 99.95% over 30 days
Budget: 0.05% of 30 days = 0.05% × 43,200 minutes = 21.6 minutes

If an outage lasted 15 minutes:
  Remaining budget = 21.6 - 15 = 6.6 minutes (69% consumed)
  ⚠️ Need to be cautious for the rest of the month

Latency-based:

SLO: 99% of requests under 200ms over 7 days
Budget: 1% of requests can exceed 200ms

If total requests this week = 10,000,000:
  Budget = 100,000 slow requests allowed
  Current slow requests = 75,000
  Remaining budget = 25,000 (75% consumed)

Error Budget Policy

An error budget policy formalizes what happens at different budget consumption levels. It must be agreed upon by both product and SRE leadership before incidents happen—not during.

Error Budget Policy:

> 50% remaining:  Full speed ahead. Ship features, experiment.
                   Standard code review and testing.

25-50% remaining: Increased caution. Additional testing required.
                   Post-deploy monitoring for 30 minutes.
                   No deploys on Fridays.

10-25% remaining: Slow down significantly.
                   Only well-tested, low-risk changes.
                   Require SRE approval for deploys.
                   Begin reliability improvement work.

< 10% remaining:  Feature freeze.
                   Only reliability improvements and critical bug fixes.
                   Incident review if not already done.

Exhausted (0%):   Full freeze on all non-reliability changes.
                   All engineering effort goes to reliability.
                   Executive escalation.
                   Post-freeze review required before resuming feature work.

Error budget negotiation: When product teams push back on budget-driven slowdowns, point to the data. The error budget isn't arbitrary—it's derived from the SLO, which is set based on user needs. If the team doesn't want the slowdown, they have two options: accept a less aggressive SLO (more error budget), or invest in reliability to reduce budget consumption.

Toil

Toil is the kind of work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth.

The defining characteristics: - Manual: A human must perform it (not automated) - Repetitive: Done over and over, not a one-time task - Automatable: Could be done by a script or system - Tactical: Reactive, interrupt-driven, not strategic - No enduring value: Doesn't permanently improve the service - Scales with service growth: Doubling traffic doubles the work

Examples of toil:

  • Manually restarting failed services
  • Manually provisioning user accounts
  • Responding to routine alerts that always require the same fix
  • Manually scaling infrastructure before expected traffic spikes
  • Copy-pasting configuration between environments
  • Manually running database migrations
  • Manually reviewing and approving routine certificate renewals

Not toil (necessary overhead): meetings, on-call rotations, architectural design, strategic planning, writing postmortems, mentoring.

Google's target: SREs should spend no more than 50% of their time on toil. The remaining 50% should be spent on engineering projects that reduce future toil. If toil exceeds 50%, the team is understaffed or under-automated.

Measuring Toil

Track toil by having team members categorize their work:

Category Definition Target
Toil Manual, repetitive, automatable operational work ≤ 50%
Engineering Software development to improve reliability, tools, automation ≥ 50%
Overhead Meetings, planning, admin, hiring Track but don't over-optimize

Measure weekly with simple time tracking. Trend over quarters—if toil percentage is increasing, the team needs to prioritize automation projects.

The Automation Ladder

Progress toil reduction through stages:

Level 0: Manual          — Human does everything manually
Level 1: Documented      — Runbook exists, human follows steps
Level 2: Scripted        — Script performs the task, human triggers it
Level 3: Automated       — System triggers script based on conditions
Level 4: Self-service    — Users trigger automation themselves (no SRE needed)
Level 5: Self-healing    — System detects and fixes issues automatically

Example progression for "scaling a service": - Level 0: SSH into server, manually add instances - Level 1: Runbook: "Run terraform apply with instance_count=N" - Level 2: Script: ./scale-service.sh --count=10 - Level 3: Auto-scaling: Kubernetes HPA scales based on CPU/memory - Level 4: Self-service: Product team adjusts scaling policy via internal platform UI - Level 5: Predictive scaling: ML model forecasts traffic and pre-scales

Incident Management

A structured approach to handling production incidents. The goal is to restore service first, investigate root causes later.

Incident Severity Levels

Level Name Description Response Time Response
SEV1 Critical Complete service outage affecting all users < 5 minutes All hands, exec notification, war room, status page
SEV2 Major Significant degradation, partial outage, data at risk < 15 minutes On-call team + escalation, status page update
SEV3 Minor Minor impact, workaround available < 1 hour On-call handles, next business day follow-up
SEV4 Low Cosmetic, minimal user impact Next business day Track in backlog

Incident Response Process

1. DETECT    → Automated alerts (preferred) or user reports identify the issue
               Goal: Detect before users notice (proactive monitoring)

2. TRIAGE    → Assess severity, assign Incident Commander (IC)
               Ask: Who is affected? How many? Is data at risk?
               Declare incident in Slack/Teams channel

3. MITIGATE  → Restore service ASAP (rollback, scale up, failover, feature flag off)
               Priority: Mitigate first, root-cause later
               "Can we make the bleeding stop, even if we don't know why it's bleeding?"

4. RESOLVE   → Fix the underlying issue (deploy fix, repair data, etc.)
               Only after service is restored

5. FOLLOW-UP → Postmortem within 48 hours (for SEV1/SEV2)
               Track action items to completion

Incident Roles

Role Responsibilities
Incident Commander (IC) Coordinates response, makes decisions, delegates tasks, manages communication cadence, determines severity, decides when incident is resolved
Operations Lead Executes technical mitigation (rollbacks, infrastructure changes, debugging)
Communications Lead Updates status page, notifies stakeholders, manages internal/external comms, records timeline
Subject Matter Experts (SMEs) Pulled in as needed for specific subsystems (database expert, network expert, etc.)

IC responsibilities in detail: - Maintain a clear picture of the current situation - Assign tasks to specific individuals (never "someone should...") - Set check-in cadence ("Let's sync every 15 minutes") - Manage scope — prevent rabbit holes ("That's a good investigation but let's focus on mitigation first") - Make the call on risky mitigations ("Yes, let's roll back even though we'll lose the last hour of data") - Declare incident resolved and schedule postmortem

Incident Communication

Internal communication template (posted every 15-30 minutes in the incident channel):

## Incident Update — [SERVICE] — SEV[X] — [HH:MM UTC]

**Status**: Investigating / Identified / Mitigating / Resolved
**Impact**: [Description of user-facing impact]
**IC**: @person
**Current actions**:
  - @alice is investigating database connection pool exhaustion
  - @bob is preparing a rollback of the 14:00 deploy
**Next update**: [HH:MM UTC]

External communication (status page update):

[14:05 UTC] Investigating — We are investigating elevated error rates
            for the API. Some users may experience timeouts.

[14:20 UTC] Identified — A database migration has been identified as
            the cause. We are rolling back the change.

[14:30 UTC] Monitoring — The rollback is complete and error rates are
            returning to normal. We are monitoring for stability.

[15:00 UTC] Resolved — This incident has been resolved. All services
            are operating normally. A postmortem will be published
            within 48 hours.

Incident Walkthrough Example

13:55 UTC — Monitoring alert fires: "API error rate > 5% for 5 minutes"
            Alert routes to PagerDuty → on-call engineer Alice's phone

14:00 UTC — Alice acknowledges. Opens laptop, checks dashboard.
            Error rate is 12% and rising. Creates #incident-api-errors Slack channel.
            Posts: "SEV2 incident declared. I'm IC. API error rate at 12%."

14:03 UTC — Alice checks recent deploys: deployment at 13:50 by Bob (PR #1234).
            Pulls in Bob as SME.

14:07 UTC — Bob confirms: "The deploy added a new database migration that adds
            a NOT NULL column without a default value. All INSERTs are failing."

14:10 UTC — Alice makes the call: "Let's rollback the deploy immediately.
            Bob, please initiate rollback."

14:12 UTC — Bob runs: kubectl rollout undo deployment/api-server
            Migration rollback: applies down migration via CI/CD

14:18 UTC — Error rate drops from 12% to 0.3% (normal baseline).
            Alice posts update: "Mitigation successful. Error rates returning
            to normal. Monitoring for 30 minutes before declaring resolved."

14:50 UTC — Alice declares incident resolved.
            Schedules postmortem for tomorrow 10am.
            Error budget consumed: ~2.5 minutes of equivalent downtime
            (18 minutes × 12% error rate).

Blameless Postmortems

After every significant incident, conduct a blameless postmortem: a structured review focused on what happened and how to prevent recurrence, not on who caused it. The fundamental belief is that people don't cause incidents—systems that allow single human errors to cause outages are poorly designed.

Blameless Culture

Blameless does not mean "accountable-less". People are still accountable for learning and improving. The distinction:

  • Blaming: "Bob deployed bad code and caused the outage" → People hide mistakes
  • Blameless: "The deployment pipeline didn't catch the migration issue" → People share freely, system improves

If people fear punishment, they'll hide information. If information is hidden, you can't identify systemic issues. If you can't identify systemic issues, they'll recur.

Facilitating a Postmortem

  1. Who attends: IC, operations lead, communications lead, relevant SMEs, engineering manager. Optionally: anyone who wants to learn.
  2. When: Within 48 hours of the incident (while memories are fresh)
  3. Duration: 30-60 minutes
  4. Facilitator: Someone not directly involved in the incident (reduces bias)

Facilitation tips: - Start by reviewing the timeline together - Ask "what" and "how" questions, not "why didn't you..." questions - Rephrase blame: "Why didn't Alice check the migration?" → "What in our process could have caught this migration issue?" - Ensure all perspectives are heard (the junior engineer who noticed something may have critical context) - Focus on systemic improvements, not individual behavior

Root Cause Analysis Techniques

5 Whys:

Why did the API return errors?
  → Because the database INSERT queries were failing.

Why were INSERT queries failing?
  → Because a new NOT NULL column was added without a default value.

Why was there no default value?
  → Because the migration wasn't tested against production-like data.

Why wasn't the migration tested against production-like data?
  → Because our CI pipeline doesn't include migration testing.

Why doesn't our CI pipeline include migration testing?
  → Because migration testing was never added as a requirement.

Action item: Add migration testing to CI pipeline.

Note: "5 Whys" is a starting point, not a rigid formula. Some root causes need 3 whys, some need 7. And most incidents have multiple contributing factors, not a single root cause.

Contributing factors model (preferred over single root cause): Instead of finding "the" root cause, identify all factors that contributed:

Contributing factors for the API outage:
1. Migration lacked a default value (direct cause)
2. CI pipeline doesn't test migrations (detection gap)
3. No canary deployment for database changes (rollout gap)
4. On-call engineer took 5 minutes to acknowledge (response gap)
5. Rollback procedure wasn't documented (knowledge gap)

Each factor gets its own action item.

Postmortem Template

# Incident Postmortem: [Title]

## Summary
Brief description of what happened (2-3 sentences).

## Impact
- Duration: X hours Y minutes
- Users affected: N (or percentage)
- Revenue impact: $X (if applicable)
- Error budget consumed: X%
- Data lost/corrupted: Y records (if applicable)

## Timeline (all times UTC)
- 14:00 — Monitoring alert fires for elevated 5xx rates
- 14:05 — On-call engineer acknowledged, begins investigation
- 14:10 — Recent deploy identified as potential cause
- 14:15 — Root cause identified: bad database migration
- 14:20 — Mitigation: rolled back database migration
- 14:25 — Service restored, error rates return to normal
- 14:50 — Incident declared resolved after monitoring period

## Root Cause
The database migration (PR #1234) added a column with a NOT NULL constraint
without a default value, causing INSERT failures for all new records.

## Contributing Factors
- Migration was not tested against production-like data
- CI pipeline doesn't run migration tests against a full dataset
- No canary deployment for database migrations
- Rollback procedure for migrations was not documented

## What Went Well
- Fast detection (5 min from deploy to alert)
- Fast mitigation (20 min total incident duration)
- Clear incident communication

## What Went Poorly
- No pre-production migration testing
- On-call engineer unfamiliar with migration rollback procedure
- No automated rollback trigger for SLO violations

## Where We Got Lucky
- Low-traffic window reduced user impact
- The migration was reversible (some aren't)

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add migration testing to CI pipeline | @alice | P1 | 2025-02-15 | TODO |
| Implement canary process for DB migrations | @bob | P1 | 2025-03-01 | TODO |
| Add runbook for migration rollbacks | @carol | P2 | 2025-02-28 | TODO |
| Investigate automated rollback on SLO breach | @dave | P2 | 2025-03-15 | TODO |

## Lessons Learned
[Open discussion points and takeaways]

Action item tracking: The postmortem is only valuable if action items are completed. Assign a specific owner, priority, and due date for each item. Track completion in your project management tool. Review outstanding postmortem action items weekly.

On-Call Practices

On-call is a fundamental part of SRE. Best practices ensure it's sustainable, effective, and doesn't burn out engineers.

Rotation Design

Model Description Pros Cons
Weekly One person on-call for a full week Predictable schedule, context continuity Long stretches can be tiring
Biweekly Alternating weeks More time off between rotations Less frequent, may lose context
Follow-the-sun Hand off between time zones (US → EU → APAC) No overnight pages Requires global team, handoff complexity
Primary + Secondary Primary responds first; secondary is backup Redundancy, load sharing, mentoring Requires more people in rotation

Best practices: - Minimum 8 people in rotation (for weekly: each person is on-call ~6 weeks/year) - Maximum 2 incidents per on-call shift (target) - Compensate on-call with pay or time off - Primary hands off to secondary if they've been engaged for more than 8 hours - Require at least one business day between being primary and secondary

Alert Quality

Alert fatigue is the enemy of effective on-call. If engineers are paged too frequently or for non-critical issues, they become desensitized and may miss critical alerts.

Every alert should pass the "Is this actionable?" test:

Alert Quality Criteria What to Do
Good alert Actionable, time-sensitive, affects users Keep as page
Informational Useful to know but not time-sensitive Demote to ticket or dashboard
Noisy Fires frequently, no action needed Delete or tune threshold
Symptom-based Based on user-visible symptoms (error rate, latency) Preferred approach
Cause-based Based on internal state (CPU, memory, disk) Supplement, not primary alerts

Alert hygiene metrics: - Pages per on-call shift: Target ≤ 2 (not counting false positives) - False positive rate: Target < 5% (pages that required no action) - Time to acknowledge: < 5 minutes - Time to mitigate: Track p50 and p95 - Proportion of pages outside business hours: Should be roughly proportional to traffic pattern

Runbooks

Every alert should link to a runbook. Runbooks are step-by-step procedures for diagnosing and resolving common issues:

# Runbook: API Error Rate > 5%

## Alert
API error rate exceeds 5% for 5 minutes (SEV2)

## Quick Diagnosis
1. Check recent deploys: `kubectl rollout history deployment/api-server`
   - If recent deploy: consider rollback (Step A)
2. Check database connectivity: `curl http://api-server:8080/health/db`
   - If database down: see Database Runbook
3. Check upstream dependencies: dashboard link [here]
   - If dependency is down: enable circuit breaker (Step B)
4. Check for traffic spike: Grafana dashboard [here]
   - If traffic spike: scale up (Step C)

## Step A: Rollback Recent Deploy
kubectl rollout undo deployment/api-server
# Wait 2 minutes, verify error rate drops

## Step B: Enable Circuit Breaker
kubectl set env deployment/api-server CIRCUIT_BREAKER_ENABLED=true
# This returns cached/default responses when dependency is down

## Step C: Scale Up
kubectl scale deployment/api-server --replicas=10
# Normal is 3 replicas; 10 handles 3x traffic

## Escalation
If none of the above resolves the issue within 15 minutes:
- Page the API team lead: @alice (PagerDuty)
- Page the database on-call: @bob (PagerDuty)

Capacity Planning

Capacity planning ensures your system can handle expected demand with sufficient headroom for unexpected spikes.

Demand Forecasting

Source Predictability Example
Organic growth Medium (trend-based) 10% monthly user growth → capacity needs grow proportionally
Seasonal patterns High (historical data) Black Friday traffic 5x normal, end-of-quarter reporting spikes
Launch events Medium (planned but uncertain magnitude) New feature launch, marketing campaign, TV ad
Viral events Low (unpredictable) App goes viral on social media, breaking news

Capacity Modeling

Little's Law (fundamental relationship for queuing systems):

L = λ × W

L = average number of items in the system (concurrent requests)
λ = average arrival rate (requests per second)
W = average time in the system (latency)

Example:
  λ = 1,000 requests/sec
  W = 200ms = 0.2s
  L = 1,000 × 0.2 = 200 concurrent requests

  If each server handles 50 concurrent requests, you need ≥ 4 servers.
  With 50% headroom: 6 servers.

Capacity planning checklist: 1. Measure current usage: CPU, memory, network, connections, request rate 2. Identify the bottleneck: What resource saturates first? 3. Model growth: Linear, exponential, or stepped (based on business plans) 4. Add headroom: Plan for at least 50% above expected peak 5. Plan for failures: If one AZ goes down, can the remaining AZs handle the load? 6. Set alerts: Alert at 70% capacity to trigger scaling or procurement

Load Testing for Capacity

Load test in a production-like environment to find actual capacity limits:

Load Test Strategy:
1. Baseline test: Normal traffic level for 30 minutes (establish metrics)
2. Ramp test: Gradually increase to 2x normal over 30 minutes
3. Stress test: Continue ramping until failure (find the breaking point)
4. Soak test: Run at 1.5x normal for 24 hours (find memory leaks, connection leaks)
5. Spike test: Sudden burst to 5x normal (test auto-scaling response time)

Release Engineering

Release engineering is the practice of building and deploying software reliably and safely.

Progressive Rollouts

Strategy Description Detection Time Blast Radius
Canary Deploy to 1-5% of instances, monitor, gradually increase Minutes to hours Small (1-5% of traffic)
Blue-Green Deploy to idle environment, switch traffic all at once Immediate 100% if not caught in staging
Rolling update Replace instances one at a time Moderate Grows over time
Feature flags Code deployed everywhere but gated behind flag Immediate (toggle off) Controlled (targeted users)

Canary deployment flow:

1. Deploy v2 to 1% of instances
2. Monitor for 10 minutes: error rate, latency, business metrics
3. If metrics are healthy: increase to 5%
4. Monitor for 30 minutes
5. If healthy: increase to 25% → 50% → 100%
6. If unhealthy at any stage: roll back to 0%, investigate

Automatic rollback trigger:
  - Error rate increases by > 0.1% compared to baseline
  - p95 latency increases by > 50ms compared to baseline
  - Any custom business metric degrades

Feature Flags

Feature flags decouple deployment from release. Code is deployed to all instances but new features are toggled on/off without a deploy:

# Feature flag usage
if feature_flags.is_enabled("new-checkout-flow", user_id=user.id):
    return new_checkout_flow(cart)
else:
    return legacy_checkout_flow(cart)

Feature flag lifecycle: 1. Development: Flag created, defaults to off 2. Testing: Enabled for internal users, QA team 3. Canary: Enabled for 1% of production users 4. Rollout: Gradually increase to 100% 5. Cleanup: Remove flag and old code path (critical — tech debt otherwise)

Change Management

Change freezes: During high-risk periods (Black Friday, year-end processing), restrict changes to emergency-only. Define clearly what qualifies as an emergency.

Deploy windows: Some teams restrict deploys to specific hours (e.g., 9am-3pm, no Fridays). This ensures experienced staff are available if issues arise. Counter-argument: smaller, more frequent deploys are safer than large, batched deploys.

Chaos Engineering

Chaos engineering is the practice of intentionally injecting failures into production systems to test resilience and discover weaknesses before they cause real outages.

Principles

  1. Start with a hypothesis: "Our system should handle the loss of one database replica without user-visible impact"
  2. Define steady state: Normal metrics (error rate, latency, throughput)
  3. Inject failure: Kill the replica
  4. Observe: Did the system maintain steady state? How long did recovery take?
  5. Learn: If the hypothesis failed, fix the system. If it passed, try a harder failure.

Failure Injection Patterns

Failure Type Tools What You Learn
Instance termination Chaos Monkey, LitmusChaos Auto-scaling, load balancer health checks
Network latency tc (traffic control), Toxiproxy Timeout handling, circuit breakers
Network partition iptables, Chaos Mesh Split-brain handling, consistency behavior
AZ/zone failure AWS FIS, Gremlin Multi-AZ resilience, data replication
Dependency failure Toxiproxy, service mesh fault injection Graceful degradation, fallback behavior
CPU/memory stress stress-ng, Chaos Mesh Resource limits, OOM handling, auto-scaling
DNS failure Modify /etc/resolv.conf, DNS poisoning DNS caching, fallback resolvers

Game Days

A game day is a planned chaos engineering exercise where the team intentionally breaks systems and practices incident response:

Game Day Plan:

Objective: Validate that the payment service handles database failover

Participants: SRE team, payment team, database team
Date: Tuesday, 2pm-4pm (low traffic window)

Failure to inject: Force failover of the primary database replica

Hypothesis: Payment service will:
  - Experience < 5 seconds of errors during failover
  - Automatically reconnect to the new primary
  - No data loss or corruption

Blast radius controls:
  - Only payment-staging environment (not production)
  - Rollback plan: manually promote old primary if failover fails
  - Abort trigger: error rate > 50% for > 30 seconds

Observation:
  - Monitor: payment error rate, latency, database connections
  - Record: timeline of events, actual behavior vs. hypothesis

Building organizational confidence: Start with non-production environments. Graduate to production during low-traffic windows. Eventually run experiments during business hours. The goal is to make chaos engineering routine, not scary.

SRE Organizational Practices

Production Readiness Review (PRR)

Before a service can go to production (or before SRE takes on operational responsibility), conduct a production readiness review:

# Production Readiness Review Checklist

## Reliability
- [ ] SLOs defined and instrumented
- [ ] Error budget policy agreed upon with product team
- [ ] Alert rules configured with runbooks
- [ ] On-call rotation established
- [ ] Incident response procedure documented

## Architecture
- [ ] No single points of failure
- [ ] Graceful degradation when dependencies fail
- [ ] Circuit breakers for external dependencies
- [ ] Timeouts configured for all external calls
- [ ] Rate limiting implemented

## Observability
- [ ] Structured logging to centralized system
- [ ] Metrics exported (request rate, latency, errors, saturation)
- [ ] Distributed tracing enabled
- [ ] Dashboards created for key metrics
- [ ] Health check endpoint (/health, /ready)

## Operations
- [ ] Deployment pipeline with automated rollback
- [ ] Canary or blue-green deployment strategy
- [ ] Rollback tested and documented
- [ ] Capacity plan documented
- [ ] Load tested at 2x expected peak

## Security
- [ ] No hardcoded secrets
- [ ] TLS for all external communication
- [ ] Authentication and authorization configured
- [ ] Dependencies scanned for vulnerabilities

## Data
- [ ] Backup strategy tested (including restore)
- [ ] Data retention policy defined
- [ ] GDPR/compliance requirements met

Service Tiers

Not all services require the same level of reliability. Classify services into tiers:

Tier Reliability Target Monitoring On-Call Example
Tier 1 (Critical) 99.99% Real-time alerting, SLO-based 24/7 dedicated rotation Payment processing, authentication
Tier 2 (Important) 99.9% Alerting with business-hours response Shared on-call rotation Product catalog, user profiles
Tier 3 (Standard) 99% Monitoring dashboards, next-business-day Best-effort Internal tools, analytics pipelines
Tier 4 (Best-effort) None Basic monitoring No on-call Experimental features, internal prototypes

Service tiers drive investment decisions: Tier 1 services get multi-region deployment, automated failover, and chaos engineering. Tier 4 services run on a single instance and accept periodic downtime.

Technical Debt Management

SRE teams often encounter technical debt that impacts reliability. Track reliability-related debt and prioritize it:

Category Example Impact
Operational debt Manual deploy process, no runbooks Slower incident response, higher toil
Architectural debt Single point of failure, monolithic database Outage risk, scaling bottleneck
Observability debt Missing metrics, no distributed tracing Longer time to diagnose issues
Testing debt No load tests, no chaos testing Unknown failure modes

Allocate a percentage of engineering capacity (20-30%) specifically for reliability and infrastructure improvements, separate from feature work. Without explicit allocation, reliability work is perpetually deprioritized until the next outage.