Site Reliability Engineering (SRE)¶
Site Reliability Engineering (SRE), pioneered by Google, is a discipline that applies software engineering principles to infrastructure and operations problems. It bridges the gap between development (who want to ship fast) and operations (who want stability). SRE treats operations work as a software problem—instead of manually managing systems, SREs write code to automate away operational toil.
While monitoring (Chapter 16) covers the observability stack, SRE is about the organizational practices, frameworks, and culture for running reliable systems at scale.
SRE vs. DevOps¶
| Aspect | SRE | DevOps |
|---|---|---|
| Origin | Google (2003) | Community movement (2008) |
| Focus | Reliability as a feature | Breaking silos between dev and ops |
| Approach | Prescriptive (concrete practices, metrics) | Philosophical (culture, automation, sharing) |
| Key metric | Error budget | Deployment frequency, lead time |
| Relationship | SRE implements DevOps principles with concrete practices |
Google's Ben Treynor Sloss: "SRE is what happens when you ask a software engineer to design an operations function."
A useful analogy: DevOps is an interface (a set of principles and values), while SRE is a concrete class that implements that interface. DevOps says "you should automate"; SRE says "you should spend no more than 50% of your time on toil, and here's how to measure and reduce it."
SRE Team Structures¶
| Model | Description | Pros | Cons |
|---|---|---|---|
| Embedded SRE | SREs are members of product teams | Deep product knowledge, tight collaboration | May lose SRE community, inconsistent practices |
| Centralized SRE | Dedicated SRE team supports multiple services | Consistent standards, shared tooling, career growth | Can become a bottleneck, less product context |
| Consulting SRE | SRE team advises product teams, doesn't own services | Scales knowledge broadly, product teams own reliability | Advice may be ignored, less operational depth |
| Platform Engineering | Builds self-service reliability tools and platforms | Scales to many teams, reduces per-team toil | Requires significant investment, can feel disconnected |
Most organizations start with centralized SRE and evolve toward a hybrid model: a central platform/SRE team builds shared tooling (CI/CD, observability, infrastructure), while embedded SREs or reliability-focused engineers within product teams apply those tools to their specific services.
SRE Engagement Model¶
SRE teams cannot support every service in an organization. A common engagement model:
- Self-serve tier: Product teams use SRE-provided tools and runbooks independently
- Consulting tier: SRE provides architecture reviews, production readiness reviews, and guidance
- Embedded tier: SRE is directly involved in operating the service (reserved for critical, high-complexity systems)
Criteria for full SRE engagement typically include: business criticality, service complexity, traffic volume, and the product team's willingness to follow SRE practices (SLOs, error budgets, postmortems).
Service Level Indicators, Objectives, and Agreements¶
The SLI → SLO → SLA hierarchy is the foundation of SRE. It transforms reliability from a vague goal ("make it reliable") into a measurable, actionable framework.
Relationship:
SLI (what you measure) → SLO (what you target) → SLA (what you promise)
Example:
SLI: Availability = successful requests / total requests
SLO: Availability >= 99.95% per month (internal target)
SLA: Availability >= 99.9% per month (customer contract — with credits if breached)
SLIs (Service Level Indicators)¶
Quantitative measures of service behavior. The most important principle: SLIs should reflect the user experience, not system internals. CPU utilization is not an SLI (users don't experience CPU); request latency is.
| SLI | Definition | Example | Measurement Point |
|---|---|---|---|
| Availability | Proportion of successful requests | 99.95% of requests return non-5xx responses | Load balancer access logs |
| Latency | Time to serve a request | 95th percentile response time < 200ms | Application instrumentation |
| Throughput | Rate of successful operations | > 10,000 requests/second sustained | Metrics aggregation |
| Error rate | Proportion of failed requests | < 0.1% of requests result in errors | Application error tracking |
| Freshness | How up-to-date data is | Search index updated within 5 minutes | Pipeline monitoring |
| Correctness | Proportion of correct responses | 99.99% of calculations return correct results | End-to-end validation |
| Durability | Proportion of data retained | 99.999999999% of objects stored are not lost | Storage system metrics |
SLI Specification vs. Implementation:
- Specification: What the SLI measures conceptually (e.g., "the proportion of valid requests served within 200ms")
- Implementation: How you actually measure it (e.g., "count of responses with status < 500 and duration < 200ms at the load balancer, divided by total request count")
The implementation matters because where you measure changes what you see. Measuring latency at the server misses network latency; measuring at the client includes the full user experience but is harder to collect. A common compromise: measure at the load balancer (captures server processing + internal network, but not client-side network).
Choosing SLIs by service type:
| Service Type | Primary SLIs |
|---|---|
| User-facing API | Availability, latency (p50, p95, p99), error rate |
| Data pipeline | Freshness (data staleness), correctness, throughput |
| Storage system | Availability, latency, durability |
| Streaming service | Start-up time, rebuffer rate, resolution quality |
| Batch processing | Completion time, success rate, data quality |
SLOs (Service Level Objectives)¶
Target values for SLIs that define "good enough" reliability. SLOs are internal commitments—they represent the reliability level that satisfies users without over-investing in reliability.
SLO Examples:
- "99.9% of API requests will succeed (non-5xx) measured over a rolling 30-day window"
- "95th percentile latency will be under 200ms measured over a rolling 7-day window"
- "99.99% of payment transactions will complete successfully measured monthly"
- "Data pipeline freshness: 99% of records available within 5 minutes of creation"
Setting SLOs:
- Too aggressive (99.999%) → team spends all time on reliability, can't ship features, infrastructure costs explode
- Too relaxed (99%) → users have poor experience, churn increases
- Target the level of reliability users actually need — consider: if you're 99.9% and your upstream dependency is 99%, your extra 0.9% doesn't matter to the user
The nines table — what each level of availability actually means:
| Availability | Downtime/month | Downtime/year | Typical For |
|---|---|---|---|
| 99% (two 9s) | 7.2 hours | 3.65 days | Internal tools, batch systems |
| 99.9% (three 9s) | 43.2 minutes | 8.76 hours | SaaS applications, APIs |
| 99.95% | 21.6 minutes | 4.38 hours | Business-critical services |
| 99.99% (four 9s) | 4.32 minutes | 52.56 minutes | Payment systems, core infrastructure |
| 99.999% (five 9s) | 26.3 seconds | 5.26 minutes | Telecom, medical systems |
Each additional nine is roughly 10x harder and more expensive to achieve. Going from 99.9% to 99.99% might require multi-region redundancy, automated failover, and eliminating all single points of failure. Going from 99.99% to 99.999% might require custom hardware, redundant providers, and near-zero planned maintenance.
SLO-based alerting: Instead of alerting on symptoms (CPU > 90%), alert on SLO burn rate. The burn rate is how fast you're consuming your error budget:
Burn rate = (error rate observed) / (error rate allowed by SLO)
Example:
SLO: 99.9% availability (0.1% error budget per month)
Current error rate: 1%
Burn rate = 1% / 0.1% = 10x
At 10x burn rate, a month's error budget is consumed in 3 days.
Multi-window, multi-burn-rate alerting (recommended by the SRE Workbook):
| Severity | Burn Rate | Long Window | Short Window | Action |
|---|---|---|---|---|
| Page (critical) | 14.4x | 1 hour | 5 minutes | Immediate response |
| Page (warning) | 6x | 6 hours | 30 minutes | Investigate soon |
| Ticket | 3x | 3 days | 6 hours | Schedule fix |
The short window confirms the long window isn't a brief spike that already resolved. Both windows must be breaching simultaneously to fire the alert.
SLAs (Service Level Agreements)¶
External contracts with customers that include SLOs plus consequences (refunds, credits) if not met. SLAs should always be less aggressive than internal SLOs—you need a buffer.
| SLO | SLA | |
|---|---|---|
| Audience | Internal teams | External customers |
| Consequence of breach | Error budget policy (slow down releases) | Financial penalties (credits, refunds) |
| Aggressiveness | More aggressive (internal target) | More conservative (contractual promise) |
| Measurement | Fine-grained (per-request, rolling window) | Coarser (monthly, aggregated) |
Worked Example: E-Commerce Platform¶
Service: Product Catalog API
SLI (Availability):
Implementation: Count of responses with status < 500, divided by total
responses, measured at the ALB over a rolling 30-day window.
SLI (Latency):
Implementation: Proportion of responses where server processing time
(from ALB metrics) is < 200ms, over a rolling 30-day window.
SLO: 99.95% availability, 99% of requests served under 200ms.
Error Budget:
Availability: 0.05% = ~21.6 minutes of downtime per month
Latency: 1% of requests can exceed 200ms
SLA (to enterprise customers):
99.9% availability, measured monthly. If breached, 10% service credit.
(Note: SLA is 99.9% while SLO is 99.95% — 0.05% buffer)
Error Budgets¶
The error budget is the allowed amount of unreliability. It's derived from the SLO:
Error Budget = 100% - SLO
Example:
SLO = 99.9% availability per month
Error Budget = 0.1% = ~43.2 minutes of downtime per month
At 1 million requests/day, that's ~1,000 failed requests/day allowed
The error budget transforms the dev/ops tension from a cultural conflict into a data-driven discussion:
- Budget remaining: Development velocity is prioritized. Ship features, take risks.
- Budget exhausted: Reliability is prioritized. Freeze feature launches, focus on stability, automation, and reducing technical debt.
This creates a self-regulating system: if a team ships a buggy release that causes an outage, they've consumed error budget and must slow down to improve reliability. If the system is ultra-stable, they have budget to take more risks.
Error Budget Calculation Examples¶
Availability-based:
SLO: 99.95% over 30 days
Budget: 0.05% of 30 days = 0.05% × 43,200 minutes = 21.6 minutes
If an outage lasted 15 minutes:
Remaining budget = 21.6 - 15 = 6.6 minutes (69% consumed)
⚠️ Need to be cautious for the rest of the month
Latency-based:
SLO: 99% of requests under 200ms over 7 days
Budget: 1% of requests can exceed 200ms
If total requests this week = 10,000,000:
Budget = 100,000 slow requests allowed
Current slow requests = 75,000
Remaining budget = 25,000 (75% consumed)
Error Budget Policy¶
An error budget policy formalizes what happens at different budget consumption levels. It must be agreed upon by both product and SRE leadership before incidents happen—not during.
Error Budget Policy:
> 50% remaining: Full speed ahead. Ship features, experiment.
Standard code review and testing.
25-50% remaining: Increased caution. Additional testing required.
Post-deploy monitoring for 30 minutes.
No deploys on Fridays.
10-25% remaining: Slow down significantly.
Only well-tested, low-risk changes.
Require SRE approval for deploys.
Begin reliability improvement work.
< 10% remaining: Feature freeze.
Only reliability improvements and critical bug fixes.
Incident review if not already done.
Exhausted (0%): Full freeze on all non-reliability changes.
All engineering effort goes to reliability.
Executive escalation.
Post-freeze review required before resuming feature work.
Error budget negotiation: When product teams push back on budget-driven slowdowns, point to the data. The error budget isn't arbitrary—it's derived from the SLO, which is set based on user needs. If the team doesn't want the slowdown, they have two options: accept a less aggressive SLO (more error budget), or invest in reliability to reduce budget consumption.
Toil¶
Toil is the kind of work tied to running a production service that is manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly with service growth.
The defining characteristics: - Manual: A human must perform it (not automated) - Repetitive: Done over and over, not a one-time task - Automatable: Could be done by a script or system - Tactical: Reactive, interrupt-driven, not strategic - No enduring value: Doesn't permanently improve the service - Scales with service growth: Doubling traffic doubles the work
Examples of toil:
- Manually restarting failed services
- Manually provisioning user accounts
- Responding to routine alerts that always require the same fix
- Manually scaling infrastructure before expected traffic spikes
- Copy-pasting configuration between environments
- Manually running database migrations
- Manually reviewing and approving routine certificate renewals
Not toil (necessary overhead): meetings, on-call rotations, architectural design, strategic planning, writing postmortems, mentoring.
Google's target: SREs should spend no more than 50% of their time on toil. The remaining 50% should be spent on engineering projects that reduce future toil. If toil exceeds 50%, the team is understaffed or under-automated.
Measuring Toil¶
Track toil by having team members categorize their work:
| Category | Definition | Target |
|---|---|---|
| Toil | Manual, repetitive, automatable operational work | ≤ 50% |
| Engineering | Software development to improve reliability, tools, automation | ≥ 50% |
| Overhead | Meetings, planning, admin, hiring | Track but don't over-optimize |
Measure weekly with simple time tracking. Trend over quarters—if toil percentage is increasing, the team needs to prioritize automation projects.
The Automation Ladder¶
Progress toil reduction through stages:
Level 0: Manual — Human does everything manually
Level 1: Documented — Runbook exists, human follows steps
Level 2: Scripted — Script performs the task, human triggers it
Level 3: Automated — System triggers script based on conditions
Level 4: Self-service — Users trigger automation themselves (no SRE needed)
Level 5: Self-healing — System detects and fixes issues automatically
Example progression for "scaling a service":
- Level 0: SSH into server, manually add instances
- Level 1: Runbook: "Run terraform apply with instance_count=N"
- Level 2: Script: ./scale-service.sh --count=10
- Level 3: Auto-scaling: Kubernetes HPA scales based on CPU/memory
- Level 4: Self-service: Product team adjusts scaling policy via internal platform UI
- Level 5: Predictive scaling: ML model forecasts traffic and pre-scales
Incident Management¶
A structured approach to handling production incidents. The goal is to restore service first, investigate root causes later.
Incident Severity Levels¶
| Level | Name | Description | Response Time | Response |
|---|---|---|---|---|
| SEV1 | Critical | Complete service outage affecting all users | < 5 minutes | All hands, exec notification, war room, status page |
| SEV2 | Major | Significant degradation, partial outage, data at risk | < 15 minutes | On-call team + escalation, status page update |
| SEV3 | Minor | Minor impact, workaround available | < 1 hour | On-call handles, next business day follow-up |
| SEV4 | Low | Cosmetic, minimal user impact | Next business day | Track in backlog |
Incident Response Process¶
1. DETECT → Automated alerts (preferred) or user reports identify the issue
Goal: Detect before users notice (proactive monitoring)
2. TRIAGE → Assess severity, assign Incident Commander (IC)
Ask: Who is affected? How many? Is data at risk?
Declare incident in Slack/Teams channel
3. MITIGATE → Restore service ASAP (rollback, scale up, failover, feature flag off)
Priority: Mitigate first, root-cause later
"Can we make the bleeding stop, even if we don't know why it's bleeding?"
4. RESOLVE → Fix the underlying issue (deploy fix, repair data, etc.)
Only after service is restored
5. FOLLOW-UP → Postmortem within 48 hours (for SEV1/SEV2)
Track action items to completion
Incident Roles¶
| Role | Responsibilities |
|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions, delegates tasks, manages communication cadence, determines severity, decides when incident is resolved |
| Operations Lead | Executes technical mitigation (rollbacks, infrastructure changes, debugging) |
| Communications Lead | Updates status page, notifies stakeholders, manages internal/external comms, records timeline |
| Subject Matter Experts (SMEs) | Pulled in as needed for specific subsystems (database expert, network expert, etc.) |
IC responsibilities in detail: - Maintain a clear picture of the current situation - Assign tasks to specific individuals (never "someone should...") - Set check-in cadence ("Let's sync every 15 minutes") - Manage scope — prevent rabbit holes ("That's a good investigation but let's focus on mitigation first") - Make the call on risky mitigations ("Yes, let's roll back even though we'll lose the last hour of data") - Declare incident resolved and schedule postmortem
Incident Communication¶
Internal communication template (posted every 15-30 minutes in the incident channel):
## Incident Update — [SERVICE] — SEV[X] — [HH:MM UTC]
**Status**: Investigating / Identified / Mitigating / Resolved
**Impact**: [Description of user-facing impact]
**IC**: @person
**Current actions**:
- @alice is investigating database connection pool exhaustion
- @bob is preparing a rollback of the 14:00 deploy
**Next update**: [HH:MM UTC]
External communication (status page update):
[14:05 UTC] Investigating — We are investigating elevated error rates
for the API. Some users may experience timeouts.
[14:20 UTC] Identified — A database migration has been identified as
the cause. We are rolling back the change.
[14:30 UTC] Monitoring — The rollback is complete and error rates are
returning to normal. We are monitoring for stability.
[15:00 UTC] Resolved — This incident has been resolved. All services
are operating normally. A postmortem will be published
within 48 hours.
Incident Walkthrough Example¶
13:55 UTC — Monitoring alert fires: "API error rate > 5% for 5 minutes"
Alert routes to PagerDuty → on-call engineer Alice's phone
14:00 UTC — Alice acknowledges. Opens laptop, checks dashboard.
Error rate is 12% and rising. Creates #incident-api-errors Slack channel.
Posts: "SEV2 incident declared. I'm IC. API error rate at 12%."
14:03 UTC — Alice checks recent deploys: deployment at 13:50 by Bob (PR #1234).
Pulls in Bob as SME.
14:07 UTC — Bob confirms: "The deploy added a new database migration that adds
a NOT NULL column without a default value. All INSERTs are failing."
14:10 UTC — Alice makes the call: "Let's rollback the deploy immediately.
Bob, please initiate rollback."
14:12 UTC — Bob runs: kubectl rollout undo deployment/api-server
Migration rollback: applies down migration via CI/CD
14:18 UTC — Error rate drops from 12% to 0.3% (normal baseline).
Alice posts update: "Mitigation successful. Error rates returning
to normal. Monitoring for 30 minutes before declaring resolved."
14:50 UTC — Alice declares incident resolved.
Schedules postmortem for tomorrow 10am.
Error budget consumed: ~2.5 minutes of equivalent downtime
(18 minutes × 12% error rate).
Blameless Postmortems¶
After every significant incident, conduct a blameless postmortem: a structured review focused on what happened and how to prevent recurrence, not on who caused it. The fundamental belief is that people don't cause incidents—systems that allow single human errors to cause outages are poorly designed.
Blameless Culture¶
Blameless does not mean "accountable-less". People are still accountable for learning and improving. The distinction:
- Blaming: "Bob deployed bad code and caused the outage" → People hide mistakes
- Blameless: "The deployment pipeline didn't catch the migration issue" → People share freely, system improves
If people fear punishment, they'll hide information. If information is hidden, you can't identify systemic issues. If you can't identify systemic issues, they'll recur.
Facilitating a Postmortem¶
- Who attends: IC, operations lead, communications lead, relevant SMEs, engineering manager. Optionally: anyone who wants to learn.
- When: Within 48 hours of the incident (while memories are fresh)
- Duration: 30-60 minutes
- Facilitator: Someone not directly involved in the incident (reduces bias)
Facilitation tips: - Start by reviewing the timeline together - Ask "what" and "how" questions, not "why didn't you..." questions - Rephrase blame: "Why didn't Alice check the migration?" → "What in our process could have caught this migration issue?" - Ensure all perspectives are heard (the junior engineer who noticed something may have critical context) - Focus on systemic improvements, not individual behavior
Root Cause Analysis Techniques¶
5 Whys:
Why did the API return errors?
→ Because the database INSERT queries were failing.
Why were INSERT queries failing?
→ Because a new NOT NULL column was added without a default value.
Why was there no default value?
→ Because the migration wasn't tested against production-like data.
Why wasn't the migration tested against production-like data?
→ Because our CI pipeline doesn't include migration testing.
Why doesn't our CI pipeline include migration testing?
→ Because migration testing was never added as a requirement.
Action item: Add migration testing to CI pipeline.
Note: "5 Whys" is a starting point, not a rigid formula. Some root causes need 3 whys, some need 7. And most incidents have multiple contributing factors, not a single root cause.
Contributing factors model (preferred over single root cause): Instead of finding "the" root cause, identify all factors that contributed:
Contributing factors for the API outage:
1. Migration lacked a default value (direct cause)
2. CI pipeline doesn't test migrations (detection gap)
3. No canary deployment for database changes (rollout gap)
4. On-call engineer took 5 minutes to acknowledge (response gap)
5. Rollback procedure wasn't documented (knowledge gap)
Each factor gets its own action item.
Postmortem Template¶
# Incident Postmortem: [Title]
## Summary
Brief description of what happened (2-3 sentences).
## Impact
- Duration: X hours Y minutes
- Users affected: N (or percentage)
- Revenue impact: $X (if applicable)
- Error budget consumed: X%
- Data lost/corrupted: Y records (if applicable)
## Timeline (all times UTC)
- 14:00 — Monitoring alert fires for elevated 5xx rates
- 14:05 — On-call engineer acknowledged, begins investigation
- 14:10 — Recent deploy identified as potential cause
- 14:15 — Root cause identified: bad database migration
- 14:20 — Mitigation: rolled back database migration
- 14:25 — Service restored, error rates return to normal
- 14:50 — Incident declared resolved after monitoring period
## Root Cause
The database migration (PR #1234) added a column with a NOT NULL constraint
without a default value, causing INSERT failures for all new records.
## Contributing Factors
- Migration was not tested against production-like data
- CI pipeline doesn't run migration tests against a full dataset
- No canary deployment for database migrations
- Rollback procedure for migrations was not documented
## What Went Well
- Fast detection (5 min from deploy to alert)
- Fast mitigation (20 min total incident duration)
- Clear incident communication
## What Went Poorly
- No pre-production migration testing
- On-call engineer unfamiliar with migration rollback procedure
- No automated rollback trigger for SLO violations
## Where We Got Lucky
- Low-traffic window reduced user impact
- The migration was reversible (some aren't)
## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add migration testing to CI pipeline | @alice | P1 | 2025-02-15 | TODO |
| Implement canary process for DB migrations | @bob | P1 | 2025-03-01 | TODO |
| Add runbook for migration rollbacks | @carol | P2 | 2025-02-28 | TODO |
| Investigate automated rollback on SLO breach | @dave | P2 | 2025-03-15 | TODO |
## Lessons Learned
[Open discussion points and takeaways]
Action item tracking: The postmortem is only valuable if action items are completed. Assign a specific owner, priority, and due date for each item. Track completion in your project management tool. Review outstanding postmortem action items weekly.
On-Call Practices¶
On-call is a fundamental part of SRE. Best practices ensure it's sustainable, effective, and doesn't burn out engineers.
Rotation Design¶
| Model | Description | Pros | Cons |
|---|---|---|---|
| Weekly | One person on-call for a full week | Predictable schedule, context continuity | Long stretches can be tiring |
| Biweekly | Alternating weeks | More time off between rotations | Less frequent, may lose context |
| Follow-the-sun | Hand off between time zones (US → EU → APAC) | No overnight pages | Requires global team, handoff complexity |
| Primary + Secondary | Primary responds first; secondary is backup | Redundancy, load sharing, mentoring | Requires more people in rotation |
Best practices: - Minimum 8 people in rotation (for weekly: each person is on-call ~6 weeks/year) - Maximum 2 incidents per on-call shift (target) - Compensate on-call with pay or time off - Primary hands off to secondary if they've been engaged for more than 8 hours - Require at least one business day between being primary and secondary
Alert Quality¶
Alert fatigue is the enemy of effective on-call. If engineers are paged too frequently or for non-critical issues, they become desensitized and may miss critical alerts.
Every alert should pass the "Is this actionable?" test:
| Alert Quality | Criteria | What to Do |
|---|---|---|
| Good alert | Actionable, time-sensitive, affects users | Keep as page |
| Informational | Useful to know but not time-sensitive | Demote to ticket or dashboard |
| Noisy | Fires frequently, no action needed | Delete or tune threshold |
| Symptom-based | Based on user-visible symptoms (error rate, latency) | Preferred approach |
| Cause-based | Based on internal state (CPU, memory, disk) | Supplement, not primary alerts |
Alert hygiene metrics: - Pages per on-call shift: Target ≤ 2 (not counting false positives) - False positive rate: Target < 5% (pages that required no action) - Time to acknowledge: < 5 minutes - Time to mitigate: Track p50 and p95 - Proportion of pages outside business hours: Should be roughly proportional to traffic pattern
Runbooks¶
Every alert should link to a runbook. Runbooks are step-by-step procedures for diagnosing and resolving common issues:
# Runbook: API Error Rate > 5%
## Alert
API error rate exceeds 5% for 5 minutes (SEV2)
## Quick Diagnosis
1. Check recent deploys: `kubectl rollout history deployment/api-server`
- If recent deploy: consider rollback (Step A)
2. Check database connectivity: `curl http://api-server:8080/health/db`
- If database down: see Database Runbook
3. Check upstream dependencies: dashboard link [here]
- If dependency is down: enable circuit breaker (Step B)
4. Check for traffic spike: Grafana dashboard [here]
- If traffic spike: scale up (Step C)
## Step A: Rollback Recent Deploy
kubectl rollout undo deployment/api-server
# Wait 2 minutes, verify error rate drops
## Step B: Enable Circuit Breaker
kubectl set env deployment/api-server CIRCUIT_BREAKER_ENABLED=true
# This returns cached/default responses when dependency is down
## Step C: Scale Up
kubectl scale deployment/api-server --replicas=10
# Normal is 3 replicas; 10 handles 3x traffic
## Escalation
If none of the above resolves the issue within 15 minutes:
- Page the API team lead: @alice (PagerDuty)
- Page the database on-call: @bob (PagerDuty)
Capacity Planning¶
Capacity planning ensures your system can handle expected demand with sufficient headroom for unexpected spikes.
Demand Forecasting¶
| Source | Predictability | Example |
|---|---|---|
| Organic growth | Medium (trend-based) | 10% monthly user growth → capacity needs grow proportionally |
| Seasonal patterns | High (historical data) | Black Friday traffic 5x normal, end-of-quarter reporting spikes |
| Launch events | Medium (planned but uncertain magnitude) | New feature launch, marketing campaign, TV ad |
| Viral events | Low (unpredictable) | App goes viral on social media, breaking news |
Capacity Modeling¶
Little's Law (fundamental relationship for queuing systems):
L = λ × W
L = average number of items in the system (concurrent requests)
λ = average arrival rate (requests per second)
W = average time in the system (latency)
Example:
λ = 1,000 requests/sec
W = 200ms = 0.2s
L = 1,000 × 0.2 = 200 concurrent requests
If each server handles 50 concurrent requests, you need ≥ 4 servers.
With 50% headroom: 6 servers.
Capacity planning checklist: 1. Measure current usage: CPU, memory, network, connections, request rate 2. Identify the bottleneck: What resource saturates first? 3. Model growth: Linear, exponential, or stepped (based on business plans) 4. Add headroom: Plan for at least 50% above expected peak 5. Plan for failures: If one AZ goes down, can the remaining AZs handle the load? 6. Set alerts: Alert at 70% capacity to trigger scaling or procurement
Load Testing for Capacity¶
Load test in a production-like environment to find actual capacity limits:
Load Test Strategy:
1. Baseline test: Normal traffic level for 30 minutes (establish metrics)
2. Ramp test: Gradually increase to 2x normal over 30 minutes
3. Stress test: Continue ramping until failure (find the breaking point)
4. Soak test: Run at 1.5x normal for 24 hours (find memory leaks, connection leaks)
5. Spike test: Sudden burst to 5x normal (test auto-scaling response time)
Release Engineering¶
Release engineering is the practice of building and deploying software reliably and safely.
Progressive Rollouts¶
| Strategy | Description | Detection Time | Blast Radius |
|---|---|---|---|
| Canary | Deploy to 1-5% of instances, monitor, gradually increase | Minutes to hours | Small (1-5% of traffic) |
| Blue-Green | Deploy to idle environment, switch traffic all at once | Immediate | 100% if not caught in staging |
| Rolling update | Replace instances one at a time | Moderate | Grows over time |
| Feature flags | Code deployed everywhere but gated behind flag | Immediate (toggle off) | Controlled (targeted users) |
Canary deployment flow:
1. Deploy v2 to 1% of instances
2. Monitor for 10 minutes: error rate, latency, business metrics
3. If metrics are healthy: increase to 5%
4. Monitor for 30 minutes
5. If healthy: increase to 25% → 50% → 100%
6. If unhealthy at any stage: roll back to 0%, investigate
Automatic rollback trigger:
- Error rate increases by > 0.1% compared to baseline
- p95 latency increases by > 50ms compared to baseline
- Any custom business metric degrades
Feature Flags¶
Feature flags decouple deployment from release. Code is deployed to all instances but new features are toggled on/off without a deploy:
# Feature flag usage
if feature_flags.is_enabled("new-checkout-flow", user_id=user.id):
return new_checkout_flow(cart)
else:
return legacy_checkout_flow(cart)
Feature flag lifecycle: 1. Development: Flag created, defaults to off 2. Testing: Enabled for internal users, QA team 3. Canary: Enabled for 1% of production users 4. Rollout: Gradually increase to 100% 5. Cleanup: Remove flag and old code path (critical — tech debt otherwise)
Change Management¶
Change freezes: During high-risk periods (Black Friday, year-end processing), restrict changes to emergency-only. Define clearly what qualifies as an emergency.
Deploy windows: Some teams restrict deploys to specific hours (e.g., 9am-3pm, no Fridays). This ensures experienced staff are available if issues arise. Counter-argument: smaller, more frequent deploys are safer than large, batched deploys.
Chaos Engineering¶
Chaos engineering is the practice of intentionally injecting failures into production systems to test resilience and discover weaknesses before they cause real outages.
Principles¶
- Start with a hypothesis: "Our system should handle the loss of one database replica without user-visible impact"
- Define steady state: Normal metrics (error rate, latency, throughput)
- Inject failure: Kill the replica
- Observe: Did the system maintain steady state? How long did recovery take?
- Learn: If the hypothesis failed, fix the system. If it passed, try a harder failure.
Failure Injection Patterns¶
| Failure Type | Tools | What You Learn |
|---|---|---|
| Instance termination | Chaos Monkey, LitmusChaos | Auto-scaling, load balancer health checks |
| Network latency | tc (traffic control), Toxiproxy | Timeout handling, circuit breakers |
| Network partition | iptables, Chaos Mesh | Split-brain handling, consistency behavior |
| AZ/zone failure | AWS FIS, Gremlin | Multi-AZ resilience, data replication |
| Dependency failure | Toxiproxy, service mesh fault injection | Graceful degradation, fallback behavior |
| CPU/memory stress | stress-ng, Chaos Mesh | Resource limits, OOM handling, auto-scaling |
| DNS failure | Modify /etc/resolv.conf, DNS poisoning | DNS caching, fallback resolvers |
Game Days¶
A game day is a planned chaos engineering exercise where the team intentionally breaks systems and practices incident response:
Game Day Plan:
Objective: Validate that the payment service handles database failover
Participants: SRE team, payment team, database team
Date: Tuesday, 2pm-4pm (low traffic window)
Failure to inject: Force failover of the primary database replica
Hypothesis: Payment service will:
- Experience < 5 seconds of errors during failover
- Automatically reconnect to the new primary
- No data loss or corruption
Blast radius controls:
- Only payment-staging environment (not production)
- Rollback plan: manually promote old primary if failover fails
- Abort trigger: error rate > 50% for > 30 seconds
Observation:
- Monitor: payment error rate, latency, database connections
- Record: timeline of events, actual behavior vs. hypothesis
Building organizational confidence: Start with non-production environments. Graduate to production during low-traffic windows. Eventually run experiments during business hours. The goal is to make chaos engineering routine, not scary.
SRE Organizational Practices¶
Production Readiness Review (PRR)¶
Before a service can go to production (or before SRE takes on operational responsibility), conduct a production readiness review:
# Production Readiness Review Checklist
## Reliability
- [ ] SLOs defined and instrumented
- [ ] Error budget policy agreed upon with product team
- [ ] Alert rules configured with runbooks
- [ ] On-call rotation established
- [ ] Incident response procedure documented
## Architecture
- [ ] No single points of failure
- [ ] Graceful degradation when dependencies fail
- [ ] Circuit breakers for external dependencies
- [ ] Timeouts configured for all external calls
- [ ] Rate limiting implemented
## Observability
- [ ] Structured logging to centralized system
- [ ] Metrics exported (request rate, latency, errors, saturation)
- [ ] Distributed tracing enabled
- [ ] Dashboards created for key metrics
- [ ] Health check endpoint (/health, /ready)
## Operations
- [ ] Deployment pipeline with automated rollback
- [ ] Canary or blue-green deployment strategy
- [ ] Rollback tested and documented
- [ ] Capacity plan documented
- [ ] Load tested at 2x expected peak
## Security
- [ ] No hardcoded secrets
- [ ] TLS for all external communication
- [ ] Authentication and authorization configured
- [ ] Dependencies scanned for vulnerabilities
## Data
- [ ] Backup strategy tested (including restore)
- [ ] Data retention policy defined
- [ ] GDPR/compliance requirements met
Service Tiers¶
Not all services require the same level of reliability. Classify services into tiers:
| Tier | Reliability Target | Monitoring | On-Call | Example |
|---|---|---|---|---|
| Tier 1 (Critical) | 99.99% | Real-time alerting, SLO-based | 24/7 dedicated rotation | Payment processing, authentication |
| Tier 2 (Important) | 99.9% | Alerting with business-hours response | Shared on-call rotation | Product catalog, user profiles |
| Tier 3 (Standard) | 99% | Monitoring dashboards, next-business-day | Best-effort | Internal tools, analytics pipelines |
| Tier 4 (Best-effort) | None | Basic monitoring | No on-call | Experimental features, internal prototypes |
Service tiers drive investment decisions: Tier 1 services get multi-region deployment, automated failover, and chaos engineering. Tier 4 services run on a single instance and accept periodic downtime.
Technical Debt Management¶
SRE teams often encounter technical debt that impacts reliability. Track reliability-related debt and prioritize it:
| Category | Example | Impact |
|---|---|---|
| Operational debt | Manual deploy process, no runbooks | Slower incident response, higher toil |
| Architectural debt | Single point of failure, monolithic database | Outage risk, scaling bottleneck |
| Observability debt | Missing metrics, no distributed tracing | Longer time to diagnose issues |
| Testing debt | No load tests, no chaos testing | Unknown failure modes |
Allocate a percentage of engineering capacity (20-30%) specifically for reliability and infrastructure improvements, separate from feature work. Without explicit allocation, reliability work is perpetually deprioritized until the next outage.