System Design¶
System design is the process of conceptualizing and planning a software system's architecture, components, modules, interfaces, and data to satisfy specified requirements. It serves as a blueprint for software development, translating user requirements into a detailed plan for implementation. Software design is the first step in SDLC (Software Design Life Cycle), which moves the concentration from problem domain to solution domain. It tries to specify how to fulfill the requirements mentioned in SRS (Software Requirement Specification).
System Design Process¶
Before diving into technical aspects, understanding the systematic approach to system design is crucial:
1. Requirements Gathering¶
Functional Requirements: What the system should do
- User stories and use cases
- Core features and capabilities
- Business rules and workflows
- Data processing requirements
Non-Functional Requirements (NFRs): How the system should perform
- Performance: Response time (< 200ms for API calls), throughput (10,000 RPS)
- Scalability: Handle 10x traffic growth over 3 years
- Availability: 99.99% uptime (52 minutes downtime/year)
- Reliability: Data durability of 99.999999999% (11 nines)
- Security: Encryption at rest and in transit, compliance requirements
- Maintainability: Modular design, clear documentation
2. Back-of-the-Envelope Estimation¶
Critical for understanding scale requirements:
Traffic Estimation:
Daily Active Users (DAU): 10 million
Average requests per user per day: 20
Total daily requests: 200 million
Requests per second (RPS): 200M / 86,400 ≈ 2,300 RPS
Peak RPS (assume 3x average): ~7,000 RPS
Storage Estimation:
New records per day: 1 million
Average record size: 1 KB
Daily storage: 1 GB
Yearly storage: 365 GB
5-year storage (with replication factor 3): ~5.5 TB
Bandwidth Estimation:
Incoming data: 1 GB/day = ~12 KB/s
Outgoing data (assume 10:1 read:write): 10 GB/day = ~120 KB/s
Memory Estimation (for caching):
Cache 20% of daily read requests
Daily reads: 180 million (90% of 200M)
Unique requests (assume 20%): 36 million
Cache size: 36M × 1KB = 36 GB
3. High-Level Design¶
Start with a simple architecture and evolve:
┌─────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────┐
│ Clients │────▶│ Load Balancer│────▶│ Application │────▶│ Database │
└─────────┘ └──────────────┘ │ Servers │ └──────────┘
└─────────────┘
Then add components as needed:
┌─────────┐ ┌─────┐ ┌──────────────┐ ┌─────────────┐
│ Clients │────▶│ CDN │────▶│ Load Balancer│────▶│ API │
└─────────┘ └─────┘ └──────────────┘ │ Gateway │
└──────┬──────┘
┌──────────────────────────┼──────────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Service │ │ Service │ │ Service │
│ A │ │ B │ │ C │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ Database │ │ Cache │ │ Message │
│ A │ │ (Redis) │ │ Queue │
└─────────────┘ └─────────────┘ └─────────────┘
4. Deep Dive into Components¶
For each component, consider:
- Technology choice and trade-offs
- Data model and schema design
- API contracts and interfaces
- Failure modes and recovery
- Scaling strategy
Key Aspects of Software Design¶
1. Architectural Design¶
Architectural design defines the overall structure of the system, identifying major components and their relationships. It establishes the foundation upon which the entire software will be built. This high-level view addresses how different parts of the system interact, what technologies will be used, and how the system will be deployed and scaled.
Key Considerations:
- Component Identification: Breaking down the system into logical, cohesive components (e.g., authentication service, payment gateway, data storage layer)
- Communication Patterns: Defining how components communicate (synchronous vs. asynchronous, REST APIs, message queues, event-driven architecture)
- Technology Stack: Selecting appropriate technologies for each component based on requirements, performance needs, and team expertise
- Deployment Architecture: Planning how the system will be deployed (monolithic vs. microservices, on-premises vs. cloud, containerization)
- Scalability Strategy: Designing for future growth (horizontal vs. vertical scaling, stateless vs. stateful components)
- Security Architecture: Defining security boundaries, authentication/authorization mechanisms, and data protection strategies
Common Architectural Patterns:
- Monolithic Architecture: Single, unified application where all components are tightly coupled
- Microservices Architecture: System composed of small, independent services that communicate over well-defined APIs
- Layered Architecture: System organized into horizontal layers (presentation, business logic, data access)
- Event-Driven Architecture: Components communicate through events, enabling loose coupling and asynchronous processing
- Service-Oriented Architecture (SOA): System built from reusable services that communicate via standardized protocols
Monolithic vs. Microservices: Deep Comparison¶
| Aspect | Monolithic | Microservices |
|---|---|---|
| Deployment | Single deployable unit | Independent service deployment |
| Scaling | Scale entire application | Scale individual services |
| Technology | Single tech stack | Polyglot (different stacks per service) |
| Data Management | Shared database | Database per service |
| Team Structure | Feature teams cross modules | Service-owning teams |
| Complexity | Simple initially, complex at scale | Complex initially, manageable at scale |
| Latency | In-process calls (fast) | Network calls (slower) |
| Consistency | ACID transactions easy | Distributed transactions complex |
| Debugging | Easier (single process) | Harder (distributed tracing needed) |
| Failure Isolation | One bug can crash all | Failures isolated to service |
When to Choose Monolithic:
- Small team (< 10 developers)
- Simple domain with clear boundaries
- Rapid prototyping and MVP development
- Strong consistency requirements
- Limited operational expertise
When to Choose Microservices:
- Large teams needing independent deployment
- Complex domains with clear bounded contexts
- Different scaling requirements per component
- Need for technology diversity
- High availability requirements with fault isolation
2. Detailed Design¶
Detailed design focuses on specifying the internal elements of all system components, their attributes, methods, and relationships. It provides the necessary information for developers to implement the design. This phase translates architectural decisions into concrete implementation specifications.
Key Aspects:
- Class/Module Design: Defining classes, modules, or functions with their specific responsibilities, methods, and properties
- Data Structures: Specifying how data will be organized and stored within each component
- Algorithms: Choosing and documenting the algorithms that will be used for critical operations
- Error Handling: Defining how errors will be detected, reported, and handled throughout the system
- State Management: Specifying how component state will be managed, persisted, and synchronized
- Performance Considerations: Identifying performance-critical paths and optimization strategies
- Testing Strategy: Outlining how components will be tested (unit tests, integration tests, etc.)
Design Artifacts:
- Class diagrams showing relationships between classes
- Sequence diagrams illustrating interaction flows
- State diagrams for components with complex state machines
- Database schemas and data models
- API specifications with request/response formats
3. Interface Design¶
Interface design describes how the software components interact with each other and with external systems. It includes user interfaces (UI), application programming interfaces (APIs), and data exchange formats. Well-designed interfaces are crucial for system integration, usability, and maintainability.
Types of Interfaces:
-
User Interfaces (UI):
- Design principles: consistency, clarity, feedback, error prevention
- Responsive design for multiple device types
- Accessibility considerations (WCAG guidelines)
- User experience (UX) flow and interaction patterns
-
Application Programming Interfaces (APIs):
- RESTful APIs: Resource-based, stateless, using HTTP methods
- GraphQL APIs: Query-based, allowing clients to request specific data
- gRPC: High-performance RPC framework using Protocol Buffers
- WebSocket APIs: Real-time, bidirectional communication
- API versioning strategies to maintain backward compatibility
- Authentication and authorization mechanisms (OAuth, API keys, JWT)
- Rate limiting and throttling to prevent abuse
- Documentation standards (OpenAPI/Swagger)
-
Data Exchange Formats:
- JSON: Human-readable, widely supported for web APIs
- XML: Structured data format, often used in enterprise systems
- Protocol Buffers: Efficient binary serialization format
- Message queue formats: AMQP, Kafka message formats
Interface Design Principles:
- Consistency: Similar operations should have similar interfaces
- Simplicity: Interfaces should be easy to understand and use
- Backward Compatibility: Changes should not break existing clients
- Documentation: Clear, comprehensive documentation is essential
- Error Handling: Consistent error response formats and codes
REST vs GraphQL vs gRPC: Deep Comparison¶
| Aspect | REST | GraphQL | gRPC |
|---|---|---|---|
| Protocol | HTTP/1.1, HTTP/2 | HTTP (typically POST) | HTTP/2 |
| Data Format | JSON, XML | JSON | Protocol Buffers (binary) |
| Schema | Optional (OpenAPI) | Required (SDL) | Required (.proto files) |
| Typing | Weak | Strong | Strong |
| Over-fetching | Common problem | Solved (client specifies) | N/A (predefined messages) |
| Under-fetching | Multiple requests needed | Single query | Multiple calls or streaming |
| Caching | HTTP caching easy | Complex (POST requests) | Requires custom solution |
| Real-time | Polling or WebSockets | Subscriptions | Bidirectional streaming |
| Browser Support | Native | Native | Requires grpc-web |
| Learning Curve | Low | Medium | High |
When to Use Each:
- REST: Public APIs, simple CRUD operations, browser-based clients, caching important
- GraphQL: Mobile apps with varied data needs, complex nested data, rapid frontend iteration
- gRPC: Internal microservices, high performance needed, streaming requirements, polyglot environments
API Versioning Strategies¶
1. URL Path Versioning:
GET /api/v1/users/123
GET /api/v2/users/123
- Pros: Clear, explicit, easy to understand
- Cons: URL changes, may break bookmarks
2. Query Parameter Versioning:
GET /api/users/123?version=1
GET /api/users/123?version=2
- Pros: Same URL, optional parameter
- Cons: Easy to forget, caching complications
3. Header Versioning:
GET /api/users/123
Accept: application/vnd.myapi.v1+json
- Pros: Clean URLs, follows HTTP semantics
- Cons: Less visible, harder to test
4. Content Negotiation:
GET /api/users/123
Accept: application/vnd.myapi+json; version=1
- Pros: RESTful, flexible
- Cons: Complex implementation
Best Practice: Use URL path versioning for public APIs (clarity) and header versioning for internal APIs (flexibility).
API Design Best Practices¶
Resource Naming:
# Good - nouns, plural, hierarchical
GET /users
GET /users/123
GET /users/123/orders
GET /users/123/orders/456
# Bad - verbs, actions in URL
GET /getUser/123
GET /user/123/getOrders
POST /createUser
HTTP Methods and Status Codes:
| Method | Usage | Success | Error |
|---|---|---|---|
| GET | Retrieve resource | 200 OK | 404 Not Found |
| POST | Create resource | 201 Created | 400 Bad Request |
| PUT | Replace resource | 200 OK | 404 Not Found |
| PATCH | Partial update | 200 OK | 400 Bad Request |
| DELETE | Remove resource | 204 No Content | 404 Not Found |
Pagination:
// Offset-based (simple but slow for large datasets)
GET /users?offset=100&limit=25
// Cursor-based (efficient, consistent)
GET /users?cursor=eyJpZCI6MTAwfQ&limit=25
// Response
{
"data": [...],
"pagination": {
"next_cursor": "eyJpZCI6MTI1fQ",
"has_more": true,
"total_count": 1000
}
}
Error Response Format:
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid request parameters",
"details": [
{
"field": "email",
"message": "Invalid email format"
},
{
"field": "age",
"message": "Must be a positive integer"
}
],
"request_id": "req_abc123",
"documentation_url": "https://api.example.com/docs/errors#VALIDATION_ERROR"
}
}
Idempotency in APIs¶
Idempotency means that making the same request multiple times has the same effect as making it once. This is crucial for handling retries and network failures.
Naturally Idempotent Operations:
- GET, PUT, DELETE are idempotent by design
- Multiple GETs return same result
- Multiple PUTs set same state
- Multiple DELETEs result in deleted state
Making POST Idempotent:
# Client generates unique key
POST /payments
Idempotency-Key: abc123-def456-ghi789
{
"amount": 100,
"currency": "USD"
}
Server Implementation:
def create_payment(request):
idempotency_key = request.headers.get('Idempotency-Key')
# Check if we've seen this key before
existing = cache.get(f"idempotency:{idempotency_key}")
if existing:
return existing # Return cached response
# Process the payment
result = process_payment(request.body)
# Store result with TTL (e.g., 24 hours)
cache.set(f"idempotency:{idempotency_key}", result, ttl=86400)
return result
Idempotency Key Best Practices:
- Generate on client side (UUID v4)
- Store with TTL (24-48 hours typical)
- Include in response for debugging
- Handle concurrent requests (use locks)
4. Data Design¶
Data design involves designing data structures and database schemas to efficiently store, retrieve, and manipulate data within the system. This includes choosing appropriate data models, normalization strategies, indexing, and data access patterns.
Key Considerations:
-
Data Modeling:
- Entity-Relationship (ER) modeling to represent entities and their relationships
- Normalization to reduce data redundancy and improve integrity
- Denormalization strategies for performance optimization
- Data types and constraints to ensure data quality
-
Database Selection:
- Relational Databases (SQL): Structured data with ACID properties (PostgreSQL, MySQL, SQL Server)
- NoSQL Databases:
- Document stores (MongoDB, CouchDB) for flexible schemas
- Key-value stores (Redis, DynamoDB) for high-performance lookups
- Column-family stores (Cassandra, HBase) for wide tables
- Graph databases (Neo4j) for relationship-heavy data
-
Data Access Patterns:
- Read patterns: Query optimization, caching strategies, read replicas
- Write patterns: Transaction handling, write-through vs. write-back caching
- Data partitioning: Horizontal (sharding) and vertical partitioning strategies
- Replication: Master-slave, master-master, and multi-master replication
-
Data Integrity:
- Primary keys, foreign keys, and unique constraints
- Referential integrity rules
- Data validation at application and database levels
- Backup and recovery strategies
-
Performance Optimization:
- Indexing strategies (B-tree, hash indexes, composite indexes)
- Query optimization techniques
- Connection pooling and connection management
- Caching layers (application-level, database-level, distributed caches)
Database Selection Guide¶
| Database Type | Best For | Examples | Limitations |
|---|---|---|---|
| Relational (SQL) | Complex queries, ACID transactions, structured data | PostgreSQL, MySQL, SQL Server | Scaling writes, schema changes |
| Document Store | Flexible schemas, hierarchical data, rapid development | MongoDB, CouchDB, DynamoDB | Complex joins, transactions |
| Key-Value Store | Caching, sessions, simple lookups | Redis, Memcached, etcd | Complex queries, relationships |
| Wide-Column Store | Time-series, analytics, high write throughput | Cassandra, HBase, ScyllaDB | Complex queries, consistency |
| Graph Database | Social networks, recommendations, fraud detection | Neo4j, Amazon Neptune, JanusGraph | Scaling, complex aggregations |
| Time-Series | Metrics, IoT data, logs | InfluxDB, TimescaleDB, Prometheus | General-purpose queries |
| Search Engine | Full-text search, log analysis | Elasticsearch, Solr, Meilisearch | Primary data store |
Database Indexing Deep Dive¶
Index Types:
-
B-Tree Index (Default):
- Balanced tree structure
- O(log n) lookups
- Good for: equality, range queries, sorting
- Supports: <, <=, =, >=, >, BETWEEN, LIKE 'prefix%'
-
Hash Index:
- O(1) lookups for equality
- Not useful for range queries
- Memory-only in some databases
-
Bitmap Index:
- Efficient for low-cardinality columns
- Good for: gender, status, boolean fields
- Excellent for AND/OR operations
-
GiST/GIN Index (PostgreSQL):
- GiST: Generalized search trees (geometric data, full-text)
- GIN: Generalized inverted index (arrays, JSONB, full-text)
-
Covering Index:
- Includes all columns needed by query
- Avoids table lookup (index-only scan)
Index Design Principles:
-- Composite index: order matters!
CREATE INDEX idx_user_status_created
ON orders(user_id, status, created_at);
-- This index supports:
-- WHERE user_id = 123
-- WHERE user_id = 123 AND status = 'pending'
-- WHERE user_id = 123 AND status = 'pending' AND created_at > '2024-01-01'
-- But NOT efficiently:
-- WHERE status = 'pending' (doesn't use leftmost column)
-- WHERE user_id = 123 AND created_at > '2024-01-01' (skips middle column)
Index Anti-Patterns:
-- Anti-pattern 1: Function on indexed column
-- BAD: Can't use index
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
-- GOOD: Expression index or normalize data
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
-- Anti-pattern 2: Leading wildcard
-- BAD: Full table scan
SELECT * FROM products WHERE name LIKE '%phone%';
-- GOOD: Use full-text search
SELECT * FROM products WHERE to_tsvector(name) @@ to_tsquery('phone');
-- Anti-pattern 3: OR on different columns
-- BAD: May not use indexes efficiently
SELECT * FROM orders WHERE user_id = 123 OR product_id = 456;
-- GOOD: Use UNION
SELECT * FROM orders WHERE user_id = 123
UNION
SELECT * FROM orders WHERE product_id = 456;
Query Analysis:
-- Always use EXPLAIN ANALYZE
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 123 AND status = 'pending'
ORDER BY created_at DESC
LIMIT 10;
-- Look for:
-- - Seq Scan (sequential scan - usually bad for large tables)
-- - Index Scan / Index Only Scan (good)
-- - Rows: estimated vs actual (large difference = stale statistics)
-- - Sort operations (consider adding index for ORDER BY)
Data Normalization vs Denormalization¶
Normalization (Reduce Redundancy):
-- Normalized schema (3NF)
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
total DECIMAL(10,2)
);
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER,
price DECIMAL(10,2)
);
-- Query requires multiple joins
SELECT u.email, o.total, oi.quantity, p.name
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id;
Denormalization (Optimize Reads):
-- Denormalized for read performance
CREATE TABLE order_summary (
id SERIAL PRIMARY KEY,
user_id INTEGER,
user_email VARCHAR(255), -- Duplicated from users
total DECIMAL(10,2),
item_count INTEGER, -- Pre-calculated
product_names TEXT[], -- Array of product names
created_at TIMESTAMP
);
-- Single table query - much faster
SELECT * FROM order_summary WHERE user_id = 123;
When to Denormalize:
- Read-heavy workloads (10:1 or higher read:write ratio)
- Complex joins hurting performance
- Data rarely changes after creation
- Reporting and analytics queries
- Caching data in a fast-access format
Trade-offs:
| Aspect | Normalized | Denormalized |
|---|---|---|
| Write speed | Faster | Slower (multiple updates) |
| Read speed | Slower (joins) | Faster (single table) |
| Storage | Less | More (redundancy) |
| Data consistency | Easier | Harder (sync required) |
| Schema changes | Easier | Harder |
ACID vs BASE¶
ACID (Traditional Databases):
- Atomicity: Transaction is all-or-nothing
- Consistency: Database moves from valid state to valid state
- Isolation: Concurrent transactions don't interfere
- Durability: Committed data survives failures
BASE (Distributed Systems):
- Basically Available: System remains available
- Soft State: State may change without input (due to eventual consistency)
- Eventually Consistent: System becomes consistent over time
Isolation Levels:
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Performance |
|---|---|---|---|---|
| READ UNCOMMITTED | ✓ | ✓ | ✓ | Highest |
| READ COMMITTED | ✗ | ✓ | ✓ | High |
| REPEATABLE READ | ✗ | ✗ | ✓ | Medium |
| SERIALIZABLE | ✗ | ✗ | ✗ | Lowest |
-- Set isolation level
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
-- Your transaction
COMMIT;
Design Principles¶
- SOLID Principles: Single Responsibility, Open-Closed, Liskov Substitution, Interface Segregation, and Dependency Inversion
- DRY (Don't Repeat Yourself): Avoiding code duplication
- KISS (Keep It Simple, Stupid): Maintaining simplicity in design
- Separation of Concerns: Dividing a system into distinct sections with minimal overlap
SOLID Principles¶
SOLID Stands for:
- S - Single-responsibility Principle
- O - Open-closed Principle
- L - Liskov Substitution Principle
- I - Interface Segregation Principle
- D - Dependency Inversion Principle
S - Single-responsibility Principle¶
The Single Responsibility Principle (SRP) is the first principle of SOLID, a set of five design principles for writing maintainable and scalable software, introduced by Robert C. Martin. SRP states that a class should have only one reason to change, meaning it should have only one responsibility or job.
Key Points:
- Definition: A class should focus on a single task or responsibility, encapsulating only the logic related to that purpose.
- Purpose: To reduce complexity, improve maintainability, and make code easier to understand and modify.
- Responsibility: Refers to a single, well-defined job or purpose (e.g., handling user data, logging, or sending emails).
- Consequence of Violation: If a class has multiple responsibilities, changes to one responsibility may unintentionally affect others, leading to bugs, fragility, and harder maintenance.
Benefits of SRP:
Improved Maintainability: Changes to one responsibility (e.g., email logic) don’t affect unrelated functionality (e.g., user management). Easier Testing: Smaller, focused classes are easier to test since they do one thing. Better Reusability: Classes with single responsibilities can be reused in other contexts. Reduced Coupling: Separating responsibilities minimizes dependencies between unrelated concerns.
How to Apply SRP:
Identify distinct responsibilities in a class by asking, "What does this class do?" If a class does multiple unrelated things, split it into smaller classes. Use clear, descriptive names for classes to reflect their single purpose. Watch for signs of violation, like classes with too many methods or frequent changes for unrelated reasons.
Common Misconceptions:
Over-splitting: SRP doesn’t mean every class must have one method. A class can have multiple methods as long as they serve the same responsibility. Not just classes: SRP applies to functions, modules, or components too, depending on the context.
Real-World Analogy: Think of a chef in a kitchen. If the chef is responsible for cooking, serving, and cleaning, they’re overloaded with responsibilities. If the cooking process changes, it could disrupt serving or cleaning. Instead, assign a chef to cook, a waiter to serve, and a dishwasher to clean—each has one job, making the kitchen more efficient.
Practical Example in Context: Imagine a web application:
Bad: A ReportGenerator class that generates reports, formats them as PDF, and emails them. Good: Split into ReportGenerator (creates report data), ReportFormatter (formats as PDF), and EmailService (sends emails). Each class changes only when its specific responsibility needs an update.
By adhering to SRP, you create modular, flexible, and easier-to-maintain codebases, aligning with the broader goals of SOLID principles.
O - Open-closed Principle¶
The Open-Closed Principle (OCP) is the second principle of SOLID, a set of design principles for creating maintainable and scalable software, introduced by Robert C. Martin. OCP states that software entities (classes, modules, functions, etc.) should be open for extension but closed for modification.
Key Points:
- Open for Extension: You can add new functionality to a system by extending existing code (e.g., through inheritance, interfaces, or plugins).
- Closed for Modification: The existing code should not need to be changed to accommodate new functionality.
- Purpose: To make systems easier to maintain and extend without introducing bugs or altering tested, stable code.
- Consequence of Violation: Modifying existing code to add new features can introduce errors, break existing functionality, and require extensive retesting.
Benefits of OCP:
- Maintainability: Existing code remains untouched, reducing the risk of introducing bugs.
- Scalability: New features can be added by writing new classes or modules without altering the core system.
- Reusability: Polymorphic designs promote reusable components.
- Reduced Testing Overhead: Only new extensions need testing, not the entire system.
How to Apply OCP:
- Use abstraction (interfaces, abstract classes) to define contracts that new classes can implement.
- Leverage polymorphism to allow different implementations to be used interchangeably.
- Apply design patterns like Strategy, Factory, or Decorator to enable flexible extensions.
- Avoid conditional logic (e.g., if-else chains) for handling different types or behaviors; use polymorphism instead.
Common Misconceptions:
- Not just inheritance: While inheritance is a common way to achieve OCP, composition (e.g., using interfaces or dependency injection) is often preferred for flexibility.
- Not absolute: In some cases, minor modifications to existing code might be practical, but OCP encourages minimizing such changes.
Real-World Analogy:
Think of a power strip with multiple outlets. The strip is “closed” for modification—you don’t need to rewire it to plug in a new device. It’s “open” for extension because you can plug in new devices (like a lamp or charger) without changing the strip itself.
Practical Example in Context:
Imagine a payment processing system:
- Bad: A PaymentProcessor class with if-else statements for each payment type (credit card, PayPal, crypto). Adding a new payment type requires modifying the class.
- Good: Define a PaymentMethod interface with a processPayment method. Each payment type (e.g., CreditCardPayment, PayPalPayment) implements the interface. The PaymentProcessor uses the interface and remains unchanged when new payment methods are added.
By following OCP, you create systems that are easier to extend and maintain, aligning with the broader goals of SOLID principles.
L - Liskov Substitution Principle¶
The Liskov Substitution Principle (LSP) is the third principle of SOLID, introduced by Barbara Liskov. It states that objects of a derived class should be able to replace objects of the base class without introducing errors or unexpected behavior in the system. In other words, a subclass should be substitutable for its superclass without breaking the program.
Key Points:
- Definition: If a program is designed to work with a base class (or interface), it should work correctly with any subclass (or implementation) of that base class without modification.
- Purpose: To ensure that inheritance hierarchies are designed correctly, maintaining the reliability and correctness of the system.
- Consequence of Violation: Violating LSP leads to brittle code, unexpected behavior, or errors when subclasses are used in place of their base classes.
Formal Definition: Let S be a subtype of T. Then, objects of type T in a program can be replaced with objects of type S without altering the correctness of the program.
Key Guidelines for LSP:
- Behavioral Compatibility: Subclasses must honor the contract (behavior) defined by the base class or interface. This includes respecting preconditions, postconditions, and invariants.
- Preconditions: Subclass methods should not require stronger conditions than the base class (e.g., stricter input requirements).
- Postconditions: Subclass methods should not weaken the guarantees provided by the base class (e.g., returning less specific results).
- Invariants: Properties that hold true for the base class must remain true for subclasses.
- Avoid Overriding with Incompatible Behavior: Subclasses shouldn’t throw unexpected exceptions or produce results that break the base class’s expectations.
- Use Interfaces or Abstract Classes Wisely: Define clear contracts that all subclasses can fulfill.
- Refactor Inheritance if Needed: If a subclass cannot fully substitute for its parent, reconsider the inheritance hierarchy or use composition instead.
Benefits of LSP:
- Reliability: Ensures subclasses can be used interchangeably with their base class without causing errors.
- Maintainability: Reduces bugs caused by unexpected behavior in inheritance hierarchies.
- Flexibility: Allows polymorphic code to work with new subclasses without modification.
- Better Design: Encourages well-thought-out class hierarchies and clear contracts.
Common Misconceptions:
- Not just syntactic compatibility: LSP is about behavior, not just matching method signatures. A subclass that compiles but behaves incorrectly still violates LSP.
- Not limited to inheritance: LSP applies to any implementation of an interface or contract, not just class inheritance.
Real-World Analogy: Think of a power outlet and appliances. If a device (subclass) claims to work with a standard outlet (base class), it should function correctly when plugged in. If a “special” appliance requires a different voltage or connector, it violates the expectation of being substitutable for a standard appliance, causing issues.
Practical Example in Context:
Imagine a payment system:
- Bad: A Payment base class with a process method. A CryptoPayment subclass throws an exception for unsupported currencies, breaking code that expects Payment to always process successfully.
- Good: Define a Payment interface with a process method that all implementations (e.g., CreditCardPayment, CryptoPayment) can fulfill reliably. If CryptoPayment has limitations, handle them within the method (e.g., return an error code) rather than breaking the contract.
Relation to Other SOLID Principles:
- OCP: LSP supports the Open-Closed Principle by ensuring new subclasses can be added without breaking existing code.
- SRP: Classes with single responsibilities are less likely to violate LSP, as they have clearer, more focused contracts.
By adhering to LSP, you create robust inheritance hierarchies and polymorphic systems that are reliable and easier to extend, aligning with the broader goals of SOLID principles.
I - Interface Segregation Principle¶
The Interface Segregation Principle (ISP) is the fourth principle of SOLID, introduced by Robert C. Martin. It states that clients should not be forced to depend on interfaces they do not use. In other words, interfaces should be specific to the needs of the client, and classes should not be required to implement methods they don’t need.
Key Points:
- Definition: Break down large, general-purpose interfaces into smaller, more specific ones tailored to the requirements of individual clients.
- Purpose: To reduce coupling, improve cohesion, and prevent classes from being burdened with irrelevant methods, making the system easier to maintain and extend.
- Consequence of Violation: Classes implementing a bloated interface may have to provide empty or meaningless implementations for unused methods, leading to fragile code and maintenance issues.
Benefits of ISP:
- Reduced Coupling: Clients only depend on the specific interfaces they need, not bloated ones.
- Improved Cohesion: Interfaces are more focused, aligning with specific responsibilities.
- Easier Maintenance: Classes implement only relevant methods, reducing unnecessary code and potential errors.
- Better Scalability: New interfaces can be added for new client needs without affecting existing classes.
How to Apply ISP:
- Split Large Interfaces: Break down “fat” interfaces into smaller, more specific ones based on client needs.
- Client-Driven Design: Design interfaces from the perspective of what clients (classes or modules) actually require.
- Avoid Empty Implementations: If a class is forced to implement empty or “not applicable” methods, it’s a sign the interface is too broad.
- Use Composition: Combine multiple small interfaces to compose the functionality a class needs.
Common Misconceptions:
- Not about minimal interfaces: ISP doesn’t mean every interface should have one method. Interfaces should be cohesive and serve a specific purpose.
- Not just for interfaces: While ISP focuses on interfaces in object-oriented programming, the principle applies to any contract (e.g., APIs, modules) that clients depend on.
Real-World Analogy:
Think of a restaurant menu. A general menu listing every dish (vegan, meat, desserts) forces all customers to navigate irrelevant options. Instead, offering separate vegan, carnivore, and dessert menus ensures customers only see what’s relevant to them, simplifying their experience.
Practical Example in Context:
Imagine a printer system:
- Bad: A Printer interface with print, scan, and fax methods. A basic printer that can’t scan or fax must implement empty methods, violating ISP.
- Good: Split into Printable, Scannable, and Faxable interfaces. A basic printer implements only Printable, while an all-in-one printer implements all three. Clients (e.g., a print job manager) only depend on the Printable interface if they need printing.
Relation to Other SOLID Principles:
- SRP: ISP aligns with the Single Responsibility Principle by ensuring interfaces focus on a single, cohesive purpose.
- LSP: ISP supports Liskov Substitution by ensuring subclasses implement only relevant behaviors, making substitution safer.
- OCP: Smaller interfaces make it easier to extend systems without modifying existing code.
D - Dependency Inversion Principle¶
The Dependency Inversion Principle (DIP) is the fifth and final principle of SOLID, introduced by Robert C. Martin. It states that high-level modules should not depend on low-level modules; both should depend on abstractions. Additionally, abstractions should not depend on details; details should depend on abstractions.
Key Points:
- Definition: Instead of high-level modules (business logic) directly depending on low-level modules (implementation details), both should rely on abstract interfaces or contracts. This decouples the system and makes it more flexible.
- Purpose: To reduce tight coupling between components, making the system easier to extend, test, and maintain.
- Consequence of Violation: Tight coupling to specific implementations makes it hard to change or swap components, leading to rigid and fragile code.
Core Concepts:
- High-Level Modules: Contain business logic or core functionality (e.g., a service orchestrating a process).
- Low-Level Modules: Handle implementation details (e.g., database access, file I/O, or external APIs).
- Abstractions: Interfaces or abstract classes that define contracts without specifying implementation details.
- Inversion: Instead of high-level modules controlling low-level ones directly, both depend on abstractions, “inverting” the traditional dependency flow.
Benefits of DIP:
- Decoupling: High-level and low-level modules are loosely coupled, relying on abstractions rather than concrete implementations.
- Flexibility: You can swap implementations (e.g., switch databases or APIs) without changing high-level code.
- Testability: Dependencies can be mocked or stubbed easily for unit testing.
- Maintainability: Changes in low-level details don’t require modifying high-level logic.
How to Apply DIP:
- Use Interfaces or Abstract Classes: Define abstractions that specify contracts for low-level modules.
- Dependency Injection: Pass dependencies (e.g., via constructor, setter, or framework) rather than hardcoding them in high-level modules.
- Avoid Concrete Dependencies: High-level modules should not instantiate or directly reference low-level classes.
- Design for Abstractions: Ensure both high-level and low-level modules depend on the same abstract contract.
Common Misconceptions:
- Not just dependency injection: While DIP often uses dependency injection, the principle is about depending on abstractions, not the mechanism of injection.
- Not about eliminating dependencies: It’s about managing dependencies through abstractions, not removing them entirely.
Real-World Analogy: Think of a lamp and a power source. Instead of wiring the lamp (high-level module) directly to a specific battery (low-level module), use a standard plug (abstraction). The lamp works with any power source (battery, wall outlet) that fits the plug, without needing to rewire the lamp.
Practical Example in Context:
Imagine a notification system:
- Bad: A NotificationService directly creates and uses an EmailSender class. Switching to SMS or push notifications requires modifying NotificationService.
- Good: Define a NotificationSender interface with a send method. EmailSender, SMSSender, and PushSender implement it. NotificationService depends on the NotificationSender interface, and the specific sender is injected at runtime.
Relation to Other SOLID Principles:
- SRP: DIP supports Single Responsibility by encouraging focused abstractions for specific responsibilities.
- OCP: DIP enables Open-Closed Principle by allowing new implementations to be added without modifying existing code.
- LSP: DIP relies on Liskov Substitution to ensure that any implementation of an abstraction can be used interchangeably.
- ISP: DIP works best with small, specific interfaces (per Interface Segregation) to avoid bloated dependencies.
By adhering to DIP, you create systems that are decoupled, flexible, and easier to maintain, aligning with the broader goals of SOLID principles.
DRY Don't Repeat Yourself¶
The "Don't Repeat Yourself" (DRY) principle is a fundamental concept in software development aimed at reducing repetition of code and improving maintainability, readability, and efficiency. It states that every piece of knowledge or logic in a system should have a single, unambiguous, authoritative representation. In other words, you should avoid duplicating code or data, as duplication can lead to errors, inconsistencies, and increased maintenance effort.
Key Aspects of DRY
- Single Source of Truth: Each piece of functionality or data should exist in one place. If you need to change something, you update it in one place, and all dependent parts of the system reflect that change.
- Avoid Code Duplication: Repeating code across multiple parts of an application can lead to problems when changes are needed, as you’d have to update every instance of the duplicated code.
- Promotes Modularity: DRY encourages breaking code into reusable functions, modules, or components, making it easier to maintain and scale.
- Applies Beyond Code: DRY can apply to data, configurations, documentation, and even design patterns, ensuring consistency across the system.
Benefits of DRY
- Maintainability: With less duplicated code, updates or bug fixes only need to be applied in one place, reducing the risk of missing spots or introducing inconsistencies.
- Readability: Code is cleaner and easier to understand when logic is centralized rather than scattered.
- Efficiency: Developers spend less time writing repetitive code and more time focusing on new features or improvements.
- Reduced Errors: Duplication increases the chance of errors, especially if changes are made inconsistently across duplicated sections.
When to Use DRY
- When repetition is meaningful: If you’re duplicating code or logic that represents the same concept or behavior, it’s a candidate for DRY.
- When future changes are likely: If a piece of logic or data is likely to change, centralizing it reduces future maintenance effort.
- When it improves clarity: DRY should make code easier to understand, not overly complex.
Potential Pitfalls
- Over-Abstraction: Applying DRY too aggressively can lead to overly complex abstractions that are hard to understand or maintain. For example, creating a highly generalized function to handle slightly different cases might make the code less readable.
- Premature Optimization: Trying to eliminate all duplication early on can lead to overengineering. Sometimes, it’s okay to have minor duplication if the code’s purpose or context might diverge later.
- Context Matters: Not all duplication is bad. If two pieces of code look similar but serve different purposes, unifying them might couple unrelated functionality, making future changes harder.
DRY vs. WET The opposite of DRY is sometimes humorously called WET ("Write Everything Twice" or "We Enjoy Typing"). WET code is full of duplication, leading to bloated, error-prone, and hard-to-maintain systems. Practical Tips for Applying DRY
- Refactor Regularly: When you notice duplicated code, refactor it into reusable components like functions, classes, or modules.
- Use Design Patterns: Patterns like Factory, Strategy, or Template Method can help centralize logic and avoid repetition.
- Leverage Tools: Use linters, code analyzers, or IDE features to detect duplication.
- Balance Simplicity and DRY: Ensure that applying DRY doesn’t make the code harder to understand or maintain.
DRY in Different Contexts
- Frontend Development: Use components (e.g., in React or Vue) to reuse UI elements and logic.
- Backend Development: Create reusable services, utilities, or middleware to handle common tasks like authentication or logging.
- DevOps: Use configuration files or Infrastructure as Code (e.g., Terraform) to avoid repeating setup instructions.
- Documentation: Maintain a single source for documentation (e.g., a wiki or README) to avoid outdated or conflicting information.
Example in a Real-World Scenario
Imagine a web app where multiple pages display a user’s profile information. Without DRY, you might repeat the same HTML and logic for rendering the profile on each page. By applying DRY, you create a reusable Profile component that takes user data as input and renders it consistently across the app. If the profile’s design changes, you update the component once, and all pages reflect the change.
Conclusion
The DRY principle is about reducing redundancy to make code more maintainable, scalable, and less error-prone. However, it’s not a rigid rule—use it thoughtfully to balance simplicity and efficiency. When applied correctly, DRY leads to cleaner, more robust systems that are easier to work with over time.
KISS Keep It Simple, Stupid¶
The KISS principle, which stands for "Keep It Simple, Stupid" (or sometimes "Keep It Short and Simple" or "Keep It Simple and Straightforward"), is a design philosophy that emphasizes simplicity in software development, engineering, and other fields. The core idea is to avoid unnecessary complexity, making systems, code, or designs as simple as possible while still achieving the intended purpose.
Origin The KISS principle is attributed to Kelly Johnson, a lead engineer at Lockheed Martin’s Skunk Works in the mid-20th century. He reportedly told his team that aircraft designs should be simple enough for an average mechanic to repair with basic tools under combat conditions. The phrase has since been adopted widely in software development, project management, and other disciplines.
Key Aspects of KISS
- Simplicity First: Choose the simplest solution that solves the problem effectively, avoiding overengineering or adding unnecessary features.
- Clarity and Readability: Write code or create designs that are easy to understand, even for someone new to the project.
- Minimize Complexity: Avoid intricate logic, convoluted architectures, or excessive abstractions that make maintenance harder.
- Focus on Essentials: Prioritize core functionality over "nice-to-have" features that add complexity without significant value.
Benefits of KISS
- Easier Maintenance: Simple code or systems are easier to debug, modify, and extend.
- Improved Readability: Straightforward designs are more accessible to developers, reducing onboarding time and miscommunication.
- Faster Development: Simpler solutions often take less time to implement and test.
- Fewer Errors: Complexity increases the likelihood of bugs; simplicity reduces it.
- Scalability: Simple systems are often easier to scale or adapt to new requirements.
KISS in Relation to DRY The KISS principle complements the DRY (Don’t Repeat Yourself) principle, but they can sometimes conflict. For example:
DRY focuses on eliminating duplication by abstracting shared logic, which can sometimes lead to complex abstractions. KISS prioritizes simplicity, even if it means tolerating minor duplication to keep the code straightforward.
When applying both, balance is key. For instance, don’t create overly generalized functions to avoid duplication if they make the code harder to understand.
When to Apply KISS
- Early Development: Start with simple solutions to get a working prototype quickly, then refine as needed.
- When Complexity Creeps In: If a system starts becoming hard to understand or maintain, simplify it.
- Team Collaboration: Simple code is easier for teams to work on, especially with varying skill levels.
- User-Facing Systems: Simplicity in UI/UX or APIs improves user experience and adoption.
Potential Pitfalls
- Oversimplification: Making things too simple might ignore edge cases or future scalability needs. For example, a simplistic database schema might not handle growth well.
- Misinterpreting Simplicity: KISS doesn’t mean cutting corners or ignoring best practices; it means finding the simplest effective solution.
- Subjectivity: What’s "simple" can vary between developers. A solution that seems simple to one person might be confusing to another.
Practical Tips for Applying KISS
- Break Down Problems: Divide complex tasks into smaller, manageable pieces.
- Use Clear Naming: Choose variable, function, and class names that clearly describe their purpose.
- Avoid Overengineering: Don’t add features or abstractions "just in case" they’re needed later (see YAGNI - "You Aren’t Gonna Need It").
- Refactor for Simplicity: Regularly review code to simplify overly complex sections.
- Test Early: Simple designs are easier to test, so validate them early to catch issues.
- Leverage Existing Tools: Use libraries or frameworks to handle complex tasks instead of reinventing the wheel.
KISS in Different Contexts
- Software Development: Write concise, focused code with minimal dependencies and clear logic.
- Project Management: Use straightforward plans and avoid overcomplicating workflows with excessive tools or processes.
- UI/UX: Design interfaces with minimal steps to achieve user goals.
- Documentation: Write clear, concise documentation without unnecessary jargon.
Example in a Real-World Scenario
Imagine building a login system for a web app. A KISS approach would involve:
- A simple form with email and password fields.
- A single endpoint (e.g., /login) that validates credentials and returns a token.
- Clear error messages for invalid inputs.
Instead of:
- A complex form with multiple authentication options (e.g., biometrics, OAuth, OTP) before validating the need for them.
- Multiple endpoints for different login scenarios, increasing maintenance.
Conclusion The KISS principle is about prioritizing simplicity to create systems that are easier to build, understand, and maintain. By focusing on what’s essential and avoiding unnecessary complexity, developers can deliver robust, efficient solutions. However, KISS should be balanced with other principles like DRY and YAGNI to ensure the solution remains effective and adaptable. When applied thoughtfully, KISS leads to cleaner code, happier teams, and better user experiences.
YAGNI You Aren’t Gonna Need It¶
The YAGNI principle, which stands for "You Aren’t Gonna Need It", is a software development guideline that advises against implementing functionality or adding complexity to a system unless it’s explicitly required for the current needs. Originating from Extreme Programming (XP), YAGNI emphasizes focusing on what’s necessary now, avoiding speculative features or overengineering that may never be used. It promotes lean development, reducing wasted effort and keeping systems simpler and more maintainable.
Key Aspects of YAGNI
- Build Only What’s Needed: Implement features, code, or infrastructure only when there’s a clear, immediate requirement, not because they might be useful later.
- Avoid Overengineering: Don’t add complexity (e.g., extra classes, configurations, or optimizations) in anticipation of future needs that may never materialize.
- Focus on Current Requirements: Prioritize delivering working software that meets the present scope, allowing for flexibility to adapt later.
- Iterative Development: YAGNI aligns with agile methodologies, encouraging incremental changes based on actual feedback or requirements.
Benefits of YAGNI
- Reduced Complexity: By avoiding unnecessary features, code remains simpler, easier to understand, and less prone to bugs.
- Faster Development: Focusing on current needs speeds up delivery, as developers aren’t spending time on speculative work.
- Lower Maintenance Costs: Fewer features mean less code to maintain, test, and debug.
- Flexibility: Avoiding premature commitments to specific designs or features makes it easier to adapt to future changes.
- Resource Efficiency: Saves time, effort, and resources that would be spent on unused functionality.
YAGNI in Relation to KISS and DRY
- YAGNI and KISS: Both emphasize simplicity. YAGNI avoids adding unnecessary features, while KISS (Keep It Simple, Stupid) focuses on keeping implementations straightforward. Together, they encourage minimal, clear solutions.
- YAGNI and DRY: DRY (Don’t Repeat Yourself) eliminates duplication, but overzealous DRY can lead to premature abstractions that YAGNI warns against. For example, creating a generalized framework for a feature that’s only used once violates YAGNI, even if it follows DRY.
When to Apply YAGNI
- Unclear Requirements: If a feature is speculative or based on "what if" scenarios, defer it until it’s explicitly needed.
- Early Development: Focus on delivering a minimum viable product (MVP) rather than building for hypothetical future needs.
- Resource-Constrained Projects: When time or budget is limited, prioritize essential functionality.
- Agile Environments: YAGNI fits well with iterative development, where features are added incrementally based on feedback.
Potential Pitfalls
- Underengineering: Overapplying YAGNI can lead to overly simplistic systems that are hard to extend when legitimate requirements arise. For example, ignoring basic scalability needs in a high-growth application.
- Ignoring Foreseeable Needs: If a requirement is highly likely based on clear evidence (e.g., customer feedback or project roadmap), YAGNI doesn’t mean ignoring it entirely.
- Technical Debt: Poorly designed simple solutions might require significant rework later. Balance YAGNI with good design practices.
- Team Misalignment: Some team members might interpret YAGNI as an excuse to cut corners or skip necessary planning.
Practical Tips for Applying YAGNI
- Validate Requirements: Confirm with stakeholders whether a feature is truly needed now.
- Refactor When Needed: If requirements change, refactor the code to accommodate new needs rather than overbuilding upfront.
- Use Agile Practices: Work in short iterations, delivering small, functional increments to validate what’s needed.
- Document Assumptions: Note why certain features were deferred to avoid confusion later.
- Balance with Scalability: Ensure the system is flexible enough to handle likely future needs without locking in speculative features.
- Communicate with Teams: Ensure everyone understands YAGNI to avoid adding unnecessary complexity during development.
YAGNI in Different Contexts
- Software Development: Write only the code needed for current functionality, avoiding speculative optimizations or frameworks.
- UI/UX Design: Build simple, focused interfaces rather than adding every possible user option upfront.
- DevOps: Deploy minimal infrastructure (e.g., a single server) instead of a complex, scalable cluster until traffic justifies it.
- Project Management: Avoid planning for every possible contingency; focus on immediate milestones.
Example in a Real-World Scenario Imagine developing an e-commerce app. A YAGNI approach would involve:
Building a basic product catalog and checkout system for the initial launch. Skipping features like advanced search filters, multi-currency support, or a recommendation engine until user demand or business needs justify them.
Instead of:
Implementing a complex system with support for multiple languages, dynamic pricing, and AI-driven recommendations from the start, “just in case” they’re needed.
Conclusion The YAGNI principle helps developers stay focused on delivering value by avoiding unnecessary work and complexity. By building only what’s required now, teams can ship faster, maintain simpler systems, and remain flexible for future changes. However, YAGNI should be applied thoughtfully, balancing immediate needs with reasonable foresight to avoid technical debt. When combined with KISS and DRY, it fosters efficient, maintainable, and practical software development.
Tools for Software Design¶
- UML (Unified Modeling Language): For creating diagrams that represent system architecture and behavior
- Draw.io, Lucidchart: Diagramming tools for creating flowcharts and architectural diagrams
- Enterprise Architect, Visual Paradigm: Comprehensive modeling tools for software design
- Figma, Sketch: For UI/UX design aspects
Importance of Good Software Design¶
- Maintainability: Well-designed software is easier to maintain and update. Changes can be made with minimal impact on other parts of the system, reducing the risk of introducing bugs and making the codebase more manageable as it grows.
- Scalability: Good design allows the system to grow and handle increased load. By considering scalability from the start, systems can accommodate growth in users, data, and traffic without requiring complete rewrites.
- Reliability: Proper design reduces bugs and system failures. Well-structured code with clear separation of concerns is easier to test, debug, and verify, leading to more stable systems.
- Efficiency: Optimized design improves performance and resource utilization. Thoughtful architectural decisions can significantly impact system performance, reducing latency, improving throughput, and minimizing resource consumption.
- Cost-effectiveness: While good design may take more time initially, it saves significant costs in the long run. Technical debt from poor design accumulates over time, making future changes more expensive and risky.
Distributed Systems¶
A distributed system is a collection of independent computers that appear to users as a single coherent system. These computers, often called nodes, are connected by a network and coordinate their actions by passing messages. Distributed systems enable applications to scale beyond the limitations of a single machine, provide fault tolerance, and allow for geographic distribution of services.
The Eight Fallacies of Distributed Computing¶
Before designing distributed systems, understand these common misconceptions (by Peter Deutsch):
- The network is reliable → Connections fail, packets are lost
- Latency is zero → Network calls take time (often 1-100ms)
- Bandwidth is infinite → Data transfer has limits
- The network is secure → All traffic can be intercepted
- Topology doesn't change → Networks are reconfigured constantly
- There is one administrator → Multiple teams manage infrastructure
- Transport cost is zero → Serialization, network I/O have costs
- The network is homogeneous → Different hardware, protocols, versions
Design Implications:
- Always handle network failures gracefully
- Implement timeouts and retries with backoff
- Compress data and batch requests when possible
- Encrypt all traffic (TLS/mTLS)
- Use service discovery for dynamic topologies
- Design for independent deployability
- Consider serialization costs in protocol choice
Characteristics of Distributed Systems¶
- Concurrency: Multiple components execute simultaneously, requiring coordination and synchronization mechanisms.
- No Global Clock: Different nodes may have different perceptions of time, making it challenging to establish a global ordering of events.
- Independent Failures: Components can fail independently without bringing down the entire system. The system must be designed to handle partial failures gracefully.
- Heterogeneity: Nodes may run different operating systems, use different hardware, or be implemented in different programming languages.
- Transparency: The distributed nature should be hidden from users, who should perceive the system as a single, unified entity.
Challenges in Distributed Systems¶
1. Network Partitions¶
Network partitions occur when network failures cause groups of nodes to be unable to communicate with each other. This creates a split-brain scenario where different parts of the system may make conflicting decisions.
Solutions:
- Consensus Algorithms: Use algorithms like Raft or Paxos to ensure nodes agree on decisions even during network partitions
- Quorum-based Systems: Require a majority of nodes to agree before making decisions
- CAP Theorem Trade-offs: Understand that during partitions, you must choose between consistency and availability
2. Consistency¶
Consistency refers to the property that all nodes in a distributed system see the same data at the same time. Achieving strong consistency across distributed nodes is challenging due to network latency and potential failures.
Consistency Models:
- Strong Consistency: All nodes see the same data simultaneously. Updates are immediately visible to all nodes. This is the simplest model but can impact availability and performance.
- Eventual Consistency: Nodes may temporarily have different views of data, but they will eventually converge to the same state. This improves availability and performance but requires handling temporary inconsistencies.
- Weak Consistency: No guarantees about when consistency will be achieved. Used when performance is more critical than consistency.
- Causal Consistency: Preserves causal relationships between events. If event A causally precedes event B, all nodes will see A before B.
- Session Consistency: Guarantees consistency within a user session, but not necessarily across different sessions.
Implementation Strategies:
- Two-Phase Commit (2PC): Ensures all nodes agree to commit or abort a transaction
- Three-Phase Commit (3PC): Improves on 2PC by reducing blocking scenarios
- Vector Clocks: Track causal relationships between events in distributed systems
- CRDTs (Conflict-free Replicated Data Types): Data structures designed to merge automatically without conflicts
3. Availability¶
Availability is the proportion of time a system is operational and accessible. In distributed systems, achieving high availability requires redundancy and fault tolerance.
Availability Levels:
- 99% (Two 9s): ~87.6 hours of downtime per year
- 99.9% (Three 9s): ~8.76 hours of downtime per year
- 99.99% (Four 9s): ~52.56 minutes of downtime per year
- 99.999% (Five 9s): ~5.26 minutes of downtime per year
Strategies for High Availability:
- Redundancy: Deploy multiple instances of services across different machines, data centers, or regions
- Health Checks: Continuously monitor service health and automatically route traffic away from unhealthy instances
- Graceful Degradation: Design systems to continue operating with reduced functionality when some components fail
- Circuit Breakers: Prevent cascading failures by stopping requests to failing services
- Retry Mechanisms: Implement intelligent retry strategies with exponential backoff
4. Partition Tolerance¶
Partition tolerance is the system's ability to continue operating despite network partitions. This is a fundamental requirement for distributed systems, as network partitions are inevitable.
CAP Theorem:
The CAP theorem states that in a distributed system, you can only guarantee two out of three properties:
- Consistency (C): All nodes see the same data simultaneously
- Availability (A): System remains operational and responds to requests
- Partition Tolerance (P): System continues despite network partitions
Since partition tolerance is essential in distributed systems, the practical choice is between:
- CP Systems: Prioritize consistency and partition tolerance (e.g., traditional databases, distributed locks)
- AP Systems: Prioritize availability and partition tolerance (e.g., DNS, web caches)
CAP Theorem Visualized:
Consistency (C)
/\
/ \
/ \
/ CP \
/________\
/\ /\
/ \ CA / \
/ AP \ / \
/______\ /______\
Availability (A) Partition Tolerance (P)
CA: Not practical in distributed systems (partitions happen)
CP: HBase, MongoDB (with majority writes), Zookeeper
AP: Cassandra, CouchDB, DynamoDB (eventual consistency)
Practical CAP Trade-offs:
| Scenario | Choice | Reasoning |
|---|---|---|
| Banking transaction | CP | Must prevent double-spending |
| Social media feed | AP | Stale data acceptable, must always load |
| Inventory count | Depends | CP for checkout, AP for display |
| User sessions | AP | Availability critical, eventual sync OK |
| Leader election | CP | Must have single leader |
| DNS | AP | Stale records OK, must respond |
CAP Theorem Misconceptions:
- "Choose 2 of 3" → More nuanced: during normal operation, you can have all three. CAP applies during partitions.
- Binary choice → Consistency and availability exist on a spectrum (tunable consistency).
- Static decision → Different operations can make different trade-offs.
PACELC Theorem (Extension of CAP):
If there's a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.
| System | P: A/C | E: L/C | Description |
|---|---|---|---|
| DynamoDB | A | L | Always available, low latency, eventual consistency |
| MongoDB | C | C | Consistent, may sacrifice latency |
| Cassandra | A | L | Tunable, defaults to availability |
| MySQL (single) | C | C | Strong consistency, higher latency |
BASE Properties (Alternative to ACID):
- Basically Available: System remains available even during failures
- Soft State: System state may change over time without input
- Eventually Consistent: System will become consistent over time
Eventual Consistency Models:
- Read-Your-Writes: You always see your own writes
- Monotonic Reads: Once you read a value, you never see older values
- Monotonic Writes: Writes by same client applied in order
- Causal Consistency: Causally related operations seen in order
# Example: Tunable consistency in Cassandra
# Write with QUORUM (majority of replicas)
session.execute(
"INSERT INTO users (id, name) VALUES (?, ?)",
[user_id, name],
consistency_level=ConsistencyLevel.QUORUM
)
# Read with ONE (any single replica) - faster but may be stale
session.execute(
"SELECT * FROM users WHERE id = ?",
[user_id],
consistency_level=ConsistencyLevel.ONE
)
Distributed System Architectures¶
1. Client-Server Architecture¶
The most common distributed architecture where clients request services from servers. Servers provide resources and services, while clients consume them.
Characteristics:
- Centralized control and data management
- Clear separation of concerns
- Easier to secure and manage
- Single point of failure at the server
Variations:
- Multi-tier Architecture: Separates presentation, business logic, and data layers
- Thin Client: Minimal processing on client, most work done on server
- Thick Client: Significant processing on client side
2. Peer-to-Peer (P2P) Architecture¶
All nodes have equal capabilities and responsibilities. Nodes can act as both clients and servers.
Characteristics:
- Decentralized, no single point of failure
- Scalable as nodes can join and leave dynamically
- More complex to manage and secure
- Examples: BitTorrent, blockchain networks, distributed hash tables (DHT)
Types:
- Structured P2P: Nodes organized in a specific topology (e.g., DHT with consistent hashing)
- Unstructured P2P: Nodes connect randomly (e.g., Gnutella)
- Hybrid P2P: Combines centralized and decentralized elements (e.g., Napster)
3. Microservices Architecture¶
System composed of small, independent services that communicate over well-defined APIs. Each service is responsible for a specific business capability.
Characteristics:
- Services are independently deployable and scalable
- Each service can use different technologies
- Services communicate via lightweight protocols (HTTP/REST, gRPC, message queues)
- Fault isolation: failure in one service doesn't bring down the entire system
Benefits:
- Technology diversity: choose best tool for each service
- Independent scaling: scale services based on their specific needs
- Faster development: teams can work independently
- Easier to understand: each service has a focused responsibility
Challenges:
- Service coordination and communication overhead
- Distributed transaction management
- Data consistency across services
- Network latency and reliability
- Operational complexity (monitoring, logging, debugging)
4. Event-Driven Architecture¶
Components communicate through events. When something significant happens, components publish events that other components can subscribe to and react to.
Characteristics:
- Loose coupling between components
- Asynchronous communication
- Highly scalable and responsive
- Event sourcing: store all changes as a sequence of events
Event Patterns:
- Event Notification: Components notify others of events
- Event-Carried State Transfer: Events contain all data needed by subscribers
- Event Sourcing: Store state changes as events, reconstruct state by replaying events
- CQRS (Command Query Responsibility Segregation): Separate read and write models
Technologies:
- Message brokers: Apache Kafka, RabbitMQ, Amazon SQS
- Event streaming: Apache Kafka, AWS Kinesis
- Event stores: EventStore, Apache Pulsar
Distributed System Patterns¶
1. Leader Election¶
In distributed systems, sometimes you need a single node to coordinate activities. Leader election algorithms ensure that exactly one node is selected as the leader.
Use Cases:
- Coordinating distributed transactions
- Managing distributed locks
- Organizing replicated data
- Coordinating cluster-wide operations
Algorithms:
- Bully Algorithm: Highest ID node becomes leader
- Ring Algorithm: Nodes organized in a ring, pass election messages
- Raft: Consensus algorithm that includes leader election
- ZooKeeper: Uses ZooKeeper for leader election
Consensus Algorithms Deep Dive¶
Consensus algorithms allow distributed systems to agree on a single value or sequence of values, even when some nodes fail.
Raft Consensus Algorithm:
Raft is designed to be understandable (unlike Paxos). It separates consensus into three sub-problems:
1. Leader Election:
┌─────────────────────────────────────────────────────────────────┐
│ RAFT STATES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ timeout ┌───────────┐ wins election │
│ │ Follower │ ───────────────▶│ Candidate │ ─────────────────▶│
│ └──────────┘ └───────────┘ │
│ ▲ │ │
│ │ │ loses election │
│ │ ▼ │
│ │ ┌──────────┐ │
│ └───────────────────────│ Leader │ │
│ discovers leader └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
- Nodes start as followers
- If no heartbeat from leader, become candidate
- Request votes from other nodes
- Majority votes → become leader
- Leader sends heartbeats to prevent new elections
2. Log Replication:
Leader Followers
│ │
│ AppendEntries(entries) │
│──────────────────────────────▶│
│ │
│ Success/Failure │
│◀──────────────────────────────│
│ │
│ (On majority success) │
│ Commit entry │
│ Apply to state machine │
│ │
- Leader receives client requests
- Appends to log, replicates to followers
- Once majority confirms, entry is committed
- Leader notifies followers to apply
3. Safety:
- Only nodes with up-to-date logs can become leader
- Log entries committed in a term are present in future leaders
- State machines apply entries in same order
Raft Example Implementation Pseudocode:
class RaftNode:
def __init__(self):
self.state = 'follower'
self.current_term = 0
self.voted_for = None
self.log = []
self.commit_index = 0
self.last_applied = 0
def on_election_timeout(self):
self.state = 'candidate'
self.current_term += 1
self.voted_for = self.id
votes = 1 # Vote for self
for peer in self.peers:
response = peer.request_vote(
term=self.current_term,
candidate_id=self.id,
last_log_index=len(self.log) - 1,
last_log_term=self.log[-1].term if self.log else 0
)
if response.vote_granted:
votes += 1
if votes > len(self.peers) / 2:
self.become_leader()
def append_entries(self, entries, leader_commit):
# Follower receives entries from leader
self.log.extend(entries)
if leader_commit > self.commit_index:
self.commit_index = min(leader_commit, len(self.log) - 1)
self.apply_committed_entries()
Paxos vs Raft:
| Aspect | Paxos | Raft |
|---|---|---|
| Understandability | Difficult | Designed to be simple |
| Leader | Optional (Multi-Paxos) | Required |
| Safety | Proven correct | Proven correct |
| Liveness | Guaranteed (with FLP caveat) | Guaranteed |
| Adoption | ZooKeeper, Spanner | etcd, Consul, TiDB |
Byzantine Fault Tolerance (BFT):
Raft and Paxos handle crash failures but not malicious nodes. For Byzantine faults:
- PBFT (Practical BFT): Tolerates f Byzantine nodes with 3f+1 total
- Use cases: Blockchain, financial systems
- Trade-off: Higher latency, more messages
2. Distributed Locking¶
Distributed locks allow multiple processes to coordinate access to shared resources across a distributed system.
Requirements:
- Mutual exclusion: only one process can hold the lock
- Deadlock freedom: system should not deadlock
- Fault tolerance: lock should be released if holder crashes
- Performance: low latency for acquiring/releasing locks
Implementation Approaches:
- Database-based: Use database transactions to implement locks
- Redis-based: Use Redis with SET NX (set if not exists) and expiration
- ZooKeeper/etcd: Use consensus systems for distributed coordination
- Lease-based: Locks expire after a time period, requiring renewal
Challenges:
- Clock skew affecting expiration times
- Network partitions causing split-brain scenarios
- Performance under high contention
3. Distributed Transactions¶
Transactions that span multiple nodes in a distributed system. Ensuring ACID properties across distributed nodes is complex.
Two-Phase Commit (2PC):
┌────────────┐ ┌────────────┐ ┌────────────┐
│Coordinator │ │Participant1│ │Participant2│
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
│ ───── PREPARE ─────────────────▶│ │
│ ───── PREPARE ────────────────────────────────────────────────────▶
│ │ │
│◀──── VOTE YES ──────────────────│ │
│◀──── VOTE YES ────────────────────────────────────────────────────│
│ │ │
│ ───── COMMIT ──────────────────▶│ │
│ ───── COMMIT ─────────────────────────────────────────────────────▶
│ │ │
│◀──── ACK ───────────────────────│ │
│◀──── ACK ─────────────────────────────────────────────────────────│
- Prepare Phase: Coordinator asks all participants to prepare
- Commit Phase: If all agree, coordinator tells all to commit; otherwise, abort
Limitations:
- Blocking: if coordinator fails, participants may be blocked
- Single point of failure at coordinator
- Poor performance due to multiple round trips
Alternatives:
- Saga Pattern: Long-running transactions broken into smaller, compensatable transactions
- TCC (Try-Confirm-Cancel): Three-phase pattern for distributed transactions
- Eventual Consistency: Accept eventual consistency instead of strong consistency
Saga Pattern Deep Dive¶
A saga is a sequence of local transactions where each transaction updates a single service and publishes events. If any transaction fails, compensating transactions undo the changes.
Choreography-Based Saga:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Order │ │ Inventory │ │ Payment │ │ Shipping │
│ Service │ │ Service │ │ Service │ │ Service │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │ │
│ OrderCreated │ │ │
│──────────────────▶│ │ │
│ │ │ │
│ │ InventoryReserved │ │
│ │──────────────────▶│ │
│ │ │ │
│ │ │ PaymentProcessed │
│ │ │──────────────────▶│
│ │ │ │
│ ShippingScheduled │ │ │
│◀──────────────────────────────────────────────────────────│
Orchestration-Based Saga:
┌─────────────┐
│ Saga │
│Orchestrator │
└──────┬──────┘
│
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Order │ │ Payment │ │ Shipping │
│ Service │ │ Service │ │ Service │
└─────────────┘ └─────────────┘ └─────────────┘
Saga Implementation Example:
class OrderSaga:
"""Orchestration-based saga for order processing."""
steps = [
('order', 'create_order', 'cancel_order'),
('inventory', 'reserve_items', 'release_items'),
('payment', 'process_payment', 'refund_payment'),
('shipping', 'schedule_shipping', 'cancel_shipping'),
]
def execute(self, order_data):
completed_steps = []
try:
for service, action, compensation in self.steps:
result = self.call_service(service, action, order_data)
completed_steps.append((service, compensation, result))
return {'status': 'success', 'order_id': order_data['id']}
except SagaStepFailed as e:
# Compensate in reverse order
for service, compensation, data in reversed(completed_steps):
self.call_service(service, compensation, data)
return {'status': 'failed', 'error': str(e)}
def call_service(self, service, action, data):
# Call service via HTTP/gRPC/message queue
pass
Saga vs 2PC:
| Aspect | Two-Phase Commit | Saga |
|---|---|---|
| Consistency | Strong (ACID) | Eventual |
| Availability | Lower (blocking) | Higher (no global lock) |
| Latency | Higher (2+ round trips) | Lower |
| Complexity | Simpler logic | Complex compensations |
| Rollback | Automatic | Manual (compensating transactions) |
| Isolation | Full | Potential anomalies |
Saga Design Considerations:
- Compensating Transactions: Design every step to be reversible
- Idempotency: Steps may be retried; must be idempotent
- Isolation: May see intermediate states; design for this
- Timeouts: Handle slow or failed services
- Semantic Lock: Use flags to prevent concurrent modifications
CQRS (Command Query Responsibility Segregation)¶
CQRS separates read and write operations into different models, allowing independent optimization.
CQRS Architecture
┌─────────────────────────────────────────────────────────────┐
│ Client │
└─────────────────────────────────────────────────────────────┘
│ │
│ Commands │ Queries
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Command Handler │ │ Query Handler │
│ (Write Model) │ │ (Read Model) │
└──────────┬──────────┘ └──────────┬──────────┘
│ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Write Database │──Sync/Async─▶│ Read Database │
│ (Normalized) │ Events │ (Denormalized) │
└─────────────────────┘ └─────────────────────┘
Write Model (Commands):
- Handles create, update, delete operations
- Validates business rules
- Normalized for consistency
- Optimized for write throughput
Read Model (Queries):
- Handles all read operations
- Denormalized for query performance
- Can have multiple read models for different query patterns
- Eventually consistent with write model
When to Use CQRS:
- Read and write patterns have different requirements
- Read-heavy workloads (10:1 or higher)
- Complex domain with rich business logic
- Need for different scaling strategies
- Event sourcing architecture
CQRS with Event Sourcing:
# Command handling
class OrderCommandHandler:
def handle_create_order(self, command):
# Validate
if not self.inventory.has_stock(command.items):
raise InsufficientStock()
# Create event
event = OrderCreatedEvent(
order_id=generate_id(),
customer_id=command.customer_id,
items=command.items,
timestamp=now()
)
# Store event
self.event_store.append(event)
# Publish for read model sync
self.event_bus.publish(event)
# Query handling
class OrderQueryHandler:
def get_order_summary(self, order_id):
# Read from denormalized read database
return self.read_db.query(
"SELECT * FROM order_summaries WHERE id = ?",
order_id
)
# Read model projector
class OrderProjector:
def on_order_created(self, event):
# Update denormalized read model
self.read_db.execute("""
INSERT INTO order_summaries
(id, customer_name, item_count, total, status)
VALUES (?, ?, ?, ?, ?)
""", [event.order_id, ...])
Event Sourcing¶
Instead of storing current state, store all changes (events) as the source of truth.
Traditional: Store current state
┌────────────────────────────────────────┐
│ User: {id: 1, name: "John", age: 31} │
└────────────────────────────────────────┘
Event Sourcing: Store all events
┌────────────────────────────────────────┐
│ UserCreated {id: 1, name: "John"} │
│ NameChanged {id: 1, name: "Johnny"} │
│ NameChanged {id: 1, name: "John"} │
│ AgeUpdated {id: 1, age: 30} │
│ AgeUpdated {id: 1, age: 31} │
└────────────────────────────────────────┘
Benefits:
- Complete audit trail
- Temporal queries (state at any point in time)
- Event replay for debugging
- Easy to add new projections
- Natural fit for event-driven systems
Challenges:
- Event schema evolution
- Eventual consistency
- Query complexity (need projections)
- Storage growth
Event Store Implementation:
class EventStore:
def append(self, aggregate_id, events, expected_version):
"""Append events with optimistic concurrency."""
with self.db.transaction():
current_version = self.get_version(aggregate_id)
if current_version != expected_version:
raise ConcurrencyError("Aggregate modified")
for i, event in enumerate(events):
self.db.execute("""
INSERT INTO events
(aggregate_id, version, event_type, data, timestamp)
VALUES (?, ?, ?, ?, ?)
""", [
aggregate_id,
expected_version + i + 1,
event.__class__.__name__,
serialize(event),
now()
])
def get_events(self, aggregate_id, from_version=0):
"""Retrieve events to rebuild state."""
return self.db.query("""
SELECT * FROM events
WHERE aggregate_id = ? AND version > ?
ORDER BY version
""", [aggregate_id, from_version])
class Order:
"""Event-sourced aggregate."""
def __init__(self, events):
self.id = None
self.status = None
self.items = []
self.version = 0
for event in events:
self.apply(event)
def apply(self, event):
if isinstance(event, OrderCreated):
self.id = event.order_id
self.status = 'created'
self.items = event.items
elif isinstance(event, OrderShipped):
self.status = 'shipped'
# ... more event handlers
self.version += 1
4. Service Discovery¶
Mechanism for services to find and communicate with each other in a distributed system.
Patterns:
- Client-Side Discovery: Clients query a service registry and select an instance
- Server-Side Discovery: Load balancer queries service registry and routes requests
Implementations:
- Service Registry: Centralized registry (e.g., Consul, Eureka, etcd)
- DNS-based: Use DNS for service discovery
- Service Mesh: Infrastructure layer handling service-to-service communication (e.g., Istio, Linkerd)
5. Circuit Breaker¶
Prevents cascading failures by stopping requests to a failing service and providing fallback behavior.
States:
- Closed: Normal operation, requests flow through
- Open: Service is failing, requests fail fast without calling service
- Half-Open: Testing if service has recovered, allowing limited requests
Benefits:
- Prevents resource exhaustion from retrying failed requests
- Provides fast failure feedback
- Allows services time to recover
Distributed Data Management¶
1. Replication¶
Maintaining copies of data across multiple nodes to improve availability, performance, and fault tolerance.
Replication Strategies:
- Master-Slave (Primary-Replica): One master handles writes, replicas handle reads
- Master-Master (Multi-Master): Multiple nodes can handle writes, requiring conflict resolution
- Synchronous Replication: Wait for all replicas to confirm before acknowledging write
- Asynchronous Replication: Acknowledge write immediately, replicate in background
Consistency Models:
- Strong Consistency: All replicas have same data at all times
- Eventual Consistency: Replicas will converge to same state eventually
- Read-Your-Writes: User always sees their own writes
- Monotonic Reads: User never sees older data after seeing newer data
2. Sharding (Partitioning)¶
Dividing data across multiple nodes to distribute load and enable horizontal scaling.
Sharding Strategies:
- Range-based Sharding: Partition by value ranges (e.g., user IDs 1-1000 on shard 1)
- Hash-based Sharding: Use hash function to determine shard (e.g., hash(user_id) % num_shards)
- Directory-based Sharding: Maintain lookup table mapping keys to shards
- Geographic Sharding: Partition by geographic location
Challenges:
- Rebalancing: Moving data when adding/removing shards
- Cross-shard Queries: Queries spanning multiple shards are complex
- Hot Spots: Uneven distribution causing some shards to be overloaded
- Transaction Support: Transactions across shards are difficult
3. Distributed Caching¶
Caching data across multiple nodes to improve performance and reduce load on backend systems.
Cache Patterns:
- Cache-Aside (Lazy Loading): Application checks cache, loads from DB if miss, stores in cache
- Write-Through: Write to cache and DB simultaneously
- Write-Back (Write-Behind): Write to cache, write to DB asynchronously
- Refresh-Ahead: Proactively refresh cache before expiration
Distributed Cache Solutions:
- Redis: In-memory data structure store, supports clustering
- Memcached: Simple distributed memory caching system
- Hazelcast: In-memory data grid with distributed caching
- Apache Ignite: Distributed caching and computing platform
Cache Invalidation:
- TTL (Time-To-Live): Cache entries expire after fixed time
- Event-based Invalidation: Invalidate on data change events
- Version-based: Use version numbers to detect stale data
Consistent Hashing¶
Consistent hashing is crucial for distributed caching and sharding. It minimizes data movement when nodes are added or removed.
Problem with Simple Hashing:
# Simple modulo hashing
server = hash(key) % num_servers
# If num_servers changes from 3 to 4:
# hash("user:123") % 3 = 1 → Server 1
# hash("user:123") % 4 = 2 → Server 2 (DIFFERENT!)
# Almost ALL keys get remapped!
Consistent Hashing Solution:
Consistent Hash Ring
0/360°
│
┌────────┴────────┐
╱ ╲
N1 N2
(50°) (120°)
╱ ╲
│ │
270° ────┤ ├──── 180°
│ │
╲ ╱
K1 N3
(240°) (200°)
╲ ╱
└────────┬────────┘
│
240°
Key K1 (hash=240°) → Clockwise → Node N1 (50°)
How It Works:
- Hash nodes and keys to positions on a ring (0-360° or 0-2^32)
- Each key is assigned to the first node clockwise from its position
- Adding/removing a node only affects keys in one segment
Virtual Nodes (Vnodes):
To improve load distribution, each physical node has multiple virtual positions:
class ConsistentHash:
def __init__(self, nodes, replicas=150):
self.ring = {}
self.sorted_keys = []
self.replicas = replicas
for node in nodes:
self.add_node(node)
def add_node(self, node):
"""Add node with virtual replicas."""
for i in range(self.replicas):
key = self._hash(f"{node}:{i}")
self.ring[key] = node
self.sorted_keys.append(key)
self.sorted_keys.sort()
def remove_node(self, node):
"""Remove node and its virtual replicas."""
for i in range(self.replicas):
key = self._hash(f"{node}:{i}")
del self.ring[key]
self.sorted_keys.remove(key)
def get_node(self, key):
"""Find the node responsible for this key."""
if not self.ring:
return None
hash_key = self._hash(key)
# Binary search for first node clockwise
for ring_key in self.sorted_keys:
if hash_key <= ring_key:
return self.ring[ring_key]
# Wrap around to first node
return self.ring[self.sorted_keys[0]]
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
Consistent Hashing Applications:
- Distributed Caches: Redis Cluster, Memcached
- Database Sharding: Cassandra, DynamoDB
- Load Balancing: Sticky sessions without session affinity
- Distributed File Systems: Ceph, GlusterFS
Cache Strategies Deep Dive¶
Cache-Aside (Lazy Loading):
def get_user(user_id):
# 1. Try cache first
cached = cache.get(f"user:{user_id}")
if cached:
return cached
# 2. Cache miss - load from DB
user = db.query("SELECT * FROM users WHERE id = ?", user_id)
# 3. Populate cache
cache.set(f"user:{user_id}", user, ttl=3600)
return user
def update_user(user_id, data):
# Update DB
db.execute("UPDATE users SET ... WHERE id = ?", user_id)
# Invalidate cache (or update it)
cache.delete(f"user:{user_id}")
Write-Through:
def update_user(user_id, data):
# Write to cache and DB atomically
with transaction():
db.execute("UPDATE users SET ... WHERE id = ?", user_id)
cache.set(f"user:{user_id}", data, ttl=3600)
Write-Behind (Write-Back):
def update_user(user_id, data):
# Write to cache immediately
cache.set(f"user:{user_id}", data, ttl=3600)
# Queue DB write for later
write_queue.enqueue({
'table': 'users',
'id': user_id,
'data': data
})
# Background worker
def process_write_queue():
while True:
batch = write_queue.dequeue_batch(size=100)
db.bulk_update(batch)
Cache Stampede Prevention:
import random
import time
def get_with_lock(key, fetch_func, ttl=3600):
"""Prevent cache stampede with distributed lock."""
value = cache.get(key)
if value:
return value
lock_key = f"lock:{key}"
# Try to acquire lock
if cache.set(lock_key, "1", nx=True, ex=10): # NX = only if not exists
try:
# We got the lock - fetch and cache
value = fetch_func()
cache.set(key, value, ex=ttl)
return value
finally:
cache.delete(lock_key)
else:
# Another process is fetching - wait and retry
time.sleep(0.1)
return get_with_lock(key, fetch_func, ttl)
def get_with_probabilistic_refresh(key, fetch_func, ttl=3600, beta=1):
"""Prevent stampede with probabilistic early refresh."""
data = cache.get_with_ttl(key)
if not data:
value = fetch_func()
cache.set(key, value, ex=ttl)
return value
value, remaining_ttl = data
# Probabilistically refresh before expiration
# Higher beta = more aggressive refresh
expiry_gap = ttl - remaining_ttl
if remaining_ttl - (beta * expiry_gap * random.random()) <= 0:
# Refresh in background
async_refresh(key, fetch_func, ttl)
return value
Bloom Filters¶
A space-efficient probabilistic data structure for membership testing.
Properties:
- Can tell if element is definitely not in set (no false negatives)
- May have false positives (says "maybe in set" when not)
- Cannot delete elements (use Counting Bloom Filter)
- Very space efficient
Use Cases:
- Cache: Check if item might be in cache before expensive lookup
- Database: Check if row might exist before disk read
- Spam filtering: Check if URL might be malicious
- Distributed systems: Check if data might be on a node
import mmh3 # MurmurHash
from bitarray import bitarray
class BloomFilter:
def __init__(self, size, num_hashes):
self.size = size
self.num_hashes = num_hashes
self.bit_array = bitarray(size)
self.bit_array.setall(0)
def add(self, item):
for i in range(self.num_hashes):
index = mmh3.hash(item, i) % self.size
self.bit_array[index] = 1
def might_contain(self, item):
for i in range(self.num_hashes):
index = mmh3.hash(item, i) % self.size
if not self.bit_array[index]:
return False # Definitely not in set
return True # Might be in set
# Usage
bf = BloomFilter(size=1000000, num_hashes=7)
bf.add("user:123")
bf.add("user:456")
bf.might_contain("user:123") # True
bf.might_contain("user:789") # Probably False (might be True)
Optimal Parameters:
n = expected number of elements
p = desired false positive rate
m = optimal number of bits = -n * ln(p) / (ln(2)^2)
k = optimal number of hashes = (m/n) * ln(2)
Example: 1 million items, 1% false positive rate
m = ~9.6 million bits (~1.2 MB)
k = 7 hash functions
Communication in Distributed Systems¶
1. Remote Procedure Call (RPC)¶
Allows a program to call a procedure on a remote server as if it were a local procedure call.
Characteristics:
- Synchronous communication (blocking)
- Type-safe with code generation
- Efficient binary protocols
- Examples: gRPC, Apache Thrift, CORBA
Challenges:
- Network latency and failures
- Parameter marshalling/unmarshalling
- Version compatibility
- Service discovery
2. Message Queues¶
Asynchronous communication pattern where messages are sent to queues and processed by consumers.
Benefits:
- Decoupling: producers and consumers don't need to know about each other
- Load leveling: smooth out traffic spikes
- Reliability: messages persisted until processed
- Scalability: multiple consumers can process messages
Message Queue Patterns:
- Point-to-Point: One consumer processes each message
- Publish-Subscribe: Multiple consumers receive each message
- Request-Reply: Synchronous pattern over async infrastructure
Technologies:
- RabbitMQ: Feature-rich message broker
- Apache Kafka: Distributed event streaming platform
- Amazon SQS: Managed message queue service
- Apache Pulsar: Cloud-native messaging and streaming
3. REST APIs¶
Representational State Transfer (REST) is an architectural style for designing networked applications.
Principles:
- Stateless: each request contains all information needed
- Resource-based: URLs represent resources
- HTTP methods: GET, POST, PUT, DELETE for operations
- Standard status codes: 200, 404, 500, etc.
Benefits:
- Simple and widely understood
- Works well with HTTP caching
- Language and platform agnostic
- Easy to test and debug
Monitoring and Observability¶
Distributed systems require comprehensive monitoring to understand system behavior and diagnose issues.
Three Pillars of Observability:
-
Metrics: Numerical measurements over time
- System metrics: CPU, memory, disk, network
- Application metrics: request rate, error rate, latency
- Business metrics: revenue, user signups, conversions
-
Logs: Discrete events with timestamps
- Structured logging: JSON format for easy parsing
- Log aggregation: centralize logs from all services
- Log levels: DEBUG, INFO, WARN, ERROR
-
Traces: Request flows across service boundaries
- Distributed tracing: track requests across services
- Span: single operation within a trace
- Trace context propagation: pass trace IDs across services
Tools:
- Metrics: Prometheus, Grafana, Datadog
- Logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, CloudWatch
- Tracing: Jaeger, Zipkin, AWS X-Ray
Scalability¶
Scalability is the capability of a system to handle a growing amount of work by adding resources. A scalable system can accommodate growth in users, data volume, or traffic without significant degradation in performance or requiring a complete redesign.
Types of Scalability¶
1. Vertical Scaling (Scale Up)¶
Increasing the capacity of existing hardware or software by adding more power (CPU, RAM, storage) to a single machine.
Characteristics:
- Simpler to implement: upgrade existing server
- No code changes typically required
- Limited by hardware maximums
- Single point of failure
- Can be expensive (high-end hardware costs)
When to Use:
- Small to medium applications
- When simplicity is more important than high availability
- Applications that can't be easily distributed
- Legacy systems that are difficult to refactor
Limitations:
- Physical limits: maximum CPU, memory, and I/O capacity
- Cost increases exponentially with capacity
- Downtime required for upgrades
- Cannot exceed single machine capabilities
2. Horizontal Scaling (Scale Out)¶
Adding more machines or instances to handle increased load, distributing work across multiple nodes.
Characteristics:
- Virtually unlimited capacity
- Better fault tolerance (no single point of failure)
- Cost-effective (commodity hardware)
- Requires distributed system design
- More complex to manage
When to Use:
- High-traffic applications
- Systems requiring high availability
- Cloud-native applications
- Modern microservices architectures
Challenges:
- State management: stateless services are easier to scale
- Data consistency across nodes
- Load distribution
- Operational complexity
Scalability Dimensions¶
1. Load Scalability¶
Ability to handle increased number of requests or transactions per unit of time.
Metrics:
- Throughput: Number of requests processed per second
- Requests Per Second (RPS): Measure of system capacity
- Transactions Per Second (TPS): For transactional systems
Strategies:
- Add more servers/instances
- Optimize code and database queries
- Implement caching layers
- Use CDN for static content
- Database read replicas for read-heavy workloads
2. Geographic Scalability¶
Ability to serve users from different geographic locations with acceptable latency.
Challenges:
- Network latency increases with distance
- Data sovereignty and compliance requirements
- Replication across regions
- Managing consistency across regions
Solutions:
- Content Delivery Networks (CDN): Cache content closer to users
- Edge Computing: Process requests at edge locations
- Multi-Region Deployment: Deploy services in multiple regions
- Database Replication: Replicate data across regions
- DNS-based Routing: Route users to nearest data center
3. Administrative Scalability¶
Ability to manage and maintain the system as it grows in size and complexity.
Challenges:
- More components to monitor and manage
- Configuration management complexity
- Deployment coordination
- Team coordination and communication
Solutions:
- Infrastructure as Code (IaC): Automate infrastructure provisioning
- Container Orchestration: Kubernetes, Docker Swarm for managing containers
- Service Mesh: Manage service-to-service communication
- Centralized Logging and Monitoring: Unified observability
- Automated Testing and Deployment: CI/CD pipelines
Scalability Patterns¶
1. Stateless Services¶
Services that don't store client state between requests. Each request contains all information needed to process it.
Benefits:
- Easy horizontal scaling: any instance can handle any request
- Simple load balancing
- No session affinity required
- Better fault tolerance
Implementation:
- Store session state in external store (Redis, database)
- Use JWT tokens for authentication (stateless)
- Pass all necessary context in requests
- Avoid server-side session storage
2. Database Scaling¶
Databases often become bottlenecks as systems scale. Multiple strategies exist for scaling databases.
Read Scaling:
- Read Replicas: Create copies of database for read operations
- Read/Write Splitting: Route reads to replicas, writes to master
- Caching: Cache frequently accessed data (Redis, Memcached)
- Materialized Views: Pre-compute expensive queries
Write Scaling:
- Sharding: Partition data across multiple databases
- Database Federation: Separate databases by function
- Denormalization: Reduce joins by duplicating data
- Event Sourcing: Store events instead of current state
Database Types:
- SQL Databases: ACID transactions, complex queries, vertical and horizontal scaling challenges
- NoSQL Databases: Often designed for horizontal scaling, eventual consistency
- Document stores: MongoDB, CouchDB
- Key-value stores: Redis, DynamoDB
- Column stores: Cassandra, HBase
- Graph databases: Neo4j
3. Caching Strategies¶
Caching stores frequently accessed data in fast storage to reduce load on backend systems and improve response times.
Cache Levels:
- Browser Cache: Cache static assets in user's browser
- CDN Cache: Cache content at edge locations
- Application Cache: In-memory cache in application (Redis, Memcached)
- Database Cache: Query result cache, buffer pool
Cache Patterns:
- Cache-Aside: Application manages cache, loads on miss
- Write-Through: Write to cache and database simultaneously
- Write-Back: Write to cache, asynchronously write to database
- Refresh-Ahead: Proactively refresh before expiration
Cache Invalidation:
- TTL (Time-To-Live): Automatic expiration
- Event-based: Invalidate on data changes
- Version-based: Use versions to detect stale data
4. Asynchronous Processing¶
Offload time-consuming tasks to background workers to keep request handlers responsive.
Benefits:
- Faster response times for users
- Better resource utilization
- Ability to handle traffic spikes
- Decoupling of components
Patterns:
- Message Queues: Send tasks to queues, workers process them
- Event-Driven: Publish events, subscribers process asynchronously
- Job Queues: Background job processing (Celery, Sidekiq)
- Stream Processing: Process data streams in real-time (Kafka Streams, Flink)
Use Cases:
- Email sending
- Image/video processing
- Report generation
- Data aggregation and analytics
- Notification delivery
Scalability Bottlenecks¶
Common bottlenecks that limit system scalability and how to address them.
1. Database Bottlenecks¶
Symptoms:
- Slow query performance
- High database CPU/memory usage
- Connection pool exhaustion
- Lock contention
Solutions:
- Query optimization: indexes, query rewriting
- Database connection pooling
- Read replicas for read-heavy workloads
- Caching frequently accessed data
- Database sharding for write scaling
- Consider NoSQL for specific use cases
2. Application Bottlenecks¶
Symptoms:
- High CPU usage
- Memory leaks
- Inefficient algorithms
- Synchronous blocking operations
Solutions:
- Code profiling and optimization
- Algorithm improvements
- Asynchronous processing
- Connection pooling
- Resource pooling (thread pools, object pools)
- Horizontal scaling
3. Network Bottlenecks¶
Symptoms:
- High latency
- Bandwidth saturation
- Network congestion
Solutions:
- CDN for static content
- Compression (gzip, brotli)
- HTTP/2 and HTTP/3 for multiplexing
- Edge computing
- Regional deployment
- Optimize payload sizes
4. I/O Bottlenecks¶
Symptoms:
- Slow disk I/O
- High I/O wait times
- Storage capacity limits
Solutions:
- Use SSDs instead of HDDs
- Distributed file systems
- Object storage (S3, GCS)
- Database optimization
- Caching to reduce I/O
Scalability Metrics and Monitoring¶
Key metrics to track system scalability and performance.
Performance Metrics:
- Latency: Time to process a request (p50, p95, p99 percentiles)
- Throughput: Requests per second
- Error Rate: Percentage of failed requests
- Availability: Uptime percentage
Resource Metrics:
- CPU Usage: Processor utilization
- Memory Usage: RAM consumption
- Disk I/O: Read/write operations per second
- Network I/O: Bandwidth utilization
Business Metrics:
- User Growth: Number of active users
- Transaction Volume: Business transactions per day
- Data Growth: Database size, storage usage
Monitoring Tools:
- APM (Application Performance Monitoring): New Relic, Datadog, AppDynamics
- Infrastructure Monitoring: Prometheus, Grafana, CloudWatch
- Log Analysis: ELK Stack, Splunk
- Distributed Tracing: Jaeger, Zipkin
Scalability Best Practices¶
-
Design for Scale from the Start:
- Consider scalability requirements early
- Use scalable architectures (microservices, event-driven)
- Design stateless services
- Plan for horizontal scaling
-
Measure Before Optimizing:
- Profile applications to find bottlenecks
- Monitor key metrics continuously
- Load test before production
- Establish performance baselines
-
Cache Aggressively:
- Cache at multiple levels
- Cache frequently accessed data
- Use appropriate cache expiration strategies
- Monitor cache hit rates
-
Optimize Database Access:
- Use indexes effectively
- Optimize queries
- Use connection pooling
- Consider read replicas and sharding
-
Use Asynchronous Processing:
- Offload heavy tasks to background workers
- Use message queues for decoupling
- Implement event-driven patterns
-
Implement Rate Limiting:
- Protect against traffic spikes
- Prevent abuse
- Ensure fair resource usage
-
Plan for Failure:
- Design for graceful degradation
- Implement circuit breakers
- Use health checks and auto-scaling
- Plan disaster recovery
Rate Limiting Deep Dive¶
Rate limiting controls the rate of requests a client can make to protect services from abuse and ensure fair usage.
Rate Limiting Algorithms¶
1. Token Bucket:
┌─────────────────────────────────────────────────────────────────┐
│ TOKEN BUCKET │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Tokens added at fixed rate (e.g., 10/second) │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ BUCKET │ Max capacity (e.g., 100 tokens) │
│ │ ○ ○ ○ ○ ○ │ │
│ │ ○ ○ ○ ○ ○ │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ Request takes 1 token (or N for weighted) │
│ If no tokens → Request rejected │
│ │
└─────────────────────────────────────────────────────────────────┘
- Allows burst traffic up to bucket size
- Smooths out request rate over time
- Most flexible algorithm
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def allow_request(self, tokens=1):
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
refill_amount = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + refill_amount)
self.last_refill = now
2. Leaky Bucket:
┌─────────────────────────────────────────────────────────────────┐
│ LEAKY BUCKET │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Requests enter bucket (queue) │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ BUCKET │ Fixed queue size │
│ │ ■ ■ ■ ■ ■ │ │
│ │ ■ ■ ■ ■ ■ │ │
│ └──────┬──────┘ │
│ │ ← Leak at constant rate │
│ ▼ │
│ Requests processed at fixed rate │
│ If bucket full → Request rejected │
│ │
└─────────────────────────────────────────────────────────────────┘
- Processes at constant rate regardless of input
- Smooths out bursty traffic
- Can introduce latency (queuing)
3. Fixed Window Counter:
class FixedWindowCounter:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.counts = {} # window_key -> count
def allow_request(self, client_id):
window_key = int(time.time() / self.window_seconds)
key = f"{client_id}:{window_key}"
current = self.counts.get(key, 0)
if current >= self.limit:
return False
self.counts[key] = current + 1
return True
- Simple to implement
- Problem: Burst at window boundaries (2x limit possible)
4. Sliding Window Log:
class SlidingWindowLog:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.logs = {} # client_id -> [timestamps]
def allow_request(self, client_id):
now = time.time()
window_start = now - self.window_seconds
# Get client's request log
if client_id not in self.logs:
self.logs[client_id] = []
# Remove old entries
self.logs[client_id] = [
ts for ts in self.logs[client_id]
if ts > window_start
]
if len(self.logs[client_id]) >= self.limit:
return False
self.logs[client_id].append(now)
return True
- Accurate but memory-intensive
- Stores timestamp of each request
5. Sliding Window Counter (Hybrid):
class SlidingWindowCounter:
"""Combines fixed window efficiency with sliding window accuracy."""
def __init__(self, limit, window_seconds):
self.limit = limit
self.window_seconds = window_seconds
self.current_window = {}
self.previous_window = {}
self.current_window_start = 0
def allow_request(self, client_id):
now = time.time()
window_start = int(now / self.window_seconds) * self.window_seconds
# Rotate windows if needed
if window_start > self.current_window_start:
self.previous_window = self.current_window
self.current_window = {}
self.current_window_start = window_start
# Calculate weighted count
elapsed_in_current = now - window_start
weight = elapsed_in_current / self.window_seconds
prev_count = self.previous_window.get(client_id, 0)
curr_count = self.current_window.get(client_id, 0)
# Weighted average
estimated_count = prev_count * (1 - weight) + curr_count
if estimated_count >= self.limit:
return False
self.current_window[client_id] = curr_count + 1
return True
Distributed Rate Limiting:
# Using Redis for distributed rate limiting
class DistributedRateLimiter:
def __init__(self, redis_client, limit, window_seconds):
self.redis = redis_client
self.limit = limit
self.window_seconds = window_seconds
def allow_request(self, client_id):
key = f"rate_limit:{client_id}"
# Lua script for atomic operation
script = """
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return current
"""
count = self.redis.eval(script, 1, key, self.window_seconds)
return count <= self.limit
Back-Pressure and Flow Control¶
Back-pressure prevents system overload by slowing down producers when consumers can't keep up.
Without Back-Pressure:
Producer (1000/s) ──▶ Queue ──▶ Consumer (100/s)
│
▼
Queue grows unbounded
Memory exhaustion
System crash
With Back-Pressure:
Producer (slowed) ◀── Signal ── Queue ──▶ Consumer (100/s)
│
▼
Queue bounded
Producer waits or drops
System stable
Back-Pressure Strategies:
1. Blocking:
from queue import Queue
# Bounded queue - put() blocks when full
queue = Queue(maxsize=1000)
def producer():
while True:
item = generate_item()
queue.put(item) # Blocks if queue full
def consumer():
while True:
item = queue.get() # Blocks if queue empty
process(item)
2. Dropping:
def producer():
while True:
item = generate_item()
try:
queue.put_nowait(item)
except queue.Full:
metrics.increment('dropped_items')
# Optionally: log, alert, or sample
3. Buffering with Overflow:
class OverflowBuffer:
def __init__(self, primary_size, overflow_size):
self.primary = Queue(maxsize=primary_size)
self.overflow = DiskQueue(max_size=overflow_size)
def put(self, item):
try:
self.primary.put_nowait(item)
except queue.Full:
try:
self.overflow.put(item) # Spill to disk
except OverflowError:
self.drop(item) # Last resort
4. Load Shedding:
class LoadShedder:
def __init__(self, threshold=0.8):
self.threshold = threshold
def should_accept(self, request):
current_load = self.get_system_load()
if current_load < self.threshold:
return True
# Prioritized shedding
if request.priority == 'high':
return current_load < 0.95
elif request.priority == 'medium':
return current_load < 0.9
else:
return False # Shed low priority
5. Reactive Streams (Pull-Based):
# Consumer pulls at its own pace
class ReactiveConsumer:
def __init__(self, source, batch_size=100):
self.source = source
self.batch_size = batch_size
async def consume(self):
while True:
# Consumer controls the pace
items = await self.source.request(self.batch_size)
await self.process(items)
Auto-Scaling Strategies¶
Horizontal Pod Autoscaler (HPA) - Kubernetes:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-server
minReplicas: 2
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: 1000
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15
Scaling Metrics:
| Metric | Best For | Consideration |
|---|---|---|
| CPU | Compute-intensive workloads | Doesn't account for I/O wait |
| Memory | Memory-intensive apps | Can be slow to change |
| RPS | Web servers, APIs | Business-relevant |
| Queue depth | Worker processes | Direct load indicator |
| Custom metrics | Specific business needs | Requires instrumentation |
Predictive Scaling:
class PredictiveScaler:
def __init__(self, history_days=14):
self.history = []
self.history_days = history_days
def predict_load(self, target_time):
"""Predict load based on historical patterns."""
# Same day of week, same time
historical_loads = self.get_historical_loads(
day_of_week=target_time.weekday(),
hour=target_time.hour
)
# Use moving average with trend
return self.calculate_prediction(historical_loads)
def calculate_replicas(self, predicted_load):
"""Calculate needed replicas with buffer."""
base_replicas = math.ceil(predicted_load / self.capacity_per_replica)
# Add 20% buffer for unexpected spikes
return int(base_replicas * 1.2)
Load Balancing¶
Load balancing is the process of distributing incoming network traffic across multiple servers or resources to ensure no single server becomes overwhelmed. Load balancers act as traffic directors, routing client requests to the most appropriate server based on various algorithms and health checks.
Why Load Balancing is Essential¶
- High Availability: If one server fails, traffic is automatically routed to healthy servers, ensuring continuous service availability.
- Performance: Distributes load evenly, preventing any single server from becoming a bottleneck.
- Scalability: Enables horizontal scaling by adding more servers behind the load balancer.
- Flexibility: Allows maintenance and updates without downtime by taking servers out of rotation.
- Geographic Distribution: Can route users to the nearest data center or server for reduced latency.
Types of Load Balancers¶
1. Network Load Balancer (Layer 4)¶
Operates at the transport layer (TCP/UDP) and routes traffic based on IP addresses and ports.
Characteristics:
- Fast and efficient (no application-layer inspection)
- Low latency
- Works with any protocol (TCP, UDP)
- Cannot inspect HTTP headers or content
- No SSL termination (unless using TCP passthrough)
Use Cases:
- High-throughput, low-latency applications
- TCP/UDP-based protocols
- When application-layer inspection isn't needed
- Real-time applications (gaming, streaming)
Examples:
- AWS Network Load Balancer
- F5 BIG-IP
- HAProxy (TCP mode)
2. Application Load Balancer (Layer 7)¶
Operates at the application layer (HTTP/HTTPS) and can inspect and route based on content.
Characteristics:
- Can inspect HTTP headers, URLs, and content
- SSL/TLS termination
- Content-based routing
- Cookie-based session affinity
- More intelligent routing decisions
- Higher latency than Layer 4 (more processing)
Use Cases:
- HTTP/HTTPS applications
- When content-based routing is needed
- Microservices (route based on path)
- When SSL termination is required at load balancer
Examples:
- AWS Application Load Balancer
- NGINX
- HAProxy (HTTP mode)
- Traefik
3. Global Server Load Balancer (GSLB)¶
Distributes traffic across multiple data centers or geographic regions.
Characteristics:
- DNS-based routing
- Geographic distribution
- Disaster recovery
- Route to nearest data center
- Health checks across regions
Use Cases:
- Multi-region deployments
- Disaster recovery
- Reducing latency for global users
- High availability across regions
Examples:
- AWS Route 53
- Cloudflare
- Azure Traffic Manager
- Google Cloud Load Balancing
Load Balancing Algorithms¶
The algorithm determines how the load balancer selects which server to handle each request.
1. Round Robin¶
Distributes requests sequentially across all servers in rotation.
How it Works:
- Request 1 → Server A
- Request 2 → Server B
- Request 3 → Server C
- Request 4 → Server A (cycle repeats)
Advantages:
- Simple and predictable
- Even distribution over time
- No server state required
Disadvantages:
- Doesn't consider server load or capacity
- Uneven if servers have different capabilities
- Doesn't account for request processing time
Use Case: When all servers have similar capacity and handle similar requests.
2. Weighted Round Robin¶
Round robin with weights assigned to servers based on their capacity.
How it Works:
- Server A (weight 3): handles 3 requests
- Server B (weight 2): handles 2 requests
- Server C (weight 1): handles 1 request
- Pattern repeats
Advantages:
- Accounts for server capacity differences
- Simple to configure
- Predictable distribution
Disadvantages:
- Doesn't consider current server load
- Weights must be manually configured
Use Case: When servers have different capacities (CPU, memory, etc.).
3. Least Connections¶
Routes requests to the server with the fewest active connections.
How it Works:
- Tracks active connections per server
- New request goes to server with minimum connections
- Continuously updated as connections are established/closed
Advantages:
- Adapts to varying request processing times
- Better for long-lived connections
- More balanced under varying loads
Disadvantages:
- Requires tracking connection state
- Doesn't consider server capacity
- More complex than round robin
Use Case: When requests have varying processing times or long-lived connections.
4. Weighted Least Connections¶
Combines least connections with server weights.
How it Works:
- Calculates: active_connections / weight
- Routes to server with lowest ratio
Advantages:
- Considers both current load and server capacity
- Most balanced algorithm
Disadvantages:
- Most complex to implement
- Requires connection tracking and weights
Use Case: When servers have different capacities and requests have varying processing times.
5. Least Response Time¶
Routes to the server with the lowest average response time.
How it Works:
- Measures response time for each server
- Routes to server with fastest average response time
- Continuously updated based on actual response times
Advantages:
- Routes to fastest servers
- Adapts to server performance changes
- Good for performance optimization
Disadvantages:
- Requires response time measurement
- May cause oscillations if response times are similar
- More overhead than simpler algorithms
Use Case: When optimizing for response time and servers have varying performance.
6. IP Hash¶
Uses a hash of the client's IP address to determine which server handles the request.
How it Works:
- Hash(client_ip) % num_servers → server_index
- Same client IP always maps to same server (if server count unchanged)
Advantages:
- Session affinity without cookies
- Predictable routing
- Simple to implement
Disadvantages:
- Uneven distribution if IPs are not evenly distributed
- Adding/removing servers changes mappings
- Doesn't consider server load
Use Case: When you need session affinity but can't use cookies, or for caching benefits.
7. Random¶
Randomly selects a server for each request.
How it Works:
- Generates random number
- Maps to server index
Advantages:
- Simple implementation
- No state required
- Can work well with many servers
Disadvantages:
- Unpredictable
- May cause uneven distribution in short term
- Doesn't consider server load or capacity
Use Case: Rarely used, mainly for testing or when other algorithms aren't suitable.
Load Balancing Strategies¶
1. Health Checks¶
Load balancers continuously monitor server health to avoid routing traffic to unhealthy servers.
Types of Health Checks:
-
Active Health Checks:
- Load balancer periodically sends requests to servers
- Checks response status, latency, or content
- Removes unhealthy servers from rotation
- Re-adds when healthy
-
Passive Health Checks:
- Monitor actual request responses
- Track error rates and timeouts
- Remove servers with high failure rates
Health Check Parameters:
- Interval: How often to check (e.g., every 10 seconds)
- Timeout: Maximum time to wait for response
- Threshold: Number of failures before marking unhealthy
- Path: Endpoint to check (e.g.,
/health) - Expected Response: Status code or content to validate
Example Health Check Endpoint:
@app.route('/health')
def health_check():
# Check database connection
# Check external dependencies
# Check disk space, memory
return {'status': 'healthy'}, 200
2. Session Affinity (Sticky Sessions)¶
Ensures requests from the same client are routed to the same server.
Why Needed:
- Server-side session storage
- In-memory caching
- Stateful applications
- File uploads requiring connection persistence
Implementation Methods:
-
Cookie-Based:
- Load balancer sets cookie with server identifier
- Subsequent requests include cookie
- Load balancer routes to same server
-
IP Hash:
- Hash client IP to determine server
- Same IP always routes to same server
-
Custom Headers:
- Use custom HTTP headers to identify sessions
Trade-offs:
- Pros: Enables stateful applications, better caching
- Cons: Uneven load distribution, harder to scale, server failure affects users
Best Practice: Use stateless services with external session storage (Redis, database) to avoid needing session affinity.
3. SSL/TLS Termination¶
Load balancer handles SSL/TLS encryption/decryption, reducing load on backend servers.
How it Works:
- Client → HTTPS → Load Balancer (decrypts)
- Load Balancer → HTTP → Backend Servers (or HTTPS with different cert)
Advantages:
- Reduces CPU load on backend servers
- Centralized certificate management
- Easier to update certificates
- Can inspect encrypted traffic
Disadvantages:
- Traffic between load balancer and servers may be unencrypted
- Single point of failure for SSL
- Load balancer becomes security boundary
Solution: Use end-to-end encryption (HTTPS to backend) if security is critical.
4. Content-Based Routing¶
Route requests based on URL path, headers, or content.
Examples:
/api/users/*→ User service/api/products/*→ Product service/api/orders/*→ Order service- Route based on
Hostheader for virtual hosting - Route based on
Content-Typeheader
Use Case: Microservices architecture where different services handle different paths.
Load Balancer Deployment Patterns¶
1. Single Load Balancer¶
One load balancer in front of multiple servers.
Characteristics:
- Simple setup
- Single point of failure
- May become bottleneck at high traffic
Use Case: Small to medium applications, development environments.
2. Multiple Load Balancers (Active-Passive)¶
Primary load balancer handles traffic, backup takes over on failure.
Characteristics:
- High availability
- Backup doesn't handle traffic normally
- Failover time required
Use Case: Medium to large applications requiring high availability.
3. Multiple Load Balancers (Active-Active)¶
Multiple load balancers share traffic load.
Characteristics:
- Better performance (load distributed)
- Higher availability
- More complex configuration
- Requires session synchronization or stateless design
Use Case: High-traffic applications, large-scale systems.
4. DNS Load Balancing¶
Use DNS to distribute traffic across multiple load balancers or servers.
How it Works:
- DNS returns multiple IP addresses
- Clients connect to different IPs
- DNS TTL controls how long clients cache IPs
Advantages:
- Simple, no additional infrastructure
- Geographic distribution possible
- Works with any protocol
Disadvantages:
- Limited control (DNS caching)
- No health checking (unless using advanced DNS)
- Uneven distribution (client DNS caching)
Use Case: Global distribution, simple load balancing needs.
Load Balancing in Cloud Environments¶
AWS Load Balancing¶
Elastic Load Balancer (ELB) Types:
- Application Load Balancer (ALB): Layer 7, HTTP/HTTPS
- Network Load Balancer (NLB): Layer 4, TCP/UDP, high performance
- Classic Load Balancer: Legacy, basic features
Features:
- Auto-scaling integration
- Health checks
- SSL termination
- Sticky sessions
- Cross-zone load balancing
Google Cloud Load Balancing¶
Types:
- HTTP(S) Load Balancing: Global, Layer 7
- TCP/UDP Load Balancing: Regional or global, Layer 4
- Internal Load Balancing: For internal traffic
Features:
- Global anycast IP
- Automatic scaling
- Health checks
- Content-based routing
Azure Load Balancing¶
Types:
- Application Gateway: Layer 7, web application firewall
- Load Balancer: Layer 4, regional
- Traffic Manager: DNS-based, global
Load Balancing Best Practices¶
-
Health Checks:
- Implement comprehensive health check endpoints
- Configure appropriate intervals and timeouts
- Monitor health check failures
-
Stateless Services:
- Design services to be stateless
- Store state in external systems (Redis, database)
- Avoid session affinity when possible
-
Monitoring:
- Monitor load balancer metrics (requests, latency, errors)
- Track backend server health
- Set up alerts for failures
-
Security:
- Use SSL/TLS termination
- Implement rate limiting
- Use web application firewalls (WAF)
- Restrict access to backend servers
-
Performance:
- Choose appropriate algorithm for your use case
- Enable connection pooling
- Use HTTP/2 for better multiplexing
- Consider geographic distribution
-
Scalability:
- Integrate with auto-scaling
- Plan for traffic growth
- Test load balancer capacity
- Use multiple load balancers for high availability
API Gateway¶
An API Gateway is a server that acts as the single entry point for all client requests to backend services. It handles cross-cutting concerns like authentication, rate limiting, and request routing.
API Gateway Responsibilities¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ API GATEWAY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Request │ │ Auth & │ │ Rate │ │ Request │ │
│ │ Routing │ │ Authorization│ │ Limiting │ │ Validation │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Caching │ │ Circuit │ │ Request │ │ Protocol │ │
│ │ │ │ Breaker │ │ Transform │ │ Translation │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Logging │ │ Metrics │ │ Service │ │ SSL/TLS │ │
│ │ & Tracing │ │ │ │ Discovery │ │ Termination │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
API Gateway Patterns¶
1. Simple Gateway:
┌─────────────┐
Clients ────────▶│ API Gateway │────────▶ Services
└─────────────┘
2. Backend for Frontend (BFF):
┌─────────────┐
Mobile ─────────▶│ Mobile BFF │────────┐
└─────────────┘ │
▼
┌─────────────┐ ┌─────────┐
Web ────────────▶│ Web BFF │───▶│Services │
└─────────────┘ └─────────┘
▲
┌─────────────┐ │
IoT ────────────▶│ IoT BFF │────────┘
└─────────────┘
Each client type gets a specialized gateway optimized for its needs.
3. Aggregator Gateway:
class OrderAggregator:
"""Aggregates data from multiple services into single response."""
async def get_order_details(self, order_id):
# Parallel requests to multiple services
order, user, payment, shipping = await asyncio.gather(
self.order_service.get(order_id),
self.user_service.get(order.user_id),
self.payment_service.get(order.payment_id),
self.shipping_service.get(order.shipping_id),
)
# Aggregate into single response
return {
'order': order,
'user': {
'name': user.name,
'email': user.email
},
'payment': {
'status': payment.status,
'amount': payment.amount
},
'shipping': {
'status': shipping.status,
'tracking': shipping.tracking_number
}
}
API Gateway Technologies¶
| Technology | Type | Best For |
|---|---|---|
| Kong | Open Source | Kubernetes, plugin ecosystem |
| AWS API Gateway | Managed | AWS ecosystem, serverless |
| Apigee | Enterprise | Large enterprises, analytics |
| NGINX | Open Source | High performance, flexibility |
| Traefik | Open Source | Kubernetes, auto-discovery |
| Envoy | Service Mesh | Service-to-service, gRPC |
Request/Response Transformation¶
# Gateway transforms legacy API response to modern format
# Legacy service response
legacy_response = {
"usr_id": "123",
"usr_nm": "John",
"crt_dt": "20240101"
}
# Gateway transformation rules
transformation = {
"user_id": "$.usr_id",
"user_name": "$.usr_nm",
"created_at": {
"source": "$.crt_dt",
"transform": "date_format('YYYYMMDD', 'ISO8601')"
}
}
# Transformed response
modern_response = {
"user_id": "123",
"user_name": "John",
"created_at": "2024-01-01T00:00:00Z"
}
Service Mesh¶
A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It makes communication between service instances reliable, fast, and secure.
Service Mesh Architecture¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ SERVICE MESH │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Control Plane │ │ Control Plane │ │
│ │ (Istiod) │ │ (Linkerd) │ │
│ │ │ │ │ │
│ │ - Config │ │ - Identity │ │
│ │ - Certificates │ │ - Routing │ │
│ │ - Discovery │ │ - Metrics │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ │ Configuration │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Data Plane │ │
│ │ │ │
│ │ ┌─────────┐ ┌───────┐ ┌─────────┐ ┌───────┐ │ │
│ │ │Service A│◀│ Proxy │◀──────▶│ Proxy │▶│Service B│ │ │
│ │ │ │ │(Envoy)│ │(Envoy) │ │ │ │ │
│ │ └─────────┘ └───────┘ └───────┘ └─────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Service Mesh Features¶
1. Traffic Management:
# Istio VirtualService - Traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- match:
- headers:
end-user:
exact: jason
route:
- destination:
host: reviews
subset: v2 # Jason gets v2
- route:
- destination:
host: reviews
subset: v1
weight: 90 # 90% to v1
- destination:
host: reviews
subset: v2
weight: 10 # 10% to v2 (canary)
2. Mutual TLS (mTLS):
# Istio PeerAuthentication - Enforce mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICT # All traffic must be mTLS
3. Observability:
- Distributed Tracing: Automatic trace propagation
- Metrics: Request latency, error rates, throughput
- Access Logs: Detailed request/response logging
4. Resilience:
# Istio DestinationRule - Circuit breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
Service Mesh vs API Gateway¶
| Aspect | API Gateway | Service Mesh |
|---|---|---|
| Position | Edge (north-south) | Internal (east-west) |
| Traffic | External to internal | Service to service |
| Deployment | Centralized | Distributed (sidecars) |
| Protocol | HTTP/REST focused | Any protocol |
| Use Case | Public APIs | Microservices internal |
Content Delivery Network (CDN)¶
A CDN is a geographically distributed network of servers that delivers content to users from the nearest edge location, reducing latency and improving performance.
CDN Architecture¶
Origin Server
│
┌──────────────┴──────────────┐
│ CDN Network │
│ │
┌───────────┼───────────┬───────────┬─────┼───────────┐
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Edge US│ │Edge EU│ │Edge AP│ │Edge SA│ │Edge AF│
│ West │ │ │ │ │ │ │ │ │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
[Users] [Users] [Users] [Users] [Users]
CDN Caching Strategies¶
Cache-Control Headers:
# Static assets (aggressive caching)
Cache-Control: public, max-age=31536000, immutable
# API responses (short cache)
Cache-Control: public, max-age=60, s-maxage=300
# Private user data (no CDN cache)
Cache-Control: private, no-store
# Stale-while-revalidate (serve stale, fetch fresh in background)
Cache-Control: public, max-age=60, stale-while-revalidate=86400
Cache Key Configuration:
# CDN cache key components
cache_key = {
'url': '/api/products',
'query_params': ['category', 'sort'], # Include in key
'headers': ['Accept-Language'], # Vary by language
'cookies': [], # Ignore cookies
}
# Example cache keys:
# /api/products?category=shoes&sort=price + Accept-Language: en
# /api/products?category=shoes&sort=price + Accept-Language: es
CDN Use Cases¶
1. Static Asset Delivery:
# Origin server configuration
location /static/ {
root /var/www;
expires 1y;
add_header Cache-Control "public, immutable";
add_header CDN-Cache-Control "max-age=31536000";
}
2. Dynamic Content Caching:
@app.route('/api/products')
def get_products():
products = fetch_products()
response = jsonify(products)
response.headers['Cache-Control'] = 'public, max-age=60'
response.headers['Surrogate-Key'] = 'products' # For purging
return response
# Purge when products change
def invalidate_products():
cdn.purge_by_tag('products')
3. Edge Computing:
// Cloudflare Worker - Edge computation
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
const url = new URL(request.url)
// A/B testing at edge
const variant = request.headers.get('Cookie')?.includes('variant=b') ? 'b' : 'a'
// Modify request to appropriate origin
url.pathname = `/${variant}${url.pathname}`
return fetch(url.toString())
}
CDN Providers Comparison¶
| Provider | Strengths | Best For |
|---|---|---|
| Cloudflare | DDoS protection, edge compute | Security-focused, edge logic |
| AWS CloudFront | AWS integration, Lambda@Edge | AWS ecosystem |
| Fastly | Real-time purging, VCL | Dynamic content, instant invalidation |
| Akamai | Global reach, enterprise | Large enterprises, streaming |
| Bunny CDN | Cost-effective | Budget-conscious, simple needs |
Real-Time Communication¶
WebSockets¶
Full-duplex communication channel over a single TCP connection.
┌─────────────────────────────────────────────────────────────────┐
│ WebSocket Lifecycle │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Client Server │
│ │ │ │
│ │ ──── HTTP Upgrade Request ─────────────────────▶│ │
│ │ Upgrade: websocket │ │
│ │ Connection: Upgrade │ │
│ │ Sec-WebSocket-Key: xxx │ │
│ │ │ │
│ │◀──── HTTP 101 Switching Protocols ──────────────│ │
│ │ Upgrade: websocket │ │
│ │ Sec-WebSocket-Accept: yyy │ │
│ │ │ │
│ │◀════════════ WebSocket Connection ════════════▶│ │
│ │ │ │
│ │ ──── Message Frame ────────────────────────────▶│ │
│ │◀──── Message Frame ─────────────────────────────│ │
│ │ ──── Message Frame ────────────────────────────▶│ │
│ │ │ │
│ │ ──── Close Frame ──────────────────────────────▶│ │
│ │◀──── Close Frame ───────────────────────────────│ │
│ │
└─────────────────────────────────────────────────────────────────┘
WebSocket Server Example:
import asyncio
import websockets
import json
class ChatServer:
def __init__(self):
self.clients = set()
self.rooms = {} # room_id -> set of websockets
async def handler(self, websocket, path):
# Register client
self.clients.add(websocket)
try:
async for message in websocket:
data = json.loads(message)
await self.process_message(websocket, data)
finally:
# Unregister client
self.clients.discard(websocket)
self.leave_all_rooms(websocket)
async def process_message(self, websocket, data):
action = data.get('action')
if action == 'join':
room_id = data['room_id']
if room_id not in self.rooms:
self.rooms[room_id] = set()
self.rooms[room_id].add(websocket)
elif action == 'message':
room_id = data['room_id']
message = data['message']
await self.broadcast_to_room(room_id, {
'type': 'message',
'content': message,
'sender': data['sender']
})
async def broadcast_to_room(self, room_id, message):
if room_id in self.rooms:
websockets.broadcast(
self.rooms[room_id],
json.dumps(message)
)
# Start server
server = ChatServer()
asyncio.run(websockets.serve(server.handler, "localhost", 8765))
Server-Sent Events (SSE)¶
One-way communication from server to client over HTTP.
from flask import Flask, Response
import time
app = Flask(__name__)
@app.route('/events')
def events():
def generate():
while True:
data = get_latest_data()
yield f"data: {json.dumps(data)}\n\n"
time.sleep(1)
return Response(
generate(),
mimetype='text/event-stream',
headers={
'Cache-Control': 'no-cache',
'Connection': 'keep-alive'
}
)
Long Polling¶
Client repeatedly requests updates, server holds request until data available.
@app.route('/poll')
async def long_poll():
timeout = 30 # seconds
start_time = time.time()
while time.time() - start_time < timeout:
data = await check_for_updates()
if data:
return jsonify(data)
await asyncio.sleep(0.5)
# Timeout - return empty and client reconnects
return jsonify({'status': 'timeout'})
Real-Time Pattern Comparison¶
| Pattern | Direction | Connection | Best For |
|---|---|---|---|
| WebSocket | Bidirectional | Persistent | Chat, gaming, collaboration |
| SSE | Server → Client | Persistent | News feeds, notifications |
| Long Polling | Request-Response | Repeated | Legacy support, simple updates |
| Short Polling | Request-Response | New each time | Infrequent updates |
Scaling Real-Time Systems¶
Challenge: WebSocket connections are stateful and tied to specific servers.
Solution: Pub/Sub with Redis:
import redis.asyncio as redis
class ScalableWebSocketServer:
def __init__(self):
self.redis = redis.from_url("redis://localhost")
self.pubsub = self.redis.pubsub()
self.local_clients = {} # room_id -> set of local websockets
async def subscribe_to_room(self, room_id, websocket):
# Local tracking
if room_id not in self.local_clients:
self.local_clients[room_id] = set()
# Subscribe to Redis channel
await self.pubsub.subscribe(f"room:{room_id}")
self.local_clients[room_id].add(websocket)
async def broadcast_to_room(self, room_id, message):
# Publish to Redis (reaches all server instances)
await self.redis.publish(
f"room:{room_id}",
json.dumps(message)
)
async def redis_listener(self):
"""Listen for messages from other server instances."""
async for message in self.pubsub.listen():
if message['type'] == 'message':
room_id = message['channel'].decode().replace('room:', '')
data = json.loads(message['data'])
# Broadcast to local clients only
if room_id in self.local_clients:
for ws in self.local_clients[room_id]:
await ws.send(json.dumps(data))
Operating Systems Fundamentals¶
Understanding operating system fundamentals is crucial for designing and building efficient, concurrent, and scalable systems. This section covers threads, processes, locks, synchronization, and concurrency—concepts that directly impact system design decisions.
Processes and Threads¶
Processes¶
A process is an instance of a running program with its own memory space, file descriptors, and system resources. Each process is isolated from other processes and has its own address space.
Process Characteristics:
- Isolation: Processes cannot directly access each other's memory
- Resource Ownership: Each process has its own memory, file handles, network connections
- Heavyweight: Creating and switching between processes is expensive
- IPC (Inter-Process Communication): Processes communicate via pipes, sockets, shared memory, message queues
Process States:
- New: Process is being created
- Ready: Process is waiting to be assigned to a CPU
- Running: Process instructions are being executed
- Waiting/Blocked: Process is waiting for an event (I/O, signal)
- Terminated: Process has finished execution
Process Creation:
- Fork: Creates a copy of the current process (Unix/Linux)
- Exec: Replaces current process with a new program
- Clone: More flexible process creation (Linux)
Threads¶
A thread is a lightweight unit of execution within a process. Multiple threads share the same memory space and resources of their parent process but have their own stack and registers.
Thread Characteristics:
- Shared Memory: Threads in the same process share memory space
- Lightweight: Creating and switching threads is faster than processes
- Concurrent Execution: Multiple threads can execute simultaneously (on multi-core systems)
- Synchronization Required: Shared memory requires synchronization mechanisms
Thread vs. Process:
| Aspect | Process | Thread |
|---|---|---|
| Memory | Isolated | Shared |
| Creation Cost | High | Low |
| Communication | IPC required | Direct memory access |
| Context Switch | Expensive | Cheap |
| Fault Isolation | High (crash doesn't affect others) | Low (crash affects all threads) |
| Scalability | Limited by process overhead | Better scalability |
Thread Models:
-
User-Level Threads:
- Managed by application/library
- OS doesn't know about threads
- Fast context switching
- Blocking I/O blocks all threads
-
Kernel-Level Threads:
- Managed by OS kernel
- OS schedules threads
- True parallelism on multi-core
- Blocking I/O only blocks one thread
-
Hybrid Model:
- Combination of user and kernel threads
- Multiple user threads mapped to fewer kernel threads
Concurrency and Parallelism¶
Concurrency¶
Concurrency is the ability of a system to handle multiple tasks at the same time. Tasks may not execute simultaneously but are interleaved, giving the appearance of simultaneous execution.
Characteristics:
- Tasks make progress concurrently
- Execution may be interleaved on a single CPU
- Useful for I/O-bound tasks
- Can improve responsiveness
Example: A web server handling multiple requests concurrently, switching between them while waiting for I/O operations.
Parallelism¶
Parallelism is the simultaneous execution of multiple tasks, typically on multiple CPUs or cores.
Characteristics:
- Tasks execute simultaneously on different CPUs
- Requires multiple processing units
- Useful for CPU-bound tasks
- True simultaneous execution
Example: Processing multiple images simultaneously, each on a different CPU core.
Concurrency vs. Parallelism:
- Concurrency: Dealing with multiple things at once (can be on single core)
- Parallelism: Doing multiple things at once (requires multiple cores)
Concurrency is about structure, parallelism is about execution.
Thread Synchronization¶
When multiple threads access shared resources, synchronization is needed to prevent race conditions and ensure data consistency.
Race Conditions¶
A race condition occurs when the outcome depends on the relative timing of thread execution. Multiple threads access shared data, and the final result depends on the order of execution.
Example:
# Thread 1 and Thread 2 both execute:
counter = counter + 1
# Without synchronization, this can result in:
# Both read counter = 5
# Both increment to 6
# Both write 6 (should be 7)
Critical Section: Code that accesses shared resources and must be executed atomically by only one thread at a time.
Mutual Exclusion (Mutex)¶
A mutex (mutual exclusion) is a synchronization primitive that ensures only one thread can access a critical section at a time.
How Mutex Works:
- Thread acquires lock before entering critical section
- Other threads wait if lock is held
- Thread releases lock after leaving critical section
- Waiting thread can then acquire lock
Properties:
- Mutual Exclusion: Only one thread holds lock at a time
- Deadlock Prevention: Must be released by acquiring thread
- Blocking: Threads wait if lock is unavailable
Example (Pseudocode):
mutex.lock()
// Critical section
shared_resource += 1
mutex.unlock()
Mutex Types:
-
Binary Semaphore:
- Two states: locked or unlocked
- Basic mutex implementation
-
Reentrant Mutex (Recursive Lock):
- Same thread can acquire lock multiple times
- Must release same number of times
- Prevents self-deadlock
-
Reader-Writer Lock:
- Multiple readers OR one writer
- Optimized for read-heavy workloads
- Writers have exclusive access
Semaphores¶
A semaphore is a synchronization primitive that maintains a counter and allows a specified number of threads to access a resource.
How Semaphores Work:
- Wait (P): Decrement counter, block if counter is 0
- Signal (V): Increment counter, wake waiting thread if any
Types:
-
Binary Semaphore:
- Counter is 0 or 1
- Similar to mutex but can be released by different thread
-
Counting Semaphore:
- Counter can be any non-negative integer
- Limits number of concurrent accesses
- Useful for resource pools
Example Use Case:
# Limit to 5 concurrent database connections
semaphore = Semaphore(5)
def database_operation():
semaphore.acquire() # Wait if 5 already acquired
try:
# Use database connection
db.query(...)
finally:
semaphore.release() # Release for next thread
Condition Variables¶
Condition variables allow threads to wait for a condition to become true and notify other threads when the condition changes.
Operations:
- Wait: Thread waits on condition variable (releases lock, blocks)
- Signal: Wake one waiting thread
- Broadcast: Wake all waiting threads
Use Case: Producer-Consumer pattern
mutex = Lock()
condition = Condition(mutex)
queue = []
def producer():
with mutex:
queue.append(item)
condition.notify() # Wake consumer
def consumer():
with mutex:
while len(queue) == 0:
condition.wait() # Wait for item
item = queue.pop(0)
Monitors¶
A monitor is a synchronization construct that combines mutual exclusion with condition variables. It ensures only one thread executes monitor code at a time.
Characteristics:
- Mutual exclusion built-in
- Condition variables for waiting
- Higher-level abstraction than mutexes
Example: Java's synchronized keyword implements monitor pattern.
Deadlocks¶
A deadlock occurs when two or more threads are blocked forever, waiting for each other to release resources.
Deadlock Conditions (Coffman Conditions)¶
All four must be present for deadlock:
- Mutual Exclusion: Resources cannot be shared
- Hold and Wait: Thread holds resource while waiting for another
- No Preemption: Resources cannot be forcibly taken
- Circular Wait: Circular chain of threads waiting for resources
Example:
Thread 1: Holds Lock A, wants Lock B
Thread 2: Holds Lock B, wants Lock A
→ Deadlock!
Deadlock Prevention¶
-
Lock Ordering:
- Always acquire locks in the same order
- Prevents circular wait
-
Lock Timeout:
- Try to acquire lock with timeout
- Release and retry if timeout
-
Deadlock Detection:
- Periodically check for circular waits
- Kill/restart threads in deadlock
-
Avoid Nested Locks:
- Minimize lock nesting
- Use higher-level synchronization
Livelock¶
Livelock occurs when threads are not blocked but cannot make progress because they keep responding to each other's actions.
Example: Two threads trying to pass through a narrow doorway, both step aside for the other, neither makes progress.
Lock-Free Programming¶
Lock-free programming uses atomic operations and memory ordering instead of locks to achieve thread safety.
Atomic Operations¶
Operations that complete in a single, indivisible step. Guaranteed to be thread-safe without locks.
Types:
- Read-Modify-Write: Compare-and-swap (CAS), fetch-and-add
- Load/Store: Atomic reads and writes
Example (Compare-and-Swap):
# Atomic increment without lock
def atomic_increment(counter):
while True:
old_value = counter.value
new_value = old_value + 1
if compare_and_swap(counter, old_value, new_value):
break # Success
# Retry if another thread modified value
Lock-Free Data Structures¶
Data structures that don't use locks but remain thread-safe through atomic operations.
Examples:
- Lock-free queues
- Lock-free stacks
- Lock-free hash tables
Advantages:
- No deadlocks
- Better performance (no lock contention)
- Progress guarantee (at least one thread makes progress)
Disadvantages:
- More complex to implement
- Harder to reason about
- May require retry loops
Memory Models and Ordering¶
Memory Ordering¶
Defines the order in which memory operations become visible to other threads.
Ordering Types:
-
Sequential Consistency:
- All threads see same order of operations
- Simplest model, but slowest
-
Relaxed Ordering:
- Minimal ordering guarantees
- Fastest, but hardest to reason about
-
Acquire-Release:
- Acquire: All operations after acquire see operations before release
- Release: All operations before release visible after acquire
- Balance between performance and simplicity
-
Memory Barriers (Fences):
- Explicit ordering instructions
- Prevent reordering of memory operations
Example:
// Thread 1
data = 42;
flag.store(true, memory_order_release); // Release
// Thread 2
if (flag.load(memory_order_acquire)) { // Acquire
// Guaranteed to see data == 42
print(data);
}
Concurrency Patterns¶
1. Producer-Consumer¶
One or more producers generate data, one or more consumers process it.
Synchronization:
- Mutex for queue access
- Condition variable for empty/full queue
- Bounded buffer to limit memory
Use Cases:
- Task queues
- Event processing
- Log aggregation
2. Reader-Writer Lock¶
Multiple readers can access simultaneously, but writers need exclusive access.
Optimization:
- Readers don't block each other
- Writers block all readers and other writers
- Better for read-heavy workloads
Implementation:
- Track number of active readers
- Block writers when readers active
- Block readers when writer active
3. Thread Pool¶
Pre-created pool of worker threads that process tasks from a queue.
Benefits:
- Avoid thread creation overhead
- Control resource usage
- Better task scheduling
Components:
- Task queue
- Worker threads
- Thread pool manager
Use Cases:
- Web servers
- Database connection pools
- Parallel processing
4. Actor Model¶
Each actor is an independent entity that processes messages sequentially.
Characteristics:
- Actors communicate via messages
- Each actor processes one message at a time
- No shared state (messages are copied)
- Natural concurrency model
Examples:
- Erlang/Elixir processes
- Akka (Java/Scala)
- Orleans (.NET)
Best Practices for Concurrency¶
-
Minimize Shared State:
- Use immutable data structures
- Prefer message passing over shared memory
- Encapsulate shared state
-
Use Appropriate Synchronization:
- Choose right primitive (mutex, semaphore, etc.)
- Keep critical sections small
- Avoid nested locks
-
Prevent Deadlocks:
- Lock ordering
- Lock timeouts
- Avoid circular dependencies
-
Performance Considerations:
- Lock contention is expensive
- Consider lock-free alternatives
- Use reader-writer locks for read-heavy workloads
- Minimize time in critical sections
-
Testing:
- Concurrency bugs are hard to reproduce
- Use stress testing
- Consider formal verification tools
- Test with different thread counts
-
Documentation:
- Document thread-safety guarantees
- Specify which operations are thread-safe
- Document locking requirements
Operating System Concepts in System Design¶
Understanding OS fundamentals helps in system design:
-
Process vs. Thread Choice:
- Use processes for isolation and fault tolerance
- Use threads for shared-memory parallelism
- Consider containerization (process-like isolation)
-
Synchronization in Distributed Systems:
- OS locks don't work across machines
- Use distributed locks (Redis, ZooKeeper, etcd)
- Consider eventual consistency
-
I/O Models:
- Blocking I/O: Simple but limits concurrency
- Non-blocking I/O: Better for high concurrency
- Async I/O: Event-driven, highly scalable
- Consider epoll/kqueue for high-performance servers
-
Resource Management:
- Connection pooling (like thread pools)
- Rate limiting (like semaphores)
- Circuit breakers (like monitors)
-
Scalability:
- Thread-per-request model doesn't scale
- Use async/event-driven models
- Consider actor model for high concurrency
Tools for Software Design¶
- UML (Unified Modeling Language): For creating diagrams that represent system architecture and behavior. UML provides standardized notation for class diagrams, sequence diagrams, activity diagrams, and deployment diagrams, facilitating communication between team members and stakeholders.
- Draw.io, Lucidchart: Diagramming tools for creating flowcharts and architectural diagrams. These tools support collaboration, version control, and integration with various platforms, making them essential for documenting system designs.
- Enterprise Architect, Visual Paradigm: Comprehensive modeling tools for software design. These tools provide advanced features for UML modeling, code generation, reverse engineering, and team collaboration, suitable for large-scale enterprise projects.
- Figma, Sketch: For UI/UX design aspects. These tools enable designers and developers to create, prototype, and collaborate on user interface designs, ensuring consistency and usability across the application.
- Architecture Decision Records (ADRs): Document architectural decisions and their rationale. Tools like adr-tools help maintain a record of design decisions over time.
- System Design Tools: Tools like C4 Model, Structurizr for creating architecture diagrams at different levels of abstraction (context, container, component, code).
Reliability and Resilience Patterns¶
Circuit Breaker Pattern¶
Prevents cascading failures by stopping requests to a failing service.
┌─────────────────────────────────────────────────────────────────┐
│ CIRCUIT BREAKER STATES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ failures > threshold ┌──────────┐ │
│ │ CLOSED │ ───────────────────────────────▶│ OPEN │ │
│ │ │ │ │ │
│ │ Requests │ │ Requests │ │
│ │ pass │ │ fail │ │
│ │ through │ │ fast │ │
│ └────▲─────┘ └────┬─────┘ │
│ │ │ │
│ │ success timeout │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ HALF-OPEN │ │
│ └───────────────────────────────────│ │ │
│ │ Limited │ │
│ failure │ requests │────┘
│ │ allowed │
│ └──────────────┘
│ │
└─────────────────────────────────────────────────────────────────┘
Implementation:
import time
from enum import Enum
from threading import Lock
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(
self,
failure_threshold=5,
recovery_timeout=30,
half_open_requests=3
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_requests = half_open_requests
self.state = CircuitState.CLOSED
self.failures = 0
self.last_failure_time = None
self.half_open_successes = 0
self.lock = Lock()
def call(self, func, *args, **kwargs):
with self.lock:
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.half_open_successes = 0
else:
raise CircuitOpenError("Circuit is OPEN")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_requests:
self.state = CircuitState.CLOSED
self.failures = 0
else:
self.failures = 0
def _on_failure(self):
with self.lock:
self.failures += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
elif self.failures >= self.failure_threshold:
self.state = CircuitState.OPEN
def _should_attempt_reset(self):
if self.last_failure_time is None:
return True
return time.time() - self.last_failure_time >= self.recovery_timeout
# Usage
circuit_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order_id):
return circuit_breaker.call(payment_api.process, order_id)
Retry Pattern with Exponential Backoff¶
import random
import time
from functools import wraps
def retry_with_backoff(
max_retries=3,
base_delay=1.0,
max_delay=60.0,
exponential_base=2,
jitter=True,
retryable_exceptions=(Exception,)
):
"""Decorator for retrying with exponential backoff."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except retryable_exceptions as e:
last_exception = e
if attempt == max_retries:
raise
# Calculate delay with exponential backoff
delay = min(
base_delay * (exponential_base ** attempt),
max_delay
)
# Add jitter to prevent thundering herd
if jitter:
delay = delay * (0.5 + random.random())
time.sleep(delay)
raise last_exception
return wrapper
return decorator
# Usage
@retry_with_backoff(
max_retries=3,
base_delay=1.0,
retryable_exceptions=(ConnectionError, TimeoutError)
)
def fetch_data(url):
response = requests.get(url, timeout=5)
response.raise_for_status()
return response.json()
Bulkhead Pattern¶
Isolates components to prevent failures from spreading.
from concurrent.futures import ThreadPoolExecutor
from threading import Semaphore
class Bulkhead:
"""Isolates resources with separate thread pools and semaphores."""
def __init__(self, name, max_concurrent, max_wait_time=5):
self.name = name
self.semaphore = Semaphore(max_concurrent)
self.executor = ThreadPoolExecutor(max_workers=max_concurrent)
self.max_wait_time = max_wait_time
def execute(self, func, *args, **kwargs):
acquired = self.semaphore.acquire(timeout=self.max_wait_time)
if not acquired:
raise BulkheadFullError(f"Bulkhead '{self.name}' is full")
try:
future = self.executor.submit(func, *args, **kwargs)
return future.result(timeout=self.max_wait_time)
finally:
self.semaphore.release()
# Create separate bulkheads for different services
payment_bulkhead = Bulkhead("payment", max_concurrent=10)
inventory_bulkhead = Bulkhead("inventory", max_concurrent=20)
notification_bulkhead = Bulkhead("notification", max_concurrent=5)
# Usage - failures in payment don't affect inventory
def process_order(order):
# Each uses its own isolated resources
payment_result = payment_bulkhead.execute(payment_service.charge, order)
inventory_result = inventory_bulkhead.execute(inventory_service.reserve, order)
notification_bulkhead.execute(notification_service.send, order)
Timeout Pattern¶
import asyncio
from contextlib import asynccontextmanager
@asynccontextmanager
async def timeout_context(seconds, error_message="Operation timed out"):
"""Context manager for async timeouts."""
try:
yield await asyncio.wait_for(
asyncio.shield(asyncio.current_task()),
timeout=seconds
)
except asyncio.TimeoutError:
raise TimeoutError(error_message)
# Usage with cascading timeouts
async def get_order_details(order_id):
total_timeout = 5.0
start_time = time.time()
# First call
async with timeout_context(2.0):
order = await order_service.get(order_id)
elapsed = time.time() - start_time
remaining = total_timeout - elapsed
# Second call with remaining time
async with timeout_context(min(remaining, 2.0)):
user = await user_service.get(order.user_id)
return {'order': order, 'user': user}
Graceful Degradation¶
class ProductService:
def __init__(self):
self.recommendation_service = RecommendationService()
self.cache = Redis()
async def get_product_page(self, product_id):
# Core functionality - must succeed
product = await self.get_product(product_id)
# Enhanced features - can degrade
recommendations = await self._get_recommendations_gracefully(product_id)
reviews_summary = await self._get_reviews_gracefully(product_id)
return {
'product': product,
'recommendations': recommendations, # May be empty
'reviews_summary': reviews_summary, # May be cached/empty
}
async def _get_recommendations_gracefully(self, product_id):
try:
return await asyncio.wait_for(
self.recommendation_service.get(product_id),
timeout=0.5
)
except (TimeoutError, ServiceUnavailable):
# Fallback to cached recommendations
cached = await self.cache.get(f"recommendations:{product_id}")
if cached:
return cached
# Final fallback - popular products
return await self.get_popular_products()
except Exception:
return [] # Return empty on unexpected errors
Chaos Engineering¶
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
Principles of Chaos Engineering¶
- Build a Hypothesis Around Steady State Behavior
- Vary Real-World Events: Network issues, server failures, traffic spikes
- Run Experiments in Production: Or production-like environments
- Automate Experiments to Run Continuously
- Minimize Blast Radius: Start small, gradually increase scope
Chaos Experiments¶
1. Network Failures:
# Chaos Monkey for Kubernetes - Network delay
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "100ms"
jitter: "50ms"
duration: "5m"
2. Pod Failures:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: order-service
scheduler:
cron: "*/10 * * * *" # Every 10 minutes
3. Resource Stress:
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: cpu-stress
spec:
mode: all
selector:
labelSelectors:
app: api-gateway
stressors:
cpu:
workers: 2
load: 80
duration: "2m"
Chaos Engineering Tools¶
| Tool | Description | Best For |
|---|---|---|
| Chaos Monkey | Netflix's original tool, kills instances | AWS/cloud instances |
| Chaos Mesh | Kubernetes-native chaos | K8s environments |
| Gremlin | Enterprise chaos platform | Production experiments |
| Litmus | CNCF chaos engineering | Kubernetes, cloud-native |
| Toxiproxy | TCP proxy for testing | Network simulation |
Building Chaos Experiments¶
class ChaosExperiment:
def __init__(self, name, hypothesis):
self.name = name
self.hypothesis = hypothesis
self.steady_state_metrics = None
def capture_steady_state(self):
"""Capture metrics before experiment."""
self.steady_state_metrics = {
'error_rate': self.metrics.get_error_rate(),
'latency_p99': self.metrics.get_latency_p99(),
'throughput': self.metrics.get_throughput(),
}
def inject_failure(self, failure_type, duration):
"""Inject the chaos."""
if failure_type == 'network_delay':
self.chaos_client.add_latency('payment-service', '200ms')
elif failure_type == 'pod_kill':
self.chaos_client.kill_pod('payment-service')
elif failure_type == 'cpu_stress':
self.chaos_client.stress_cpu('payment-service', 80)
def verify_hypothesis(self):
"""Check if system behaved as expected."""
current_metrics = {
'error_rate': self.metrics.get_error_rate(),
'latency_p99': self.metrics.get_latency_p99(),
'throughput': self.metrics.get_throughput(),
}
# Example hypothesis: error rate should stay below 1%
assert current_metrics['error_rate'] < 0.01, \
f"Error rate exceeded threshold: {current_metrics['error_rate']}"
# Latency should not increase more than 2x
assert current_metrics['latency_p99'] < self.steady_state_metrics['latency_p99'] * 2
def run(self, failure_type, duration=60):
self.capture_steady_state()
try:
self.inject_failure(failure_type, duration)
time.sleep(duration)
self.verify_hypothesis()
return {'status': 'passed', 'metrics': self.get_metrics()}
except AssertionError as e:
return {'status': 'failed', 'reason': str(e)}
finally:
self.rollback()
Common System Design Problems¶
URL Shortener (like bit.ly)¶
Requirements:
- Shorten long URLs to short codes
- Redirect short URLs to original
- Custom short codes (optional)
- Analytics (click tracking)
- Scale: 100M URLs created/month, 10:1 read:write ratio
Design:
┌─────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ API GW │────▶│ Shortener │
└─────────┘ └─────────────┘ │ Service │
└──────┬──────┘
│
┌──────────────────────┴──────────────────────┐
│ │
┌─────▼─────┐ ┌───────▼───────┐
│ Cache │ │ Database │
│ (Redis) │ │ (Cassandra) │
└───────────┘ └───────────────┘
Key Decisions:
- ID Generation: Base62 encoding of unique ID (62^7 = 3.5 trillion combinations)
- Database: Cassandra for high write throughput, or DynamoDB
- Caching: Cache popular URLs in Redis (80/20 rule)
import hashlib
import string
class URLShortener:
ALPHABET = string.ascii_letters + string.digits # Base62
def shorten(self, long_url, custom_code=None):
if custom_code:
if self.db.exists(custom_code):
raise CodeExistsError()
code = custom_code
else:
code = self._generate_code(long_url)
self.db.save(code, long_url)
return f"https://short.url/{code}"
def _generate_code(self, url):
# Use counter-based or hash-based approach
unique_id = self.id_generator.next() # Snowflake ID
return self._base62_encode(unique_id)[:7]
def _base62_encode(self, num):
if num == 0:
return self.ALPHABET[0]
result = []
while num:
result.append(self.ALPHABET[num % 62])
num //= 62
return ''.join(reversed(result))
Rate Limiter¶
Requirements:
- Limit requests per user/IP
- Support different limits for different API tiers
- Distributed across multiple servers
- Low latency impact
Design using Sliding Window Counter in Redis:
class DistributedRateLimiter:
def __init__(self, redis_client):
self.redis = redis_client
def is_allowed(self, key, limit, window_seconds):
now = time.time()
window_start = now - window_seconds
# Lua script for atomic sliding window
script = """
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window_start = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
-- Count current window
local count = redis.call('ZCARD', key)
if count < limit then
-- Add new request
redis.call('ZADD', key, now, now .. ':' .. math.random())
redis.call('EXPIRE', key, ARGV[4])
return 1
else
return 0
end
"""
return bool(self.redis.eval(
script, 1, key, now, window_start, limit, window_seconds
))
Distributed Cache¶
Requirements:
- High read throughput (millions of reads/second)
- Sub-millisecond latency
- Automatic failover
- Data partitioning
Design:
┌─────────────────────────────────────────────────────────────────┐
│ Distributed Cache Cluster │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │
│ │ │ │ │ │ │ │
│ │ Keys: 0-32 │ │ Keys: 33-65│ │ Keys: 66-99│ │
│ │ │ │ │ │ │ │
│ │ Primary │ │ Primary │ │ Primary │ │
│ │ Replica1 │ │ Replica1 │ │ Replica1 │ │
│ │ Replica2 │ │ Replica2 │ │ Replica2 │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ Consistent Hashing Ring for Key Distribution │
│ │
└─────────────────────────────────────────────────────────────────┘
Message Queue¶
Requirements:
- At-least-once delivery
- Ordering guarantees (per partition)
- Horizontal scalability
- Consumer groups
Simplified Kafka-like Design:
class Partition:
def __init__(self, partition_id):
self.partition_id = partition_id
self.messages = [] # In practice, append-only log on disk
self.offset = 0
def append(self, message):
self.messages.append({
'offset': self.offset,
'timestamp': time.time(),
'data': message
})
self.offset += 1
return self.offset - 1
class Topic:
def __init__(self, name, num_partitions):
self.name = name
self.partitions = [
Partition(i) for i in range(num_partitions)
]
def produce(self, key, message):
# Partition by key hash for ordering
partition_id = hash(key) % len(self.partitions)
return self.partitions[partition_id].append(message)
class ConsumerGroup:
def __init__(self, group_id, topic):
self.group_id = group_id
self.topic = topic
self.offsets = {} # partition_id -> offset
self.assignments = {} # consumer_id -> [partition_ids]
def consume(self, consumer_id, batch_size=100):
messages = []
for partition_id in self.assignments.get(consumer_id, []):
offset = self.offsets.get(partition_id, 0)
partition = self.topic.partitions[partition_id]
batch = partition.messages[offset:offset + batch_size]
messages.extend(batch)
return messages
def commit(self, partition_id, offset):
self.offsets[partition_id] = offset
Notification System¶
Requirements:
- Push notifications (iOS, Android, Web)
- Email and SMS
- Template management
- Rate limiting per user
- Priority queues
Design:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Notification System │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────────┐ ┌────────────────────────────┐ │
│ │ Trigger │────▶│ Notification │────▶│ Priority Queues │ │
│ │ Service │ │ Service │ │ │ │
│ └────────────┘ └────────────────┘ │ ┌─────────┐ ┌─────────┐ │ │
│ │ │ High │ │ Normal │ │ │
│ │ │ (OTP) │ │ (promo) │ │ │
│ │ └────┬────┘ └────┬────┘ │ │
│ └───────┼───────────┼──────┘ │
│ │ │ │
│ ┌──────────────────────────────┴───────────┴──────┐ │
│ │ Workers │ │
│ └───────────┬────────────┬────────────┬──────────┘ │
│ │ │ │ │
│ ┌─────▼────┐ ┌─────▼────┐ ┌─────▼────┐ │
│ │ Push │ │ Email │ │ SMS │ │
│ │ Provider │ │ Provider │ │ Provider │ │
│ │ (FCM/APNs)│ │(SendGrid)│ │ (Twilio) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘
class NotificationService:
def __init__(self):
self.template_engine = TemplateEngine()
self.rate_limiter = RateLimiter()
self.queues = {
'high': PriorityQueue('high'),
'normal': PriorityQueue('normal'),
'low': PriorityQueue('low'),
}
async def send(self, notification):
# Check rate limits
if not self.rate_limiter.allow(notification.user_id, notification.channel):
raise RateLimitExceeded()
# Render template
content = self.template_engine.render(
notification.template_id,
notification.variables
)
# Get user preferences
preferences = await self.user_service.get_preferences(notification.user_id)
# Enqueue for each enabled channel
for channel in notification.channels:
if preferences.is_enabled(channel):
message = {
'user_id': notification.user_id,
'channel': channel,
'content': content,
'metadata': notification.metadata,
}
priority = self._get_priority(notification.type)
self.queues[priority].enqueue(message)
System Design Interview Tips¶
Framework for System Design¶
1. Requirements Clarification (5 minutes):
- Functional requirements
- Non-functional requirements (scale, latency, availability)
- Constraints and assumptions
2. Back-of-Envelope Estimation (5 minutes):
- Traffic estimates (QPS, peak)
- Storage estimates
- Bandwidth estimates
3. High-Level Design (10 minutes):
- Core components
- Data flow
- API design
4. Deep Dive (15 minutes):
- Database schema
- Scaling strategies
- Trade-offs and alternatives
5. Wrap-Up (5 minutes):
- Bottlenecks and mitigations
- Monitoring and alerting
- Future improvements
Key Questions to Ask¶
Functional:
- Who are the users?
- What are the main features?
- What operations are supported?
Scale:
- How many users?
- How many requests per second?
- How much data?
Performance:
- What's the expected latency?
- What's the availability requirement?
- What consistency model is needed?
Common Trade-offs¶
| Decision | Option A | Option B |
|---|---|---|
| Consistency | Strong (CP) | Eventual (AP) |
| Storage | SQL (ACID) | NoSQL (scale) |
| Communication | Sync (simple) | Async (resilient) |
| Caching | Cache aside (flexible) | Write-through (consistent) |
| Scaling | Vertical (simple) | Horizontal (unlimited) |