System Design¶

System design is the process of conceptualizing and planning a software system's architecture, components, modules, interfaces, and data to satisfy specified requirements. It serves as a blueprint for software development, translating user requirements into a detailed plan for implementation. Software design is the first step in SDLC (Software Design Life Cycle), which moves the concentration from problem domain to solution domain. It tries to specify how to fulfill the requirements mentioned in SRS (Software Requirement Specification).

System Design Process¶

Before diving into technical aspects, understanding the systematic approach to system design is crucial:

1. Requirements Gathering¶

Functional Requirements: What the system should do

User stories and use cases
Core features and capabilities
Business rules and workflows
Data processing requirements

Non-Functional Requirements (NFRs): How the system should perform

Performance: Response time (< 200ms for API calls), throughput (10,000 RPS)
Scalability: Handle 10x traffic growth over 3 years
Availability: 99.99% uptime (52 minutes downtime/year)
Reliability: Data durability of 99.999999999% (11 nines)
Security: Encryption at rest and in transit, compliance requirements
Maintainability: Modular design, clear documentation

2. Back-of-the-Envelope Estimation¶

Critical for understanding scale requirements:

Traffic Estimation:

Daily Active Users (DAU): 10 million
Average requests per user per day: 20
Total daily requests: 200 million
Requests per second (RPS): 200M / 86,400 ≈ 2,300 RPS
Peak RPS (assume 3x average): ~7,000 RPS

Storage Estimation:

New records per day: 1 million
Average record size: 1 KB
Daily storage: 1 GB
Yearly storage: 365 GB
5-year storage (with replication factor 3): ~5.5 TB

Bandwidth Estimation:

Incoming data: 1 GB/day = ~12 KB/s
Outgoing data (assume 10:1 read:write): 10 GB/day = ~120 KB/s

Memory Estimation (for caching):

Cache 20% of daily read requests
Daily reads: 180 million (90% of 200M)
Unique requests (assume 20%): 36 million
Cache size: 36M × 1KB = 36 GB

3. High-Level Design¶

Start with a simple architecture and evolve:

┌─────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────┐
│ Clients │────▶│ Load Balancer│────▶│ Application │────▶│ Database │
└─────────┘     └──────────────┘     │   Servers   │     └──────────┘
                                     └─────────────┘

Then add components as needed:

┌─────────┐     ┌─────┐     ┌──────────────┐     ┌─────────────┐
│ Clients │────▶│ CDN │────▶│ Load Balancer│────▶│   API       │
└─────────┘     └─────┘     └──────────────┘     │   Gateway   │
                                                 └──────┬──────┘
                            ┌──────────────────────────┼──────────────────────────┐
                            ▼                          ▼                          ▼
                     ┌─────────────┐           ┌─────────────┐           ┌─────────────┐
                     │   Service   │           │   Service   │           │   Service   │
                     │      A      │           │      B      │           │      C      │
                     └──────┬──────┘           └──────┬──────┘           └──────┬──────┘
                            │                         │                         │
                     ┌──────▼──────┐           ┌──────▼──────┐           ┌──────▼──────┐
                     │  Database   │           │    Cache    │           │   Message   │
                     │     A       │           │   (Redis)   │           │    Queue    │
                     └─────────────┘           └─────────────┘           └─────────────┘

4. Deep Dive into Components¶

For each component, consider:

Technology choice and trade-offs
Data model and schema design
API contracts and interfaces
Failure modes and recovery
Scaling strategy

Key Aspects of Software Design¶

1. Architectural Design¶

Architectural design defines the overall structure of the system, identifying major components and their relationships. It establishes the foundation upon which the entire software will be built. This high-level view addresses how different parts of the system interact, what technologies will be used, and how the system will be deployed and scaled.

Key Considerations:

Component Identification: Breaking down the system into logical, cohesive components (e.g., authentication service, payment gateway, data storage layer)
Communication Patterns: Defining how components communicate (synchronous vs. asynchronous, REST APIs, message queues, event-driven architecture)
Technology Stack: Selecting appropriate technologies for each component based on requirements, performance needs, and team expertise
Deployment Architecture: Planning how the system will be deployed (monolithic vs. microservices, on-premises vs. cloud, containerization)
Scalability Strategy: Designing for future growth (horizontal vs. vertical scaling, stateless vs. stateful components)
Security Architecture: Defining security boundaries, authentication/authorization mechanisms, and data protection strategies

Common Architectural Patterns:

Monolithic Architecture: Single, unified application where all components are tightly coupled
Microservices Architecture: System composed of small, independent services that communicate over well-defined APIs
Layered Architecture: System organized into horizontal layers (presentation, business logic, data access)
Event-Driven Architecture: Components communicate through events, enabling loose coupling and asynchronous processing
Service-Oriented Architecture (SOA): System built from reusable services that communicate via standardized protocols

Monolithic vs. Microservices: Deep Comparison¶

Aspect	Monolithic	Microservices
Deployment	Single deployable unit	Independent service deployment
Scaling	Scale entire application	Scale individual services
Technology	Single tech stack	Polyglot (different stacks per service)
Data Management	Shared database	Database per service
Team Structure	Feature teams cross modules	Service-owning teams
Complexity	Simple initially, complex at scale	Complex initially, manageable at scale
Latency	In-process calls (fast)	Network calls (slower)
Consistency	ACID transactions easy	Distributed transactions complex
Debugging	Easier (single process)	Harder (distributed tracing needed)
Failure Isolation	One bug can crash all	Failures isolated to service

When to Choose Monolithic:

Small team (< 10 developers)
Simple domain with clear boundaries
Rapid prototyping and MVP development
Strong consistency requirements
Limited operational expertise

When to Choose Microservices:

Large teams needing independent deployment
Complex domains with clear bounded contexts
Different scaling requirements per component
Need for technology diversity
High availability requirements with fault isolation

2. Detailed Design¶

Detailed design focuses on specifying the internal elements of all system components, their attributes, methods, and relationships. It provides the necessary information for developers to implement the design. This phase translates architectural decisions into concrete implementation specifications.

Key Aspects:

Class/Module Design: Defining classes, modules, or functions with their specific responsibilities, methods, and properties
Data Structures: Specifying how data will be organized and stored within each component
Algorithms: Choosing and documenting the algorithms that will be used for critical operations
Error Handling: Defining how errors will be detected, reported, and handled throughout the system
State Management: Specifying how component state will be managed, persisted, and synchronized
Performance Considerations: Identifying performance-critical paths and optimization strategies
Testing Strategy: Outlining how components will be tested (unit tests, integration tests, etc.)

Design Artifacts:

Class diagrams showing relationships between classes
Sequence diagrams illustrating interaction flows
State diagrams for components with complex state machines
Database schemas and data models
API specifications with request/response formats

3. Interface Design¶

Interface design describes how the software components interact with each other and with external systems. It includes user interfaces (UI), application programming interfaces (APIs), and data exchange formats. Well-designed interfaces are crucial for system integration, usability, and maintainability.

Types of Interfaces:

User Interfaces (UI):
- Design principles: consistency, clarity, feedback, error prevention
- Responsive design for multiple device types
- Accessibility considerations (WCAG guidelines)
- User experience (UX) flow and interaction patterns
Application Programming Interfaces (APIs):
- RESTful APIs: Resource-based, stateless, using HTTP methods
- GraphQL APIs: Query-based, allowing clients to request specific data
- gRPC: High-performance RPC framework using Protocol Buffers
- WebSocket APIs: Real-time, bidirectional communication
- API versioning strategies to maintain backward compatibility
- Authentication and authorization mechanisms (OAuth, API keys, JWT)
- Rate limiting and throttling to prevent abuse
- Documentation standards (OpenAPI/Swagger)
Data Exchange Formats:
- JSON: Human-readable, widely supported for web APIs
- XML: Structured data format, often used in enterprise systems
- Protocol Buffers: Efficient binary serialization format
- Message queue formats: AMQP, Kafka message formats

Interface Design Principles:

Consistency: Similar operations should have similar interfaces
Simplicity: Interfaces should be easy to understand and use
Backward Compatibility: Changes should not break existing clients
Documentation: Clear, comprehensive documentation is essential
Error Handling: Consistent error response formats and codes

REST vs GraphQL vs gRPC: Deep Comparison¶

Aspect	REST	GraphQL	gRPC
Protocol	HTTP/1.1, HTTP/2	HTTP (typically POST)	HTTP/2
Data Format	JSON, XML	JSON	Protocol Buffers (binary)
Schema	Optional (OpenAPI)	Required (SDL)	Required (.proto files)
Typing	Weak	Strong	Strong
Over-fetching	Common problem	Solved (client specifies)	N/A (predefined messages)
Under-fetching	Multiple requests needed	Single query	Multiple calls or streaming
Caching	HTTP caching easy	Complex (POST requests)	Requires custom solution
Real-time	Polling or WebSockets	Subscriptions	Bidirectional streaming
Browser Support	Native	Native	Requires grpc-web
Learning Curve	Low	Medium	High

When to Use Each:

REST: Public APIs, simple CRUD operations, browser-based clients, caching important
GraphQL: Mobile apps with varied data needs, complex nested data, rapid frontend iteration
gRPC: Internal microservices, high performance needed, streaming requirements, polyglot environments

API Versioning Strategies¶

1. URL Path Versioning:

GET /api/v1/users/123
GET /api/v2/users/123

Pros: Clear, explicit, easy to understand
Cons: URL changes, may break bookmarks

2. Query Parameter Versioning:

GET /api/users/123?version=1
GET /api/users/123?version=2

Pros: Same URL, optional parameter
Cons: Easy to forget, caching complications

3. Header Versioning:

GET /api/users/123
Accept: application/vnd.myapi.v1+json

Pros: Clean URLs, follows HTTP semantics
Cons: Less visible, harder to test

4. Content Negotiation:

GET /api/users/123
Accept: application/vnd.myapi+json; version=1

Pros: RESTful, flexible
Cons: Complex implementation

Best Practice: Use URL path versioning for public APIs (clarity) and header versioning for internal APIs (flexibility).

API Design Best Practices¶

Resource Naming:

# Good - nouns, plural, hierarchical
GET /users
GET /users/123
GET /users/123/orders
GET /users/123/orders/456

# Bad - verbs, actions in URL
GET /getUser/123
GET /user/123/getOrders
POST /createUser

HTTP Methods and Status Codes:

Method	Usage	Success	Error
GET	Retrieve resource	200 OK	404 Not Found
POST	Create resource	201 Created	400 Bad Request
PUT	Replace resource	200 OK	404 Not Found
PATCH	Partial update	200 OK	400 Bad Request
DELETE	Remove resource	204 No Content	404 Not Found

Pagination:

// Offset-based (simple but slow for large datasets)
GET /users?offset=100&limit=25

// Cursor-based (efficient, consistent)
GET /users?cursor=eyJpZCI6MTAwfQ&limit=25

// Response
{
  "data": [...],
  "pagination": {
    "next_cursor": "eyJpZCI6MTI1fQ",
    "has_more": true,
    "total_count": 1000
  }
}

Error Response Format:

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid request parameters",
    "details": [
      {
        "field": "email",
        "message": "Invalid email format"
      },
      {
        "field": "age",
        "message": "Must be a positive integer"
      }
    ],
    "request_id": "req_abc123",
    "documentation_url": "https://api.example.com/docs/errors#VALIDATION_ERROR"
  }
}

Idempotency in APIs¶

Idempotency means that making the same request multiple times has the same effect as making it once. This is crucial for handling retries and network failures.

Naturally Idempotent Operations:

GET, PUT, DELETE are idempotent by design
Multiple GETs return same result
Multiple PUTs set same state
Multiple DELETEs result in deleted state

Making POST Idempotent:

# Client generates unique key
POST /payments
Idempotency-Key: abc123-def456-ghi789

{
  "amount": 100,
  "currency": "USD"
}

Server Implementation:

def create_payment(request):
    idempotency_key = request.headers.get('Idempotency-Key')

    # Check if we've seen this key before
    existing = cache.get(f"idempotency:{idempotency_key}")
    if existing:
        return existing  # Return cached response

    # Process the payment
    result = process_payment(request.body)

    # Store result with TTL (e.g., 24 hours)
    cache.set(f"idempotency:{idempotency_key}", result, ttl=86400)

    return result

Idempotency Key Best Practices:

Generate on client side (UUID v4)
Store with TTL (24-48 hours typical)
Include in response for debugging
Handle concurrent requests (use locks)

4. Data Design¶

Data design involves designing data structures and database schemas to efficiently store, retrieve, and manipulate data within the system. This includes choosing appropriate data models, normalization strategies, indexing, and data access patterns.

Key Considerations:

Data Modeling:
- Entity-Relationship (ER) modeling to represent entities and their relationships
- Normalization to reduce data redundancy and improve integrity
- Denormalization strategies for performance optimization
- Data types and constraints to ensure data quality
Database Selection:
- Relational Databases (SQL): Structured data with ACID properties (PostgreSQL, MySQL, SQL Server)
- NoSQL Databases:
  - Document stores (MongoDB, CouchDB) for flexible schemas
  - Key-value stores (Redis, DynamoDB) for high-performance lookups
  - Column-family stores (Cassandra, HBase) for wide tables
  - Graph databases (Neo4j) for relationship-heavy data
Data Access Patterns:
- Read patterns: Query optimization, caching strategies, read replicas
- Write patterns: Transaction handling, write-through vs. write-back caching
- Data partitioning: Horizontal (sharding) and vertical partitioning strategies
- Replication: Master-slave, master-master, and multi-master replication
Data Integrity:
- Primary keys, foreign keys, and unique constraints
- Referential integrity rules
- Data validation at application and database levels
- Backup and recovery strategies
Performance Optimization:
- Indexing strategies (B-tree, hash indexes, composite indexes)
- Query optimization techniques
- Connection pooling and connection management
- Caching layers (application-level, database-level, distributed caches)

Database Selection Guide¶

Database Type	Best For	Examples	Limitations
Relational (SQL)	Complex queries, ACID transactions, structured data	PostgreSQL, MySQL, SQL Server	Scaling writes, schema changes
Document Store	Flexible schemas, hierarchical data, rapid development	MongoDB, CouchDB, DynamoDB	Complex joins, transactions
Key-Value Store	Caching, sessions, simple lookups	Redis, Memcached, etcd	Complex queries, relationships
Wide-Column Store	Time-series, analytics, high write throughput	Cassandra, HBase, ScyllaDB	Complex queries, consistency
Graph Database	Social networks, recommendations, fraud detection	Neo4j, Amazon Neptune, JanusGraph	Scaling, complex aggregations
Time-Series	Metrics, IoT data, logs	InfluxDB, TimescaleDB, Prometheus	General-purpose queries
Search Engine	Full-text search, log analysis	Elasticsearch, Solr, Meilisearch	Primary data store

Database Indexing Deep Dive¶

Index Types:

B-Tree Index (Default):
- Balanced tree structure
- O(log n) lookups
- Good for: equality, range queries, sorting
- Supports: <, <=, =, >=, >, BETWEEN, LIKE 'prefix%'
Hash Index:
- O(1) lookups for equality
- Not useful for range queries
- Memory-only in some databases
Bitmap Index:
- Efficient for low-cardinality columns
- Good for: gender, status, boolean fields
- Excellent for AND/OR operations
GiST/GIN Index (PostgreSQL):
- GiST: Generalized search trees (geometric data, full-text)
- GIN: Generalized inverted index (arrays, JSONB, full-text)
Covering Index:
- Includes all columns needed by query
- Avoids table lookup (index-only scan)

Index Design Principles:

-- Composite index: order matters!
CREATE INDEX idx_user_status_created 
ON orders(user_id, status, created_at);

-- This index supports:
-- WHERE user_id = 123
-- WHERE user_id = 123 AND status = 'pending'
-- WHERE user_id = 123 AND status = 'pending' AND created_at > '2024-01-01'

-- But NOT efficiently:
-- WHERE status = 'pending'  (doesn't use leftmost column)
-- WHERE user_id = 123 AND created_at > '2024-01-01'  (skips middle column)

Index Anti-Patterns:

-- Anti-pattern 1: Function on indexed column
-- BAD: Can't use index
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';

-- GOOD: Expression index or normalize data
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

-- Anti-pattern 2: Leading wildcard
-- BAD: Full table scan
SELECT * FROM products WHERE name LIKE '%phone%';

-- GOOD: Use full-text search
SELECT * FROM products WHERE to_tsvector(name) @@ to_tsquery('phone');

-- Anti-pattern 3: OR on different columns
-- BAD: May not use indexes efficiently
SELECT * FROM orders WHERE user_id = 123 OR product_id = 456;

-- GOOD: Use UNION
SELECT * FROM orders WHERE user_id = 123
UNION
SELECT * FROM orders WHERE product_id = 456;

Query Analysis:

-- Always use EXPLAIN ANALYZE
EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE user_id = 123 AND status = 'pending'
ORDER BY created_at DESC
LIMIT 10;

-- Look for:
-- - Seq Scan (sequential scan - usually bad for large tables)
-- - Index Scan / Index Only Scan (good)
-- - Rows: estimated vs actual (large difference = stale statistics)
-- - Sort operations (consider adding index for ORDER BY)

Data Normalization vs Denormalization¶

Normalization (Reduce Redundancy):

-- Normalized schema (3NF)
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    total DECIMAL(10,2)
);

CREATE TABLE order_items (
    id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(id),
    product_id INTEGER REFERENCES products(id),
    quantity INTEGER,
    price DECIMAL(10,2)
);

-- Query requires multiple joins
SELECT u.email, o.total, oi.quantity, p.name
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id;

Denormalization (Optimize Reads):

-- Denormalized for read performance
CREATE TABLE order_summary (
    id SERIAL PRIMARY KEY,
    user_id INTEGER,
    user_email VARCHAR(255),  -- Duplicated from users
    total DECIMAL(10,2),
    item_count INTEGER,        -- Pre-calculated
    product_names TEXT[],      -- Array of product names
    created_at TIMESTAMP
);

-- Single table query - much faster
SELECT * FROM order_summary WHERE user_id = 123;

When to Denormalize:

Read-heavy workloads (10:1 or higher read:write ratio)
Complex joins hurting performance
Data rarely changes after creation
Reporting and analytics queries
Caching data in a fast-access format

Trade-offs:

Aspect	Normalized	Denormalized
Write speed	Faster	Slower (multiple updates)
Read speed	Slower (joins)	Faster (single table)
Storage	Less	More (redundancy)
Data consistency	Easier	Harder (sync required)
Schema changes	Easier	Harder

ACID vs BASE¶

ACID (Traditional Databases):

Atomicity: Transaction is all-or-nothing
Consistency: Database moves from valid state to valid state
Isolation: Concurrent transactions don't interfere
Durability: Committed data survives failures

BASE (Distributed Systems):

Basically Available: System remains available
Soft State: State may change without input (due to eventual consistency)
Eventually Consistent: System becomes consistent over time

Isolation Levels:

Level	Dirty Read	Non-Repeatable Read	Phantom Read	Performance
READ UNCOMMITTED	✓	✓	✓	Highest
READ COMMITTED	✗	✓	✓	High
REPEATABLE READ	✗	✗	✓	Medium
SERIALIZABLE	✗	✗	✗	Lowest

-- Set isolation level
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

BEGIN;
-- Your transaction
COMMIT;

Design Principles¶

SOLID Principles: Single Responsibility, Open-Closed, Liskov Substitution, Interface Segregation, and Dependency Inversion
DRY (Don't Repeat Yourself): Avoiding code duplication
KISS (Keep It Simple, Stupid): Maintaining simplicity in design
Separation of Concerns: Dividing a system into distinct sections with minimal overlap

SOLID Principles¶

SOLID Stands for:

S - Single-responsibility Principle
O - Open-closed Principle
L - Liskov Substitution Principle
I - Interface Segregation Principle
D - Dependency Inversion Principle

S - Single-responsibility Principle¶

The Single Responsibility Principle (SRP) is the first principle of SOLID, a set of five design principles for writing maintainable and scalable software, introduced by Robert C. Martin. SRP states that a class should have only one reason to change, meaning it should have only one responsibility or job.

Key Points:

Definition: A class should focus on a single task or responsibility, encapsulating only the logic related to that purpose.
Purpose: To reduce complexity, improve maintainability, and make code easier to understand and modify.
Responsibility: Refers to a single, well-defined job or purpose (e.g., handling user data, logging, or sending emails).
Consequence of Violation: If a class has multiple responsibilities, changes to one responsibility may unintentionally affect others, leading to bugs, fragility, and harder maintenance.

Benefits of SRP:

Improved Maintainability: Changes to one responsibility (e.g., email logic) don’t affect unrelated functionality (e.g., user management). Easier Testing: Smaller, focused classes are easier to test since they do one thing. Better Reusability: Classes with single responsibilities can be reused in other contexts. Reduced Coupling: Separating responsibilities minimizes dependencies between unrelated concerns.

How to Apply SRP:

Identify distinct responsibilities in a class by asking, "What does this class do?" If a class does multiple unrelated things, split it into smaller classes. Use clear, descriptive names for classes to reflect their single purpose. Watch for signs of violation, like classes with too many methods or frequent changes for unrelated reasons.

Common Misconceptions:

Over-splitting: SRP doesn’t mean every class must have one method. A class can have multiple methods as long as they serve the same responsibility. Not just classes: SRP applies to functions, modules, or components too, depending on the context.

Real-World Analogy: Think of a chef in a kitchen. If the chef is responsible for cooking, serving, and cleaning, they’re overloaded with responsibilities. If the cooking process changes, it could disrupt serving or cleaning. Instead, assign a chef to cook, a waiter to serve, and a dishwasher to clean—each has one job, making the kitchen more efficient.

Practical Example in Context: Imagine a web application:

Bad: A ReportGenerator class that generates reports, formats them as PDF, and emails them. Good: Split into ReportGenerator (creates report data), ReportFormatter (formats as PDF), and EmailService (sends emails). Each class changes only when its specific responsibility needs an update.

By adhering to SRP, you create modular, flexible, and easier-to-maintain codebases, aligning with the broader goals of SOLID principles.

O - Open-closed Principle¶

The Open-Closed Principle (OCP) is the second principle of SOLID, a set of design principles for creating maintainable and scalable software, introduced by Robert C. Martin. OCP states that software entities (classes, modules, functions, etc.) should be open for extension but closed for modification.

Key Points:

Open for Extension: You can add new functionality to a system by extending existing code (e.g., through inheritance, interfaces, or plugins).
Closed for Modification: The existing code should not need to be changed to accommodate new functionality.
Purpose: To make systems easier to maintain and extend without introducing bugs or altering tested, stable code.
Consequence of Violation: Modifying existing code to add new features can introduce errors, break existing functionality, and require extensive retesting.

Benefits of OCP:

Maintainability: Existing code remains untouched, reducing the risk of introducing bugs.
Scalability: New features can be added by writing new classes or modules without altering the core system.
Reusability: Polymorphic designs promote reusable components.
Reduced Testing Overhead: Only new extensions need testing, not the entire system.

How to Apply OCP:

Use abstraction (interfaces, abstract classes) to define contracts that new classes can implement.
Leverage polymorphism to allow different implementations to be used interchangeably.
Apply design patterns like Strategy, Factory, or Decorator to enable flexible extensions.
Avoid conditional logic (e.g., if-else chains) for handling different types or behaviors; use polymorphism instead.

Common Misconceptions:

Not just inheritance: While inheritance is a common way to achieve OCP, composition (e.g., using interfaces or dependency injection) is often preferred for flexibility.
Not absolute: In some cases, minor modifications to existing code might be practical, but OCP encourages minimizing such changes.

Real-World Analogy:

Think of a power strip with multiple outlets. The strip is “closed” for modification—you don’t need to rewire it to plug in a new device. It’s “open” for extension because you can plug in new devices (like a lamp or charger) without changing the strip itself.

Practical Example in Context:

Imagine a payment processing system:

Bad: A PaymentProcessor class with if-else statements for each payment type (credit card, PayPal, crypto). Adding a new payment type requires modifying the class.
Good: Define a PaymentMethod interface with a processPayment method. Each payment type (e.g., CreditCardPayment, PayPalPayment) implements the interface. The PaymentProcessor uses the interface and remains unchanged when new payment methods are added.

By following OCP, you create systems that are easier to extend and maintain, aligning with the broader goals of SOLID principles.

L - Liskov Substitution Principle¶

The Liskov Substitution Principle (LSP) is the third principle of SOLID, introduced by Barbara Liskov. It states that objects of a derived class should be able to replace objects of the base class without introducing errors or unexpected behavior in the system. In other words, a subclass should be substitutable for its superclass without breaking the program.

Key Points:

Definition: If a program is designed to work with a base class (or interface), it should work correctly with any subclass (or implementation) of that base class without modification.
Purpose: To ensure that inheritance hierarchies are designed correctly, maintaining the reliability and correctness of the system.
Consequence of Violation: Violating LSP leads to brittle code, unexpected behavior, or errors when subclasses are used in place of their base classes.

Formal Definition: Let S be a subtype of T. Then, objects of type T in a program can be replaced with objects of type S without altering the correctness of the program.

Key Guidelines for LSP:

Behavioral Compatibility: Subclasses must honor the contract (behavior) defined by the base class or interface. This includes respecting preconditions, postconditions, and invariants.
- Preconditions: Subclass methods should not require stronger conditions than the base class (e.g., stricter input requirements).
- Postconditions: Subclass methods should not weaken the guarantees provided by the base class (e.g., returning less specific results).
- Invariants: Properties that hold true for the base class must remain true for subclasses.
Avoid Overriding with Incompatible Behavior: Subclasses shouldn’t throw unexpected exceptions or produce results that break the base class’s expectations.
Use Interfaces or Abstract Classes Wisely: Define clear contracts that all subclasses can fulfill.
Refactor Inheritance if Needed: If a subclass cannot fully substitute for its parent, reconsider the inheritance hierarchy or use composition instead.

Benefits of LSP:

Reliability: Ensures subclasses can be used interchangeably with their base class without causing errors.
Maintainability: Reduces bugs caused by unexpected behavior in inheritance hierarchies.
Flexibility: Allows polymorphic code to work with new subclasses without modification.
Better Design: Encourages well-thought-out class hierarchies and clear contracts.

Common Misconceptions:

Not just syntactic compatibility: LSP is about behavior, not just matching method signatures. A subclass that compiles but behaves incorrectly still violates LSP.
Not limited to inheritance: LSP applies to any implementation of an interface or contract, not just class inheritance.

Real-World Analogy: Think of a power outlet and appliances. If a device (subclass) claims to work with a standard outlet (base class), it should function correctly when plugged in. If a “special” appliance requires a different voltage or connector, it violates the expectation of being substitutable for a standard appliance, causing issues.

Practical Example in Context:

Imagine a payment system:

Bad: A Payment base class with a process method. A CryptoPayment subclass throws an exception for unsupported currencies, breaking code that expects Payment to always process successfully.
Good: Define a Payment interface with a process method that all implementations (e.g., CreditCardPayment, CryptoPayment) can fulfill reliably. If CryptoPayment has limitations, handle them within the method (e.g., return an error code) rather than breaking the contract.

Relation to Other SOLID Principles:

OCP: LSP supports the Open-Closed Principle by ensuring new subclasses can be added without breaking existing code.
SRP: Classes with single responsibilities are less likely to violate LSP, as they have clearer, more focused contracts.

By adhering to LSP, you create robust inheritance hierarchies and polymorphic systems that are reliable and easier to extend, aligning with the broader goals of SOLID principles.

I - Interface Segregation Principle¶

The Interface Segregation Principle (ISP) is the fourth principle of SOLID, introduced by Robert C. Martin. It states that clients should not be forced to depend on interfaces they do not use. In other words, interfaces should be specific to the needs of the client, and classes should not be required to implement methods they don’t need.

Key Points:

Definition: Break down large, general-purpose interfaces into smaller, more specific ones tailored to the requirements of individual clients.
Purpose: To reduce coupling, improve cohesion, and prevent classes from being burdened with irrelevant methods, making the system easier to maintain and extend.
Consequence of Violation: Classes implementing a bloated interface may have to provide empty or meaningless implementations for unused methods, leading to fragile code and maintenance issues.

Benefits of ISP:

Reduced Coupling: Clients only depend on the specific interfaces they need, not bloated ones.
Improved Cohesion: Interfaces are more focused, aligning with specific responsibilities.
Easier Maintenance: Classes implement only relevant methods, reducing unnecessary code and potential errors.
Better Scalability: New interfaces can be added for new client needs without affecting existing classes.

How to Apply ISP:

Split Large Interfaces: Break down “fat” interfaces into smaller, more specific ones based on client needs.
Client-Driven Design: Design interfaces from the perspective of what clients (classes or modules) actually require.
Avoid Empty Implementations: If a class is forced to implement empty or “not applicable” methods, it’s a sign the interface is too broad.
Use Composition: Combine multiple small interfaces to compose the functionality a class needs.

Common Misconceptions:

Not about minimal interfaces: ISP doesn’t mean every interface should have one method. Interfaces should be cohesive and serve a specific purpose.
Not just for interfaces: While ISP focuses on interfaces in object-oriented programming, the principle applies to any contract (e.g., APIs, modules) that clients depend on.

Real-World Analogy:

Think of a restaurant menu. A general menu listing every dish (vegan, meat, desserts) forces all customers to navigate irrelevant options. Instead, offering separate vegan, carnivore, and dessert menus ensures customers only see what’s relevant to them, simplifying their experience.

Practical Example in Context:

Imagine a printer system:

Bad: A Printer interface with print, scan, and fax methods. A basic printer that can’t scan or fax must implement empty methods, violating ISP.
Good: Split into Printable, Scannable, and Faxable interfaces. A basic printer implements only Printable, while an all-in-one printer implements all three. Clients (e.g., a print job manager) only depend on the Printable interface if they need printing.

Relation to Other SOLID Principles:

SRP: ISP aligns with the Single Responsibility Principle by ensuring interfaces focus on a single, cohesive purpose.
LSP: ISP supports Liskov Substitution by ensuring subclasses implement only relevant behaviors, making substitution safer.
OCP: Smaller interfaces make it easier to extend systems without modifying existing code.

D - Dependency Inversion Principle¶

The Dependency Inversion Principle (DIP) is the fifth and final principle of SOLID, introduced by Robert C. Martin. It states that high-level modules should not depend on low-level modules; both should depend on abstractions. Additionally, abstractions should not depend on details; details should depend on abstractions.

Key Points:

Definition: Instead of high-level modules (business logic) directly depending on low-level modules (implementation details), both should rely on abstract interfaces or contracts. This decouples the system and makes it more flexible.
Purpose: To reduce tight coupling between components, making the system easier to extend, test, and maintain.
Consequence of Violation: Tight coupling to specific implementations makes it hard to change or swap components, leading to rigid and fragile code.

Core Concepts:

High-Level Modules: Contain business logic or core functionality (e.g., a service orchestrating a process).
Low-Level Modules: Handle implementation details (e.g., database access, file I/O, or external APIs).
Abstractions: Interfaces or abstract classes that define contracts without specifying implementation details.
Inversion: Instead of high-level modules controlling low-level ones directly, both depend on abstractions, “inverting” the traditional dependency flow.

Benefits of DIP:

Decoupling: High-level and low-level modules are loosely coupled, relying on abstractions rather than concrete implementations.
Flexibility: You can swap implementations (e.g., switch databases or APIs) without changing high-level code.
Testability: Dependencies can be mocked or stubbed easily for unit testing.
Maintainability: Changes in low-level details don’t require modifying high-level logic.

How to Apply DIP:

Use Interfaces or Abstract Classes: Define abstractions that specify contracts for low-level modules.
Dependency Injection: Pass dependencies (e.g., via constructor, setter, or framework) rather than hardcoding them in high-level modules.
Avoid Concrete Dependencies: High-level modules should not instantiate or directly reference low-level classes.
Design for Abstractions: Ensure both high-level and low-level modules depend on the same abstract contract.

Common Misconceptions:

Not just dependency injection: While DIP often uses dependency injection, the principle is about depending on abstractions, not the mechanism of injection.
Not about eliminating dependencies: It’s about managing dependencies through abstractions, not removing them entirely.

Real-World Analogy: Think of a lamp and a power source. Instead of wiring the lamp (high-level module) directly to a specific battery (low-level module), use a standard plug (abstraction). The lamp works with any power source (battery, wall outlet) that fits the plug, without needing to rewire the lamp.

Practical Example in Context:

Imagine a notification system:

Bad: A NotificationService directly creates and uses an EmailSender class. Switching to SMS or push notifications requires modifying NotificationService.
Good: Define a NotificationSender interface with a send method. EmailSender, SMSSender, and PushSender implement it. NotificationService depends on the NotificationSender interface, and the specific sender is injected at runtime.

Relation to Other SOLID Principles:

SRP: DIP supports Single Responsibility by encouraging focused abstractions for specific responsibilities.
OCP: DIP enables Open-Closed Principle by allowing new implementations to be added without modifying existing code.
LSP: DIP relies on Liskov Substitution to ensure that any implementation of an abstraction can be used interchangeably.
ISP: DIP works best with small, specific interfaces (per Interface Segregation) to avoid bloated dependencies.

By adhering to DIP, you create systems that are decoupled, flexible, and easier to maintain, aligning with the broader goals of SOLID principles.

DRY Don't Repeat Yourself¶

The "Don't Repeat Yourself" (DRY) principle is a fundamental concept in software development aimed at reducing repetition of code and improving maintainability, readability, and efficiency. It states that every piece of knowledge or logic in a system should have a single, unambiguous, authoritative representation. In other words, you should avoid duplicating code or data, as duplication can lead to errors, inconsistencies, and increased maintenance effort.

Key Aspects of DRY

Single Source of Truth: Each piece of functionality or data should exist in one place. If you need to change something, you update it in one place, and all dependent parts of the system reflect that change.
Avoid Code Duplication: Repeating code across multiple parts of an application can lead to problems when changes are needed, as you’d have to update every instance of the duplicated code.
Promotes Modularity: DRY encourages breaking code into reusable functions, modules, or components, making it easier to maintain and scale.
Applies Beyond Code: DRY can apply to data, configurations, documentation, and even design patterns, ensuring consistency across the system.

Benefits of DRY

Maintainability: With less duplicated code, updates or bug fixes only need to be applied in one place, reducing the risk of missing spots or introducing inconsistencies.
Readability: Code is cleaner and easier to understand when logic is centralized rather than scattered.
Efficiency: Developers spend less time writing repetitive code and more time focusing on new features or improvements.
Reduced Errors: Duplication increases the chance of errors, especially if changes are made inconsistently across duplicated sections.

When to Use DRY

When repetition is meaningful: If you’re duplicating code or logic that represents the same concept or behavior, it’s a candidate for DRY.
When future changes are likely: If a piece of logic or data is likely to change, centralizing it reduces future maintenance effort.
When it improves clarity: DRY should make code easier to understand, not overly complex.

Potential Pitfalls

Over-Abstraction: Applying DRY too aggressively can lead to overly complex abstractions that are hard to understand or maintain. For example, creating a highly generalized function to handle slightly different cases might make the code less readable.
Premature Optimization: Trying to eliminate all duplication early on can lead to overengineering. Sometimes, it’s okay to have minor duplication if the code’s purpose or context might diverge later.
Context Matters: Not all duplication is bad. If two pieces of code look similar but serve different purposes, unifying them might couple unrelated functionality, making future changes harder.

DRY vs. WET The opposite of DRY is sometimes humorously called WET ("Write Everything Twice" or "We Enjoy Typing"). WET code is full of duplication, leading to bloated, error-prone, and hard-to-maintain systems. Practical Tips for Applying DRY

Refactor Regularly: When you notice duplicated code, refactor it into reusable components like functions, classes, or modules.
Use Design Patterns: Patterns like Factory, Strategy, or Template Method can help centralize logic and avoid repetition.
Leverage Tools: Use linters, code analyzers, or IDE features to detect duplication.
Balance Simplicity and DRY: Ensure that applying DRY doesn’t make the code harder to understand or maintain.

DRY in Different Contexts

Frontend Development: Use components (e.g., in React or Vue) to reuse UI elements and logic.
Backend Development: Create reusable services, utilities, or middleware to handle common tasks like authentication or logging.
DevOps: Use configuration files or Infrastructure as Code (e.g., Terraform) to avoid repeating setup instructions.
Documentation: Maintain a single source for documentation (e.g., a wiki or README) to avoid outdated or conflicting information.

Example in a Real-World Scenario

Imagine a web app where multiple pages display a user’s profile information. Without DRY, you might repeat the same HTML and logic for rendering the profile on each page. By applying DRY, you create a reusable Profile component that takes user data as input and renders it consistently across the app. If the profile’s design changes, you update the component once, and all pages reflect the change.

Conclusion

The DRY principle is about reducing redundancy to make code more maintainable, scalable, and less error-prone. However, it’s not a rigid rule—use it thoughtfully to balance simplicity and efficiency. When applied correctly, DRY leads to cleaner, more robust systems that are easier to work with over time.

KISS Keep It Simple, Stupid¶

The KISS principle, which stands for "Keep It Simple, Stupid" (or sometimes "Keep It Short and Simple" or "Keep It Simple and Straightforward"), is a design philosophy that emphasizes simplicity in software development, engineering, and other fields. The core idea is to avoid unnecessary complexity, making systems, code, or designs as simple as possible while still achieving the intended purpose.

Origin The KISS principle is attributed to Kelly Johnson, a lead engineer at Lockheed Martin’s Skunk Works in the mid-20th century. He reportedly told his team that aircraft designs should be simple enough for an average mechanic to repair with basic tools under combat conditions. The phrase has since been adopted widely in software development, project management, and other disciplines.

Key Aspects of KISS

Simplicity First: Choose the simplest solution that solves the problem effectively, avoiding overengineering or adding unnecessary features.
Clarity and Readability: Write code or create designs that are easy to understand, even for someone new to the project.
Minimize Complexity: Avoid intricate logic, convoluted architectures, or excessive abstractions that make maintenance harder.
Focus on Essentials: Prioritize core functionality over "nice-to-have" features that add complexity without significant value.

Benefits of KISS

Easier Maintenance: Simple code or systems are easier to debug, modify, and extend.
Improved Readability: Straightforward designs are more accessible to developers, reducing onboarding time and miscommunication.
Faster Development: Simpler solutions often take less time to implement and test.
Fewer Errors: Complexity increases the likelihood of bugs; simplicity reduces it.
Scalability: Simple systems are often easier to scale or adapt to new requirements.

KISS in Relation to DRY The KISS principle complements the DRY (Don’t Repeat Yourself) principle, but they can sometimes conflict. For example:

DRY focuses on eliminating duplication by abstracting shared logic, which can sometimes lead to complex abstractions. KISS prioritizes simplicity, even if it means tolerating minor duplication to keep the code straightforward.

When applying both, balance is key. For instance, don’t create overly generalized functions to avoid duplication if they make the code harder to understand.

When to Apply KISS

Early Development: Start with simple solutions to get a working prototype quickly, then refine as needed.
When Complexity Creeps In: If a system starts becoming hard to understand or maintain, simplify it.
Team Collaboration: Simple code is easier for teams to work on, especially with varying skill levels.
User-Facing Systems: Simplicity in UI/UX or APIs improves user experience and adoption.

Potential Pitfalls

Oversimplification: Making things too simple might ignore edge cases or future scalability needs. For example, a simplistic database schema might not handle growth well.
Misinterpreting Simplicity: KISS doesn’t mean cutting corners or ignoring best practices; it means finding the simplest effective solution.
Subjectivity: What’s "simple" can vary between developers. A solution that seems simple to one person might be confusing to another.

Practical Tips for Applying KISS

Break Down Problems: Divide complex tasks into smaller, manageable pieces.
Use Clear Naming: Choose variable, function, and class names that clearly describe their purpose.
Avoid Overengineering: Don’t add features or abstractions "just in case" they’re needed later (see YAGNI - "You Aren’t Gonna Need It").
Refactor for Simplicity: Regularly review code to simplify overly complex sections.
Test Early: Simple designs are easier to test, so validate them early to catch issues.
Leverage Existing Tools: Use libraries or frameworks to handle complex tasks instead of reinventing the wheel.

KISS in Different Contexts

Software Development: Write concise, focused code with minimal dependencies and clear logic.
Project Management: Use straightforward plans and avoid overcomplicating workflows with excessive tools or processes.
UI/UX: Design interfaces with minimal steps to achieve user goals.
Documentation: Write clear, concise documentation without unnecessary jargon.

Example in a Real-World Scenario

Imagine building a login system for a web app. A KISS approach would involve:

A simple form with email and password fields.
A single endpoint (e.g., /login) that validates credentials and returns a token.
Clear error messages for invalid inputs.

Instead of:

A complex form with multiple authentication options (e.g., biometrics, OAuth, OTP) before validating the need for them.
Multiple endpoints for different login scenarios, increasing maintenance.

Conclusion The KISS principle is about prioritizing simplicity to create systems that are easier to build, understand, and maintain. By focusing on what’s essential and avoiding unnecessary complexity, developers can deliver robust, efficient solutions. However, KISS should be balanced with other principles like DRY and YAGNI to ensure the solution remains effective and adaptable. When applied thoughtfully, KISS leads to cleaner code, happier teams, and better user experiences.

YAGNI You Aren’t Gonna Need It¶

The YAGNI principle, which stands for "You Aren’t Gonna Need It", is a software development guideline that advises against implementing functionality or adding complexity to a system unless it’s explicitly required for the current needs. Originating from Extreme Programming (XP), YAGNI emphasizes focusing on what’s necessary now, avoiding speculative features or overengineering that may never be used. It promotes lean development, reducing wasted effort and keeping systems simpler and more maintainable.

Key Aspects of YAGNI

Build Only What’s Needed: Implement features, code, or infrastructure only when there’s a clear, immediate requirement, not because they might be useful later.
Avoid Overengineering: Don’t add complexity (e.g., extra classes, configurations, or optimizations) in anticipation of future needs that may never materialize.
Focus on Current Requirements: Prioritize delivering working software that meets the present scope, allowing for flexibility to adapt later.
Iterative Development: YAGNI aligns with agile methodologies, encouraging incremental changes based on actual feedback or requirements.

Benefits of YAGNI

Reduced Complexity: By avoiding unnecessary features, code remains simpler, easier to understand, and less prone to bugs.
Faster Development: Focusing on current needs speeds up delivery, as developers aren’t spending time on speculative work.
Lower Maintenance Costs: Fewer features mean less code to maintain, test, and debug.
Flexibility: Avoiding premature commitments to specific designs or features makes it easier to adapt to future changes.
Resource Efficiency: Saves time, effort, and resources that would be spent on unused functionality.

YAGNI in Relation to KISS and DRY

YAGNI and KISS: Both emphasize simplicity. YAGNI avoids adding unnecessary features, while KISS (Keep It Simple, Stupid) focuses on keeping implementations straightforward. Together, they encourage minimal, clear solutions.
YAGNI and DRY: DRY (Don’t Repeat Yourself) eliminates duplication, but overzealous DRY can lead to premature abstractions that YAGNI warns against. For example, creating a generalized framework for a feature that’s only used once violates YAGNI, even if it follows DRY.

When to Apply YAGNI

Unclear Requirements: If a feature is speculative or based on "what if" scenarios, defer it until it’s explicitly needed.
Early Development: Focus on delivering a minimum viable product (MVP) rather than building for hypothetical future needs.
Resource-Constrained Projects: When time or budget is limited, prioritize essential functionality.
Agile Environments: YAGNI fits well with iterative development, where features are added incrementally based on feedback.

Potential Pitfalls

Underengineering: Overapplying YAGNI can lead to overly simplistic systems that are hard to extend when legitimate requirements arise. For example, ignoring basic scalability needs in a high-growth application.
Ignoring Foreseeable Needs: If a requirement is highly likely based on clear evidence (e.g., customer feedback or project roadmap), YAGNI doesn’t mean ignoring it entirely.
Technical Debt: Poorly designed simple solutions might require significant rework later. Balance YAGNI with good design practices.
Team Misalignment: Some team members might interpret YAGNI as an excuse to cut corners or skip necessary planning.

Practical Tips for Applying YAGNI

Validate Requirements: Confirm with stakeholders whether a feature is truly needed now.
Refactor When Needed: If requirements change, refactor the code to accommodate new needs rather than overbuilding upfront.
Use Agile Practices: Work in short iterations, delivering small, functional increments to validate what’s needed.
Document Assumptions: Note why certain features were deferred to avoid confusion later.
Balance with Scalability: Ensure the system is flexible enough to handle likely future needs without locking in speculative features.
Communicate with Teams: Ensure everyone understands YAGNI to avoid adding unnecessary complexity during development.

YAGNI in Different Contexts

Software Development: Write only the code needed for current functionality, avoiding speculative optimizations or frameworks.
UI/UX Design: Build simple, focused interfaces rather than adding every possible user option upfront.
DevOps: Deploy minimal infrastructure (e.g., a single server) instead of a complex, scalable cluster until traffic justifies it.
Project Management: Avoid planning for every possible contingency; focus on immediate milestones.

Example in a Real-World Scenario Imagine developing an e-commerce app. A YAGNI approach would involve:

Building a basic product catalog and checkout system for the initial launch. Skipping features like advanced search filters, multi-currency support, or a recommendation engine until user demand or business needs justify them.

Instead of:

Implementing a complex system with support for multiple languages, dynamic pricing, and AI-driven recommendations from the start, “just in case” they’re needed.

Conclusion The YAGNI principle helps developers stay focused on delivering value by avoiding unnecessary work and complexity. By building only what’s required now, teams can ship faster, maintain simpler systems, and remain flexible for future changes. However, YAGNI should be applied thoughtfully, balancing immediate needs with reasonable foresight to avoid technical debt. When combined with KISS and DRY, it fosters efficient, maintainable, and practical software development.

Tools for Software Design¶

UML (Unified Modeling Language): For creating diagrams that represent system architecture and behavior
Draw.io, Lucidchart: Diagramming tools for creating flowcharts and architectural diagrams
Enterprise Architect, Visual Paradigm: Comprehensive modeling tools for software design
Figma, Sketch: For UI/UX design aspects

Importance of Good Software Design¶

Maintainability: Well-designed software is easier to maintain and update. Changes can be made with minimal impact on other parts of the system, reducing the risk of introducing bugs and making the codebase more manageable as it grows.
Scalability: Good design allows the system to grow and handle increased load. By considering scalability from the start, systems can accommodate growth in users, data, and traffic without requiring complete rewrites.
Reliability: Proper design reduces bugs and system failures. Well-structured code with clear separation of concerns is easier to test, debug, and verify, leading to more stable systems.
Efficiency: Optimized design improves performance and resource utilization. Thoughtful architectural decisions can significantly impact system performance, reducing latency, improving throughput, and minimizing resource consumption.
Cost-effectiveness: While good design may take more time initially, it saves significant costs in the long run. Technical debt from poor design accumulates over time, making future changes more expensive and risky.

Distributed Systems¶

A distributed system is a collection of independent computers that appear to users as a single coherent system. These computers, often called nodes, are connected by a network and coordinate their actions by passing messages. Distributed systems enable applications to scale beyond the limitations of a single machine, provide fault tolerance, and allow for geographic distribution of services.

The Eight Fallacies of Distributed Computing¶

Before designing distributed systems, understand these common misconceptions (by Peter Deutsch):

The network is reliable → Connections fail, packets are lost
Latency is zero → Network calls take time (often 1-100ms)
Bandwidth is infinite → Data transfer has limits
The network is secure → All traffic can be intercepted
Topology doesn't change → Networks are reconfigured constantly
There is one administrator → Multiple teams manage infrastructure
Transport cost is zero → Serialization, network I/O have costs
The network is homogeneous → Different hardware, protocols, versions

Design Implications:

Always handle network failures gracefully
Implement timeouts and retries with backoff
Compress data and batch requests when possible
Encrypt all traffic (TLS/mTLS)
Use service discovery for dynamic topologies
Design for independent deployability
Consider serialization costs in protocol choice

Characteristics of Distributed Systems¶

Concurrency: Multiple components execute simultaneously, requiring coordination and synchronization mechanisms.
No Global Clock: Different nodes may have different perceptions of time, making it challenging to establish a global ordering of events.
Independent Failures: Components can fail independently without bringing down the entire system. The system must be designed to handle partial failures gracefully.
Heterogeneity: Nodes may run different operating systems, use different hardware, or be implemented in different programming languages.
Transparency: The distributed nature should be hidden from users, who should perceive the system as a single, unified entity.

Challenges in Distributed Systems¶

1. Network Partitions¶

Network partitions occur when network failures cause groups of nodes to be unable to communicate with each other. This creates a split-brain scenario where different parts of the system may make conflicting decisions.

Solutions:

Consensus Algorithms: Use algorithms like Raft or Paxos to ensure nodes agree on decisions even during network partitions
Quorum-based Systems: Require a majority of nodes to agree before making decisions
CAP Theorem Trade-offs: Understand that during partitions, you must choose between consistency and availability

2. Consistency¶

Consistency refers to the property that all nodes in a distributed system see the same data at the same time. Achieving strong consistency across distributed nodes is challenging due to network latency and potential failures.

Consistency Models:

Strong Consistency: All nodes see the same data simultaneously. Updates are immediately visible to all nodes. This is the simplest model but can impact availability and performance.
Eventual Consistency: Nodes may temporarily have different views of data, but they will eventually converge to the same state. This improves availability and performance but requires handling temporary inconsistencies.
Weak Consistency: No guarantees about when consistency will be achieved. Used when performance is more critical than consistency.
Causal Consistency: Preserves causal relationships between events. If event A causally precedes event B, all nodes will see A before B.
Session Consistency: Guarantees consistency within a user session, but not necessarily across different sessions.

Implementation Strategies:

Two-Phase Commit (2PC): Ensures all nodes agree to commit or abort a transaction
Three-Phase Commit (3PC): Improves on 2PC by reducing blocking scenarios
Vector Clocks: Track causal relationships between events in distributed systems
CRDTs (Conflict-free Replicated Data Types): Data structures designed to merge automatically without conflicts

3. Availability¶

Availability is the proportion of time a system is operational and accessible. In distributed systems, achieving high availability requires redundancy and fault tolerance.

Availability Levels:

99% (Two 9s): ~87.6 hours of downtime per year
99.9% (Three 9s): ~8.76 hours of downtime per year
99.99% (Four 9s): ~52.56 minutes of downtime per year
99.999% (Five 9s): ~5.26 minutes of downtime per year

Strategies for High Availability:

Redundancy: Deploy multiple instances of services across different machines, data centers, or regions
Health Checks: Continuously monitor service health and automatically route traffic away from unhealthy instances
Graceful Degradation: Design systems to continue operating with reduced functionality when some components fail
Circuit Breakers: Prevent cascading failures by stopping requests to failing services
Retry Mechanisms: Implement intelligent retry strategies with exponential backoff

4. Partition Tolerance¶

Partition tolerance is the system's ability to continue operating despite network partitions. This is a fundamental requirement for distributed systems, as network partitions are inevitable.

CAP Theorem:

The CAP theorem states that in a distributed system, you can only guarantee two out of three properties:

Consistency (C): All nodes see the same data simultaneously
Availability (A): System remains operational and responds to requests
Partition Tolerance (P): System continues despite network partitions

Since partition tolerance is essential in distributed systems, the practical choice is between:

CP Systems: Prioritize consistency and partition tolerance (e.g., traditional databases, distributed locks)
AP Systems: Prioritize availability and partition tolerance (e.g., DNS, web caches)

CAP Theorem Visualized:

                    Consistency (C)
                         /\
                        /  \
                       /    \
                      /  CP  \
                     /________\
                    /\        /\
                   /  \  CA  /  \
                  / AP \    /    \
                 /______\  /______\
         Availability (A)    Partition Tolerance (P)

    CA: Not practical in distributed systems (partitions happen)
    CP: HBase, MongoDB (with majority writes), Zookeeper
    AP: Cassandra, CouchDB, DynamoDB (eventual consistency)

Practical CAP Trade-offs:

Scenario	Choice	Reasoning
Banking transaction	CP	Must prevent double-spending
Social media feed	AP	Stale data acceptable, must always load
Inventory count	Depends	CP for checkout, AP for display
User sessions	AP	Availability critical, eventual sync OK
Leader election	CP	Must have single leader
DNS	AP	Stale records OK, must respond

CAP Theorem Misconceptions:

"Choose 2 of 3" → More nuanced: during normal operation, you can have all three. CAP applies during partitions.
Binary choice → Consistency and availability exist on a spectrum (tunable consistency).
Static decision → Different operations can make different trade-offs.

PACELC Theorem (Extension of CAP):

If there's a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.

System	P: A/C	E: L/C	Description
DynamoDB	A	L	Always available, low latency, eventual consistency
MongoDB	C	C	Consistent, may sacrifice latency
Cassandra	A	L	Tunable, defaults to availability
MySQL (single)	C	C	Strong consistency, higher latency

BASE Properties (Alternative to ACID):

Basically Available: System remains available even during failures
Soft State: System state may change over time without input
Eventually Consistent: System will become consistent over time

Eventual Consistency Models:

Read-Your-Writes: You always see your own writes
Monotonic Reads: Once you read a value, you never see older values
Monotonic Writes: Writes by same client applied in order
Causal Consistency: Causally related operations seen in order

# Example: Tunable consistency in Cassandra
# Write with QUORUM (majority of replicas)
session.execute(
    "INSERT INTO users (id, name) VALUES (?, ?)",
    [user_id, name],
    consistency_level=ConsistencyLevel.QUORUM
)

# Read with ONE (any single replica) - faster but may be stale
session.execute(
    "SELECT * FROM users WHERE id = ?",
    [user_id],
    consistency_level=ConsistencyLevel.ONE
)

Distributed System Architectures¶

1. Client-Server Architecture¶

The most common distributed architecture where clients request services from servers. Servers provide resources and services, while clients consume them.

Characteristics:

Centralized control and data management
Clear separation of concerns
Easier to secure and manage
Single point of failure at the server

Variations:

Multi-tier Architecture: Separates presentation, business logic, and data layers
Thin Client: Minimal processing on client, most work done on server
Thick Client: Significant processing on client side

2. Peer-to-Peer (P2P) Architecture¶

All nodes have equal capabilities and responsibilities. Nodes can act as both clients and servers.

Characteristics:

Decentralized, no single point of failure
Scalable as nodes can join and leave dynamically
More complex to manage and secure
Examples: BitTorrent, blockchain networks, distributed hash tables (DHT)

Types:

Structured P2P: Nodes organized in a specific topology (e.g., DHT with consistent hashing)
Unstructured P2P: Nodes connect randomly (e.g., Gnutella)
Hybrid P2P: Combines centralized and decentralized elements (e.g., Napster)

3. Microservices Architecture¶

System composed of small, independent services that communicate over well-defined APIs. Each service is responsible for a specific business capability.

Characteristics:

Services are independently deployable and scalable
Each service can use different technologies
Services communicate via lightweight protocols (HTTP/REST, gRPC, message queues)
Fault isolation: failure in one service doesn't bring down the entire system

Benefits:

Technology diversity: choose best tool for each service
Independent scaling: scale services based on their specific needs
Faster development: teams can work independently
Easier to understand: each service has a focused responsibility

Challenges:

Service coordination and communication overhead
Distributed transaction management
Data consistency across services
Network latency and reliability
Operational complexity (monitoring, logging, debugging)

4. Event-Driven Architecture¶

Components communicate through events. When something significant happens, components publish events that other components can subscribe to and react to.

Characteristics:

Loose coupling between components
Asynchronous communication
Highly scalable and responsive
Event sourcing: store all changes as a sequence of events

Event Patterns:

Event Notification: Components notify others of events
Event-Carried State Transfer: Events contain all data needed by subscribers
Event Sourcing: Store state changes as events, reconstruct state by replaying events
CQRS (Command Query Responsibility Segregation): Separate read and write models

Technologies:

Message brokers: Apache Kafka, RabbitMQ, Amazon SQS
Event streaming: Apache Kafka, AWS Kinesis
Event stores: EventStore, Apache Pulsar

Distributed System Patterns¶

1. Leader Election¶

In distributed systems, sometimes you need a single node to coordinate activities. Leader election algorithms ensure that exactly one node is selected as the leader.

Use Cases:

Coordinating distributed transactions
Managing distributed locks
Organizing replicated data
Coordinating cluster-wide operations

Algorithms:

Bully Algorithm: Highest ID node becomes leader
Ring Algorithm: Nodes organized in a ring, pass election messages
Raft: Consensus algorithm that includes leader election
ZooKeeper: Uses ZooKeeper for leader election

Consensus Algorithms Deep Dive¶

Consensus algorithms allow distributed systems to agree on a single value or sequence of values, even when some nodes fail.

Raft Consensus Algorithm:

Raft is designed to be understandable (unlike Paxos). It separates consensus into three sub-problems:

1. Leader Election:

┌─────────────────────────────────────────────────────────────────┐
│                        RAFT STATES                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────┐     timeout     ┌───────────┐    wins election  │
│   │ Follower │ ───────────────▶│ Candidate │ ─────────────────▶│
│   └──────────┘                 └───────────┘                    │
│        ▲                            │                           │
│        │                            │ loses election            │
│        │                            ▼                           │
│        │                       ┌──────────┐                     │
│        └───────────────────────│  Leader  │                     │
│           discovers leader     └──────────┘                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Nodes start as followers
If no heartbeat from leader, become candidate
Request votes from other nodes
Majority votes → become leader
Leader sends heartbeats to prevent new elections

2. Log Replication:

Leader                          Followers
   │                               │
   │  AppendEntries(entries)       │
   │──────────────────────────────▶│
   │                               │
   │  Success/Failure              │
   │◀──────────────────────────────│
   │                               │
   │ (On majority success)         │
   │  Commit entry                 │
   │  Apply to state machine       │
   │                               │

Leader receives client requests
Appends to log, replicates to followers
Once majority confirms, entry is committed
Leader notifies followers to apply

3. Safety:

Only nodes with up-to-date logs can become leader
Log entries committed in a term are present in future leaders
State machines apply entries in same order

Raft Example Implementation Pseudocode:

class RaftNode:
    def __init__(self):
        self.state = 'follower'
        self.current_term = 0
        self.voted_for = None
        self.log = []
        self.commit_index = 0
        self.last_applied = 0

    def on_election_timeout(self):
        self.state = 'candidate'
        self.current_term += 1
        self.voted_for = self.id
        votes = 1  # Vote for self

        for peer in self.peers:
            response = peer.request_vote(
                term=self.current_term,
                candidate_id=self.id,
                last_log_index=len(self.log) - 1,
                last_log_term=self.log[-1].term if self.log else 0
            )
            if response.vote_granted:
                votes += 1

        if votes > len(self.peers) / 2:
            self.become_leader()

    def append_entries(self, entries, leader_commit):
        # Follower receives entries from leader
        self.log.extend(entries)
        if leader_commit > self.commit_index:
            self.commit_index = min(leader_commit, len(self.log) - 1)
            self.apply_committed_entries()

Paxos vs Raft:

Aspect	Paxos	Raft
Understandability	Difficult	Designed to be simple
Leader	Optional (Multi-Paxos)	Required
Safety	Proven correct	Proven correct
Liveness	Guaranteed (with FLP caveat)	Guaranteed
Adoption	ZooKeeper, Spanner	etcd, Consul, TiDB

Byzantine Fault Tolerance (BFT):

Raft and Paxos handle crash failures but not malicious nodes. For Byzantine faults:

PBFT (Practical BFT): Tolerates f Byzantine nodes with 3f+1 total
Use cases: Blockchain, financial systems
Trade-off: Higher latency, more messages

2. Distributed Locking¶

Distributed locks allow multiple processes to coordinate access to shared resources across a distributed system.

Requirements:

Mutual exclusion: only one process can hold the lock
Deadlock freedom: system should not deadlock
Fault tolerance: lock should be released if holder crashes
Performance: low latency for acquiring/releasing locks

Implementation Approaches:

Database-based: Use database transactions to implement locks
Redis-based: Use Redis with SET NX (set if not exists) and expiration
ZooKeeper/etcd: Use consensus systems for distributed coordination
Lease-based: Locks expire after a time period, requiring renewal

Challenges:

Clock skew affecting expiration times
Network partitions causing split-brain scenarios
Performance under high contention

3. Distributed Transactions¶

Transactions that span multiple nodes in a distributed system. Ensuring ACID properties across distributed nodes is complex.

Two-Phase Commit (2PC):

┌────────────┐                    ┌────────────┐                    ┌────────────┐
│Coordinator │                    │Participant1│                    │Participant2│
└─────┬──────┘                    └─────┬──────┘                    └─────┬──────┘
      │                                 │                                 │
      │ ───── PREPARE ─────────────────▶│                                 │
      │ ───── PREPARE ────────────────────────────────────────────────────▶
      │                                 │                                 │
      │◀──── VOTE YES ──────────────────│                                 │
      │◀──── VOTE YES ────────────────────────────────────────────────────│
      │                                 │                                 │
      │ ───── COMMIT ──────────────────▶│                                 │
      │ ───── COMMIT ─────────────────────────────────────────────────────▶
      │                                 │                                 │
      │◀──── ACK ───────────────────────│                                 │
      │◀──── ACK ─────────────────────────────────────────────────────────│

Prepare Phase: Coordinator asks all participants to prepare
Commit Phase: If all agree, coordinator tells all to commit; otherwise, abort

Limitations:

Blocking: if coordinator fails, participants may be blocked
Single point of failure at coordinator
Poor performance due to multiple round trips

Alternatives:

Saga Pattern: Long-running transactions broken into smaller, compensatable transactions
TCC (Try-Confirm-Cancel): Three-phase pattern for distributed transactions
Eventual Consistency: Accept eventual consistency instead of strong consistency

Saga Pattern Deep Dive¶

A saga is a sequence of local transactions where each transaction updates a single service and publishes events. If any transaction fails, compensating transactions undo the changes.

Choreography-Based Saga:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Order     │     │  Inventory  │     │   Payment   │     │   Shipping  │
│   Service   │     │   Service   │     │   Service   │     │   Service   │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │                   │
       │ OrderCreated      │                   │                   │
       │──────────────────▶│                   │                   │
       │                   │                   │                   │
       │                   │ InventoryReserved │                   │
       │                   │──────────────────▶│                   │
       │                   │                   │                   │
       │                   │                   │ PaymentProcessed  │
       │                   │                   │──────────────────▶│
       │                   │                   │                   │
       │ ShippingScheduled │                   │                   │
       │◀──────────────────────────────────────────────────────────│

Orchestration-Based Saga:

                              ┌─────────────┐
                              │   Saga      │
                              │Orchestrator │
                              └──────┬──────┘
                                     │
       ┌─────────────────────────────┼─────────────────────────────┐
       │                             │                             │
       ▼                             ▼                             ▼
┌─────────────┐              ┌─────────────┐              ┌─────────────┐
│   Order     │              │   Payment   │              │   Shipping  │
│   Service   │              │   Service   │              │   Service   │
└─────────────┘              └─────────────┘              └─────────────┘

Saga Implementation Example:

class OrderSaga:
    """Orchestration-based saga for order processing."""

    steps = [
        ('order', 'create_order', 'cancel_order'),
        ('inventory', 'reserve_items', 'release_items'),
        ('payment', 'process_payment', 'refund_payment'),
        ('shipping', 'schedule_shipping', 'cancel_shipping'),
    ]

    def execute(self, order_data):
        completed_steps = []

        try:
            for service, action, compensation in self.steps:
                result = self.call_service(service, action, order_data)
                completed_steps.append((service, compensation, result))

            return {'status': 'success', 'order_id': order_data['id']}

        except SagaStepFailed as e:
            # Compensate in reverse order
            for service, compensation, data in reversed(completed_steps):
                self.call_service(service, compensation, data)

            return {'status': 'failed', 'error': str(e)}

    def call_service(self, service, action, data):
        # Call service via HTTP/gRPC/message queue
        pass

Saga vs 2PC:

Aspect	Two-Phase Commit	Saga
Consistency	Strong (ACID)	Eventual
Availability	Lower (blocking)	Higher (no global lock)
Latency	Higher (2+ round trips)	Lower
Complexity	Simpler logic	Complex compensations
Rollback	Automatic	Manual (compensating transactions)
Isolation	Full	Potential anomalies

Saga Design Considerations:

Compensating Transactions: Design every step to be reversible
Idempotency: Steps may be retried; must be idempotent
Isolation: May see intermediate states; design for this
Timeouts: Handle slow or failed services
Semantic Lock: Use flags to prevent concurrent modifications

CQRS (Command Query Responsibility Segregation)¶

CQRS separates read and write operations into different models, allowing independent optimization.

                              CQRS Architecture

        ┌─────────────────────────────────────────────────────────────┐
        │                         Client                              │
        └─────────────────────────────────────────────────────────────┘
                    │                                    │
                    │ Commands                           │ Queries
                    ▼                                    ▼
        ┌─────────────────────┐              ┌─────────────────────┐
        │   Command Handler   │              │   Query Handler     │
        │   (Write Model)     │              │   (Read Model)      │
        └──────────┬──────────┘              └──────────┬──────────┘
                   │                                    │
                   ▼                                    ▼
        ┌─────────────────────┐              ┌─────────────────────┐
        │   Write Database    │──Sync/Async─▶│   Read Database     │
        │   (Normalized)      │   Events     │   (Denormalized)    │
        └─────────────────────┘              └─────────────────────┘

Write Model (Commands):

Handles create, update, delete operations
Validates business rules
Normalized for consistency
Optimized for write throughput

Read Model (Queries):

Handles all read operations
Denormalized for query performance
Can have multiple read models for different query patterns
Eventually consistent with write model

When to Use CQRS:

Read and write patterns have different requirements
Read-heavy workloads (10:1 or higher)
Complex domain with rich business logic
Need for different scaling strategies
Event sourcing architecture

CQRS with Event Sourcing:

# Command handling
class OrderCommandHandler:
    def handle_create_order(self, command):
        # Validate
        if not self.inventory.has_stock(command.items):
            raise InsufficientStock()

        # Create event
        event = OrderCreatedEvent(
            order_id=generate_id(),
            customer_id=command.customer_id,
            items=command.items,
            timestamp=now()
        )

        # Store event
        self.event_store.append(event)

        # Publish for read model sync
        self.event_bus.publish(event)

# Query handling
class OrderQueryHandler:
    def get_order_summary(self, order_id):
        # Read from denormalized read database
        return self.read_db.query(
            "SELECT * FROM order_summaries WHERE id = ?",
            order_id
        )

# Read model projector
class OrderProjector:
    def on_order_created(self, event):
        # Update denormalized read model
        self.read_db.execute("""
            INSERT INTO order_summaries 
            (id, customer_name, item_count, total, status)
            VALUES (?, ?, ?, ?, ?)
        """, [event.order_id, ...])

Event Sourcing¶

Instead of storing current state, store all changes (events) as the source of truth.

Traditional: Store current state
┌────────────────────────────────────────┐
│ User: {id: 1, name: "John", age: 31}   │
└────────────────────────────────────────┘

Event Sourcing: Store all events
┌────────────────────────────────────────┐
│ UserCreated {id: 1, name: "John"}      │
│ NameChanged {id: 1, name: "Johnny"}    │
│ NameChanged {id: 1, name: "John"}      │
│ AgeUpdated {id: 1, age: 30}            │
│ AgeUpdated {id: 1, age: 31}            │
└────────────────────────────────────────┘

Benefits:

Complete audit trail
Temporal queries (state at any point in time)
Event replay for debugging
Easy to add new projections
Natural fit for event-driven systems

Challenges:

Event schema evolution
Eventual consistency
Query complexity (need projections)
Storage growth

Event Store Implementation:

class EventStore:
    def append(self, aggregate_id, events, expected_version):
        """Append events with optimistic concurrency."""
        with self.db.transaction():
            current_version = self.get_version(aggregate_id)

            if current_version != expected_version:
                raise ConcurrencyError("Aggregate modified")

            for i, event in enumerate(events):
                self.db.execute("""
                    INSERT INTO events 
                    (aggregate_id, version, event_type, data, timestamp)
                    VALUES (?, ?, ?, ?, ?)
                """, [
                    aggregate_id,
                    expected_version + i + 1,
                    event.__class__.__name__,
                    serialize(event),
                    now()
                ])

    def get_events(self, aggregate_id, from_version=0):
        """Retrieve events to rebuild state."""
        return self.db.query("""
            SELECT * FROM events 
            WHERE aggregate_id = ? AND version > ?
            ORDER BY version
        """, [aggregate_id, from_version])

class Order:
    """Event-sourced aggregate."""

    def __init__(self, events):
        self.id = None
        self.status = None
        self.items = []
        self.version = 0

        for event in events:
            self.apply(event)

    def apply(self, event):
        if isinstance(event, OrderCreated):
            self.id = event.order_id
            self.status = 'created'
            self.items = event.items
        elif isinstance(event, OrderShipped):
            self.status = 'shipped'
        # ... more event handlers

        self.version += 1

4. Service Discovery¶

Mechanism for services to find and communicate with each other in a distributed system.

Patterns:

Client-Side Discovery: Clients query a service registry and select an instance
Server-Side Discovery: Load balancer queries service registry and routes requests

Implementations:

Service Registry: Centralized registry (e.g., Consul, Eureka, etcd)
DNS-based: Use DNS for service discovery
Service Mesh: Infrastructure layer handling service-to-service communication (e.g., Istio, Linkerd)

5. Circuit Breaker¶

Prevents cascading failures by stopping requests to a failing service and providing fallback behavior.

States:

Closed: Normal operation, requests flow through
Open: Service is failing, requests fail fast without calling service
Half-Open: Testing if service has recovered, allowing limited requests

Benefits:

Prevents resource exhaustion from retrying failed requests
Provides fast failure feedback
Allows services time to recover

Distributed Data Management¶

1. Replication¶

Maintaining copies of data across multiple nodes to improve availability, performance, and fault tolerance.

Replication Strategies:

Master-Slave (Primary-Replica): One master handles writes, replicas handle reads
Master-Master (Multi-Master): Multiple nodes can handle writes, requiring conflict resolution
Synchronous Replication: Wait for all replicas to confirm before acknowledging write
Asynchronous Replication: Acknowledge write immediately, replicate in background

Consistency Models:

Strong Consistency: All replicas have same data at all times
Eventual Consistency: Replicas will converge to same state eventually
Read-Your-Writes: User always sees their own writes
Monotonic Reads: User never sees older data after seeing newer data

2. Sharding (Partitioning)¶

Dividing data across multiple nodes to distribute load and enable horizontal scaling.

Sharding Strategies:

Range-based Sharding: Partition by value ranges (e.g., user IDs 1-1000 on shard 1)
Hash-based Sharding: Use hash function to determine shard (e.g., hash(user_id) % num_shards)
Directory-based Sharding: Maintain lookup table mapping keys to shards
Geographic Sharding: Partition by geographic location

Challenges:

Rebalancing: Moving data when adding/removing shards
Cross-shard Queries: Queries spanning multiple shards are complex
Hot Spots: Uneven distribution causing some shards to be overloaded
Transaction Support: Transactions across shards are difficult

3. Distributed Caching¶

Caching data across multiple nodes to improve performance and reduce load on backend systems.

Cache Patterns:

Cache-Aside (Lazy Loading): Application checks cache, loads from DB if miss, stores in cache
Write-Through: Write to cache and DB simultaneously
Write-Back (Write-Behind): Write to cache, write to DB asynchronously
Refresh-Ahead: Proactively refresh cache before expiration

Distributed Cache Solutions:

Redis: In-memory data structure store, supports clustering
Memcached: Simple distributed memory caching system
Hazelcast: In-memory data grid with distributed caching
Apache Ignite: Distributed caching and computing platform

Cache Invalidation:

TTL (Time-To-Live): Cache entries expire after fixed time
Event-based Invalidation: Invalidate on data change events
Version-based: Use version numbers to detect stale data

Consistent Hashing¶

Consistent hashing is crucial for distributed caching and sharding. It minimizes data movement when nodes are added or removed.

Problem with Simple Hashing:

# Simple modulo hashing
server = hash(key) % num_servers

# If num_servers changes from 3 to 4:
# hash("user:123") % 3 = 1  → Server 1
# hash("user:123") % 4 = 2  → Server 2  (DIFFERENT!)

# Almost ALL keys get remapped!

Consistent Hashing Solution:

                        Consistent Hash Ring

                              0/360°
                               │
                      ┌────────┴────────┐
                     ╱                   ╲
                    N1                   N2
                   (50°)               (120°)
                  ╱                       ╲
                 │                         │
        270° ────┤                         ├──── 180°
                 │                         │
                  ╲                       ╱
                   K1                   N3
                  (240°)              (200°)
                     ╲                   ╱
                      └────────┬────────┘
                               │
                             240°

        Key K1 (hash=240°) → Clockwise → Node N1 (50°)

How It Works:

Hash nodes and keys to positions on a ring (0-360° or 0-2^32)
Each key is assigned to the first node clockwise from its position
Adding/removing a node only affects keys in one segment

Virtual Nodes (Vnodes):

To improve load distribution, each physical node has multiple virtual positions:

class ConsistentHash:
    def __init__(self, nodes, replicas=150):
        self.ring = {}
        self.sorted_keys = []
        self.replicas = replicas

        for node in nodes:
            self.add_node(node)

    def add_node(self, node):
        """Add node with virtual replicas."""
        for i in range(self.replicas):
            key = self._hash(f"{node}:{i}")
            self.ring[key] = node
            self.sorted_keys.append(key)
        self.sorted_keys.sort()

    def remove_node(self, node):
        """Remove node and its virtual replicas."""
        for i in range(self.replicas):
            key = self._hash(f"{node}:{i}")
            del self.ring[key]
            self.sorted_keys.remove(key)

    def get_node(self, key):
        """Find the node responsible for this key."""
        if not self.ring:
            return None

        hash_key = self._hash(key)

        # Binary search for first node clockwise
        for ring_key in self.sorted_keys:
            if hash_key <= ring_key:
                return self.ring[ring_key]

        # Wrap around to first node
        return self.ring[self.sorted_keys[0]]

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

Consistent Hashing Applications:

Distributed Caches: Redis Cluster, Memcached
Database Sharding: Cassandra, DynamoDB
Load Balancing: Sticky sessions without session affinity
Distributed File Systems: Ceph, GlusterFS

Cache Strategies Deep Dive¶

Cache-Aside (Lazy Loading):

def get_user(user_id):
    # 1. Try cache first
    cached = cache.get(f"user:{user_id}")
    if cached:
        return cached

    # 2. Cache miss - load from DB
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)

    # 3. Populate cache
    cache.set(f"user:{user_id}", user, ttl=3600)

    return user

def update_user(user_id, data):
    # Update DB
    db.execute("UPDATE users SET ... WHERE id = ?", user_id)

    # Invalidate cache (or update it)
    cache.delete(f"user:{user_id}")

Write-Through:

def update_user(user_id, data):
    # Write to cache and DB atomically
    with transaction():
        db.execute("UPDATE users SET ... WHERE id = ?", user_id)
        cache.set(f"user:{user_id}", data, ttl=3600)

Write-Behind (Write-Back):

def update_user(user_id, data):
    # Write to cache immediately
    cache.set(f"user:{user_id}", data, ttl=3600)

    # Queue DB write for later
    write_queue.enqueue({
        'table': 'users',
        'id': user_id,
        'data': data
    })

# Background worker
def process_write_queue():
    while True:
        batch = write_queue.dequeue_batch(size=100)
        db.bulk_update(batch)

Cache Stampede Prevention:

import random
import time

def get_with_lock(key, fetch_func, ttl=3600):
    """Prevent cache stampede with distributed lock."""
    value = cache.get(key)
    if value:
        return value

    lock_key = f"lock:{key}"

    # Try to acquire lock
    if cache.set(lock_key, "1", nx=True, ex=10):  # NX = only if not exists
        try:
            # We got the lock - fetch and cache
            value = fetch_func()
            cache.set(key, value, ex=ttl)
            return value
        finally:
            cache.delete(lock_key)
    else:
        # Another process is fetching - wait and retry
        time.sleep(0.1)
        return get_with_lock(key, fetch_func, ttl)

def get_with_probabilistic_refresh(key, fetch_func, ttl=3600, beta=1):
    """Prevent stampede with probabilistic early refresh."""
    data = cache.get_with_ttl(key)
    if not data:
        value = fetch_func()
        cache.set(key, value, ex=ttl)
        return value

    value, remaining_ttl = data

    # Probabilistically refresh before expiration
    # Higher beta = more aggressive refresh
    expiry_gap = ttl - remaining_ttl
    if remaining_ttl - (beta * expiry_gap * random.random()) <= 0:
        # Refresh in background
        async_refresh(key, fetch_func, ttl)

    return value

Bloom Filters¶

A space-efficient probabilistic data structure for membership testing.

Properties:

Can tell if element is definitely not in set (no false negatives)
May have false positives (says "maybe in set" when not)
Cannot delete elements (use Counting Bloom Filter)
Very space efficient

Use Cases:

Cache: Check if item might be in cache before expensive lookup
Database: Check if row might exist before disk read
Spam filtering: Check if URL might be malicious
Distributed systems: Check if data might be on a node

import mmh3  # MurmurHash
from bitarray import bitarray

class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)

    def add(self, item):
        for i in range(self.num_hashes):
            index = mmh3.hash(item, i) % self.size
            self.bit_array[index] = 1

    def might_contain(self, item):
        for i in range(self.num_hashes):
            index = mmh3.hash(item, i) % self.size
            if not self.bit_array[index]:
                return False  # Definitely not in set
        return True  # Might be in set

# Usage
bf = BloomFilter(size=1000000, num_hashes=7)
bf.add("user:123")
bf.add("user:456")

bf.might_contain("user:123")  # True
bf.might_contain("user:789")  # Probably False (might be True)

Optimal Parameters:

n = expected number of elements
p = desired false positive rate
m = optimal number of bits = -n * ln(p) / (ln(2)^2)
k = optimal number of hashes = (m/n) * ln(2)

Example: 1 million items, 1% false positive rate
m = ~9.6 million bits (~1.2 MB)
k = 7 hash functions

Communication in Distributed Systems¶

1. Remote Procedure Call (RPC)¶

Allows a program to call a procedure on a remote server as if it were a local procedure call.

Characteristics:

Synchronous communication (blocking)
Type-safe with code generation
Efficient binary protocols
Examples: gRPC, Apache Thrift, CORBA

Challenges:

Network latency and failures
Parameter marshalling/unmarshalling
Version compatibility
Service discovery

2. Message Queues¶

Asynchronous communication pattern where messages are sent to queues and processed by consumers.

Benefits:

Decoupling: producers and consumers don't need to know about each other
Load leveling: smooth out traffic spikes
Reliability: messages persisted until processed
Scalability: multiple consumers can process messages

Message Queue Patterns:

Point-to-Point: One consumer processes each message
Publish-Subscribe: Multiple consumers receive each message
Request-Reply: Synchronous pattern over async infrastructure

Technologies:

RabbitMQ: Feature-rich message broker
Apache Kafka: Distributed event streaming platform
Amazon SQS: Managed message queue service
Apache Pulsar: Cloud-native messaging and streaming

3. REST APIs¶

Representational State Transfer (REST) is an architectural style for designing networked applications.

Principles:

Stateless: each request contains all information needed
Resource-based: URLs represent resources
HTTP methods: GET, POST, PUT, DELETE for operations
Standard status codes: 200, 404, 500, etc.

Benefits:

Simple and widely understood
Works well with HTTP caching
Language and platform agnostic
Easy to test and debug

Monitoring and Observability¶

Distributed systems require comprehensive monitoring to understand system behavior and diagnose issues.

Three Pillars of Observability:

Metrics: Numerical measurements over time
- System metrics: CPU, memory, disk, network
- Application metrics: request rate, error rate, latency
- Business metrics: revenue, user signups, conversions
Logs: Discrete events with timestamps
- Structured logging: JSON format for easy parsing
- Log aggregation: centralize logs from all services
- Log levels: DEBUG, INFO, WARN, ERROR
Traces: Request flows across service boundaries
- Distributed tracing: track requests across services
- Span: single operation within a trace
- Trace context propagation: pass trace IDs across services

Tools:

Metrics: Prometheus, Grafana, Datadog
Logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, CloudWatch
Tracing: Jaeger, Zipkin, AWS X-Ray

Scalability¶

Scalability is the capability of a system to handle a growing amount of work by adding resources. A scalable system can accommodate growth in users, data volume, or traffic without significant degradation in performance or requiring a complete redesign.

Types of Scalability¶

1. Vertical Scaling (Scale Up)¶

Increasing the capacity of existing hardware or software by adding more power (CPU, RAM, storage) to a single machine.

Characteristics:

Simpler to implement: upgrade existing server
No code changes typically required
Limited by hardware maximums
Single point of failure
Can be expensive (high-end hardware costs)

When to Use:

Small to medium applications
When simplicity is more important than high availability
Applications that can't be easily distributed
Legacy systems that are difficult to refactor

Limitations:

Physical limits: maximum CPU, memory, and I/O capacity
Cost increases exponentially with capacity
Downtime required for upgrades
Cannot exceed single machine capabilities

2. Horizontal Scaling (Scale Out)¶

Adding more machines or instances to handle increased load, distributing work across multiple nodes.

Characteristics:

Virtually unlimited capacity
Better fault tolerance (no single point of failure)
Cost-effective (commodity hardware)
Requires distributed system design
More complex to manage

When to Use:

High-traffic applications
Systems requiring high availability
Cloud-native applications
Modern microservices architectures

Challenges:

State management: stateless services are easier to scale
Data consistency across nodes
Load distribution
Operational complexity

Scalability Dimensions¶

1. Load Scalability¶

Ability to handle increased number of requests or transactions per unit of time.

Metrics:

Throughput: Number of requests processed per second
Requests Per Second (RPS): Measure of system capacity
Transactions Per Second (TPS): For transactional systems

Strategies:

Add more servers/instances
Optimize code and database queries
Implement caching layers
Use CDN for static content
Database read replicas for read-heavy workloads

2. Geographic Scalability¶

Ability to serve users from different geographic locations with acceptable latency.

Challenges:

Network latency increases with distance
Data sovereignty and compliance requirements
Replication across regions
Managing consistency across regions

Solutions:

Content Delivery Networks (CDN): Cache content closer to users
Edge Computing: Process requests at edge locations
Multi-Region Deployment: Deploy services in multiple regions
Database Replication: Replicate data across regions
DNS-based Routing: Route users to nearest data center

3. Administrative Scalability¶

Ability to manage and maintain the system as it grows in size and complexity.

Challenges:

More components to monitor and manage
Configuration management complexity
Deployment coordination
Team coordination and communication

Solutions:

Infrastructure as Code (IaC): Automate infrastructure provisioning
Container Orchestration: Kubernetes, Docker Swarm for managing containers
Service Mesh: Manage service-to-service communication
Centralized Logging and Monitoring: Unified observability
Automated Testing and Deployment: CI/CD pipelines

Scalability Patterns¶

1. Stateless Services¶

Services that don't store client state between requests. Each request contains all information needed to process it.

Benefits:

Easy horizontal scaling: any instance can handle any request
Simple load balancing
No session affinity required
Better fault tolerance

Implementation:

Store session state in external store (Redis, database)
Use JWT tokens for authentication (stateless)
Pass all necessary context in requests
Avoid server-side session storage

2. Database Scaling¶

Databases often become bottlenecks as systems scale. Multiple strategies exist for scaling databases.

Read Scaling:

Read Replicas: Create copies of database for read operations
Read/Write Splitting: Route reads to replicas, writes to master
Caching: Cache frequently accessed data (Redis, Memcached)
Materialized Views: Pre-compute expensive queries

Write Scaling:

Sharding: Partition data across multiple databases
Database Federation: Separate databases by function
Denormalization: Reduce joins by duplicating data
Event Sourcing: Store events instead of current state

Database Types:

SQL Databases: ACID transactions, complex queries, vertical and horizontal scaling challenges
NoSQL Databases: Often designed for horizontal scaling, eventual consistency
- Document stores: MongoDB, CouchDB
- Key-value stores: Redis, DynamoDB
- Column stores: Cassandra, HBase
- Graph databases: Neo4j

3. Caching Strategies¶

Caching stores frequently accessed data in fast storage to reduce load on backend systems and improve response times.

Cache Levels:

Browser Cache: Cache static assets in user's browser
CDN Cache: Cache content at edge locations
Application Cache: In-memory cache in application (Redis, Memcached)
Database Cache: Query result cache, buffer pool

Cache Patterns:

Cache-Aside: Application manages cache, loads on miss
Write-Through: Write to cache and database simultaneously
Write-Back: Write to cache, asynchronously write to database
Refresh-Ahead: Proactively refresh before expiration

Cache Invalidation:

TTL (Time-To-Live): Automatic expiration
Event-based: Invalidate on data changes
Version-based: Use versions to detect stale data

4. Asynchronous Processing¶

Offload time-consuming tasks to background workers to keep request handlers responsive.

Benefits:

Faster response times for users
Better resource utilization
Ability to handle traffic spikes
Decoupling of components

Patterns:

Message Queues: Send tasks to queues, workers process them
Event-Driven: Publish events, subscribers process asynchronously
Job Queues: Background job processing (Celery, Sidekiq)
Stream Processing: Process data streams in real-time (Kafka Streams, Flink)

Use Cases:

Email sending
Image/video processing
Report generation
Data aggregation and analytics
Notification delivery

Scalability Bottlenecks¶

Common bottlenecks that limit system scalability and how to address them.

1. Database Bottlenecks¶

Symptoms:

Slow query performance
High database CPU/memory usage
Connection pool exhaustion
Lock contention

Solutions:

Query optimization: indexes, query rewriting
Database connection pooling
Read replicas for read-heavy workloads
Caching frequently accessed data
Database sharding for write scaling
Consider NoSQL for specific use cases

2. Application Bottlenecks¶

Symptoms:

High CPU usage
Memory leaks
Inefficient algorithms
Synchronous blocking operations

Solutions:

Code profiling and optimization
Algorithm improvements
Asynchronous processing
Connection pooling
Resource pooling (thread pools, object pools)
Horizontal scaling

3. Network Bottlenecks¶

Symptoms:

High latency
Bandwidth saturation
Network congestion

Solutions:

CDN for static content
Compression (gzip, brotli)
HTTP/2 and HTTP/3 for multiplexing
Edge computing
Regional deployment
Optimize payload sizes

4. I/O Bottlenecks¶

Symptoms:

Slow disk I/O
High I/O wait times
Storage capacity limits

Solutions:

Use SSDs instead of HDDs
Distributed file systems
Object storage (S3, GCS)
Database optimization
Caching to reduce I/O

Scalability Metrics and Monitoring¶

Key metrics to track system scalability and performance.

Performance Metrics:

Latency: Time to process a request (p50, p95, p99 percentiles)
Throughput: Requests per second
Error Rate: Percentage of failed requests
Availability: Uptime percentage

Resource Metrics:

CPU Usage: Processor utilization
Memory Usage: RAM consumption
Disk I/O: Read/write operations per second
Network I/O: Bandwidth utilization

Business Metrics:

User Growth: Number of active users
Transaction Volume: Business transactions per day
Data Growth: Database size, storage usage

Monitoring Tools:

APM (Application Performance Monitoring): New Relic, Datadog, AppDynamics
Infrastructure Monitoring: Prometheus, Grafana, CloudWatch
Log Analysis: ELK Stack, Splunk
Distributed Tracing: Jaeger, Zipkin

Scalability Best Practices¶

Design for Scale from the Start:
- Consider scalability requirements early
- Use scalable architectures (microservices, event-driven)
- Design stateless services
- Plan for horizontal scaling
Measure Before Optimizing:
- Profile applications to find bottlenecks
- Monitor key metrics continuously
- Load test before production
- Establish performance baselines
Cache Aggressively:
- Cache at multiple levels
- Cache frequently accessed data
- Use appropriate cache expiration strategies
- Monitor cache hit rates
Optimize Database Access:
- Use indexes effectively
- Optimize queries
- Use connection pooling
- Consider read replicas and sharding
Use Asynchronous Processing:
- Offload heavy tasks to background workers
- Use message queues for decoupling
- Implement event-driven patterns
Implement Rate Limiting:
- Protect against traffic spikes
- Prevent abuse
- Ensure fair resource usage
Plan for Failure:
- Design for graceful degradation
- Implement circuit breakers
- Use health checks and auto-scaling
- Plan disaster recovery

Rate Limiting Deep Dive¶

Rate limiting controls the rate of requests a client can make to protect services from abuse and ensure fair usage.

Rate Limiting Algorithms¶

1. Token Bucket:

┌─────────────────────────────────────────────────────────────────┐
│                        TOKEN BUCKET                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    Tokens added at fixed rate (e.g., 10/second)                │
│                     │                                           │
│                     ▼                                           │
│              ┌─────────────┐                                    │
│              │   BUCKET    │  Max capacity (e.g., 100 tokens)  │
│              │  ○ ○ ○ ○ ○  │                                    │
│              │  ○ ○ ○ ○ ○  │                                    │
│              └──────┬──────┘                                    │
│                     │                                           │
│                     ▼                                           │
│         Request takes 1 token (or N for weighted)               │
│         If no tokens → Request rejected                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Allows burst traffic up to bucket size
Smooths out request rate over time
Most flexible algorithm

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()

    def allow_request(self, tokens=1):
        self._refill()

        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        refill_amount = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + refill_amount)
        self.last_refill = now

2. Leaky Bucket:

┌─────────────────────────────────────────────────────────────────┐
│                        LEAKY BUCKET                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    Requests enter bucket (queue)                                │
│                     │                                           │
│                     ▼                                           │
│              ┌─────────────┐                                    │
│              │   BUCKET    │  Fixed queue size                  │
│              │  ■ ■ ■ ■ ■  │                                    │
│              │  ■ ■ ■ ■ ■  │                                    │
│              └──────┬──────┘                                    │
│                     │ ← Leak at constant rate                   │
│                     ▼                                           │
│         Requests processed at fixed rate                        │
│         If bucket full → Request rejected                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Processes at constant rate regardless of input
Smooths out bursty traffic
Can introduce latency (queuing)

3. Fixed Window Counter:

class FixedWindowCounter:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.counts = {}  # window_key -> count

    def allow_request(self, client_id):
        window_key = int(time.time() / self.window_seconds)
        key = f"{client_id}:{window_key}"

        current = self.counts.get(key, 0)
        if current >= self.limit:
            return False

        self.counts[key] = current + 1
        return True

Simple to implement
Problem: Burst at window boundaries (2x limit possible)

4. Sliding Window Log:

class SlidingWindowLog:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.logs = {}  # client_id -> [timestamps]

    def allow_request(self, client_id):
        now = time.time()
        window_start = now - self.window_seconds

        # Get client's request log
        if client_id not in self.logs:
            self.logs[client_id] = []

        # Remove old entries
        self.logs[client_id] = [
            ts for ts in self.logs[client_id] 
            if ts > window_start
        ]

        if len(self.logs[client_id]) >= self.limit:
            return False

        self.logs[client_id].append(now)
        return True

Accurate but memory-intensive
Stores timestamp of each request

5. Sliding Window Counter (Hybrid):

class SlidingWindowCounter:
    """Combines fixed window efficiency with sliding window accuracy."""

    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window_seconds = window_seconds
        self.current_window = {}
        self.previous_window = {}
        self.current_window_start = 0

    def allow_request(self, client_id):
        now = time.time()
        window_start = int(now / self.window_seconds) * self.window_seconds

        # Rotate windows if needed
        if window_start > self.current_window_start:
            self.previous_window = self.current_window
            self.current_window = {}
            self.current_window_start = window_start

        # Calculate weighted count
        elapsed_in_current = now - window_start
        weight = elapsed_in_current / self.window_seconds

        prev_count = self.previous_window.get(client_id, 0)
        curr_count = self.current_window.get(client_id, 0)

        # Weighted average
        estimated_count = prev_count * (1 - weight) + curr_count

        if estimated_count >= self.limit:
            return False

        self.current_window[client_id] = curr_count + 1
        return True

Distributed Rate Limiting:

# Using Redis for distributed rate limiting
class DistributedRateLimiter:
    def __init__(self, redis_client, limit, window_seconds):
        self.redis = redis_client
        self.limit = limit
        self.window_seconds = window_seconds

    def allow_request(self, client_id):
        key = f"rate_limit:{client_id}"

        # Lua script for atomic operation
        script = """
        local current = redis.call('INCR', KEYS[1])
        if current == 1 then
            redis.call('EXPIRE', KEYS[1], ARGV[1])
        end
        return current
        """

        count = self.redis.eval(script, 1, key, self.window_seconds)
        return count <= self.limit

Back-Pressure and Flow Control¶

Back-pressure prevents system overload by slowing down producers when consumers can't keep up.

Without Back-Pressure:

Producer (1000/s) ──▶ Queue ──▶ Consumer (100/s)
                        │
                        ▼
                 Queue grows unbounded
                 Memory exhaustion
                 System crash

With Back-Pressure:

Producer (slowed) ◀── Signal ── Queue ──▶ Consumer (100/s)
                        │
                        ▼
                 Queue bounded
                 Producer waits or drops
                 System stable

Back-Pressure Strategies:

1. Blocking:

from queue import Queue

# Bounded queue - put() blocks when full
queue = Queue(maxsize=1000)

def producer():
    while True:
        item = generate_item()
        queue.put(item)  # Blocks if queue full

def consumer():
    while True:
        item = queue.get()  # Blocks if queue empty
        process(item)

2. Dropping:

def producer():
    while True:
        item = generate_item()
        try:
            queue.put_nowait(item)
        except queue.Full:
            metrics.increment('dropped_items')
            # Optionally: log, alert, or sample

3. Buffering with Overflow:

class OverflowBuffer:
    def __init__(self, primary_size, overflow_size):
        self.primary = Queue(maxsize=primary_size)
        self.overflow = DiskQueue(max_size=overflow_size)

    def put(self, item):
        try:
            self.primary.put_nowait(item)
        except queue.Full:
            try:
                self.overflow.put(item)  # Spill to disk
            except OverflowError:
                self.drop(item)  # Last resort

4. Load Shedding:

class LoadShedder:
    def __init__(self, threshold=0.8):
        self.threshold = threshold

    def should_accept(self, request):
        current_load = self.get_system_load()

        if current_load < self.threshold:
            return True

        # Prioritized shedding
        if request.priority == 'high':
            return current_load < 0.95
        elif request.priority == 'medium':
            return current_load < 0.9
        else:
            return False  # Shed low priority

5. Reactive Streams (Pull-Based):

# Consumer pulls at its own pace
class ReactiveConsumer:
    def __init__(self, source, batch_size=100):
        self.source = source
        self.batch_size = batch_size

    async def consume(self):
        while True:
            # Consumer controls the pace
            items = await self.source.request(self.batch_size)
            await self.process(items)

Auto-Scaling Strategies¶

Horizontal Pod Autoscaler (HPA) - Kubernetes:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-server
  minReplicas: 2
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: 1000
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

Scaling Metrics:

Metric	Best For	Consideration
CPU	Compute-intensive workloads	Doesn't account for I/O wait
Memory	Memory-intensive apps	Can be slow to change
RPS	Web servers, APIs	Business-relevant
Queue depth	Worker processes	Direct load indicator
Custom metrics	Specific business needs	Requires instrumentation

Predictive Scaling:

class PredictiveScaler:
    def __init__(self, history_days=14):
        self.history = []
        self.history_days = history_days

    def predict_load(self, target_time):
        """Predict load based on historical patterns."""
        # Same day of week, same time
        historical_loads = self.get_historical_loads(
            day_of_week=target_time.weekday(),
            hour=target_time.hour
        )

        # Use moving average with trend
        return self.calculate_prediction(historical_loads)

    def calculate_replicas(self, predicted_load):
        """Calculate needed replicas with buffer."""
        base_replicas = math.ceil(predicted_load / self.capacity_per_replica)

        # Add 20% buffer for unexpected spikes
        return int(base_replicas * 1.2)

Load Balancing¶

Load balancing is the process of distributing incoming network traffic across multiple servers or resources to ensure no single server becomes overwhelmed. Load balancers act as traffic directors, routing client requests to the most appropriate server based on various algorithms and health checks.

Why Load Balancing is Essential¶

High Availability: If one server fails, traffic is automatically routed to healthy servers, ensuring continuous service availability.
Performance: Distributes load evenly, preventing any single server from becoming a bottleneck.
Scalability: Enables horizontal scaling by adding more servers behind the load balancer.
Flexibility: Allows maintenance and updates without downtime by taking servers out of rotation.
Geographic Distribution: Can route users to the nearest data center or server for reduced latency.

Types of Load Balancers¶

1. Network Load Balancer (Layer 4)¶

Operates at the transport layer (TCP/UDP) and routes traffic based on IP addresses and ports.

Characteristics:

Fast and efficient (no application-layer inspection)
Low latency
Works with any protocol (TCP, UDP)
Cannot inspect HTTP headers or content
No SSL termination (unless using TCP passthrough)

Use Cases:

High-throughput, low-latency applications
TCP/UDP-based protocols
When application-layer inspection isn't needed
Real-time applications (gaming, streaming)

Examples:

AWS Network Load Balancer
F5 BIG-IP
HAProxy (TCP mode)

2. Application Load Balancer (Layer 7)¶

Operates at the application layer (HTTP/HTTPS) and can inspect and route based on content.

Characteristics:

Can inspect HTTP headers, URLs, and content
SSL/TLS termination
Content-based routing
Cookie-based session affinity
More intelligent routing decisions
Higher latency than Layer 4 (more processing)

Use Cases:

HTTP/HTTPS applications
When content-based routing is needed
Microservices (route based on path)
When SSL termination is required at load balancer

Examples:

AWS Application Load Balancer
NGINX
HAProxy (HTTP mode)
Traefik

3. Global Server Load Balancer (GSLB)¶

Distributes traffic across multiple data centers or geographic regions.

Characteristics:

DNS-based routing
Geographic distribution
Disaster recovery
Route to nearest data center
Health checks across regions

Use Cases:

Multi-region deployments
Disaster recovery
Reducing latency for global users
High availability across regions

Examples:

AWS Route 53
Cloudflare
Azure Traffic Manager
Google Cloud Load Balancing

Load Balancing Algorithms¶

The algorithm determines how the load balancer selects which server to handle each request.

1. Round Robin¶

Distributes requests sequentially across all servers in rotation.

How it Works:

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (cycle repeats)

Advantages:

Simple and predictable
Even distribution over time
No server state required

Disadvantages:

Doesn't consider server load or capacity
Uneven if servers have different capabilities
Doesn't account for request processing time

Use Case: When all servers have similar capacity and handle similar requests.

2. Weighted Round Robin¶

Round robin with weights assigned to servers based on their capacity.

How it Works:

Server A (weight 3): handles 3 requests
Server B (weight 2): handles 2 requests
Server C (weight 1): handles 1 request
Pattern repeats

Advantages:

Accounts for server capacity differences
Simple to configure
Predictable distribution

Disadvantages:

Doesn't consider current server load
Weights must be manually configured

Use Case: When servers have different capacities (CPU, memory, etc.).

3. Least Connections¶

Routes requests to the server with the fewest active connections.

How it Works:

Tracks active connections per server
New request goes to server with minimum connections
Continuously updated as connections are established/closed

Advantages:

Adapts to varying request processing times
Better for long-lived connections
More balanced under varying loads

Disadvantages:

Requires tracking connection state
Doesn't consider server capacity
More complex than round robin

Use Case: When requests have varying processing times or long-lived connections.

4. Weighted Least Connections¶

Combines least connections with server weights.

How it Works:

Calculates: active_connections / weight
Routes to server with lowest ratio

Advantages:

Considers both current load and server capacity
Most balanced algorithm

Disadvantages:

Most complex to implement
Requires connection tracking and weights

Use Case: When servers have different capacities and requests have varying processing times.

5. Least Response Time¶

Routes to the server with the lowest average response time.

How it Works:

Measures response time for each server
Routes to server with fastest average response time
Continuously updated based on actual response times

Advantages:

Routes to fastest servers
Adapts to server performance changes
Good for performance optimization

Disadvantages:

Requires response time measurement
May cause oscillations if response times are similar
More overhead than simpler algorithms

Use Case: When optimizing for response time and servers have varying performance.

6. IP Hash¶

Uses a hash of the client's IP address to determine which server handles the request.

How it Works:

Hash(client_ip) % num_servers → server_index
Same client IP always maps to same server (if server count unchanged)

Advantages:

Session affinity without cookies
Predictable routing
Simple to implement

Disadvantages:

Uneven distribution if IPs are not evenly distributed
Adding/removing servers changes mappings
Doesn't consider server load

Use Case: When you need session affinity but can't use cookies, or for caching benefits.

7. Random¶

Randomly selects a server for each request.

How it Works:

Generates random number
Maps to server index

Advantages:

Simple implementation
No state required
Can work well with many servers

Disadvantages:

Unpredictable
May cause uneven distribution in short term
Doesn't consider server load or capacity

Use Case: Rarely used, mainly for testing or when other algorithms aren't suitable.

Load Balancing Strategies¶

1. Health Checks¶

Load balancers continuously monitor server health to avoid routing traffic to unhealthy servers.

Types of Health Checks:

Active Health Checks:
- Load balancer periodically sends requests to servers
- Checks response status, latency, or content
- Removes unhealthy servers from rotation
- Re-adds when healthy
Passive Health Checks:
- Monitor actual request responses
- Track error rates and timeouts
- Remove servers with high failure rates

Health Check Parameters:

Interval: How often to check (e.g., every 10 seconds)
Timeout: Maximum time to wait for response
Threshold: Number of failures before marking unhealthy
Path: Endpoint to check (e.g., /health)
Expected Response: Status code or content to validate

Example Health Check Endpoint:

@app.route('/health')
def health_check():
    # Check database connection
    # Check external dependencies
    # Check disk space, memory
    return {'status': 'healthy'}, 200

2. Session Affinity (Sticky Sessions)¶

Ensures requests from the same client are routed to the same server.

Why Needed:

Server-side session storage
In-memory caching
Stateful applications
File uploads requiring connection persistence

Implementation Methods:

Cookie-Based:
- Load balancer sets cookie with server identifier
- Subsequent requests include cookie
- Load balancer routes to same server
IP Hash:
- Hash client IP to determine server
- Same IP always routes to same server
Custom Headers:
- Use custom HTTP headers to identify sessions

Trade-offs:

Pros: Enables stateful applications, better caching
Cons: Uneven load distribution, harder to scale, server failure affects users

Best Practice: Use stateless services with external session storage (Redis, database) to avoid needing session affinity.

3. SSL/TLS Termination¶

Load balancer handles SSL/TLS encryption/decryption, reducing load on backend servers.

How it Works:

Client → HTTPS → Load Balancer (decrypts)
Load Balancer → HTTP → Backend Servers (or HTTPS with different cert)

Advantages:

Reduces CPU load on backend servers
Centralized certificate management
Easier to update certificates
Can inspect encrypted traffic

Disadvantages:

Traffic between load balancer and servers may be unencrypted
Single point of failure for SSL
Load balancer becomes security boundary

Solution: Use end-to-end encryption (HTTPS to backend) if security is critical.

4. Content-Based Routing¶

Route requests based on URL path, headers, or content.

Examples:

/api/users/* → User service
/api/products/* → Product service
/api/orders/* → Order service
Route based on Host header for virtual hosting
Route based on Content-Type header

Use Case: Microservices architecture where different services handle different paths.

Load Balancer Deployment Patterns¶

1. Single Load Balancer¶

One load balancer in front of multiple servers.

Characteristics:

Simple setup
Single point of failure
May become bottleneck at high traffic

Use Case: Small to medium applications, development environments.

2. Multiple Load Balancers (Active-Passive)¶

Primary load balancer handles traffic, backup takes over on failure.

Characteristics:

High availability
Backup doesn't handle traffic normally
Failover time required

Use Case: Medium to large applications requiring high availability.

3. Multiple Load Balancers (Active-Active)¶

Multiple load balancers share traffic load.

Characteristics:

Better performance (load distributed)
Higher availability
More complex configuration
Requires session synchronization or stateless design

Use Case: High-traffic applications, large-scale systems.

4. DNS Load Balancing¶

Use DNS to distribute traffic across multiple load balancers or servers.

How it Works:

DNS returns multiple IP addresses
Clients connect to different IPs
DNS TTL controls how long clients cache IPs

Advantages:

Simple, no additional infrastructure
Geographic distribution possible
Works with any protocol

Disadvantages:

Limited control (DNS caching)
No health checking (unless using advanced DNS)
Uneven distribution (client DNS caching)

Use Case: Global distribution, simple load balancing needs.

Load Balancing in Cloud Environments¶

AWS Load Balancing¶

Elastic Load Balancer (ELB) Types:

Application Load Balancer (ALB): Layer 7, HTTP/HTTPS
Network Load Balancer (NLB): Layer 4, TCP/UDP, high performance
Classic Load Balancer: Legacy, basic features

Features:

Auto-scaling integration
Health checks
SSL termination
Sticky sessions
Cross-zone load balancing

Google Cloud Load Balancing¶

Types:

HTTP(S) Load Balancing: Global, Layer 7
TCP/UDP Load Balancing: Regional or global, Layer 4
Internal Load Balancing: For internal traffic

Features:

Global anycast IP
Automatic scaling
Health checks
Content-based routing

Azure Load Balancing¶

Types:

Application Gateway: Layer 7, web application firewall
Load Balancer: Layer 4, regional
Traffic Manager: DNS-based, global

Load Balancing Best Practices¶

Health Checks:
- Implement comprehensive health check endpoints
- Configure appropriate intervals and timeouts
- Monitor health check failures
Stateless Services:
- Design services to be stateless
- Store state in external systems (Redis, database)
- Avoid session affinity when possible
Monitoring:
- Monitor load balancer metrics (requests, latency, errors)
- Track backend server health
- Set up alerts for failures
Security:
- Use SSL/TLS termination
- Implement rate limiting
- Use web application firewalls (WAF)
- Restrict access to backend servers
Performance:
- Choose appropriate algorithm for your use case
- Enable connection pooling
- Use HTTP/2 for better multiplexing
- Consider geographic distribution
Scalability:
- Integrate with auto-scaling
- Plan for traffic growth
- Test load balancer capacity
- Use multiple load balancers for high availability

API Gateway¶

An API Gateway is a server that acts as the single entry point for all client requests to backend services. It handles cross-cutting concerns like authentication, rate limiting, and request routing.

API Gateway Responsibilities¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                              API GATEWAY                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │
│  │   Request   │  │   Auth &    │  │    Rate     │  │   Request   │       │
│  │   Routing   │  │ Authorization│  │  Limiting   │  │ Validation  │       │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘       │
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │
│  │   Caching   │  │   Circuit   │  │   Request   │  │   Protocol  │       │
│  │             │  │   Breaker   │  │   Transform │  │ Translation │       │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘       │
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │
│  │   Logging   │  │   Metrics   │  │   Service   │  │   SSL/TLS   │       │
│  │  & Tracing  │  │             │  │  Discovery  │  │ Termination │       │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

API Gateway Patterns¶

1. Simple Gateway:

                    ┌─────────────┐
   Clients ────────▶│ API Gateway │────────▶ Services
                    └─────────────┘

2. Backend for Frontend (BFF):

                    ┌─────────────┐
   Mobile ─────────▶│ Mobile BFF  │────────┐
                    └─────────────┘        │
                                           ▼
                    ┌─────────────┐    ┌─────────┐
   Web ────────────▶│   Web BFF   │───▶│Services │
                    └─────────────┘    └─────────┘
                                           ▲
                    ┌─────────────┐        │
   IoT ────────────▶│   IoT BFF   │────────┘
                    └─────────────┘

Each client type gets a specialized gateway optimized for its needs.

3. Aggregator Gateway:

class OrderAggregator:
    """Aggregates data from multiple services into single response."""

    async def get_order_details(self, order_id):
        # Parallel requests to multiple services
        order, user, payment, shipping = await asyncio.gather(
            self.order_service.get(order_id),
            self.user_service.get(order.user_id),
            self.payment_service.get(order.payment_id),
            self.shipping_service.get(order.shipping_id),
        )

        # Aggregate into single response
        return {
            'order': order,
            'user': {
                'name': user.name,
                'email': user.email
            },
            'payment': {
                'status': payment.status,
                'amount': payment.amount
            },
            'shipping': {
                'status': shipping.status,
                'tracking': shipping.tracking_number
            }
        }

API Gateway Technologies¶

Technology	Type	Best For
Kong	Open Source	Kubernetes, plugin ecosystem
AWS API Gateway	Managed	AWS ecosystem, serverless
Apigee	Enterprise	Large enterprises, analytics
NGINX	Open Source	High performance, flexibility
Traefik	Open Source	Kubernetes, auto-discovery
Envoy	Service Mesh	Service-to-service, gRPC

Request/Response Transformation¶

# Gateway transforms legacy API response to modern format

# Legacy service response
legacy_response = {
    "usr_id": "123",
    "usr_nm": "John",
    "crt_dt": "20240101"
}

# Gateway transformation rules
transformation = {
    "user_id": "$.usr_id",
    "user_name": "$.usr_nm",
    "created_at": {
        "source": "$.crt_dt",
        "transform": "date_format('YYYYMMDD', 'ISO8601')"
    }
}

# Transformed response
modern_response = {
    "user_id": "123",
    "user_name": "John",
    "created_at": "2024-01-01T00:00:00Z"
}

Service Mesh¶

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It makes communication between service instances reliable, fast, and secure.

Service Mesh Architecture¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                              SERVICE MESH                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────┐                           ┌─────────────────┐        │
│   │   Control Plane │                           │   Control Plane │        │
│   │    (Istiod)     │                           │   (Linkerd)     │        │
│   │                 │                           │                 │        │
│   │  - Config       │                           │  - Identity     │        │
│   │  - Certificates │                           │  - Routing      │        │
│   │  - Discovery    │                           │  - Metrics      │        │
│   └────────┬────────┘                           └────────┬────────┘        │
│            │                                             │                  │
│            │              Configuration                  │                  │
│            ▼                                             ▼                  │
│   ┌──────────────────────────────────────────────────────────────────┐    │
│   │                          Data Plane                               │    │
│   │                                                                   │    │
│   │  ┌─────────┐ ┌───────┐        ┌─────────┐ ┌───────┐            │    │
│   │  │Service A│◀│ Proxy │◀──────▶│ Proxy  │▶│Service B│            │    │
│   │  │         │ │(Envoy)│        │(Envoy) │ │         │            │    │
│   │  └─────────┘ └───────┘        └───────┘ └─────────┘            │    │
│   │                                                                   │    │
│   └──────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Service Mesh Features¶

1. Traffic Management:

# Istio VirtualService - Traffic routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: jason
    route:
    - destination:
        host: reviews
        subset: v2  # Jason gets v2
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 90  # 90% to v1
    - destination:
        host: reviews
        subset: v2
      weight: 10  # 10% to v2 (canary)

2. Mutual TLS (mTLS):

# Istio PeerAuthentication - Enforce mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT  # All traffic must be mTLS

3. Observability:

Distributed Tracing: Automatic trace propagation
Metrics: Request latency, error rates, throughput
Access Logs: Detailed request/response logging

4. Resilience:

# Istio DestinationRule - Circuit breaker
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Service Mesh vs API Gateway¶

Aspect	API Gateway	Service Mesh
Position	Edge (north-south)	Internal (east-west)
Traffic	External to internal	Service to service
Deployment	Centralized	Distributed (sidecars)
Protocol	HTTP/REST focused	Any protocol
Use Case	Public APIs	Microservices internal

Content Delivery Network (CDN)¶

A CDN is a geographically distributed network of servers that delivers content to users from the nearest edge location, reducing latency and improving performance.

CDN Architecture¶

                              Origin Server
                                   │
                    ┌──────────────┴──────────────┐
                    │         CDN Network          │
                    │                              │
        ┌───────────┼───────────┬───────────┬─────┼───────────┐
        │           │           │           │     │           │
        ▼           ▼           ▼           ▼     ▼           ▼
    ┌───────┐   ┌───────┐   ┌───────┐   ┌───────┐   ┌───────┐
    │Edge US│   │Edge EU│   │Edge AP│   │Edge SA│   │Edge AF│
    │ West  │   │       │   │       │   │       │   │       │
    └───┬───┘   └───┬───┘   └───┬───┘   └───┬───┘   └───┬───┘
        │           │           │           │           │
        ▼           ▼           ▼           ▼           ▼
    [Users]     [Users]     [Users]     [Users]     [Users]

CDN Caching Strategies¶

Cache-Control Headers:

# Static assets (aggressive caching)
Cache-Control: public, max-age=31536000, immutable

# API responses (short cache)
Cache-Control: public, max-age=60, s-maxage=300

# Private user data (no CDN cache)
Cache-Control: private, no-store

# Stale-while-revalidate (serve stale, fetch fresh in background)
Cache-Control: public, max-age=60, stale-while-revalidate=86400

Cache Key Configuration:

# CDN cache key components
cache_key = {
    'url': '/api/products',
    'query_params': ['category', 'sort'],  # Include in key
    'headers': ['Accept-Language'],         # Vary by language
    'cookies': [],                          # Ignore cookies
}

# Example cache keys:
# /api/products?category=shoes&sort=price + Accept-Language: en
# /api/products?category=shoes&sort=price + Accept-Language: es

CDN Use Cases¶

1. Static Asset Delivery:

# Origin server configuration
location /static/ {
    root /var/www;
    expires 1y;
    add_header Cache-Control "public, immutable";
    add_header CDN-Cache-Control "max-age=31536000";
}

2. Dynamic Content Caching:

@app.route('/api/products')
def get_products():
    products = fetch_products()

    response = jsonify(products)
    response.headers['Cache-Control'] = 'public, max-age=60'
    response.headers['Surrogate-Key'] = 'products'  # For purging

    return response

# Purge when products change
def invalidate_products():
    cdn.purge_by_tag('products')

3. Edge Computing:

// Cloudflare Worker - Edge computation
addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

async function handleRequest(request) {
  const url = new URL(request.url)

  // A/B testing at edge
  const variant = request.headers.get('Cookie')?.includes('variant=b') ? 'b' : 'a'

  // Modify request to appropriate origin
  url.pathname = `/${variant}${url.pathname}`

  return fetch(url.toString())
}

CDN Providers Comparison¶

Provider	Strengths	Best For
Cloudflare	DDoS protection, edge compute	Security-focused, edge logic
AWS CloudFront	AWS integration, Lambda@Edge	AWS ecosystem
Fastly	Real-time purging, VCL	Dynamic content, instant invalidation
Akamai	Global reach, enterprise	Large enterprises, streaming
Bunny CDN	Cost-effective	Budget-conscious, simple needs

Real-Time Communication¶

WebSockets¶

Full-duplex communication channel over a single TCP connection.

┌─────────────────────────────────────────────────────────────────┐
│                      WebSocket Lifecycle                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Client                                            Server      │
│     │                                                  │        │
│     │ ──── HTTP Upgrade Request ─────────────────────▶│        │
│     │       Upgrade: websocket                         │        │
│     │       Connection: Upgrade                        │        │
│     │       Sec-WebSocket-Key: xxx                     │        │
│     │                                                  │        │
│     │◀──── HTTP 101 Switching Protocols ──────────────│        │
│     │       Upgrade: websocket                         │        │
│     │       Sec-WebSocket-Accept: yyy                  │        │
│     │                                                  │        │
│     │◀════════════ WebSocket Connection ════════════▶│        │
│     │                                                  │        │
│     │ ──── Message Frame ────────────────────────────▶│        │
│     │◀──── Message Frame ─────────────────────────────│        │
│     │ ──── Message Frame ────────────────────────────▶│        │
│     │                                                  │        │
│     │ ──── Close Frame ──────────────────────────────▶│        │
│     │◀──── Close Frame ───────────────────────────────│        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

WebSocket Server Example:

import asyncio
import websockets
import json

class ChatServer:
    def __init__(self):
        self.clients = set()
        self.rooms = {}  # room_id -> set of websockets

    async def handler(self, websocket, path):
        # Register client
        self.clients.add(websocket)

        try:
            async for message in websocket:
                data = json.loads(message)
                await self.process_message(websocket, data)
        finally:
            # Unregister client
            self.clients.discard(websocket)
            self.leave_all_rooms(websocket)

    async def process_message(self, websocket, data):
        action = data.get('action')

        if action == 'join':
            room_id = data['room_id']
            if room_id not in self.rooms:
                self.rooms[room_id] = set()
            self.rooms[room_id].add(websocket)

        elif action == 'message':
            room_id = data['room_id']
            message = data['message']
            await self.broadcast_to_room(room_id, {
                'type': 'message',
                'content': message,
                'sender': data['sender']
            })

    async def broadcast_to_room(self, room_id, message):
        if room_id in self.rooms:
            websockets.broadcast(
                self.rooms[room_id],
                json.dumps(message)
            )

# Start server
server = ChatServer()
asyncio.run(websockets.serve(server.handler, "localhost", 8765))

Server-Sent Events (SSE)¶

One-way communication from server to client over HTTP.

from flask import Flask, Response
import time

app = Flask(__name__)

@app.route('/events')
def events():
    def generate():
        while True:
            data = get_latest_data()
            yield f"data: {json.dumps(data)}\n\n"
            time.sleep(1)

    return Response(
        generate(),
        mimetype='text/event-stream',
        headers={
            'Cache-Control': 'no-cache',
            'Connection': 'keep-alive'
        }
    )

Long Polling¶

Client repeatedly requests updates, server holds request until data available.

@app.route('/poll')
async def long_poll():
    timeout = 30  # seconds
    start_time = time.time()

    while time.time() - start_time < timeout:
        data = await check_for_updates()
        if data:
            return jsonify(data)
        await asyncio.sleep(0.5)

    # Timeout - return empty and client reconnects
    return jsonify({'status': 'timeout'})

Real-Time Pattern Comparison¶

Pattern	Direction	Connection	Best For
WebSocket	Bidirectional	Persistent	Chat, gaming, collaboration
SSE	Server → Client	Persistent	News feeds, notifications
Long Polling	Request-Response	Repeated	Legacy support, simple updates
Short Polling	Request-Response	New each time	Infrequent updates

Scaling Real-Time Systems¶

Challenge: WebSocket connections are stateful and tied to specific servers.

Solution: Pub/Sub with Redis:

import redis.asyncio as redis

class ScalableWebSocketServer:
    def __init__(self):
        self.redis = redis.from_url("redis://localhost")
        self.pubsub = self.redis.pubsub()
        self.local_clients = {}  # room_id -> set of local websockets

    async def subscribe_to_room(self, room_id, websocket):
        # Local tracking
        if room_id not in self.local_clients:
            self.local_clients[room_id] = set()
            # Subscribe to Redis channel
            await self.pubsub.subscribe(f"room:{room_id}")

        self.local_clients[room_id].add(websocket)

    async def broadcast_to_room(self, room_id, message):
        # Publish to Redis (reaches all server instances)
        await self.redis.publish(
            f"room:{room_id}",
            json.dumps(message)
        )

    async def redis_listener(self):
        """Listen for messages from other server instances."""
        async for message in self.pubsub.listen():
            if message['type'] == 'message':
                room_id = message['channel'].decode().replace('room:', '')
                data = json.loads(message['data'])

                # Broadcast to local clients only
                if room_id in self.local_clients:
                    for ws in self.local_clients[room_id]:
                        await ws.send(json.dumps(data))

Operating Systems Fundamentals¶

Understanding operating system fundamentals is crucial for designing and building efficient, concurrent, and scalable systems. This section covers threads, processes, locks, synchronization, and concurrency—concepts that directly impact system design decisions.

Processes and Threads¶

Processes¶

A process is an instance of a running program with its own memory space, file descriptors, and system resources. Each process is isolated from other processes and has its own address space.

Process Characteristics:

Isolation: Processes cannot directly access each other's memory
Resource Ownership: Each process has its own memory, file handles, network connections
Heavyweight: Creating and switching between processes is expensive
IPC (Inter-Process Communication): Processes communicate via pipes, sockets, shared memory, message queues

Process States:

New: Process is being created
Ready: Process is waiting to be assigned to a CPU
Running: Process instructions are being executed
Waiting/Blocked: Process is waiting for an event (I/O, signal)
Terminated: Process has finished execution

Process Creation:

Fork: Creates a copy of the current process (Unix/Linux)
Exec: Replaces current process with a new program
Clone: More flexible process creation (Linux)

Threads¶

A thread is a lightweight unit of execution within a process. Multiple threads share the same memory space and resources of their parent process but have their own stack and registers.

Thread Characteristics:

Shared Memory: Threads in the same process share memory space
Lightweight: Creating and switching threads is faster than processes
Concurrent Execution: Multiple threads can execute simultaneously (on multi-core systems)
Synchronization Required: Shared memory requires synchronization mechanisms

Thread vs. Process:

Aspect	Process	Thread
Memory	Isolated	Shared
Creation Cost	High	Low
Communication	IPC required	Direct memory access
Context Switch	Expensive	Cheap
Fault Isolation	High (crash doesn't affect others)	Low (crash affects all threads)
Scalability	Limited by process overhead	Better scalability

Thread Models:

User-Level Threads:
- Managed by application/library
- OS doesn't know about threads
- Fast context switching
- Blocking I/O blocks all threads
Kernel-Level Threads:
- Managed by OS kernel
- OS schedules threads
- True parallelism on multi-core
- Blocking I/O only blocks one thread
Hybrid Model:
- Combination of user and kernel threads
- Multiple user threads mapped to fewer kernel threads

Concurrency and Parallelism¶

Concurrency¶

Concurrency is the ability of a system to handle multiple tasks at the same time. Tasks may not execute simultaneously but are interleaved, giving the appearance of simultaneous execution.

Characteristics:

Tasks make progress concurrently
Execution may be interleaved on a single CPU
Useful for I/O-bound tasks
Can improve responsiveness

Example: A web server handling multiple requests concurrently, switching between them while waiting for I/O operations.

Parallelism¶

Parallelism is the simultaneous execution of multiple tasks, typically on multiple CPUs or cores.

Characteristics:

Tasks execute simultaneously on different CPUs
Requires multiple processing units
Useful for CPU-bound tasks
True simultaneous execution

Example: Processing multiple images simultaneously, each on a different CPU core.

Concurrency vs. Parallelism:

Concurrency: Dealing with multiple things at once (can be on single core)
Parallelism: Doing multiple things at once (requires multiple cores)

Concurrency is about structure, parallelism is about execution.

Thread Synchronization¶

When multiple threads access shared resources, synchronization is needed to prevent race conditions and ensure data consistency.

Race Conditions¶

A race condition occurs when the outcome depends on the relative timing of thread execution. Multiple threads access shared data, and the final result depends on the order of execution.

Example:

# Thread 1 and Thread 2 both execute:
counter = counter + 1

# Without synchronization, this can result in:
# Both read counter = 5
# Both increment to 6
# Both write 6 (should be 7)

Critical Section: Code that accesses shared resources and must be executed atomically by only one thread at a time.

Mutual Exclusion (Mutex)¶

A mutex (mutual exclusion) is a synchronization primitive that ensures only one thread can access a critical section at a time.

How Mutex Works:

Thread acquires lock before entering critical section
Other threads wait if lock is held
Thread releases lock after leaving critical section
Waiting thread can then acquire lock

Properties:

Mutual Exclusion: Only one thread holds lock at a time
Deadlock Prevention: Must be released by acquiring thread
Blocking: Threads wait if lock is unavailable

Example (Pseudocode):

mutex.lock()
    // Critical section
    shared_resource += 1
mutex.unlock()

Mutex Types:

Binary Semaphore:
- Two states: locked or unlocked
- Basic mutex implementation
Reentrant Mutex (Recursive Lock):
- Same thread can acquire lock multiple times
- Must release same number of times
- Prevents self-deadlock
Reader-Writer Lock:
- Multiple readers OR one writer
- Optimized for read-heavy workloads
- Writers have exclusive access

Semaphores¶

A semaphore is a synchronization primitive that maintains a counter and allows a specified number of threads to access a resource.

How Semaphores Work:

Wait (P): Decrement counter, block if counter is 0
Signal (V): Increment counter, wake waiting thread if any

Types:

Binary Semaphore:
- Counter is 0 or 1
- Similar to mutex but can be released by different thread
Counting Semaphore:
- Counter can be any non-negative integer
- Limits number of concurrent accesses
- Useful for resource pools

Example Use Case:

# Limit to 5 concurrent database connections
semaphore = Semaphore(5)

def database_operation():
    semaphore.acquire()  # Wait if 5 already acquired
    try:
        # Use database connection
        db.query(...)
    finally:
        semaphore.release()  # Release for next thread

Condition Variables¶

Condition variables allow threads to wait for a condition to become true and notify other threads when the condition changes.

Operations:

Wait: Thread waits on condition variable (releases lock, blocks)
Signal: Wake one waiting thread
Broadcast: Wake all waiting threads

Use Case: Producer-Consumer pattern

mutex = Lock()
condition = Condition(mutex)
queue = []

def producer():
    with mutex:
        queue.append(item)
        condition.notify()  # Wake consumer

def consumer():
    with mutex:
        while len(queue) == 0:
            condition.wait()  # Wait for item
        item = queue.pop(0)

Monitors¶

A monitor is a synchronization construct that combines mutual exclusion with condition variables. It ensures only one thread executes monitor code at a time.

Characteristics:

Mutual exclusion built-in
Condition variables for waiting
Higher-level abstraction than mutexes

Example: Java's synchronized keyword implements monitor pattern.

Deadlocks¶

A deadlock occurs when two or more threads are blocked forever, waiting for each other to release resources.

Deadlock Conditions (Coffman Conditions)¶

All four must be present for deadlock:

Mutual Exclusion: Resources cannot be shared
Hold and Wait: Thread holds resource while waiting for another
No Preemption: Resources cannot be forcibly taken
Circular Wait: Circular chain of threads waiting for resources

Example:

Thread 1: Holds Lock A, wants Lock B
Thread 2: Holds Lock B, wants Lock A
→ Deadlock!

Deadlock Prevention¶

Lock Ordering:
- Always acquire locks in the same order
- Prevents circular wait
Lock Timeout:
- Try to acquire lock with timeout
- Release and retry if timeout
Deadlock Detection:
- Periodically check for circular waits
- Kill/restart threads in deadlock
Avoid Nested Locks:
- Minimize lock nesting
- Use higher-level synchronization

Livelock¶

Livelock occurs when threads are not blocked but cannot make progress because they keep responding to each other's actions.

Example: Two threads trying to pass through a narrow doorway, both step aside for the other, neither makes progress.

Lock-Free Programming¶

Lock-free programming uses atomic operations and memory ordering instead of locks to achieve thread safety.

Atomic Operations¶

Operations that complete in a single, indivisible step. Guaranteed to be thread-safe without locks.

Types:

Read-Modify-Write: Compare-and-swap (CAS), fetch-and-add
Load/Store: Atomic reads and writes

Example (Compare-and-Swap):

# Atomic increment without lock
def atomic_increment(counter):
    while True:
        old_value = counter.value
        new_value = old_value + 1
        if compare_and_swap(counter, old_value, new_value):
            break  # Success
        # Retry if another thread modified value

Lock-Free Data Structures¶

Data structures that don't use locks but remain thread-safe through atomic operations.

Examples:

Lock-free queues
Lock-free stacks
Lock-free hash tables

Advantages:

No deadlocks
Better performance (no lock contention)
Progress guarantee (at least one thread makes progress)

Disadvantages:

More complex to implement
Harder to reason about
May require retry loops

Memory Models and Ordering¶

Memory Ordering¶

Defines the order in which memory operations become visible to other threads.

Ordering Types:

Sequential Consistency:
- All threads see same order of operations
- Simplest model, but slowest
Relaxed Ordering:
- Minimal ordering guarantees
- Fastest, but hardest to reason about
Acquire-Release:
- Acquire: All operations after acquire see operations before release
- Release: All operations before release visible after acquire
- Balance between performance and simplicity
Memory Barriers (Fences):
- Explicit ordering instructions
- Prevent reordering of memory operations

Example:

// Thread 1
data = 42;
flag.store(true, memory_order_release);  // Release

// Thread 2
if (flag.load(memory_order_acquire)) {  // Acquire
    // Guaranteed to see data == 42
    print(data);
}

Concurrency Patterns¶

1. Producer-Consumer¶

One or more producers generate data, one or more consumers process it.

Synchronization:

Mutex for queue access
Condition variable for empty/full queue
Bounded buffer to limit memory

Use Cases:

Task queues
Event processing
Log aggregation

2. Reader-Writer Lock¶

Multiple readers can access simultaneously, but writers need exclusive access.

Optimization:

Readers don't block each other
Writers block all readers and other writers
Better for read-heavy workloads

Implementation:

Track number of active readers
Block writers when readers active
Block readers when writer active

3. Thread Pool¶

Pre-created pool of worker threads that process tasks from a queue.

Benefits:

Avoid thread creation overhead
Control resource usage
Better task scheduling

Components:

Task queue
Worker threads
Thread pool manager

Use Cases:

Web servers
Database connection pools
Parallel processing

4. Actor Model¶

Each actor is an independent entity that processes messages sequentially.

Characteristics:

Actors communicate via messages
Each actor processes one message at a time
No shared state (messages are copied)
Natural concurrency model

Examples:

Erlang/Elixir processes
Akka (Java/Scala)
Orleans (.NET)

Best Practices for Concurrency¶

Minimize Shared State:
- Use immutable data structures
- Prefer message passing over shared memory
- Encapsulate shared state
Use Appropriate Synchronization:
- Choose right primitive (mutex, semaphore, etc.)
- Keep critical sections small
- Avoid nested locks
Prevent Deadlocks:
- Lock ordering
- Lock timeouts
- Avoid circular dependencies
Performance Considerations:
- Lock contention is expensive
- Consider lock-free alternatives
- Use reader-writer locks for read-heavy workloads
- Minimize time in critical sections
Testing:
- Concurrency bugs are hard to reproduce
- Use stress testing
- Consider formal verification tools
- Test with different thread counts
Documentation:
- Document thread-safety guarantees
- Specify which operations are thread-safe
- Document locking requirements

Operating System Concepts in System Design¶

Understanding OS fundamentals helps in system design:

Process vs. Thread Choice:
- Use processes for isolation and fault tolerance
- Use threads for shared-memory parallelism
- Consider containerization (process-like isolation)
Synchronization in Distributed Systems:
- OS locks don't work across machines
- Use distributed locks (Redis, ZooKeeper, etcd)
- Consider eventual consistency
I/O Models:
- Blocking I/O: Simple but limits concurrency
- Non-blocking I/O: Better for high concurrency
- Async I/O: Event-driven, highly scalable
- Consider epoll/kqueue for high-performance servers
Resource Management:
- Connection pooling (like thread pools)
- Rate limiting (like semaphores)
- Circuit breakers (like monitors)
Scalability:
- Thread-per-request model doesn't scale
- Use async/event-driven models
- Consider actor model for high concurrency

Tools for Software Design¶

UML (Unified Modeling Language): For creating diagrams that represent system architecture and behavior. UML provides standardized notation for class diagrams, sequence diagrams, activity diagrams, and deployment diagrams, facilitating communication between team members and stakeholders.
Draw.io, Lucidchart: Diagramming tools for creating flowcharts and architectural diagrams. These tools support collaboration, version control, and integration with various platforms, making them essential for documenting system designs.
Enterprise Architect, Visual Paradigm: Comprehensive modeling tools for software design. These tools provide advanced features for UML modeling, code generation, reverse engineering, and team collaboration, suitable for large-scale enterprise projects.
Figma, Sketch: For UI/UX design aspects. These tools enable designers and developers to create, prototype, and collaborate on user interface designs, ensuring consistency and usability across the application.
Architecture Decision Records (ADRs): Document architectural decisions and their rationale. Tools like adr-tools help maintain a record of design decisions over time.
System Design Tools: Tools like C4 Model, Structurizr for creating architecture diagrams at different levels of abstraction (context, container, component, code).

Reliability and Resilience Patterns¶

Circuit Breaker Pattern¶

Prevents cascading failures by stopping requests to a failing service.

┌─────────────────────────────────────────────────────────────────┐
│                    CIRCUIT BREAKER STATES                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌──────────┐      failures > threshold       ┌──────────┐   │
│   │  CLOSED  │ ───────────────────────────────▶│   OPEN   │   │
│   │          │                                  │          │   │
│   │ Requests │                                  │ Requests │   │
│   │  pass    │                                  │   fail   │   │
│   │ through  │                                  │  fast    │   │
│   └────▲─────┘                                  └────┬─────┘   │
│        │                                             │         │
│        │ success                          timeout    │         │
│        │                                             ▼         │
│        │                                   ┌──────────────┐    │
│        │                                   │  HALF-OPEN   │    │
│        └───────────────────────────────────│              │    │
│                                            │  Limited     │    │
│                     failure                │  requests    │────┘
│                                            │  allowed     │     
│                                            └──────────────┘     
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementation:

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold=5,
        recovery_timeout=30,
        half_open_requests=3
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_requests = half_open_requests

        self.state = CircuitState.CLOSED
        self.failures = 0
        self.last_failure_time = None
        self.half_open_successes = 0
        self.lock = Lock()

    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == CircuitState.OPEN:
                if self._should_attempt_reset():
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_successes = 0
                else:
                    raise CircuitOpenError("Circuit is OPEN")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.half_open_successes += 1
                if self.half_open_successes >= self.half_open_requests:
                    self.state = CircuitState.CLOSED
                    self.failures = 0
            else:
                self.failures = 0

    def _on_failure(self):
        with self.lock:
            self.failures += 1
            self.last_failure_time = time.time()

            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failures >= self.failure_threshold:
                self.state = CircuitState.OPEN

    def _should_attempt_reset(self):
        if self.last_failure_time is None:
            return True
        return time.time() - self.last_failure_time >= self.recovery_timeout

# Usage
circuit_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def call_payment_service(order_id):
    return circuit_breaker.call(payment_api.process, order_id)

Retry Pattern with Exponential Backoff¶

import random
import time
from functools import wraps

def retry_with_backoff(
    max_retries=3,
    base_delay=1.0,
    max_delay=60.0,
    exponential_base=2,
    jitter=True,
    retryable_exceptions=(Exception,)
):
    """Decorator for retrying with exponential backoff."""

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None

            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except retryable_exceptions as e:
                    last_exception = e

                    if attempt == max_retries:
                        raise

                    # Calculate delay with exponential backoff
                    delay = min(
                        base_delay * (exponential_base ** attempt),
                        max_delay
                    )

                    # Add jitter to prevent thundering herd
                    if jitter:
                        delay = delay * (0.5 + random.random())

                    time.sleep(delay)

            raise last_exception

        return wrapper
    return decorator

# Usage
@retry_with_backoff(
    max_retries=3,
    base_delay=1.0,
    retryable_exceptions=(ConnectionError, TimeoutError)
)
def fetch_data(url):
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    return response.json()

Bulkhead Pattern¶

Isolates components to prevent failures from spreading.

from concurrent.futures import ThreadPoolExecutor
from threading import Semaphore

class Bulkhead:
    """Isolates resources with separate thread pools and semaphores."""

    def __init__(self, name, max_concurrent, max_wait_time=5):
        self.name = name
        self.semaphore = Semaphore(max_concurrent)
        self.executor = ThreadPoolExecutor(max_workers=max_concurrent)
        self.max_wait_time = max_wait_time

    def execute(self, func, *args, **kwargs):
        acquired = self.semaphore.acquire(timeout=self.max_wait_time)

        if not acquired:
            raise BulkheadFullError(f"Bulkhead '{self.name}' is full")

        try:
            future = self.executor.submit(func, *args, **kwargs)
            return future.result(timeout=self.max_wait_time)
        finally:
            self.semaphore.release()

# Create separate bulkheads for different services
payment_bulkhead = Bulkhead("payment", max_concurrent=10)
inventory_bulkhead = Bulkhead("inventory", max_concurrent=20)
notification_bulkhead = Bulkhead("notification", max_concurrent=5)

# Usage - failures in payment don't affect inventory
def process_order(order):
    # Each uses its own isolated resources
    payment_result = payment_bulkhead.execute(payment_service.charge, order)
    inventory_result = inventory_bulkhead.execute(inventory_service.reserve, order)
    notification_bulkhead.execute(notification_service.send, order)

Timeout Pattern¶

import asyncio
from contextlib import asynccontextmanager

@asynccontextmanager
async def timeout_context(seconds, error_message="Operation timed out"):
    """Context manager for async timeouts."""
    try:
        yield await asyncio.wait_for(
            asyncio.shield(asyncio.current_task()),
            timeout=seconds
        )
    except asyncio.TimeoutError:
        raise TimeoutError(error_message)

# Usage with cascading timeouts
async def get_order_details(order_id):
    total_timeout = 5.0
    start_time = time.time()

    # First call
    async with timeout_context(2.0):
        order = await order_service.get(order_id)

    elapsed = time.time() - start_time
    remaining = total_timeout - elapsed

    # Second call with remaining time
    async with timeout_context(min(remaining, 2.0)):
        user = await user_service.get(order.user_id)

    return {'order': order, 'user': user}

Graceful Degradation¶

class ProductService:
    def __init__(self):
        self.recommendation_service = RecommendationService()
        self.cache = Redis()

    async def get_product_page(self, product_id):
        # Core functionality - must succeed
        product = await self.get_product(product_id)

        # Enhanced features - can degrade
        recommendations = await self._get_recommendations_gracefully(product_id)
        reviews_summary = await self._get_reviews_gracefully(product_id)

        return {
            'product': product,
            'recommendations': recommendations,  # May be empty
            'reviews_summary': reviews_summary,  # May be cached/empty
        }

    async def _get_recommendations_gracefully(self, product_id):
        try:
            return await asyncio.wait_for(
                self.recommendation_service.get(product_id),
                timeout=0.5
            )
        except (TimeoutError, ServiceUnavailable):
            # Fallback to cached recommendations
            cached = await self.cache.get(f"recommendations:{product_id}")
            if cached:
                return cached

            # Final fallback - popular products
            return await self.get_popular_products()
        except Exception:
            return []  # Return empty on unexpected errors

Chaos Engineering¶

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.

Principles of Chaos Engineering¶

Build a Hypothesis Around Steady State Behavior
Vary Real-World Events: Network issues, server failures, traffic spikes
Run Experiments in Production: Or production-like environments
Automate Experiments to Run Continuously
Minimize Blast Radius: Start small, gradually increase scope

Chaos Experiments¶

1. Network Failures:

# Chaos Monkey for Kubernetes - Network delay
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "100ms"
    jitter: "50ms"
  duration: "5m"

2. Pod Failures:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: order-service
  scheduler:
    cron: "*/10 * * * *"  # Every 10 minutes

3. Resource Stress:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: all
  selector:
    labelSelectors:
      app: api-gateway
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "2m"

Chaos Engineering Tools¶

Tool	Description	Best For
Chaos Monkey	Netflix's original tool, kills instances	AWS/cloud instances
Chaos Mesh	Kubernetes-native chaos	K8s environments
Gremlin	Enterprise chaos platform	Production experiments
Litmus	CNCF chaos engineering	Kubernetes, cloud-native
Toxiproxy	TCP proxy for testing	Network simulation

Building Chaos Experiments¶

class ChaosExperiment:
    def __init__(self, name, hypothesis):
        self.name = name
        self.hypothesis = hypothesis
        self.steady_state_metrics = None

    def capture_steady_state(self):
        """Capture metrics before experiment."""
        self.steady_state_metrics = {
            'error_rate': self.metrics.get_error_rate(),
            'latency_p99': self.metrics.get_latency_p99(),
            'throughput': self.metrics.get_throughput(),
        }

    def inject_failure(self, failure_type, duration):
        """Inject the chaos."""
        if failure_type == 'network_delay':
            self.chaos_client.add_latency('payment-service', '200ms')
        elif failure_type == 'pod_kill':
            self.chaos_client.kill_pod('payment-service')
        elif failure_type == 'cpu_stress':
            self.chaos_client.stress_cpu('payment-service', 80)

    def verify_hypothesis(self):
        """Check if system behaved as expected."""
        current_metrics = {
            'error_rate': self.metrics.get_error_rate(),
            'latency_p99': self.metrics.get_latency_p99(),
            'throughput': self.metrics.get_throughput(),
        }

        # Example hypothesis: error rate should stay below 1%
        assert current_metrics['error_rate'] < 0.01, \
            f"Error rate exceeded threshold: {current_metrics['error_rate']}"

        # Latency should not increase more than 2x
        assert current_metrics['latency_p99'] < self.steady_state_metrics['latency_p99'] * 2

    def run(self, failure_type, duration=60):
        self.capture_steady_state()

        try:
            self.inject_failure(failure_type, duration)
            time.sleep(duration)
            self.verify_hypothesis()
            return {'status': 'passed', 'metrics': self.get_metrics()}
        except AssertionError as e:
            return {'status': 'failed', 'reason': str(e)}
        finally:
            self.rollback()

Common System Design Problems¶

URL Shortener (like bit.ly)¶

Requirements:

Shorten long URLs to short codes
Redirect short URLs to original
Custom short codes (optional)
Analytics (click tracking)
Scale: 100M URLs created/month, 10:1 read:write ratio

Design:

┌─────────┐     ┌─────────────┐     ┌─────────────┐
│  Client │────▶│   API GW    │────▶│  Shortener  │
└─────────┘     └─────────────┘     │   Service   │
                                    └──────┬──────┘
                                           │
                    ┌──────────────────────┴──────────────────────┐
                    │                                             │
              ┌─────▼─────┐                               ┌───────▼───────┐
              │   Cache   │                               │   Database    │
              │  (Redis)  │                               │ (Cassandra)   │
              └───────────┘                               └───────────────┘

Key Decisions:

ID Generation: Base62 encoding of unique ID (62^7 = 3.5 trillion combinations)
Database: Cassandra for high write throughput, or DynamoDB
Caching: Cache popular URLs in Redis (80/20 rule)

import hashlib
import string

class URLShortener:
    ALPHABET = string.ascii_letters + string.digits  # Base62

    def shorten(self, long_url, custom_code=None):
        if custom_code:
            if self.db.exists(custom_code):
                raise CodeExistsError()
            code = custom_code
        else:
            code = self._generate_code(long_url)

        self.db.save(code, long_url)
        return f"https://short.url/{code}"

    def _generate_code(self, url):
        # Use counter-based or hash-based approach
        unique_id = self.id_generator.next()  # Snowflake ID
        return self._base62_encode(unique_id)[:7]

    def _base62_encode(self, num):
        if num == 0:
            return self.ALPHABET[0]

        result = []
        while num:
            result.append(self.ALPHABET[num % 62])
            num //= 62
        return ''.join(reversed(result))

Rate Limiter¶

Requirements:

Limit requests per user/IP
Support different limits for different API tiers
Distributed across multiple servers
Low latency impact

Design using Sliding Window Counter in Redis:

class DistributedRateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client

    def is_allowed(self, key, limit, window_seconds):
        now = time.time()
        window_start = now - window_seconds

        # Lua script for atomic sliding window
        script = """
        local key = KEYS[1]
        local now = tonumber(ARGV[1])
        local window_start = tonumber(ARGV[2])
        local limit = tonumber(ARGV[3])

        -- Remove old entries
        redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)

        -- Count current window
        local count = redis.call('ZCARD', key)

        if count < limit then
            -- Add new request
            redis.call('ZADD', key, now, now .. ':' .. math.random())
            redis.call('EXPIRE', key, ARGV[4])
            return 1
        else
            return 0
        end
        """

        return bool(self.redis.eval(
            script, 1, key, now, window_start, limit, window_seconds
        ))

Distributed Cache¶

Requirements:

High read throughput (millions of reads/second)
Sub-millisecond latency
Automatic failover
Data partitioning

Design:

┌─────────────────────────────────────────────────────────────────┐
│                     Distributed Cache Cluster                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   ┌────────────┐     ┌────────────┐     ┌────────────┐        │
│   │   Shard 1  │     │   Shard 2  │     │   Shard 3  │        │
│   │            │     │            │     │            │        │
│   │ Keys: 0-32 │     │ Keys: 33-65│     │ Keys: 66-99│        │
│   │            │     │            │     │            │        │
│   │  Primary   │     │  Primary   │     │  Primary   │        │
│   │  Replica1  │     │  Replica1  │     │  Replica1  │        │
│   │  Replica2  │     │  Replica2  │     │  Replica2  │        │
│   └────────────┘     └────────────┘     └────────────┘        │
│                                                                 │
│   Consistent Hashing Ring for Key Distribution                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Message Queue¶

Requirements:

At-least-once delivery
Ordering guarantees (per partition)
Horizontal scalability
Consumer groups

Simplified Kafka-like Design:

class Partition:
    def __init__(self, partition_id):
        self.partition_id = partition_id
        self.messages = []  # In practice, append-only log on disk
        self.offset = 0

    def append(self, message):
        self.messages.append({
            'offset': self.offset,
            'timestamp': time.time(),
            'data': message
        })
        self.offset += 1
        return self.offset - 1

class Topic:
    def __init__(self, name, num_partitions):
        self.name = name
        self.partitions = [
            Partition(i) for i in range(num_partitions)
        ]

    def produce(self, key, message):
        # Partition by key hash for ordering
        partition_id = hash(key) % len(self.partitions)
        return self.partitions[partition_id].append(message)

class ConsumerGroup:
    def __init__(self, group_id, topic):
        self.group_id = group_id
        self.topic = topic
        self.offsets = {}  # partition_id -> offset
        self.assignments = {}  # consumer_id -> [partition_ids]

    def consume(self, consumer_id, batch_size=100):
        messages = []
        for partition_id in self.assignments.get(consumer_id, []):
            offset = self.offsets.get(partition_id, 0)
            partition = self.topic.partitions[partition_id]

            batch = partition.messages[offset:offset + batch_size]
            messages.extend(batch)

        return messages

    def commit(self, partition_id, offset):
        self.offsets[partition_id] = offset

Notification System¶

Requirements:

Push notifications (iOS, Android, Web)
Email and SMS
Template management
Rate limiting per user
Priority queues

Design:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Notification System                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌────────────┐     ┌────────────────┐     ┌────────────────────────────┐ │
│  │  Trigger   │────▶│  Notification  │────▶│     Priority Queues        │ │
│  │  Service   │     │    Service     │     │                            │ │
│  └────────────┘     └────────────────┘     │  ┌─────────┐  ┌─────────┐ │ │
│                                            │  │  High   │  │ Normal  │ │ │
│                                            │  │ (OTP)   │  │ (promo) │ │ │
│                                            │  └────┬────┘  └────┬────┘ │ │
│                                            └───────┼───────────┼──────┘ │
│                                                    │           │        │
│                     ┌──────────────────────────────┴───────────┴──────┐ │
│                     │                  Workers                        │ │
│                     └───────────┬────────────┬────────────┬──────────┘ │
│                                 │            │            │            │
│                           ┌─────▼────┐ ┌─────▼────┐ ┌─────▼────┐      │
│                           │   Push   │ │  Email   │ │   SMS    │      │
│                           │ Provider │ │ Provider │ │ Provider │      │
│                           │ (FCM/APNs)│ │(SendGrid)│ │ (Twilio) │      │
│                           └──────────┘ └──────────┘ └──────────┘      │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

class NotificationService:
    def __init__(self):
        self.template_engine = TemplateEngine()
        self.rate_limiter = RateLimiter()
        self.queues = {
            'high': PriorityQueue('high'),
            'normal': PriorityQueue('normal'),
            'low': PriorityQueue('low'),
        }

    async def send(self, notification):
        # Check rate limits
        if not self.rate_limiter.allow(notification.user_id, notification.channel):
            raise RateLimitExceeded()

        # Render template
        content = self.template_engine.render(
            notification.template_id,
            notification.variables
        )

        # Get user preferences
        preferences = await self.user_service.get_preferences(notification.user_id)

        # Enqueue for each enabled channel
        for channel in notification.channels:
            if preferences.is_enabled(channel):
                message = {
                    'user_id': notification.user_id,
                    'channel': channel,
                    'content': content,
                    'metadata': notification.metadata,
                }

                priority = self._get_priority(notification.type)
                self.queues[priority].enqueue(message)

System Design Interview Tips¶

Framework for System Design¶

1. Requirements Clarification (5 minutes):

Functional requirements
Non-functional requirements (scale, latency, availability)
Constraints and assumptions

2. Back-of-Envelope Estimation (5 minutes):

Traffic estimates (QPS, peak)
Storage estimates
Bandwidth estimates

3. High-Level Design (10 minutes):

Core components
Data flow
API design

4. Deep Dive (15 minutes):

Database schema
Scaling strategies
Trade-offs and alternatives

5. Wrap-Up (5 minutes):

Bottlenecks and mitigations
Monitoring and alerting
Future improvements

Key Questions to Ask¶

Functional:

Who are the users?
What are the main features?
What operations are supported?

Scale:

How many users?
How many requests per second?
How much data?

Performance:

What's the expected latency?
What's the availability requirement?
What consistency model is needed?

Common Trade-offs¶

Decision	Option A	Option B
Consistency	Strong (CP)	Eventual (AP)
Storage	SQL (ACID)	NoSQL (scale)
Communication	Sync (simple)	Async (resilient)
Caching	Cache aside (flexible)	Write-through (consistent)
Scaling	Vertical (simple)	Horizontal (unlimited)