Databases¶

Databases are the cornerstone of modern software engineering, providing the fundamental infrastructure for persistent data storage, retrieval, and management. Understanding databases deeply—from theoretical foundations to practical implementation details—is essential for building scalable, reliable, and efficient applications. This guide covers relational and non-relational databases, their internal architectures, query optimization, transactions, distributed systems concepts, and best practices for real-world usage.

Fundamentals of Data Storage¶

Why Databases Exist¶

Before databases, applications stored data in flat files—simple text or binary files on disk. While this works for trivial cases, it quickly becomes problematic:

Concurrent Access: Multiple processes reading/writing the same file leads to data corruption.
Data Integrity: No built-in way to enforce rules (e.g., "email must be unique").
Query Efficiency: Finding specific records requires scanning entire files.
Data Relationships: Representing connections between entities is cumbersome.
Crash Recovery: A power failure mid-write can corrupt the entire file.

Databases solve these problems by providing:

Structured Storage: Organized data models (tables, documents, graphs).
Query Languages: Declarative ways to retrieve and manipulate data.
ACID Guarantees: Reliable transactions even during failures.
Concurrency Control: Safe multi-user access without corruption.
Indexing: Fast lookups without full scans.
Security: Authentication, authorization, and encryption.

Data Models Overview¶

Different applications have different data requirements. The major data models are:

Model	Structure	Examples	Best For
Relational	Tables with rows and columns	PostgreSQL, MySQL, SQLite	Structured data with complex relationships
Document	JSON/BSON documents in collections	MongoDB, CouchDB	Semi-structured, hierarchical data
Key-Value	Simple key→value pairs	Redis, DynamoDB, etcd	Caching, session storage, config
Column-Family	Wide columns, sparse data	Cassandra, HBase, ScyllaDB	Time-series, analytics, IoT
Graph	Nodes and edges	Neo4j, ArangoDB, Amazon Neptune	Social networks, recommendations
Time-Series	Timestamped data points	InfluxDB, TimescaleDB, Prometheus	Metrics, monitoring, IoT sensors
Vector	High-dimensional embeddings	Pinecone, Milvus, pgvector	AI/ML similarity search

The Relational Model¶

Theoretical Foundations¶

The relational model, proposed by Edgar F. Codd in 1970, is based on mathematical set theory and predicate logic. It provides a formal framework for organizing data into relations (tables).

Core Concepts¶

Relation (Table): A relation is a set of tuples (rows) that share the same attributes (columns). Mathematically:

\[R \subseteq D_1 \times D_2 \times ... \times D_n\]

Where \(D_i\) is the domain (set of possible values) for attribute \(i\).

Tuple (Row): A single record in a relation, representing one instance of the entity. Tuples are unordered—the relational model doesn't guarantee row order.

Attribute (Column): A named property of a relation with a defined domain. Each attribute has:

Name: Unique identifier within the relation
Domain: Set of allowable values (data type + constraints)
Nullability: Whether NULL values are permitted

Domain: The set of atomic values an attribute can take. Examples:

INTEGER: {..., -2, -1, 0, 1, 2, ...}
VARCHAR(255): All strings up to 255 characters
BOOLEAN: {TRUE, FALSE}

Schema: The logical structure defining a relation's attributes and constraints. A schema is metadata—it describes the data without containing actual values.

-- Schema definition
CREATE TABLE employees (
    id          SERIAL PRIMARY KEY,
    name        VARCHAR(100) NOT NULL,
    email       VARCHAR(255) UNIQUE NOT NULL,
    department  VARCHAR(50),
    salary      DECIMAL(10, 2) CHECK (salary > 0),
    hire_date   DATE DEFAULT CURRENT_DATE,
    manager_id  INTEGER REFERENCES employees(id)
);

Keys and Constraints¶

Keys are fundamental to the relational model, establishing identity and relationships.

Types of Keys¶

Super Key: Any set of attributes that uniquely identifies a tuple. For a table with columns {id, email, name}, super keys include:

{id}
{email}
{id, email}
{id, name}
{id, email, name}

Candidate Key: A minimal super key—no proper subset is also a super key. From the above, {id} and {email} are candidate keys.

Primary Key: The chosen candidate key for a relation. It:

Must be unique across all rows
Cannot contain NULL values
Should be immutable (rarely change)
Is typically indexed automatically

-- Primary key can be single column
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL
);

-- Or composite (multiple columns)
CREATE TABLE order_items (
    order_id INTEGER,
    product_id INTEGER,
    quantity INTEGER,
    PRIMARY KEY (order_id, product_id)
);

Foreign Key: An attribute (or set) that references a primary key in another table, establishing relationships.

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL REFERENCES users(id),
    total DECIMAL(10, 2),
    created_at TIMESTAMP DEFAULT NOW()
);

Natural Key vs Surrogate Key:

Natural Key: Uses real-world identifiers (email, SSN, ISBN).
- Pros: Meaningful, no extra column
- Cons: May change, privacy concerns, can be complex
Surrogate Key: System-generated identifier (auto-increment, UUID).
- Pros: Stable, simple, no business logic
- Cons: Meaningless, extra storage

-- Surrogate key (recommended for most cases)
CREATE TABLE products (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    sku VARCHAR(50) UNIQUE,  -- Natural key as secondary
    name VARCHAR(255)
);

Constraint Types¶

Constraints enforce data integrity rules at the database level:

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,

    -- NOT NULL: Attribute must have a value
    name VARCHAR(100) NOT NULL,

    -- UNIQUE: No duplicate values (NULLs may be allowed)
    email VARCHAR(255) UNIQUE,

    -- CHECK: Custom validation rules
    age INTEGER CHECK (age >= 18 AND age <= 120),
    salary DECIMAL(10, 2) CHECK (salary > 0),

    -- DEFAULT: Value when not specified
    status VARCHAR(20) DEFAULT 'active',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- FOREIGN KEY with referential actions
    department_id INTEGER REFERENCES departments(id)
        ON DELETE SET NULL
        ON UPDATE CASCADE
);

-- Table-level constraints (for multi-column)
ALTER TABLE employees ADD CONSTRAINT unique_name_dept 
    UNIQUE (name, department_id);

ALTER TABLE employees ADD CONSTRAINT valid_dates
    CHECK (termination_date IS NULL OR termination_date > hire_date);

Referential Integrity Actions:

Action	On DELETE	On UPDATE
`CASCADE`	Delete referencing rows	Update foreign key values
`SET NULL`	Set FK to NULL	Set FK to NULL
`SET DEFAULT`	Set FK to default	Set FK to default
`RESTRICT`	Prevent deletion	Prevent update
`NO ACTION`	Check at end of transaction	Check at end of transaction

Relational Algebra¶

Relational algebra provides the theoretical foundation for SQL operations. Understanding it helps optimize queries.

Fundamental Operations¶

Selection (σ): Filters rows based on a predicate.

\[\sigma_{salary > 50000}(employees)\]

SELECT * FROM employees WHERE salary > 50000;

Projection (π): Selects specific columns.

\[\pi_{name, salary}(employees)\]

SELECT name, salary FROM employees;

Cartesian Product (×): Combines every row of one table with every row of another.

\[employees \times departments\]

SELECT * FROM employees, departments;  -- Rarely useful alone

Union (∪): Combines rows from two relations with the same schema.

\[R \cup S = \{t | t \in R \lor t \in S\}\]

SELECT name FROM employees
UNION
SELECT name FROM contractors;

Difference (−): Rows in first relation but not second.

\[R - S = \{t | t \in R \land t \notin S\}\]

SELECT name FROM all_users
EXCEPT
SELECT name FROM banned_users;

Intersection (∩): Rows in both relations.

\[R \cap S = \{t | t \in R \land t \in S\}\]

SELECT name FROM customers
INTERSECT
SELECT name FROM newsletter_subscribers;

Derived Operations¶

Join (⋈): Combines related rows from two tables.

Natural Join: Joins on all common attributes.

\[R \bowtie S\]

SELECT * FROM employees NATURAL JOIN departments;

Theta Join: Joins with arbitrary condition.

\[R \bowtie_\theta S = \sigma_\theta(R \times S)\]

SELECT * FROM employees e, projects p 
WHERE e.skill_level >= p.required_level;

Equi-Join: Theta join with equality condition (most common).

SELECT * FROM employees e 
JOIN departments d ON e.dept_id = d.id;

Outer Joins: Include non-matching rows.

-- LEFT OUTER JOIN: All rows from left table
SELECT e.name, d.name as dept
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;

-- RIGHT OUTER JOIN: All rows from right table
SELECT e.name, d.name as dept
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.id;

-- FULL OUTER JOIN: All rows from both tables
SELECT e.name, d.name as dept
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.id;

Semi-Join (⋉): Returns rows from left table that have a match in right table.

SELECT * FROM employees e
WHERE EXISTS (
    SELECT 1 FROM projects p WHERE p.lead_id = e.id
);

Anti-Join: Returns rows from left table with no match in right table.

SELECT * FROM employees e
WHERE NOT EXISTS (
    SELECT 1 FROM projects p WHERE p.lead_id = e.id
);

SQL: Structured Query Language¶

SQL is the standard language for interacting with relational databases. While implementations vary, the core syntax is consistent across systems.

SQL Sublanguages¶

SQL is divided into several sublanguages, each serving a specific purpose:

DDL (Data Definition Language)¶

Defines database structure:

-- CREATE: Create new objects
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    price DECIMAL(10, 2),
    category_id INTEGER REFERENCES categories(id)
);

CREATE INDEX idx_products_category ON products(category_id);

CREATE VIEW expensive_products AS
    SELECT * FROM products WHERE price > 100;

-- ALTER: Modify existing objects
ALTER TABLE products ADD COLUMN description TEXT;
ALTER TABLE products ALTER COLUMN name TYPE VARCHAR(500);
ALTER TABLE products DROP COLUMN description;
ALTER TABLE products ADD CONSTRAINT positive_price CHECK (price > 0);

-- DROP: Remove objects
DROP TABLE products;              -- Fails if referenced
DROP TABLE products CASCADE;       -- Drops dependent objects too
DROP INDEX idx_products_category;

-- TRUNCATE: Remove all rows (faster than DELETE)
TRUNCATE TABLE products;          -- Cannot be rolled back in some DBs
TRUNCATE TABLE products RESTART IDENTITY CASCADE;

DML (Data Manipulation Language)¶

Manipulates data within tables:

-- INSERT: Add new rows
INSERT INTO products (name, price, category_id)
VALUES ('Laptop', 999.99, 1);

-- Insert multiple rows
INSERT INTO products (name, price, category_id) VALUES
    ('Mouse', 29.99, 2),
    ('Keyboard', 79.99, 2),
    ('Monitor', 299.99, 3);

-- Insert from query
INSERT INTO archived_orders (id, total, archived_at)
SELECT id, total, NOW() FROM orders WHERE created_at < '2024-01-01';

-- UPDATE: Modify existing rows
UPDATE products SET price = price * 1.1 WHERE category_id = 1;

UPDATE products p
SET price = p.price * (1 - d.discount_rate)
FROM discounts d
WHERE p.category_id = d.category_id;

-- DELETE: Remove rows
DELETE FROM products WHERE id = 5;

DELETE FROM orders
WHERE created_at < NOW() - INTERVAL '1 year'
AND status = 'completed';

-- UPSERT: Insert or update on conflict
INSERT INTO products (sku, name, price)
VALUES ('ABC123', 'Widget', 9.99)
ON CONFLICT (sku) DO UPDATE SET
    name = EXCLUDED.name,
    price = EXCLUDED.price;

DQL (Data Query Language)¶

Retrieves data from the database:

-- Basic SELECT
SELECT name, price FROM products;

-- With conditions
SELECT * FROM products
WHERE price > 100 AND category_id IN (1, 2, 3);

-- Ordering and limiting
SELECT * FROM products
ORDER BY price DESC, name ASC
LIMIT 10 OFFSET 20;

-- Aggregations
SELECT 
    category_id,
    COUNT(*) as product_count,
    AVG(price) as avg_price,
    MIN(price) as min_price,
    MAX(price) as max_price,
    SUM(price) as total_value
FROM products
GROUP BY category_id
HAVING COUNT(*) > 5;

-- Joins
SELECT 
    p.name as product,
    c.name as category,
    s.name as supplier
FROM products p
JOIN categories c ON p.category_id = c.id
LEFT JOIN suppliers s ON p.supplier_id = s.id
WHERE p.active = true;

DCL (Data Control Language)¶

Controls access permissions:

-- GRANT: Give permissions
GRANT SELECT, INSERT ON products TO app_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO admin;
GRANT USAGE ON SCHEMA reporting TO analyst;

-- REVOKE: Remove permissions
REVOKE DELETE ON products FROM app_user;
REVOKE ALL PRIVILEGES ON sensitive_data FROM public;

-- Role management
CREATE ROLE readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;
GRANT readonly TO analyst_user;

TCL (Transaction Control Language)¶

Manages transactions:

-- Start a transaction
BEGIN;  -- or START TRANSACTION

-- Make changes
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

-- Savepoint for partial rollback
SAVEPOINT after_debit;

UPDATE audit_log SET processed = true WHERE transaction_id = 123;

-- Rollback to savepoint if needed
ROLLBACK TO after_debit;

-- Commit or rollback entire transaction
COMMIT;  -- or ROLLBACK;

Advanced SQL Features¶

Common Table Expressions (CTEs)¶

CTEs improve readability and enable recursive queries:

-- Non-recursive CTE
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', created_at) as month,
        SUM(total) as revenue
    FROM orders
    WHERE created_at >= '2024-01-01'
    GROUP BY 1
),
monthly_growth AS (
    SELECT 
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue,
        revenue - LAG(revenue) OVER (ORDER BY month) as growth
    FROM monthly_sales
)
SELECT * FROM monthly_growth WHERE growth > 0;

-- Recursive CTE (organizational hierarchy)
WITH RECURSIVE org_tree AS (
    -- Base case: top-level employees
    SELECT id, name, manager_id, 1 as depth, ARRAY[name] as path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: employees with managers
    SELECT e.id, e.name, e.manager_id, ot.depth + 1, ot.path || e.name
    FROM employees e
    JOIN org_tree ot ON e.manager_id = ot.id
    WHERE ot.depth < 10  -- Prevent infinite recursion
)
SELECT * FROM org_tree ORDER BY path;

Window Functions¶

Window functions perform calculations across rows related to the current row:

SELECT 
    name,
    department,
    salary,

    -- Ranking functions
    ROW_NUMBER() OVER (ORDER BY salary DESC) as overall_rank,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) as dense_rank,
    NTILE(4) OVER (ORDER BY salary DESC) as quartile,

    -- Aggregate functions as windows
    SUM(salary) OVER (PARTITION BY department) as dept_total,
    AVG(salary) OVER (PARTITION BY department) as dept_avg,
    COUNT(*) OVER (PARTITION BY department) as dept_count,

    -- Navigation functions
    LAG(salary) OVER (ORDER BY hire_date) as prev_salary,
    LEAD(salary) OVER (ORDER BY hire_date) as next_salary,
    FIRST_VALUE(name) OVER (PARTITION BY dept ORDER BY salary DESC) as top_earner,
    LAST_VALUE(name) OVER (
        PARTITION BY department 
        ORDER BY salary DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as lowest_earner,

    -- Running totals
    SUM(salary) OVER (ORDER BY hire_date ROWS UNBOUNDED PRECEDING) as running_total,
    AVG(salary) OVER (ORDER BY hire_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg

FROM employees;

Subqueries¶

Subqueries are queries nested within other queries:

-- Scalar subquery (returns single value)
SELECT 
    name,
    salary,
    salary - (SELECT AVG(salary) FROM employees) as diff_from_avg
FROM employees;

-- Table subquery in FROM clause
SELECT dept, avg_salary
FROM (
    SELECT department as dept, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
) dept_stats
WHERE avg_salary > 50000;

-- Correlated subquery (references outer query)
SELECT e1.name, e1.salary
FROM employees e1
WHERE e1.salary > (
    SELECT AVG(e2.salary) 
    FROM employees e2 
    WHERE e2.department = e1.department
);

-- EXISTS subquery
SELECT c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.id 
    AND o.total > 1000
);

-- IN vs EXISTS
-- IN: Better for small subquery results
SELECT * FROM products WHERE category_id IN (SELECT id FROM categories WHERE active);

-- EXISTS: Better for large tables with indexes
SELECT * FROM products p WHERE EXISTS (
    SELECT 1 FROM categories c WHERE c.id = p.category_id AND c.active
);

JSON Operations¶

Modern databases support JSON for semi-structured data:

-- PostgreSQL JSON operations
CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    data JSONB NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

INSERT INTO events (data) VALUES 
('{"type": "click", "page": "/home", "user": {"id": 123, "name": "Alice"}}');

-- Query JSON fields
SELECT 
    data->>'type' as event_type,               -- Get as text
    data->'user'->>'name' as user_name,        -- Nested access
    data#>>'{user,id}' as user_id,             -- Path access
    jsonb_typeof(data->'user') as user_type    -- Type check
FROM events
WHERE data->>'type' = 'click'
AND (data->'user'->>'id')::int > 100;

-- JSON containment
SELECT * FROM events WHERE data @> '{"type": "click"}';

-- JSON existence
SELECT * FROM events WHERE data ? 'user';      -- Has key
SELECT * FROM events WHERE data ?& array['type', 'page'];  -- Has all keys

-- Update JSON
UPDATE events 
SET data = data || '{"processed": true}'
WHERE id = 1;

UPDATE events
SET data = jsonb_set(data, '{user,verified}', 'true')
WHERE id = 1;

-- Index JSON for performance
CREATE INDEX idx_events_type ON events ((data->>'type'));
CREATE INDEX idx_events_data ON events USING GIN (data);

Database Design and Normalization¶

Good database design prevents anomalies, reduces redundancy, and improves data integrity.

Normal Forms¶

Normalization is the process of organizing data to reduce redundancy and dependency issues. Each normal form builds on the previous.

First Normal Form (1NF)¶

Requirements:

Each column contains atomic (indivisible) values
Each column contains values of a single type
Each row is unique (has a primary key)
No repeating groups

Violation:

OrderID	CustomerName	Products
1	Alice	Laptop, Mouse
2	Bob	Keyboard

Fixed (1NF):

OrderID	CustomerName	Product
1	Alice	Laptop
1	Alice	Mouse
2	Bob	Keyboard

Second Normal Form (2NF)¶

Requirements:

Must be in 1NF
No partial dependencies (non-key attributes depend on the entire primary key)

Violation (composite key: OrderID + Product):

OrderID	Product	CustomerName	Price
1	Laptop	Alice	999
1	Mouse	Alice	29

CustomerName depends only on OrderID, not the full key.

Fixed (2NF):

Orders table:

OrderID	CustomerName
1	Alice

OrderItems table:

OrderID	Product	Price
1	Laptop	999
1	Mouse	29

Third Normal Form (3NF)¶

Requirements:

Must be in 2NF
No transitive dependencies (non-key attributes depend only on the primary key, not other non-key attributes)

Violation:

EmployeeID	DepartmentID	DepartmentName
1	10	Engineering
2	10	Engineering

DepartmentName depends on DepartmentID, not EmployeeID.

Fixed (3NF):

Employees table:

EmployeeID	DepartmentID
1	10
2	10

Departments table:

DepartmentID	DepartmentName
10	Engineering

Boyce-Codd Normal Form (BCNF)¶

Requirements:

Must be in 3NF
For every functional dependency X → Y, X must be a superkey

BCNF is a stricter version of 3NF that handles certain edge cases with overlapping candidate keys.

Violation:

Student	Subject	Teacher
Alice	Math	Dr. Smith
Bob	Math	Dr. Jones
Alice	Physics	Dr. Brown

If each teacher teaches only one subject, then Teacher → Subject. But Teacher isn't a superkey.

Fixed (BCNF):

StudentTeacher table:

Student	Teacher
Alice	Dr. Smith
Bob	Dr. Jones
Alice	Dr. Brown

TeacherSubject table:

Teacher	Subject
Dr. Smith	Math
Dr. Jones	Math
Dr. Brown	Physics

Fourth Normal Form (4NF)¶

Requirements:

Must be in BCNF
No multi-valued dependencies (one attribute shouldn't determine multiple independent attributes)

Violation:

Employee	Skill	Language
Alice	Python	English
Alice	Python	Spanish
Alice	Java	English
Alice	Java	Spanish

Skills and Languages are independent multi-valued facts about Employee.

Fixed (4NF):

EmployeeSkills:

Employee	Skill
Alice	Python
Alice	Java

EmployeeLanguages:

Employee	Language
Alice	English
Alice	Spanish

Fifth Normal Form (5NF)¶

Requirements:

Must be in 4NF
Cannot be further decomposed without losing information (no join dependencies)

5NF handles complex cases where data can only be reconstructed by joining three or more tables.

Denormalization¶

While normalization reduces redundancy, it increases join complexity. Denormalization intentionally adds redundancy for performance:

When to denormalize:

Read-heavy workloads with complex joins
Reporting and analytics queries
Caching frequently accessed combinations
When consistency can be managed at application level

Techniques:

-- Redundant columns
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
-- Update via trigger or application logic

-- Materialized views
CREATE MATERIALIZED VIEW order_summary AS
SELECT 
    o.id,
    o.total,
    c.name as customer_name,
    c.email as customer_email
FROM orders o
JOIN customers c ON o.customer_id = c.id;

-- Refresh periodically
REFRESH MATERIALIZED VIEW order_summary;

-- Pre-computed aggregates
CREATE TABLE daily_sales (
    date DATE PRIMARY KEY,
    total_orders INTEGER,
    total_revenue DECIMAL(12, 2),
    avg_order_value DECIMAL(10, 2)
);

Schema Design Patterns¶

Star Schema (Data Warehousing)¶

Central fact table surrounded by dimension tables:

-- Fact table (measures/metrics)
CREATE TABLE fact_sales (
    id SERIAL PRIMARY KEY,
    date_id INTEGER REFERENCES dim_date(id),
    product_id INTEGER REFERENCES dim_product(id),
    customer_id INTEGER REFERENCES dim_customer(id),
    store_id INTEGER REFERENCES dim_store(id),
    quantity INTEGER,
    unit_price DECIMAL(10, 2),
    total_amount DECIMAL(12, 2)
);

-- Dimension tables (descriptive attributes)
CREATE TABLE dim_date (
    id INTEGER PRIMARY KEY,
    full_date DATE,
    year INTEGER,
    quarter INTEGER,
    month INTEGER,
    day INTEGER,
    day_of_week INTEGER,
    is_weekend BOOLEAN
);

CREATE TABLE dim_product (
    id INTEGER PRIMARY KEY,
    sku VARCHAR(50),
    name VARCHAR(255),
    category VARCHAR(100),
    brand VARCHAR(100)
);

Snowflake Schema¶

Normalized dimension tables (dimensions have sub-dimensions):

-- Normalized product dimension
CREATE TABLE dim_category (
    id INTEGER PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE dim_brand (
    id INTEGER PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE dim_product (
    id INTEGER PRIMARY KEY,
    sku VARCHAR(50),
    name VARCHAR(255),
    category_id INTEGER REFERENCES dim_category(id),
    brand_id INTEGER REFERENCES dim_brand(id)
);

Slowly Changing Dimensions (SCD)¶

Handling historical changes in dimension data:

Type 1 - Overwrite:

-- Simply update the record (no history)
UPDATE dim_customer SET address = 'New Address' WHERE id = 123;

Type 2 - Add New Row:

CREATE TABLE dim_customer (
    surrogate_key SERIAL PRIMARY KEY,
    customer_id INTEGER,  -- Natural key
    name VARCHAR(100),
    address VARCHAR(255),
    effective_date DATE,
    expiration_date DATE,
    is_current BOOLEAN DEFAULT true
);

-- When customer changes address:
UPDATE dim_customer 
SET expiration_date = CURRENT_DATE - 1, is_current = false
WHERE customer_id = 123 AND is_current = true;

INSERT INTO dim_customer (customer_id, name, address, effective_date, is_current)
VALUES (123, 'Alice', 'New Address', CURRENT_DATE, true);

Type 3 - Add Column:

CREATE TABLE dim_customer (
    id INTEGER PRIMARY KEY,
    name VARCHAR(100),
    current_address VARCHAR(255),
    previous_address VARCHAR(255)
);

Indexing and Query Optimization¶

Indexes are data structures that speed up data retrieval at the cost of additional storage and write overhead.

Index Types¶

B-Tree Index (Default)¶

Balanced tree structure, ideal for:

Equality comparisons (=)
Range queries (<, >, BETWEEN)
Prefix matching (LIKE 'abc%')
Sorting (ORDER BY)

-- Single column index
CREATE INDEX idx_users_email ON users(email);

-- Composite index (column order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at);

-- Query uses the index
SELECT * FROM orders WHERE customer_id = 123 AND created_at > '2024-01-01';
SELECT * FROM orders WHERE customer_id = 123;  -- Uses index (leftmost prefix)
SELECT * FROM orders WHERE created_at > '2024-01-01';  -- May NOT use index efficiently

B-Tree Structure:

                    [50]
                   /    \
            [25, 30]    [75, 90]
           /   |   \    /   |   \
        [10] [26] [35] [60] [80] [95]

Logarithmic search time: O(log n)
Balanced: all leaf nodes at same depth
Sorted: enables range scans

Hash Index¶

Uses hash function for O(1) lookups, but:

Only supports equality comparisons
Cannot be used for range queries or sorting
Not available in all databases

-- PostgreSQL hash index
CREATE INDEX idx_users_id_hash ON users USING HASH (id);

-- Good for: WHERE id = 123
-- Bad for: WHERE id > 100, ORDER BY id

GiST (Generalized Search Tree)¶

For complex data types:

-- Full-text search
CREATE INDEX idx_articles_content ON articles USING GiST (to_tsvector('english', content));

-- Geometric data
CREATE INDEX idx_locations_point ON locations USING GiST (point);

-- Range types
CREATE INDEX idx_events_duration ON events USING GiST (tsrange(start_time, end_time));

GIN (Generalized Inverted Index)¶

For composite values (arrays, JSON, full-text):

-- Array containment
CREATE INDEX idx_posts_tags ON posts USING GIN (tags);
SELECT * FROM posts WHERE tags @> ARRAY['sql', 'databases'];

-- JSONB queries
CREATE INDEX idx_events_data ON events USING GIN (data);
SELECT * FROM events WHERE data @> '{"type": "click"}';

-- Full-text search (better for static data)
CREATE INDEX idx_articles_tsv ON articles USING GIN (to_tsvector('english', content));

BRIN (Block Range Index)¶

For large, naturally ordered data:

-- Time-series data (much smaller than B-tree)
CREATE INDEX idx_logs_timestamp ON logs USING BRIN (timestamp);

-- Ideal when: data is physically sorted by the indexed column
-- Very small index size, but less precise

Partial Index¶

Index only rows matching a condition:

-- Only index active users
CREATE INDEX idx_active_users ON users(email) WHERE active = true;

-- Only index recent orders
CREATE INDEX idx_recent_orders ON orders(customer_id) 
WHERE created_at > '2024-01-01';

-- Query must include the condition to use the index
SELECT * FROM users WHERE email = 'test@example.com' AND active = true;

Expression Index¶

Index on computed expressions:

-- Case-insensitive search
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';

-- JSON field
CREATE INDEX idx_events_type ON events((data->>'type'));
SELECT * FROM events WHERE data->>'type' = 'click';

-- Date part
CREATE INDEX idx_orders_month ON orders(DATE_TRUNC('month', created_at));

Query Optimization¶

Understanding EXPLAIN¶

EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;

-- Output interpretation:
/*
Seq Scan on orders  (cost=0.00..1234.00 rows=100 width=50) (actual time=0.015..12.345 rows=95 loops=1)
  Filter: (customer_id = 123)
  Rows Removed by Filter: 9905
Planning Time: 0.123 ms
Execution Time: 12.456 ms
*/

Key metrics:

cost: Estimated units (startup cost..total cost)
rows: Estimated row count
actual time: Real execution time (ms)
loops: Number of iterations

Scan types (best to worst for large tables):

Index Only Scan: All data from index, no table access
Index Scan: Use index, fetch rows from table
Bitmap Index Scan: Build bitmap of matching rows, then fetch
Seq Scan: Full table scan (may be OK for small tables)

Common Optimization Techniques¶

Use covering indexes:

-- Index includes all needed columns
CREATE INDEX idx_orders_covering ON orders(customer_id) INCLUDE (total, status);

-- Query satisfied entirely from index
SELECT customer_id, total, status FROM orders WHERE customer_id = 123;

Avoid functions on indexed columns:

-- Bad: function prevents index use
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';

-- Good: create expression index, or
SELECT * FROM users WHERE email = 'Alice@example.com';  -- if case-sensitive is OK

-- Or use CITEXT type for case-insensitive text

Use appropriate types:

-- Bad: implicit type conversion
SELECT * FROM orders WHERE id = '123';  -- id is INTEGER

-- Good: matching types
SELECT * FROM orders WHERE id = 123;

Limit result sets:

-- Use LIMIT for pagination
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 40;

-- Better: keyset pagination (for large offsets)
SELECT * FROM orders 
WHERE created_at < '2024-01-15 10:30:00'
ORDER BY created_at DESC 
LIMIT 20;

Optimize JOINs:

-- Ensure join columns are indexed
CREATE INDEX idx_order_items_order ON order_items(order_id);

-- Join order matters for some databases
-- Put smaller/filtered tables first (optimizer usually handles this)

Statistics and Query Planning¶

-- Update table statistics
ANALYZE orders;
ANALYZE;  -- All tables

-- Check statistics
SELECT 
    schemaname, tablename, 
    n_live_tup as rows,
    n_dead_tup as dead_rows,
    last_vacuum, last_autovacuum,
    last_analyze, last_autoanalyze
FROM pg_stat_user_tables;

-- View index usage
SELECT 
    indexrelname as index_name,
    idx_scan as times_used,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE schemaname = 'public';

Transactions and Concurrency Control¶

Transactions group multiple operations into atomic units, ensuring data consistency even with concurrent access and failures.

ACID Properties¶

Atomicity¶

All operations in a transaction succeed or fail together:

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;  -- Both succeed, or both fail

If the system crashes after the first UPDATE but before COMMIT, the entire transaction is rolled back.

Consistency¶

Transactions maintain all database constraints:

-- Constraint: balance >= 0
ALTER TABLE accounts ADD CONSTRAINT positive_balance CHECK (balance >= 0);

BEGIN;
UPDATE accounts SET balance = balance - 1000 WHERE id = 1;  -- Would make balance negative
-- Transaction fails: constraint violation
ROLLBACK;  -- Automatic

Isolation¶

Concurrent transactions don't interfere with each other:

-- Transaction A                    -- Transaction B
BEGIN;                              BEGIN;
SELECT balance FROM accounts        SELECT balance FROM accounts
WHERE id = 1;  -- Returns 1000      WHERE id = 1;  -- Returns 1000

UPDATE accounts                     
SET balance = balance - 100
WHERE id = 1;

                                    UPDATE accounts
                                    SET balance = balance + 50
                                    WHERE id = 1;
                                    -- Waits for A to complete (or uses MVCC)
COMMIT;
                                    COMMIT;
-- Final balance: 950 (not 1050 or 900)

Durability¶

Committed transactions survive system failures:

Data written to disk before COMMIT returns
Write-Ahead Logging (WAL) ensures recovery
Replication provides additional durability

Isolation Levels¶

SQL defines four isolation levels, trading consistency for performance:

Read Uncommitted¶

Allows dirty reads (seeing uncommitted changes)
Highest performance, lowest consistency
Rarely used in practice

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

Dirty Read Example:

-- Transaction A                    -- Transaction B
BEGIN;
UPDATE accounts SET balance = 0
WHERE id = 1;
                                    BEGIN;
                                    SELECT balance FROM accounts
                                    WHERE id = 1;  -- Sees 0 (uncommitted!)
ROLLBACK;  -- Never committed!
                                    -- B made decisions based on wrong data

Read Committed (PostgreSQL default)¶

No dirty reads: only see committed changes
Allows non-repeatable reads

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

Non-Repeatable Read Example:

-- Transaction A                    -- Transaction B
BEGIN;
SELECT balance FROM accounts
WHERE id = 1;  -- Returns 1000

                                    BEGIN;
                                    UPDATE accounts SET balance = 500
                                    WHERE id = 1;
                                    COMMIT;

SELECT balance FROM accounts
WHERE id = 1;  -- Returns 500 (different!)
COMMIT;

Repeatable Read (MySQL InnoDB default)¶

No dirty reads
No non-repeatable reads: same query returns same rows
Allows phantom reads (new rows appear)

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

Phantom Read Example:

-- Transaction A                    -- Transaction B
BEGIN;
SELECT COUNT(*) FROM orders
WHERE customer_id = 1;  -- Returns 10

                                    BEGIN;
                                    INSERT INTO orders (customer_id, ...)
                                    VALUES (1, ...);
                                    COMMIT;

SELECT COUNT(*) FROM orders
WHERE customer_id = 1;  -- Returns 11 (phantom row!)
COMMIT;

Serializable¶

Strictest isolation
Transactions behave as if executed serially
No anomalies, but lowest concurrency

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

In PostgreSQL, serializable uses Serializable Snapshot Isolation (SSI), which detects conflicts and aborts transactions as needed.

Isolation Levels Summary¶

Level	Dirty Read	Non-Repeatable Read	Phantom Read
Read Uncommitted	✓	✓	✓
Read Committed	✗	✓	✓
Repeatable Read	✗	✗	✓ (varies)
Serializable	✗	✗	✗

Concurrency Control Mechanisms¶

Pessimistic Locking¶

Lock resources before access:

-- Exclusive lock (for update)
BEGIN;
SELECT * FROM accounts WHERE id = 1 FOR UPDATE;  -- Lock row
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;  -- Release lock

-- Shared lock (for read, allows other reads)
SELECT * FROM accounts WHERE id = 1 FOR SHARE;

-- Skip locked rows (useful for job queues)
SELECT * FROM jobs WHERE status = 'pending' 
FOR UPDATE SKIP LOCKED LIMIT 1;

-- No wait (fail immediately if locked)
SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT;

Lock Types:

Row-level locks: Lock specific rows (most common)
Table-level locks: Lock entire tables (for DDL, bulk operations)
Advisory locks: Application-defined locks

-- Table locks
LOCK TABLE accounts IN EXCLUSIVE MODE;

-- Advisory locks
SELECT pg_advisory_lock(12345);  -- Acquire
SELECT pg_advisory_unlock(12345);  -- Release

Optimistic Locking¶

Assume no conflicts, detect at commit time:

-- Using version numbers
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    price DECIMAL(10, 2),
    version INTEGER DEFAULT 1
);

-- Read with version
SELECT id, name, price, version FROM products WHERE id = 1;
-- Returns: (1, 'Widget', 9.99, 5)

-- Update only if version matches
UPDATE products 
SET price = 10.99, version = version + 1
WHERE id = 1 AND version = 5;

-- Check affected rows: if 0, another transaction modified it
-- Application retries with new data

Multi-Version Concurrency Control (MVCC)¶

MVCC allows readers and writers to not block each other:

How it works:

Each row has hidden transaction IDs (xmin, xmax)
Updates create new row versions, don't overwrite
Readers see consistent snapshot based on their start time
Old versions cleaned up by VACUUM

-- PostgreSQL: view row versions
SELECT xmin, xmax, * FROM accounts WHERE id = 1;

-- Transaction A (started at t=100)
BEGIN;
SELECT * FROM accounts;  -- Sees snapshot as of t=100

-- Transaction B commits update at t=110
-- Transaction A still sees old data (snapshot isolation)

SELECT * FROM accounts;  -- Still sees t=100 snapshot
COMMIT;

Benefits:

Readers never block writers
Writers rarely block readers
Better concurrency than pure locking

Costs:

More storage for row versions
Need VACUUM to clean up dead rows
More complex implementation

Deadlocks¶

Deadlocks occur when transactions wait for each other's locks:

-- Transaction A                    -- Transaction B
BEGIN;                              BEGIN;
UPDATE accounts SET balance = 100   
WHERE id = 1;  -- Locks row 1
                                    UPDATE accounts SET balance = 200
                                    WHERE id = 2;  -- Locks row 2

UPDATE accounts SET balance = 300
WHERE id = 2;  -- Waits for B
                                    UPDATE accounts SET balance = 400
                                    WHERE id = 1;  -- Waits for A

-- DEADLOCK! Database detects and aborts one transaction

Prevention strategies:

Lock ordering: Always acquire locks in consistent order
Lock timeout: Give up after waiting too long
Deadlock detection: Database detects and resolves

-- Set lock timeout
SET lock_timeout = '5s';

-- Access resources in consistent order
-- Always lock accounts in ascending ID order

NoSQL Databases¶

NoSQL ("Not Only SQL") databases emerged to handle data that doesn't fit well in relational models, or to achieve scalability beyond what traditional RDBMS can offer.

Document Databases¶

Store data as JSON-like documents in collections:

Examples: MongoDB, CouchDB, Amazon DocumentDB

// MongoDB document
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Alice Johnson",
  "email": "alice@example.com",
  "orders": [
    {
      "id": "ORD-001",
      "items": [
        {"product": "Laptop", "price": 999.99, "quantity": 1},
        {"product": "Mouse", "price": 29.99, "quantity": 2}
      ],
      "total": 1059.97,
      "status": "shipped"
    }
  ],
  "preferences": {
    "newsletter": true,
    "theme": "dark"
  }
}

Query examples (MongoDB):

// Find documents
db.users.find({ "email": "alice@example.com" })
db.users.find({ "orders.total": { $gt: 1000 } })
db.users.find({ "preferences.theme": "dark" })

// Aggregation pipeline
db.orders.aggregate([
  { $match: { status: "shipped" } },
  { $group: { 
      _id: "$customer_id", 
      totalSpent: { $sum: "$total" },
      orderCount: { $sum: 1 }
  }},
  { $sort: { totalSpent: -1 } },
  { $limit: 10 }
])

// Update nested documents
db.users.updateOne(
  { "_id": ObjectId("..."), "orders.id": "ORD-001" },
  { $set: { "orders.$.status": "delivered" } }
)

When to use:

Flexible schemas that evolve
Hierarchical/nested data
Rapid prototyping
Content management
Catalogs with varying attributes

Trade-offs:

No joins (embedded data or multiple queries)
Less strict consistency (by default)
Potential data duplication

Key-Value Stores¶

Simplest NoSQL model: store values by unique keys:

Examples: Redis, Amazon DynamoDB, etcd

# Redis examples
import redis
r = redis.Redis()

# Basic operations
r.set("user:123:name", "Alice")
r.get("user:123:name")  # "Alice"
r.delete("user:123:name")

# Expiration (caching)
r.setex("session:abc123", 3600, "user_data")  # Expires in 1 hour

# Data structures
r.hset("user:123", mapping={"name": "Alice", "email": "alice@example.com"})
r.hgetall("user:123")  # {"name": "Alice", "email": "..."}

r.lpush("queue:jobs", "job1", "job2")
r.rpop("queue:jobs")  # "job1"

r.sadd("tags:post:1", "python", "databases")
r.smembers("tags:post:1")  # {"python", "databases"}

r.zadd("leaderboard", {"alice": 100, "bob": 85})
r.zrange("leaderboard", 0, 9, withscores=True)  # Top 10

When to use:

Caching (API responses, sessions)
Real-time leaderboards
Rate limiting
Pub/sub messaging
Configuration storage

Trade-offs:

Limited query capabilities
No complex relationships
Data usually in memory (cost)

Column-Family Stores¶

Organize data by columns rather than rows, optimized for sparse data and analytics:

Examples: Apache Cassandra, HBase, ScyllaDB

-- Cassandra CQL
CREATE TABLE sensor_data (
    sensor_id uuid,
    timestamp timestamp,
    temperature float,
    humidity float,
    pressure float,
    PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- Write (extremely fast)
INSERT INTO sensor_data (sensor_id, timestamp, temperature)
VALUES (uuid(), now(), 23.5);

-- Read recent data for a sensor
SELECT * FROM sensor_data 
WHERE sensor_id = ? 
AND timestamp > '2024-01-01'
LIMIT 1000;

-- Wide rows: one row per sensor, columns for each reading
-- Efficient for time-series queries on single partition

When to use:

Time-series data (IoT, metrics)
Write-heavy workloads
Data with natural time ordering
Analytics requiring column-based scans

Trade-offs:

Limited secondary indexes
Denormalization required
No ad-hoc queries
Eventual consistency model

Graph Databases¶

Store and query relationships as first-class citizens:

Examples: Neo4j, Amazon Neptune, ArangoDB

// Neo4j Cypher query language

// Create nodes and relationships
CREATE (alice:Person {name: 'Alice', age: 30})
CREATE (bob:Person {name: 'Bob', age: 28})
CREATE (acme:Company {name: 'Acme Corp'})
CREATE (alice)-[:FRIENDS_WITH {since: 2020}]->(bob)
CREATE (alice)-[:WORKS_AT {role: 'Engineer'}]->(acme)
CREATE (bob)-[:WORKS_AT {role: 'Designer'}]->(acme)

// Find friends of friends
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH*2]->(fof)
WHERE fof <> alice
RETURN DISTINCT fof.name

// Shortest path
MATCH path = shortestPath(
  (alice:Person {name: 'Alice'})-[*]-(bob:Person {name: 'Bob'})
)
RETURN path

// Recommendation: people who work at same company
MATCH (alice:Person {name: 'Alice'})-[:WORKS_AT]->(company)<-[:WORKS_AT]-(colleague)
WHERE colleague <> alice
AND NOT (alice)-[:FRIENDS_WITH]-(colleague)
RETURN colleague.name AS recommendation

When to use:

Social networks
Fraud detection
Knowledge graphs
Recommendation engines
Network/infrastructure mapping

Trade-offs:

Not optimized for bulk analytics
Scaling can be complex
Different query paradigm

Time-Series Databases¶

Optimized for timestamped data:

Examples: InfluxDB, TimescaleDB, Prometheus

-- TimescaleDB (PostgreSQL extension)
CREATE TABLE metrics (
    time TIMESTAMPTZ NOT NULL,
    device_id TEXT,
    metric_name TEXT,
    value DOUBLE PRECISION
);

-- Convert to hypertable (time-series optimized)
SELECT create_hypertable('metrics', 'time');

-- Insert data
INSERT INTO metrics VALUES 
(NOW(), 'sensor-1', 'temperature', 23.5),
(NOW(), 'sensor-1', 'humidity', 65.2);

-- Time-series queries
SELECT 
    time_bucket('1 hour', time) AS hour,
    device_id,
    AVG(value) as avg_value,
    MIN(value) as min_value,
    MAX(value) as max_value
FROM metrics
WHERE time > NOW() - INTERVAL '24 hours'
AND metric_name = 'temperature'
GROUP BY hour, device_id
ORDER BY hour DESC;

-- Continuous aggregates (materialized views for time-series)
CREATE MATERIALIZED VIEW hourly_stats
WITH (timescaledb.continuous) AS
SELECT 
    time_bucket('1 hour', time) AS hour,
    device_id,
    AVG(value) as avg_value
FROM metrics
GROUP BY hour, device_id;

Vector Databases¶

Store and search high-dimensional embeddings for AI/ML:

Examples: Pinecone, Milvus, Weaviate, pgvector

-- pgvector (PostgreSQL extension)
CREATE EXTENSION vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)  -- OpenAI embedding dimension
);

-- Create index for similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

-- Insert with embedding
INSERT INTO documents (content, embedding)
VALUES ('Hello world', '[0.1, 0.2, ...]');

-- Similarity search
SELECT content, embedding <=> '[0.15, 0.25, ...]' AS distance
FROM documents
ORDER BY embedding <=> '[0.15, 0.25, ...]'
LIMIT 10;

CAP Theorem and Distributed Databases¶

CAP Theorem¶

In a distributed system, you can only guarantee two of three properties:

Consistency: All nodes see the same data at the same time
Availability: Every request receives a response (success/failure)
Partition Tolerance: System operates despite network failures

Since network partitions are inevitable, the real choice is between:

CP (Consistency + Partition Tolerance): Sacrifice availability during partitions. System may refuse requests to maintain consistency.
- Examples: Traditional RDBMS, HBase, MongoDB (with majority write concern)
AP (Availability + Partition Tolerance): Sacrifice consistency during partitions. System stays available but may return stale data.
- Examples: Cassandra, DynamoDB, CouchDB

PACELC Theorem¶

Extension of CAP: "If there's a Partition, choose Availability or Consistency; Else, when running normally, choose Latency or Consistency."

PA/EL: Sacrifice consistency for availability/latency (Cassandra, DynamoDB)
PC/EC: Prioritize consistency always (Traditional RDBMS)
PA/EC: Available during partition, consistent otherwise (MongoDB default)

Consistency Models¶

Strong Consistency¶

All reads return the most recent write:

Write(x=1)  →  Read(x) = 1 (guaranteed)

Achieved via: Single leader, synchronous replication, consensus protocols
Trade-off: Higher latency, lower availability

Eventual Consistency¶

If no new updates, eventually all reads return the last updated value:

Write(x=1)  →  Read(x) = 0 (possible, old value)
          →  [time passes]
          →  Read(x) = 1 (eventually)

Achieved via: Asynchronous replication, gossip protocols
Trade-off: May read stale data

Causal Consistency¶

If operation A causally affects B, everyone sees A before B:

A: Write(x=1)
B: Read(x) returns 1, then Write(y=2)
C: If C sees y=2, C must also see x=1

Read-Your-Writes Consistency¶

Users always see their own updates:

User writes x=1  →  User reads x (always returns 1 or newer)
Other user reads x (may return old value)

Consensus Algorithms¶

Distributed systems use consensus algorithms to agree on values:

Paxos¶

Classic consensus algorithm, complex but theoretically sound.

Raft¶

More understandable alternative to Paxos:

Leader Election: One node becomes leader
Log Replication: Leader replicates log entries to followers
Safety: Only nodes with up-to-date logs can become leader

Used by: etcd, Consul, CockroachDB

BASE vs ACID¶

ACID (Traditional RDBMS):

Atomicity, Consistency, Isolation, Durability
Strong guarantees, lower availability

BASE (Many NoSQL systems):

Basically Available: System remains available
Soft state: State may change over time (without input)
Eventually consistent: Will become consistent eventually

Database Internals¶

Understanding how databases work internally helps with performance tuning and debugging.

Storage Engine Architecture¶

┌─────────────────────────────────────────────────────────────┐
│                      Query Layer                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │   Parser    │→│  Optimizer   │→│   Executor           │ │
│  └─────────────┘  └─────────────┘  └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    Storage Engine                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │   Buffer    │  │  Transaction │  │   Index              │ │
│  │   Manager   │  │   Manager    │  │   Manager            │ │
│  └─────────────┘  └─────────────┘  └─────────────────────┘ │
│                              ↓                               │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                    Disk Manager                          ││
│  │  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐  ││
│  │  │  Data    │  │  Index   │  │   Write-Ahead Log    │  ││
│  │  │  Files   │  │  Files   │  │       (WAL)          │  ││
│  │  └──────────┘  └──────────┘  └──────────────────────┘  ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Query Processing Pipeline¶

1. Parsing¶

Converts SQL text into an Abstract Syntax Tree (AST):

SELECT name, salary FROM employees WHERE dept = 'Engineering'

Becomes:

SelectStatement
├── columns: [name, salary]
├── from: employees
└── where: dept = 'Engineering'

2. Semantic Analysis¶

Resolves table and column names
Checks permissions
Validates types

3. Query Optimization¶

The optimizer finds the most efficient execution plan:

Cost-based optimization:

Estimates cost of different plans
Uses statistics (row counts, value distributions)
Considers available indexes

Optimization techniques:

Predicate pushdown: Filter early to reduce data
Join reordering: Choose optimal join sequence
Index selection: Decide which indexes to use
Subquery flattening: Convert subqueries to joins

-- Original query
SELECT * FROM orders o 
WHERE o.customer_id IN (
    SELECT id FROM customers WHERE country = 'USA'
);

-- Optimized (flattened)
SELECT o.* FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA';

4. Execution¶

Executes the physical plan using various operators:

Table Scan: Read all rows from table
Index Scan: Read rows via index
Nested Loop Join: For each outer row, scan inner table
Hash Join: Build hash table, probe with other table
Merge Join: Sort both tables, merge sorted results
Sort: Order rows
Aggregate: Compute SUM, COUNT, etc.

Buffer Management¶

The buffer pool caches disk pages in memory:

┌─────────────────────────────────────────┐
│            Buffer Pool (RAM)             │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐      │
│  │Page1│ │Page2│ │Page3│ │Page4│ ...   │
│  └─────┘ └─────┘ └─────┘ └─────┘      │
│    ↑                         ↑          │
│    │ frequently accessed     │ dirty    │
└────│─────────────────────────│──────────┘
     │                         │
     ↓                         ↓
┌─────────────────────────────────────────┐
│              Disk Storage               │
│  ┌─────────────────────────────────┐   │
│  │  Data Files (pages on disk)     │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘

Page replacement policies:

LRU (Least Recently Used): Evict oldest accessed page
Clock: Approximation of LRU with lower overhead
LRU-K: Consider K most recent accesses

Dirty pages:

Modified pages marked "dirty"
Written back to disk by background writer
Or during checkpoint

Write-Ahead Logging (WAL)¶

WAL ensures durability and crash recovery:

1. Write change to WAL (on disk)
2. Modify page in buffer pool (in memory)
3. Acknowledge transaction commit
4. Later: Write dirty pages to data files (checkpoint)

Recovery process:

On crash: Data files may be inconsistent
Read WAL from last checkpoint
Redo committed transactions
Undo uncommitted transactions

-- PostgreSQL: WAL configuration
SHOW wal_level;          -- minimal, replica, logical
SHOW max_wal_size;       -- Maximum WAL size before checkpoint
SHOW checkpoint_timeout; -- Time between checkpoints

Page Structure¶

Data is stored in fixed-size pages (typically 8KB):

┌─────────────────────────────────────────────────────┐
│ Page Header                                          │
│ ┌─────────────────────────────────────────────────┐│
│ │ LSN │ Checksum │ Flags │ Free Space │ ...       ││
│ └─────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────┤
│ Item Pointers (line pointers)                       │
│ ┌───┐┌───┐┌───┐┌───┐                               │
│ │ 1 ││ 2 ││ 3 ││ 4 │→ point to tuple offsets      │
│ └───┘└───┘└───┘└───┘                               │
├─────────────────────────────────────────────────────┤
│                                                      │
│           Free Space                                 │
│                                                      │
├─────────────────────────────────────────────────────┤
│ Tuples (row data, grows from bottom)                │
│ ┌─────────────────────────────────────────────────┐│
│ │ Tuple 4 │ Tuple 3 │ Tuple 2 │ Tuple 1 │         ││
│ └─────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘

Heap vs Index-Organized Tables¶

Heap Table (PostgreSQL, default):

Rows stored in insertion order
Index points to physical location (ctid)
Random I/O for index lookups

Index-Organized Table / Clustered Index (MySQL InnoDB, SQL Server):

Rows stored in primary key order
Primary key index contains row data
Secondary indexes point to primary key (not physical location)

-- PostgreSQL: Cluster table by index (one-time reorganization)
CLUSTER employees USING idx_employees_dept;

-- MySQL InnoDB: Table is always clustered by PRIMARY KEY
CREATE TABLE employees (
    id INT PRIMARY KEY,  -- This IS the clustered index
    name VARCHAR(100),
    INDEX idx_name (name)  -- Secondary index points to id
);

Sharding, Replication, and Partitioning¶

Sharding and replication in distributed databases interact with consistency models, CAP, and quorum; for the distributed-systems perspective (consensus, replication strategies, CRDTs), see Distributed Systems.

Partitioning (Single Database)¶

Divide a table into smaller pieces within one database:

Range Partitioning¶

-- PostgreSQL native partitioning
CREATE TABLE orders (
    id SERIAL,
    order_date DATE,
    customer_id INTEGER,
    total DECIMAL(10, 2)
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2023 PARTITION OF orders
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

-- Queries automatically use correct partition
SELECT * FROM orders WHERE order_date = '2024-06-15';
-- Only scans orders_2024

List Partitioning¶

CREATE TABLE customers (
    id SERIAL,
    name VARCHAR(100),
    country VARCHAR(2)
) PARTITION BY LIST (country);

CREATE TABLE customers_us PARTITION OF customers
    FOR VALUES IN ('US');

CREATE TABLE customers_eu PARTITION OF customers
    FOR VALUES IN ('DE', 'FR', 'IT', 'ES', 'GB');

CREATE TABLE customers_other PARTITION OF customers
    DEFAULT;

Hash Partitioning¶

CREATE TABLE events (
    id SERIAL,
    user_id INTEGER,
    event_type VARCHAR(50),
    data JSONB
) PARTITION BY HASH (user_id);

CREATE TABLE events_p0 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Sharding (Multiple Databases)¶

Distribute data across multiple database instances:

┌─────────────────────────────────────────────────────────────┐
│                      Application                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              Shard Router / Proxy                    │   │
│  │    (determines which shard to query)                 │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                    │               │               │
          ┌─────────┘               │               └─────────┐
          ↓                         ↓                         ↓
    ┌──────────┐             ┌──────────┐             ┌──────────┐
    │ Shard 1  │             │ Shard 2  │             │ Shard 3  │
    │users 1-N │             │users N+1 │             │users 2N+ │
    │          │             │  to 2N   │             │          │
    └──────────┘             └──────────┘             └──────────┘

Sharding strategies:

Range-based: Users 1-1M on shard 1, 1M-2M on shard 2
- Pro: Easy range queries
- Con: Hotspots (new users always hit last shard)
Hash-based: shard_id = hash(user_id) % num_shards
- Pro: Even distribution
- Con: Range queries require all shards
Directory-based: Lookup table maps keys to shards
- Pro: Flexible
- Con: Lookup table is a bottleneck
Geographic: Shard by region
- Pro: Data locality
- Con: Uneven load

Challenges:

Cross-shard queries (joins, aggregations)
Distributed transactions
Resharding when adding/removing shards
Maintaining consistency

Replication¶

Copy data across multiple nodes for availability and read scaling:

Single-Leader (Primary-Replica)¶

┌─────────────┐
│   Primary   │←── All writes
│   (Leader)  │
└─────────────┘
       │
       │ Replication (sync or async)
       ↓
┌─────────────┐  ┌─────────────┐
│  Replica 1  │  │  Replica 2  │←── Reads distributed
│ (Follower)  │  │ (Follower)  │
└─────────────┘  └─────────────┘

Synchronous replication:

Write waits for replica confirmation
Strong consistency
Higher latency, lower throughput

Asynchronous replication:

Write returns immediately
Potential data loss on primary failure
Lower latency, higher throughput

-- PostgreSQL streaming replication (async)
-- On primary:
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET max_wal_senders = 3;

-- Create replication user
CREATE USER replicator REPLICATION LOGIN PASSWORD 'secret';

-- On replica:
-- pg_basebackup -h primary -U replicator -D /var/lib/postgresql/data -P

Multi-Leader¶

Multiple nodes accept writes, sync with each other:

┌─────────────┐       ┌─────────────┐
│  Leader 1   │←────→│  Leader 2   │
│  (Region A) │       │  (Region B) │
└─────────────┘       └─────────────┘
       ↑                     ↑
       │                     │
   Writes A              Writes B

Conflict resolution:

Last-write-wins (timestamp-based)
Custom merge logic
User intervention

Leaderless¶

Any node accepts reads/writes:

┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node 1  │  │ Node 2  │  │ Node 3  │
└─────────┘  └─────────┘  └─────────┘
     ↑            ↑            ↑
     └────────────┼────────────┘
                  │
         Client (writes to multiple)

Quorum reads/writes:

N = total nodes, W = write quorum, R = read quorum
If W + R > N, reads see latest write
Example: N=3, W=2, R=2 (tolerates 1 failure)

Write to nodes 1, 2 (W=2) ✓
Read from nodes 2, 3 (R=2) → at least one has latest

Object-Relational Mappers (ORMs)¶

ORMs bridge the gap between object-oriented code and relational databases.

SQLAlchemy (Python)¶

The most popular Python ORM:

from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, DateTime
from sqlalchemy.orm import declarative_base, relationship, sessionmaker
from datetime import datetime

# Setup
engine = create_engine('postgresql://user:pass@localhost/db', echo=True)
Base = declarative_base()
Session = sessionmaker(bind=engine)

# Model definitions
class User(Base):
    __tablename__ = 'users'

    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False)
    email = Column(String(255), unique=True, nullable=False)
    created_at = Column(DateTime, default=datetime.utcnow)

    # Relationships
    orders = relationship('Order', back_populates='user', lazy='select')

    def __repr__(self):
        return f"<User(name='{self.name}', email='{self.email}')>"

class Order(Base):
    __tablename__ = 'orders'

    id = Column(Integer, primary_key=True)
    user_id = Column(Integer, ForeignKey('users.id'), nullable=False)
    total = Column(Integer)  # cents
    status = Column(String(20), default='pending')
    created_at = Column(DateTime, default=datetime.utcnow)

    user = relationship('User', back_populates='orders')
    items = relationship('OrderItem', back_populates='order', cascade='all, delete-orphan')

# Create tables
Base.metadata.create_all(engine)

# Using the ORM
session = Session()

# Create
user = User(name='Alice', email='alice@example.com')
session.add(user)
session.commit()

# Read
user = session.query(User).filter_by(email='alice@example.com').first()
users = session.query(User).filter(User.name.like('%ali%')).all()

# Update
user.name = 'Alice Johnson'
session.commit()

# Delete
session.delete(user)
session.commit()

# Complex queries
from sqlalchemy import func

# Join
results = session.query(User, Order).join(Order).filter(Order.total > 10000).all()

# Aggregation
stats = session.query(
    User.id,
    User.name,
    func.count(Order.id).label('order_count'),
    func.sum(Order.total).label('total_spent')
).outerjoin(Order).group_by(User.id).all()

# Eager loading (prevent N+1)
from sqlalchemy.orm import joinedload, selectinload

users = session.query(User).options(
    joinedload(User.orders).joinedload(Order.items)
).all()

# Subqueries
from sqlalchemy import select
subq = select(func.max(Order.total)).where(Order.user_id == User.id).scalar_subquery()
rich_users = session.query(User).filter(User.id.in_(
    select(Order.user_id).where(Order.total > 10000)
)).all()

Query Builders vs ORMs¶

Query Builder (more SQL-like):

# SQLAlchemy Core
from sqlalchemy import select, insert, update, delete

stmt = select(users.c.name, users.c.email).where(users.c.id > 10)
result = connection.execute(stmt)

stmt = insert(users).values(name='Bob', email='bob@example.com')
connection.execute(stmt)

When to use what:

Use Case	Recommendation
Simple CRUD with objects	ORM
Complex queries, reports	Query builder or raw SQL
Bulk operations	Raw SQL or COPY
Need full SQL control	Query builder or raw SQL
Rapid prototyping	ORM

ORM Best Practices¶

1. Understand lazy loading:

# N+1 problem
users = session.query(User).all()
for user in users:
    print(user.orders)  # Each triggers a query!

# Solution: eager loading
users = session.query(User).options(selectinload(User.orders)).all()

2. Use transactions explicitly:

try:
    session.begin()
    # operations
    session.commit()
except:
    session.rollback()
    raise
finally:
    session.close()

# Or use context manager
with Session() as session:
    with session.begin():
        # operations auto-commit on success, rollback on exception

3. Batch operations:

# Slow: one query per insert
for item in items:
    session.add(Item(**item))
session.commit()

# Fast: bulk insert
session.bulk_insert_mappings(Item, items)
session.commit()

# Fastest: raw COPY (PostgreSQL)
from io import StringIO
import csv

buffer = StringIO()
writer = csv.writer(buffer)
writer.writerows([(i['name'], i['value']) for i in items])
buffer.seek(0)

with engine.raw_connection() as conn:
    with conn.cursor() as cur:
        cur.copy_from(buffer, 'items', columns=('name', 'value'), sep=',')
    conn.commit()

Database Security¶

Authentication and Authorization¶

-- Create users with limited privileges
CREATE USER app_user WITH PASSWORD 'secure_password';
CREATE USER readonly_user WITH PASSWORD 'another_password';

-- Grant specific permissions
GRANT CONNECT ON DATABASE myapp TO app_user;
GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO app_user;

-- Read-only access
GRANT CONNECT ON DATABASE myapp TO readonly_user;
GRANT USAGE ON SCHEMA public TO readonly_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly_user;

-- Row-level security (PostgreSQL)
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY orders_user_policy ON orders
    FOR ALL
    TO app_user
    USING (user_id = current_setting('app.current_user_id')::INTEGER);

-- Set context before queries
SET app.current_user_id = '123';
SELECT * FROM orders;  -- Only sees orders for user 123

SQL Injection Prevention¶

Vulnerable code:

# NEVER DO THIS
query = f"SELECT * FROM users WHERE email = '{user_input}'"
# Input: "'; DROP TABLE users; --"
# Results in: SELECT * FROM users WHERE email = ''; DROP TABLE users; --'

Safe code (parameterized queries):

# SQLAlchemy ORM (automatically parameterized)
user = session.query(User).filter_by(email=user_input).first()

# Raw SQL with parameters
result = connection.execute(
    text("SELECT * FROM users WHERE email = :email"),
    {"email": user_input}
)

# psycopg2 with parameters
cursor.execute("SELECT * FROM users WHERE email = %s", (user_input,))

Encryption¶

At rest:

-- PostgreSQL: Encrypt column data
CREATE EXTENSION pgcrypto;

-- Store encrypted
INSERT INTO users (email, ssn_encrypted)
VALUES ('alice@example.com', pgp_sym_encrypt('123-45-6789', 'encryption_key'));

-- Read decrypted
SELECT email, pgp_sym_decrypt(ssn_encrypted, 'encryption_key') as ssn
FROM users;

In transit:

# PostgreSQL: Require SSL
# postgresql.conf
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'

# pg_hba.conf: Require SSL for connections
hostssl all all 0.0.0.0/0 scram-sha-256

Connection string:

engine = create_engine(
    'postgresql://user:pass@host/db?sslmode=require'
)

Audit Logging¶

-- PostgreSQL: Audit trigger
CREATE TABLE audit_log (
    id SERIAL PRIMARY KEY,
    table_name TEXT,
    operation TEXT,
    old_data JSONB,
    new_data JSONB,
    changed_by TEXT,
    changed_at TIMESTAMP DEFAULT NOW()
);

CREATE OR REPLACE FUNCTION audit_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'DELETE' THEN
        INSERT INTO audit_log (table_name, operation, old_data, changed_by)
        VALUES (TG_TABLE_NAME, 'DELETE', to_jsonb(OLD), current_user);
        RETURN OLD;
    ELSIF TG_OP = 'UPDATE' THEN
        INSERT INTO audit_log (table_name, operation, old_data, new_data, changed_by)
        VALUES (TG_TABLE_NAME, 'UPDATE', to_jsonb(OLD), to_jsonb(NEW), current_user);
        RETURN NEW;
    ELSIF TG_OP = 'INSERT' THEN
        INSERT INTO audit_log (table_name, operation, new_data, changed_by)
        VALUES (TG_TABLE_NAME, 'INSERT', to_jsonb(NEW), current_user);
        RETURN NEW;
    END IF;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER users_audit
AFTER INSERT OR UPDATE OR DELETE ON users
FOR EACH ROW EXECUTE FUNCTION audit_trigger();

Performance Tuning¶

Identifying Slow Queries¶

-- PostgreSQL: Enable query logging
ALTER SYSTEM SET log_min_duration_statement = 1000;  -- Log queries > 1 second
SELECT pg_reload_conf();

-- View slow queries
SELECT 
    query,
    calls,
    total_exec_time / 1000 as total_seconds,
    mean_exec_time as avg_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Active queries
SELECT 
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

Connection Pooling¶

Database connections are expensive. Use a pool:

# SQLAlchemy connection pooling
engine = create_engine(
    'postgresql://user:pass@localhost/db',
    pool_size=20,           # Maintained connections
    max_overflow=10,        # Extra connections when busy
    pool_timeout=30,        # Wait time for connection
    pool_recycle=1800,      # Recycle connections after 30 min
    pool_pre_ping=True      # Verify connection health
)

External poolers:

PgBouncer: Lightweight connection pooler for PostgreSQL
pgpool-II: Connection pool + load balancing + replication

# pgbouncer.ini
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction  # transaction, session, or statement
max_client_conn = 1000
default_pool_size = 20

Caching Strategies¶

Query cache:

import redis
import json
from functools import wraps

redis_client = redis.Redis()

def cache_query(ttl=300):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            cache_key = f"{func.__name__}:{hash((args, tuple(sorted(kwargs.items()))))}"

            # Check cache
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)

            # Execute query
            result = func(*args, **kwargs)

            # Cache result
            redis_client.setex(cache_key, ttl, json.dumps(result))
            return result
        return wrapper
    return decorator

@cache_query(ttl=60)
def get_user_orders(user_id):
    return session.query(Order).filter_by(user_id=user_id).all()

Materialized views:

-- Pre-compute expensive aggregations
CREATE MATERIALIZED VIEW user_statistics AS
SELECT 
    u.id,
    u.name,
    COUNT(o.id) as order_count,
    SUM(o.total) as total_spent,
    MAX(o.created_at) as last_order
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;

CREATE UNIQUE INDEX ON user_statistics (id);

-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY user_statistics;

Vacuuming (PostgreSQL)¶

MVCC creates dead rows that need cleanup:

-- Manual vacuum
VACUUM orders;                    -- Reclaim space, update visibility
VACUUM FULL orders;               -- Reclaim space, rewrite table (locks!)
VACUUM ANALYZE orders;            -- Also update statistics

-- Monitor bloat
SELECT 
    schemaname, tablename,
    n_dead_tup as dead_tuples,
    n_live_tup as live_tuples,
    round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
    last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

-- Autovacuum settings
ALTER TABLE high_churn_table SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- Vacuum when 5% dead (default 20%)
    autovacuum_analyze_scale_factor = 0.02
);

Backup, Recovery, and High Availability¶

Backup Strategies¶

Logical Backups¶

Export data as SQL or portable format:

# PostgreSQL pg_dump
pg_dump -h localhost -U user -d mydb > backup.sql
pg_dump -Fc mydb > backup.dump  # Custom format (compressed)
pg_dump -Fd mydb -j 4 -f backup_dir  # Directory format (parallel)

# Restore
psql -d mydb < backup.sql
pg_restore -d mydb backup.dump
pg_restore -d mydb -j 4 backup_dir  # Parallel restore

# MySQL mysqldump
mysqldump -u user -p mydb > backup.sql
mysqldump --single-transaction mydb > backup.sql  # Consistent snapshot

Pros:

Portable between versions
Selective backup (tables, schemas)
Human-readable (SQL format)

Cons:

Slow for large databases
Restore takes time

Physical Backups¶

Copy data files directly:

# PostgreSQL pg_basebackup
pg_basebackup -h localhost -U replicator -D /backup/base -Fp -Xs -P

# With compression
pg_basebackup -h localhost -U replicator -D - -Ft -Xs | gzip > backup.tar.gz

Pros:

Fast backup and restore
Includes all data

Cons:

Version-specific
Large size

Point-in-Time Recovery (PITR)¶

Restore to any moment using WAL:

# Setup: Enable WAL archiving
# postgresql.conf
archive_mode = on
archive_command = 'cp %p /archive/%f'

# Recovery: Restore base backup, then replay WAL
# Create recovery.signal and set in postgresql.conf:
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2024-01-15 14:30:00'

High Availability Architectures¶

Primary-Standby with Automatic Failover¶

┌─────────────┐         ┌─────────────┐
│   Primary   │←──WAL──→│   Standby   │
└─────────────┘         └─────────────┘
       ↑                       ↑
       │                       │
       └───────┬───────────────┘
               │
        ┌──────────────┐
        │  Failover    │  (Patroni, repmgr, pg_auto_failover)
        │  Manager     │
        └──────────────┘

Tools:

Patroni: Template for high-availability PostgreSQL
repmgr: Replication manager for PostgreSQL
pg_auto_failover: Automated failover for PostgreSQL

Load Balancing¶

                    ┌─────────────┐
                    │  HAProxy /  │
                    │  PgBouncer  │
                    └─────────────┘
                          │
           ┌──────────────┼──────────────┐
           ↓              ↓              ↓
    ┌──────────┐   ┌──────────┐   ┌──────────┐
    │ Primary  │   │ Replica1 │   │ Replica2 │
    │ (writes) │   │ (reads)  │   │ (reads)  │
    └──────────┘   └──────────┘   └──────────┘

# HAProxy configuration
frontend postgres
    bind *:5432
    default_backend postgres_primary

backend postgres_primary
    option pgsql-check user haproxy
    server primary 10.0.0.1:5432 check

backend postgres_replicas
    balance roundrobin
    option pgsql-check user haproxy
    server replica1 10.0.0.2:5432 check
    server replica2 10.0.0.3:5432 check

Disaster Recovery¶

Recovery objectives:

RTO (Recovery Time Objective): Maximum acceptable downtime
RPO (Recovery Point Objective): Maximum acceptable data loss

Strategy	RTO	RPO	Cost
Daily backups	Hours	24 hours	Low
Continuous WAL archiving	Minutes	Minutes	Medium
Synchronous standby	Seconds	Zero	High
Multi-region active-active	Near-zero	Near-zero	Very high

Backup retention policy:

# Example: Grandfather-father-son rotation
# Daily: Keep 7 days
# Weekly: Keep 4 weeks  
# Monthly: Keep 12 months
# Yearly: Keep 7 years

find /backups/daily -mtime +7 -delete
find /backups/weekly -mtime +28 -delete
# etc.

Summary and Best Practices¶

Choosing a Database¶

Requirement	Recommendation
Strong consistency, complex queries	PostgreSQL, MySQL
Flexible schema, rapid development	MongoDB
High write throughput, time-series	Cassandra, TimescaleDB
Caching, sessions, real-time	Redis
Graph relationships	Neo4j
Full-text search	Elasticsearch, PostgreSQL FTS
AI/ML similarity search	pgvector, Pinecone

General Best Practices¶

Design schemas thoughtfully: Normalize appropriately, denormalize for performance where needed.
Use appropriate indexes: Index frequently queried columns, but don't over-index.
Write efficient queries: Use EXPLAIN, avoid N+1 queries, prefer joins over subqueries.
Manage connections: Use connection pooling, close connections properly.
Secure your data: Use parameterized queries, encrypt sensitive data, implement least-privilege access.
Plan for scale: Consider partitioning, replication, and caching strategies early.
Backup regularly: Test your backups by actually restoring them.
Monitor performance: Track slow queries, connection counts, disk usage.
Keep statistics updated: Run ANALYZE regularly for optimal query plans.
Use transactions appropriately: Understand isolation levels and their trade-offs.

Performance Checklist¶

[ ] Queries using indexes (check with EXPLAIN)
[ ] No N+1 queries (use eager loading)
[ ] Connection pooling configured
[ ] Query cache for frequently accessed data
[ ] Appropriate indexes on foreign keys
[ ] Statistics up to date (ANALYZE)
[ ] Dead rows cleaned up (VACUUM)
[ ] Slow query logging enabled
[ ] Monitoring dashboards in place