Databases¶
Databases are the cornerstone of modern software engineering, providing the fundamental infrastructure for persistent data storage, retrieval, and management. Understanding databases deeply—from theoretical foundations to practical implementation details—is essential for building scalable, reliable, and efficient applications. This guide covers relational and non-relational databases, their internal architectures, query optimization, transactions, distributed systems concepts, and best practices for real-world usage.
Table of Contents¶
- Fundamentals of Data Storage
- The Relational Model
- SQL: Structured Query Language
- Database Design and Normalization
- Indexing and Query Optimization
- Transactions and Concurrency Control
- NoSQL Databases
- CAP Theorem and Distributed Databases
- Database Internals
- Sharding, Replication, and Partitioning
- Object-Relational Mappers (ORMs)
- Database Security
- Performance Tuning
- Backup, Recovery, and High Availability
Fundamentals of Data Storage¶
Why Databases Exist¶
Before databases, applications stored data in flat files—simple text or binary files on disk. While this works for trivial cases, it quickly becomes problematic:
- Concurrent Access: Multiple processes reading/writing the same file leads to data corruption.
- Data Integrity: No built-in way to enforce rules (e.g., "email must be unique").
- Query Efficiency: Finding specific records requires scanning entire files.
- Data Relationships: Representing connections between entities is cumbersome.
- Crash Recovery: A power failure mid-write can corrupt the entire file.
Databases solve these problems by providing:
- Structured Storage: Organized data models (tables, documents, graphs).
- Query Languages: Declarative ways to retrieve and manipulate data.
- ACID Guarantees: Reliable transactions even during failures.
- Concurrency Control: Safe multi-user access without corruption.
- Indexing: Fast lookups without full scans.
- Security: Authentication, authorization, and encryption.
Data Models Overview¶
Different applications have different data requirements. The major data models are:
| Model | Structure | Examples | Best For |
|---|---|---|---|
| Relational | Tables with rows and columns | PostgreSQL, MySQL, SQLite | Structured data with complex relationships |
| Document | JSON/BSON documents in collections | MongoDB, CouchDB | Semi-structured, hierarchical data |
| Key-Value | Simple key→value pairs | Redis, DynamoDB, etcd | Caching, session storage, config |
| Column-Family | Wide columns, sparse data | Cassandra, HBase, ScyllaDB | Time-series, analytics, IoT |
| Graph | Nodes and edges | Neo4j, ArangoDB, Amazon Neptune | Social networks, recommendations |
| Time-Series | Timestamped data points | InfluxDB, TimescaleDB, Prometheus | Metrics, monitoring, IoT sensors |
| Vector | High-dimensional embeddings | Pinecone, Milvus, pgvector | AI/ML similarity search |
The Relational Model¶
Theoretical Foundations¶
The relational model, proposed by Edgar F. Codd in 1970, is based on mathematical set theory and predicate logic. It provides a formal framework for organizing data into relations (tables).
Core Concepts¶
Relation (Table): A relation is a set of tuples (rows) that share the same attributes (columns). Mathematically:
Where \(D_i\) is the domain (set of possible values) for attribute \(i\).
Tuple (Row): A single record in a relation, representing one instance of the entity. Tuples are unordered—the relational model doesn't guarantee row order.
Attribute (Column): A named property of a relation with a defined domain. Each attribute has:
- Name: Unique identifier within the relation
- Domain: Set of allowable values (data type + constraints)
- Nullability: Whether NULL values are permitted
Domain: The set of atomic values an attribute can take. Examples:
- INTEGER: {..., -2, -1, 0, 1, 2, ...}
- VARCHAR(255): All strings up to 255 characters
- BOOLEAN: {TRUE, FALSE}
Schema: The logical structure defining a relation's attributes and constraints. A schema is metadata—it describes the data without containing actual values.
-- Schema definition
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
department VARCHAR(50),
salary DECIMAL(10, 2) CHECK (salary > 0),
hire_date DATE DEFAULT CURRENT_DATE,
manager_id INTEGER REFERENCES employees(id)
);
Keys and Constraints¶
Keys are fundamental to the relational model, establishing identity and relationships.
Types of Keys¶
Super Key: Any set of attributes that uniquely identifies a tuple. For a table with columns {id, email, name}, super keys include:
- {id}
- {email}
- {id, email}
- {id, name}
- {id, email, name}
Candidate Key: A minimal super key—no proper subset is also a super key. From the above, {id} and {email} are candidate keys.
Primary Key: The chosen candidate key for a relation. It:
- Must be unique across all rows
- Cannot contain NULL values
- Should be immutable (rarely change)
- Is typically indexed automatically
-- Primary key can be single column
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL
);
-- Or composite (multiple columns)
CREATE TABLE order_items (
order_id INTEGER,
product_id INTEGER,
quantity INTEGER,
PRIMARY KEY (order_id, product_id)
);
Foreign Key: An attribute (or set) that references a primary key in another table, establishing relationships.
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
total DECIMAL(10, 2),
created_at TIMESTAMP DEFAULT NOW()
);
Natural Key vs Surrogate Key:
- Natural Key: Uses real-world identifiers (email, SSN, ISBN).
- Pros: Meaningful, no extra column
- Cons: May change, privacy concerns, can be complex
- Surrogate Key: System-generated identifier (auto-increment, UUID).
- Pros: Stable, simple, no business logic
- Cons: Meaningless, extra storage
-- Surrogate key (recommended for most cases)
CREATE TABLE products (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
sku VARCHAR(50) UNIQUE, -- Natural key as secondary
name VARCHAR(255)
);
Constraint Types¶
Constraints enforce data integrity rules at the database level:
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
-- NOT NULL: Attribute must have a value
name VARCHAR(100) NOT NULL,
-- UNIQUE: No duplicate values (NULLs may be allowed)
email VARCHAR(255) UNIQUE,
-- CHECK: Custom validation rules
age INTEGER CHECK (age >= 18 AND age <= 120),
salary DECIMAL(10, 2) CHECK (salary > 0),
-- DEFAULT: Value when not specified
status VARCHAR(20) DEFAULT 'active',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- FOREIGN KEY with referential actions
department_id INTEGER REFERENCES departments(id)
ON DELETE SET NULL
ON UPDATE CASCADE
);
-- Table-level constraints (for multi-column)
ALTER TABLE employees ADD CONSTRAINT unique_name_dept
UNIQUE (name, department_id);
ALTER TABLE employees ADD CONSTRAINT valid_dates
CHECK (termination_date IS NULL OR termination_date > hire_date);
Referential Integrity Actions:
| Action | On DELETE | On UPDATE |
|---|---|---|
CASCADE |
Delete referencing rows | Update foreign key values |
SET NULL |
Set FK to NULL | Set FK to NULL |
SET DEFAULT |
Set FK to default | Set FK to default |
RESTRICT |
Prevent deletion | Prevent update |
NO ACTION |
Check at end of transaction | Check at end of transaction |
Relational Algebra¶
Relational algebra provides the theoretical foundation for SQL operations. Understanding it helps optimize queries.
Fundamental Operations¶
Selection (σ): Filters rows based on a predicate.
SELECT * FROM employees WHERE salary > 50000;
Projection (π): Selects specific columns.
SELECT name, salary FROM employees;
Cartesian Product (×): Combines every row of one table with every row of another.
SELECT * FROM employees, departments; -- Rarely useful alone
Union (∪): Combines rows from two relations with the same schema.
SELECT name FROM employees
UNION
SELECT name FROM contractors;
Difference (−): Rows in first relation but not second.
SELECT name FROM all_users
EXCEPT
SELECT name FROM banned_users;
Intersection (∩): Rows in both relations.
SELECT name FROM customers
INTERSECT
SELECT name FROM newsletter_subscribers;
Derived Operations¶
Join (⋈): Combines related rows from two tables.
Natural Join: Joins on all common attributes.
SELECT * FROM employees NATURAL JOIN departments;
Theta Join: Joins with arbitrary condition.
SELECT * FROM employees e, projects p
WHERE e.skill_level >= p.required_level;
Equi-Join: Theta join with equality condition (most common).
SELECT * FROM employees e
JOIN departments d ON e.dept_id = d.id;
Outer Joins: Include non-matching rows.
-- LEFT OUTER JOIN: All rows from left table
SELECT e.name, d.name as dept
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;
-- RIGHT OUTER JOIN: All rows from right table
SELECT e.name, d.name as dept
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.id;
-- FULL OUTER JOIN: All rows from both tables
SELECT e.name, d.name as dept
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.id;
Semi-Join (⋉): Returns rows from left table that have a match in right table.
SELECT * FROM employees e
WHERE EXISTS (
SELECT 1 FROM projects p WHERE p.lead_id = e.id
);
Anti-Join: Returns rows from left table with no match in right table.
SELECT * FROM employees e
WHERE NOT EXISTS (
SELECT 1 FROM projects p WHERE p.lead_id = e.id
);
SQL: Structured Query Language¶
SQL is the standard language for interacting with relational databases. While implementations vary, the core syntax is consistent across systems.
SQL Sublanguages¶
SQL is divided into several sublanguages, each serving a specific purpose:
DDL (Data Definition Language)¶
Defines database structure:
-- CREATE: Create new objects
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
price DECIMAL(10, 2),
category_id INTEGER REFERENCES categories(id)
);
CREATE INDEX idx_products_category ON products(category_id);
CREATE VIEW expensive_products AS
SELECT * FROM products WHERE price > 100;
-- ALTER: Modify existing objects
ALTER TABLE products ADD COLUMN description TEXT;
ALTER TABLE products ALTER COLUMN name TYPE VARCHAR(500);
ALTER TABLE products DROP COLUMN description;
ALTER TABLE products ADD CONSTRAINT positive_price CHECK (price > 0);
-- DROP: Remove objects
DROP TABLE products; -- Fails if referenced
DROP TABLE products CASCADE; -- Drops dependent objects too
DROP INDEX idx_products_category;
-- TRUNCATE: Remove all rows (faster than DELETE)
TRUNCATE TABLE products; -- Cannot be rolled back in some DBs
TRUNCATE TABLE products RESTART IDENTITY CASCADE;
DML (Data Manipulation Language)¶
Manipulates data within tables:
-- INSERT: Add new rows
INSERT INTO products (name, price, category_id)
VALUES ('Laptop', 999.99, 1);
-- Insert multiple rows
INSERT INTO products (name, price, category_id) VALUES
('Mouse', 29.99, 2),
('Keyboard', 79.99, 2),
('Monitor', 299.99, 3);
-- Insert from query
INSERT INTO archived_orders (id, total, archived_at)
SELECT id, total, NOW() FROM orders WHERE created_at < '2024-01-01';
-- UPDATE: Modify existing rows
UPDATE products SET price = price * 1.1 WHERE category_id = 1;
UPDATE products p
SET price = p.price * (1 - d.discount_rate)
FROM discounts d
WHERE p.category_id = d.category_id;
-- DELETE: Remove rows
DELETE FROM products WHERE id = 5;
DELETE FROM orders
WHERE created_at < NOW() - INTERVAL '1 year'
AND status = 'completed';
-- UPSERT: Insert or update on conflict
INSERT INTO products (sku, name, price)
VALUES ('ABC123', 'Widget', 9.99)
ON CONFLICT (sku) DO UPDATE SET
name = EXCLUDED.name,
price = EXCLUDED.price;
DQL (Data Query Language)¶
Retrieves data from the database:
-- Basic SELECT
SELECT name, price FROM products;
-- With conditions
SELECT * FROM products
WHERE price > 100 AND category_id IN (1, 2, 3);
-- Ordering and limiting
SELECT * FROM products
ORDER BY price DESC, name ASC
LIMIT 10 OFFSET 20;
-- Aggregations
SELECT
category_id,
COUNT(*) as product_count,
AVG(price) as avg_price,
MIN(price) as min_price,
MAX(price) as max_price,
SUM(price) as total_value
FROM products
GROUP BY category_id
HAVING COUNT(*) > 5;
-- Joins
SELECT
p.name as product,
c.name as category,
s.name as supplier
FROM products p
JOIN categories c ON p.category_id = c.id
LEFT JOIN suppliers s ON p.supplier_id = s.id
WHERE p.active = true;
DCL (Data Control Language)¶
Controls access permissions:
-- GRANT: Give permissions
GRANT SELECT, INSERT ON products TO app_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO admin;
GRANT USAGE ON SCHEMA reporting TO analyst;
-- REVOKE: Remove permissions
REVOKE DELETE ON products FROM app_user;
REVOKE ALL PRIVILEGES ON sensitive_data FROM public;
-- Role management
CREATE ROLE readonly;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly;
GRANT readonly TO analyst_user;
TCL (Transaction Control Language)¶
Manages transactions:
-- Start a transaction
BEGIN; -- or START TRANSACTION
-- Make changes
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
-- Savepoint for partial rollback
SAVEPOINT after_debit;
UPDATE audit_log SET processed = true WHERE transaction_id = 123;
-- Rollback to savepoint if needed
ROLLBACK TO after_debit;
-- Commit or rollback entire transaction
COMMIT; -- or ROLLBACK;
Advanced SQL Features¶
Common Table Expressions (CTEs)¶
CTEs improve readability and enable recursive queries:
-- Non-recursive CTE
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', created_at) as month,
SUM(total) as revenue
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY 1
),
monthly_growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) as prev_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) as growth
FROM monthly_sales
)
SELECT * FROM monthly_growth WHERE growth > 0;
-- Recursive CTE (organizational hierarchy)
WITH RECURSIVE org_tree AS (
-- Base case: top-level employees
SELECT id, name, manager_id, 1 as depth, ARRAY[name] as path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: employees with managers
SELECT e.id, e.name, e.manager_id, ot.depth + 1, ot.path || e.name
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.id
WHERE ot.depth < 10 -- Prevent infinite recursion
)
SELECT * FROM org_tree ORDER BY path;
Window Functions¶
Window functions perform calculations across rows related to the current row:
SELECT
name,
department,
salary,
-- Ranking functions
ROW_NUMBER() OVER (ORDER BY salary DESC) as overall_rank,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank,
DENSE_RANK() OVER (ORDER BY salary DESC) as dense_rank,
NTILE(4) OVER (ORDER BY salary DESC) as quartile,
-- Aggregate functions as windows
SUM(salary) OVER (PARTITION BY department) as dept_total,
AVG(salary) OVER (PARTITION BY department) as dept_avg,
COUNT(*) OVER (PARTITION BY department) as dept_count,
-- Navigation functions
LAG(salary) OVER (ORDER BY hire_date) as prev_salary,
LEAD(salary) OVER (ORDER BY hire_date) as next_salary,
FIRST_VALUE(name) OVER (PARTITION BY dept ORDER BY salary DESC) as top_earner,
LAST_VALUE(name) OVER (
PARTITION BY department
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as lowest_earner,
-- Running totals
SUM(salary) OVER (ORDER BY hire_date ROWS UNBOUNDED PRECEDING) as running_total,
AVG(salary) OVER (ORDER BY hire_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_avg
FROM employees;
Subqueries¶
Subqueries are queries nested within other queries:
-- Scalar subquery (returns single value)
SELECT
name,
salary,
salary - (SELECT AVG(salary) FROM employees) as diff_from_avg
FROM employees;
-- Table subquery in FROM clause
SELECT dept, avg_salary
FROM (
SELECT department as dept, AVG(salary) as avg_salary
FROM employees
GROUP BY department
) dept_stats
WHERE avg_salary > 50000;
-- Correlated subquery (references outer query)
SELECT e1.name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department = e1.department
);
-- EXISTS subquery
SELECT c.name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.id
AND o.total > 1000
);
-- IN vs EXISTS
-- IN: Better for small subquery results
SELECT * FROM products WHERE category_id IN (SELECT id FROM categories WHERE active);
-- EXISTS: Better for large tables with indexes
SELECT * FROM products p WHERE EXISTS (
SELECT 1 FROM categories c WHERE c.id = p.category_id AND c.active
);
JSON Operations¶
Modern databases support JSON for semi-structured data:
-- PostgreSQL JSON operations
CREATE TABLE events (
id SERIAL PRIMARY KEY,
data JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
INSERT INTO events (data) VALUES
('{"type": "click", "page": "/home", "user": {"id": 123, "name": "Alice"}}');
-- Query JSON fields
SELECT
data->>'type' as event_type, -- Get as text
data->'user'->>'name' as user_name, -- Nested access
data#>>'{user,id}' as user_id, -- Path access
jsonb_typeof(data->'user') as user_type -- Type check
FROM events
WHERE data->>'type' = 'click'
AND (data->'user'->>'id')::int > 100;
-- JSON containment
SELECT * FROM events WHERE data @> '{"type": "click"}';
-- JSON existence
SELECT * FROM events WHERE data ? 'user'; -- Has key
SELECT * FROM events WHERE data ?& array['type', 'page']; -- Has all keys
-- Update JSON
UPDATE events
SET data = data || '{"processed": true}'
WHERE id = 1;
UPDATE events
SET data = jsonb_set(data, '{user,verified}', 'true')
WHERE id = 1;
-- Index JSON for performance
CREATE INDEX idx_events_type ON events ((data->>'type'));
CREATE INDEX idx_events_data ON events USING GIN (data);
Database Design and Normalization¶
Good database design prevents anomalies, reduces redundancy, and improves data integrity.
Normal Forms¶
Normalization is the process of organizing data to reduce redundancy and dependency issues. Each normal form builds on the previous.
First Normal Form (1NF)¶
Requirements:
- Each column contains atomic (indivisible) values
- Each column contains values of a single type
- Each row is unique (has a primary key)
- No repeating groups
Violation:
| OrderID | CustomerName | Products |
|---|---|---|
| 1 | Alice | Laptop, Mouse |
| 2 | Bob | Keyboard |
Fixed (1NF):
| OrderID | CustomerName | Product |
|---|---|---|
| 1 | Alice | Laptop |
| 1 | Alice | Mouse |
| 2 | Bob | Keyboard |
Second Normal Form (2NF)¶
Requirements:
- Must be in 1NF
- No partial dependencies (non-key attributes depend on the entire primary key)
Violation (composite key: OrderID + Product):
| OrderID | Product | CustomerName | Price |
|---|---|---|---|
| 1 | Laptop | Alice | 999 |
| 1 | Mouse | Alice | 29 |
CustomerName depends only on OrderID, not the full key.
Fixed (2NF):
Orders table:
| OrderID | CustomerName |
|---|---|
| 1 | Alice |
OrderItems table:
| OrderID | Product | Price |
|---|---|---|
| 1 | Laptop | 999 |
| 1 | Mouse | 29 |
Third Normal Form (3NF)¶
Requirements:
- Must be in 2NF
- No transitive dependencies (non-key attributes depend only on the primary key, not other non-key attributes)
Violation:
| EmployeeID | DepartmentID | DepartmentName |
|---|---|---|
| 1 | 10 | Engineering |
| 2 | 10 | Engineering |
DepartmentName depends on DepartmentID, not EmployeeID.
Fixed (3NF):
Employees table:
| EmployeeID | DepartmentID |
|---|---|
| 1 | 10 |
| 2 | 10 |
Departments table:
| DepartmentID | DepartmentName |
|---|---|
| 10 | Engineering |
Boyce-Codd Normal Form (BCNF)¶
Requirements:
- Must be in 3NF
- For every functional dependency X → Y, X must be a superkey
BCNF is a stricter version of 3NF that handles certain edge cases with overlapping candidate keys.
Violation:
| Student | Subject | Teacher |
|---|---|---|
| Alice | Math | Dr. Smith |
| Bob | Math | Dr. Jones |
| Alice | Physics | Dr. Brown |
If each teacher teaches only one subject, then Teacher → Subject. But Teacher isn't a superkey.
Fixed (BCNF):
StudentTeacher table:
| Student | Teacher |
|---|---|
| Alice | Dr. Smith |
| Bob | Dr. Jones |
| Alice | Dr. Brown |
TeacherSubject table:
| Teacher | Subject |
|---|---|
| Dr. Smith | Math |
| Dr. Jones | Math |
| Dr. Brown | Physics |
Fourth Normal Form (4NF)¶
Requirements:
- Must be in BCNF
- No multi-valued dependencies (one attribute shouldn't determine multiple independent attributes)
Violation:
| Employee | Skill | Language |
|---|---|---|
| Alice | Python | English |
| Alice | Python | Spanish |
| Alice | Java | English |
| Alice | Java | Spanish |
Skills and Languages are independent multi-valued facts about Employee.
Fixed (4NF):
EmployeeSkills:
| Employee | Skill |
|---|---|
| Alice | Python |
| Alice | Java |
EmployeeLanguages:
| Employee | Language |
|---|---|
| Alice | English |
| Alice | Spanish |
Fifth Normal Form (5NF)¶
Requirements:
- Must be in 4NF
- Cannot be further decomposed without losing information (no join dependencies)
5NF handles complex cases where data can only be reconstructed by joining three or more tables.
Denormalization¶
While normalization reduces redundancy, it increases join complexity. Denormalization intentionally adds redundancy for performance:
When to denormalize:
- Read-heavy workloads with complex joins
- Reporting and analytics queries
- Caching frequently accessed combinations
- When consistency can be managed at application level
Techniques:
-- Redundant columns
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
-- Update via trigger or application logic
-- Materialized views
CREATE MATERIALIZED VIEW order_summary AS
SELECT
o.id,
o.total,
c.name as customer_name,
c.email as customer_email
FROM orders o
JOIN customers c ON o.customer_id = c.id;
-- Refresh periodically
REFRESH MATERIALIZED VIEW order_summary;
-- Pre-computed aggregates
CREATE TABLE daily_sales (
date DATE PRIMARY KEY,
total_orders INTEGER,
total_revenue DECIMAL(12, 2),
avg_order_value DECIMAL(10, 2)
);
Schema Design Patterns¶
Star Schema (Data Warehousing)¶
Central fact table surrounded by dimension tables:
-- Fact table (measures/metrics)
CREATE TABLE fact_sales (
id SERIAL PRIMARY KEY,
date_id INTEGER REFERENCES dim_date(id),
product_id INTEGER REFERENCES dim_product(id),
customer_id INTEGER REFERENCES dim_customer(id),
store_id INTEGER REFERENCES dim_store(id),
quantity INTEGER,
unit_price DECIMAL(10, 2),
total_amount DECIMAL(12, 2)
);
-- Dimension tables (descriptive attributes)
CREATE TABLE dim_date (
id INTEGER PRIMARY KEY,
full_date DATE,
year INTEGER,
quarter INTEGER,
month INTEGER,
day INTEGER,
day_of_week INTEGER,
is_weekend BOOLEAN
);
CREATE TABLE dim_product (
id INTEGER PRIMARY KEY,
sku VARCHAR(50),
name VARCHAR(255),
category VARCHAR(100),
brand VARCHAR(100)
);
Snowflake Schema¶
Normalized dimension tables (dimensions have sub-dimensions):
-- Normalized product dimension
CREATE TABLE dim_category (
id INTEGER PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE dim_brand (
id INTEGER PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE dim_product (
id INTEGER PRIMARY KEY,
sku VARCHAR(50),
name VARCHAR(255),
category_id INTEGER REFERENCES dim_category(id),
brand_id INTEGER REFERENCES dim_brand(id)
);
Slowly Changing Dimensions (SCD)¶
Handling historical changes in dimension data:
Type 1 - Overwrite:
-- Simply update the record (no history)
UPDATE dim_customer SET address = 'New Address' WHERE id = 123;
Type 2 - Add New Row:
CREATE TABLE dim_customer (
surrogate_key SERIAL PRIMARY KEY,
customer_id INTEGER, -- Natural key
name VARCHAR(100),
address VARCHAR(255),
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN DEFAULT true
);
-- When customer changes address:
UPDATE dim_customer
SET expiration_date = CURRENT_DATE - 1, is_current = false
WHERE customer_id = 123 AND is_current = true;
INSERT INTO dim_customer (customer_id, name, address, effective_date, is_current)
VALUES (123, 'Alice', 'New Address', CURRENT_DATE, true);
Type 3 - Add Column:
CREATE TABLE dim_customer (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
current_address VARCHAR(255),
previous_address VARCHAR(255)
);
Indexing and Query Optimization¶
Indexes are data structures that speed up data retrieval at the cost of additional storage and write overhead.
Index Types¶
B-Tree Index (Default)¶
Balanced tree structure, ideal for:
- Equality comparisons (
=) - Range queries (
<,>,BETWEEN) - Prefix matching (
LIKE 'abc%') - Sorting (
ORDER BY)
-- Single column index
CREATE INDEX idx_users_email ON users(email);
-- Composite index (column order matters!)
CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at);
-- Query uses the index
SELECT * FROM orders WHERE customer_id = 123 AND created_at > '2024-01-01';
SELECT * FROM orders WHERE customer_id = 123; -- Uses index (leftmost prefix)
SELECT * FROM orders WHERE created_at > '2024-01-01'; -- May NOT use index efficiently
B-Tree Structure:
[50]
/ \
[25, 30] [75, 90]
/ | \ / | \
[10] [26] [35] [60] [80] [95]
- Logarithmic search time: O(log n)
- Balanced: all leaf nodes at same depth
- Sorted: enables range scans
Hash Index¶
Uses hash function for O(1) lookups, but:
- Only supports equality comparisons
- Cannot be used for range queries or sorting
- Not available in all databases
-- PostgreSQL hash index
CREATE INDEX idx_users_id_hash ON users USING HASH (id);
-- Good for: WHERE id = 123
-- Bad for: WHERE id > 100, ORDER BY id
GiST (Generalized Search Tree)¶
For complex data types:
-- Full-text search
CREATE INDEX idx_articles_content ON articles USING GiST (to_tsvector('english', content));
-- Geometric data
CREATE INDEX idx_locations_point ON locations USING GiST (point);
-- Range types
CREATE INDEX idx_events_duration ON events USING GiST (tsrange(start_time, end_time));
GIN (Generalized Inverted Index)¶
For composite values (arrays, JSON, full-text):
-- Array containment
CREATE INDEX idx_posts_tags ON posts USING GIN (tags);
SELECT * FROM posts WHERE tags @> ARRAY['sql', 'databases'];
-- JSONB queries
CREATE INDEX idx_events_data ON events USING GIN (data);
SELECT * FROM events WHERE data @> '{"type": "click"}';
-- Full-text search (better for static data)
CREATE INDEX idx_articles_tsv ON articles USING GIN (to_tsvector('english', content));
BRIN (Block Range Index)¶
For large, naturally ordered data:
-- Time-series data (much smaller than B-tree)
CREATE INDEX idx_logs_timestamp ON logs USING BRIN (timestamp);
-- Ideal when: data is physically sorted by the indexed column
-- Very small index size, but less precise
Partial Index¶
Index only rows matching a condition:
-- Only index active users
CREATE INDEX idx_active_users ON users(email) WHERE active = true;
-- Only index recent orders
CREATE INDEX idx_recent_orders ON orders(customer_id)
WHERE created_at > '2024-01-01';
-- Query must include the condition to use the index
SELECT * FROM users WHERE email = 'test@example.com' AND active = true;
Expression Index¶
Index on computed expressions:
-- Case-insensitive search
CREATE INDEX idx_users_email_lower ON users(LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- JSON field
CREATE INDEX idx_events_type ON events((data->>'type'));
SELECT * FROM events WHERE data->>'type' = 'click';
-- Date part
CREATE INDEX idx_orders_month ON orders(DATE_TRUNC('month', created_at));
Query Optimization¶
Understanding EXPLAIN¶
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;
-- Output interpretation:
/*
Seq Scan on orders (cost=0.00..1234.00 rows=100 width=50) (actual time=0.015..12.345 rows=95 loops=1)
Filter: (customer_id = 123)
Rows Removed by Filter: 9905
Planning Time: 0.123 ms
Execution Time: 12.456 ms
*/
Key metrics:
- cost: Estimated units (startup cost..total cost)
- rows: Estimated row count
- actual time: Real execution time (ms)
- loops: Number of iterations
Scan types (best to worst for large tables):
- Index Only Scan: All data from index, no table access
- Index Scan: Use index, fetch rows from table
- Bitmap Index Scan: Build bitmap of matching rows, then fetch
- Seq Scan: Full table scan (may be OK for small tables)
Common Optimization Techniques¶
Use covering indexes:
-- Index includes all needed columns
CREATE INDEX idx_orders_covering ON orders(customer_id) INCLUDE (total, status);
-- Query satisfied entirely from index
SELECT customer_id, total, status FROM orders WHERE customer_id = 123;
Avoid functions on indexed columns:
-- Bad: function prevents index use
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
-- Good: create expression index, or
SELECT * FROM users WHERE email = 'Alice@example.com'; -- if case-sensitive is OK
-- Or use CITEXT type for case-insensitive text
Use appropriate types:
-- Bad: implicit type conversion
SELECT * FROM orders WHERE id = '123'; -- id is INTEGER
-- Good: matching types
SELECT * FROM orders WHERE id = 123;
Limit result sets:
-- Use LIMIT for pagination
SELECT * FROM orders ORDER BY created_at DESC LIMIT 20 OFFSET 40;
-- Better: keyset pagination (for large offsets)
SELECT * FROM orders
WHERE created_at < '2024-01-15 10:30:00'
ORDER BY created_at DESC
LIMIT 20;
Optimize JOINs:
-- Ensure join columns are indexed
CREATE INDEX idx_order_items_order ON order_items(order_id);
-- Join order matters for some databases
-- Put smaller/filtered tables first (optimizer usually handles this)
Statistics and Query Planning¶
-- Update table statistics
ANALYZE orders;
ANALYZE; -- All tables
-- Check statistics
SELECT
schemaname, tablename,
n_live_tup as rows,
n_dead_tup as dead_rows,
last_vacuum, last_autovacuum,
last_analyze, last_autoanalyze
FROM pg_stat_user_tables;
-- View index usage
SELECT
indexrelname as index_name,
idx_scan as times_used,
idx_tup_read as tuples_read,
idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE schemaname = 'public';
Transactions and Concurrency Control¶
Transactions group multiple operations into atomic units, ensuring data consistency even with concurrent access and failures.
ACID Properties¶
Atomicity¶
All operations in a transaction succeed or fail together:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; -- Both succeed, or both fail
If the system crashes after the first UPDATE but before COMMIT, the entire transaction is rolled back.
Consistency¶
Transactions maintain all database constraints:
-- Constraint: balance >= 0
ALTER TABLE accounts ADD CONSTRAINT positive_balance CHECK (balance >= 0);
BEGIN;
UPDATE accounts SET balance = balance - 1000 WHERE id = 1; -- Would make balance negative
-- Transaction fails: constraint violation
ROLLBACK; -- Automatic
Isolation¶
Concurrent transactions don't interfere with each other:
-- Transaction A -- Transaction B
BEGIN; BEGIN;
SELECT balance FROM accounts SELECT balance FROM accounts
WHERE id = 1; -- Returns 1000 WHERE id = 1; -- Returns 1000
UPDATE accounts
SET balance = balance - 100
WHERE id = 1;
UPDATE accounts
SET balance = balance + 50
WHERE id = 1;
-- Waits for A to complete (or uses MVCC)
COMMIT;
COMMIT;
-- Final balance: 950 (not 1050 or 900)
Durability¶
Committed transactions survive system failures:
- Data written to disk before COMMIT returns
- Write-Ahead Logging (WAL) ensures recovery
- Replication provides additional durability
Isolation Levels¶
SQL defines four isolation levels, trading consistency for performance:
Read Uncommitted¶
- Allows dirty reads (seeing uncommitted changes)
- Highest performance, lowest consistency
- Rarely used in practice
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
Dirty Read Example:
-- Transaction A -- Transaction B
BEGIN;
UPDATE accounts SET balance = 0
WHERE id = 1;
BEGIN;
SELECT balance FROM accounts
WHERE id = 1; -- Sees 0 (uncommitted!)
ROLLBACK; -- Never committed!
-- B made decisions based on wrong data
Read Committed (PostgreSQL default)¶
- No dirty reads: only see committed changes
- Allows non-repeatable reads
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
Non-Repeatable Read Example:
-- Transaction A -- Transaction B
BEGIN;
SELECT balance FROM accounts
WHERE id = 1; -- Returns 1000
BEGIN;
UPDATE accounts SET balance = 500
WHERE id = 1;
COMMIT;
SELECT balance FROM accounts
WHERE id = 1; -- Returns 500 (different!)
COMMIT;
Repeatable Read (MySQL InnoDB default)¶
- No dirty reads
- No non-repeatable reads: same query returns same rows
- Allows phantom reads (new rows appear)
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
Phantom Read Example:
-- Transaction A -- Transaction B
BEGIN;
SELECT COUNT(*) FROM orders
WHERE customer_id = 1; -- Returns 10
BEGIN;
INSERT INTO orders (customer_id, ...)
VALUES (1, ...);
COMMIT;
SELECT COUNT(*) FROM orders
WHERE customer_id = 1; -- Returns 11 (phantom row!)
COMMIT;
Serializable¶
- Strictest isolation
- Transactions behave as if executed serially
- No anomalies, but lowest concurrency
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
In PostgreSQL, serializable uses Serializable Snapshot Isolation (SSI), which detects conflicts and aborts transactions as needed.
Isolation Levels Summary¶
| Level | Dirty Read | Non-Repeatable Read | Phantom Read |
|---|---|---|---|
| Read Uncommitted | ✓ | ✓ | ✓ |
| Read Committed | ✗ | ✓ | ✓ |
| Repeatable Read | ✗ | ✗ | ✓ (varies) |
| Serializable | ✗ | ✗ | ✗ |
Concurrency Control Mechanisms¶
Pessimistic Locking¶
Lock resources before access:
-- Exclusive lock (for update)
BEGIN;
SELECT * FROM accounts WHERE id = 1 FOR UPDATE; -- Lock row
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT; -- Release lock
-- Shared lock (for read, allows other reads)
SELECT * FROM accounts WHERE id = 1 FOR SHARE;
-- Skip locked rows (useful for job queues)
SELECT * FROM jobs WHERE status = 'pending'
FOR UPDATE SKIP LOCKED LIMIT 1;
-- No wait (fail immediately if locked)
SELECT * FROM accounts WHERE id = 1 FOR UPDATE NOWAIT;
Lock Types:
- Row-level locks: Lock specific rows (most common)
- Table-level locks: Lock entire tables (for DDL, bulk operations)
- Advisory locks: Application-defined locks
-- Table locks
LOCK TABLE accounts IN EXCLUSIVE MODE;
-- Advisory locks
SELECT pg_advisory_lock(12345); -- Acquire
SELECT pg_advisory_unlock(12345); -- Release
Optimistic Locking¶
Assume no conflicts, detect at commit time:
-- Using version numbers
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
price DECIMAL(10, 2),
version INTEGER DEFAULT 1
);
-- Read with version
SELECT id, name, price, version FROM products WHERE id = 1;
-- Returns: (1, 'Widget', 9.99, 5)
-- Update only if version matches
UPDATE products
SET price = 10.99, version = version + 1
WHERE id = 1 AND version = 5;
-- Check affected rows: if 0, another transaction modified it
-- Application retries with new data
Multi-Version Concurrency Control (MVCC)¶
MVCC allows readers and writers to not block each other:
How it works:
- Each row has hidden transaction IDs (xmin, xmax)
- Updates create new row versions, don't overwrite
- Readers see consistent snapshot based on their start time
- Old versions cleaned up by VACUUM
-- PostgreSQL: view row versions
SELECT xmin, xmax, * FROM accounts WHERE id = 1;
-- Transaction A (started at t=100)
BEGIN;
SELECT * FROM accounts; -- Sees snapshot as of t=100
-- Transaction B commits update at t=110
-- Transaction A still sees old data (snapshot isolation)
SELECT * FROM accounts; -- Still sees t=100 snapshot
COMMIT;
Benefits:
- Readers never block writers
- Writers rarely block readers
- Better concurrency than pure locking
Costs:
- More storage for row versions
- Need VACUUM to clean up dead rows
- More complex implementation
Deadlocks¶
Deadlocks occur when transactions wait for each other's locks:
-- Transaction A -- Transaction B
BEGIN; BEGIN;
UPDATE accounts SET balance = 100
WHERE id = 1; -- Locks row 1
UPDATE accounts SET balance = 200
WHERE id = 2; -- Locks row 2
UPDATE accounts SET balance = 300
WHERE id = 2; -- Waits for B
UPDATE accounts SET balance = 400
WHERE id = 1; -- Waits for A
-- DEADLOCK! Database detects and aborts one transaction
Prevention strategies:
- Lock ordering: Always acquire locks in consistent order
- Lock timeout: Give up after waiting too long
- Deadlock detection: Database detects and resolves
-- Set lock timeout
SET lock_timeout = '5s';
-- Access resources in consistent order
-- Always lock accounts in ascending ID order
NoSQL Databases¶
NoSQL ("Not Only SQL") databases emerged to handle data that doesn't fit well in relational models, or to achieve scalability beyond what traditional RDBMS can offer.
Document Databases¶
Store data as JSON-like documents in collections:
Examples: MongoDB, CouchDB, Amazon DocumentDB
// MongoDB document
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"name": "Alice Johnson",
"email": "alice@example.com",
"orders": [
{
"id": "ORD-001",
"items": [
{"product": "Laptop", "price": 999.99, "quantity": 1},
{"product": "Mouse", "price": 29.99, "quantity": 2}
],
"total": 1059.97,
"status": "shipped"
}
],
"preferences": {
"newsletter": true,
"theme": "dark"
}
}
Query examples (MongoDB):
// Find documents
db.users.find({ "email": "alice@example.com" })
db.users.find({ "orders.total": { $gt: 1000 } })
db.users.find({ "preferences.theme": "dark" })
// Aggregation pipeline
db.orders.aggregate([
{ $match: { status: "shipped" } },
{ $group: {
_id: "$customer_id",
totalSpent: { $sum: "$total" },
orderCount: { $sum: 1 }
}},
{ $sort: { totalSpent: -1 } },
{ $limit: 10 }
])
// Update nested documents
db.users.updateOne(
{ "_id": ObjectId("..."), "orders.id": "ORD-001" },
{ $set: { "orders.$.status": "delivered" } }
)
When to use:
- Flexible schemas that evolve
- Hierarchical/nested data
- Rapid prototyping
- Content management
- Catalogs with varying attributes
Trade-offs:
- No joins (embedded data or multiple queries)
- Less strict consistency (by default)
- Potential data duplication
Key-Value Stores¶
Simplest NoSQL model: store values by unique keys:
Examples: Redis, Amazon DynamoDB, etcd
# Redis examples
import redis
r = redis.Redis()
# Basic operations
r.set("user:123:name", "Alice")
r.get("user:123:name") # "Alice"
r.delete("user:123:name")
# Expiration (caching)
r.setex("session:abc123", 3600, "user_data") # Expires in 1 hour
# Data structures
r.hset("user:123", mapping={"name": "Alice", "email": "alice@example.com"})
r.hgetall("user:123") # {"name": "Alice", "email": "..."}
r.lpush("queue:jobs", "job1", "job2")
r.rpop("queue:jobs") # "job1"
r.sadd("tags:post:1", "python", "databases")
r.smembers("tags:post:1") # {"python", "databases"}
r.zadd("leaderboard", {"alice": 100, "bob": 85})
r.zrange("leaderboard", 0, 9, withscores=True) # Top 10
When to use:
- Caching (API responses, sessions)
- Real-time leaderboards
- Rate limiting
- Pub/sub messaging
- Configuration storage
Trade-offs:
- Limited query capabilities
- No complex relationships
- Data usually in memory (cost)
Column-Family Stores¶
Organize data by columns rather than rows, optimized for sparse data and analytics:
Examples: Apache Cassandra, HBase, ScyllaDB
-- Cassandra CQL
CREATE TABLE sensor_data (
sensor_id uuid,
timestamp timestamp,
temperature float,
humidity float,
pressure float,
PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
-- Write (extremely fast)
INSERT INTO sensor_data (sensor_id, timestamp, temperature)
VALUES (uuid(), now(), 23.5);
-- Read recent data for a sensor
SELECT * FROM sensor_data
WHERE sensor_id = ?
AND timestamp > '2024-01-01'
LIMIT 1000;
-- Wide rows: one row per sensor, columns for each reading
-- Efficient for time-series queries on single partition
When to use:
- Time-series data (IoT, metrics)
- Write-heavy workloads
- Data with natural time ordering
- Analytics requiring column-based scans
Trade-offs:
- Limited secondary indexes
- Denormalization required
- No ad-hoc queries
- Eventual consistency model
Graph Databases¶
Store and query relationships as first-class citizens:
Examples: Neo4j, Amazon Neptune, ArangoDB
// Neo4j Cypher query language
// Create nodes and relationships
CREATE (alice:Person {name: 'Alice', age: 30})
CREATE (bob:Person {name: 'Bob', age: 28})
CREATE (acme:Company {name: 'Acme Corp'})
CREATE (alice)-[:FRIENDS_WITH {since: 2020}]->(bob)
CREATE (alice)-[:WORKS_AT {role: 'Engineer'}]->(acme)
CREATE (bob)-[:WORKS_AT {role: 'Designer'}]->(acme)
// Find friends of friends
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH*2]->(fof)
WHERE fof <> alice
RETURN DISTINCT fof.name
// Shortest path
MATCH path = shortestPath(
(alice:Person {name: 'Alice'})-[*]-(bob:Person {name: 'Bob'})
)
RETURN path
// Recommendation: people who work at same company
MATCH (alice:Person {name: 'Alice'})-[:WORKS_AT]->(company)<-[:WORKS_AT]-(colleague)
WHERE colleague <> alice
AND NOT (alice)-[:FRIENDS_WITH]-(colleague)
RETURN colleague.name AS recommendation
When to use:
- Social networks
- Fraud detection
- Knowledge graphs
- Recommendation engines
- Network/infrastructure mapping
Trade-offs:
- Not optimized for bulk analytics
- Scaling can be complex
- Different query paradigm
Time-Series Databases¶
Optimized for timestamped data:
Examples: InfluxDB, TimescaleDB, Prometheus
-- TimescaleDB (PostgreSQL extension)
CREATE TABLE metrics (
time TIMESTAMPTZ NOT NULL,
device_id TEXT,
metric_name TEXT,
value DOUBLE PRECISION
);
-- Convert to hypertable (time-series optimized)
SELECT create_hypertable('metrics', 'time');
-- Insert data
INSERT INTO metrics VALUES
(NOW(), 'sensor-1', 'temperature', 23.5),
(NOW(), 'sensor-1', 'humidity', 65.2);
-- Time-series queries
SELECT
time_bucket('1 hour', time) AS hour,
device_id,
AVG(value) as avg_value,
MIN(value) as min_value,
MAX(value) as max_value
FROM metrics
WHERE time > NOW() - INTERVAL '24 hours'
AND metric_name = 'temperature'
GROUP BY hour, device_id
ORDER BY hour DESC;
-- Continuous aggregates (materialized views for time-series)
CREATE MATERIALIZED VIEW hourly_stats
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', time) AS hour,
device_id,
AVG(value) as avg_value
FROM metrics
GROUP BY hour, device_id;
Vector Databases¶
Store and search high-dimensional embeddings for AI/ML:
Examples: Pinecone, Milvus, Weaviate, pgvector
-- pgvector (PostgreSQL extension)
CREATE EXTENSION vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536) -- OpenAI embedding dimension
);
-- Create index for similarity search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);
-- Insert with embedding
INSERT INTO documents (content, embedding)
VALUES ('Hello world', '[0.1, 0.2, ...]');
-- Similarity search
SELECT content, embedding <=> '[0.15, 0.25, ...]' AS distance
FROM documents
ORDER BY embedding <=> '[0.15, 0.25, ...]'
LIMIT 10;
CAP Theorem and Distributed Databases¶
CAP Theorem¶
In a distributed system, you can only guarantee two of three properties:
- Consistency: All nodes see the same data at the same time
- Availability: Every request receives a response (success/failure)
- Partition Tolerance: System operates despite network failures
Since network partitions are inevitable, the real choice is between:
-
CP (Consistency + Partition Tolerance): Sacrifice availability during partitions. System may refuse requests to maintain consistency.
- Examples: Traditional RDBMS, HBase, MongoDB (with majority write concern)
-
AP (Availability + Partition Tolerance): Sacrifice consistency during partitions. System stays available but may return stale data.
- Examples: Cassandra, DynamoDB, CouchDB
PACELC Theorem¶
Extension of CAP: "If there's a Partition, choose Availability or Consistency; Else, when running normally, choose Latency or Consistency."
- PA/EL: Sacrifice consistency for availability/latency (Cassandra, DynamoDB)
- PC/EC: Prioritize consistency always (Traditional RDBMS)
- PA/EC: Available during partition, consistent otherwise (MongoDB default)
Consistency Models¶
Strong Consistency¶
All reads return the most recent write:
Write(x=1) → Read(x) = 1 (guaranteed)
- Achieved via: Single leader, synchronous replication, consensus protocols
- Trade-off: Higher latency, lower availability
Eventual Consistency¶
If no new updates, eventually all reads return the last updated value:
Write(x=1) → Read(x) = 0 (possible, old value)
→ [time passes]
→ Read(x) = 1 (eventually)
- Achieved via: Asynchronous replication, gossip protocols
- Trade-off: May read stale data
Causal Consistency¶
If operation A causally affects B, everyone sees A before B:
A: Write(x=1)
B: Read(x) returns 1, then Write(y=2)
C: If C sees y=2, C must also see x=1
Read-Your-Writes Consistency¶
Users always see their own updates:
User writes x=1 → User reads x (always returns 1 or newer)
Other user reads x (may return old value)
Consensus Algorithms¶
Distributed systems use consensus algorithms to agree on values:
Paxos¶
Classic consensus algorithm, complex but theoretically sound.
Raft¶
More understandable alternative to Paxos:
- Leader Election: One node becomes leader
- Log Replication: Leader replicates log entries to followers
- Safety: Only nodes with up-to-date logs can become leader
Used by: etcd, Consul, CockroachDB
BASE vs ACID¶
ACID (Traditional RDBMS):
- Atomicity, Consistency, Isolation, Durability
- Strong guarantees, lower availability
BASE (Many NoSQL systems):
- Basically Available: System remains available
- Soft state: State may change over time (without input)
- Eventually consistent: Will become consistent eventually
Database Internals¶
Understanding how databases work internally helps with performance tuning and debugging.
Storage Engine Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ Query Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Parser │→│ Optimizer │→│ Executor │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Storage Engine │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Buffer │ │ Transaction │ │ Index │ │
│ │ Manager │ │ Manager │ │ Manager │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Disk Manager ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ ││
│ │ │ Data │ │ Index │ │ Write-Ahead Log │ ││
│ │ │ Files │ │ Files │ │ (WAL) │ ││
│ │ └──────────┘ └──────────┘ └──────────────────────┘ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Query Processing Pipeline¶
1. Parsing¶
Converts SQL text into an Abstract Syntax Tree (AST):
SELECT name, salary FROM employees WHERE dept = 'Engineering'
Becomes:
SelectStatement
├── columns: [name, salary]
├── from: employees
└── where: dept = 'Engineering'
2. Semantic Analysis¶
- Resolves table and column names
- Checks permissions
- Validates types
3. Query Optimization¶
The optimizer finds the most efficient execution plan:
Cost-based optimization:
- Estimates cost of different plans
- Uses statistics (row counts, value distributions)
- Considers available indexes
Optimization techniques:
- Predicate pushdown: Filter early to reduce data
- Join reordering: Choose optimal join sequence
- Index selection: Decide which indexes to use
- Subquery flattening: Convert subqueries to joins
-- Original query
SELECT * FROM orders o
WHERE o.customer_id IN (
SELECT id FROM customers WHERE country = 'USA'
);
-- Optimized (flattened)
SELECT o.* FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA';
4. Execution¶
Executes the physical plan using various operators:
- Table Scan: Read all rows from table
- Index Scan: Read rows via index
- Nested Loop Join: For each outer row, scan inner table
- Hash Join: Build hash table, probe with other table
- Merge Join: Sort both tables, merge sorted results
- Sort: Order rows
- Aggregate: Compute SUM, COUNT, etc.
Buffer Management¶
The buffer pool caches disk pages in memory:
┌─────────────────────────────────────────┐
│ Buffer Pool (RAM) │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Page1│ │Page2│ │Page3│ │Page4│ ... │
│ └─────┘ └─────┘ └─────┘ └─────┘ │
│ ↑ ↑ │
│ │ frequently accessed │ dirty │
└────│─────────────────────────│──────────┘
│ │
↓ ↓
┌─────────────────────────────────────────┐
│ Disk Storage │
│ ┌─────────────────────────────────┐ │
│ │ Data Files (pages on disk) │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
Page replacement policies:
- LRU (Least Recently Used): Evict oldest accessed page
- Clock: Approximation of LRU with lower overhead
- LRU-K: Consider K most recent accesses
Dirty pages:
- Modified pages marked "dirty"
- Written back to disk by background writer
- Or during checkpoint
Write-Ahead Logging (WAL)¶
WAL ensures durability and crash recovery:
1. Write change to WAL (on disk)
2. Modify page in buffer pool (in memory)
3. Acknowledge transaction commit
4. Later: Write dirty pages to data files (checkpoint)
Recovery process:
- On crash: Data files may be inconsistent
- Read WAL from last checkpoint
- Redo committed transactions
- Undo uncommitted transactions
-- PostgreSQL: WAL configuration
SHOW wal_level; -- minimal, replica, logical
SHOW max_wal_size; -- Maximum WAL size before checkpoint
SHOW checkpoint_timeout; -- Time between checkpoints
Page Structure¶
Data is stored in fixed-size pages (typically 8KB):
┌─────────────────────────────────────────────────────┐
│ Page Header │
│ ┌─────────────────────────────────────────────────┐│
│ │ LSN │ Checksum │ Flags │ Free Space │ ... ││
│ └─────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────┤
│ Item Pointers (line pointers) │
│ ┌───┐┌───┐┌───┐┌───┐ │
│ │ 1 ││ 2 ││ 3 ││ 4 │→ point to tuple offsets │
│ └───┘└───┘└───┘└───┘ │
├─────────────────────────────────────────────────────┤
│ │
│ Free Space │
│ │
├─────────────────────────────────────────────────────┤
│ Tuples (row data, grows from bottom) │
│ ┌─────────────────────────────────────────────────┐│
│ │ Tuple 4 │ Tuple 3 │ Tuple 2 │ Tuple 1 │ ││
│ └─────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────┘
Heap vs Index-Organized Tables¶
Heap Table (PostgreSQL, default):
- Rows stored in insertion order
- Index points to physical location (ctid)
- Random I/O for index lookups
Index-Organized Table / Clustered Index (MySQL InnoDB, SQL Server):
- Rows stored in primary key order
- Primary key index contains row data
- Secondary indexes point to primary key (not physical location)
-- PostgreSQL: Cluster table by index (one-time reorganization)
CLUSTER employees USING idx_employees_dept;
-- MySQL InnoDB: Table is always clustered by PRIMARY KEY
CREATE TABLE employees (
id INT PRIMARY KEY, -- This IS the clustered index
name VARCHAR(100),
INDEX idx_name (name) -- Secondary index points to id
);
Sharding, Replication, and Partitioning¶
Sharding and replication in distributed databases interact with consistency models, CAP, and quorum; for the distributed-systems perspective (consensus, replication strategies, CRDTs), see Distributed Systems.
Partitioning (Single Database)¶
Divide a table into smaller pieces within one database:
Range Partitioning¶
-- PostgreSQL native partitioning
CREATE TABLE orders (
id SERIAL,
order_date DATE,
customer_id INTEGER,
total DECIMAL(10, 2)
) PARTITION BY RANGE (order_date);
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
CREATE TABLE orders_2025 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
-- Queries automatically use correct partition
SELECT * FROM orders WHERE order_date = '2024-06-15';
-- Only scans orders_2024
List Partitioning¶
CREATE TABLE customers (
id SERIAL,
name VARCHAR(100),
country VARCHAR(2)
) PARTITION BY LIST (country);
CREATE TABLE customers_us PARTITION OF customers
FOR VALUES IN ('US');
CREATE TABLE customers_eu PARTITION OF customers
FOR VALUES IN ('DE', 'FR', 'IT', 'ES', 'GB');
CREATE TABLE customers_other PARTITION OF customers
DEFAULT;
Hash Partitioning¶
CREATE TABLE events (
id SERIAL,
user_id INTEGER,
event_type VARCHAR(50),
data JSONB
) PARTITION BY HASH (user_id);
CREATE TABLE events_p0 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events
FOR VALUES WITH (MODULUS 4, REMAINDER 3);
Sharding (Multiple Databases)¶
Distribute data across multiple database instances:
┌─────────────────────────────────────────────────────────────┐
│ Application │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Shard Router / Proxy │ │
│ │ (determines which shard to query) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │ │
┌─────────┘ │ └─────────┐
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│users 1-N │ │users N+1 │ │users 2N+ │
│ │ │ to 2N │ │ │
└──────────┘ └──────────┘ └──────────┘
Sharding strategies:
-
Range-based: Users 1-1M on shard 1, 1M-2M on shard 2
- Pro: Easy range queries
- Con: Hotspots (new users always hit last shard)
-
Hash-based: shard_id = hash(user_id) % num_shards
- Pro: Even distribution
- Con: Range queries require all shards
-
Directory-based: Lookup table maps keys to shards
- Pro: Flexible
- Con: Lookup table is a bottleneck
-
Geographic: Shard by region
- Pro: Data locality
- Con: Uneven load
Challenges:
- Cross-shard queries (joins, aggregations)
- Distributed transactions
- Resharding when adding/removing shards
- Maintaining consistency
Replication¶
Copy data across multiple nodes for availability and read scaling:
Single-Leader (Primary-Replica)¶
┌─────────────┐
│ Primary │←── All writes
│ (Leader) │
└─────────────┘
│
│ Replication (sync or async)
↓
┌─────────────┐ ┌─────────────┐
│ Replica 1 │ │ Replica 2 │←── Reads distributed
│ (Follower) │ │ (Follower) │
└─────────────┘ └─────────────┘
Synchronous replication:
- Write waits for replica confirmation
- Strong consistency
- Higher latency, lower throughput
Asynchronous replication:
- Write returns immediately
- Potential data loss on primary failure
- Lower latency, higher throughput
-- PostgreSQL streaming replication (async)
-- On primary:
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET max_wal_senders = 3;
-- Create replication user
CREATE USER replicator REPLICATION LOGIN PASSWORD 'secret';
-- On replica:
-- pg_basebackup -h primary -U replicator -D /var/lib/postgresql/data -P
Multi-Leader¶
Multiple nodes accept writes, sync with each other:
┌─────────────┐ ┌─────────────┐
│ Leader 1 │←────→│ Leader 2 │
│ (Region A) │ │ (Region B) │
└─────────────┘ └─────────────┘
↑ ↑
│ │
Writes A Writes B
Conflict resolution:
- Last-write-wins (timestamp-based)
- Custom merge logic
- User intervention
Leaderless¶
Any node accepts reads/writes:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
└─────────┘ └─────────┘ └─────────┘
↑ ↑ ↑
└────────────┼────────────┘
│
Client (writes to multiple)
Quorum reads/writes:
- N = total nodes, W = write quorum, R = read quorum
- If W + R > N, reads see latest write
- Example: N=3, W=2, R=2 (tolerates 1 failure)
Write to nodes 1, 2 (W=2) ✓
Read from nodes 2, 3 (R=2) → at least one has latest
Object-Relational Mappers (ORMs)¶
ORMs bridge the gap between object-oriented code and relational databases.
SQLAlchemy (Python)¶
The most popular Python ORM:
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, DateTime
from sqlalchemy.orm import declarative_base, relationship, sessionmaker
from datetime import datetime
# Setup
engine = create_engine('postgresql://user:pass@localhost/db', echo=True)
Base = declarative_base()
Session = sessionmaker(bind=engine)
# Model definitions
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=False)
email = Column(String(255), unique=True, nullable=False)
created_at = Column(DateTime, default=datetime.utcnow)
# Relationships
orders = relationship('Order', back_populates='user', lazy='select')
def __repr__(self):
return f"<User(name='{self.name}', email='{self.email}')>"
class Order(Base):
__tablename__ = 'orders'
id = Column(Integer, primary_key=True)
user_id = Column(Integer, ForeignKey('users.id'), nullable=False)
total = Column(Integer) # cents
status = Column(String(20), default='pending')
created_at = Column(DateTime, default=datetime.utcnow)
user = relationship('User', back_populates='orders')
items = relationship('OrderItem', back_populates='order', cascade='all, delete-orphan')
# Create tables
Base.metadata.create_all(engine)
# Using the ORM
session = Session()
# Create
user = User(name='Alice', email='alice@example.com')
session.add(user)
session.commit()
# Read
user = session.query(User).filter_by(email='alice@example.com').first()
users = session.query(User).filter(User.name.like('%ali%')).all()
# Update
user.name = 'Alice Johnson'
session.commit()
# Delete
session.delete(user)
session.commit()
# Complex queries
from sqlalchemy import func
# Join
results = session.query(User, Order).join(Order).filter(Order.total > 10000).all()
# Aggregation
stats = session.query(
User.id,
User.name,
func.count(Order.id).label('order_count'),
func.sum(Order.total).label('total_spent')
).outerjoin(Order).group_by(User.id).all()
# Eager loading (prevent N+1)
from sqlalchemy.orm import joinedload, selectinload
users = session.query(User).options(
joinedload(User.orders).joinedload(Order.items)
).all()
# Subqueries
from sqlalchemy import select
subq = select(func.max(Order.total)).where(Order.user_id == User.id).scalar_subquery()
rich_users = session.query(User).filter(User.id.in_(
select(Order.user_id).where(Order.total > 10000)
)).all()
Query Builders vs ORMs¶
Query Builder (more SQL-like):
# SQLAlchemy Core
from sqlalchemy import select, insert, update, delete
stmt = select(users.c.name, users.c.email).where(users.c.id > 10)
result = connection.execute(stmt)
stmt = insert(users).values(name='Bob', email='bob@example.com')
connection.execute(stmt)
When to use what:
| Use Case | Recommendation |
|---|---|
| Simple CRUD with objects | ORM |
| Complex queries, reports | Query builder or raw SQL |
| Bulk operations | Raw SQL or COPY |
| Need full SQL control | Query builder or raw SQL |
| Rapid prototyping | ORM |
ORM Best Practices¶
1. Understand lazy loading:
# N+1 problem
users = session.query(User).all()
for user in users:
print(user.orders) # Each triggers a query!
# Solution: eager loading
users = session.query(User).options(selectinload(User.orders)).all()
2. Use transactions explicitly:
try:
session.begin()
# operations
session.commit()
except:
session.rollback()
raise
finally:
session.close()
# Or use context manager
with Session() as session:
with session.begin():
# operations auto-commit on success, rollback on exception
3. Batch operations:
# Slow: one query per insert
for item in items:
session.add(Item(**item))
session.commit()
# Fast: bulk insert
session.bulk_insert_mappings(Item, items)
session.commit()
# Fastest: raw COPY (PostgreSQL)
from io import StringIO
import csv
buffer = StringIO()
writer = csv.writer(buffer)
writer.writerows([(i['name'], i['value']) for i in items])
buffer.seek(0)
with engine.raw_connection() as conn:
with conn.cursor() as cur:
cur.copy_from(buffer, 'items', columns=('name', 'value'), sep=',')
conn.commit()
Database Security¶
Authentication and Authorization¶
-- Create users with limited privileges
CREATE USER app_user WITH PASSWORD 'secure_password';
CREATE USER readonly_user WITH PASSWORD 'another_password';
-- Grant specific permissions
GRANT CONNECT ON DATABASE myapp TO app_user;
GRANT USAGE ON SCHEMA public TO app_user;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_user;
GRANT USAGE, SELECT ON ALL SEQUENCES IN SCHEMA public TO app_user;
-- Read-only access
GRANT CONNECT ON DATABASE myapp TO readonly_user;
GRANT USAGE ON SCHEMA public TO readonly_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly_user;
-- Row-level security (PostgreSQL)
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
CREATE POLICY orders_user_policy ON orders
FOR ALL
TO app_user
USING (user_id = current_setting('app.current_user_id')::INTEGER);
-- Set context before queries
SET app.current_user_id = '123';
SELECT * FROM orders; -- Only sees orders for user 123
SQL Injection Prevention¶
Vulnerable code:
# NEVER DO THIS
query = f"SELECT * FROM users WHERE email = '{user_input}'"
# Input: "'; DROP TABLE users; --"
# Results in: SELECT * FROM users WHERE email = ''; DROP TABLE users; --'
Safe code (parameterized queries):
# SQLAlchemy ORM (automatically parameterized)
user = session.query(User).filter_by(email=user_input).first()
# Raw SQL with parameters
result = connection.execute(
text("SELECT * FROM users WHERE email = :email"),
{"email": user_input}
)
# psycopg2 with parameters
cursor.execute("SELECT * FROM users WHERE email = %s", (user_input,))
Encryption¶
At rest:
-- PostgreSQL: Encrypt column data
CREATE EXTENSION pgcrypto;
-- Store encrypted
INSERT INTO users (email, ssn_encrypted)
VALUES ('alice@example.com', pgp_sym_encrypt('123-45-6789', 'encryption_key'));
-- Read decrypted
SELECT email, pgp_sym_decrypt(ssn_encrypted, 'encryption_key') as ssn
FROM users;
In transit:
# PostgreSQL: Require SSL
# postgresql.conf
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'
# pg_hba.conf: Require SSL for connections
hostssl all all 0.0.0.0/0 scram-sha-256
Connection string:
engine = create_engine(
'postgresql://user:pass@host/db?sslmode=require'
)
Audit Logging¶
-- PostgreSQL: Audit trigger
CREATE TABLE audit_log (
id SERIAL PRIMARY KEY,
table_name TEXT,
operation TEXT,
old_data JSONB,
new_data JSONB,
changed_by TEXT,
changed_at TIMESTAMP DEFAULT NOW()
);
CREATE OR REPLACE FUNCTION audit_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'DELETE' THEN
INSERT INTO audit_log (table_name, operation, old_data, changed_by)
VALUES (TG_TABLE_NAME, 'DELETE', to_jsonb(OLD), current_user);
RETURN OLD;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO audit_log (table_name, operation, old_data, new_data, changed_by)
VALUES (TG_TABLE_NAME, 'UPDATE', to_jsonb(OLD), to_jsonb(NEW), current_user);
RETURN NEW;
ELSIF TG_OP = 'INSERT' THEN
INSERT INTO audit_log (table_name, operation, new_data, changed_by)
VALUES (TG_TABLE_NAME, 'INSERT', to_jsonb(NEW), current_user);
RETURN NEW;
END IF;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER users_audit
AFTER INSERT OR UPDATE OR DELETE ON users
FOR EACH ROW EXECUTE FUNCTION audit_trigger();
Performance Tuning¶
Identifying Slow Queries¶
-- PostgreSQL: Enable query logging
ALTER SYSTEM SET log_min_duration_statement = 1000; -- Log queries > 1 second
SELECT pg_reload_conf();
-- View slow queries
SELECT
query,
calls,
total_exec_time / 1000 as total_seconds,
mean_exec_time as avg_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Active queries
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
Connection Pooling¶
Database connections are expensive. Use a pool:
# SQLAlchemy connection pooling
engine = create_engine(
'postgresql://user:pass@localhost/db',
pool_size=20, # Maintained connections
max_overflow=10, # Extra connections when busy
pool_timeout=30, # Wait time for connection
pool_recycle=1800, # Recycle connections after 30 min
pool_pre_ping=True # Verify connection health
)
External poolers:
- PgBouncer: Lightweight connection pooler for PostgreSQL
- pgpool-II: Connection pool + load balancing + replication
# pgbouncer.ini
[databases]
mydb = host=localhost port=5432 dbname=mydb
[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction # transaction, session, or statement
max_client_conn = 1000
default_pool_size = 20
Caching Strategies¶
Query cache:
import redis
import json
from functools import wraps
redis_client = redis.Redis()
def cache_query(ttl=300):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
cache_key = f"{func.__name__}:{hash((args, tuple(sorted(kwargs.items()))))}"
# Check cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Execute query
result = func(*args, **kwargs)
# Cache result
redis_client.setex(cache_key, ttl, json.dumps(result))
return result
return wrapper
return decorator
@cache_query(ttl=60)
def get_user_orders(user_id):
return session.query(Order).filter_by(user_id=user_id).all()
Materialized views:
-- Pre-compute expensive aggregations
CREATE MATERIALIZED VIEW user_statistics AS
SELECT
u.id,
u.name,
COUNT(o.id) as order_count,
SUM(o.total) as total_spent,
MAX(o.created_at) as last_order
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;
CREATE UNIQUE INDEX ON user_statistics (id);
-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY user_statistics;
Vacuuming (PostgreSQL)¶
MVCC creates dead rows that need cleanup:
-- Manual vacuum
VACUUM orders; -- Reclaim space, update visibility
VACUUM FULL orders; -- Reclaim space, rewrite table (locks!)
VACUUM ANALYZE orders; -- Also update statistics
-- Monitor bloat
SELECT
schemaname, tablename,
n_dead_tup as dead_tuples,
n_live_tup as live_tuples,
round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
last_vacuum, last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Autovacuum settings
ALTER TABLE high_churn_table SET (
autovacuum_vacuum_scale_factor = 0.05, -- Vacuum when 5% dead (default 20%)
autovacuum_analyze_scale_factor = 0.02
);
Backup, Recovery, and High Availability¶
Backup Strategies¶
Logical Backups¶
Export data as SQL or portable format:
# PostgreSQL pg_dump
pg_dump -h localhost -U user -d mydb > backup.sql
pg_dump -Fc mydb > backup.dump # Custom format (compressed)
pg_dump -Fd mydb -j 4 -f backup_dir # Directory format (parallel)
# Restore
psql -d mydb < backup.sql
pg_restore -d mydb backup.dump
pg_restore -d mydb -j 4 backup_dir # Parallel restore
# MySQL mysqldump
mysqldump -u user -p mydb > backup.sql
mysqldump --single-transaction mydb > backup.sql # Consistent snapshot
Pros:
- Portable between versions
- Selective backup (tables, schemas)
- Human-readable (SQL format)
Cons:
- Slow for large databases
- Restore takes time
Physical Backups¶
Copy data files directly:
# PostgreSQL pg_basebackup
pg_basebackup -h localhost -U replicator -D /backup/base -Fp -Xs -P
# With compression
pg_basebackup -h localhost -U replicator -D - -Ft -Xs | gzip > backup.tar.gz
Pros:
- Fast backup and restore
- Includes all data
Cons:
- Version-specific
- Large size
Point-in-Time Recovery (PITR)¶
Restore to any moment using WAL:
# Setup: Enable WAL archiving
# postgresql.conf
archive_mode = on
archive_command = 'cp %p /archive/%f'
# Recovery: Restore base backup, then replay WAL
# Create recovery.signal and set in postgresql.conf:
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2024-01-15 14:30:00'
High Availability Architectures¶
Primary-Standby with Automatic Failover¶
┌─────────────┐ ┌─────────────┐
│ Primary │←──WAL──→│ Standby │
└─────────────┘ └─────────────┘
↑ ↑
│ │
└───────┬───────────────┘
│
┌──────────────┐
│ Failover │ (Patroni, repmgr, pg_auto_failover)
│ Manager │
└──────────────┘
Tools:
- Patroni: Template for high-availability PostgreSQL
- repmgr: Replication manager for PostgreSQL
- pg_auto_failover: Automated failover for PostgreSQL
Load Balancing¶
┌─────────────┐
│ HAProxy / │
│ PgBouncer │
└─────────────┘
│
┌──────────────┼──────────────┐
↓ ↓ ↓
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Primary │ │ Replica1 │ │ Replica2 │
│ (writes) │ │ (reads) │ │ (reads) │
└──────────┘ └──────────┘ └──────────┘
# HAProxy configuration
frontend postgres
bind *:5432
default_backend postgres_primary
backend postgres_primary
option pgsql-check user haproxy
server primary 10.0.0.1:5432 check
backend postgres_replicas
balance roundrobin
option pgsql-check user haproxy
server replica1 10.0.0.2:5432 check
server replica2 10.0.0.3:5432 check
Disaster Recovery¶
Recovery objectives:
- RTO (Recovery Time Objective): Maximum acceptable downtime
- RPO (Recovery Point Objective): Maximum acceptable data loss
| Strategy | RTO | RPO | Cost |
|---|---|---|---|
| Daily backups | Hours | 24 hours | Low |
| Continuous WAL archiving | Minutes | Minutes | Medium |
| Synchronous standby | Seconds | Zero | High |
| Multi-region active-active | Near-zero | Near-zero | Very high |
Backup retention policy:
# Example: Grandfather-father-son rotation
# Daily: Keep 7 days
# Weekly: Keep 4 weeks
# Monthly: Keep 12 months
# Yearly: Keep 7 years
find /backups/daily -mtime +7 -delete
find /backups/weekly -mtime +28 -delete
# etc.
Summary and Best Practices¶
Choosing a Database¶
| Requirement | Recommendation |
|---|---|
| Strong consistency, complex queries | PostgreSQL, MySQL |
| Flexible schema, rapid development | MongoDB |
| High write throughput, time-series | Cassandra, TimescaleDB |
| Caching, sessions, real-time | Redis |
| Graph relationships | Neo4j |
| Full-text search | Elasticsearch, PostgreSQL FTS |
| AI/ML similarity search | pgvector, Pinecone |
General Best Practices¶
-
Design schemas thoughtfully: Normalize appropriately, denormalize for performance where needed.
-
Use appropriate indexes: Index frequently queried columns, but don't over-index.
-
Write efficient queries: Use EXPLAIN, avoid N+1 queries, prefer joins over subqueries.
-
Manage connections: Use connection pooling, close connections properly.
-
Secure your data: Use parameterized queries, encrypt sensitive data, implement least-privilege access.
-
Plan for scale: Consider partitioning, replication, and caching strategies early.
-
Backup regularly: Test your backups by actually restoring them.
-
Monitor performance: Track slow queries, connection counts, disk usage.
-
Keep statistics updated: Run ANALYZE regularly for optimal query plans.
-
Use transactions appropriately: Understand isolation levels and their trade-offs.
Performance Checklist¶
- [ ] Queries using indexes (check with EXPLAIN)
- [ ] No N+1 queries (use eager loading)
- [ ] Connection pooling configured
- [ ] Query cache for frequently accessed data
- [ ] Appropriate indexes on foreign keys
- [ ] Statistics up to date (ANALYZE)
- [ ] Dead rows cleaned up (VACUUM)
- [ ] Slow query logging enabled
- [ ] Monitoring dashboards in place