Mathematical Foundations¶

Number Systems¶

Computers use different number systems to represent data, each with specific use cases:

Binary (Base-2): Uses 0s and 1s (bits). Each bit represents a power of 2 (e.g., 1011 = $1 \\cdot 2^3 + 0 \\cdot 2^2 + 1 \\cdot 2^1 + 1 \\cdot 2^0 = 8 + 0 + 2 + 1 = 11$ decimal). Binary is the native format for CPU registers, memory, and instructions.
Decimal (Base-10): Human-readable, used in user interfaces and financial calculations. Computers convert decimal inputs to binary for processing.
Hexadecimal (Base-16): Uses 0-9 and A-F (e.g., 2F = $2 \\cdot 16^1 + 15 \\cdot 16^0 = 47$ decimal). Compact for representing binary (1 hex digit = 4 bits), common in memory addresses and debugging.
Octal (Base-8): Less common, used in legacy systems or Unix permissions (e.g., 777 octal for file permissions).

Binary Number System¶

Basics: Bits, Bytes, and Positional Notation¶

Bits and Grouping:
- A bit (binary digit) is the smallest unit, representing 0 or 1.
- Bits are grouped for efficiency:
  - Nibble: 4 bits (e.g., 1011).
  - Byte: 8 bits (e.g., 10110011), the standard unit for memory addressing.
  - Word: Typically 16, 32, or 64 bits, depending on architecture (e.g., 64-bit in modern x86 CPUs).
- MSB (Most Significant Bit): The leftmost bit, with the highest positional value.
- LSB (Least Significant Bit): The rightmost bit, with the lowest value.

Conversions¶

Binary to Decimal: Sum the positional values where bits are 1.
Decimal to Binary: Repeatedly divide by 2, recording remainders (LSB first).
- Example: Convert 13 to binary: 13 ÷ 2 = 6 remainder 1 6 ÷ 2 = 3 remainder 0 3 ÷ 2 = 1 remainder 1 1 ÷ 2 = 0 remainder 1 Read remainders bottom-up: 1101.
Binary to Hex/Octal: Group bits (4 for hex, 3 for octal) and convert each group.
- Example: Binary 10110011 = 1011 (B hex) + 0011 (3 hex) = B3 hex.
- Octal: 10 110 011 = 2 (2 oct) + 6 (6 oct) + 3 (3 oct) = 263 octal.
Binary Fractions: Extend positional notation left of the decimal point with negative powers (e.g., 0.101 = $1 \\cdot 2^{-1} + 0 \\cdot 2^{-2} + 1 \\cdot 2^{-3} = 0.5 + 0 + 0.125 = 0.625$ decimal).
- Decimal to binary fraction: Multiply fractional part by 2, record integer part (repeat until 0 or desired precision).

Signed Binary Representations¶

Computers handle negative numbers using fixed bit widths (e.g., 8-bit, 16-bit).

Sign-Magnitude: MSB is sign (0=positive, 1=negative); rest is magnitude.
- Example (4-bit): +3 = 0011, -3 = 1011.
- Issues: Two zeros (+0=0000, -0=1000), complex arithmetic.
One's Complement: Negate by inverting bits.
- Example (4-bit): +3 = 0011, -3 = ~0011 = 1100.
- Issues: Still two zeros, end-around carry in addition.
Two's Complement (Most Common): Negate by inverting bits and adding 1.
- Example (4-bit): +3 = 0011, -3 = ~0011 + 1 = 1100 + 1 = 1101.
- Range (n bits): $-2^{n-1}$ to $2^{n-1}-1$ (e.g., 4-bit: -8 to 7).
- Advantages: Single zero, seamless arithmetic (addition works for signed/unsigned).

Little-Endian and Big-Endian Representations¶

Big-Endian:
- The most significant byte (MSB) is stored at the lowest memory address.
- Matches human-readable order (e.g., writing 1234 from left to right).
- Example: 32-bit integer 0x12345678 (decimal 305,419,896) in big-endian:
  - Memory layout (addresses increasing left to right): 12 34 56 78.
  - At address 0x1000: 12, 0x1001: 34, 0x1002: 56, 0x1003: 78.
Little-Endian:
- The least significant byte (LSB) is stored at the lowest memory address.

Binary Arithmetic in Depth¶

Addition: Bit-by-bit with carry (e.g., 101 + 11 = 1000, carry propagates).
Subtraction: Add the two's complement of the subtrahend.
Multiplication: Shift and add (e.g., 101 * 10 = 101 \<< 1 + 101 \<< 0 = 1010 + 0000 = 1010 or 5*2=10).
Division: Shift and subtract, or use repeated subtraction.
Hardware handles these in the ALU, but software can emulate for big integers.

Bitwise Operations¶

Bitwise operations manipulate individual bits, offering fast, memory-efficient alternatives to arithmetic:

AND (&): Sets a bit to 1 if both inputs are 1 (e.g., 101 & 110 = 100).
OR (|): Sets a bit to 1 if either input is 1 (e.g., 101 | 110 = 111).
XOR (^): Sets a bit to 1 if inputs differ (e.g., 101 ^ 110 = 011).
NOT (~): Inverts bits (e.g., ~101 = 010 in 3-bit).
Shifts (\<<, >>): Move bits left or right (e.g., 101 \<< 1 = 1010, doubles the value; 101 >> 1 = 010, halves it).

Use Cases:

Flags: Store multiple boolean values in one byte (e.g., permissions: read = 001, write = 010).
Optimization: Check if a number is odd (x & 1) faster than modulo (x % 2).
Masking: Extract specific bits (e.g., value & 0xFF gets the lowest 8 bits).

Floating-Point Arithmetic¶

IEEE 754 Representation¶

The IEEE 754 standard defines how floating-point numbers are stored, typically in 32-bit (single precision) or 64-bit (double precision) formats.

Components:
- Sign Bit: 1 bit (0 = positive, 1 = negative).
- Exponent: Stored with a bias (e.g., 127 for single precision) to represent positive and negative powers of 2.
- Mantissa (Fraction): The significant digits, normalized to 1.xxxx (implicit leading 1 not stored).

Single Precision (32-bit):

Sign: 1 bit.
Exponent: 8 bits (biased by 127, range -126 to +127).
Mantissa: 23 bits (plus implicit 1, giving 24-bit precision).
Example:

3.14 ≈ $1.57 \\times 2^1$:

  - Sign: 0 (positive).
  - Exponent: 1 + 127 = 128 = 10000000 (binary).
  - Mantissa: 1.57 (normalized) ≈ 10010001111010111000010 (23 bits).
  - Stored as: 0 10000000 10010001111010111000010.

Double Precision (64-bit):
- Sign: 1 bit.
- Exponent: 11 bits (biased by 1023, range -1022 to +1023).
- Mantissa: 52 bits (plus implicit 1).
- Offers higher precision and range but uses more memory and CPU cycles.
Special Values:
- Zero: Exponent and mantissa all 0 (sign bit determines +0 or -0).
- Infinity: Exponent all 1s, mantissa 0 (e.g., result of 1/0).
- NaN (Not a Number): Exponent all 1s, non-zero mantissa (e.g., 0/0, sqrt(-1)).
- Subnormal Numbers: Exponent all 0s, non-zero mantissa, for very small numbers (lose precision).

Floating-Point Operations¶

The CPU’s FPU performs operations like addition, subtraction, multiplication, division, and square root:

Addition/Subtraction: Align exponents (shift smaller number’s mantissa), add/subtract mantissas, normalize result, and adjust exponent.
- Example: 1.5 + 2.5 = 4.0 involves aligning $1.5 \\times 2^0$ and $1.25 \\times 2^1$, adding mantissas, and normalizing.
Multiplication: Multiply mantissas, add exponents (subtract bias), normalize.
Division: Divide mantissas, subtract exponents, normalize.
Rounding: Results are rounded to fit the mantissa (e.g., round-to-nearest, round-to-zero), causing precision loss.

Performance: Floating-point operations are slower than integer ones (e.g., 2-5 cycles for addition vs. 1 for integer addition) due to complex normalization and rounding. SIMD instructions (e.g., SSE/AVX on x86) accelerate vectorized floating-point operations. Why it matters: Operations can amplify precision errors, and their cost impacts performance in math-heavy applications like graphics or machine learning.

Precision and Errors¶

Floating-point numbers are approximations, leading to errors:

Rounding Errors: Finite mantissa bits cause small inaccuracies (e.g., 0.1 + 0.2 = 0.30000004 in single precision).
Catastrophic Cancellation: Subtracting nearly equal numbers loses significant digits (e.g., 1.234567 - 1.234566 = 0.000001, but precision may drop).
Overflow/Underflow: Results too large/small yield infinity or subnormal numbers.
Non-Associativity: (a + b) + c ≠ a + (b + c) due to rounding (e.g., adding many small numbers to a large one).

Mitigation:

Use double precision for higher accuracy (but more memory).
Use fixed-point or decimal arithmetic (e.g., Python’s decimal, Java’s BigDecimal) for exact calculations (e.g., finance).
Rearrange computations to minimize cancellation (e.g., add small numbers first).

Performance Considerations¶

Hardware Support: Modern CPUs have dedicated FPUs or SIMD units (e.g., Intel’s AVX-512 for 16 parallel float operations). Use compiler flags (e.g., -mavx) or intrinsics for optimization.
Memory Usage: Single precision (4 bytes) vs. double (8 bytes) affects cache efficiency and bandwidth.
Vectorization: Libraries like NumPy use SIMD for fast array operations.
Software Fallbacks: Without FPU support (e.g., some embedded systems), floating-point is emulated, slowing performance.

Binary-Coded Decimal (BCD)¶

BCD encodes decimal digits (0-9) using 4 bits per digit, ensuring exact decimal representation:

Representation:
- Decimal 42 = 0100 0010 (4 = 0100, 2 = 0010).
- Packed BCD: Two digits per byte (e.g., 01000010 for 42).
- Unpacked BCD: One digit per byte (e.g., 00000100 00000010).
Advantages: Avoids floating-point errors, simplifies decimal display (e.g., calculators).
Disadvantages: Less efficient (e.g., 9999 needs 16 bits in BCD vs. 14 in binary).
Use Cases: Financial systems, embedded displays, legacy hardware (e.g., x86 BCD instructions like DAA).

Boolean Algebra¶

Boolean algebra is the mathematical language of digital logic. It operates on binary values (0 = false, 1 = true) using three core operations: AND, OR, and NOT. Every CPU instruction, memory flag, and hardware gate is built from Boolean functions. Mastering it lets you write branchless, cache-friendly, and ultra-fast code, debug low-level systems, and even design custom hardware.

Formal Foundations¶

A Boolean algebra is a 6-tuple (S, +, ·, ¬, 0, 1) where:

S = {0, 1} — the set of truth values
+ = OR (logical disjunction)
· = AND (logical conjunction)
¬ = NOT (complement)
0 = identity element for OR
1 = identity element for AND

The system satisfies commutativity, associativity, distributivity, identity, complement, and absorption laws.

Primitive Operations & Truth Tables¶

Operation	Symbol	Name	Truth Table
AND	`A · B` or `A & B`	Conjunction	`0 0 → 0` `0 1 → 0` `1 0 → 0` `1 1 → 1`
OR	`A + B` or `A \\| B`	Disjunction	`0 0 → 0` `0 1 → 1` `1 0 → 1` `1 1 → 1`
NOT	`¬A` or `~A`	Negation	`0 → 1` `1 → 0`

Derived Gates

Gate	Expression	When True
NAND	`¬(A · B)`	False only if both true
NOR	`¬(A + B)`	True only if both false
XOR	`A ⊕ B = (A · ¬B) + (¬A · B)`	True if inputs differ
XNOR	`¬(A ⊕ B)`	True if inputs are equal

Universal Gates: NAND and NOR can implement any Boolean function. This is why NAND flash dominates memory.

Complete Boolean Laws¶

Law	Expression
Identity	`A + 0 = A`, `A · 1 = A`
Dominance	`A + 1 = 1`, `A · 0 = 0`
Idempotence	`A + A = A`, `A · A = A`
Complement	`A + ¬A = 1`, `A · ¬A = 0`
Involution	`¬(¬A) = A`
Commutative	`A + B = B + A`, `A · B = B · A`
Associative	`(A + B) + C = A + (B + C)`
Distributive	`A · (B + C) = (A · B) + (A · C)`
Absorption	`A + (A · B) = A`, `A · (A + B) = A`
De Morgan	`¬(A + B) = ¬A · ¬B`, `¬(A · B) = ¬A + ¬B`
Consensus	`(A·B) + (¬A·C) + (B·C) = (A·B) + (¬A·C)`

Canonical Forms¶

Sum of Products (SOP): F = Σ minterms Example: F(A,B) = A¬B + AB = m1 + m3
Product of Sums (POS): F = Π maxterms Example: F(A,B) = (A+B)(¬A+B) = M0 · M2

Every function has a unique SOP/POS representation.

Simplification Techniques¶

Algebraic Reduction

F = A·B + A·¬B + ¬A·B
  = A·(B + ¬B) + ¬A·B
  = A·1 + ¬A·B
  = A + ¬A·B
  = A + B  (absorption)

Karnaugh Map (K-map) For F(A,B,C) with 1s at 001, 011, 110, 111:

AB\C	00	01	11	10
00	0	1	1	0
01	0	0	1	0
11	1	1	1	1
10	0	0	1	0

→ Group: BC (quad) + AB (pair) → F = BC + AB

Quine-McCluskey: Used in EDA tools for functions with >6 variables.

Boolean Logic in the CPU (ALU)¶

The ALU contains parallel arrays:

AND gate array → bitwise AND
OR gate array → bitwise OR
XOR gate array → addition, subtraction
NOT array → inversion
MUX → selects result based on opcode

Example Opcode Table:

Opcode	Operation
000	AND
001	OR
010	ADD
110	NOT

Branch prediction: if (ZF · ¬CF) → taken

Boolean Algebra in Code¶

Bitwise Tricks

// Is power of 2?
bool is_pow2(uint32_t x) {
    return x > 0 && (x & (x - 1)) == 0;
}

// Swap without temp
void swap(int* a, int* b) {
    *a ^= *b;
    *b ^= *a;
    *a ^= *b;
}

// Count set bits (Brian Kernighan)
int popcount(uint32_t x) {
    int count = 0;
    while (x) {
        x &= x - 1;
        count++;
    }
    return count;
}

Bit Masks & Permissions

#define FLAG_READ  (1 << 0)  // 0b0001
#define FLAG_WRITE (1 << 1)  // 0b0010
#define FLAG_EXEC  (1 << 2)  // 0b0100

uint8_t perms = 0;
perms |= FLAG_READ | FLAG_WRITE;
perms &= ~FLAG_EXEC;

if (perms & FLAG_READ) {
    printf("Read allowed\n");
}

Conditional Optimization

// Before (redundant)
if ((x > 0 && y > 0) || (x > 0 && z > 0)) { ... }

// After (distributive law)
if (x > 0 && (y > 0 || z > 0)) { ... }  // Fewer comparisons

Advanced Applications¶

Multiplexer (MUX)

Y = (S · A) + (¬S · B)

Used in ALU to select add/subtract path.

D Flip-Flop (Sequential Logic)

D	Q(t+1)
0	0
1	1

Built from 6 NAND gates — foundation of registers.

Finite State Machine (FSM)

NextState = f(CurrentState, Input)
Output    = g(CurrentState, Input)

Example: Traffic light controller with Boolean state encoding.

SAT Solvers & Formal Verification

Reduce circuit to CNF → find satisfying assignment Tools: Z3, MiniSat, ABC

Cryptography & Security¶

One-Time Pad: C = P ⊕ K, P = C ⊕ K
AES S-box: 8→8 bit nonlinear Boolean function
Side-channel resistance: Use Boolean masking

Performance Impact¶

Operation	CPU Cycles (x86)	Energy
AND/OR	1	~0.1 pJ
XOR	1	~0.15 pJ
Shift	1	~0.1 pJ
Compare	1–3	~0.3 pJ

Combinatorics¶

Combinatorics is the study of counting, arrangement, and selection of objects. It's fundamental to probability, algorithm analysis, and system design. Understanding combinatorial principles helps you calculate complexity, design efficient algorithms, and reason about system scalability.

Fundamental Counting Principles¶

Multiplication Principle (Rule of Product)¶

If one operation can be performed in $m$ ways and a second operation can be performed in $n$ ways, then the two operations together can be performed in $m \\times n$ ways.

Example: A password has 2 letters (26 choices each) followed by 3 digits (10 choices each).
- Total passwords: $26 \\times 26 \\times 10 \\times 10 \\times 10 = 676,000$
Generalization: For $k$ operations with $n_1, n_2, \\ldots, n_k$ choices: $n_1 \\times n_2 \\times \\cdots \\times n_k$

Addition Principle (Rule of Sum)¶

If one task can be done in $m$ ways and another task in $n$ ways (mutually exclusive), then either task can be done in $m + n$ ways.

Example: Choose a dessert: 5 cakes OR 3 pies = 8 total choices.

Inclusion-Exclusion Principle¶

For two sets $A$ and $B$:

\[|A \\cup B| = |A| + |B| - |A \\cap B|\]

Example: In a class of 30 students, 20 know Python, 15 know Java, and 8 know both.
- Students knowing at least one: $20 + 15 - 8 = 27$
Three sets: $|A \\cup B \\cup C| = |A| + |B| + |C| - |A \\cap B| - |A \\cap C| - |B \\cap C| + |A \\cap B \\cap C|$

Permutations¶

A permutation is an ordered arrangement of objects. Order matters.

Permutations of n Distinct Objects¶

The number of ways to arrange $n$ distinct objects is: $$P(n, n) = n! = n \\times (n-1) \\times (n-2) \\times \\cdots \\times 2 \\times 1$$

Example: Arrange 5 books on a shelf: $5! = 120$ ways.
Factorial properties:
- $0! = 1$ (by definition)
- $n! = n \\times (n-1)!$
- For large $n$: $n! \\approx \\sqrt{2\\pi n} \\left(\\frac{n}{e}\\right)^n$ (Stirling's approximation)

Permutations of n Objects Taken r at a Time¶

The number of ways to arrange $r$ objects from $n$ distinct objects: $$P(n, r) = \\frac{n!}{(n-r)!} = n \\times (n-1) \\times \\cdots \\times (n-r+1)$$

Example: Choose and arrange 3 winners from 10 contestants: $P(10, 3) = 10 \\times 9 \\times 8 = 720$
Interpretation: First place: 10 choices, second: 9, third: 8.

Permutations with Repetition¶

When objects can be repeated: $$n^r$$

Example: 4-digit PIN with digits 0-9: $10^4 = 10,000$ possibilities.

Permutations with Identical Objects¶

When some objects are identical, divide by the factorials of identical groups:

\[\\frac{n!}{n_1! \\times n_2! \\times \\cdots \\times n_k!}\]

where $n_1 + n_2 + \\cdots + n_k = n$ (counts of identical objects).

Example: Arrange letters in "MISSISSIPPI" (1 M, 4 I, 4 S, 2 P):
- Total: $\\frac{11!}{1! \\times 4! \\times 4! \\times 2!} = \\frac{39,916,800}{1 \\times 24 \\times 24 \\times 2} = 34,650$

Combinations¶

A combination is a selection of objects where order does NOT matter.

Combinations of n Objects Taken r at a Time¶

The number of ways to choose $r$ objects from $n$ distinct objects:

\[C(n, r) = \\binom{n}{r} = \\frac{n!}{r!(n-r)!} = \\frac{P(n, r)}{r!}\]

Example: Choose 3 members from 10 people for a committee: $C(10, 3) = \\frac{10!}{3! \\times 7!} = \\frac{10 \\times 9 \\times 8}{3 \\times 2 \\times 1} = 120$
Key difference from permutations: $\\binom{10}{3} = 120$ vs $P(10, 3) = 720$ (order doesn't matter in committees).

Binomial Coefficients¶

The notation $\\binom{n}{r}$ is called a binomial coefficient and appears in the binomial theorem:

\[(x + y)^n = \\sum\_{r=0}^{n} \\binom{n}{r} x^{n-r} y^r\]

Properties:

Symmetry: $\\binom{n}{r} = \\binom{n}{n-r}$
Pascal's Identity: $\\binom{n}{r} = \\binom{n-1}{r-1} + \\binom{n-1}{r}$ (forms Pascal's triangle)
Sum: $\\sum\_{r=0}^{n} \\binom{n}{r} = 2^n$
Boundary: $\\binom{n}{0} = \\binom{n}{n} = 1$

Combinations with Repetition¶

When objects can be selected more than once:

\[\\binom{n+r-1}{r} = \\frac{(n+r-1)!}{r!(n-1)!}\]

Example: Choose 5 fruits from 3 types (apples, oranges, bananas), allowing duplicates:
- $\\binom{3+5-1}{5} = \\binom{7}{5} = 21$ ways

Combinatorial Identities¶

Vandermonde's Identity:

\[\\binom{m+n}{r} = \\sum\_{k=0}^{r} \\binom{m}{k} \\binom{n}{r-k}\]

Hockey-Stick Identity:

\[\\sum\_{i=r}^{n} \\binom{i}{r} = \\binom{n+1}{r+1}\]

Applications in Computer Science¶

Algorithm Analysis¶

Subset generation: $2^n$ subsets of $n$ elements (exponential complexity).
String matching: $n!$ permutations of a string of length $n$ (brute force is infeasible).
Graph algorithms: $\\binom{n}{2} = \\frac{n(n-1)}{2}$ possible edges in a graph with $n$ vertices.

Hash Tables & Collisions¶

Birthday paradox: With $n$ items and $m$ buckets, probability of collision:
- Approximation: $1 - e^{-\\frac{n^2}{2m}}$
- For $m=365$ (days), $n=23$ gives ~50% collision probability.

Cryptography¶

Key space: $2^n$ possible keys for $n$-bit key (e.g., AES-256: $2^{256}$).
Password strength: $|\\Sigma|^L$ where $\\Sigma$ is character set, $L$ is length.

Database Query Optimization¶

Join orderings: For $n$ tables, $(n-1)!$ possible join orders (optimizer uses heuristics).

Code Example: Generating Combinations¶

def combinations(n, r):
    """Calculate C(n, r) using dynamic programming."""
    # Base cases
    if r > n or r < 0:
        return 0
    if r == 0 or r == n:
        return 1

    # Use Pascal's triangle
    dp = [[0] * (r + 1) for _ in range(n + 1)]
    for i in range(n + 1):
        for j in range(min(i, r) + 1):
            if j == 0 or j == i:
                dp[i][j] = 1
            else:
                dp[i][j] = dp[i-1][j-1] + dp[i-1][j]
    return dp[n][r]

# Generate all combinations recursively
def generate_combinations(arr, r):
    """Generate all r-combinations of arr."""
    if r == 0:
        return [[]]
    if len(arr) < r:
        return []

    result = []
    # Include first element
    for combo in generate_combinations(arr[1:], r - 1):
        result.append([arr[0]] + combo)
    # Exclude first element
    result.extend(generate_combinations(arr[1:], r))
    return result

Probability¶

Probability quantifies uncertainty and randomness. It's essential for randomized algorithms, machine learning, system reliability, game theory, and performance modeling. Understanding probability enables you to make informed decisions under uncertainty and design robust systems.

Basic Concepts¶

Sample Space and Events¶

Sample Space ($\\Omega$): The set of all possible outcomes of an experiment.
- Example: Rolling a die: $\\Omega = \\{1, 2, 3, 4, 5, 6\\}$
Event ($E$): A subset of the sample space.
- Example: "Rolling an even number": $E = \\{2, 4, 6\\}$
Elementary Event: An event with a single outcome.
Complement: $\\bar{E} = \\Omega \\setminus E$ (event does NOT occur).

Probability Axioms (Kolmogorov)¶

For any event $E$ in sample space $\\Omega$:

Non-negativity: $P(E) \\geq 0$
Normalization: $P(\\Omega) = 1$
Additivity: For mutually exclusive events $E_1, E_2, \\ldots$: $$P(E_1 \\cup E_2 \\cup \\cdots) = P(E_1) + P(E_2) + \\cdots$$

Consequences:

$P(\\emptyset) = 0$
$P(\\bar{E}) = 1 - P(E)$
$P(E) \\leq 1$ for any event $E$

Classical Probability (Equally Likely Outcomes)¶

If all outcomes are equally likely:

\[P(E) = \\frac{\\text{Number of favorable outcomes}}{\\text{Total number of outcomes}} = \\frac{|E|}{|\\Omega|}\]

Example: Probability of rolling a 6: $P(\\{6\\}) = \\frac{1}{6}$

Empirical Probability (Frequency)¶

For repeated experiments:

\[P(E) \\approx \\frac{\\text{Number of times E occurred}}{\\text{Total number of trials}}\]

Law of Large Numbers: As trials increase, empirical probability converges to true probability.

Conditional Probability and Independence¶

Conditional Probability¶

The probability of event $A$ given that event $B$ has occurred:

\[P(A | B) = \\frac{P(A \\cap B)}{P(B)} \\quad \\text{(if } P(B) > 0\\text{)}\]

Example: Draw a card from a deck. Given it's red, probability it's a heart:
- $P(\\text{Heart} | \\text{Red}) = \\frac{P(\\text{Heart} \\cap \\text{Red})}{P(\\text{Red})} = \\frac{13/52}{26/52} = \\frac{1}{2}$

Multiplication Rule¶

\[P(A \\cap B) = P(A | B) \\times P(B) = P(B | A) \\times P(A)\]

Generalization: $P(A_1 \\cap A_2 \\cap \\cdots \\cap A_n) = P(A_1) \\times P(A_2 | A_1) \\times P(A_3 | A_1 \\cap A_2) \\times \\cdots$

Independence¶

Events $A$ and $B$ are independent if:

\[P(A \\cap B) = P(A) \\times P(B)\]

Equivalently: $P(A | B) = P(A)$ (knowing $B$ doesn't affect $A$).

Example: Tossing two coins: $P(\\text{First heads} \\cap \\text{Second heads}) = \\frac{1}{2} \\times \\frac{1}{2} = \\frac{1}{4}$
Mutual Independence: Events $A_1, A_2, \\ldots, A_n$ are mutually independent if:

$$P(A\_{i_1} \\cap A\_{i_2} \\cap \\cdots \\cap A\_{i_k}) = P(A\_{i_1}) \\times P(A\_{i_2}) \\times \\cdots \\times P(A\_{i_k})$$ for any subset $\\{i_1, i_2, \\ldots, i_k\\}$.

Bayes' Theorem¶

\[P(A | B) = \\frac{P(B | A) \\times P(A)}{P(B)}\]

Extended form (with partition):

\[P(A_i | B) = \\frac{P(B | A_i) \\times P(A_i)}{\\sum\_{j=1}^{n} P(B | A_j) \\times P(A_j)}\]

where $\\{A_1, A_2, \\ldots, A_n\\}$ partitions the sample space.

Example (Medical Testing):
- Disease prevalence: $P(\\text{Disease}) = 0.01$
- Test sensitivity: $P(\\text{Positive} | \\text{Disease}) = 0.95$
- Test specificity: $P(\\text{Negative} | \\text{No Disease}) = 0.90$
- Given positive test, probability of disease: $$P(\\text{Disease} | \\text{Positive}) = \\frac{0.95 \\times 0.01}{0.95 \\times 0.01 + 0.10 \\times 0.99} \\approx 0.0876$$

Random Variables¶

A random variable $X$ is a function that maps outcomes to real numbers: $X: \\Omega \\rightarrow \\mathbb{R}$.

Discrete Random Variables¶

Takes countable values (e.g., integers).

Probability Mass Function (PMF):

\[p_X(x) = P(X = x)\]

Properties:

$p_X(x) \\geq 0$ for all $x$
$\\sum\_{x} p_X(x) = 1$

Cumulative Distribution Function (CDF):

\[F_X(x) = P(X \\leq x) = \\sum\_{k \\leq x} p_X(k)\]

Continuous Random Variables¶

Takes uncountable values (e.g., real numbers).

Probability Density Function (PDF):

\[f_X(x) \\geq 0 \\quad \\text{and} \\quad \\int\_{-\\infty}^{\\infty} f_X(x) , dx = 1\]

CDF:

\[F_X(x) = P(X \\leq x) = \\int\_{-\\infty}^{x} f_X(t) , dt\]

Relationship:

$f_X(x) = \\frac{d}{dx} F_X(x)$ (if differentiable).

Expected Value (Mean)¶

Discrete:

\[E[X] = \\mu = \\sum\_{x} x \\cdot p_X(x)\]

Continuous:

\[E[X] = \\mu = \\int\_{-\\infty}^{\\infty} x \\cdot f_X(x) , dx\]

Properties:

Linearity: $E[aX + b] = aE[X] + b$
Additivity: $E[X + Y] = E[X] + E[Y]$ (always)
Independence: $E[XY] = E[X] \\times E[Y]$ (if $X$ and $Y$ are independent)

Variance and Standard Deviation¶

Variance:

\[\\text{Var}(X) = \\sigma^2 = E[(X - \\mu)^2] = E[X^2] - (E[X])^2\]

Standard Deviation:

\[\\sigma = \\sqrt{\\text{Var}(X)}\]

Properties:

$\\text{Var}(aX + b) = a^2 \\text{Var}(X)$
$\\text{Var}(X + Y) = \\text{Var}(X) + \\text{Var}(Y)$ (if $X$ and $Y$ are independent)

Common Discrete Distributions¶

Bernoulli Distribution ($X \\sim \\text{Bernoulli}(p)$):

Models single trial with success probability $p$.
$P(X = 1) = p$, $P(X = 0) = 1-p$
$E[X] = p$, $\\text{Var}(X) = p(1-p)$

Binomial Distribution ($X \\sim \\text{Binomial}(n, p)$):

Number of successes in $n$ independent Bernoulli trials.
$P(X = k) = \\binom{n}{k} p^k (1-p)^{n-k}$ for $k = 0, 1, \\ldots, n$
$E[X] = np$, $\\text{Var}(X) = np(1-p)$
Example: Number of heads in 10 coin flips: $X \\sim \\text{Binomial}(10, 0.5)$

Geometric Distribution ($X \\sim \\text{Geometric}(p)$):

Number of trials until first success.
$P(X = k) = (1-p)^{k-1} p$ for $k = 1, 2, \\ldots$
$E[X] = \\frac{1}{p}$, $\\text{Var}(X) = \\frac{1-p}{p^2}$

Poisson Distribution ($X \\sim \\text{Poisson}(\\lambda)$):

Models rare events over time/space.
$P(X = k) = \\frac{\\lambda^k e^{-\\lambda}}{k!}$ for $k = 0, 1, 2, \\ldots$
$E[X] = \\lambda$, $\\text{Var}(X) = \\lambda$
Example: Number of requests per second to a server.

Common Continuous Distributions¶

Uniform Distribution ($X \\sim \\text{Uniform}(a, b)$):

$f_X(x) = \\frac{1}{b-a}$ for $a \\leq x \\leq b$, 0 otherwise
$E[X] = \\frac{a+b}{2}$, $\\text{Var}(X) = \\frac{(b-a)^2}{12}$

Exponential Distribution ($X \\sim \\text{Exponential}(\\lambda)$):

Models waiting times between events.
$f_X(x) = \\lambda e^{-\\lambda x}$ for $x \\geq 0$
$E[X] = \\frac{1}{\\lambda}$, $\\text{Var}(X) = \\frac{1}{\\lambda^2}$
Memoryless property: $P(X > s+t | X > s) = P(X > t)$

Normal (Gaussian) Distribution ($X \\sim \\mathcal{N}(\\mu, \\sigma^2)$):

$f_X(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}} e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}}$
$E[X] = \\mu$, $\\text{Var}(X) = \\sigma^2$
Central Limit Theorem: Sum of many independent random variables approximates normal.
Standard Normal: $Z \\sim \\mathcal{N}(0, 1)$ with CDF $\\Phi(z)$

Applications in Computer Science¶

Randomized Algorithms¶

QuickSort pivot selection: Random pivot gives $O(n \\log n)$ expected time.
Monte Carlo methods: Estimate $\\pi$ by random sampling.
Hash functions: Uniform distribution minimizes collisions.

System Reliability¶

Series system: $P(\\text{System works}) = \\prod\_{i=1}^{n} P(\\text{Component } i \\text{ works})$
Parallel system: $P(\\text{System works}) = 1 - \\prod\_{i=1}^{n} (1 - P(\\text{Component } i \\text{ works}))$

Load Balancing¶

Birthday paradox: With $n$ requests and $m$ servers, expected collisions help design hash-based load balancers.

Code Example: Monte Carlo Simulation¶

import random
import math

def estimate_pi(n_trials):
    """Estimate π using Monte Carlo method."""
    inside_circle = 0
    for _ in range(n_trials):
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        if x**2 + y**2 <= 1:
            inside_circle += 1
    return 4 * inside_circle / n_trials

# Expected value: E[estimate] ≈ π
print(f"π estimate: {estimate_pi(1000000)}")  # ~3.141...

def binomial_pmf(n, p, k):
    """Calculate P(X = k) for X ~ Binomial(n, p)."""
    from math import comb
    return comb(n, k) * (p ** k) * ((1 - p) ** (n - k))

def expected_value_binomial(n, p):
    """E[X] for X ~ Binomial(n, p)."""
    return n * p

def variance_binomial(n, p):
    """Var(X) for X ~ Binomial(n, p)."""
    return n * p * (1 - p)

Statistics¶

Statistics provides methods to collect, analyze, interpret, and present data. It's crucial for A/B testing, performance monitoring, machine learning evaluation, and data-driven decision making. Understanding statistics enables you to draw valid conclusions from data and avoid common pitfalls.

Descriptive Statistics¶

Measures of Central Tendency¶

Mean (Arithmetic Average): $$\\bar{x} = \\frac{1}{n} \\sum\_{i=1}^{n} x_i$$

Properties: Sensitive to outliers.
Weighted mean: $\\bar{x}\_w = \\frac{\\sum w_i x_i}{\\sum w_i}$

Median:

Middle value when data is sorted (or average of two middle values if $n$ is even).
Properties: Robust to outliers, 50th percentile.

Mode:

Most frequently occurring value(s).
Useful for categorical data.

Comparison:

Symmetric data: Mean ≈ Median ≈ Mode
Right-skewed: Mean > Median > Mode
Left-skewed: Mean < Median < Mode

Measures of Dispersion¶

Range: $$\\text{Range} = \\max(x_i) - \\min(x_i)$$

Variance (Sample): $$s^2 = \\frac{1}{n-1} \\sum\_{i=1}^{n} (x_i - \\bar{x})^2$$

Standard Deviation (Sample): $$s = \\sqrt{s^2}$$

Interquartile Range (IQR): $$\\text{IQR} = Q_3 - Q_1$$

where $Q_1$ (25th percentile) and $Q_3$ (75th percentile) are quartiles.

Coefficient of Variation: $$\\text{CV} = \\frac{s}{\\bar{x}} \\times 100\\%$$

(Relative measure of variability)

Percentiles and Quartiles¶

$p$-th percentile: Value below which $p\\%$ of data falls.
Quartiles:
- $Q_1$: 25th percentile
- $Q_2$: 50th percentile (median)
- $Q_3$: 75th percentile

Five-Number Summary:

Minimum, $Q_1$, Median, $Q_3$, Maximum
Used in box plots.

Probability Distributions in Practice¶

Sampling Distributions¶

Sample Mean Distribution:

If $X_1, X_2, \\ldots, X_n$ are i.i.d. with mean $\\mu$ and variance $\\sigma^2$:
- $E[\\bar{X}] = \\mu$
- $\\text{Var}(\\bar{X}) = \\frac{\\sigma^2}{n}$
- Central Limit Theorem: $\\bar{X} \\approx \\mathcal{N}(\\mu, \\frac{\\sigma^2}{n})$ for large $n$

Standard Error¶

Standard Error of the Mean: $$\\text{SE} = \\frac{\\sigma}{\\sqrt{n}} \\approx \\frac{s}{\\sqrt{n}}$$

Measures variability of sample mean across different samples.

Inferential Statistics¶

Point Estimation¶

Estimator: A statistic used to estimate a population parameter.

Properties of Good Estimators:

Unbiased: $E[\\hat{\\theta}] = \\theta$
Consistent: $\\hat{\\theta} \\xrightarrow{p} \\theta$ as $n \\to \\infty$
Efficient: Minimum variance among unbiased estimators

Common Estimators:

Sample mean: $\\bar{X} = \\frac{1}{n}\\sum X_i$ (unbiased for $\\mu$)
Sample variance: $S^2 = \\frac{1}{n-1}\\sum(X_i - \\bar{X})^2$ (unbiased for $\\sigma^2$)

Confidence Intervals¶

A confidence interval provides a range of plausible values for a parameter with a specified confidence level.

95% Confidence Interval for Mean (Large Sample, Known $\\sigma$): $$\\bar{x} \\pm 1.96 \\times \\frac{\\sigma}{\\sqrt{n}}$$

95% Confidence Interval for Mean (Large Sample, Unknown $\\sigma$): $$\\bar{x} \\pm t\_{0.025, n-1} \\times \\frac{s}{\\sqrt{n}}$$

where $t\_{0.025, n-1}$ is the critical value from $t$-distribution.

Interpretation: If we repeat the sampling process many times, 95% of intervals will contain the true parameter.

Hypothesis Testing¶

Null Hypothesis ($H_0$): Statement to be tested (often "no effect" or "no difference").

Alternative Hypothesis ($H_1$ or $H_a$): Statement we want to prove.

Test Statistic: A value computed from sample data used to decide between $H_0$ and $H_1$.

$p$-value: Probability of observing data as extreme (or more extreme) than observed, assuming $H_0$ is true.

Decision Rule:

If $p$-value < $\\alpha$ (significance level, e.g., 0.05), reject $H_0$
Otherwise, fail to reject $H_0$

Type I Error ($\\alpha$): Rejecting $H_0$ when it's true (false positive).

Type II Error ($\\beta$): Failing to reject $H_0$ when it's false (false negative).

Power: $1 - \\beta$ (probability of correctly rejecting $H_0$).

Common Tests¶

$z$-test (Large Sample, Known $\\sigma$): $$z = \\frac{\\bar{x} - \\mu_0}{\\sigma / \\sqrt{n}}$$

$t$-test (Small Sample, Unknown $\\sigma$): $$t = \\frac{\\bar{x} - \\mu_0}{s / \\sqrt{n}}$$

Follows $t$-distribution with $n-1$ degrees of freedom.

Chi-Square Test:

Tests independence in contingency tables.
$\\chi^2 = \\sum \\frac{(O_i - E_i)^2}{E_i}$ where $O_i$ = observed, $E_i$ = expected.

Correlation and Regression¶

Correlation¶

Pearson Correlation Coefficient: $$r = \\frac{\\sum (x_i - \\bar{x})(y_i - \\bar{y})}{\\sqrt{\\sum (x_i - \\bar{x})^2 \\sum (y_i - \\bar{y})^2}} = \\frac{\\text{Cov}(X, Y)}{s_X s_Y}$$

Range: $-1 \\leq r \\leq 1$
$r = 1$: Perfect positive linear relationship
$r = -1$: Perfect negative linear relationship
$r = 0$: No linear relationship (but may have nonlinear relationship)

Interpretation:

$|r| > 0.7$: Strong correlation
$0.3 < |r| < 0.7$: Moderate correlation
$|r| < 0.3$: Weak correlation

Causation vs. Correlation: Correlation does NOT imply causation.

Simple Linear Regression¶

Models relationship between dependent variable $Y$ and independent variable $X$: $$Y = \\beta_0 + \\beta_1 X + \\epsilon$$

Least Squares Estimates: $$\\hat{\\beta_1} = \\frac{\\sum (x_i - \\bar{x})(y_i - \\bar{y})}{\\sum (x_i - \\bar{x})^2} = r \\frac{s_Y}{s_X}$$

\[\\hat{\\beta_0} = \\bar{y} - \\hat{\\beta_1} \\bar{x}\]

Coefficient of Determination ($R^2$): $$R^2 = r^2 = \\frac{\\text{Explained Variation}}{\\text{Total Variation}}$$

Proportion of variance in $Y$ explained by $X$.
Range: $0 \\leq R^2 \\leq 1$

Applications in Computer Science¶

A/B Testing¶

Hypothesis: $H_0$: No difference between variants, $H_1$: Variants differ
Metrics: Conversion rate, click-through rate, revenue
Statistical significance: $p$-value < 0.05
Sample size: Power analysis to detect meaningful differences

Performance Monitoring¶

Latency percentiles: $p50$ (median), $p95$, $p99$
Throughput: Mean requests per second with confidence intervals
Anomaly detection: Values beyond $\\mu \\pm 3\\sigma$ (3-sigma rule)

Machine Learning Evaluation¶

Accuracy: $\\frac{\\text{TP} + \\text{TN}}{\\text{TP} + \\text{TN} + \\text{FP} + \\text{FN}}$
Precision: $\\frac{\\text{TP}}{\\text{TP} + \\text{FP}}$
Recall: $\\frac{\\text{TP}}{\\text{TP} + \\text{FN}}$
F1-Score: $2 \\times \\frac{\\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}$
Confusion Matrix: TP, TN, FP, FN counts

Database Query Optimization¶

Cardinality estimation: Using statistics to estimate result sizes
Selectivity: Fraction of rows matching a condition

Code Example: Statistical Analysis¶

import numpy as np
from scipy import stats

def descriptive_stats(data):
    """Calculate descriptive statistics."""
    return {
        'mean': np.mean(data),
        'median': np.median(data),
        'std': np.std(data, ddof=1),  # Sample std
        'variance': np.var(data, ddof=1),
        'min': np.min(data),
        'max': np.max(data),
        'q1': np.percentile(data, 25),
        'q3': np.percentile(data, 75),
        'iqr': np.percentile(data, 75) - np.percentile(data, 25)
    }

def confidence_interval_mean(data, confidence=0.95):
    """Calculate confidence interval for mean."""
    n = len(data)
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    se = std / np.sqrt(n)

    alpha = 1 - confidence
    t_critical = stats.t.ppf(1 - alpha/2, df=n-1)

    margin = t_critical * se
    return (mean - margin, mean + margin)

def t_test_one_sample(data, mu0):
    """One-sample t-test: H0: μ = μ0 vs H1: μ ≠ μ0."""
    t_stat, p_value = stats.ttest_1samp(data, mu0)
    return {
        't_statistic': t_stat,
        'p_value': p_value,
        'reject_h0': p_value < 0.05
    }

def correlation_analysis(x, y):
    """Calculate correlation and perform test."""
    r, p_value = stats.pearsonr(x, y)
    return {
        'correlation': r,
        'p_value': p_value,
        'significant': p_value < 0.05
    }

# Example: Analyze response times
response_times = np.random.exponential(scale=100, size=1000)
stats_dict = descriptive_stats(response_times)
ci = confidence_interval_mean(response_times)
print(f"Mean: {stats_dict['mean']:.2f}ms")
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]ms")

Common Pitfalls¶

Confusing correlation with causation: Ice cream sales and drowning are correlated (both increase in summer), but one doesn't cause the other.
Multiple comparisons problem: Testing many hypotheses increases chance of false positives. Use Bonferroni correction: $\\alpha\_{\\text{adjusted}} = \\frac{\\alpha}{n}$.
Selection bias: Sample not representative of population.
Survivorship bias: Only analyzing successful cases, ignoring failures.
P-hacking: Manipulating data or tests until $p$-value < 0.05.
Ignoring effect size: Statistical significance doesn't mean practical importance.
Small sample sizes: Leads to unreliable estimates and low power.