PC Architecture¶

A PC’s architecture is the blueprint of how its components—CPU, memory, storage, and peripherals—work together to execute your code. At its core, a computer follows the von Neumann architecture, where:

CPU (Central Processing Unit): The "brain" that executes instructions.
Memory: Stores data and instructions for quick access.
Storage: Persists data and programs when the system is off.
Input/Output (I/O): Handles communication with devices like keyboards or disks.
Bus: Connects components for data transfer.

CPU¶

At its heart, the CPU runs your code through a repeating loop called the fetch-decode-execute (FDE) cycle (sometimes called fetch-execute cycle). This is the fundamental process for processing machine instructions:

Fetch: The CPU retrieves the next instruction from main memory (RAM) using the Program Counter (PC) register, which holds the memory address of the current instruction. The instruction is loaded into the Instruction Register (IR). This step involves accessing memory via the address bus and data bus.
Decode: The Control Unit (CU) interprets the instruction in the IR. It breaks it down into an opcode (operation code, e.g., ADD or LOAD) and operands (data or addresses involved). The CU then signals other CPU components (like the ALU) on what to do.
Execute: The instruction is performed. For example:
- Arithmetic operations (e.g., addition) are handled by the Arithmetic Logic Unit (ALU).
- Data movement (e.g., loading from memory) updates registers or memory.
- If the result needs storage, it's written back (sometimes called a "write-back" stage in extended cycles).

After execution, the PC increments to point to the next instruction, and the cycle repeats. In simple CPUs, this is sequential, but modern ones overlap stages for efficiency.

Internal Components of the CPU¶

Control Unit (CU): The "director" that orchestrates the FDE cycle. It generates control signals to coordinate the ALU, registers, and memory access. In modern designs, it's microprogrammed or hardwired for speed.
Arithmetic Logic Unit (ALU): Performs computations like addition, subtraction, bitwise operations (AND, OR, XOR), and comparisons. It handles integer and floating-point math via dedicated Floating-Point Units (FPUs) in advanced CPUs.
Registers: Ultra-fast storage inside the CPU for temporary data. Types include:
- General-Purpose Registers (GPRs): Hold variables or intermediate results (e.g., 16-32 registers in x86 architectures like RAX, RBX).
- Special Registers: Program Counter (PC), Stack Pointer (SP), Flags (for condition results like zero or overflow).
- Vector Registers: For SIMD (Single Instruction, Multiple Data) operations, like AVX in Intel CPUs, allowing parallel processing of arrays.

Registers are much faster than cache or RAM (access in ~1 cycle vs. 100+ for RAM), so compilers optimize to keep data here as much as possible.

Advanced Features for Performance¶

Modern CPUs (e.g., Intel Core, AMD Ryzen, Apple M-series) go beyond basic FDE with optimizations:

Clock Speed and Turbo Boost: Measured in GHz (e.g., 5 GHz means 5 billion cycles/second). Higher speeds mean faster execution, but they're limited by heat (TDP - Thermal Design Power). Features like Intel Turbo Boost or AMD Precision Boost dynamically increase speed for short bursts.
Cores and Multi-Core Design: A single CPU can have multiple cores (e.g., 16 cores in a high-end desktop CPU). Each core runs its own FDE cycle, enabling parallelism. Hyper-Threading (Intel) or Simultaneous Multithreading (SMT, AMD) allows one core to handle two threads by duplicating some registers.
Instruction Pipelining: Divides the FDE into stages (e.g., 5-20 stages in modern CPUs: fetch, decode, execute, memory access, write-back). Instructions overlap, like an assembly line, increasing throughput. However, hazards (data dependencies, branches) can stall the pipeline.
Branch Prediction and Speculative Execution: CPUs guess the outcome of branches (e.g., if-statements) using history tables. If wrong, they discard speculative work (a "misprediction penalty"). This was exploited in vulnerabilities like Spectre/Meltdown.
Out-of-Order Execution (OoO): Instructions aren't executed strictly in program order. The CPU reorders them to keep units busy, using a Reorder Buffer (ROB) to maintain correctness.
Superscalar Architecture: Multiple instructions per cycle (e.g., 4-8 wide in modern CPUs) by having parallel ALUs and decoders.

Instruction Set Architectures (ISAs)¶

The CPU's "language" is its ISA, defining instructions and registers:

CISC (Complex Instruction Set Computing): Like x86 (Intel/AMD), with rich, variable-length instructions (e.g., one instruction for string copy). Good for compatibility but complex hardware.
RISC (Reduced Instruction Set Computing): Like ARM or RISC-V, with simple, fixed-length instructions. Simpler hardware, power-efficient (common in mobiles), but may need more instructions for tasks.

Memory¶

Memory is the fast, temporary storage that holds your program’s data and instructions while they’re being processed by the CPU. Unlike storage, which is persistent but slow, memory is volatile (data is lost when power is off) and optimized for speed. Understanding memory’s structure, behavior, and interaction with software is crucial for writing high-performance code. Below, we explore memory types, the memory hierarchy, key concepts, and their implications for engineers.

Types of Memory¶

Memory comes in several forms, each with distinct roles:

Registers: Ultra-fast storage inside the CPU (e.g., 16-32 general-purpose registers in x86 CPUs, like RAX or RBX). They hold temporary data (e.g., loop counters, function arguments) and are accessed in ~0.1-1 nanosecond (ns). Capacity is tiny (e.g., 512 bytes total).
Cache Memory: Fast memory on or near the CPU, divided into levels:
- L1 Cache: Per-core, split into instruction (I-cache) and data (D-cache), typically 32-64KB per core, ~1-3 ns access.
- L2 Cache: Per-core or shared, 256KB-2MB, ~10-20 ns access.
- L3 Cache (LLC - Last Level Cache): Shared across cores, 8-100MB, ~30-50 ns access. Cache stores frequently used data to reduce trips to slower RAM.
- Cache uses associativity (e.g., 8-way set-associative) and replacement policies (LRU). Misses cause stalls—optimize with spatial/temporal locality.
RAM (Random Access Memory): The main system memory, typically 8-128GB in modern PCs, with access times of ~10-100 ns. RAM is volatile and holds running programs and their data, loaded from storage.
ROM (Read-Only Memory): Non-volatile, used for firmware (e.g., BIOS/UEFI). Rarely modified by user code but critical for system boot.
Specialized Memory (e.g., VRAM in GPUs): Dedicated to specific tasks, like graphics processing. Not covered here but relevant for game developers.

The Memory Hierarchy¶

Memory is organized in a hierarchy to balance speed, cost, and capacity:

Registers: Fastest, smallest (bytes), most expensive per bit.
L1/L2/L3 Cache: Fast, medium capacity (KB-MB), on-chip or near CPU.
RAM: Slower, larger (GB), off-chip but still fast compared to storage.
Storage (SSD/HDD): Slowest (microseconds to milliseconds), massive (TB), cheapest per bit.

This hierarchy creates a trade-off: faster memory is scarcer, so CPUs rely on locality of reference:

Temporal Locality: If data is accessed once, it’s likely to be accessed again soon (e.g., loop variables).
Spatial Locality: If data at one memory address is accessed, nearby addresses are likely needed (e.g., array elements).

Cache exploits locality by prefetching data into fast storage. A cache hit (data in cache) is fast; a cache miss (data in RAM or storage) causes delays. Cache is managed by hardware (cache controllers) using policies like Least Recently Used (LRU) for eviction and set-associative mapping (e.g., 8-way associativity) to organize data.

Key Memory Concepts¶

Virtual Memory: An abstraction that gives each process its own contiguous address space, mapped to physical memory via page tables.
Memory Alignment: Data should be stored at addresses aligned to the CPU’s word size (e.g., 4-byte boundaries for 32-bit systems). Misaligned access (e.g., reading a 4-byte integer at an odd address) may require multiple CPU cycles, slowing execution.
Memory Management: In low-level languages (C/C++), you manually allocate/deallocate memory (e.g., malloc/free). In high-level languages (Java, Python), a garbage collector automatically reclaims unused memory, but this can cause pauses (e.g., stop-the-world pauses in Java’s JVM).
Cache Coherence: In multi-core CPUs, each core’s cache must stay consistent with others. Protocols like MESI ensure data integrity but add overhead in multithreaded programs.
Memory Bandwidth: The rate at which data moves between RAM and CPU (e.g., 50-100 GB/s in DDR5 RAM). High-bandwidth tasks (e.g., video processing) can saturate this, causing bottlenecks.