Containerization¶

Containerization is often misunderstood as "lightweight Virtual Machines." To an engineer, that is a useful analogy, but it is technically incorrect.

Containerization is not virtualization; it is process isolation.

While a Virtual Machine (VM) simulates hardware to run a full guest OS, a Container simulates an Operating System to run a process. The fundamental difference lies in the abstraction layer: VMs abstract the hardware (CPU, memory, disk, network interfaces), while containers abstract the operating system interface (filesystem view, process table, network stack, user IDs).

1. The Core Illusion: It's Just a Process¶

If you run a container (e.g., docker run -d nginx) and then run ps aux on your host machine, you can actually find the Nginx process running directly on your host kernel. It is not hidden inside a black box file; it is a standard Linux process with a standard PID.

# On the host machine after running: docker run -d nginx
$ ps aux | grep nginx
root     24501  0.0  0.1  8860  5432 ?  Ss   10:15  0:00 nginx: master process
www-data 24512  0.0  0.0  9264  2340 ?  S    10:15  0:00 nginx: worker process

So why does that process think it has its own file system, IP address, and root user? The answer lies in three Linux Kernel features: Namespaces, Cgroups, and Union File Systems.

These are not container-specific inventions—they are general-purpose kernel primitives that have existed since the 2000s. Container runtimes simply orchestrate them into a cohesive illusion.

2. Namespaces (The Walls)¶

Namespaces manipulate what a process can see. They partition kernel resources such that one set of processes sees one set of resources, while another set sees a different set. Each namespace type isolates a specific global system resource.

The Namespace Syscall Interface¶

Namespaces are created and managed through three primary system calls:

// Create a new process in new namespaces
int clone(int (*fn)(void *), void *stack, int flags, void *arg);

// Move calling process into new namespaces
int unshare(int flags);

// Join an existing namespace
int setns(int fd, int nstype);

The flags parameter specifies which namespaces to create:

Flag	Namespace	Kernel Version	Purpose
`CLONE_NEWPID`	PID	2.6.24 (2008)	Process ID isolation
`CLONE_NEWNS`	Mount	2.4.19 (2002)	Filesystem mount points
`CLONE_NEWNET`	Network	2.6.29 (2009)	Network stack isolation
`CLONE_NEWUTS`	UTS	2.6.19 (2006)	Hostname and domain name
`CLONE_NEWIPC`	IPC	2.6.19 (2006)	Inter-process communication
`CLONE_NEWUSER`	User	3.8 (2013)	User and group ID mapping
`CLONE_NEWCGROUP`	Cgroup	4.6 (2016)	Cgroup root directory
`CLONE_NEWTIME`	Time	5.6 (2020)	System clock offsets

PID Namespace¶

The PID namespace isolates the process ID number space. This is fundamental to the "container as isolated system" illusion.

How It Works:

Each PID namespace has its own process numbering starting from 1
The first process in a new PID namespace becomes PID 1 (the init process)
Processes can see processes in their namespace and child namespaces, but not parent namespaces
PID namespaces form a hierarchy (nested namespaces)

# View namespace from host
$ ls -la /proc/24501/ns/
lrwxrwxrwx 1 root root 0 Jan 26 10:15 pid -> 'pid:[4026532456]'

# The number 4026532456 is the inode number identifying this namespace
# Processes with the same inode share the same namespace

The PID 1 Problem:

Inside a container, your application becomes PID 1. This is significant because PID 1 has special responsibilities in Unix:

Signal Handling: PID 1 does not get default signal handlers. SIGTERM and SIGINT are ignored unless explicitly handled.
Zombie Reaping: PID 1 must reap orphaned child processes, or they become zombies.

This is why container images often use init systems like tini or dumb-init as the entrypoint—they properly handle signals and reap zombies.

Mount Namespace (MNT)¶

The mount namespace isolates the list of mount points seen by a process. This allows each container to have its own root filesystem.

Key Operations:

pivot_root() — Changes the root filesystem for all processes in the namespace
chroot() — Changes the root directory for the calling process (weaker isolation)

Modern container runtimes use pivot_root because it's more secure:

// Simplified pivot_root usage
mkdir("/new_root/old_root");
pivot_root("/new_root", "/new_root/old_root");
chdir("/");
umount2("/old_root", MNT_DETACH);
rmdir("/old_root");

Mount Propagation:

Mounts can be configured with different propagation types:

Type	Behavior
`MS_SHARED`	Mount/unmount events propagate bidirectionally
`MS_PRIVATE`	No propagation (default for containers)
`MS_SLAVE`	Events propagate from master to slave only
`MS_UNBINDABLE`	Cannot be bind-mounted

Network Namespace (NET)¶

The network namespace provides isolation of the network stack:

Network devices (interfaces)
IPv4 and IPv6 protocol stacks
IP routing tables
Firewall rules (iptables/nftables)
Network ports
/proc/net and /sys/class/net

Creating Network Connectivity:

Containers need a way to communicate with the outside world. This is achieved through virtual ethernet (veth) pairs:

┌─────────────────────────────────────────────────────────────┐
│                        HOST                                  │
│                                                              │
│   ┌─────────────────┐        ┌─────────────────┐            │
│   │   Container A   │        │   Container B   │            │
│   │   NET Namespace │        │   NET Namespace │            │
│   │                 │        │                 │            │
│   │  eth0          │        │  eth0          │             │
│   │  172.17.0.2    │        │  172.17.0.3    │             │
│   └────────┬────────┘        └────────┬────────┘            │
│            │ vethA                    │ vethB               │
│            │                          │                      │
│   ┌────────┴──────────────────────────┴────────┐            │
│   │              docker0 bridge                 │            │
│   │              172.17.0.1                     │            │
│   └────────────────────┬───────────────────────┘            │
│                        │                                     │
│   ┌────────────────────┴───────────────────────┐            │
│   │              eth0 (Host NIC)                │            │
│   │              192.168.1.100                  │            │
│   └────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────┘

User Namespace (USER)¶

The user namespace maps user and group IDs inside the namespace to different IDs outside. This is the foundation of rootless containers.

# Mapping configuration (inside container UID 0 → host UID 100000)
$ cat /proc/24501/uid_map
         0     100000      65536

# Format: <inside-start> <outside-start> <count>
# UID 0-65535 inside maps to UID 100000-165535 outside

Security Implications:

Process runs as root (UID 0) inside the container
If it escapes, it's an unprivileged user (UID 100000) on the host
Cannot access files owned by real root
Cannot load kernel modules or mount filesystems

UTS Namespace¶

Allows the container to have its own hostname and NIS domain name:

# Inside container
$ hostname
my-container-hostname

# On host
$ hostname
production-server-01

IPC Namespace¶

Isolates Inter-Process Communication resources:

System V IPC (message queues, semaphore sets, shared memory segments)
POSIX message queues

This prevents containers from interfering with each other's shared memory or message passing.

Cgroup Namespace¶

Virtualizes the view of /sys/fs/cgroup. A process sees its cgroup as the root:

# Inside container (sees root as its own cgroup)
$ cat /proc/self/cgroup
0::/

# On host (sees full path)
$ cat /proc/24501/cgroup
0::/system.slice/docker-abc123.scope

Time Namespace (Linux 5.6+)¶

Allows containers to have different system clock values:

# Useful for testing time-sensitive applications
# Container can think it's year 2030 while host is 2025

3. Cgroups (The Police)¶

Control Groups (cgroups) manipulate what a process can use. While Namespaces hide resources, Cgroups limit them.

Without Cgroups, a containerized process could consume 100% of your Host CPU or RAM, crashing the machine. Cgroups allow you to say: "This group of processes (container) gets max 512MB RAM and 50% of 1 CPU core."

Cgroups V1 vs V2¶

Aspect	Cgroups V1	Cgroups V2
Hierarchy	Multiple hierarchies (one per controller)	Single unified hierarchy
Mount point	`/sys/fs/cgroup/<controller>/`	`/sys/fs/cgroup/`
Process membership	Process can be in different cgroups per controller	Process in exactly one cgroup
Interface files	Controller-specific prefixes	Unified naming (`cpu.max`, `memory.max`)
Default (2024+)	Legacy	Default in modern distros

Cgroup V2 Controllers¶

CPU Controller:

$ cat /sys/fs/cgroup/docker/abc123/cpu.max
150000 100000
# Format: <quota> <period>
# 150000 µs of CPU time every 100000 µs = 1.5 cores

$ cat /sys/fs/cgroup/docker/abc123/cpu.weight
100
# Relative weight (1-10000, default 100) for CPU time sharing

Memory Controller:

$ cat /sys/fs/cgroup/docker/abc123/memory.max
536870912
# Hard limit in bytes (512MB)

$ cat /sys/fs/cgroup/docker/abc123/memory.current
234881024
# Current usage

$ cat /sys/fs/cgroup/docker/abc123/memory.swap.max
0
# Swap limit (0 = no swap allowed)

I/O Controller:

$ cat /sys/fs/cgroup/docker/abc123/io.max
8:0 rbps=10485760 wbps=10485760 riops=1000 wiops=1000
# Device 8:0: max 10MB/s read/write, 1000 IOPS

PIDs Controller:

$ cat /sys/fs/cgroup/docker/abc123/pids.max
100
# Maximum number of processes (fork bomb protection)

The OOM Killer¶

When a container exceeds its memory limit:

Kernel tries to reclaim memory (page cache, swap)
If unsuccessful, triggers OOM (Out of Memory) Killer
OOM Killer selects process with highest oom_score
Selected process is terminated with SIGKILL

# Check OOM events
$ cat /sys/fs/cgroup/docker/abc123/memory.events
low 0
high 0
max 0
oom 3          # OOM triggered 3 times
oom_kill 3     # 3 processes killed

Memory Accounting¶

Cgroups track different types of memory:

$ cat /sys/fs/cgroup/docker/abc123/memory.stat
anon 104857600           # Anonymous memory (heap, stack)
file 52428800            # File-backed memory (page cache)
kernel_stack 163840      # Kernel stack
pagetables 524288        # Page tables
shmem 0                  # Shared memory
sock 4096                # Socket buffers

4. Union File Systems (The Storage)¶

This is why containers are so fast to start compared to VMs.

Standard VM: A 10GB disk image is a giant binary blob that must be copied/attached.
Container: Uses a layered file system (like OverlayFS).

The Layer Model¶

┌─────────────────────────────────────────────┐
│         Container Layer (R/W)               │  ← Ephemeral, per-container
│         /var/lib/docker/overlay2/xyz/diff   │
├─────────────────────────────────────────────┤
│         Image Layer 3 (R/O)                 │  ← Your application code
│         sha256:abc123...                    │
├─────────────────────────────────────────────┤
│         Image Layer 2 (R/O)                 │  ← apt-get install nginx
│         sha256:def456...                    │
├─────────────────────────────────────────────┤
│         Image Layer 1 (R/O)                 │  ← Base OS (Ubuntu)
│         sha256:789ghi...                    │
└─────────────────────────────────────────────┘

OverlayFS Mechanics¶

OverlayFS (Overlay Filesystem) is the default storage driver for Docker. It presents a unified view of multiple directories:

┌─────────────────────────────────────────────┐
│              Merged (View)                   │  ← What container sees
│              /merged                         │
└───────────────────┬─────────────────────────┘
                    │
        ┌───────────┴───────────┐
        │                       │
┌───────┴───────┐       ┌───────┴───────┐
│   UpperDir    │       │   LowerDir    │
│   (R/W)       │       │   (R/O)       │
│   /upper      │       │   /lower      │
└───────────────┘       └───────────────┘

Mount Command:

mount -t overlay overlay \
  -o lowerdir=/lower,upperdir=/upper,workdir=/work \
  /merged

Copy-on-Write (CoW) Operations¶

Operation	Behavior
Read existing file	Transparent lookup through layers (fast)
Modify existing file	Copy entire file to upper layer, then modify
Delete file	Create "whiteout" file in upper layer
Create new file	Written directly to upper layer

The Copy-Up Problem:

# If base image has a 1GB log file
# And you append 1 byte to it...
# The ENTIRE 1GB file is copied to the container layer first!

# This is why you should:
# 1. Never modify large files in the container filesystem
# 2. Use volumes for data that changes
# 3. Keep base images minimal

Whiteout Files¶

When you delete a file that exists in a lower layer, OverlayFS creates a special "whiteout" file:

# Delete /etc/config from lower layer
rm /merged/etc/config

# OverlayFS creates:
# /upper/etc/config (character device 0:0)

# This "whiteout" marker tells the filesystem to hide
# the file from the merged view

5. The OCI (Open Container Initiative) Standards¶

The OCI defines three specifications that ensure container interoperability:

OCI Runtime Specification¶

Defines how to run a "filesystem bundle":

container-bundle/
├── config.json      # Container configuration
└── rootfs/          # Root filesystem
    ├── bin/
    ├── etc/
    ├── lib/
    └── ...

config.json structure:

{
  "ociVersion": "1.0.2",
  "process": {
    "terminal": false,
    "user": { "uid": 0, "gid": 0 },
    "args": ["nginx", "-g", "daemon off;"],
    "env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"],
    "cwd": "/"
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "linux": {
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "mount" },
      { "type": "ipc" },
      { "type": "uts" }
    ],
    "resources": {
      "memory": { "limit": 536870912 },
      "cpu": { "quota": 150000, "period": 100000 }
    }
  }
}

OCI Image Specification¶

Defines the format of container images:

Image = Manifest + Config + Layers

Manifest (application/vnd.oci.image.manifest.v1+json):
{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc...",
    "size": 1234
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:def...",
      "size": 12345678
    }
  ]
}

OCI Distribution Specification¶

Defines how images are pushed/pulled from registries (HTTP API).

6. The Runtime Architecture¶

The container ecosystem has multiple layers of runtimes:

┌─────────────────────────────────────────────────────────┐
│                    User Interface                        │
│              (Docker CLI, Podman, nerdctl)              │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                  Container Engine                        │
│              (Docker Daemon, Podman)                     │
│         - Image management                               │
│         - Network management                             │
│         - Volume management                              │
│         - API server                                     │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│               High-Level Runtime                         │
│              (containerd, CRI-O)                         │
│         - Image pull/push                                │
│         - Container lifecycle                            │
│         - Snapshot management                            │
│         - Execution supervision                          │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                Low-Level Runtime                         │
│              (runc, crun, youki, gVisor, Kata)          │
│         - Namespace creation                             │
│         - Cgroup configuration                           │
│         - Process execution                              │
│         - OCI runtime-spec implementation                │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                    Linux Kernel                          │
│         - Namespaces                                     │
│         - Cgroups                                        │
│         - Seccomp                                        │
│         - Capabilities                                   │
│         - OverlayFS                                      │
└─────────────────────────────────────────────────────────┘

Low-Level Runtimes Comparison¶

Runtime	Language	Isolation	Use Case
runc	Go	Namespaces	Default, most compatible
crun	C	Namespaces	Faster startup, lower memory
youki	Rust	Namespaces	Memory safety, modern
gVisor	Go	User-space kernel	Strong isolation (sandboxing)
Kata	Go	MicroVM	Hardware-level isolation
Firecracker	Rust	MicroVM	AWS Lambda, serverless

7. Container vs. VM: Technical Comparison¶

Feature	Virtual Machine (VM)	Container
Abstraction Layer	Hardware (via Hypervisor)	OS (via Kernel primitives)
Kernel	Each VM has its own kernel	Shared host kernel
Startup Time	30s-2min (BIOS, kernel boot)	100ms-1s (process start)
Memory Overhead	500MB-2GB (guest OS)	5-50MB (process only)
Disk Overhead	10-50GB per VM	Shared layers (MB added)
Isolation Strength	Strong (hardware boundary)	Weaker (kernel boundary)
Density	10-50 VMs per host	100-1000 containers per host
Kernel Exploit Risk	Guest kernel only	Host kernel (shared)
Syscall Compatibility	Full (own kernel)	Host kernel version dependent

When to Use VMs vs. Containers¶

Use VMs when:

Running untrusted workloads (strong isolation required)
Need different operating systems (Windows + Linux)
Kernel version requirements differ
Regulatory compliance requires hardware-level separation

Use Containers when:

Deploying microservices
Rapid scaling required
Resource efficiency is critical
CI/CD pipelines
Same OS family across workloads

8. Security Considerations¶

Containers are not secure by default. The shared kernel is both an advantage (efficiency) and a risk (attack surface).

Defense in Depth¶

User Namespaces: Run containers as non-root on host
Read-only Root Filesystem: Prevent runtime modifications
Dropped Capabilities: Remove unnecessary privileges
Seccomp Profiles: Block dangerous syscalls
AppArmor/SELinux: Mandatory Access Control
Network Policies: Isolate container networks
Image Scanning: Detect vulnerabilities before deployment

Container Escape Vectors¶

Vector	Mitigation
Kernel exploits	Keep kernel patched, use gVisor/Kata
Privileged containers	Never use `--privileged` in production
Mounted Docker socket	Never mount `/var/run/docker.sock`
Host path mounts	Restrict to specific, non-sensitive paths
CAP_SYS_ADMIN	Drop all unnecessary capabilities

9. The Evolution: Why Orchestration¶

Because containers are so lightweight, engineers stopped deploying "one server," and started deploying "hundreds of microservices." Managing this manually is impossible.

The Problems at Scale:

Which host should this container run on?
How do I ensure 3 copies are always running?
How do containers find each other? (Service Discovery)
How do I update without downtime? (Rolling Updates)
How do I handle host failures? (Self-Healing)
How do I manage secrets and configuration?
How do I route external traffic? (Ingress)

This created the need for Kubernetes (K8s).

If Docker is the brick, Kubernetes is the architect.

Essential Engineer's Perspective¶

To master containerization, stop thinking of it as "mini-servers." Start thinking of it as packaging.

You are packaging your application with its entire environment (dependencies, OS config, network rules) so that it runs exactly the same on your laptop as it does on the production server.

The phrase "It works on my machine" is effectively solved by this technology.

The Mental Model:

Traditional Deployment:
  App → Installed on Server → Configuration varies → "Works on my machine"

Container Deployment:
  App + Dependencies + Config → Immutable Image → Runs anywhere identically

Key Principles:

Immutability: Don't modify running containers; replace them
Ephemerality: Containers can be killed and recreated anytime
Single Process: One container = one process (ideally)
Statelessness: Store state outside (volumes, databases)
Declarative Configuration: Define desired state, let tools reconcile