Containerization¶
Containerization is often misunderstood as "lightweight Virtual Machines." To an engineer, that is a useful analogy, but it is technically incorrect.
Containerization is not virtualization; it is process isolation.
While a Virtual Machine (VM) simulates hardware to run a full guest OS, a Container simulates an Operating System to run a process. The fundamental difference lies in the abstraction layer: VMs abstract the hardware (CPU, memory, disk, network interfaces), while containers abstract the operating system interface (filesystem view, process table, network stack, user IDs).
1. The Core Illusion: It's Just a Process¶
If you run a container (e.g., docker run -d nginx) and then run ps aux on your host machine, you can actually find the Nginx process running directly on your host kernel. It is not hidden inside a black box file; it is a standard Linux process with a standard PID.
# On the host machine after running: docker run -d nginx
$ ps aux | grep nginx
root 24501 0.0 0.1 8860 5432 ? Ss 10:15 0:00 nginx: master process
www-data 24512 0.0 0.0 9264 2340 ? S 10:15 0:00 nginx: worker process
So why does that process think it has its own file system, IP address, and root user? The answer lies in three Linux Kernel features: Namespaces, Cgroups, and Union File Systems.
These are not container-specific inventions—they are general-purpose kernel primitives that have existed since the 2000s. Container runtimes simply orchestrate them into a cohesive illusion.
2. Namespaces (The Walls)¶
Namespaces manipulate what a process can see. They partition kernel resources such that one set of processes sees one set of resources, while another set sees a different set. Each namespace type isolates a specific global system resource.
The Namespace Syscall Interface¶
Namespaces are created and managed through three primary system calls:
// Create a new process in new namespaces
int clone(int (*fn)(void *), void *stack, int flags, void *arg);
// Move calling process into new namespaces
int unshare(int flags);
// Join an existing namespace
int setns(int fd, int nstype);
The flags parameter specifies which namespaces to create:
| Flag | Namespace | Kernel Version | Purpose |
|---|---|---|---|
CLONE_NEWPID |
PID | 2.6.24 (2008) | Process ID isolation |
CLONE_NEWNS |
Mount | 2.4.19 (2002) | Filesystem mount points |
CLONE_NEWNET |
Network | 2.6.29 (2009) | Network stack isolation |
CLONE_NEWUTS |
UTS | 2.6.19 (2006) | Hostname and domain name |
CLONE_NEWIPC |
IPC | 2.6.19 (2006) | Inter-process communication |
CLONE_NEWUSER |
User | 3.8 (2013) | User and group ID mapping |
CLONE_NEWCGROUP |
Cgroup | 4.6 (2016) | Cgroup root directory |
CLONE_NEWTIME |
Time | 5.6 (2020) | System clock offsets |
PID Namespace¶
The PID namespace isolates the process ID number space. This is fundamental to the "container as isolated system" illusion.
How It Works:
- Each PID namespace has its own process numbering starting from 1
- The first process in a new PID namespace becomes PID 1 (the init process)
- Processes can see processes in their namespace and child namespaces, but not parent namespaces
- PID namespaces form a hierarchy (nested namespaces)
# View namespace from host
$ ls -la /proc/24501/ns/
lrwxrwxrwx 1 root root 0 Jan 26 10:15 pid -> 'pid:[4026532456]'
# The number 4026532456 is the inode number identifying this namespace
# Processes with the same inode share the same namespace
The PID 1 Problem:
Inside a container, your application becomes PID 1. This is significant because PID 1 has special responsibilities in Unix:
- Signal Handling: PID 1 does not get default signal handlers.
SIGTERMandSIGINTare ignored unless explicitly handled. - Zombie Reaping: PID 1 must reap orphaned child processes, or they become zombies.
This is why container images often use init systems like tini or dumb-init as the entrypoint—they properly handle signals and reap zombies.
Mount Namespace (MNT)¶
The mount namespace isolates the list of mount points seen by a process. This allows each container to have its own root filesystem.
Key Operations:
pivot_root()— Changes the root filesystem for all processes in the namespacechroot()— Changes the root directory for the calling process (weaker isolation)
Modern container runtimes use pivot_root because it's more secure:
// Simplified pivot_root usage
mkdir("/new_root/old_root");
pivot_root("/new_root", "/new_root/old_root");
chdir("/");
umount2("/old_root", MNT_DETACH);
rmdir("/old_root");
Mount Propagation:
Mounts can be configured with different propagation types:
| Type | Behavior |
|---|---|
MS_SHARED |
Mount/unmount events propagate bidirectionally |
MS_PRIVATE |
No propagation (default for containers) |
MS_SLAVE |
Events propagate from master to slave only |
MS_UNBINDABLE |
Cannot be bind-mounted |
Network Namespace (NET)¶
The network namespace provides isolation of the network stack:
- Network devices (interfaces)
- IPv4 and IPv6 protocol stacks
- IP routing tables
- Firewall rules (iptables/nftables)
- Network ports
/proc/netand/sys/class/net
Creating Network Connectivity:
Containers need a way to communicate with the outside world. This is achieved through virtual ethernet (veth) pairs:
┌─────────────────────────────────────────────────────────────┐
│ HOST │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Container A │ │ Container B │ │
│ │ NET Namespace │ │ NET Namespace │ │
│ │ │ │ │ │
│ │ eth0 │ │ eth0 │ │
│ │ 172.17.0.2 │ │ 172.17.0.3 │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ vethA │ vethB │
│ │ │ │
│ ┌────────┴──────────────────────────┴────────┐ │
│ │ docker0 bridge │ │
│ │ 172.17.0.1 │ │
│ └────────────────────┬───────────────────────┘ │
│ │ │
│ ┌────────────────────┴───────────────────────┐ │
│ │ eth0 (Host NIC) │ │
│ │ 192.168.1.100 │ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
User Namespace (USER)¶
The user namespace maps user and group IDs inside the namespace to different IDs outside. This is the foundation of rootless containers.
# Mapping configuration (inside container UID 0 → host UID 100000)
$ cat /proc/24501/uid_map
0 100000 65536
# Format: <inside-start> <outside-start> <count>
# UID 0-65535 inside maps to UID 100000-165535 outside
Security Implications:
- Process runs as root (UID 0) inside the container
- If it escapes, it's an unprivileged user (UID 100000) on the host
- Cannot access files owned by real root
- Cannot load kernel modules or mount filesystems
UTS Namespace¶
Allows the container to have its own hostname and NIS domain name:
# Inside container
$ hostname
my-container-hostname
# On host
$ hostname
production-server-01
IPC Namespace¶
Isolates Inter-Process Communication resources:
- System V IPC (message queues, semaphore sets, shared memory segments)
- POSIX message queues
This prevents containers from interfering with each other's shared memory or message passing.
Cgroup Namespace¶
Virtualizes the view of /sys/fs/cgroup. A process sees its cgroup as the root:
# Inside container (sees root as its own cgroup)
$ cat /proc/self/cgroup
0::/
# On host (sees full path)
$ cat /proc/24501/cgroup
0::/system.slice/docker-abc123.scope
Time Namespace (Linux 5.6+)¶
Allows containers to have different system clock values:
# Useful for testing time-sensitive applications
# Container can think it's year 2030 while host is 2025
3. Cgroups (The Police)¶
Control Groups (cgroups) manipulate what a process can use. While Namespaces hide resources, Cgroups limit them.
Without Cgroups, a containerized process could consume 100% of your Host CPU or RAM, crashing the machine. Cgroups allow you to say: "This group of processes (container) gets max 512MB RAM and 50% of 1 CPU core."
Cgroups V1 vs V2¶
| Aspect | Cgroups V1 | Cgroups V2 |
|---|---|---|
| Hierarchy | Multiple hierarchies (one per controller) | Single unified hierarchy |
| Mount point | /sys/fs/cgroup/<controller>/ |
/sys/fs/cgroup/ |
| Process membership | Process can be in different cgroups per controller | Process in exactly one cgroup |
| Interface files | Controller-specific prefixes | Unified naming (cpu.max, memory.max) |
| Default (2024+) | Legacy | Default in modern distros |
Cgroup V2 Controllers¶
CPU Controller:
$ cat /sys/fs/cgroup/docker/abc123/cpu.max
150000 100000
# Format: <quota> <period>
# 150000 µs of CPU time every 100000 µs = 1.5 cores
$ cat /sys/fs/cgroup/docker/abc123/cpu.weight
100
# Relative weight (1-10000, default 100) for CPU time sharing
Memory Controller:
$ cat /sys/fs/cgroup/docker/abc123/memory.max
536870912
# Hard limit in bytes (512MB)
$ cat /sys/fs/cgroup/docker/abc123/memory.current
234881024
# Current usage
$ cat /sys/fs/cgroup/docker/abc123/memory.swap.max
0
# Swap limit (0 = no swap allowed)
I/O Controller:
$ cat /sys/fs/cgroup/docker/abc123/io.max
8:0 rbps=10485760 wbps=10485760 riops=1000 wiops=1000
# Device 8:0: max 10MB/s read/write, 1000 IOPS
PIDs Controller:
$ cat /sys/fs/cgroup/docker/abc123/pids.max
100
# Maximum number of processes (fork bomb protection)
The OOM Killer¶
When a container exceeds its memory limit:
- Kernel tries to reclaim memory (page cache, swap)
- If unsuccessful, triggers OOM (Out of Memory) Killer
- OOM Killer selects process with highest
oom_score - Selected process is terminated with
SIGKILL
# Check OOM events
$ cat /sys/fs/cgroup/docker/abc123/memory.events
low 0
high 0
max 0
oom 3 # OOM triggered 3 times
oom_kill 3 # 3 processes killed
Memory Accounting¶
Cgroups track different types of memory:
$ cat /sys/fs/cgroup/docker/abc123/memory.stat
anon 104857600 # Anonymous memory (heap, stack)
file 52428800 # File-backed memory (page cache)
kernel_stack 163840 # Kernel stack
pagetables 524288 # Page tables
shmem 0 # Shared memory
sock 4096 # Socket buffers
4. Union File Systems (The Storage)¶
This is why containers are so fast to start compared to VMs.
- Standard VM: A 10GB disk image is a giant binary blob that must be copied/attached.
- Container: Uses a layered file system (like OverlayFS).
The Layer Model¶
┌─────────────────────────────────────────────┐
│ Container Layer (R/W) │ ← Ephemeral, per-container
│ /var/lib/docker/overlay2/xyz/diff │
├─────────────────────────────────────────────┤
│ Image Layer 3 (R/O) │ ← Your application code
│ sha256:abc123... │
├─────────────────────────────────────────────┤
│ Image Layer 2 (R/O) │ ← apt-get install nginx
│ sha256:def456... │
├─────────────────────────────────────────────┤
│ Image Layer 1 (R/O) │ ← Base OS (Ubuntu)
│ sha256:789ghi... │
└─────────────────────────────────────────────┘
OverlayFS Mechanics¶
OverlayFS (Overlay Filesystem) is the default storage driver for Docker. It presents a unified view of multiple directories:
┌─────────────────────────────────────────────┐
│ Merged (View) │ ← What container sees
│ /merged │
└───────────────────┬─────────────────────────┘
│
┌───────────┴───────────┐
│ │
┌───────┴───────┐ ┌───────┴───────┐
│ UpperDir │ │ LowerDir │
│ (R/W) │ │ (R/O) │
│ /upper │ │ /lower │
└───────────────┘ └───────────────┘
Mount Command:
mount -t overlay overlay \
-o lowerdir=/lower,upperdir=/upper,workdir=/work \
/merged
Copy-on-Write (CoW) Operations¶
| Operation | Behavior |
|---|---|
| Read existing file | Transparent lookup through layers (fast) |
| Modify existing file | Copy entire file to upper layer, then modify |
| Delete file | Create "whiteout" file in upper layer |
| Create new file | Written directly to upper layer |
The Copy-Up Problem:
# If base image has a 1GB log file
# And you append 1 byte to it...
# The ENTIRE 1GB file is copied to the container layer first!
# This is why you should:
# 1. Never modify large files in the container filesystem
# 2. Use volumes for data that changes
# 3. Keep base images minimal
Whiteout Files¶
When you delete a file that exists in a lower layer, OverlayFS creates a special "whiteout" file:
# Delete /etc/config from lower layer
rm /merged/etc/config
# OverlayFS creates:
# /upper/etc/config (character device 0:0)
# This "whiteout" marker tells the filesystem to hide
# the file from the merged view
5. The OCI (Open Container Initiative) Standards¶
The OCI defines three specifications that ensure container interoperability:
OCI Runtime Specification¶
Defines how to run a "filesystem bundle":
container-bundle/
├── config.json # Container configuration
└── rootfs/ # Root filesystem
├── bin/
├── etc/
├── lib/
└── ...
config.json structure:
{
"ociVersion": "1.0.2",
"process": {
"terminal": false,
"user": { "uid": 0, "gid": 0 },
"args": ["nginx", "-g", "daemon off;"],
"env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"],
"cwd": "/"
},
"root": {
"path": "rootfs",
"readonly": false
},
"linux": {
"namespaces": [
{ "type": "pid" },
{ "type": "network" },
{ "type": "mount" },
{ "type": "ipc" },
{ "type": "uts" }
],
"resources": {
"memory": { "limit": 536870912 },
"cpu": { "quota": 150000, "period": 100000 }
}
}
}
OCI Image Specification¶
Defines the format of container images:
Image = Manifest + Config + Layers
Manifest (application/vnd.oci.image.manifest.v1+json):
{
"schemaVersion": 2,
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:abc...",
"size": 1234
},
"layers": [
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:def...",
"size": 12345678
}
]
}
OCI Distribution Specification¶
Defines how images are pushed/pulled from registries (HTTP API).
6. The Runtime Architecture¶
The container ecosystem has multiple layers of runtimes:
┌─────────────────────────────────────────────────────────┐
│ User Interface │
│ (Docker CLI, Podman, nerdctl) │
└───────────────────────────┬─────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────┐
│ Container Engine │
│ (Docker Daemon, Podman) │
│ - Image management │
│ - Network management │
│ - Volume management │
│ - API server │
└───────────────────────────┬─────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────┐
│ High-Level Runtime │
│ (containerd, CRI-O) │
│ - Image pull/push │
│ - Container lifecycle │
│ - Snapshot management │
│ - Execution supervision │
└───────────────────────────┬─────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────┐
│ Low-Level Runtime │
│ (runc, crun, youki, gVisor, Kata) │
│ - Namespace creation │
│ - Cgroup configuration │
│ - Process execution │
│ - OCI runtime-spec implementation │
└───────────────────────────┬─────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────┐
│ Linux Kernel │
│ - Namespaces │
│ - Cgroups │
│ - Seccomp │
│ - Capabilities │
│ - OverlayFS │
└─────────────────────────────────────────────────────────┘
Low-Level Runtimes Comparison¶
| Runtime | Language | Isolation | Use Case |
|---|---|---|---|
| runc | Go | Namespaces | Default, most compatible |
| crun | C | Namespaces | Faster startup, lower memory |
| youki | Rust | Namespaces | Memory safety, modern |
| gVisor | Go | User-space kernel | Strong isolation (sandboxing) |
| Kata | Go | MicroVM | Hardware-level isolation |
| Firecracker | Rust | MicroVM | AWS Lambda, serverless |
7. Container vs. VM: Technical Comparison¶
| Feature | Virtual Machine (VM) | Container |
|---|---|---|
| Abstraction Layer | Hardware (via Hypervisor) | OS (via Kernel primitives) |
| Kernel | Each VM has its own kernel | Shared host kernel |
| Startup Time | 30s-2min (BIOS, kernel boot) | 100ms-1s (process start) |
| Memory Overhead | 500MB-2GB (guest OS) | 5-50MB (process only) |
| Disk Overhead | 10-50GB per VM | Shared layers (MB added) |
| Isolation Strength | Strong (hardware boundary) | Weaker (kernel boundary) |
| Density | 10-50 VMs per host | 100-1000 containers per host |
| Kernel Exploit Risk | Guest kernel only | Host kernel (shared) |
| Syscall Compatibility | Full (own kernel) | Host kernel version dependent |
When to Use VMs vs. Containers¶
Use VMs when:
- Running untrusted workloads (strong isolation required)
- Need different operating systems (Windows + Linux)
- Kernel version requirements differ
- Regulatory compliance requires hardware-level separation
Use Containers when:
- Deploying microservices
- Rapid scaling required
- Resource efficiency is critical
- CI/CD pipelines
- Same OS family across workloads
8. Security Considerations¶
Containers are not secure by default. The shared kernel is both an advantage (efficiency) and a risk (attack surface).
Defense in Depth¶
- User Namespaces: Run containers as non-root on host
- Read-only Root Filesystem: Prevent runtime modifications
- Dropped Capabilities: Remove unnecessary privileges
- Seccomp Profiles: Block dangerous syscalls
- AppArmor/SELinux: Mandatory Access Control
- Network Policies: Isolate container networks
- Image Scanning: Detect vulnerabilities before deployment
Container Escape Vectors¶
| Vector | Mitigation |
|---|---|
| Kernel exploits | Keep kernel patched, use gVisor/Kata |
| Privileged containers | Never use --privileged in production |
| Mounted Docker socket | Never mount /var/run/docker.sock |
| Host path mounts | Restrict to specific, non-sensitive paths |
| CAP_SYS_ADMIN | Drop all unnecessary capabilities |
9. The Evolution: Why Orchestration¶
Because containers are so lightweight, engineers stopped deploying "one server," and started deploying "hundreds of microservices." Managing this manually is impossible.
The Problems at Scale:
- Which host should this container run on?
- How do I ensure 3 copies are always running?
- How do containers find each other? (Service Discovery)
- How do I update without downtime? (Rolling Updates)
- How do I handle host failures? (Self-Healing)
- How do I manage secrets and configuration?
- How do I route external traffic? (Ingress)
This created the need for Kubernetes (K8s).
If Docker is the brick, Kubernetes is the architect.
Essential Engineer's Perspective¶
To master containerization, stop thinking of it as "mini-servers." Start thinking of it as packaging.
You are packaging your application with its entire environment (dependencies, OS config, network rules) so that it runs exactly the same on your laptop as it does on the production server.
The phrase "It works on my machine" is effectively solved by this technology.
The Mental Model:
Traditional Deployment:
App → Installed on Server → Configuration varies → "Works on my machine"
Container Deployment:
App + Dependencies + Config → Immutable Image → Runs anywhere identically
Key Principles:
- Immutability: Don't modify running containers; replace them
- Ephemerality: Containers can be killed and recreated anytime
- Single Process: One container = one process (ideally)
- Statelessness: Store state outside (volumes, databases)
- Declarative Configuration: Define desired state, let tools reconcile