Skip to content

Containerization

Containerization is often misunderstood as "lightweight Virtual Machines." To an engineer, that is a useful analogy, but it is technically incorrect.

Containerization is not virtualization; it is process isolation.

While a Virtual Machine (VM) simulates hardware to run a full guest OS, a Container simulates an Operating System to run a process. The fundamental difference lies in the abstraction layer: VMs abstract the hardware (CPU, memory, disk, network interfaces), while containers abstract the operating system interface (filesystem view, process table, network stack, user IDs).


1. The Core Illusion: It's Just a Process

If you run a container (e.g., docker run -d nginx) and then run ps aux on your host machine, you can actually find the Nginx process running directly on your host kernel. It is not hidden inside a black box file; it is a standard Linux process with a standard PID.

# On the host machine after running: docker run -d nginx
$ ps aux | grep nginx
root     24501  0.0  0.1  8860  5432 ?  Ss   10:15  0:00 nginx: master process
www-data 24512  0.0  0.0  9264  2340 ?  S    10:15  0:00 nginx: worker process

So why does that process think it has its own file system, IP address, and root user? The answer lies in three Linux Kernel features: Namespaces, Cgroups, and Union File Systems.

These are not container-specific inventions—they are general-purpose kernel primitives that have existed since the 2000s. Container runtimes simply orchestrate them into a cohesive illusion.


2. Namespaces (The Walls)

Namespaces manipulate what a process can see. They partition kernel resources such that one set of processes sees one set of resources, while another set sees a different set. Each namespace type isolates a specific global system resource.

The Namespace Syscall Interface

Namespaces are created and managed through three primary system calls:

// Create a new process in new namespaces
int clone(int (*fn)(void *), void *stack, int flags, void *arg);

// Move calling process into new namespaces
int unshare(int flags);

// Join an existing namespace
int setns(int fd, int nstype);

The flags parameter specifies which namespaces to create:

Flag Namespace Kernel Version Purpose
CLONE_NEWPID PID 2.6.24 (2008) Process ID isolation
CLONE_NEWNS Mount 2.4.19 (2002) Filesystem mount points
CLONE_NEWNET Network 2.6.29 (2009) Network stack isolation
CLONE_NEWUTS UTS 2.6.19 (2006) Hostname and domain name
CLONE_NEWIPC IPC 2.6.19 (2006) Inter-process communication
CLONE_NEWUSER User 3.8 (2013) User and group ID mapping
CLONE_NEWCGROUP Cgroup 4.6 (2016) Cgroup root directory
CLONE_NEWTIME Time 5.6 (2020) System clock offsets

PID Namespace

The PID namespace isolates the process ID number space. This is fundamental to the "container as isolated system" illusion.

How It Works:

  • Each PID namespace has its own process numbering starting from 1
  • The first process in a new PID namespace becomes PID 1 (the init process)
  • Processes can see processes in their namespace and child namespaces, but not parent namespaces
  • PID namespaces form a hierarchy (nested namespaces)
# View namespace from host
$ ls -la /proc/24501/ns/
lrwxrwxrwx 1 root root 0 Jan 26 10:15 pid -> 'pid:[4026532456]'

# The number 4026532456 is the inode number identifying this namespace
# Processes with the same inode share the same namespace

The PID 1 Problem:

Inside a container, your application becomes PID 1. This is significant because PID 1 has special responsibilities in Unix:

  1. Signal Handling: PID 1 does not get default signal handlers. SIGTERM and SIGINT are ignored unless explicitly handled.
  2. Zombie Reaping: PID 1 must reap orphaned child processes, or they become zombies.

This is why container images often use init systems like tini or dumb-init as the entrypoint—they properly handle signals and reap zombies.

Mount Namespace (MNT)

The mount namespace isolates the list of mount points seen by a process. This allows each container to have its own root filesystem.

Key Operations:

  • pivot_root() — Changes the root filesystem for all processes in the namespace
  • chroot() — Changes the root directory for the calling process (weaker isolation)

Modern container runtimes use pivot_root because it's more secure:

// Simplified pivot_root usage
mkdir("/new_root/old_root");
pivot_root("/new_root", "/new_root/old_root");
chdir("/");
umount2("/old_root", MNT_DETACH);
rmdir("/old_root");

Mount Propagation:

Mounts can be configured with different propagation types:

Type Behavior
MS_SHARED Mount/unmount events propagate bidirectionally
MS_PRIVATE No propagation (default for containers)
MS_SLAVE Events propagate from master to slave only
MS_UNBINDABLE Cannot be bind-mounted

Network Namespace (NET)

The network namespace provides isolation of the network stack:

  • Network devices (interfaces)
  • IPv4 and IPv6 protocol stacks
  • IP routing tables
  • Firewall rules (iptables/nftables)
  • Network ports
  • /proc/net and /sys/class/net

Creating Network Connectivity:

Containers need a way to communicate with the outside world. This is achieved through virtual ethernet (veth) pairs:

┌─────────────────────────────────────────────────────────────┐
│                        HOST                                  │
│                                                              │
│   ┌─────────────────┐        ┌─────────────────┐            │
│   │   Container A   │        │   Container B   │            │
│   │   NET Namespace │        │   NET Namespace │            │
│   │                 │        │                 │            │
│   │  eth0          │        │  eth0          │             │
│   │  172.17.0.2    │        │  172.17.0.3    │             │
│   └────────┬────────┘        └────────┬────────┘            │
│            │ vethA                    │ vethB               │
│            │                          │                      │
│   ┌────────┴──────────────────────────┴────────┐            │
│   │              docker0 bridge                 │            │
│   │              172.17.0.1                     │            │
│   └────────────────────┬───────────────────────┘            │
│                        │                                     │
│   ┌────────────────────┴───────────────────────┐            │
│   │              eth0 (Host NIC)                │            │
│   │              192.168.1.100                  │            │
│   └────────────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────────┘

User Namespace (USER)

The user namespace maps user and group IDs inside the namespace to different IDs outside. This is the foundation of rootless containers.

# Mapping configuration (inside container UID 0 → host UID 100000)
$ cat /proc/24501/uid_map
         0     100000      65536

# Format: <inside-start> <outside-start> <count>
# UID 0-65535 inside maps to UID 100000-165535 outside

Security Implications:

  • Process runs as root (UID 0) inside the container
  • If it escapes, it's an unprivileged user (UID 100000) on the host
  • Cannot access files owned by real root
  • Cannot load kernel modules or mount filesystems

UTS Namespace

Allows the container to have its own hostname and NIS domain name:

# Inside container
$ hostname
my-container-hostname

# On host
$ hostname
production-server-01

IPC Namespace

Isolates Inter-Process Communication resources:

  • System V IPC (message queues, semaphore sets, shared memory segments)
  • POSIX message queues

This prevents containers from interfering with each other's shared memory or message passing.

Cgroup Namespace

Virtualizes the view of /sys/fs/cgroup. A process sees its cgroup as the root:

# Inside container (sees root as its own cgroup)
$ cat /proc/self/cgroup
0::/

# On host (sees full path)
$ cat /proc/24501/cgroup
0::/system.slice/docker-abc123.scope

Time Namespace (Linux 5.6+)

Allows containers to have different system clock values:

# Useful for testing time-sensitive applications
# Container can think it's year 2030 while host is 2025

3. Cgroups (The Police)

Control Groups (cgroups) manipulate what a process can use. While Namespaces hide resources, Cgroups limit them.

Without Cgroups, a containerized process could consume 100% of your Host CPU or RAM, crashing the machine. Cgroups allow you to say: "This group of processes (container) gets max 512MB RAM and 50% of 1 CPU core."

Cgroups V1 vs V2

Aspect Cgroups V1 Cgroups V2
Hierarchy Multiple hierarchies (one per controller) Single unified hierarchy
Mount point /sys/fs/cgroup/<controller>/ /sys/fs/cgroup/
Process membership Process can be in different cgroups per controller Process in exactly one cgroup
Interface files Controller-specific prefixes Unified naming (cpu.max, memory.max)
Default (2024+) Legacy Default in modern distros

Cgroup V2 Controllers

CPU Controller:

$ cat /sys/fs/cgroup/docker/abc123/cpu.max
150000 100000
# Format: <quota> <period>
# 150000 µs of CPU time every 100000 µs = 1.5 cores

$ cat /sys/fs/cgroup/docker/abc123/cpu.weight
100
# Relative weight (1-10000, default 100) for CPU time sharing

Memory Controller:

$ cat /sys/fs/cgroup/docker/abc123/memory.max
536870912
# Hard limit in bytes (512MB)

$ cat /sys/fs/cgroup/docker/abc123/memory.current
234881024
# Current usage

$ cat /sys/fs/cgroup/docker/abc123/memory.swap.max
0
# Swap limit (0 = no swap allowed)

I/O Controller:

$ cat /sys/fs/cgroup/docker/abc123/io.max
8:0 rbps=10485760 wbps=10485760 riops=1000 wiops=1000
# Device 8:0: max 10MB/s read/write, 1000 IOPS

PIDs Controller:

$ cat /sys/fs/cgroup/docker/abc123/pids.max
100
# Maximum number of processes (fork bomb protection)

The OOM Killer

When a container exceeds its memory limit:

  1. Kernel tries to reclaim memory (page cache, swap)
  2. If unsuccessful, triggers OOM (Out of Memory) Killer
  3. OOM Killer selects process with highest oom_score
  4. Selected process is terminated with SIGKILL
# Check OOM events
$ cat /sys/fs/cgroup/docker/abc123/memory.events
low 0
high 0
max 0
oom 3          # OOM triggered 3 times
oom_kill 3     # 3 processes killed

Memory Accounting

Cgroups track different types of memory:

$ cat /sys/fs/cgroup/docker/abc123/memory.stat
anon 104857600           # Anonymous memory (heap, stack)
file 52428800            # File-backed memory (page cache)
kernel_stack 163840      # Kernel stack
pagetables 524288        # Page tables
shmem 0                  # Shared memory
sock 4096                # Socket buffers

4. Union File Systems (The Storage)

This is why containers are so fast to start compared to VMs.

  • Standard VM: A 10GB disk image is a giant binary blob that must be copied/attached.
  • Container: Uses a layered file system (like OverlayFS).

The Layer Model

┌─────────────────────────────────────────────┐
│         Container Layer (R/W)               │  ← Ephemeral, per-container
│         /var/lib/docker/overlay2/xyz/diff   │
├─────────────────────────────────────────────┤
│         Image Layer 3 (R/O)                 │  ← Your application code
│         sha256:abc123...                    │
├─────────────────────────────────────────────┤
│         Image Layer 2 (R/O)                 │  ← apt-get install nginx
│         sha256:def456...                    │
├─────────────────────────────────────────────┤
│         Image Layer 1 (R/O)                 │  ← Base OS (Ubuntu)
│         sha256:789ghi...                    │
└─────────────────────────────────────────────┘

OverlayFS Mechanics

OverlayFS (Overlay Filesystem) is the default storage driver for Docker. It presents a unified view of multiple directories:

┌─────────────────────────────────────────────┐
│              Merged (View)                   │  ← What container sees
│              /merged                         │
└───────────────────┬─────────────────────────┘
                    │
        ┌───────────┴───────────┐
        │                       │
┌───────┴───────┐       ┌───────┴───────┐
│   UpperDir    │       │   LowerDir    │
│   (R/W)       │       │   (R/O)       │
│   /upper      │       │   /lower      │
└───────────────┘       └───────────────┘

Mount Command:

mount -t overlay overlay \
  -o lowerdir=/lower,upperdir=/upper,workdir=/work \
  /merged

Copy-on-Write (CoW) Operations

Operation Behavior
Read existing file Transparent lookup through layers (fast)
Modify existing file Copy entire file to upper layer, then modify
Delete file Create "whiteout" file in upper layer
Create new file Written directly to upper layer

The Copy-Up Problem:

# If base image has a 1GB log file
# And you append 1 byte to it...
# The ENTIRE 1GB file is copied to the container layer first!

# This is why you should:
# 1. Never modify large files in the container filesystem
# 2. Use volumes for data that changes
# 3. Keep base images minimal

Whiteout Files

When you delete a file that exists in a lower layer, OverlayFS creates a special "whiteout" file:

# Delete /etc/config from lower layer
rm /merged/etc/config

# OverlayFS creates:
# /upper/etc/config (character device 0:0)

# This "whiteout" marker tells the filesystem to hide
# the file from the merged view

5. The OCI (Open Container Initiative) Standards

The OCI defines three specifications that ensure container interoperability:

OCI Runtime Specification

Defines how to run a "filesystem bundle":

container-bundle/
├── config.json      # Container configuration
└── rootfs/          # Root filesystem
    ├── bin/
    ├── etc/
    ├── lib/
    └── ...

config.json structure:

{
  "ociVersion": "1.0.2",
  "process": {
    "terminal": false,
    "user": { "uid": 0, "gid": 0 },
    "args": ["nginx", "-g", "daemon off;"],
    "env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"],
    "cwd": "/"
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "linux": {
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "mount" },
      { "type": "ipc" },
      { "type": "uts" }
    ],
    "resources": {
      "memory": { "limit": 536870912 },
      "cpu": { "quota": 150000, "period": 100000 }
    }
  }
}

OCI Image Specification

Defines the format of container images:

Image = Manifest + Config + Layers

Manifest (application/vnd.oci.image.manifest.v1+json):
{
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc...",
    "size": 1234
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:def...",
      "size": 12345678
    }
  ]
}

OCI Distribution Specification

Defines how images are pushed/pulled from registries (HTTP API).


6. The Runtime Architecture

The container ecosystem has multiple layers of runtimes:

┌─────────────────────────────────────────────────────────┐
│                    User Interface                        │
│              (Docker CLI, Podman, nerdctl)              │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                  Container Engine                        │
│              (Docker Daemon, Podman)                     │
│         - Image management                               │
│         - Network management                             │
│         - Volume management                              │
│         - API server                                     │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│               High-Level Runtime                         │
│              (containerd, CRI-O)                         │
│         - Image pull/push                                │
│         - Container lifecycle                            │
│         - Snapshot management                            │
│         - Execution supervision                          │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                Low-Level Runtime                         │
│              (runc, crun, youki, gVisor, Kata)          │
│         - Namespace creation                             │
│         - Cgroup configuration                           │
│         - Process execution                              │
│         - OCI runtime-spec implementation                │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                    Linux Kernel                          │
│         - Namespaces                                     │
│         - Cgroups                                        │
│         - Seccomp                                        │
│         - Capabilities                                   │
│         - OverlayFS                                      │
└─────────────────────────────────────────────────────────┘

Low-Level Runtimes Comparison

Runtime Language Isolation Use Case
runc Go Namespaces Default, most compatible
crun C Namespaces Faster startup, lower memory
youki Rust Namespaces Memory safety, modern
gVisor Go User-space kernel Strong isolation (sandboxing)
Kata Go MicroVM Hardware-level isolation
Firecracker Rust MicroVM AWS Lambda, serverless

7. Container vs. VM: Technical Comparison

Feature Virtual Machine (VM) Container
Abstraction Layer Hardware (via Hypervisor) OS (via Kernel primitives)
Kernel Each VM has its own kernel Shared host kernel
Startup Time 30s-2min (BIOS, kernel boot) 100ms-1s (process start)
Memory Overhead 500MB-2GB (guest OS) 5-50MB (process only)
Disk Overhead 10-50GB per VM Shared layers (MB added)
Isolation Strength Strong (hardware boundary) Weaker (kernel boundary)
Density 10-50 VMs per host 100-1000 containers per host
Kernel Exploit Risk Guest kernel only Host kernel (shared)
Syscall Compatibility Full (own kernel) Host kernel version dependent

When to Use VMs vs. Containers

Use VMs when:

  • Running untrusted workloads (strong isolation required)
  • Need different operating systems (Windows + Linux)
  • Kernel version requirements differ
  • Regulatory compliance requires hardware-level separation

Use Containers when:

  • Deploying microservices
  • Rapid scaling required
  • Resource efficiency is critical
  • CI/CD pipelines
  • Same OS family across workloads

8. Security Considerations

Containers are not secure by default. The shared kernel is both an advantage (efficiency) and a risk (attack surface).

Defense in Depth

  1. User Namespaces: Run containers as non-root on host
  2. Read-only Root Filesystem: Prevent runtime modifications
  3. Dropped Capabilities: Remove unnecessary privileges
  4. Seccomp Profiles: Block dangerous syscalls
  5. AppArmor/SELinux: Mandatory Access Control
  6. Network Policies: Isolate container networks
  7. Image Scanning: Detect vulnerabilities before deployment

Container Escape Vectors

Vector Mitigation
Kernel exploits Keep kernel patched, use gVisor/Kata
Privileged containers Never use --privileged in production
Mounted Docker socket Never mount /var/run/docker.sock
Host path mounts Restrict to specific, non-sensitive paths
CAP_SYS_ADMIN Drop all unnecessary capabilities

9. The Evolution: Why Orchestration

Because containers are so lightweight, engineers stopped deploying "one server," and started deploying "hundreds of microservices." Managing this manually is impossible.

The Problems at Scale:

  • Which host should this container run on?
  • How do I ensure 3 copies are always running?
  • How do containers find each other? (Service Discovery)
  • How do I update without downtime? (Rolling Updates)
  • How do I handle host failures? (Self-Healing)
  • How do I manage secrets and configuration?
  • How do I route external traffic? (Ingress)

This created the need for Kubernetes (K8s).

If Docker is the brick, Kubernetes is the architect.


Essential Engineer's Perspective

To master containerization, stop thinking of it as "mini-servers." Start thinking of it as packaging.

You are packaging your application with its entire environment (dependencies, OS config, network rules) so that it runs exactly the same on your laptop as it does on the production server.

The phrase "It works on my machine" is effectively solved by this technology.

The Mental Model:

Traditional Deployment:
  App → Installed on Server → Configuration varies → "Works on my machine"

Container Deployment:
  App + Dependencies + Config → Immutable Image → Runs anywhere identically

Key Principles:

  1. Immutability: Don't modify running containers; replace them
  2. Ephemerality: Containers can be killed and recreated anytime
  3. Single Process: One container = one process (ideally)
  4. Statelessness: Store state outside (volumes, databases)
  5. Declarative Configuration: Define desired state, let tools reconcile