Skip to content

Docker

Docker is a platform that enables developers to build, ship, and run applications in isolated environments called containers. Unlike virtual machines (VMs), which require a full guest operating system and hypervisor, containers share the host's kernel, making them lightweight, fast to start, and resource-efficient.

This chapter provides an in-depth technical exploration of Docker's internals, from the syscall level to production best practices.


1. Historical Context

Containerization has roots predating Docker:

Year Technology Innovation
1979 chroot Filesystem isolation (Unix V7)
2000 FreeBSD Jails Process isolation + networking
2005 Solaris Zones Resource controls + virtualization
2006 Process Containers Google's cgroups (merged into Linux 2.6.24)
2008 LXC Linux Containers (namespaces + cgroups)
2013 Docker Developer-friendly tooling + ecosystem
2015 OCI Open Container Initiative standards
2016 containerd Industry-standard runtime (extracted from Docker)

Docker did not invent containerization—it democratized it by providing:

  • Simple CLI interface
  • Dockerfile build system
  • Docker Hub registry
  • Cross-platform support (Docker Desktop for macOS/Windows)

2. Docker Architecture Deep Dive

Docker follows a client-server architecture with multiple layers:

┌──────────────────────────────────────────────────────────────────┐
│                        Docker Client                              │
│                     (docker CLI, SDKs)                           │
│                                                                   │
│   Commands: docker build | run | push | pull | exec | logs       │
└────────────────────────────┬─────────────────────────────────────┘
                             │ REST API (Unix socket or TCP)
                             │ /var/run/docker.sock
┌────────────────────────────▼─────────────────────────────────────┐
│                      Docker Daemon (dockerd)                      │
│                                                                   │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│
│   │   Images    │ │  Containers │ │  Networks   │ │  Volumes   ││
│   │  Manager    │ │   Manager   │ │   Manager   │ │  Manager   ││
│   └─────────────┘ └─────────────┘ └─────────────┘ └────────────┘│
└────────────────────────────┬─────────────────────────────────────┘
                             │ gRPC API
┌────────────────────────────▼─────────────────────────────────────┐
│                        containerd                                 │
│                                                                   │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│
│   │   Content   │ │  Snapshots  │ │    Tasks    │ │   Events   ││
│   │    Store    │ │   (Storage) │ │ (Lifecycle) │ │   Stream   ││
│   └─────────────┘ └─────────────┘ └──────┬──────┘ └────────────┘│
└──────────────────────────────────────────┼───────────────────────┘
                                           │
┌──────────────────────────────────────────▼───────────────────────┐
│                containerd-shim-runc-v2                            │
│           (per-container process, survives containerd restart)    │
└──────────────────────────────────────────┬───────────────────────┘
                                           │
┌──────────────────────────────────────────▼───────────────────────┐
│                          runc                                     │
│                   (OCI runtime, spawns container)                 │
│                                                                   │
│   1. Parse config.json                                           │
│   2. Set up namespaces (clone() with CLONE_NEW*)                 │
│   3. Configure cgroups                                           │
│   4. Apply seccomp filters                                       │
│   5. pivot_root to container filesystem                          │
│   6. exec() the entrypoint                                       │
│   7. Exit (shim takes over)                                      │
└──────────────────────────────────────────────────────────────────┘

2.1 The Docker Daemon (dockerd)

The daemon is a long-running process that:

  • Listens on /var/run/docker.sock (Unix) or TCP port 2375/2376
  • Manages Docker objects (images, containers, networks, volumes)
  • Handles build requests
  • Interacts with registries

Configuration: /etc/docker/daemon.json

{
  "storage-driver": "overlay2",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "default-ulimits": {
    "nofile": { "Name": "nofile", "Hard": 65536, "Soft": 65536 }
  },
  "live-restore": true,
  "userland-proxy": false,
  "default-address-pools": [{ "base": "172.17.0.0/16", "size": 24 }]
}

2.2 containerd

containerd is a CNCF graduated project that provides:

  • Image pull/push to registries
  • Image storage and management
  • Container execution via OCI runtimes
  • Network interface creation (CNI plugins)
  • Snapshot management (storage drivers)

You can interact directly with containerd using ctr or nerdctl:

# Pull image directly with containerd
ctr images pull docker.io/library/nginx:latest

# Create and run container
ctr run --rm -t docker.io/library/nginx:latest my-nginx

# List running tasks
ctr tasks ls

2.3 The Shim Process

The shim (containerd-shim-runc-v2) is critical for:

  1. Daemon Independence: Containers keep running when containerd restarts
  2. STDIO Handling: Keeps stdin/stdout/stderr streams open
  3. Exit Status: Reports container exit code to containerd
  4. Zombie Reaping: Acts as subreaper for container processes
# View shim processes
$ ps aux | grep shim
root  1234  containerd-shim-runc-v2 -namespace moby -id abc123...
root  5678  containerd-shim-runc-v2 -namespace moby -id def456...

2.4 runc

runc is the reference implementation of the OCI runtime specification:

# Manual container creation with runc
$ mkdir -p mycontainer/rootfs
$ docker export $(docker create busybox) | tar -C mycontainer/rootfs -xf -
$ cd mycontainer
$ runc spec  # Creates config.json

# Run the container
$ runc run my-container

What runc does:

  1. Parses OCI config.json
  2. Creates namespaces via clone()
  3. Configures cgroups by writing to /sys/fs/cgroup/
  4. Sets up filesystem mounts
  5. Applies security profiles (seccomp, capabilities, AppArmor)
  6. Calls pivot_root() to change root filesystem
  7. Executes the entrypoint with execve()
  8. Exits—shim takes over as parent process

3. Kernel Isolation Mechanisms

3.1 The clone() System Call

When creating a container, runc uses clone() with namespace flags:

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static int child_fn(void *arg) {
    printf("Child PID (inside namespace): %d\n", getpid());
    printf("Parent PID (inside namespace): %d\n", getppid());
    execl("/bin/sh", "sh", NULL);
    return 0;
}

int main() {
    char *stack = malloc(STACK_SIZE);
    if (!stack) exit(1);

    // Create new namespaces
    int flags = CLONE_NEWPID |   // New PID namespace
                CLONE_NEWNS  |   // New mount namespace
                CLONE_NEWNET |   // New network namespace
                CLONE_NEWUTS |   // New UTS namespace
                CLONE_NEWIPC |   // New IPC namespace
                SIGCHLD;

    pid_t pid = clone(child_fn, stack + STACK_SIZE, flags, NULL);

    if (pid == -1) {
        perror("clone");
        exit(1);
    }

    printf("Child PID (from parent view): %d\n", pid);
    waitpid(pid, NULL, 0);

    return 0;
}

3.2 Namespace Details

PID Namespace Hierarchy:

Host PID Namespace (Level 0)
├── PID 1 (systemd)
├── PID 1234 (containerd)
├── PID 5678 (Container A's init - seen as PID 1 inside)
│   └── Container A's PID Namespace (Level 1)
│       ├── PID 1 (nginx master)
│       └── PID 2 (nginx worker)
└── PID 9012 (Container B's init - seen as PID 1 inside)
    └── Container B's PID Namespace (Level 1)
        └── PID 1 (python app)

Network Namespace Inspection:

# List network namespaces
$ ip netns list
# Docker creates netns in /var/run/docker/netns/

# Enter a container's network namespace
$ nsenter --net=/var/run/docker/netns/abc123 ip addr

# From inside container
$ cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes...
  lo: 12345      100    0    0    0     0          0         0  12345...
 eth0: 67890     500    0    0    0     0          0         0  45678...

3.3 Cgroups V2 in Practice

Docker creates cgroup hierarchies under /sys/fs/cgroup/system.slice/:

# Container cgroup location
$ CONTAINER_ID=$(docker ps -q)
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cgroup.procs
1234  # PIDs in this cgroup

# Resource limits
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
536870912  # 512MB

$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
100000 100000  # 100% of 1 CPU

# Current usage
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
134217728  # 128MB currently used

$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat
usage_usec 12345678
user_usec 10000000
system_usec 2345678

Resource Limit Flags:

docker run \
  --memory="512m" \              # memory.max
  --memory-swap="1g" \           # memory.swap.max
  --memory-reservation="256m" \  # memory.low (soft limit)
  --cpus="1.5" \                 # cpu.max (150000 100000)
  --cpu-shares="512" \           # cpu.weight (relative)
  --cpuset-cpus="0,1" \          # cpuset.cpus
  --pids-limit="100" \           # pids.max
  --blkio-weight="500" \         # io.weight
  nginx

4. Storage: OverlayFS Deep Dive

4.1 Layer Structure

# View overlay mount for a running container
$ mount | grep overlay
overlay on /var/lib/docker/overlay2/xyz/merged type overlay (rw,...)
  lowerdir=/var/lib/docker/overlay2/abc/diff:
           /var/lib/docker/overlay2/def/diff:
           /var/lib/docker/overlay2/ghi/diff,
  upperdir=/var/lib/docker/overlay2/xyz/diff,
  workdir=/var/lib/docker/overlay2/xyz/work

# Examine layer contents
$ ls /var/lib/docker/overlay2/
abc/  # Layer 1 (base image)
def/  # Layer 2
ghi/  # Layer 3
xyz/  # Container layer
  ├── diff/     # Container's writable layer
  ├── merged/   # Unified view
  ├── work/     # Kernel workspace
  └── link      # Shortened layer identifier

4.2 Copy-on-Write Internals

// Simplified kernel OverlayFS copy-up logic
int ovl_copy_up(struct dentry *dentry) {
    // 1. Check if file exists in upperdir
    if (exists_in_upper(dentry))
        return 0;  // Already copied

    // 2. Create parent directories in upperdir
    create_parent_dirs(upper_dentry);

    // 3. Copy entire file from lowerdir to upperdir
    //    This is the expensive operation!
    copy_file(lower_path, upper_path);

    // 4. Copy xattrs (extended attributes)
    copy_xattrs(lower_path, upper_path);

    // 5. Set up overlay redirect (opaque marker)
    set_redirect(upper_dentry);

    return 0;
}

Performance Implications:

Operation Performance Notes
Read from lower Native Direct read, no copy
Read from upper Native Direct read
Write new file Native Direct write to upper
Modify lower file SLOW Full copy-up first
Delete lower file Fast Create whiteout marker

4.3 Whiteout Files and Opaque Directories

# Delete a file from base image
$ docker run --rm -it ubuntu rm /etc/motd

# Inside the container's upperdir:
$ ls -la /var/lib/docker/overlay2/xyz/diff/etc/
c--------- 1 root root 0, 0 Jan 26 10:00 motd  # Character device 0:0 = whiteout

# Delete entire directory
$ docker run --rm -it ubuntu rm -rf /var/cache/

# Creates opaque directory with xattr
$ getfattr -d /var/lib/docker/overlay2/xyz/diff/var/cache/
trusted.overlay.opaque="y"

4.4 Image Layer Inspection

# View image layers
$ docker image inspect nginx:latest --format '{{json .RootFS.Layers}}' | jq
[
  "sha256:abc123...",  # Base layer
  "sha256:def456...",  # Layer 2
  "sha256:789ghi..."   # Top layer
]

# Examine layer history
$ docker history nginx:latest
IMAGE          CREATED       CREATED BY                                      SIZE
abc123def456   2 weeks ago   CMD ["nginx" "-g" "daemon off;"]               0B
<missing>      2 weeks ago   STOPSIGNAL SIGQUIT                              0B
<missing>      2 weeks ago   EXPOSE 80                                       0B
<missing>      2 weeks ago   ENTRYPOINT ["/docker-entrypoint.sh"]           0B
<missing>      2 weeks ago   COPY 30-tune-worker-processes.sh ... (RUN)     4.62kB
...

# Layer diff contents
$ docker save nginx:latest | tar -xf - -C /tmp/nginx-layers/
$ ls /tmp/nginx-layers/
abc123.../
def456.../
manifest.json

5. Image Format and Registry Protocol

5.1 OCI Image Manifest

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:abc123...",
    "size": 7023
  },
  "layers": [
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:layer1...",
      "size": 32654848
    },
    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
      "digest": "sha256:layer2...",
      "size": 16724
    }
  ],
  "annotations": {
    "org.opencontainers.image.created": "2025-01-26T10:00:00Z"
  }
}

5.2 Image Configuration

{
  "architecture": "amd64",
  "os": "linux",
  "config": {
    "Hostname": "",
    "User": "",
    "ExposedPorts": { "80/tcp": {} },
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin",
      "NGINX_VERSION=1.25.0"
    ],
    "Cmd": ["nginx", "-g", "daemon off;"],
    "Entrypoint": ["/docker-entrypoint.sh"],
    "WorkingDir": "/",
    "Labels": {
      "maintainer": "NGINX Docker Maintainers"
    }
  },
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:uncompressed-layer1...",
      "sha256:uncompressed-layer2..."
    ]
  },
  "history": [
    {
      "created": "2025-01-20T00:00:00Z",
      "created_by": "/bin/sh -c #(nop) ADD file:abc... in /",
      "empty_layer": false
    }
  ]
}

5.3 Registry HTTP API (OCI Distribution)

# 1. Check API version
GET /v2/

# 2. Get manifest
GET /v2/<name>/manifests/<reference>
Accept: application/vnd.oci.image.manifest.v1+json

# 3. Get blob (layer)
GET /v2/<name>/blobs/<digest>

# 4. Push flow:
# a) Check if blob exists
HEAD /v2/<name>/blobs/<digest>

# b) Start upload
POST /v2/<name>/blobs/uploads/

# c) Upload blob
PUT /v2/<name>/blobs/uploads/<uuid>?digest=<digest>
Content-Type: application/octet-stream

# d) Push manifest
PUT /v2/<name>/manifests/<reference>
Content-Type: application/vnd.oci.image.manifest.v1+json

Authentication Flow:

# 1. Initial request returns 401 with WWW-Authenticate header
GET /v2/library/nginx/manifests/latest
< 401 Unauthorized
< WWW-Authenticate: Bearer realm="https://auth.docker.io/token",
    service="registry.docker.io",scope="repository:library/nginx:pull"

# 2. Get token from auth service
GET https://auth.docker.io/token?service=registry.docker.io&scope=repository:library/nginx:pull
> {"token": "eyJhbGc...", "expires_in": 300}

# 3. Retry with token
GET /v2/library/nginx/manifests/latest
Authorization: Bearer eyJhbGc...

6. Dockerfile and Build System

6.1 Dockerfile Instructions Deep Dive

Instruction Layer Created Build-time Run-time Notes
FROM Base Sets base image
RUN Yes Executes commands
COPY Yes Copies files from context
ADD Yes COPY + URL + tar extraction
ENV No* Sets environment variables
ARG No Build-time variables
WORKDIR No* Sets working directory
USER No* Sets user for subsequent instructions
EXPOSE No (docs) Documentation only
VOLUME No* Creates mount point
ENTRYPOINT No Container entry point
CMD No Default arguments
LABEL No* Metadata
HEALTHCHECK No Health check command
SHELL No Default shell
STOPSIGNAL No Stop signal
ONBUILD No Trigger for child images

*Creates metadata layer, not filesystem layer

6.2 Build Cache Mechanics

# Layer cache invalidation cascade
FROM python:3.11                    # Cache: Base image digest
WORKDIR /app                        # Cache: Rarely changes
COPY requirements.txt .             # Cache: Invalidates if file changes
RUN pip install -r requirements.txt # Cache: Invalidates if above changed
COPY . .                            # Cache: Invalidates on ANY source change
CMD ["python", "app.py"]            # Cache: Invalidates if above changed

Cache Key Calculation:

  1. Parent layer hash
  2. Instruction string
  3. For COPY/ADD: file content hashes
  4. For RUN: command string only (not execution results!)

Cache Invalidation:

# Force rebuild from specific instruction
docker build --no-cache .

# Invalidate from specific stage
docker build --no-cache-filter=build .

6.3 Multi-Stage Builds

Multi-stage builds dramatically reduce final image size:

# Stage 1: Build
FROM golang:1.21 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /app -ldflags="-s -w" .

# Stage 2: Runtime
FROM scratch
# Or: FROM gcr.io/distroless/static-debian12
COPY --from=builder /app /app
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
ENTRYPOINT ["/app"]

Result:

Stage Contents Size
golang:1.21 Go toolchain, libs, source ~800MB
Final (scratch) Binary + certs only ~10MB

6.4 BuildKit (Modern Build Engine)

BuildKit is the next-generation build engine with:

  • Parallel stage execution
  • Efficient layer caching
  • Build secrets
  • SSH forwarding
  • Cache mounts
  • Better cache export/import

Enable BuildKit:

export DOCKER_BUILDKIT=1
# Or in daemon.json: {"features": {"buildkit": true}}

BuildKit-specific Features:

# syntax=docker/dockerfile:1.5

# Mount cache for package managers
FROM python:3.11
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install numpy pandas scikit-learn

# Mount secrets (never stored in layer)
FROM alpine
RUN --mount=type=secret,id=aws_creds \
    cat /run/secrets/aws_creds

# SSH forwarding
FROM alpine
RUN --mount=type=ssh \
    git clone git@github.com:private/repo.git

# Bind mount from context
FROM golang:1.21
RUN --mount=type=bind,source=go.mod,target=go.mod \
    go mod download

Build Command:

docker buildx build \
  --secret id=aws_creds,src=$HOME/.aws/credentials \
  --ssh default \
  --cache-from type=registry,ref=myregistry/myapp:cache \
  --cache-to type=registry,ref=myregistry/myapp:cache \
  --platform linux/amd64,linux/arm64 \
  --push \
  -t myregistry/myapp:latest \
  .

7. Networking Deep Dive

7.1 Network Drivers

Driver Use Case Scope IP Management
bridge Default, isolated Single host Docker IPAM
host Maximum performance Single host Host IP
none No networking Single host None
overlay Multi-host (Swarm) Multi-host Docker IPAM
macvlan Direct L2 access Single host External DHCP/Static
ipvlan L2/L3 without MAC Single host External

7.2 Bridge Network Internals

# Default bridge network
$ docker network inspect bridge
[
  {
    "Name": "bridge",
    "Driver": "bridge",
    "IPAM": {
      "Config": [{ "Subnet": "172.17.0.0/16", "Gateway": "172.17.0.1" }]
    },
    "Options": {
      "com.docker.network.bridge.default_bridge": "true",
      "com.docker.network.bridge.name": "docker0"
    }
  }
]

# View bridge on host
$ ip link show docker0
docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
    link/ether 02:42:ac:11:00:01 brd ff:ff:ff:ff:ff:ff

$ brctl show docker0
bridge name   bridge id              STP enabled   interfaces
docker0       8000.024242424242      no            veth123abc
                                                   veth456def

7.3 veth Pair Creation

When a container starts:

# Docker creates a veth pair
$ ip link add veth123 type veth peer name eth0@if123

# One end goes to container's network namespace
$ ip link set eth0@if123 netns <container-ns>

# Other end connects to bridge
$ ip link set veth123 master docker0
$ ip link set veth123 up

# Configure container's interface
$ nsenter --net=<container-ns> ip addr add 172.17.0.2/16 dev eth0
$ nsenter --net=<container-ns> ip link set eth0 up
$ nsenter --net=<container-ns> ip route add default via 172.17.0.1

7.4 NAT and Port Mapping

Outbound (MASQUERADE):

$ iptables -t nat -L POSTROUTING -n -v
Chain POSTROUTING
pkts bytes target     prot opt in     out     source            destination
1234 56789 MASQUERADE all  --  *      !docker0 172.17.0.0/16    0.0.0.0/0

Inbound (DNAT for -p 8080:80):

$ iptables -t nat -L DOCKER -n -v
Chain DOCKER
pkts bytes target     prot opt in     out     source            destination
 100  5000 DNAT       tcp  --  !docker0 *     0.0.0.0/0        0.0.0.0/0
                      tcp dpt:8080 to:172.17.0.2:80

7.5 DNS Resolution

Docker provides an embedded DNS server for user-defined networks:

# Inside container on user-defined network
$ cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0

# Docker's DNS server at 127.0.0.11 handles:
# 1. Container name resolution (web → 172.18.0.2)
# 2. Service discovery
# 3. Forwards unknown queries to host DNS

8. Security Deep Dive

8.1 Linux Capabilities

Docker drops most capabilities by default:

# Default capabilities kept:
CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW,
SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE,
SYS_CHROOT, KILL, AUDIT_WRITE

# Dangerous capabilities dropped:
SYS_ADMIN, SYS_PTRACE, SYS_MODULE, SYS_RAWIO, SYS_TIME,
SYS_BOOT, NET_ADMIN, SYS_RESOURCE, SYSLOG, ...

Capability Management:

# Drop all, add only needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx

# Check container capabilities
$ cat /proc/1/status | grep Cap
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000

# Decode with capsh
$ capsh --decode=00000000a80425fb

8.2 Seccomp Profiles

Docker applies a default seccomp profile blocking ~44 syscalls:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["accept", "accept4", "access", "..."],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["clone"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 2114060288,
          "op": "SCMP_CMP_MASKED_EQ"
        }
      ]
    }
  ]
}

Blocked Syscalls Include:

  • reboot, swapon, swapoff
  • mount, umount, pivot_root
  • clock_settime, settimeofday
  • init_module, delete_module
  • acct, kexec_load

Custom Profile:

docker run --security-opt seccomp=/path/to/profile.json myimage

8.3 AppArmor Profiles

Docker generates AppArmor profiles restricting:

profile docker-default flags=(attach_disconnected,mediate_deleted) {
  # Deny writes to sensitive paths
  deny /proc/** w,
  deny /sys/** w,

  # Allow read of specific proc files
  /proc/*/attr/current r,
  /proc/*/mounts r,

  # Network access
  network inet stream,
  network inet6 stream,

  # File capabilities
  capability chown,
  capability dac_override,
  capability net_bind_service,
}

8.4 Rootless Mode

Running Docker without root privileges:

# Install rootless Docker
$ dockerd-rootless-setuptool.sh install

# Configure
$ export DOCKER_HOST=unix:///run/user/1000/docker.sock

# How it works:
# 1. Uses user namespaces (container root → host UID 100000+)
# 2. Uses slirp4netns for networking (userspace TCP/IP)
# 3. Uses fuse-overlayfs for storage

Limitations:

  • Slightly slower networking (userspace stack)
  • Cannot use privileged containers
  • Limited to unprivileged ports (>1024) without capabilities
  • Some storage drivers unavailable

9. Docker Compose

9.1 Compose File Structure

version: "3.9"

services:
  web:
    build:
      context: ./web
      dockerfile: Dockerfile
      args:
        NODE_ENV: production
    image: myapp-web:${TAG:-latest}
    ports:
      - "8080:3000"
    environment:
      - DATABASE_URL=postgres://db:5432/myapp
    depends_on:
      db:
        condition: service_healthy
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: "0.5"
          memory: 512M
    networks:
      - frontend
      - backend
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  db:
    image: postgres:15
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
    networks:
      - backend

volumes:
  pgdata:
    driver: local

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true

secrets:
  db_password:
    file: ./secrets/db_password.txt

9.2 Compose Commands

# Start services
docker compose up -d

# View logs
docker compose logs -f web

# Scale service
docker compose up -d --scale web=3

# Execute command in service
docker compose exec web sh

# Stop and remove
docker compose down -v  # -v removes volumes

10. Debugging Containers

10.1 Inspection Commands

# Full container details
docker inspect <container>

# Specific field
docker inspect -f '{{.State.Pid}}' <container>
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container>

# Resource usage
docker stats <container>

# Process list
docker top <container>

# Filesystem changes
docker diff <container>

10.2 Entering Namespaces

# Get container PID
PID=$(docker inspect -f '{{.State.Pid}}' mycontainer)

# Enter all namespaces
nsenter -t $PID -m -u -i -n -p /bin/sh

# Enter specific namespace only
nsenter -t $PID --net ip addr  # Network namespace
nsenter -t $PID --pid --mount ps aux  # PID + mount namespace

10.3 Debug Container (Docker 1.26+)

# Attach debug container to running container's namespaces
docker debug <container>

# With specific image
docker debug --image=busybox <container>

10.4 Analyzing Image Layers

# Tool: dive (interactive layer explorer)
dive nginx:latest

# Manual analysis
docker save nginx:latest | tar -xf - -C /tmp/nginx/
ls /tmp/nginx/
# manifest.json, oci-layout, blobs/sha256/...

11. Performance Tuning

11.1 Image Optimization

# Bad: Large image, poor caching
FROM python:3.11
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt

# Good: Small image, optimal caching
FROM python:3.11-slim AS base
WORKDIR /app

FROM base AS deps
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install --no-compile -r requirements.txt

FROM base AS runtime
COPY --from=deps /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=deps /usr/local/bin /usr/local/bin
COPY . .
USER nobody
CMD ["python", "app.py"]

11.2 Resource Limits Best Practices

# Always set memory limits
docker run --memory="512m" --memory-swap="512m"  # Disable swap

# CPU: Use cpus for hard limit, cpu-shares for soft
docker run --cpus="2" --cpu-shares="1024"

# Prevent fork bombs
docker run --pids-limit="100"

# I/O limits for noisy neighbors
docker run --device-read-bps="/dev/sda:10mb" --device-write-bps="/dev/sda:10mb"

11.3 Storage Performance

# Use volumes, not bind mounts, for databases
docker run -v pgdata:/var/lib/postgresql/data postgres

# For build caches, use tmpfs
docker run --tmpfs /tmp:rw,noexec,nosuid,size=1g myapp

# Configure storage driver options
# In daemon.json:
{
  "storage-driver": "overlay2",
  "storage-opts": [
    "overlay2.override_kernel_check=true",
    "overlay2.size=20G"
  ]
}

12. Production Checklist

12.1 Image Security

  • [ ] Use minimal base images (distroless, alpine, scratch)
  • [ ] Pin image versions (never use latest in production)
  • [ ] Scan images for vulnerabilities (trivy, grype)
  • [ ] Sign images (cosign, notation)
  • [ ] Run as non-root user
  • [ ] Use multi-stage builds
  • [ ] No secrets in images (use runtime secrets)

12.2 Runtime Security

  • [ ] Set resource limits (memory, CPU, PIDs)
  • [ ] Drop capabilities (--cap-drop=ALL)
  • [ ] Use read-only root filesystem (--read-only)
  • [ ] Enable seccomp profile
  • [ ] No privileged containers
  • [ ] No host namespace sharing
  • [ ] No Docker socket mounting

12.3 Observability

  • [ ] Configure log driver with rotation
  • [ ] Export metrics (cAdvisor, docker stats API)
  • [ ] Health checks defined
  • [ ] Tracing instrumentation

12.4 Build Process

  • [ ] .dockerignore configured
  • [ ] BuildKit enabled
  • [ ] Multi-platform builds if needed
  • [ ] CI/CD pipeline with caching