Docker¶
Docker is a platform that enables developers to build, ship, and run applications in isolated environments called containers. Unlike virtual machines (VMs), which require a full guest operating system and hypervisor, containers share the host's kernel, making them lightweight, fast to start, and resource-efficient.
This chapter provides an in-depth technical exploration of Docker's internals, from the syscall level to production best practices.
1. Historical Context¶
Containerization has roots predating Docker:
| Year | Technology | Innovation |
|---|---|---|
| 1979 | chroot |
Filesystem isolation (Unix V7) |
| 2000 | FreeBSD Jails | Process isolation + networking |
| 2005 | Solaris Zones | Resource controls + virtualization |
| 2006 | Process Containers | Google's cgroups (merged into Linux 2.6.24) |
| 2008 | LXC | Linux Containers (namespaces + cgroups) |
| 2013 | Docker | Developer-friendly tooling + ecosystem |
| 2015 | OCI | Open Container Initiative standards |
| 2016 | containerd | Industry-standard runtime (extracted from Docker) |
Docker did not invent containerization—it democratized it by providing:
- Simple CLI interface
- Dockerfile build system
- Docker Hub registry
- Cross-platform support (Docker Desktop for macOS/Windows)
2. Docker Architecture Deep Dive¶
Docker follows a client-server architecture with multiple layers:
┌──────────────────────────────────────────────────────────────────┐
│ Docker Client │
│ (docker CLI, SDKs) │
│ │
│ Commands: docker build | run | push | pull | exec | logs │
└────────────────────────────┬─────────────────────────────────────┘
│ REST API (Unix socket or TCP)
│ /var/run/docker.sock
┌────────────────────────────▼─────────────────────────────────────┐
│ Docker Daemon (dockerd) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│
│ │ Images │ │ Containers │ │ Networks │ │ Volumes ││
│ │ Manager │ │ Manager │ │ Manager │ │ Manager ││
│ └─────────────┘ └─────────────┘ └─────────────┘ └────────────┘│
└────────────────────────────┬─────────────────────────────────────┘
│ gRPC API
┌────────────────────────────▼─────────────────────────────────────┐
│ containerd │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐│
│ │ Content │ │ Snapshots │ │ Tasks │ │ Events ││
│ │ Store │ │ (Storage) │ │ (Lifecycle) │ │ Stream ││
│ └─────────────┘ └─────────────┘ └──────┬──────┘ └────────────┘│
└──────────────────────────────────────────┼───────────────────────┘
│
┌──────────────────────────────────────────▼───────────────────────┐
│ containerd-shim-runc-v2 │
│ (per-container process, survives containerd restart) │
└──────────────────────────────────────────┬───────────────────────┘
│
┌──────────────────────────────────────────▼───────────────────────┐
│ runc │
│ (OCI runtime, spawns container) │
│ │
│ 1. Parse config.json │
│ 2. Set up namespaces (clone() with CLONE_NEW*) │
│ 3. Configure cgroups │
│ 4. Apply seccomp filters │
│ 5. pivot_root to container filesystem │
│ 6. exec() the entrypoint │
│ 7. Exit (shim takes over) │
└──────────────────────────────────────────────────────────────────┘
2.1 The Docker Daemon (dockerd)¶
The daemon is a long-running process that:
- Listens on
/var/run/docker.sock(Unix) or TCP port 2375/2376 - Manages Docker objects (images, containers, networks, volumes)
- Handles build requests
- Interacts with registries
Configuration: /etc/docker/daemon.json
{
"storage-driver": "overlay2",
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
},
"default-ulimits": {
"nofile": { "Name": "nofile", "Hard": 65536, "Soft": 65536 }
},
"live-restore": true,
"userland-proxy": false,
"default-address-pools": [{ "base": "172.17.0.0/16", "size": 24 }]
}
2.2 containerd¶
containerd is a CNCF graduated project that provides:
- Image pull/push to registries
- Image storage and management
- Container execution via OCI runtimes
- Network interface creation (CNI plugins)
- Snapshot management (storage drivers)
You can interact directly with containerd using ctr or nerdctl:
# Pull image directly with containerd
ctr images pull docker.io/library/nginx:latest
# Create and run container
ctr run --rm -t docker.io/library/nginx:latest my-nginx
# List running tasks
ctr tasks ls
2.3 The Shim Process¶
The shim (containerd-shim-runc-v2) is critical for:
- Daemon Independence: Containers keep running when containerd restarts
- STDIO Handling: Keeps stdin/stdout/stderr streams open
- Exit Status: Reports container exit code to containerd
- Zombie Reaping: Acts as subreaper for container processes
# View shim processes
$ ps aux | grep shim
root 1234 containerd-shim-runc-v2 -namespace moby -id abc123...
root 5678 containerd-shim-runc-v2 -namespace moby -id def456...
2.4 runc¶
runc is the reference implementation of the OCI runtime specification:
# Manual container creation with runc
$ mkdir -p mycontainer/rootfs
$ docker export $(docker create busybox) | tar -C mycontainer/rootfs -xf -
$ cd mycontainer
$ runc spec # Creates config.json
# Run the container
$ runc run my-container
What runc does:
- Parses OCI
config.json - Creates namespaces via
clone() - Configures cgroups by writing to
/sys/fs/cgroup/ - Sets up filesystem mounts
- Applies security profiles (seccomp, capabilities, AppArmor)
- Calls
pivot_root()to change root filesystem - Executes the entrypoint with
execve() - Exits—shim takes over as parent process
3. Kernel Isolation Mechanisms¶
3.1 The clone() System Call¶
When creating a container, runc uses clone() with namespace flags:
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024)
static int child_fn(void *arg) {
printf("Child PID (inside namespace): %d\n", getpid());
printf("Parent PID (inside namespace): %d\n", getppid());
execl("/bin/sh", "sh", NULL);
return 0;
}
int main() {
char *stack = malloc(STACK_SIZE);
if (!stack) exit(1);
// Create new namespaces
int flags = CLONE_NEWPID | // New PID namespace
CLONE_NEWNS | // New mount namespace
CLONE_NEWNET | // New network namespace
CLONE_NEWUTS | // New UTS namespace
CLONE_NEWIPC | // New IPC namespace
SIGCHLD;
pid_t pid = clone(child_fn, stack + STACK_SIZE, flags, NULL);
if (pid == -1) {
perror("clone");
exit(1);
}
printf("Child PID (from parent view): %d\n", pid);
waitpid(pid, NULL, 0);
return 0;
}
3.2 Namespace Details¶
PID Namespace Hierarchy:
Host PID Namespace (Level 0)
├── PID 1 (systemd)
├── PID 1234 (containerd)
├── PID 5678 (Container A's init - seen as PID 1 inside)
│ └── Container A's PID Namespace (Level 1)
│ ├── PID 1 (nginx master)
│ └── PID 2 (nginx worker)
└── PID 9012 (Container B's init - seen as PID 1 inside)
└── Container B's PID Namespace (Level 1)
└── PID 1 (python app)
Network Namespace Inspection:
# List network namespaces
$ ip netns list
# Docker creates netns in /var/run/docker/netns/
# Enter a container's network namespace
$ nsenter --net=/var/run/docker/netns/abc123 ip addr
# From inside container
$ cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes...
lo: 12345 100 0 0 0 0 0 0 12345...
eth0: 67890 500 0 0 0 0 0 0 45678...
3.3 Cgroups V2 in Practice¶
Docker creates cgroup hierarchies under /sys/fs/cgroup/system.slice/:
# Container cgroup location
$ CONTAINER_ID=$(docker ps -q)
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cgroup.procs
1234 # PIDs in this cgroup
# Resource limits
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.max
536870912 # 512MB
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.max
100000 100000 # 100% of 1 CPU
# Current usage
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/memory.current
134217728 # 128MB currently used
$ cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}.scope/cpu.stat
usage_usec 12345678
user_usec 10000000
system_usec 2345678
Resource Limit Flags:
docker run \
--memory="512m" \ # memory.max
--memory-swap="1g" \ # memory.swap.max
--memory-reservation="256m" \ # memory.low (soft limit)
--cpus="1.5" \ # cpu.max (150000 100000)
--cpu-shares="512" \ # cpu.weight (relative)
--cpuset-cpus="0,1" \ # cpuset.cpus
--pids-limit="100" \ # pids.max
--blkio-weight="500" \ # io.weight
nginx
4. Storage: OverlayFS Deep Dive¶
4.1 Layer Structure¶
# View overlay mount for a running container
$ mount | grep overlay
overlay on /var/lib/docker/overlay2/xyz/merged type overlay (rw,...)
lowerdir=/var/lib/docker/overlay2/abc/diff:
/var/lib/docker/overlay2/def/diff:
/var/lib/docker/overlay2/ghi/diff,
upperdir=/var/lib/docker/overlay2/xyz/diff,
workdir=/var/lib/docker/overlay2/xyz/work
# Examine layer contents
$ ls /var/lib/docker/overlay2/
abc/ # Layer 1 (base image)
def/ # Layer 2
ghi/ # Layer 3
xyz/ # Container layer
├── diff/ # Container's writable layer
├── merged/ # Unified view
├── work/ # Kernel workspace
└── link # Shortened layer identifier
4.2 Copy-on-Write Internals¶
// Simplified kernel OverlayFS copy-up logic
int ovl_copy_up(struct dentry *dentry) {
// 1. Check if file exists in upperdir
if (exists_in_upper(dentry))
return 0; // Already copied
// 2. Create parent directories in upperdir
create_parent_dirs(upper_dentry);
// 3. Copy entire file from lowerdir to upperdir
// This is the expensive operation!
copy_file(lower_path, upper_path);
// 4. Copy xattrs (extended attributes)
copy_xattrs(lower_path, upper_path);
// 5. Set up overlay redirect (opaque marker)
set_redirect(upper_dentry);
return 0;
}
Performance Implications:
| Operation | Performance | Notes |
|---|---|---|
| Read from lower | Native | Direct read, no copy |
| Read from upper | Native | Direct read |
| Write new file | Native | Direct write to upper |
| Modify lower file | SLOW | Full copy-up first |
| Delete lower file | Fast | Create whiteout marker |
4.3 Whiteout Files and Opaque Directories¶
# Delete a file from base image
$ docker run --rm -it ubuntu rm /etc/motd
# Inside the container's upperdir:
$ ls -la /var/lib/docker/overlay2/xyz/diff/etc/
c--------- 1 root root 0, 0 Jan 26 10:00 motd # Character device 0:0 = whiteout
# Delete entire directory
$ docker run --rm -it ubuntu rm -rf /var/cache/
# Creates opaque directory with xattr
$ getfattr -d /var/lib/docker/overlay2/xyz/diff/var/cache/
trusted.overlay.opaque="y"
4.4 Image Layer Inspection¶
# View image layers
$ docker image inspect nginx:latest --format '{{json .RootFS.Layers}}' | jq
[
"sha256:abc123...", # Base layer
"sha256:def456...", # Layer 2
"sha256:789ghi..." # Top layer
]
# Examine layer history
$ docker history nginx:latest
IMAGE CREATED CREATED BY SIZE
abc123def456 2 weeks ago CMD ["nginx" "-g" "daemon off;"] 0B
<missing> 2 weeks ago STOPSIGNAL SIGQUIT 0B
<missing> 2 weeks ago EXPOSE 80 0B
<missing> 2 weeks ago ENTRYPOINT ["/docker-entrypoint.sh"] 0B
<missing> 2 weeks ago COPY 30-tune-worker-processes.sh ... (RUN) 4.62kB
...
# Layer diff contents
$ docker save nginx:latest | tar -xf - -C /tmp/nginx-layers/
$ ls /tmp/nginx-layers/
abc123.../
def456.../
manifest.json
5. Image Format and Registry Protocol¶
5.1 OCI Image Manifest¶
{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:abc123...",
"size": 7023
},
"layers": [
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:layer1...",
"size": 32654848
},
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:layer2...",
"size": 16724
}
],
"annotations": {
"org.opencontainers.image.created": "2025-01-26T10:00:00Z"
}
}
5.2 Image Configuration¶
{
"architecture": "amd64",
"os": "linux",
"config": {
"Hostname": "",
"User": "",
"ExposedPorts": { "80/tcp": {} },
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin",
"NGINX_VERSION=1.25.0"
],
"Cmd": ["nginx", "-g", "daemon off;"],
"Entrypoint": ["/docker-entrypoint.sh"],
"WorkingDir": "/",
"Labels": {
"maintainer": "NGINX Docker Maintainers"
}
},
"rootfs": {
"type": "layers",
"diff_ids": [
"sha256:uncompressed-layer1...",
"sha256:uncompressed-layer2..."
]
},
"history": [
{
"created": "2025-01-20T00:00:00Z",
"created_by": "/bin/sh -c #(nop) ADD file:abc... in /",
"empty_layer": false
}
]
}
5.3 Registry HTTP API (OCI Distribution)¶
# 1. Check API version
GET /v2/
# 2. Get manifest
GET /v2/<name>/manifests/<reference>
Accept: application/vnd.oci.image.manifest.v1+json
# 3. Get blob (layer)
GET /v2/<name>/blobs/<digest>
# 4. Push flow:
# a) Check if blob exists
HEAD /v2/<name>/blobs/<digest>
# b) Start upload
POST /v2/<name>/blobs/uploads/
# c) Upload blob
PUT /v2/<name>/blobs/uploads/<uuid>?digest=<digest>
Content-Type: application/octet-stream
# d) Push manifest
PUT /v2/<name>/manifests/<reference>
Content-Type: application/vnd.oci.image.manifest.v1+json
Authentication Flow:
# 1. Initial request returns 401 with WWW-Authenticate header
GET /v2/library/nginx/manifests/latest
< 401 Unauthorized
< WWW-Authenticate: Bearer realm="https://auth.docker.io/token",
service="registry.docker.io",scope="repository:library/nginx:pull"
# 2. Get token from auth service
GET https://auth.docker.io/token?service=registry.docker.io&scope=repository:library/nginx:pull
> {"token": "eyJhbGc...", "expires_in": 300}
# 3. Retry with token
GET /v2/library/nginx/manifests/latest
Authorization: Bearer eyJhbGc...
6. Dockerfile and Build System¶
6.1 Dockerfile Instructions Deep Dive¶
| Instruction | Layer Created | Build-time | Run-time | Notes |
|---|---|---|---|---|
FROM |
Base | ✓ | Sets base image | |
RUN |
Yes | ✓ | Executes commands | |
COPY |
Yes | ✓ | Copies files from context | |
ADD |
Yes | ✓ | COPY + URL + tar extraction | |
ENV |
No* | ✓ | ✓ | Sets environment variables |
ARG |
No | ✓ | Build-time variables | |
WORKDIR |
No* | ✓ | ✓ | Sets working directory |
USER |
No* | ✓ | ✓ | Sets user for subsequent instructions |
EXPOSE |
No | (docs) | Documentation only | |
VOLUME |
No* | ✓ | Creates mount point | |
ENTRYPOINT |
No | ✓ | Container entry point | |
CMD |
No | ✓ | Default arguments | |
LABEL |
No* | ✓ | ✓ | Metadata |
HEALTHCHECK |
No | ✓ | Health check command | |
SHELL |
No | ✓ | Default shell | |
STOPSIGNAL |
No | ✓ | Stop signal | |
ONBUILD |
No | ✓ | Trigger for child images |
*Creates metadata layer, not filesystem layer
6.2 Build Cache Mechanics¶
# Layer cache invalidation cascade
FROM python:3.11 # Cache: Base image digest
WORKDIR /app # Cache: Rarely changes
COPY requirements.txt . # Cache: Invalidates if file changes
RUN pip install -r requirements.txt # Cache: Invalidates if above changed
COPY . . # Cache: Invalidates on ANY source change
CMD ["python", "app.py"] # Cache: Invalidates if above changed
Cache Key Calculation:
- Parent layer hash
- Instruction string
- For
COPY/ADD: file content hashes - For
RUN: command string only (not execution results!)
Cache Invalidation:
# Force rebuild from specific instruction
docker build --no-cache .
# Invalidate from specific stage
docker build --no-cache-filter=build .
6.3 Multi-Stage Builds¶
Multi-stage builds dramatically reduce final image size:
# Stage 1: Build
FROM golang:1.21 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /app -ldflags="-s -w" .
# Stage 2: Runtime
FROM scratch
# Or: FROM gcr.io/distroless/static-debian12
COPY --from=builder /app /app
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
ENTRYPOINT ["/app"]
Result:
| Stage | Contents | Size |
|---|---|---|
| golang:1.21 | Go toolchain, libs, source | ~800MB |
| Final (scratch) | Binary + certs only | ~10MB |
6.4 BuildKit (Modern Build Engine)¶
BuildKit is the next-generation build engine with:
- Parallel stage execution
- Efficient layer caching
- Build secrets
- SSH forwarding
- Cache mounts
- Better cache export/import
Enable BuildKit:
export DOCKER_BUILDKIT=1
# Or in daemon.json: {"features": {"buildkit": true}}
BuildKit-specific Features:
# syntax=docker/dockerfile:1.5
# Mount cache for package managers
FROM python:3.11
RUN --mount=type=cache,target=/root/.cache/pip \
pip install numpy pandas scikit-learn
# Mount secrets (never stored in layer)
FROM alpine
RUN --mount=type=secret,id=aws_creds \
cat /run/secrets/aws_creds
# SSH forwarding
FROM alpine
RUN --mount=type=ssh \
git clone git@github.com:private/repo.git
# Bind mount from context
FROM golang:1.21
RUN --mount=type=bind,source=go.mod,target=go.mod \
go mod download
Build Command:
docker buildx build \
--secret id=aws_creds,src=$HOME/.aws/credentials \
--ssh default \
--cache-from type=registry,ref=myregistry/myapp:cache \
--cache-to type=registry,ref=myregistry/myapp:cache \
--platform linux/amd64,linux/arm64 \
--push \
-t myregistry/myapp:latest \
.
7. Networking Deep Dive¶
7.1 Network Drivers¶
| Driver | Use Case | Scope | IP Management |
|---|---|---|---|
bridge |
Default, isolated | Single host | Docker IPAM |
host |
Maximum performance | Single host | Host IP |
none |
No networking | Single host | None |
overlay |
Multi-host (Swarm) | Multi-host | Docker IPAM |
macvlan |
Direct L2 access | Single host | External DHCP/Static |
ipvlan |
L2/L3 without MAC | Single host | External |
7.2 Bridge Network Internals¶
# Default bridge network
$ docker network inspect bridge
[
{
"Name": "bridge",
"Driver": "bridge",
"IPAM": {
"Config": [{ "Subnet": "172.17.0.0/16", "Gateway": "172.17.0.1" }]
},
"Options": {
"com.docker.network.bridge.default_bridge": "true",
"com.docker.network.bridge.name": "docker0"
}
}
]
# View bridge on host
$ ip link show docker0
docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
link/ether 02:42:ac:11:00:01 brd ff:ff:ff:ff:ff:ff
$ brctl show docker0
bridge name bridge id STP enabled interfaces
docker0 8000.024242424242 no veth123abc
veth456def
7.3 veth Pair Creation¶
When a container starts:
# Docker creates a veth pair
$ ip link add veth123 type veth peer name eth0@if123
# One end goes to container's network namespace
$ ip link set eth0@if123 netns <container-ns>
# Other end connects to bridge
$ ip link set veth123 master docker0
$ ip link set veth123 up
# Configure container's interface
$ nsenter --net=<container-ns> ip addr add 172.17.0.2/16 dev eth0
$ nsenter --net=<container-ns> ip link set eth0 up
$ nsenter --net=<container-ns> ip route add default via 172.17.0.1
7.4 NAT and Port Mapping¶
Outbound (MASQUERADE):
$ iptables -t nat -L POSTROUTING -n -v
Chain POSTROUTING
pkts bytes target prot opt in out source destination
1234 56789 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0
Inbound (DNAT for -p 8080:80):
$ iptables -t nat -L DOCKER -n -v
Chain DOCKER
pkts bytes target prot opt in out source destination
100 5000 DNAT tcp -- !docker0 * 0.0.0.0/0 0.0.0.0/0
tcp dpt:8080 to:172.17.0.2:80
7.5 DNS Resolution¶
Docker provides an embedded DNS server for user-defined networks:
# Inside container on user-defined network
$ cat /etc/resolv.conf
nameserver 127.0.0.11
options ndots:0
# Docker's DNS server at 127.0.0.11 handles:
# 1. Container name resolution (web → 172.18.0.2)
# 2. Service discovery
# 3. Forwards unknown queries to host DNS
8. Security Deep Dive¶
8.1 Linux Capabilities¶
Docker drops most capabilities by default:
# Default capabilities kept:
CHOWN, DAC_OVERRIDE, FSETID, FOWNER, MKNOD, NET_RAW,
SETGID, SETUID, SETFCAP, SETPCAP, NET_BIND_SERVICE,
SYS_CHROOT, KILL, AUDIT_WRITE
# Dangerous capabilities dropped:
SYS_ADMIN, SYS_PTRACE, SYS_MODULE, SYS_RAWIO, SYS_TIME,
SYS_BOOT, NET_ADMIN, SYS_RESOURCE, SYSLOG, ...
Capability Management:
# Drop all, add only needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx
# Check container capabilities
$ cat /proc/1/status | grep Cap
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
# Decode with capsh
$ capsh --decode=00000000a80425fb
8.2 Seccomp Profiles¶
Docker applies a default seccomp profile blocking ~44 syscalls:
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": ["accept", "accept4", "access", "..."],
"action": "SCMP_ACT_ALLOW"
},
{
"names": ["clone"],
"action": "SCMP_ACT_ALLOW",
"args": [
{
"index": 0,
"value": 2114060288,
"op": "SCMP_CMP_MASKED_EQ"
}
]
}
]
}
Blocked Syscalls Include:
reboot,swapon,swapoffmount,umount,pivot_rootclock_settime,settimeofdayinit_module,delete_moduleacct,kexec_load
Custom Profile:
docker run --security-opt seccomp=/path/to/profile.json myimage
8.3 AppArmor Profiles¶
Docker generates AppArmor profiles restricting:
profile docker-default flags=(attach_disconnected,mediate_deleted) {
# Deny writes to sensitive paths
deny /proc/** w,
deny /sys/** w,
# Allow read of specific proc files
/proc/*/attr/current r,
/proc/*/mounts r,
# Network access
network inet stream,
network inet6 stream,
# File capabilities
capability chown,
capability dac_override,
capability net_bind_service,
}
8.4 Rootless Mode¶
Running Docker without root privileges:
# Install rootless Docker
$ dockerd-rootless-setuptool.sh install
# Configure
$ export DOCKER_HOST=unix:///run/user/1000/docker.sock
# How it works:
# 1. Uses user namespaces (container root → host UID 100000+)
# 2. Uses slirp4netns for networking (userspace TCP/IP)
# 3. Uses fuse-overlayfs for storage
Limitations:
- Slightly slower networking (userspace stack)
- Cannot use privileged containers
- Limited to unprivileged ports (>1024) without capabilities
- Some storage drivers unavailable
9. Docker Compose¶
9.1 Compose File Structure¶
version: "3.9"
services:
web:
build:
context: ./web
dockerfile: Dockerfile
args:
NODE_ENV: production
image: myapp-web:${TAG:-latest}
ports:
- "8080:3000"
environment:
- DATABASE_URL=postgres://db:5432/myapp
depends_on:
db:
condition: service_healthy
deploy:
replicas: 2
resources:
limits:
cpus: "0.5"
memory: 512M
networks:
- frontend
- backend
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
db:
image: postgres:15
volumes:
- pgdata:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
environment:
POSTGRES_PASSWORD_FILE: /run/secrets/db_password
secrets:
- db_password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
networks:
- backend
volumes:
pgdata:
driver: local
networks:
frontend:
driver: bridge
backend:
driver: bridge
internal: true
secrets:
db_password:
file: ./secrets/db_password.txt
9.2 Compose Commands¶
# Start services
docker compose up -d
# View logs
docker compose logs -f web
# Scale service
docker compose up -d --scale web=3
# Execute command in service
docker compose exec web sh
# Stop and remove
docker compose down -v # -v removes volumes
10. Debugging Containers¶
10.1 Inspection Commands¶
# Full container details
docker inspect <container>
# Specific field
docker inspect -f '{{.State.Pid}}' <container>
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container>
# Resource usage
docker stats <container>
# Process list
docker top <container>
# Filesystem changes
docker diff <container>
10.2 Entering Namespaces¶
# Get container PID
PID=$(docker inspect -f '{{.State.Pid}}' mycontainer)
# Enter all namespaces
nsenter -t $PID -m -u -i -n -p /bin/sh
# Enter specific namespace only
nsenter -t $PID --net ip addr # Network namespace
nsenter -t $PID --pid --mount ps aux # PID + mount namespace
10.3 Debug Container (Docker 1.26+)¶
# Attach debug container to running container's namespaces
docker debug <container>
# With specific image
docker debug --image=busybox <container>
10.4 Analyzing Image Layers¶
# Tool: dive (interactive layer explorer)
dive nginx:latest
# Manual analysis
docker save nginx:latest | tar -xf - -C /tmp/nginx/
ls /tmp/nginx/
# manifest.json, oci-layout, blobs/sha256/...
11. Performance Tuning¶
11.1 Image Optimization¶
# Bad: Large image, poor caching
FROM python:3.11
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
# Good: Small image, optimal caching
FROM python:3.11-slim AS base
WORKDIR /app
FROM base AS deps
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
pip install --no-compile -r requirements.txt
FROM base AS runtime
COPY --from=deps /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=deps /usr/local/bin /usr/local/bin
COPY . .
USER nobody
CMD ["python", "app.py"]
11.2 Resource Limits Best Practices¶
# Always set memory limits
docker run --memory="512m" --memory-swap="512m" # Disable swap
# CPU: Use cpus for hard limit, cpu-shares for soft
docker run --cpus="2" --cpu-shares="1024"
# Prevent fork bombs
docker run --pids-limit="100"
# I/O limits for noisy neighbors
docker run --device-read-bps="/dev/sda:10mb" --device-write-bps="/dev/sda:10mb"
11.3 Storage Performance¶
# Use volumes, not bind mounts, for databases
docker run -v pgdata:/var/lib/postgresql/data postgres
# For build caches, use tmpfs
docker run --tmpfs /tmp:rw,noexec,nosuid,size=1g myapp
# Configure storage driver options
# In daemon.json:
{
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true",
"overlay2.size=20G"
]
}
12. Production Checklist¶
12.1 Image Security¶
- [ ] Use minimal base images (
distroless,alpine,scratch) - [ ] Pin image versions (never use
latestin production) - [ ] Scan images for vulnerabilities (
trivy,grype) - [ ] Sign images (
cosign,notation) - [ ] Run as non-root user
- [ ] Use multi-stage builds
- [ ] No secrets in images (use runtime secrets)
12.2 Runtime Security¶
- [ ] Set resource limits (memory, CPU, PIDs)
- [ ] Drop capabilities (
--cap-drop=ALL) - [ ] Use read-only root filesystem (
--read-only) - [ ] Enable seccomp profile
- [ ] No privileged containers
- [ ] No host namespace sharing
- [ ] No Docker socket mounting
12.3 Observability¶
- [ ] Configure log driver with rotation
- [ ] Export metrics (cAdvisor, docker stats API)
- [ ] Health checks defined
- [ ] Tracing instrumentation
12.4 Build Process¶
- [ ] .dockerignore configured
- [ ] BuildKit enabled
- [ ] Multi-platform builds if needed
- [ ] CI/CD pipeline with caching