Skip to content

Kubernetes

Container orchestration is the automated management of the lifecycle of hundreds, thousands, or tens of thousands of containers in production environments. It solves the problems that appear when you move from running 1–10 containers on a laptop to running 10,000+ containers across dozens or hundreds of machines.

Kubernetes (K8s) has decisively won the orchestration war, running >90% of containerized production workloads globally.


1. Core Architecture

1.1 Component Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CONTROL PLANE                                   │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                         kube-apiserver                                   ││
│  │  • REST + gRPC API endpoint                                             ││
│  │  • Authentication, Authorization, Admission                             ││
│  │  • etcd client (only component that talks to etcd)                     ││
│  └────────────────────────────────┬────────────────────────────────────────┘│
│                                   │                                          │
│  ┌────────────────────────────────▼────────────────────────────────────────┐│
│  │                              etcd                                        ││
│  │  • Distributed key-value store (Raft consensus)                         ││
│  │  • Source of truth for all cluster state                                ││
│  │  • 3 or 5 nodes for HA                                                  ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐ │
│  │  kube-scheduler     │  │ controller-manager  │  │ cloud-controller    │ │
│  │                     │  │                     │  │                     │ │
│  │  • Watches unbound  │  │  • Node controller  │  │  • Node lifecycle   │ │
│  │    Pods             │  │  • ReplicaSet       │  │  • LoadBalancer     │ │
│  │  • Scores nodes     │  │  • Deployment       │  │  • Routes           │ │
│  │  • Binds Pod→Node   │  │  • StatefulSet      │  │  • Cloud disks      │ │
│  │                     │  │  • Job, CronJob     │  │                     │ │
│  └─────────────────────┘  └─────────────────────┘  └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                              ┌───────┴────────┐
                              │   Network      │
                              └───────┬────────┘
                                      │
┌─────────────────────────────────────┼─────────────────────────────────────────┐
│                               WORKER NODES                                     │
│                                      │                                         │
│   ┌──────────────────────────────────▼──────────────────────────────────────┐ │
│   │                              kubelet                                     │ │
│   │  • Registers node with API server                                        │ │
│   │  • Watches for Pod assignments                                           │ │
│   │  • Manages container lifecycle via CRI                                   │ │
│   │  • Reports node status, pod status                                       │ │
│   │  • Manages volumes via CSI                                               │ │
│   └──────────────────────────────────┬──────────────────────────────────────┘ │
│                                      │ CRI (gRPC)                              │
│   ┌──────────────────────────────────▼──────────────────────────────────────┐ │
│   │                          containerd / CRI-O                              │ │
│   │  • Pulls images                                                          │ │
│   │  • Creates containers via OCI runtime                                    │ │
│   │  • Manages container lifecycle                                           │ │
│   └──────────────────────────────────────────────────────────────────────────┘ │
│                                                                                │
│   ┌──────────────────────────────────────────────────────────────────────────┐ │
│   │                            kube-proxy                                     │ │
│   │  • Maintains network rules (iptables/IPVS/eBPF)                          │ │
│   │  • Implements Service abstraction                                        │ │
│   │  • Load balances traffic to Pod endpoints                                │ │
│   └──────────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────────┘

1.2 Component Details

Component Function Stateless? HA Strategy
kube-apiserver API gateway, auth, admission Yes Multiple replicas behind LB
etcd Persistent state storage No Raft consensus (3 or 5 nodes)
kube-scheduler Pod placement decisions Yes Leader election
kube-controller-manager Reconciliation loops Yes Leader election
cloud-controller-manager Cloud provider integration Yes Leader election
kubelet Node agent N/A One per node
kube-proxy Network rules N/A One per node

2. etcd: The Cluster Brain

2.1 What etcd Stores

Everything in Kubernetes is stored in etcd under /registry/:

/registry/
├── configmaps/
│   └── default/
│       └── my-config
├── deployments/
│   └── default/
│       └── nginx-deployment
├── events/
├── namespaces/
│   ├── default
│   ├── kube-system
│   └── kube-public
├── nodes/
│   ├── node-1
│   └── node-2
├── pods/
│   └── default/
│       ├── nginx-abc123
│       └── nginx-def456
├── secrets/
├── services/
└── ...

2.2 Raft Consensus Protocol

etcd uses Raft for distributed consensus:

┌─────────────────────────────────────────────────────────────────┐
│                     Raft State Machine                           │
│                                                                  │
│   ┌───────────────┐                                             │
│   │    Leader     │  ← Only leader handles writes                │
│   │   (Node 1)    │  ← Replicates to followers                  │
│   └───────┬───────┘                                             │
│           │                                                      │
│     ┌─────┴─────┐                                               │
│     ▼           ▼                                               │
│ ┌───────┐   ┌───────┐                                           │
│ │Follower│   │Follower│                                          │
│ │(Node 2)│   │(Node 3)│                                          │
│ └───────┘   └───────┘                                           │
│                                                                  │
│  Write Path:                                                     │
│  1. Client → Leader                                              │
│  2. Leader appends to local log                                  │
│  3. Leader replicates to followers                               │
│  4. Majority (quorum) acknowledges                               │
│  5. Leader commits entry                                         │
│  6. Leader responds to client                                    │
│                                                                  │
│  Quorum = (N/2) + 1                                             │
│  3 nodes → need 2 for consensus (survives 1 failure)            │
│  5 nodes → need 3 for consensus (survives 2 failures)           │
└─────────────────────────────────────────────────────────────────┘

2.3 etcd Performance Characteristics

Metric Recommended Critical
Disk IOPS >3000 SSDs required
Disk latency \<10ms p99 >50ms = cluster instability
Network latency \<2ms between nodes >10ms = election timeouts
Object size \<1MB >1.5MB rejected
Total DB size \<8GB default Can increase, but impacts performance

2.4 etcd Operations

# Check cluster health
etcdctl endpoint health --endpoints=https://127.0.0.1:2379

# List all keys
etcdctl get / --prefix --keys-only

# Get specific key
etcdctl get /registry/pods/default/nginx-abc123

# Watch for changes
etcdctl watch /registry/pods --prefix

# Compact history (required for long-running clusters)
etcdctl compact $(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')

# Defragment (reclaim disk space after compaction)
etcdctl defrag --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379

3. API Server Deep Dive

3.1 Request Processing Pipeline

┌────────────────────────────────────────────────────────────────────────────┐
│                        API Server Request Flow                              │
│                                                                             │
│  Client Request                                                             │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    1. AUTHENTICATION                                 │   │
│  │  • Client certificates (x509)                                        │   │
│  │  • Bearer tokens (ServiceAccount, OIDC)                              │   │
│  │  • Basic auth (deprecated)                                           │   │
│  │  • Webhook token auth                                                │   │
│  │                                                                       │   │
│  │  Result: User identity (username, UID, groups)                       │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    2. AUTHORIZATION                                  │   │
│  │  • RBAC (Role-Based Access Control) ← primary                       │   │
│  │  • ABAC (Attribute-Based)                                           │   │
│  │  • Webhook                                                           │   │
│  │  • Node authorizer (kubelet-specific)                               │   │
│  │                                                                       │   │
│  │  Question: Can user X perform verb Y on resource Z?                  │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    3. ADMISSION CONTROLLERS                          │   │
│  │                                                                       │   │
│  │  ┌─────────────────────┐    ┌─────────────────────┐                  │   │
│  │  │ Mutating Webhooks   │ →  │ Validating Webhooks │                  │   │
│  │  │                     │    │                     │                  │   │
│  │  │ • Modify objects    │    │ • Accept/Reject     │                  │   │
│  │  │ • Inject sidecars   │    │ • Policy enforcement│                  │   │
│  │  │ • Set defaults      │    │ • Security checks   │                  │   │
│  │  └─────────────────────┘    └─────────────────────┘                  │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    4. VALIDATION                                     │   │
│  │  • Schema validation (OpenAPI)                                       │   │
│  │  • Field immutability checks                                        │   │
│  │  • Resource quota checks                                            │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    5. PERSISTENCE                                    │   │
│  │  • Serialize to protobuf                                            │   │
│  │  • Write to etcd                                                    │   │
│  │  • Return response                                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────────────────┘

3.2 API Groups and Versions

# Core API (legacy, no group)
/api/v1/namespaces/default/pods/nginx

# Named API groups
/apis/apps/v1/namespaces/default/deployments/nginx
/apis/batch/v1/namespaces/default/jobs/myjob
/apis/networking.k8s.io/v1/namespaces/default/ingresses/myingress

# List all API resources
kubectl api-resources
kubectl api-versions

3.3 Watch Mechanism

Kubernetes uses long-polling watches for efficient state synchronization:

// Controller's informer uses watch
watcher, _ := client.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{
    ResourceVersion: "12345",  // Start watching from this version
})

for event := range watcher.ResultChan() {
    switch event.Type {
    case watch.Added:
        // New pod created
    case watch.Modified:
        // Pod updated
    case watch.Deleted:
        // Pod removed
    case watch.Bookmark:
        // Progress marker (no actual change)
    case watch.Error:
        // Re-list and restart watch
    }
}

Resource Versions:

  • Every object has a resourceVersion (etcd revision)
  • Watches specify starting resourceVersion
  • Allows efficient sync without polling

4. Admission Controllers

4.1 Built-in Admission Controllers

Controller Type Function
NamespaceLifecycle Validating Prevents ops in terminating namespaces
LimitRanger Mutating Applies default resource limits
ServiceAccount Mutating Auto-mounts SA tokens
DefaultStorageClass Mutating Assigns default storage class
ResourceQuota Validating Enforces namespace quotas
PodSecurity Validating Enforces Pod Security Standards
NodeRestriction Validating Limits kubelet API access

4.2 Dynamic Admission Webhooks

MutatingWebhookConfiguration:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: sidecar-injector
webhooks:
  - name: sidecar.example.com
    clientConfig:
      service:
        name: sidecar-injector
        namespace: system
        path: /mutate
      caBundle: <base64-encoded-ca>
    rules:
      - operations: ["CREATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]
    namespaceSelector:
      matchLabels:
        sidecar-injection: enabled
    failurePolicy: Fail # or Ignore
    sideEffects: None
    admissionReviewVersions: ["v1"]

Webhook Handler Example:

func handleMutate(w http.ResponseWriter, r *http.Request) {
    var review admissionv1.AdmissionReview
    json.NewDecoder(r.Body).Decode(&review)

    pod := corev1.Pod{}
    json.Unmarshal(review.Request.Object.Raw, &pod)

    // Add sidecar container
    patch := []map[string]interface{}{
        {
            "op":    "add",
            "path":  "/spec/containers/-",
            "value": sidecarContainer,
        },
    }

    patchBytes, _ := json.Marshal(patch)
    patchType := admissionv1.PatchTypeJSONPatch

    review.Response = &admissionv1.AdmissionResponse{
        UID:       review.Request.UID,
        Allowed:   true,
        PatchType: &patchType,
        Patch:     patchBytes,
    }

    json.NewEncoder(w).Encode(review)
}

4.3 Policy Engines

Kyverno:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-for-labels
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "Pods must have 'app' and 'owner' labels"
        pattern:
          metadata:
            labels:
              app: "?*"
              owner: "?*"

OPA Gatekeeper:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          properties:
            labels:
              type: array
              items: { type: string }
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels
        violation[{"msg": msg}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Missing labels: %v", [missing])
        }

5. The Reconciliation Loop

5.1 Controller Pattern

Every Kubernetes controller follows this pattern:

func (c *Controller) Run(ctx context.Context) {
    // 1. List all existing objects (initial sync)
    objects, _ := c.lister.List(labels.Everything())
    for _, obj := range objects {
        c.workqueue.Add(obj.GetName())
    }

    // 2. Watch for changes
    go c.informer.Run(ctx.Done())

    // 3. Process work queue
    for c.processNextItem(ctx) {
    }
}

func (c *Controller) processNextItem(ctx context.Context) bool {
    key, shutdown := c.workqueue.Get()
    if shutdown {
        return false
    }
    defer c.workqueue.Done(key)

    // 4. Reconcile
    err := c.reconcile(ctx, key.(string))

    if err != nil {
        // 5. Requeue with exponential backoff
        c.workqueue.AddRateLimited(key)
        return true
    }

    c.workqueue.Forget(key)
    return true
}

func (c *Controller) reconcile(ctx context.Context, name string) error {
    // Get desired state
    desired, err := c.lister.Get(name)
    if errors.IsNotFound(err) {
        return nil  // Object deleted, nothing to do
    }

    // Get actual state
    actual, _ := c.getActualState(name)

    // Compare and act
    if !reflect.DeepEqual(desired.Spec, actual) {
        return c.update(ctx, desired)
    }

    return nil
}

5.2 Deployment Controller Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                    Deployment Controller                                 │
│                                                                          │
│  User creates Deployment (replicas: 3)                                  │
│       │                                                                  │
│       ▼                                                                  │
│  Deployment Controller watches → sees new Deployment                    │
│       │                                                                  │
│       ▼                                                                  │
│  Creates ReplicaSet with replicas: 3                                    │
│       │                                                                  │
│       ▼                                                                  │
│  ReplicaSet Controller watches → sees new ReplicaSet                    │
│       │                                                                  │
│       ▼                                                                  │
│  Creates 3 Pods (without nodeName)                                      │
│       │                                                                  │
│       ▼                                                                  │
│  Scheduler watches → sees 3 unscheduled Pods                            │
│       │                                                                  │
│       ▼                                                                  │
│  Assigns nodeName to each Pod                                           │
│       │                                                                  │
│       ▼                                                                  │
│  kubelet watches → sees Pods assigned to this node                      │
│       │                                                                  │
│       ▼                                                                  │
│  Starts containers via CRI                                              │
│       │                                                                  │
│       ▼                                                                  │
│  Reports Pod status back to API server                                  │
└─────────────────────────────────────────────────────────────────────────┘

6. The Pod: Kubernetes Atom

6.1 Pod is NOT a Container

A Pod is:

  • A group of 1+ containers sharing:
    • Network namespace (same IP, same localhost)
    • IPC namespace (shared memory)
    • UTS namespace (same hostname)
    • Optionally: PID namespace
  • A scheduling unit (placed together on one node)
  • A lifecycle unit (all containers start/stop together)

6.2 Pod Anatomy

apiVersion: v1
kind: Pod
metadata:
  name: multi-container-pod
  namespace: default
  labels:
    app: myapp
    version: v1
  annotations:
    prometheus.io/scrape: "true"
spec:
  # Scheduling constraints
  nodeSelector:
    disktype: ssd
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values: ["us-west-1a", "us-west-1b"]
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: myapp
            topologyKey: kubernetes.io/hostname
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "database"
      effect: "NoSchedule"

  # Service account
  serviceAccountName: myapp-sa
  automountServiceAccountToken: false

  # Security context (pod-level)
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault

  # DNS configuration
  dnsPolicy: ClusterFirst
  dnsConfig:
    options:
      - name: ndots
        value: "2"

  # Init containers (run sequentially before main containers)
  initContainers:
    - name: init-db
      image: busybox
      command: ["sh", "-c", "until nc -z db 5432; do sleep 2; done"]

  # Main containers
  containers:
    - name: app
      image: myapp:v1.2.3
      imagePullPolicy: IfNotPresent

      # Commands
      command: ["/app/server"]
      args: ["--config=/etc/config/app.yaml"]

      # Environment
      env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
      envFrom:
        - configMapRef:
            name: app-config

      # Ports
      ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: metrics
          containerPort: 9090

      # Resources
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
        limits:
          cpu: "500m"
          memory: "512Mi"

      # Probes
      startupProbe:
        httpGet:
          path: /healthz
          port: http
        failureThreshold: 30
        periodSeconds: 10
      livenessProbe:
        httpGet:
          path: /healthz
          port: http
        initialDelaySeconds: 0
        periodSeconds: 10
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /ready
          port: http
        periodSeconds: 5
        failureThreshold: 1

      # Security context (container-level)
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]

      # Volume mounts
      volumeMounts:
        - name: config
          mountPath: /etc/config
          readOnly: true
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /var/cache

    # Sidecar container
    - name: log-shipper
      image: fluentbit:latest
      resources:
        requests:
          cpu: "10m"
          memory: "32Mi"
        limits:
          cpu: "50m"
          memory: "64Mi"
      volumeMounts:
        - name: logs
          mountPath: /var/log/app
          readOnly: true

  # Volumes
  volumes:
    - name: config
      configMap:
        name: app-config
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir:
        sizeLimit: "100Mi"
    - name: logs
      emptyDir: {}

  # Termination
  terminationGracePeriodSeconds: 30

  # Restart policy
  restartPolicy: Always # Always | OnFailure | Never

6.3 Pod Lifecycle

┌─────────────────────────────────────────────────────────────────────────┐
│                         Pod Lifecycle                                    │
│                                                                          │
│   Pending                                                                │
│      │                                                                   │
│      │  (Scheduled to node)                                             │
│      ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    Container Startup                             │   │
│   │                                                                  │   │
│   │  1. Pull image (if not cached)                                  │   │
│   │  2. Create container                                            │   │
│   │  3. Run init containers (sequentially)                          │   │
│   │  4. Start main containers (in parallel)                         │   │
│   │  5. Execute postStart hooks                                     │   │
│   │  6. Wait for startupProbe to pass                               │   │
│   │  7. Start livenessProbe and readinessProbe                      │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│      │                                                                   │
│      ▼                                                                   │
│   Running ←──────────────────────────────────────────────┐              │
│      │                                                    │              │
│      │  (livenessProbe fails)                            │              │
│      ▼                                                    │              │
│   Container restarts ─────────────────────────────────────┘              │
│      │                                                                   │
│      │  (Pod deleted or node fails)                                     │
│      ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                   Termination Sequence                           │   │
│   │                                                                  │   │
│   │  1. Pod marked Terminating                                       │   │
│   │  2. Remove from Service endpoints                                │   │
│   │  3. Execute preStop hooks (parallel with SIGTERM)               │   │
│   │  4. Send SIGTERM to containers                                   │   │
│   │  5. Wait terminationGracePeriodSeconds                          │   │
│   │  6. Send SIGKILL                                                 │   │
│   │  7. Remove Pod object                                            │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│      │                                                                   │
│      ▼                                                                   │
│   Succeeded / Failed                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

6.4 Container Types

Type When Runs Use Case
Init Containers Before main containers, sequentially DB migrations, wait for dependencies
Main Containers Application lifetime, in parallel Primary workload
Sidecar Containers Application lifetime, in parallel Log shipping, proxies, monitoring
Ephemeral Containers Debug-time only (kubectl debug) Troubleshooting running pods

7. The Scheduler

7.1 Scheduling Phases

┌─────────────────────────────────────────────────────────────────────────┐
│                        Scheduler Pipeline                                │
│                                                                          │
│  Unscheduled Pod enters queue                                           │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                  Phase 1: FILTERING                              │   │
│  │                                                                  │   │
│  │  Eliminate nodes that cannot run the Pod:                        │   │
│  │  • PodFitsResources - enough CPU/memory?                         │   │
│  │  • PodFitsHostPorts - port conflicts?                            │   │
│  │  • NodeSelector - labels match?                                  │   │
│  │  • TaintToleration - tolerates taints?                           │   │
│  │  • NodeAffinity - affinity rules satisfied?                      │   │
│  │  • VolumeBinding - PV available in zone?                         │   │
│  │  • InterPodAffinity - co-location rules?                         │   │
│  │                                                                  │   │
│  │  Input: All nodes                                                │   │
│  │  Output: Feasible nodes                                          │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     ▼                                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                   Phase 2: SCORING                               │   │
│  │                                                                  │   │
│  │  Rank feasible nodes (0-100 per plugin):                         │   │
│  │  • NodeResourcesFit - prefer balanced utilization                │   │
│  │  • ImageLocality - image already cached?                         │   │
│  │  • InterPodAffinity - prefer co-located pods                     │   │
│  │  • TaintToleration - prefer fewer tolerations needed             │   │
│  │  • NodeAffinity - prefer affinity matches                        │   │
│  │                                                                  │   │
│  │  Final score = Σ (plugin_score × plugin_weight)                  │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     ▼                                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    Phase 3: BINDING                              │   │
│  │                                                                  │   │
│  │  1. Select highest-scoring node                                  │   │
│  │  2. Reserve resources (optimistic)                               │   │
│  │  3. Run pre-bind plugins (e.g., volume provisioning)            │   │
│  │  4. Update Pod's spec.nodeName                                   │   │
│  │  5. Run post-bind plugins                                        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

7.2 Scheduling Constraints

Node Selector (simple):

spec:
  nodeSelector:
    disktype: ssd
    zone: us-west-1a

Node Affinity (flexible):

spec:
  affinity:
    nodeAffinity:
      # Hard requirement
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values: ["amd64", "arm64"]
      # Soft preference
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: zone
                operator: In
                values: ["us-west-1a"]

Pod Anti-Affinity (spread replicas):

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: redis
          topologyKey: kubernetes.io/hostname

Taints and Tolerations:

# Taint a node (repels pods)
kubectl taint nodes node1 dedicated=database:NoSchedule

# Pod must tolerate to schedule
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "database"
    effect: "NoSchedule"

7.3 Resource Management

Requests vs Limits:

Aspect Requests Limits
Scheduling Used for node selection Not considered
CPU Guaranteed minimum Throttled above
Memory Guaranteed minimum OOM killed above
QoS Determines QoS class Determines QoS class

QoS Classes:

Class Criteria Eviction Priority
Guaranteed requests == limits for all containers Lowest (last evicted)
Burstable At least one request set Medium
BestEffort No requests or limits Highest (first evicted)

8. Services and Networking

8.1 Service Types

┌─────────────────────────────────────────────────────────────────────────┐
│                          Service Types                                   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                       ClusterIP                                  │   │
│  │                                                                  │   │
│  │  • Internal cluster IP only                                      │   │
│  │  • DNS: my-svc.namespace.svc.cluster.local                      │   │
│  │  • Default type                                                  │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: ClusterIP                                               │   │
│  │    clusterIP: 10.96.0.100                                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                       NodePort                                   │   │
│  │                                                                  │   │
│  │  • Exposes on each node's IP at static port (30000-32767)       │   │
│  │  • Includes ClusterIP                                            │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: NodePort                                                │   │
│  │    ports:                                                        │   │
│  │    - port: 80                                                    │   │
│  │      nodePort: 30080                                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      LoadBalancer                                │   │
│  │                                                                  │   │
│  │  • Provisions cloud load balancer (AWS ELB, GCP LB, etc.)       │   │
│  │  • Includes NodePort and ClusterIP                               │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: LoadBalancer                                            │   │
│  │    loadBalancerIP: 1.2.3.4  # optional, if supported            │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      ExternalName                                │   │
│  │                                                                  │   │
│  │  • DNS CNAME record, no proxying                                 │   │
│  │  • Useful for external services                                  │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: ExternalName                                            │   │
│  │    externalName: my.database.example.com                        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                       Headless                                   │   │
│  │                                                                  │   │
│  │  • No ClusterIP (clusterIP: None)                               │   │
│  │  • DNS returns Pod IPs directly                                  │   │
│  │  • Used with StatefulSets for stable network identity           │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    clusterIP: None                                               │   │
│  │    selector:                                                     │   │
│  │      app: postgres                                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

8.2 kube-proxy Modes

iptables Mode (default):

# View Service rules
iptables -t nat -L KUBE-SERVICES -n

# Chain KUBE-SERVICES
-A KUBE-SERVICES -d 10.96.0.100/32 -p tcp -m tcp --dport 80 \
   -j KUBE-SVC-XXXXX

# Chain KUBE-SVC-XXXXX (round-robin to endpoints)
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.33333 \
   -j KUBE-SEP-11111
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.50000 \
   -j KUBE-SEP-22222
-A KUBE-SVC-XXXXX -j KUBE-SEP-33333

# Chain KUBE-SEP-11111 (DNAT to Pod)
-A KUBE-SEP-11111 -p tcp -j DNAT --to-destination 172.17.0.2:8080

IPVS Mode (better performance):

# Enable in kube-proxy config
mode: ipvs
ipvs:
  scheduler: rr  # rr, lc, dh, sh, sed, nq

# View IPVS rules
ipvsadm -Ln
IP Virtual Server version 1.2.1
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP  10.96.0.100:80 rr
  -> 172.17.0.2:8080      Masq    1      0          0
  -> 172.17.0.3:8080      Masq    1      0          0
  -> 172.17.0.4:8080      Masq    1      0          0

eBPF Mode (Cilium):

  • No iptables/IPVS
  • Direct socket-level load balancing
  • Better performance, lower latency
  • Requires Cilium CNI

8.3 CNI (Container Network Interface)

Pod-to-Pod Networking Requirements:

  1. Every Pod gets its own IP address
  2. Pods can communicate without NAT
  3. Nodes can communicate with Pods without NAT
  4. Pod's IP is same to self and others

Popular CNI Plugins:

Plugin Network Model Features
Cilium eBPF L7 policies, observability, service mesh
Calico BGP or VXLAN Network policies, high performance
Flannel VXLAN/host-gw Simple, minimal features
AWS VPC CNI Native VPC Pod IPs from VPC, no overlay
Weave VXLAN Simple, encrypted

8.4 Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - app.example.com
      secretName: app-tls
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-service
                port:
                  number: 80

Ingress Controllers:

Controller Maintained By Features
ingress-nginx Kubernetes Most popular, battle-tested
Traefik Traefik Labs Auto-discovery, middleware
HAProxy HAProxy High performance
Contour VMware Envoy-based
AWS ALB AWS Native ALB integration
Istio Gateway Istio Service mesh integration

9. Storage

9.1 Storage Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        Storage Architecture                              │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    PersistentVolumeClaim (PVC)                   │   │
│  │                                                                  │   │
│  │  • User's storage request                                        │   │
│  │  • Namespace-scoped                                              │   │
│  │  • Specifies size, access modes, storage class                   │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │ binds to                          │
│  ┌──────────────────────────────────▼──────────────────────────────┐   │
│  │                     PersistentVolume (PV)                        │   │
│  │                                                                  │   │
│  │  • Cluster-scoped storage resource                               │   │
│  │  • Provisioned statically or dynamically                         │   │
│  │  • Has specific capacity, access modes, reclaim policy           │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │ backed by                         │
│  ┌──────────────────────────────────▼──────────────────────────────┐   │
│  │                     Storage Backend                              │   │
│  │                                                                  │   │
│  │  • Cloud: AWS EBS, GCP PD, Azure Disk                           │   │
│  │  • On-prem: Ceph, NFS, iSCSI                                    │   │
│  │  • Local: hostPath, local PV                                     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  StorageClass (controls dynamic provisioning)                           │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  apiVersion: storage.k8s.io/v1                                   │   │
│  │  kind: StorageClass                                              │   │
│  │  metadata:                                                       │   │
│  │    name: fast-ssd                                                │   │
│  │  provisioner: kubernetes.io/aws-ebs                              │   │
│  │  parameters:                                                     │   │
│  │    type: gp3                                                     │   │
│  │    iopsPerGB: "50"                                               │   │
│  │  reclaimPolicy: Delete                                           │   │
│  │  volumeBindingMode: WaitForFirstConsumer                         │   │
│  │  allowVolumeExpansion: true                                      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

9.2 Access Modes

Mode Abbreviation Description
ReadWriteOnce RWO Single node read/write
ReadOnlyMany ROX Multiple nodes read-only
ReadWriteMany RWX Multiple nodes read/write
ReadWriteOncePod RWOP Single pod read/write (K8s 1.22+)

9.3 CSI (Container Storage Interface)

# CSI Driver deployment (simplified)
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: ebs.csi.aws.com
spec:
  attachRequired: true
  podInfoOnMount: false
  volumeLifecycleModes:
    - Persistent
    - Ephemeral

CSI Operations:

  1. CreateVolume - Provision storage
  2. DeleteVolume - Remove storage
  3. ControllerPublishVolume - Attach to node
  4. ControllerUnpublishVolume - Detach from node
  5. NodeStageVolume - Mount to staging path
  6. NodePublishVolume - Bind mount to pod path
  7. NodeUnpublishVolume - Unmount from pod
  8. NodeUnstageVolume - Unmount from staging

10. RBAC (Role-Based Access Control)

10.1 RBAC Components

┌─────────────────────────────────────────────────────────────────────────┐
│                           RBAC Model                                     │
│                                                                          │
│  ┌─────────────┐     ┌───────────────┐     ┌─────────────────────────┐ │
│  │   Subject   │────▶│  RoleBinding  │────▶│   Role / ClusterRole    │ │
│  │             │     │               │     │                         │ │
│  │ • User      │     │ Connects      │     │ Defines permissions:    │ │
│  │ • Group     │     │ subject to    │     │ • API groups            │ │
│  │ • Service   │     │ role          │     │ • Resources             │ │
│  │   Account   │     │               │     │ • Verbs                 │ │
│  └─────────────┘     └───────────────┘     └─────────────────────────┘ │
│                                                                          │
│  Namespace-scoped:                                                       │
│    Role + RoleBinding                                                    │
│                                                                          │
│  Cluster-scoped:                                                         │
│    ClusterRole + ClusterRoleBinding                                      │
│    (or ClusterRole + RoleBinding for namespace-limited access)          │
└─────────────────────────────────────────────────────────────────────────┘

10.2 RBAC Examples

Role (namespace-scoped permissions):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: development
  name: pod-reader
rules:
  - apiGroups: [""] # Core API group
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]

ClusterRole (cluster-wide or aggregatable):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: secret-reader
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
---
# Aggregated ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-endpoints
  labels:
    rbac.authorization.k8s.io/aggregate-to-view: "true"
rules:
  - apiGroups: [""]
    resources: ["services", "endpoints", "pods"]
    verbs: ["get", "list", "watch"]

RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: development
subjects:
  - kind: User
    name: jane
    apiGroup: rbac.authorization.k8s.io
  - kind: ServiceAccount
    name: ci-bot
    namespace: development
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

ClusterRoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-admin-binding
subjects:
  - kind: Group
    name: system:masters
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io

10.3 Common Verbs

Verb HTTP Method Description
get GET Read single resource
list GET Read collection
watch GET (streaming) Watch for changes
create POST Create resource
update PUT Replace resource
patch PATCH Partial update
delete DELETE Delete resource
deletecollection DELETE Delete multiple

10.4 Debugging RBAC

# Check if user can perform action
kubectl auth can-i create deployments --as=jane
kubectl auth can-i delete pods --as=system:serviceaccount:default:mysa

# List all permissions for user
kubectl auth can-i --list --as=jane

# Impersonate user
kubectl get pods --as=jane --as-group=developers

11. Autoscaling

11.1 Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 100
  metrics:
    # CPU-based
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # Memory-based
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: 500Mi
    # Custom metrics (from Prometheus)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 1000
    # External metrics (from cloud provider)
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: orders
        target:
          type: Value
          value: 100
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max

HPA Algorithm:

desiredReplicas = ceil(currentReplicas × (currentMetric / desiredMetric))

Example:
  currentReplicas = 3
  currentCPU = 90%
  targetCPU = 70%
  desiredReplicas = ceil(3 × (90/70)) = ceil(3.86) = 4

11.2 Vertical Pod Autoscaler (VPA)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: Auto # Off | Initial | Recreate | Auto
  resourcePolicy:
    containerPolicies:
      - containerName: app
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "4"
          memory: "8Gi"
        controlledResources: ["cpu", "memory"]
        controlledValues: RequestsAndLimits

VPA Modes:

Mode Behavior
Off Only recommendations, no changes
Initial Apply on pod creation only
Recreate Evict and recreate pods to apply
Auto Currently same as Recreate

11.3 Cluster Autoscaler

# Cluster Autoscaler configuration (typically Helm values)
autoDiscovery:
  clusterName: my-cluster
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/my-cluster

extraArgs:
  balance-similar-node-groups: true
  skip-nodes-with-system-pods: false
  scale-down-enabled: true
  scale-down-delay-after-add: 10m
  scale-down-unneeded-time: 10m
  scale-down-utilization-threshold: 0.5
  max-node-provision-time: 15m

11.4 KEDA (Kubernetes Event-Driven Autoscaling)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: app-scaledobject
spec:
  scaleTargetRef:
    name: app
  pollingInterval: 30
  cooldownPeriod: 300
  minReplicaCount: 0 # Scale to zero!
  maxReplicaCount: 100
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: my-group
        topic: orders
        lagThreshold: "100"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_total
        query: sum(rate(http_requests_total{app="myapp"}[2m]))
        threshold: "100"

12. Operators and Custom Resources

12.1 CRD (Custom Resource Definition)

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: ["engine", "size"]
              properties:
                engine:
                  type: string
                  enum: ["postgres", "mysql", "mongodb"]
                version:
                  type: string
                  default: "15"
                size:
                  type: string
                  pattern: "^[0-9]+Gi$"
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 5
                  default: 1
            status:
              type: object
              properties:
                state:
                  type: string
                endpoint:
                  type: string
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Engine
          type: string
          jsonPath: .spec.engine
        - name: Size
          type: string
          jsonPath: .spec.size
        - name: State
          type: string
          jsonPath: .status.state
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database
    shortNames:
      - db

12.2 Custom Resource Instance

apiVersion: example.com/v1
kind: Database
metadata:
  name: orders-db
spec:
  engine: postgres
  version: "15"
  size: 100Gi
  replicas: 3

12.3 Operator Pattern

// Simplified operator reconciliation
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("database", req.NamespacedName)

    // 1. Fetch the Database CR
    var db examplev1.Database
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Check if StatefulSet exists
    var sts appsv1.StatefulSet
    err := r.Get(ctx, types.NamespacedName{
        Name:      db.Name + "-sts",
        Namespace: db.Namespace,
    }, &sts)

    if errors.IsNotFound(err) {
        // 3. Create StatefulSet
        sts = r.constructStatefulSet(&db)
        if err := r.Create(ctx, &sts); err != nil {
            return ctrl.Result{}, err
        }
        log.Info("Created StatefulSet")
    }

    // 4. Update status
    db.Status.State = "Running"
    db.Status.Endpoint = fmt.Sprintf("%s.%s.svc:5432", db.Name, db.Namespace)
    if err := r.Status().Update(ctx, &db); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
Operator Purpose
cert-manager TLS certificate management
Prometheus Operator Monitoring stack
ArgoCD GitOps continuous delivery
Crossplane Cloud resource provisioning
Strimzi Kafka on Kubernetes
Zalando Postgres Operator PostgreSQL clusters

13. Security Best Practices

13.1 Pod Security Standards

# Enforce restricted profile on namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Security Levels:

Level Description
privileged Unrestricted (only for system workloads)
baseline Minimally restrictive (prevents known escalations)
restricted Heavily restricted (security best practices)

13.2 Network Policies

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        # Allow from web frontend
        - podSelector:
            matchLabels:
              app: web
        # Allow from specific namespace
        - namespaceSelector:
            matchLabels:
              name: monitoring
          podSelector:
            matchLabels:
              app: prometheus
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        # Allow to database
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432
    - to:
        # Allow DNS
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53

13.3 Security Checklist

Control Plane:

  • [ ] etcd encrypted at rest
  • [ ] API server audit logging enabled
  • [ ] RBAC enabled (no ABAC)
  • [ ] Anonymous auth disabled
  • [ ] Node authorizer enabled
  • [ ] Admission controllers configured

Workloads:

  • [ ] Run as non-root
  • [ ] Read-only root filesystem
  • [ ] No privilege escalation
  • [ ] Drop all capabilities
  • [ ] Seccomp profile applied
  • [ ] Resource limits set
  • [ ] Network policies defined

Images:

  • [ ] Minimal base images
  • [ ] No latest tags
  • [ ] Vulnerability scanning in CI
  • [ ] Image signing and verification
  • [ ] Private registry with auth

14. Observability Stack

14.1 Metrics Pipeline

┌─────────────────────────────────────────────────────────────────────────┐
│                         Metrics Pipeline                                 │
│                                                                          │
│  Applications                                                            │
│  (expose /metrics)                                                       │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      Prometheus                                  │   │
│  │  • Scrapes targets (pull model)                                  │   │
│  │  • Stores time-series locally                                    │   │
│  │  • Evaluates alerting rules                                      │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │                                   │
│         ┌───────────────────────────┼───────────────────────────┐      │
│         ▼                           ▼                           ▼      │
│  ┌─────────────┐           ┌─────────────┐           ┌─────────────┐  │
│  │ Thanos/Mimir│           │Alertmanager │           │   Grafana   │  │
│  │             │           │             │           │             │  │
│  │ Long-term   │           │ Routing     │           │ Dashboards  │  │
│  │ storage     │           │ Silencing   │           │ Queries     │  │
│  │ Global view │           │ Notification│           │ Alerts      │  │
│  └─────────────┘           └─────────────┘           └─────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

14.2 Logs Pipeline

┌─────────────────────────────────────────────────────────────────────────┐
│                          Logs Pipeline                                   │
│                                                                          │
│  Containers (stdout/stderr)                                             │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Node-level collector (DaemonSet)                                │   │
│  │  • Fluent Bit / Fluentd / Vector                                 │   │
│  │  • Reads from /var/log/containers/                               │   │
│  │  • Enriches with K8s metadata                                    │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │                                   │
│                                     ▼                                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Log Storage                                                     │   │
│  │  • Loki (lightweight, label-based)                               │   │
│  │  • OpenSearch (full-text search)                                 │   │
│  │  • CloudWatch / Stackdriver                                      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

14.3 Tracing Pipeline

# OpenTelemetry Collector configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      memory_limiter:
        limit_mib: 512

    exporters:
      otlp:
        endpoint: tempo:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]

15. Multi-Tenancy Patterns

15.1 Namespace Isolation

# Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    services: "10"
    secrets: "20"
    persistentvolumeclaims: "10"
---
# Limit Ranges (defaults)
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "4Gi"
    - type: PersistentVolumeClaim
      max:
        storage: 10Gi

15.2 Hierarchical Namespaces

# Using HNC (Hierarchical Namespace Controller)
apiVersion: hnc.x-k8s.io/v1alpha2
kind: HierarchyConfiguration
metadata:
  name: hierarchy
  namespace: team-a
spec:
  parent: organization
---
# Subnamespace
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
  name: team-a-dev
  namespace: team-a

16. Disaster Recovery

16.1 Backup Strategies

etcd Backup:

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify backup
etcdctl snapshot status backup.db --write-out=table

# Restore
etcdctl snapshot restore backup.db \
  --data-dir=/var/lib/etcd-restored

Velero (Full Cluster Backup):

# Install Velero
velero install \
  --provider aws \
  --bucket my-backup-bucket \
  --secret-file ./credentials-velero

# Create backup
velero backup create cluster-backup --include-namespaces '*'

# Schedule periodic backups
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --ttl 720h

# Restore
velero restore create --from-backup cluster-backup

16.2 High Availability Checklist

Control Plane:

  • [ ] 3+ API server replicas behind load balancer
  • [ ] 3 or 5 etcd nodes (Raft quorum)
  • [ ] Leader election for scheduler/controller-manager
  • [ ] Spread across availability zones

Worker Nodes:

  • [ ] Multiple nodes per zone
  • [ ] Pod anti-affinity for critical workloads
  • [ ] Pod Disruption Budgets defined
  • [ ] Node auto-repair enabled

Data:

  • [ ] PVs with zone-redundant storage
  • [ ] Application-level replication (databases)
  • [ ] Regular backup testing
  • [ ] Documented recovery procedures

17. Production Architecture Example

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Production Cluster Architecture                       │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐│
│  │                           Control Plane                                 ││
│  │                                                                         ││
│  │   Zone A              Zone B              Zone C                        ││
│  │  ┌──────────┐       ┌──────────┐       ┌──────────┐                   ││
│  │  │API Server│       │API Server│       │API Server│                   ││
│  │  │etcd      │       │etcd      │       │etcd      │                   ││
│  │  └──────────┘       └──────────┘       └──────────┘                   ││
│  │                                                                         ││
│  │                    Load Balancer (internal)                            ││
│  └────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐│
│  │                           Worker Nodes                                  ││
│  │                                                                         ││
│  │   Zone A              Zone B              Zone C                        ││
│  │  ┌──────────┐       ┌──────────┐       ┌──────────┐                   ││
│  │  │ Node 1   │       │ Node 3   │       │ Node 5   │                   ││
│  │  │ Node 2   │       │ Node 4   │       │ Node 6   │                   ││
│  │  └──────────┘       └──────────┘       └──────────┘                   ││
│  │                                                                         ││
│  │  Node pools:                                                           ││
│  │  • General purpose (on-demand)                                         ││
│  │  • Compute optimized (spot/preemptible)                               ││
│  │  • Memory optimized (databases)                                        ││
│  │  • GPU (ML workloads)                                                  ││
│  └────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐│
│  │                         Platform Services                               ││
│  │                                                                         ││
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐              ││
│  │  │ Ingress  │  │ Cert     │  │ External │  │ Secrets  │              ││
│  │  │ (NGINX)  │  │ Manager  │  │ DNS      │  │ (Vault)  │              ││
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘              ││
│  │                                                                         ││
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐              ││
│  │  │Prometheus│  │  Loki    │  │  Tempo   │  │ Grafana  │              ││
│  │  │+ Thanos  │  │          │  │          │  │          │              ││
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘              ││
│  │                                                                         ││
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐              ││
│  │  │  ArgoCD  │  │ Kyverno  │  │ Velero   │  │ Cilium   │              ││
│  │  │ (GitOps) │  │ (Policy) │  │ (Backup) │  │ (CNI)    │              ││
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘              ││
│  └────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

18. Essential kubectl Commands

# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes

# Debugging pods
kubectl describe pod <pod>
kubectl logs <pod> -c <container> --previous
kubectl exec -it <pod> -- /bin/sh
kubectl debug -it <pod> --image=busybox

# Resource management
kubectl get all -A
kubectl api-resources
kubectl explain pod.spec.containers

# Events and troubleshooting
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

# RBAC debugging
kubectl auth can-i create pods --as=jane
kubectl auth whoami

# Rollouts
kubectl rollout status deployment/app
kubectl rollout history deployment/app
kubectl rollout undo deployment/app --to-revision=2

# Port forwarding
kubectl port-forward svc/app 8080:80
kubectl port-forward pod/app-abc123 8080:80

# Resource editing
kubectl edit deployment app
kubectl patch deployment app -p '{"spec":{"replicas":3}}'

# Labels and selectors
kubectl get pods -l app=nginx,env=prod
kubectl label pods <pod> version=v2