Kubernetes¶

Container orchestration is the automated management of the lifecycle of hundreds, thousands, or tens of thousands of containers in production environments. It solves the problems that appear when you move from running 1–10 containers on a laptop to running 10,000+ containers across dozens or hundreds of machines.

Kubernetes (K8s) has decisively won the orchestration war, running >90% of containerized production workloads globally.

1. Core Architecture¶

1.1 Component Overview¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CONTROL PLANE                                   │
│  ┌─────────────────────────────────────────────────────────────────────────┐│
│  │                         kube-apiserver                                   ││
│  │  • REST + gRPC API endpoint                                             ││
│  │  • Authentication, Authorization, Admission                             ││
│  │  • etcd client (only component that talks to etcd)                     ││
│  └────────────────────────────────┬────────────────────────────────────────┘│
│                                   │                                          │
│  ┌────────────────────────────────▼────────────────────────────────────────┐│
│  │                              etcd                                        ││
│  │  • Distributed key-value store (Raft consensus)                         ││
│  │  • Source of truth for all cluster state                                ││
│  │  • 3 or 5 nodes for HA                                                  ││
│  └─────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌─────────────────────┐  ┌─────────────────────┐  ┌─────────────────────┐ │
│  │  kube-scheduler     │  │ controller-manager  │  │ cloud-controller    │ │
│  │                     │  │                     │  │                     │ │
│  │  • Watches unbound  │  │  • Node controller  │  │  • Node lifecycle   │ │
│  │    Pods             │  │  • ReplicaSet       │  │  • LoadBalancer     │ │
│  │  • Scores nodes     │  │  • Deployment       │  │  • Routes           │ │
│  │  • Binds Pod→Node   │  │  • StatefulSet      │  │  • Cloud disks      │ │
│  │                     │  │  • Job, CronJob     │  │                     │ │
│  └─────────────────────┘  └─────────────────────┘  └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                              ┌───────┴────────┐
                              │   Network      │
                              └───────┬────────┘
                                      │
┌─────────────────────────────────────┼─────────────────────────────────────────┐
│                               WORKER NODES                                     │
│                                      │                                         │
│   ┌──────────────────────────────────▼──────────────────────────────────────┐ │
│   │                              kubelet                                     │ │
│   │  • Registers node with API server                                        │ │
│   │  • Watches for Pod assignments                                           │ │
│   │  • Manages container lifecycle via CRI                                   │ │
│   │  • Reports node status, pod status                                       │ │
│   │  • Manages volumes via CSI                                               │ │
│   └──────────────────────────────────┬──────────────────────────────────────┘ │
│                                      │ CRI (gRPC)                              │
│   ┌──────────────────────────────────▼──────────────────────────────────────┐ │
│   │                          containerd / CRI-O                              │ │
│   │  • Pulls images                                                          │ │
│   │  • Creates containers via OCI runtime                                    │ │
│   │  • Manages container lifecycle                                           │ │
│   └──────────────────────────────────────────────────────────────────────────┘ │
│                                                                                │
│   ┌──────────────────────────────────────────────────────────────────────────┐ │
│   │                            kube-proxy                                     │ │
│   │  • Maintains network rules (iptables/IPVS/eBPF)                          │ │
│   │  • Implements Service abstraction                                        │ │
│   │  • Load balances traffic to Pod endpoints                                │ │
│   └──────────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────────┘

1.2 Component Details¶

Component	Function	Stateless?	HA Strategy
kube-apiserver	API gateway, auth, admission	Yes	Multiple replicas behind LB
etcd	Persistent state storage	No	Raft consensus (3 or 5 nodes)
kube-scheduler	Pod placement decisions	Yes	Leader election
kube-controller-manager	Reconciliation loops	Yes	Leader election
cloud-controller-manager	Cloud provider integration	Yes	Leader election
kubelet	Node agent	N/A	One per node
kube-proxy	Network rules	N/A	One per node

2. etcd: The Cluster Brain¶

2.1 What etcd Stores¶

Everything in Kubernetes is stored in etcd under /registry/:

/registry/
├── configmaps/
│   └── default/
│       └── my-config
├── deployments/
│   └── default/
│       └── nginx-deployment
├── events/
├── namespaces/
│   ├── default
│   ├── kube-system
│   └── kube-public
├── nodes/
│   ├── node-1
│   └── node-2
├── pods/
│   └── default/
│       ├── nginx-abc123
│       └── nginx-def456
├── secrets/
├── services/
└── ...

2.2 Raft Consensus Protocol¶

etcd uses Raft for distributed consensus:

┌─────────────────────────────────────────────────────────────────┐
│                     Raft State Machine                           │
│                                                                  │
│   ┌───────────────┐                                             │
│   │    Leader     │  ← Only leader handles writes                │
│   │   (Node 1)    │  ← Replicates to followers                  │
│   └───────┬───────┘                                             │
│           │                                                      │
│     ┌─────┴─────┐                                               │
│     ▼           ▼                                               │
│ ┌───────┐   ┌───────┐                                           │
│ │Follower│   │Follower│                                          │
│ │(Node 2)│   │(Node 3)│                                          │
│ └───────┘   └───────┘                                           │
│                                                                  │
│  Write Path:                                                     │
│  1. Client → Leader                                              │
│  2. Leader appends to local log                                  │
│  3. Leader replicates to followers                               │
│  4. Majority (quorum) acknowledges                               │
│  5. Leader commits entry                                         │
│  6. Leader responds to client                                    │
│                                                                  │
│  Quorum = (N/2) + 1                                             │
│  3 nodes → need 2 for consensus (survives 1 failure)            │
│  5 nodes → need 3 for consensus (survives 2 failures)           │
└─────────────────────────────────────────────────────────────────┘

2.3 etcd Performance Characteristics¶

Metric	Recommended	Critical
Disk IOPS	>3000	SSDs required
Disk latency	\<10ms p99	>50ms = cluster instability
Network latency	\<2ms between nodes	>10ms = election timeouts
Object size	\<1MB	>1.5MB rejected
Total DB size	\<8GB default	Can increase, but impacts performance

2.4 etcd Operations¶

# Check cluster health
etcdctl endpoint health --endpoints=https://127.0.0.1:2379

# List all keys
etcdctl get / --prefix --keys-only

# Get specific key
etcdctl get /registry/pods/default/nginx-abc123

# Watch for changes
etcdctl watch /registry/pods --prefix

# Compact history (required for long-running clusters)
etcdctl compact $(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')

# Defragment (reclaim disk space after compaction)
etcdctl defrag --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379

3. API Server Deep Dive¶

3.1 Request Processing Pipeline¶

┌────────────────────────────────────────────────────────────────────────────┐
│                        API Server Request Flow                              │
│                                                                             │
│  Client Request                                                             │
│       │                                                                     │
│       ▼                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    1. AUTHENTICATION                                 │   │
│  │  • Client certificates (x509)                                        │   │
│  │  • Bearer tokens (ServiceAccount, OIDC)                              │   │
│  │  • Basic auth (deprecated)                                           │   │
│  │  • Webhook token auth                                                │   │
│  │                                                                       │   │
│  │  Result: User identity (username, UID, groups)                       │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    2. AUTHORIZATION                                  │   │
│  │  • RBAC (Role-Based Access Control) ← primary                       │   │
│  │  • ABAC (Attribute-Based)                                           │   │
│  │  • Webhook                                                           │   │
│  │  • Node authorizer (kubelet-specific)                               │   │
│  │                                                                       │   │
│  │  Question: Can user X perform verb Y on resource Z?                  │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    3. ADMISSION CONTROLLERS                          │   │
│  │                                                                       │   │
│  │  ┌─────────────────────┐    ┌─────────────────────┐                  │   │
│  │  │ Mutating Webhooks   │ →  │ Validating Webhooks │                  │   │
│  │  │                     │    │                     │                  │   │
│  │  │ • Modify objects    │    │ • Accept/Reject     │                  │   │
│  │  │ • Inject sidecars   │    │ • Policy enforcement│                  │   │
│  │  │ • Set defaults      │    │ • Security checks   │                  │   │
│  │  └─────────────────────┘    └─────────────────────┘                  │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    4. VALIDATION                                     │   │
│  │  • Schema validation (OpenAPI)                                       │   │
│  │  • Field immutability checks                                        │   │
│  │  • Resource quota checks                                            │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                    5. PERSISTENCE                                    │   │
│  │  • Serialize to protobuf                                            │   │
│  │  • Write to etcd                                                    │   │
│  │  • Return response                                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────────────────┘

3.2 API Groups and Versions¶

# Core API (legacy, no group)
/api/v1/namespaces/default/pods/nginx

# Named API groups
/apis/apps/v1/namespaces/default/deployments/nginx
/apis/batch/v1/namespaces/default/jobs/myjob
/apis/networking.k8s.io/v1/namespaces/default/ingresses/myingress

# List all API resources
kubectl api-resources
kubectl api-versions

3.3 Watch Mechanism¶

Kubernetes uses long-polling watches for efficient state synchronization:

// Controller's informer uses watch
watcher, _ := client.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{
    ResourceVersion: "12345",  // Start watching from this version
})

for event := range watcher.ResultChan() {
    switch event.Type {
    case watch.Added:
        // New pod created
    case watch.Modified:
        // Pod updated
    case watch.Deleted:
        // Pod removed
    case watch.Bookmark:
        // Progress marker (no actual change)
    case watch.Error:
        // Re-list and restart watch
    }
}

Resource Versions:

Every object has a resourceVersion (etcd revision)
Watches specify starting resourceVersion
Allows efficient sync without polling

4. Admission Controllers¶

4.1 Built-in Admission Controllers¶

Controller	Type	Function
`NamespaceLifecycle`	Validating	Prevents ops in terminating namespaces
`LimitRanger`	Mutating	Applies default resource limits
`ServiceAccount`	Mutating	Auto-mounts SA tokens
`DefaultStorageClass`	Mutating	Assigns default storage class
`ResourceQuota`	Validating	Enforces namespace quotas
`PodSecurity`	Validating	Enforces Pod Security Standards
`NodeRestriction`	Validating	Limits kubelet API access

4.2 Dynamic Admission Webhooks¶

MutatingWebhookConfiguration:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: sidecar-injector
webhooks:
  - name: sidecar.example.com
    clientConfig:
      service:
        name: sidecar-injector
        namespace: system
        path: /mutate
      caBundle: <base64-encoded-ca>
    rules:
      - operations: ["CREATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]
    namespaceSelector:
      matchLabels:
        sidecar-injection: enabled
    failurePolicy: Fail # or Ignore
    sideEffects: None
    admissionReviewVersions: ["v1"]

Webhook Handler Example:

func handleMutate(w http.ResponseWriter, r *http.Request) {
    var review admissionv1.AdmissionReview
    json.NewDecoder(r.Body).Decode(&review)

    pod := corev1.Pod{}
    json.Unmarshal(review.Request.Object.Raw, &pod)

    // Add sidecar container
    patch := []map[string]interface{}{
        {
            "op":    "add",
            "path":  "/spec/containers/-",
            "value": sidecarContainer,
        },
    }

    patchBytes, _ := json.Marshal(patch)
    patchType := admissionv1.PatchTypeJSONPatch

    review.Response = &admissionv1.AdmissionResponse{
        UID:       review.Request.UID,
        Allowed:   true,
        PatchType: &patchType,
        Patch:     patchBytes,
    }

    json.NewEncoder(w).Encode(review)
}

4.3 Policy Engines¶

Kyverno:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-for-labels
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: "Pods must have 'app' and 'owner' labels"
        pattern:
          metadata:
            labels:
              app: "?*"
              owner: "?*"

OPA Gatekeeper:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          properties:
            labels:
              type: array
              items: { type: string }
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels
        violation[{"msg": msg}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Missing labels: %v", [missing])
        }

5. The Reconciliation Loop¶

5.1 Controller Pattern¶

Every Kubernetes controller follows this pattern:

func (c *Controller) Run(ctx context.Context) {
    // 1. List all existing objects (initial sync)
    objects, _ := c.lister.List(labels.Everything())
    for _, obj := range objects {
        c.workqueue.Add(obj.GetName())
    }

    // 2. Watch for changes
    go c.informer.Run(ctx.Done())

    // 3. Process work queue
    for c.processNextItem(ctx) {
    }
}

func (c *Controller) processNextItem(ctx context.Context) bool {
    key, shutdown := c.workqueue.Get()
    if shutdown {
        return false
    }
    defer c.workqueue.Done(key)

    // 4. Reconcile
    err := c.reconcile(ctx, key.(string))

    if err != nil {
        // 5. Requeue with exponential backoff
        c.workqueue.AddRateLimited(key)
        return true
    }

    c.workqueue.Forget(key)
    return true
}

func (c *Controller) reconcile(ctx context.Context, name string) error {
    // Get desired state
    desired, err := c.lister.Get(name)
    if errors.IsNotFound(err) {
        return nil  // Object deleted, nothing to do
    }

    // Get actual state
    actual, _ := c.getActualState(name)

    // Compare and act
    if !reflect.DeepEqual(desired.Spec, actual) {
        return c.update(ctx, desired)
    }

    return nil
}

5.2 Deployment Controller Flow¶

┌─────────────────────────────────────────────────────────────────────────┐
│                    Deployment Controller                                 │
│                                                                          │
│  User creates Deployment (replicas: 3)                                  │
│       │                                                                  │
│       ▼                                                                  │
│  Deployment Controller watches → sees new Deployment                    │
│       │                                                                  │
│       ▼                                                                  │
│  Creates ReplicaSet with replicas: 3                                    │
│       │                                                                  │
│       ▼                                                                  │
│  ReplicaSet Controller watches → sees new ReplicaSet                    │
│       │                                                                  │
│       ▼                                                                  │
│  Creates 3 Pods (without nodeName)                                      │
│       │                                                                  │
│       ▼                                                                  │
│  Scheduler watches → sees 3 unscheduled Pods                            │
│       │                                                                  │
│       ▼                                                                  │
│  Assigns nodeName to each Pod                                           │
│       │                                                                  │
│       ▼                                                                  │
│  kubelet watches → sees Pods assigned to this node                      │
│       │                                                                  │
│       ▼                                                                  │
│  Starts containers via CRI                                              │
│       │                                                                  │
│       ▼                                                                  │
│  Reports Pod status back to API server                                  │
└─────────────────────────────────────────────────────────────────────────┘

6. The Pod: Kubernetes Atom¶

6.1 Pod is NOT a Container¶

A Pod is:

A group of 1+ containers sharing:
- Network namespace (same IP, same localhost)
- IPC namespace (shared memory)
- UTS namespace (same hostname)
- Optionally: PID namespace
A scheduling unit (placed together on one node)
A lifecycle unit (all containers start/stop together)

6.2 Pod Anatomy¶

apiVersion: v1
kind: Pod
metadata:
  name: multi-container-pod
  namespace: default
  labels:
    app: myapp
    version: v1
  annotations:
    prometheus.io/scrape: "true"
spec:
  # Scheduling constraints
  nodeSelector:
    disktype: ssd
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: zone
                operator: In
                values: ["us-west-1a", "us-west-1b"]
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: myapp
            topologyKey: kubernetes.io/hostname
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "database"
      effect: "NoSchedule"

  # Service account
  serviceAccountName: myapp-sa
  automountServiceAccountToken: false

  # Security context (pod-level)
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault

  # DNS configuration
  dnsPolicy: ClusterFirst
  dnsConfig:
    options:
      - name: ndots
        value: "2"

  # Init containers (run sequentially before main containers)
  initContainers:
    - name: init-db
      image: busybox
      command: ["sh", "-c", "until nc -z db 5432; do sleep 2; done"]

  # Main containers
  containers:
    - name: app
      image: myapp:v1.2.3
      imagePullPolicy: IfNotPresent

      # Commands
      command: ["/app/server"]
      args: ["--config=/etc/config/app.yaml"]

      # Environment
      env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
      envFrom:
        - configMapRef:
            name: app-config

      # Ports
      ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        - name: metrics
          containerPort: 9090

      # Resources
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
        limits:
          cpu: "500m"
          memory: "512Mi"

      # Probes
      startupProbe:
        httpGet:
          path: /healthz
          port: http
        failureThreshold: 30
        periodSeconds: 10
      livenessProbe:
        httpGet:
          path: /healthz
          port: http
        initialDelaySeconds: 0
        periodSeconds: 10
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /ready
          port: http
        periodSeconds: 5
        failureThreshold: 1

      # Security context (container-level)
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]

      # Volume mounts
      volumeMounts:
        - name: config
          mountPath: /etc/config
          readOnly: true
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /var/cache

    # Sidecar container
    - name: log-shipper
      image: fluentbit:latest
      resources:
        requests:
          cpu: "10m"
          memory: "32Mi"
        limits:
          cpu: "50m"
          memory: "64Mi"
      volumeMounts:
        - name: logs
          mountPath: /var/log/app
          readOnly: true

  # Volumes
  volumes:
    - name: config
      configMap:
        name: app-config
    - name: tmp
      emptyDir: {}
    - name: cache
      emptyDir:
        sizeLimit: "100Mi"
    - name: logs
      emptyDir: {}

  # Termination
  terminationGracePeriodSeconds: 30

  # Restart policy
  restartPolicy: Always # Always | OnFailure | Never

6.3 Pod Lifecycle¶

┌─────────────────────────────────────────────────────────────────────────┐
│                         Pod Lifecycle                                    │
│                                                                          │
│   Pending                                                                │
│      │                                                                   │
│      │  (Scheduled to node)                                             │
│      ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    Container Startup                             │   │
│   │                                                                  │   │
│   │  1. Pull image (if not cached)                                  │   │
│   │  2. Create container                                            │   │
│   │  3. Run init containers (sequentially)                          │   │
│   │  4. Start main containers (in parallel)                         │   │
│   │  5. Execute postStart hooks                                     │   │
│   │  6. Wait for startupProbe to pass                               │   │
│   │  7. Start livenessProbe and readinessProbe                      │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│      │                                                                   │
│      ▼                                                                   │
│   Running ←──────────────────────────────────────────────┐              │
│      │                                                    │              │
│      │  (livenessProbe fails)                            │              │
│      ▼                                                    │              │
│   Container restarts ─────────────────────────────────────┘              │
│      │                                                                   │
│      │  (Pod deleted or node fails)                                     │
│      ▼                                                                   │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                   Termination Sequence                           │   │
│   │                                                                  │   │
│   │  1. Pod marked Terminating                                       │   │
│   │  2. Remove from Service endpoints                                │   │
│   │  3. Execute preStop hooks (parallel with SIGTERM)               │   │
│   │  4. Send SIGTERM to containers                                   │   │
│   │  5. Wait terminationGracePeriodSeconds                          │   │
│   │  6. Send SIGKILL                                                 │   │
│   │  7. Remove Pod object                                            │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│      │                                                                   │
│      ▼                                                                   │
│   Succeeded / Failed                                                     │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

6.4 Container Types¶

Type	When Runs	Use Case
Init Containers	Before main containers, sequentially	DB migrations, wait for dependencies
Main Containers	Application lifetime, in parallel	Primary workload
Sidecar Containers	Application lifetime, in parallel	Log shipping, proxies, monitoring
Ephemeral Containers	Debug-time only (`kubectl debug`)	Troubleshooting running pods

7. The Scheduler¶

7.1 Scheduling Phases¶

┌─────────────────────────────────────────────────────────────────────────┐
│                        Scheduler Pipeline                                │
│                                                                          │
│  Unscheduled Pod enters queue                                           │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                  Phase 1: FILTERING                              │   │
│  │                                                                  │   │
│  │  Eliminate nodes that cannot run the Pod:                        │   │
│  │  • PodFitsResources - enough CPU/memory?                         │   │
│  │  • PodFitsHostPorts - port conflicts?                            │   │
│  │  • NodeSelector - labels match?                                  │   │
│  │  • TaintToleration - tolerates taints?                           │   │
│  │  • NodeAffinity - affinity rules satisfied?                      │   │
│  │  • VolumeBinding - PV available in zone?                         │   │
│  │  • InterPodAffinity - co-location rules?                         │   │
│  │                                                                  │   │
│  │  Input: All nodes                                                │   │
│  │  Output: Feasible nodes                                          │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     ▼                                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                   Phase 2: SCORING                               │   │
│  │                                                                  │   │
│  │  Rank feasible nodes (0-100 per plugin):                         │   │
│  │  • NodeResourcesFit - prefer balanced utilization                │   │
│  │  • ImageLocality - image already cached?                         │   │
│  │  • InterPodAffinity - prefer co-located pods                     │   │
│  │  • TaintToleration - prefer fewer tolerations needed             │   │
│  │  • NodeAffinity - prefer affinity matches                        │   │
│  │                                                                  │   │
│  │  Final score = Σ (plugin_score × plugin_weight)                  │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     ▼                                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    Phase 3: BINDING                              │   │
│  │                                                                  │   │
│  │  1. Select highest-scoring node                                  │   │
│  │  2. Reserve resources (optimistic)                               │   │
│  │  3. Run pre-bind plugins (e.g., volume provisioning)            │   │
│  │  4. Update Pod's spec.nodeName                                   │   │
│  │  5. Run post-bind plugins                                        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

7.2 Scheduling Constraints¶

Node Selector (simple):

spec:
  nodeSelector:
    disktype: ssd
    zone: us-west-1a

Node Affinity (flexible):

spec:
  affinity:
    nodeAffinity:
      # Hard requirement
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values: ["amd64", "arm64"]
      # Soft preference
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: zone
                operator: In
                values: ["us-west-1a"]

Pod Anti-Affinity (spread replicas):

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: redis
          topologyKey: kubernetes.io/hostname

Taints and Tolerations:

# Taint a node (repels pods)
kubectl taint nodes node1 dedicated=database:NoSchedule

# Pod must tolerate to schedule
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "database"
    effect: "NoSchedule"

7.3 Resource Management¶

Requests vs Limits:

Aspect	Requests	Limits
Scheduling	Used for node selection	Not considered
CPU	Guaranteed minimum	Throttled above
Memory	Guaranteed minimum	OOM killed above
QoS	Determines QoS class	Determines QoS class

QoS Classes:

Class	Criteria	Eviction Priority
Guaranteed	requests == limits for all containers	Lowest (last evicted)
Burstable	At least one request set	Medium
BestEffort	No requests or limits	Highest (first evicted)

8. Services and Networking¶

8.1 Service Types¶

┌─────────────────────────────────────────────────────────────────────────┐
│                          Service Types                                   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                       ClusterIP                                  │   │
│  │                                                                  │   │
│  │  • Internal cluster IP only                                      │   │
│  │  • DNS: my-svc.namespace.svc.cluster.local                      │   │
│  │  • Default type                                                  │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: ClusterIP                                               │   │
│  │    clusterIP: 10.96.0.100                                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                       NodePort                                   │   │
│  │                                                                  │   │
│  │  • Exposes on each node's IP at static port (30000-32767)       │   │
│  │  • Includes ClusterIP                                            │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: NodePort                                                │   │
│  │    ports:                                                        │   │
│  │    - port: 80                                                    │   │
│  │      nodePort: 30080                                             │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      LoadBalancer                                │   │
│  │                                                                  │   │
│  │  • Provisions cloud load balancer (AWS ELB, GCP LB, etc.)       │   │
│  │  • Includes NodePort and ClusterIP                               │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: LoadBalancer                                            │   │
│  │    loadBalancerIP: 1.2.3.4  # optional, if supported            │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      ExternalName                                │   │
│  │                                                                  │   │
│  │  • DNS CNAME record, no proxying                                 │   │
│  │  • Useful for external services                                  │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    type: ExternalName                                            │   │
│  │    externalName: my.database.example.com                        │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                       Headless                                   │   │
│  │                                                                  │   │
│  │  • No ClusterIP (clusterIP: None)                               │   │
│  │  • DNS returns Pod IPs directly                                  │   │
│  │  • Used with StatefulSets for stable network identity           │   │
│  │                                                                  │   │
│  │  spec:                                                           │   │
│  │    clusterIP: None                                               │   │
│  │    selector:                                                     │   │
│  │      app: postgres                                               │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

8.2 kube-proxy Modes¶

iptables Mode (default):

# View Service rules
iptables -t nat -L KUBE-SERVICES -n

# Chain KUBE-SERVICES
-A KUBE-SERVICES -d 10.96.0.100/32 -p tcp -m tcp --dport 80 \
   -j KUBE-SVC-XXXXX

# Chain KUBE-SVC-XXXXX (round-robin to endpoints)
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.33333 \
   -j KUBE-SEP-11111
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.50000 \
   -j KUBE-SEP-22222
-A KUBE-SVC-XXXXX -j KUBE-SEP-33333

# Chain KUBE-SEP-11111 (DNAT to Pod)
-A KUBE-SEP-11111 -p tcp -j DNAT --to-destination 172.17.0.2:8080

IPVS Mode (better performance):

# Enable in kube-proxy config
mode: ipvs
ipvs:
  scheduler: rr  # rr, lc, dh, sh, sed, nq

# View IPVS rules
ipvsadm -Ln
IP Virtual Server version 1.2.1
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP  10.96.0.100:80 rr
  -> 172.17.0.2:8080      Masq    1      0          0
  -> 172.17.0.3:8080      Masq    1      0          0
  -> 172.17.0.4:8080      Masq    1      0          0

eBPF Mode (Cilium):

No iptables/IPVS
Direct socket-level load balancing
Better performance, lower latency
Requires Cilium CNI

8.3 CNI (Container Network Interface)¶

Pod-to-Pod Networking Requirements:

Every Pod gets its own IP address
Pods can communicate without NAT
Nodes can communicate with Pods without NAT
Pod's IP is same to self and others

Popular CNI Plugins:

Plugin	Network Model	Features
Cilium	eBPF	L7 policies, observability, service mesh
Calico	BGP or VXLAN	Network policies, high performance
Flannel	VXLAN/host-gw	Simple, minimal features
AWS VPC CNI	Native VPC	Pod IPs from VPC, no overlay
Weave	VXLAN	Simple, encrypted

8.4 Ingress¶

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - app.example.com
      secretName: app-tls
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-service
                port:
                  number: 80

Ingress Controllers:

Controller	Maintained By	Features
ingress-nginx	Kubernetes	Most popular, battle-tested
Traefik	Traefik Labs	Auto-discovery, middleware
HAProxy	HAProxy	High performance
Contour	VMware	Envoy-based
AWS ALB	AWS	Native ALB integration
Istio Gateway	Istio	Service mesh integration

9. Storage¶

9.1 Storage Architecture¶

┌─────────────────────────────────────────────────────────────────────────┐
│                        Storage Architecture                              │
│                                                                          │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                    PersistentVolumeClaim (PVC)                   │   │
│  │                                                                  │   │
│  │  • User's storage request                                        │   │
│  │  • Namespace-scoped                                              │   │
│  │  • Specifies size, access modes, storage class                   │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │ binds to                          │
│  ┌──────────────────────────────────▼──────────────────────────────┐   │
│  │                     PersistentVolume (PV)                        │   │
│  │                                                                  │   │
│  │  • Cluster-scoped storage resource                               │   │
│  │  • Provisioned statically or dynamically                         │   │
│  │  • Has specific capacity, access modes, reclaim policy           │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │ backed by                         │
│  ┌──────────────────────────────────▼──────────────────────────────┐   │
│  │                     Storage Backend                              │   │
│  │                                                                  │   │
│  │  • Cloud: AWS EBS, GCP PD, Azure Disk                           │   │
│  │  • On-prem: Ceph, NFS, iSCSI                                    │   │
│  │  • Local: hostPath, local PV                                     │   │
│  └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  StorageClass (controls dynamic provisioning)                           │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  apiVersion: storage.k8s.io/v1                                   │   │
│  │  kind: StorageClass                                              │   │
│  │  metadata:                                                       │   │
│  │    name: fast-ssd                                                │   │
│  │  provisioner: kubernetes.io/aws-ebs                              │   │
│  │  parameters:                                                     │   │
│  │    type: gp3                                                     │   │
│  │    iopsPerGB: "50"                                               │   │
│  │  reclaimPolicy: Delete                                           │   │
│  │  volumeBindingMode: WaitForFirstConsumer                         │   │
│  │  allowVolumeExpansion: true                                      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

9.2 Access Modes¶

Mode	Abbreviation	Description
ReadWriteOnce	RWO	Single node read/write
ReadOnlyMany	ROX	Multiple nodes read-only
ReadWriteMany	RWX	Multiple nodes read/write
ReadWriteOncePod	RWOP	Single pod read/write (K8s 1.22+)

9.3 CSI (Container Storage Interface)¶

# CSI Driver deployment (simplified)
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: ebs.csi.aws.com
spec:
  attachRequired: true
  podInfoOnMount: false
  volumeLifecycleModes:
    - Persistent
    - Ephemeral

CSI Operations:

CreateVolume - Provision storage
DeleteVolume - Remove storage
ControllerPublishVolume - Attach to node
ControllerUnpublishVolume - Detach from node
NodeStageVolume - Mount to staging path
NodePublishVolume - Bind mount to pod path
NodeUnpublishVolume - Unmount from pod
NodeUnstageVolume - Unmount from staging

10. RBAC (Role-Based Access Control)¶

10.1 RBAC Components¶

┌─────────────────────────────────────────────────────────────────────────┐
│                           RBAC Model                                     │
│                                                                          │
│  ┌─────────────┐     ┌───────────────┐     ┌─────────────────────────┐ │
│  │   Subject   │────▶│  RoleBinding  │────▶│   Role / ClusterRole    │ │
│  │             │     │               │     │                         │ │
│  │ • User      │     │ Connects      │     │ Defines permissions:    │ │
│  │ • Group     │     │ subject to    │     │ • API groups            │ │
│  │ • Service   │     │ role          │     │ • Resources             │ │
│  │   Account   │     │               │     │ • Verbs                 │ │
│  └─────────────┘     └───────────────┘     └─────────────────────────┘ │
│                                                                          │
│  Namespace-scoped:                                                       │
│    Role + RoleBinding                                                    │
│                                                                          │
│  Cluster-scoped:                                                         │
│    ClusterRole + ClusterRoleBinding                                      │
│    (or ClusterRole + RoleBinding for namespace-limited access)          │
└─────────────────────────────────────────────────────────────────────────┘

10.2 RBAC Examples¶

Role (namespace-scoped permissions):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: development
  name: pod-reader
rules:
  - apiGroups: [""] # Core API group
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]

ClusterRole (cluster-wide or aggregatable):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: secret-reader
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
---
# Aggregated ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-endpoints
  labels:
    rbac.authorization.k8s.io/aggregate-to-view: "true"
rules:
  - apiGroups: [""]
    resources: ["services", "endpoints", "pods"]
    verbs: ["get", "list", "watch"]

RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: development
subjects:
  - kind: User
    name: jane
    apiGroup: rbac.authorization.k8s.io
  - kind: ServiceAccount
    name: ci-bot
    namespace: development
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

ClusterRoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-admin-binding
subjects:
  - kind: Group
    name: system:masters
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io

10.3 Common Verbs¶

Verb	HTTP Method	Description
`get`	GET	Read single resource
`list`	GET	Read collection
`watch`	GET (streaming)	Watch for changes
`create`	POST	Create resource
`update`	PUT	Replace resource
`patch`	PATCH	Partial update
`delete`	DELETE	Delete resource
`deletecollection`	DELETE	Delete multiple

10.4 Debugging RBAC¶

# Check if user can perform action
kubectl auth can-i create deployments --as=jane
kubectl auth can-i delete pods --as=system:serviceaccount:default:mysa

# List all permissions for user
kubectl auth can-i --list --as=jane

# Impersonate user
kubectl get pods --as=jane --as-group=developers

11. Autoscaling¶

11.1 Horizontal Pod Autoscaler (HPA)¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 100
  metrics:
    # CPU-based
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # Memory-based
    - type: Resource
      resource:
        name: memory
        target:
          type: AverageValue
          averageValue: 500Mi
    # Custom metrics (from Prometheus)
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 1000
    # External metrics (from cloud provider)
    - type: External
      external:
        metric:
          name: sqs_queue_length
          selector:
            matchLabels:
              queue: orders
        target:
          type: Value
          value: 100
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max

HPA Algorithm:

desiredReplicas = ceil(currentReplicas × (currentMetric / desiredMetric))

Example:
  currentReplicas = 3
  currentCPU = 90%
  targetCPU = 70%
  desiredReplicas = ceil(3 × (90/70)) = ceil(3.86) = 4

11.2 Vertical Pod Autoscaler (VPA)¶

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: Auto # Off | Initial | Recreate | Auto
  resourcePolicy:
    containerPolicies:
      - containerName: app
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "4"
          memory: "8Gi"
        controlledResources: ["cpu", "memory"]
        controlledValues: RequestsAndLimits

VPA Modes:

Mode	Behavior
`Off`	Only recommendations, no changes
`Initial`	Apply on pod creation only
`Recreate`	Evict and recreate pods to apply
`Auto`	Currently same as Recreate

11.3 Cluster Autoscaler¶

# Cluster Autoscaler configuration (typically Helm values)
autoDiscovery:
  clusterName: my-cluster
  tags:
    - k8s.io/cluster-autoscaler/enabled
    - k8s.io/cluster-autoscaler/my-cluster

extraArgs:
  balance-similar-node-groups: true
  skip-nodes-with-system-pods: false
  scale-down-enabled: true
  scale-down-delay-after-add: 10m
  scale-down-unneeded-time: 10m
  scale-down-utilization-threshold: 0.5
  max-node-provision-time: 15m

11.4 KEDA (Kubernetes Event-Driven Autoscaling)¶

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: app-scaledobject
spec:
  scaleTargetRef:
    name: app
  pollingInterval: 30
  cooldownPeriod: 300
  minReplicaCount: 0 # Scale to zero!
  maxReplicaCount: 100
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: my-group
        topic: orders
        lagThreshold: "100"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_total
        query: sum(rate(http_requests_total{app="myapp"}[2m]))
        threshold: "100"

12. Operators and Custom Resources¶

12.1 CRD (Custom Resource Definition)¶

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: ["engine", "size"]
              properties:
                engine:
                  type: string
                  enum: ["postgres", "mysql", "mongodb"]
                version:
                  type: string
                  default: "15"
                size:
                  type: string
                  pattern: "^[0-9]+Gi$"
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 5
                  default: 1
            status:
              type: object
              properties:
                state:
                  type: string
                endpoint:
                  type: string
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Engine
          type: string
          jsonPath: .spec.engine
        - name: Size
          type: string
          jsonPath: .spec.size
        - name: State
          type: string
          jsonPath: .status.state
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database
    shortNames:
      - db

12.2 Custom Resource Instance¶

apiVersion: example.com/v1
kind: Database
metadata:
  name: orders-db
spec:
  engine: postgres
  version: "15"
  size: 100Gi
  replicas: 3

12.3 Operator Pattern¶

// Simplified operator reconciliation
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := r.Log.WithValues("database", req.NamespacedName)

    // 1. Fetch the Database CR
    var db examplev1.Database
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Check if StatefulSet exists
    var sts appsv1.StatefulSet
    err := r.Get(ctx, types.NamespacedName{
        Name:      db.Name + "-sts",
        Namespace: db.Namespace,
    }, &sts)

    if errors.IsNotFound(err) {
        // 3. Create StatefulSet
        sts = r.constructStatefulSet(&db)
        if err := r.Create(ctx, &sts); err != nil {
            return ctrl.Result{}, err
        }
        log.Info("Created StatefulSet")
    }

    // 4. Update status
    db.Status.State = "Running"
    db.Status.Endpoint = fmt.Sprintf("%s.%s.svc:5432", db.Name, db.Namespace)
    if err := r.Status().Update(ctx, &db); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

12.4 Popular Operators¶

Operator	Purpose
cert-manager	TLS certificate management
Prometheus Operator	Monitoring stack
ArgoCD	GitOps continuous delivery
Crossplane	Cloud resource provisioning
Strimzi	Kafka on Kubernetes
Zalando Postgres Operator	PostgreSQL clusters

13. Security Best Practices¶

13.1 Pod Security Standards¶

# Enforce restricted profile on namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Security Levels:

Level	Description
`privileged`	Unrestricted (only for system workloads)
`baseline`	Minimally restrictive (prevents known escalations)
`restricted`	Heavily restricted (security best practices)

13.2 Network Policies¶

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        # Allow from web frontend
        - podSelector:
            matchLabels:
              app: web
        # Allow from specific namespace
        - namespaceSelector:
            matchLabels:
              name: monitoring
          podSelector:
            matchLabels:
              app: prometheus
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        # Allow to database
        - podSelector:
            matchLabels:
              app: postgres
      ports:
        - protocol: TCP
          port: 5432
    - to:
        # Allow DNS
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53

13.3 Security Checklist¶

Control Plane:

[ ] etcd encrypted at rest
[ ] API server audit logging enabled
[ ] RBAC enabled (no ABAC)
[ ] Anonymous auth disabled
[ ] Node authorizer enabled
[ ] Admission controllers configured

Workloads:

[ ] Run as non-root
[ ] Read-only root filesystem
[ ] No privilege escalation
[ ] Drop all capabilities
[ ] Seccomp profile applied
[ ] Resource limits set
[ ] Network policies defined

Images:

[ ] Minimal base images
[ ] No latest tags
[ ] Vulnerability scanning in CI
[ ] Image signing and verification
[ ] Private registry with auth

14. Observability Stack¶

14.1 Metrics Pipeline¶

┌─────────────────────────────────────────────────────────────────────────┐
│                         Metrics Pipeline                                 │
│                                                                          │
│  Applications                                                            │
│  (expose /metrics)                                                       │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │                      Prometheus                                  │   │
│  │  • Scrapes targets (pull model)                                  │   │
│  │  • Stores time-series locally                                    │   │
│  │  • Evaluates alerting rules                                      │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │                                   │
│         ┌───────────────────────────┼───────────────────────────┐      │
│         ▼                           ▼                           ▼      │
│  ┌─────────────┐           ┌─────────────┐           ┌─────────────┐  │
│  │ Thanos/Mimir│           │Alertmanager │           │   Grafana   │  │
│  │             │           │             │           │             │  │
│  │ Long-term   │           │ Routing     │           │ Dashboards  │  │
│  │ storage     │           │ Silencing   │           │ Queries     │  │
│  │ Global view │           │ Notification│           │ Alerts      │  │
│  └─────────────┘           └─────────────┘           └─────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

14.2 Logs Pipeline¶

┌─────────────────────────────────────────────────────────────────────────┐
│                          Logs Pipeline                                   │
│                                                                          │
│  Containers (stdout/stderr)                                             │
│       │                                                                  │
│       ▼                                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Node-level collector (DaemonSet)                                │   │
│  │  • Fluent Bit / Fluentd / Vector                                 │   │
│  │  • Reads from /var/log/containers/                               │   │
│  │  • Enriches with K8s metadata                                    │   │
│  └──────────────────────────────────┬──────────────────────────────┘   │
│                                     │                                   │
│                                     ▼                                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  Log Storage                                                     │   │
│  │  • Loki (lightweight, label-based)                               │   │
│  │  • OpenSearch (full-text search)                                 │   │
│  │  • CloudWatch / Stackdriver                                      │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

14.3 Tracing Pipeline¶

# OpenTelemetry Collector configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      memory_limiter:
        limit_mib: 512

    exporters:
      otlp:
        endpoint: tempo:4317
        tls:
          insecure: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]

15. Multi-Tenancy Patterns¶

15.1 Namespace Isolation¶

# Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    services: "10"
    secrets: "20"
    persistentvolumeclaims: "10"
---
# Limit Ranges (defaults)
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "2"
        memory: "4Gi"
    - type: PersistentVolumeClaim
      max:
        storage: 10Gi

15.2 Hierarchical Namespaces¶

# Using HNC (Hierarchical Namespace Controller)
apiVersion: hnc.x-k8s.io/v1alpha2
kind: HierarchyConfiguration
metadata:
  name: hierarchy
  namespace: team-a
spec:
  parent: organization
---
# Subnamespace
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
  name: team-a-dev
  namespace: team-a

16. Disaster Recovery¶

16.1 Backup Strategies¶

etcd Backup:

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

# Verify backup
etcdctl snapshot status backup.db --write-out=table

# Restore
etcdctl snapshot restore backup.db \
  --data-dir=/var/lib/etcd-restored

Velero (Full Cluster Backup):

# Install Velero
velero install \
  --provider aws \
  --bucket my-backup-bucket \
  --secret-file ./credentials-velero

# Create backup
velero backup create cluster-backup --include-namespaces '*'

# Schedule periodic backups
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --ttl 720h

# Restore
velero restore create --from-backup cluster-backup

16.2 High Availability Checklist¶

Control Plane:

[ ] 3+ API server replicas behind load balancer
[ ] 3 or 5 etcd nodes (Raft quorum)
[ ] Leader election for scheduler/controller-manager
[ ] Spread across availability zones

Worker Nodes:

[ ] Multiple nodes per zone
[ ] Pod anti-affinity for critical workloads
[ ] Pod Disruption Budgets defined
[ ] Node auto-repair enabled

Data:

[ ] PVs with zone-redundant storage
[ ] Application-level replication (databases)
[ ] Regular backup testing
[ ] Documented recovery procedures

17. Production Architecture Example¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Production Cluster Architecture                       │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐│
│  │                           Control Plane                                 ││
│  │                                                                         ││
│  │   Zone A              Zone B              Zone C                        ││
│  │  ┌──────────┐       ┌──────────┐       ┌──────────┐                   ││
│  │  │API Server│       │API Server│       │API Server│                   ││
│  │  │etcd      │       │etcd      │       │etcd      │                   ││
│  │  └──────────┘       └──────────┘       └──────────┘                   ││
│  │                                                                         ││
│  │                    Load Balancer (internal)                            ││
│  └────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐│
│  │                           Worker Nodes                                  ││
│  │                                                                         ││
│  │   Zone A              Zone B              Zone C                        ││
│  │  ┌──────────┐       ┌──────────┐       ┌──────────┐                   ││
│  │  │ Node 1   │       │ Node 3   │       │ Node 5   │                   ││
│  │  │ Node 2   │       │ Node 4   │       │ Node 6   │                   ││
│  │  └──────────┘       └──────────┘       └──────────┘                   ││
│  │                                                                         ││
│  │  Node pools:                                                           ││
│  │  • General purpose (on-demand)                                         ││
│  │  • Compute optimized (spot/preemptible)                               ││
│  │  • Memory optimized (databases)                                        ││
│  │  • GPU (ML workloads)                                                  ││
│  └────────────────────────────────────────────────────────────────────────┘│
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐│
│  │                         Platform Services                               ││
│  │                                                                         ││
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐              ││
│  │  │ Ingress  │  │ Cert     │  │ External │  │ Secrets  │              ││
│  │  │ (NGINX)  │  │ Manager  │  │ DNS      │  │ (Vault)  │              ││
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘              ││
│  │                                                                         ││
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐              ││
│  │  │Prometheus│  │  Loki    │  │  Tempo   │  │ Grafana  │              ││
│  │  │+ Thanos  │  │          │  │          │  │          │              ││
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘              ││
│  │                                                                         ││
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐              ││
│  │  │  ArgoCD  │  │ Kyverno  │  │ Velero   │  │ Cilium   │              ││
│  │  │ (GitOps) │  │ (Policy) │  │ (Backup) │  │ (CNI)    │              ││
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘              ││
│  └────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

18. Essential kubectl Commands¶

# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes

# Debugging pods
kubectl describe pod <pod>
kubectl logs <pod> -c <container> --previous
kubectl exec -it <pod> -- /bin/sh
kubectl debug -it <pod> --image=busybox

# Resource management
kubectl get all -A
kubectl api-resources
kubectl explain pod.spec.containers

# Events and troubleshooting
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

# RBAC debugging
kubectl auth can-i create pods --as=jane
kubectl auth whoami

# Rollouts
kubectl rollout status deployment/app
kubectl rollout history deployment/app
kubectl rollout undo deployment/app --to-revision=2

# Port forwarding
kubectl port-forward svc/app 8080:80
kubectl port-forward pod/app-abc123 8080:80

# Resource editing
kubectl edit deployment app
kubectl patch deployment app -p '{"spec":{"replicas":3}}'

# Labels and selectors
kubectl get pods -l app=nginx,env=prod
kubectl label pods <pod> version=v2