Kubernetes¶
Container orchestration is the automated management of the lifecycle of hundreds, thousands, or tens of thousands of containers in production environments. It solves the problems that appear when you move from running 1–10 containers on a laptop to running 10,000+ containers across dozens or hundreds of machines.
Kubernetes (K8s) has decisively won the orchestration war, running >90% of containerized production workloads globally.
1. Core Architecture¶
1.1 Component Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ ┌─────────────────────────────────────────────────────────────────────────┐│
│ │ kube-apiserver ││
│ │ • REST + gRPC API endpoint ││
│ │ • Authentication, Authorization, Admission ││
│ │ • etcd client (only component that talks to etcd) ││
│ └────────────────────────────────┬────────────────────────────────────────┘│
│ │ │
│ ┌────────────────────────────────▼────────────────────────────────────────┐│
│ │ etcd ││
│ │ • Distributed key-value store (Raft consensus) ││
│ │ • Source of truth for all cluster state ││
│ │ • 3 or 5 nodes for HA ││
│ └─────────────────────────────────────────────────────────────────────────┘│
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ kube-scheduler │ │ controller-manager │ │ cloud-controller │ │
│ │ │ │ │ │ │ │
│ │ • Watches unbound │ │ • Node controller │ │ • Node lifecycle │ │
│ │ Pods │ │ • ReplicaSet │ │ • LoadBalancer │ │
│ │ • Scores nodes │ │ • Deployment │ │ • Routes │ │
│ │ • Binds Pod→Node │ │ • StatefulSet │ │ • Cloud disks │ │
│ │ │ │ • Job, CronJob │ │ │ │
│ └─────────────────────┘ └─────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌───────┴────────┐
│ Network │
└───────┬────────┘
│
┌─────────────────────────────────────┼─────────────────────────────────────────┐
│ WORKER NODES │
│ │ │
│ ┌──────────────────────────────────▼──────────────────────────────────────┐ │
│ │ kubelet │ │
│ │ • Registers node with API server │ │
│ │ • Watches for Pod assignments │ │
│ │ • Manages container lifecycle via CRI │ │
│ │ • Reports node status, pod status │ │
│ │ • Manages volumes via CSI │ │
│ └──────────────────────────────────┬──────────────────────────────────────┘ │
│ │ CRI (gRPC) │
│ ┌──────────────────────────────────▼──────────────────────────────────────┐ │
│ │ containerd / CRI-O │ │
│ │ • Pulls images │ │
│ │ • Creates containers via OCI runtime │ │
│ │ • Manages container lifecycle │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ kube-proxy │ │
│ │ • Maintains network rules (iptables/IPVS/eBPF) │ │
│ │ • Implements Service abstraction │ │
│ │ • Load balances traffic to Pod endpoints │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────────┘
1.2 Component Details¶
| Component | Function | Stateless? | HA Strategy |
|---|---|---|---|
| kube-apiserver | API gateway, auth, admission | Yes | Multiple replicas behind LB |
| etcd | Persistent state storage | No | Raft consensus (3 or 5 nodes) |
| kube-scheduler | Pod placement decisions | Yes | Leader election |
| kube-controller-manager | Reconciliation loops | Yes | Leader election |
| cloud-controller-manager | Cloud provider integration | Yes | Leader election |
| kubelet | Node agent | N/A | One per node |
| kube-proxy | Network rules | N/A | One per node |
2. etcd: The Cluster Brain¶
2.1 What etcd Stores¶
Everything in Kubernetes is stored in etcd under /registry/:
/registry/
├── configmaps/
│ └── default/
│ └── my-config
├── deployments/
│ └── default/
│ └── nginx-deployment
├── events/
├── namespaces/
│ ├── default
│ ├── kube-system
│ └── kube-public
├── nodes/
│ ├── node-1
│ └── node-2
├── pods/
│ └── default/
│ ├── nginx-abc123
│ └── nginx-def456
├── secrets/
├── services/
└── ...
2.2 Raft Consensus Protocol¶
etcd uses Raft for distributed consensus:
┌─────────────────────────────────────────────────────────────────┐
│ Raft State Machine │
│ │
│ ┌───────────────┐ │
│ │ Leader │ ← Only leader handles writes │
│ │ (Node 1) │ ← Replicates to followers │
│ └───────┬───────┘ │
│ │ │
│ ┌─────┴─────┐ │
│ ▼ ▼ │
│ ┌───────┐ ┌───────┐ │
│ │Follower│ │Follower│ │
│ │(Node 2)│ │(Node 3)│ │
│ └───────┘ └───────┘ │
│ │
│ Write Path: │
│ 1. Client → Leader │
│ 2. Leader appends to local log │
│ 3. Leader replicates to followers │
│ 4. Majority (quorum) acknowledges │
│ 5. Leader commits entry │
│ 6. Leader responds to client │
│ │
│ Quorum = (N/2) + 1 │
│ 3 nodes → need 2 for consensus (survives 1 failure) │
│ 5 nodes → need 3 for consensus (survives 2 failures) │
└─────────────────────────────────────────────────────────────────┘
2.3 etcd Performance Characteristics¶
| Metric | Recommended | Critical |
|---|---|---|
| Disk IOPS | >3000 | SSDs required |
| Disk latency | \<10ms p99 | >50ms = cluster instability |
| Network latency | \<2ms between nodes | >10ms = election timeouts |
| Object size | \<1MB | >1.5MB rejected |
| Total DB size | \<8GB default | Can increase, but impacts performance |
2.4 etcd Operations¶
# Check cluster health
etcdctl endpoint health --endpoints=https://127.0.0.1:2379
# List all keys
etcdctl get / --prefix --keys-only
# Get specific key
etcdctl get /registry/pods/default/nginx-abc123
# Watch for changes
etcdctl watch /registry/pods --prefix
# Compact history (required for long-running clusters)
etcdctl compact $(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
# Defragment (reclaim disk space after compaction)
etcdctl defrag --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379
3. API Server Deep Dive¶
3.1 Request Processing Pipeline¶
┌────────────────────────────────────────────────────────────────────────────┐
│ API Server Request Flow │
│ │
│ Client Request │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. AUTHENTICATION │ │
│ │ • Client certificates (x509) │ │
│ │ • Bearer tokens (ServiceAccount, OIDC) │ │
│ │ • Basic auth (deprecated) │ │
│ │ • Webhook token auth │ │
│ │ │ │
│ │ Result: User identity (username, UID, groups) │ │
│ └──────────────────────────────────┬──────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 2. AUTHORIZATION │ │
│ │ • RBAC (Role-Based Access Control) ← primary │ │
│ │ • ABAC (Attribute-Based) │ │
│ │ • Webhook │ │
│ │ • Node authorizer (kubelet-specific) │ │
│ │ │ │
│ │ Question: Can user X perform verb Y on resource Z? │ │
│ └──────────────────────────────────┬──────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 3. ADMISSION CONTROLLERS │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ Mutating Webhooks │ → │ Validating Webhooks │ │ │
│ │ │ │ │ │ │ │
│ │ │ • Modify objects │ │ • Accept/Reject │ │ │
│ │ │ • Inject sidecars │ │ • Policy enforcement│ │ │
│ │ │ • Set defaults │ │ • Security checks │ │ │
│ │ └─────────────────────┘ └─────────────────────┘ │ │
│ └──────────────────────────────────┬──────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 4. VALIDATION │ │
│ │ • Schema validation (OpenAPI) │ │
│ │ • Field immutability checks │ │
│ │ • Resource quota checks │ │
│ └──────────────────────────────────┬──────────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 5. PERSISTENCE │ │
│ │ • Serialize to protobuf │ │
│ │ • Write to etcd │ │
│ │ • Return response │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
3.2 API Groups and Versions¶
# Core API (legacy, no group)
/api/v1/namespaces/default/pods/nginx
# Named API groups
/apis/apps/v1/namespaces/default/deployments/nginx
/apis/batch/v1/namespaces/default/jobs/myjob
/apis/networking.k8s.io/v1/namespaces/default/ingresses/myingress
# List all API resources
kubectl api-resources
kubectl api-versions
3.3 Watch Mechanism¶
Kubernetes uses long-polling watches for efficient state synchronization:
// Controller's informer uses watch
watcher, _ := client.CoreV1().Pods("").Watch(ctx, metav1.ListOptions{
ResourceVersion: "12345", // Start watching from this version
})
for event := range watcher.ResultChan() {
switch event.Type {
case watch.Added:
// New pod created
case watch.Modified:
// Pod updated
case watch.Deleted:
// Pod removed
case watch.Bookmark:
// Progress marker (no actual change)
case watch.Error:
// Re-list and restart watch
}
}
Resource Versions:
- Every object has a
resourceVersion(etcd revision) - Watches specify starting
resourceVersion - Allows efficient sync without polling
4. Admission Controllers¶
4.1 Built-in Admission Controllers¶
| Controller | Type | Function |
|---|---|---|
NamespaceLifecycle |
Validating | Prevents ops in terminating namespaces |
LimitRanger |
Mutating | Applies default resource limits |
ServiceAccount |
Mutating | Auto-mounts SA tokens |
DefaultStorageClass |
Mutating | Assigns default storage class |
ResourceQuota |
Validating | Enforces namespace quotas |
PodSecurity |
Validating | Enforces Pod Security Standards |
NodeRestriction |
Validating | Limits kubelet API access |
4.2 Dynamic Admission Webhooks¶
MutatingWebhookConfiguration:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: sidecar-injector
webhooks:
- name: sidecar.example.com
clientConfig:
service:
name: sidecar-injector
namespace: system
path: /mutate
caBundle: <base64-encoded-ca>
rules:
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
namespaceSelector:
matchLabels:
sidecar-injection: enabled
failurePolicy: Fail # or Ignore
sideEffects: None
admissionReviewVersions: ["v1"]
Webhook Handler Example:
func handleMutate(w http.ResponseWriter, r *http.Request) {
var review admissionv1.AdmissionReview
json.NewDecoder(r.Body).Decode(&review)
pod := corev1.Pod{}
json.Unmarshal(review.Request.Object.Raw, &pod)
// Add sidecar container
patch := []map[string]interface{}{
{
"op": "add",
"path": "/spec/containers/-",
"value": sidecarContainer,
},
}
patchBytes, _ := json.Marshal(patch)
patchType := admissionv1.PatchTypeJSONPatch
review.Response = &admissionv1.AdmissionResponse{
UID: review.Request.UID,
Allowed: true,
PatchType: &patchType,
Patch: patchBytes,
}
json.NewEncoder(w).Encode(review)
}
4.3 Policy Engines¶
Kyverno:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-labels
spec:
validationFailureAction: Enforce
rules:
- name: check-for-labels
match:
resources:
kinds:
- Pod
validate:
message: "Pods must have 'app' and 'owner' labels"
pattern:
metadata:
labels:
app: "?*"
owner: "?*"
OPA Gatekeeper:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredlabels
spec:
crd:
spec:
names:
kind: K8sRequiredLabels
validation:
openAPIV3Schema:
properties:
labels:
type: array
items: { type: string }
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredlabels
violation[{"msg": msg}] {
provided := {label | input.review.object.metadata.labels[label]}
required := {label | label := input.parameters.labels[_]}
missing := required - provided
count(missing) > 0
msg := sprintf("Missing labels: %v", [missing])
}
5. The Reconciliation Loop¶
5.1 Controller Pattern¶
Every Kubernetes controller follows this pattern:
func (c *Controller) Run(ctx context.Context) {
// 1. List all existing objects (initial sync)
objects, _ := c.lister.List(labels.Everything())
for _, obj := range objects {
c.workqueue.Add(obj.GetName())
}
// 2. Watch for changes
go c.informer.Run(ctx.Done())
// 3. Process work queue
for c.processNextItem(ctx) {
}
}
func (c *Controller) processNextItem(ctx context.Context) bool {
key, shutdown := c.workqueue.Get()
if shutdown {
return false
}
defer c.workqueue.Done(key)
// 4. Reconcile
err := c.reconcile(ctx, key.(string))
if err != nil {
// 5. Requeue with exponential backoff
c.workqueue.AddRateLimited(key)
return true
}
c.workqueue.Forget(key)
return true
}
func (c *Controller) reconcile(ctx context.Context, name string) error {
// Get desired state
desired, err := c.lister.Get(name)
if errors.IsNotFound(err) {
return nil // Object deleted, nothing to do
}
// Get actual state
actual, _ := c.getActualState(name)
// Compare and act
if !reflect.DeepEqual(desired.Spec, actual) {
return c.update(ctx, desired)
}
return nil
}
5.2 Deployment Controller Flow¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Deployment Controller │
│ │
│ User creates Deployment (replicas: 3) │
│ │ │
│ ▼ │
│ Deployment Controller watches → sees new Deployment │
│ │ │
│ ▼ │
│ Creates ReplicaSet with replicas: 3 │
│ │ │
│ ▼ │
│ ReplicaSet Controller watches → sees new ReplicaSet │
│ │ │
│ ▼ │
│ Creates 3 Pods (without nodeName) │
│ │ │
│ ▼ │
│ Scheduler watches → sees 3 unscheduled Pods │
│ │ │
│ ▼ │
│ Assigns nodeName to each Pod │
│ │ │
│ ▼ │
│ kubelet watches → sees Pods assigned to this node │
│ │ │
│ ▼ │
│ Starts containers via CRI │
│ │ │
│ ▼ │
│ Reports Pod status back to API server │
└─────────────────────────────────────────────────────────────────────────┘
6. The Pod: Kubernetes Atom¶
6.1 Pod is NOT a Container¶
A Pod is:
- A group of 1+ containers sharing:
- Network namespace (same IP, same localhost)
- IPC namespace (shared memory)
- UTS namespace (same hostname)
- Optionally: PID namespace
- A scheduling unit (placed together on one node)
- A lifecycle unit (all containers start/stop together)
6.2 Pod Anatomy¶
apiVersion: v1
kind: Pod
metadata:
name: multi-container-pod
namespace: default
labels:
app: myapp
version: v1
annotations:
prometheus.io/scrape: "true"
spec:
# Scheduling constraints
nodeSelector:
disktype: ssd
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: In
values: ["us-west-1a", "us-west-1b"]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: kubernetes.io/hostname
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
# Service account
serviceAccountName: myapp-sa
automountServiceAccountToken: false
# Security context (pod-level)
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
# DNS configuration
dnsPolicy: ClusterFirst
dnsConfig:
options:
- name: ndots
value: "2"
# Init containers (run sequentially before main containers)
initContainers:
- name: init-db
image: busybox
command: ["sh", "-c", "until nc -z db 5432; do sleep 2; done"]
# Main containers
containers:
- name: app
image: myapp:v1.2.3
imagePullPolicy: IfNotPresent
# Commands
command: ["/app/server"]
args: ["--config=/etc/config/app.yaml"]
# Environment
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
envFrom:
- configMapRef:
name: app-config
# Ports
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
# Resources
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
# Probes
startupProbe:
httpGet:
path: /healthz
port: http
failureThreshold: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
periodSeconds: 5
failureThreshold: 1
# Security context (container-level)
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
# Volume mounts
volumeMounts:
- name: config
mountPath: /etc/config
readOnly: true
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/cache
# Sidecar container
- name: log-shipper
image: fluentbit:latest
resources:
requests:
cpu: "10m"
memory: "32Mi"
limits:
cpu: "50m"
memory: "64Mi"
volumeMounts:
- name: logs
mountPath: /var/log/app
readOnly: true
# Volumes
volumes:
- name: config
configMap:
name: app-config
- name: tmp
emptyDir: {}
- name: cache
emptyDir:
sizeLimit: "100Mi"
- name: logs
emptyDir: {}
# Termination
terminationGracePeriodSeconds: 30
# Restart policy
restartPolicy: Always # Always | OnFailure | Never
6.3 Pod Lifecycle¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Pod Lifecycle │
│ │
│ Pending │
│ │ │
│ │ (Scheduled to node) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Container Startup │ │
│ │ │ │
│ │ 1. Pull image (if not cached) │ │
│ │ 2. Create container │ │
│ │ 3. Run init containers (sequentially) │ │
│ │ 4. Start main containers (in parallel) │ │
│ │ 5. Execute postStart hooks │ │
│ │ 6. Wait for startupProbe to pass │ │
│ │ 7. Start livenessProbe and readinessProbe │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Running ←──────────────────────────────────────────────┐ │
│ │ │ │
│ │ (livenessProbe fails) │ │
│ ▼ │ │
│ Container restarts ─────────────────────────────────────┘ │
│ │ │
│ │ (Pod deleted or node fails) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Termination Sequence │ │
│ │ │ │
│ │ 1. Pod marked Terminating │ │
│ │ 2. Remove from Service endpoints │ │
│ │ 3. Execute preStop hooks (parallel with SIGTERM) │ │
│ │ 4. Send SIGTERM to containers │ │
│ │ 5. Wait terminationGracePeriodSeconds │ │
│ │ 6. Send SIGKILL │ │
│ │ 7. Remove Pod object │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Succeeded / Failed │
│ │
└─────────────────────────────────────────────────────────────────────────┘
6.4 Container Types¶
| Type | When Runs | Use Case |
|---|---|---|
| Init Containers | Before main containers, sequentially | DB migrations, wait for dependencies |
| Main Containers | Application lifetime, in parallel | Primary workload |
| Sidecar Containers | Application lifetime, in parallel | Log shipping, proxies, monitoring |
| Ephemeral Containers | Debug-time only (kubectl debug) |
Troubleshooting running pods |
7. The Scheduler¶
7.1 Scheduling Phases¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Scheduler Pipeline │
│ │
│ Unscheduled Pod enters queue │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Phase 1: FILTERING │ │
│ │ │ │
│ │ Eliminate nodes that cannot run the Pod: │ │
│ │ • PodFitsResources - enough CPU/memory? │ │
│ │ • PodFitsHostPorts - port conflicts? │ │
│ │ • NodeSelector - labels match? │ │
│ │ • TaintToleration - tolerates taints? │ │
│ │ • NodeAffinity - affinity rules satisfied? │ │
│ │ • VolumeBinding - PV available in zone? │ │
│ │ • InterPodAffinity - co-location rules? │ │
│ │ │ │
│ │ Input: All nodes │ │
│ │ Output: Feasible nodes │ │
│ └──────────────────────────────────┬──────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Phase 2: SCORING │ │
│ │ │ │
│ │ Rank feasible nodes (0-100 per plugin): │ │
│ │ • NodeResourcesFit - prefer balanced utilization │ │
│ │ • ImageLocality - image already cached? │ │
│ │ • InterPodAffinity - prefer co-located pods │ │
│ │ • TaintToleration - prefer fewer tolerations needed │ │
│ │ • NodeAffinity - prefer affinity matches │ │
│ │ │ │
│ │ Final score = Σ (plugin_score × plugin_weight) │ │
│ └──────────────────────────────────┬──────────────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Phase 3: BINDING │ │
│ │ │ │
│ │ 1. Select highest-scoring node │ │
│ │ 2. Reserve resources (optimistic) │ │
│ │ 3. Run pre-bind plugins (e.g., volume provisioning) │ │
│ │ 4. Update Pod's spec.nodeName │ │
│ │ 5. Run post-bind plugins │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
7.2 Scheduling Constraints¶
Node Selector (simple):
spec:
nodeSelector:
disktype: ssd
zone: us-west-1a
Node Affinity (flexible):
spec:
affinity:
nodeAffinity:
# Hard requirement
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
# Soft preference
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: zone
operator: In
values: ["us-west-1a"]
Pod Anti-Affinity (spread replicas):
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: redis
topologyKey: kubernetes.io/hostname
Taints and Tolerations:
# Taint a node (repels pods)
kubectl taint nodes node1 dedicated=database:NoSchedule
# Pod must tolerate to schedule
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "database"
effect: "NoSchedule"
7.3 Resource Management¶
Requests vs Limits:
| Aspect | Requests | Limits |
|---|---|---|
| Scheduling | Used for node selection | Not considered |
| CPU | Guaranteed minimum | Throttled above |
| Memory | Guaranteed minimum | OOM killed above |
| QoS | Determines QoS class | Determines QoS class |
QoS Classes:
| Class | Criteria | Eviction Priority |
|---|---|---|
| Guaranteed | requests == limits for all containers | Lowest (last evicted) |
| Burstable | At least one request set | Medium |
| BestEffort | No requests or limits | Highest (first evicted) |
8. Services and Networking¶
8.1 Service Types¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Service Types │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ClusterIP │ │
│ │ │ │
│ │ • Internal cluster IP only │ │
│ │ • DNS: my-svc.namespace.svc.cluster.local │ │
│ │ • Default type │ │
│ │ │ │
│ │ spec: │ │
│ │ type: ClusterIP │ │
│ │ clusterIP: 10.96.0.100 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ NodePort │ │
│ │ │ │
│ │ • Exposes on each node's IP at static port (30000-32767) │ │
│ │ • Includes ClusterIP │ │
│ │ │ │
│ │ spec: │ │
│ │ type: NodePort │ │
│ │ ports: │ │
│ │ - port: 80 │ │
│ │ nodePort: 30080 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LoadBalancer │ │
│ │ │ │
│ │ • Provisions cloud load balancer (AWS ELB, GCP LB, etc.) │ │
│ │ • Includes NodePort and ClusterIP │ │
│ │ │ │
│ │ spec: │ │
│ │ type: LoadBalancer │ │
│ │ loadBalancerIP: 1.2.3.4 # optional, if supported │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ExternalName │ │
│ │ │ │
│ │ • DNS CNAME record, no proxying │ │
│ │ • Useful for external services │ │
│ │ │ │
│ │ spec: │ │
│ │ type: ExternalName │ │
│ │ externalName: my.database.example.com │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Headless │ │
│ │ │ │
│ │ • No ClusterIP (clusterIP: None) │ │
│ │ • DNS returns Pod IPs directly │ │
│ │ • Used with StatefulSets for stable network identity │ │
│ │ │ │
│ │ spec: │ │
│ │ clusterIP: None │ │
│ │ selector: │ │
│ │ app: postgres │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
8.2 kube-proxy Modes¶
iptables Mode (default):
# View Service rules
iptables -t nat -L KUBE-SERVICES -n
# Chain KUBE-SERVICES
-A KUBE-SERVICES -d 10.96.0.100/32 -p tcp -m tcp --dport 80 \
-j KUBE-SVC-XXXXX
# Chain KUBE-SVC-XXXXX (round-robin to endpoints)
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.33333 \
-j KUBE-SEP-11111
-A KUBE-SVC-XXXXX -m statistic --mode random --probability 0.50000 \
-j KUBE-SEP-22222
-A KUBE-SVC-XXXXX -j KUBE-SEP-33333
# Chain KUBE-SEP-11111 (DNAT to Pod)
-A KUBE-SEP-11111 -p tcp -j DNAT --to-destination 172.17.0.2:8080
IPVS Mode (better performance):
# Enable in kube-proxy config
mode: ipvs
ipvs:
scheduler: rr # rr, lc, dh, sh, sed, nq
# View IPVS rules
ipvsadm -Ln
IP Virtual Server version 1.2.1
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.96.0.100:80 rr
-> 172.17.0.2:8080 Masq 1 0 0
-> 172.17.0.3:8080 Masq 1 0 0
-> 172.17.0.4:8080 Masq 1 0 0
eBPF Mode (Cilium):
- No iptables/IPVS
- Direct socket-level load balancing
- Better performance, lower latency
- Requires Cilium CNI
8.3 CNI (Container Network Interface)¶
Pod-to-Pod Networking Requirements:
- Every Pod gets its own IP address
- Pods can communicate without NAT
- Nodes can communicate with Pods without NAT
- Pod's IP is same to self and others
Popular CNI Plugins:
| Plugin | Network Model | Features |
|---|---|---|
| Cilium | eBPF | L7 policies, observability, service mesh |
| Calico | BGP or VXLAN | Network policies, high performance |
| Flannel | VXLAN/host-gw | Simple, minimal features |
| AWS VPC CNI | Native VPC | Pod IPs from VPC, no overlay |
| Weave | VXLAN | Simple, encrypted |
8.4 Ingress¶
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- app.example.com
secretName: app-tls
rules:
- host: app.example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
- path: /
pathType: Prefix
backend:
service:
name: web-service
port:
number: 80
Ingress Controllers:
| Controller | Maintained By | Features |
|---|---|---|
| ingress-nginx | Kubernetes | Most popular, battle-tested |
| Traefik | Traefik Labs | Auto-discovery, middleware |
| HAProxy | HAProxy | High performance |
| Contour | VMware | Envoy-based |
| AWS ALB | AWS | Native ALB integration |
| Istio Gateway | Istio | Service mesh integration |
9. Storage¶
9.1 Storage Architecture¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Storage Architecture │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PersistentVolumeClaim (PVC) │ │
│ │ │ │
│ │ • User's storage request │ │
│ │ • Namespace-scoped │ │
│ │ • Specifies size, access modes, storage class │ │
│ └──────────────────────────────────┬──────────────────────────────┘ │
│ │ binds to │
│ ┌──────────────────────────────────▼──────────────────────────────┐ │
│ │ PersistentVolume (PV) │ │
│ │ │ │
│ │ • Cluster-scoped storage resource │ │
│ │ • Provisioned statically or dynamically │ │
│ │ • Has specific capacity, access modes, reclaim policy │ │
│ └──────────────────────────────────┬──────────────────────────────┘ │
│ │ backed by │
│ ┌──────────────────────────────────▼──────────────────────────────┐ │
│ │ Storage Backend │ │
│ │ │ │
│ │ • Cloud: AWS EBS, GCP PD, Azure Disk │ │
│ │ • On-prem: Ceph, NFS, iSCSI │ │
│ │ • Local: hostPath, local PV │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ StorageClass (controls dynamic provisioning) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ apiVersion: storage.k8s.io/v1 │ │
│ │ kind: StorageClass │ │
│ │ metadata: │ │
│ │ name: fast-ssd │ │
│ │ provisioner: kubernetes.io/aws-ebs │ │
│ │ parameters: │ │
│ │ type: gp3 │ │
│ │ iopsPerGB: "50" │ │
│ │ reclaimPolicy: Delete │ │
│ │ volumeBindingMode: WaitForFirstConsumer │ │
│ │ allowVolumeExpansion: true │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
9.2 Access Modes¶
| Mode | Abbreviation | Description |
|---|---|---|
| ReadWriteOnce | RWO | Single node read/write |
| ReadOnlyMany | ROX | Multiple nodes read-only |
| ReadWriteMany | RWX | Multiple nodes read/write |
| ReadWriteOncePod | RWOP | Single pod read/write (K8s 1.22+) |
9.3 CSI (Container Storage Interface)¶
# CSI Driver deployment (simplified)
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
name: ebs.csi.aws.com
spec:
attachRequired: true
podInfoOnMount: false
volumeLifecycleModes:
- Persistent
- Ephemeral
CSI Operations:
- CreateVolume - Provision storage
- DeleteVolume - Remove storage
- ControllerPublishVolume - Attach to node
- ControllerUnpublishVolume - Detach from node
- NodeStageVolume - Mount to staging path
- NodePublishVolume - Bind mount to pod path
- NodeUnpublishVolume - Unmount from pod
- NodeUnstageVolume - Unmount from staging
10. RBAC (Role-Based Access Control)¶
10.1 RBAC Components¶
┌─────────────────────────────────────────────────────────────────────────┐
│ RBAC Model │
│ │
│ ┌─────────────┐ ┌───────────────┐ ┌─────────────────────────┐ │
│ │ Subject │────▶│ RoleBinding │────▶│ Role / ClusterRole │ │
│ │ │ │ │ │ │ │
│ │ • User │ │ Connects │ │ Defines permissions: │ │
│ │ • Group │ │ subject to │ │ • API groups │ │
│ │ • Service │ │ role │ │ • Resources │ │
│ │ Account │ │ │ │ • Verbs │ │
│ └─────────────┘ └───────────────┘ └─────────────────────────┘ │
│ │
│ Namespace-scoped: │
│ Role + RoleBinding │
│ │
│ Cluster-scoped: │
│ ClusterRole + ClusterRoleBinding │
│ (or ClusterRole + RoleBinding for namespace-limited access) │
└─────────────────────────────────────────────────────────────────────────┘
10.2 RBAC Examples¶
Role (namespace-scoped permissions):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: development
name: pod-reader
rules:
- apiGroups: [""] # Core API group
resources: ["pods"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get"]
ClusterRole (cluster-wide or aggregatable):
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: secret-reader
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
---
# Aggregated ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-endpoints
labels:
rbac.authorization.k8s.io/aggregate-to-view: "true"
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "list", "watch"]
RoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: development
subjects:
- kind: User
name: jane
apiGroup: rbac.authorization.k8s.io
- kind: ServiceAccount
name: ci-bot
namespace: development
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
ClusterRoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-admin-binding
subjects:
- kind: Group
name: system:masters
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
10.3 Common Verbs¶
| Verb | HTTP Method | Description |
|---|---|---|
get |
GET | Read single resource |
list |
GET | Read collection |
watch |
GET (streaming) | Watch for changes |
create |
POST | Create resource |
update |
PUT | Replace resource |
patch |
PATCH | Partial update |
delete |
DELETE | Delete resource |
deletecollection |
DELETE | Delete multiple |
10.4 Debugging RBAC¶
# Check if user can perform action
kubectl auth can-i create deployments --as=jane
kubectl auth can-i delete pods --as=system:serviceaccount:default:mysa
# List all permissions for user
kubectl auth can-i --list --as=jane
# Impersonate user
kubectl get pods --as=jane --as-group=developers
11. Autoscaling¶
11.1 Horizontal Pod Autoscaler (HPA)¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 2
maxReplicas: 100
metrics:
# CPU-based
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory-based
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 500Mi
# Custom metrics (from Prometheus)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 1000
# External metrics (from cloud provider)
- type: External
external:
metric:
name: sqs_queue_length
selector:
matchLabels:
queue: orders
target:
type: Value
value: 100
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
HPA Algorithm:
desiredReplicas = ceil(currentReplicas × (currentMetric / desiredMetric))
Example:
currentReplicas = 3
currentCPU = 90%
targetCPU = 70%
desiredReplicas = ceil(3 × (90/70)) = ceil(3.86) = 4
11.2 Vertical Pod Autoscaler (VPA)¶
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: app
updatePolicy:
updateMode: Auto # Off | Initial | Recreate | Auto
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
VPA Modes:
| Mode | Behavior |
|---|---|
Off |
Only recommendations, no changes |
Initial |
Apply on pod creation only |
Recreate |
Evict and recreate pods to apply |
Auto |
Currently same as Recreate |
11.3 Cluster Autoscaler¶
# Cluster Autoscaler configuration (typically Helm values)
autoDiscovery:
clusterName: my-cluster
tags:
- k8s.io/cluster-autoscaler/enabled
- k8s.io/cluster-autoscaler/my-cluster
extraArgs:
balance-similar-node-groups: true
skip-nodes-with-system-pods: false
scale-down-enabled: true
scale-down-delay-after-add: 10m
scale-down-unneeded-time: 10m
scale-down-utilization-threshold: 0.5
max-node-provision-time: 15m
11.4 KEDA (Kubernetes Event-Driven Autoscaling)¶
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: app-scaledobject
spec:
scaleTargetRef:
name: app
pollingInterval: 30
cooldownPeriod: 300
minReplicaCount: 0 # Scale to zero!
maxReplicaCount: 100
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: my-group
topic: orders
lagThreshold: "100"
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_total
query: sum(rate(http_requests_total{app="myapp"}[2m]))
threshold: "100"
12. Operators and Custom Resources¶
12.1 CRD (Custom Resource Definition)¶
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: ["engine", "size"]
properties:
engine:
type: string
enum: ["postgres", "mysql", "mongodb"]
version:
type: string
default: "15"
size:
type: string
pattern: "^[0-9]+Gi$"
replicas:
type: integer
minimum: 1
maximum: 5
default: 1
status:
type: object
properties:
state:
type: string
endpoint:
type: string
subresources:
status: {}
additionalPrinterColumns:
- name: Engine
type: string
jsonPath: .spec.engine
- name: Size
type: string
jsonPath: .spec.size
- name: State
type: string
jsonPath: .status.state
scope: Namespaced
names:
plural: databases
singular: database
kind: Database
shortNames:
- db
12.2 Custom Resource Instance¶
apiVersion: example.com/v1
kind: Database
metadata:
name: orders-db
spec:
engine: postgres
version: "15"
size: 100Gi
replicas: 3
12.3 Operator Pattern¶
// Simplified operator reconciliation
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := r.Log.WithValues("database", req.NamespacedName)
// 1. Fetch the Database CR
var db examplev1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Check if StatefulSet exists
var sts appsv1.StatefulSet
err := r.Get(ctx, types.NamespacedName{
Name: db.Name + "-sts",
Namespace: db.Namespace,
}, &sts)
if errors.IsNotFound(err) {
// 3. Create StatefulSet
sts = r.constructStatefulSet(&db)
if err := r.Create(ctx, &sts); err != nil {
return ctrl.Result{}, err
}
log.Info("Created StatefulSet")
}
// 4. Update status
db.Status.State = "Running"
db.Status.Endpoint = fmt.Sprintf("%s.%s.svc:5432", db.Name, db.Namespace)
if err := r.Status().Update(ctx, &db); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
12.4 Popular Operators¶
| Operator | Purpose |
|---|---|
| cert-manager | TLS certificate management |
| Prometheus Operator | Monitoring stack |
| ArgoCD | GitOps continuous delivery |
| Crossplane | Cloud resource provisioning |
| Strimzi | Kafka on Kubernetes |
| Zalando Postgres Operator | PostgreSQL clusters |
13. Security Best Practices¶
13.1 Pod Security Standards¶
# Enforce restricted profile on namespace
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Security Levels:
| Level | Description |
|---|---|
privileged |
Unrestricted (only for system workloads) |
baseline |
Minimally restrictive (prevents known escalations) |
restricted |
Heavily restricted (security best practices) |
13.2 Network Policies¶
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
# Allow from web frontend
- podSelector:
matchLabels:
app: web
# Allow from specific namespace
- namespaceSelector:
matchLabels:
name: monitoring
podSelector:
matchLabels:
app: prometheus
ports:
- protocol: TCP
port: 8080
egress:
- to:
# Allow to database
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to:
# Allow DNS
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
13.3 Security Checklist¶
Control Plane:
- [ ] etcd encrypted at rest
- [ ] API server audit logging enabled
- [ ] RBAC enabled (no ABAC)
- [ ] Anonymous auth disabled
- [ ] Node authorizer enabled
- [ ] Admission controllers configured
Workloads:
- [ ] Run as non-root
- [ ] Read-only root filesystem
- [ ] No privilege escalation
- [ ] Drop all capabilities
- [ ] Seccomp profile applied
- [ ] Resource limits set
- [ ] Network policies defined
Images:
- [ ] Minimal base images
- [ ] No
latesttags - [ ] Vulnerability scanning in CI
- [ ] Image signing and verification
- [ ] Private registry with auth
14. Observability Stack¶
14.1 Metrics Pipeline¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Metrics Pipeline │
│ │
│ Applications │
│ (expose /metrics) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Prometheus │ │
│ │ • Scrapes targets (pull model) │ │
│ │ • Stores time-series locally │ │
│ │ • Evaluates alerting rules │ │
│ └──────────────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┼───────────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Thanos/Mimir│ │Alertmanager │ │ Grafana │ │
│ │ │ │ │ │ │ │
│ │ Long-term │ │ Routing │ │ Dashboards │ │
│ │ storage │ │ Silencing │ │ Queries │ │
│ │ Global view │ │ Notification│ │ Alerts │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
14.2 Logs Pipeline¶
┌─────────────────────────────────────────────────────────────────────────┐
│ Logs Pipeline │
│ │
│ Containers (stdout/stderr) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Node-level collector (DaemonSet) │ │
│ │ • Fluent Bit / Fluentd / Vector │ │
│ │ • Reads from /var/log/containers/ │ │
│ │ • Enriches with K8s metadata │ │
│ └──────────────────────────────────┬──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Log Storage │ │
│ │ • Loki (lightweight, label-based) │ │
│ │ • OpenSearch (full-text search) │ │
│ │ • CloudWatch / Stackdriver │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
14.3 Tracing Pipeline¶
# OpenTelemetry Collector configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
15. Multi-Tenancy Patterns¶
15.1 Namespace Isolation¶
# Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
services: "10"
secrets: "20"
persistentvolumeclaims: "10"
---
# Limit Ranges (defaults)
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-a
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "2"
memory: "4Gi"
- type: PersistentVolumeClaim
max:
storage: 10Gi
15.2 Hierarchical Namespaces¶
# Using HNC (Hierarchical Namespace Controller)
apiVersion: hnc.x-k8s.io/v1alpha2
kind: HierarchyConfiguration
metadata:
name: hierarchy
namespace: team-a
spec:
parent: organization
---
# Subnamespace
apiVersion: hnc.x-k8s.io/v1alpha2
kind: SubnamespaceAnchor
metadata:
name: team-a-dev
namespace: team-a
16. Disaster Recovery¶
16.1 Backup Strategies¶
etcd Backup:
# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save backup.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify backup
etcdctl snapshot status backup.db --write-out=table
# Restore
etcdctl snapshot restore backup.db \
--data-dir=/var/lib/etcd-restored
Velero (Full Cluster Backup):
# Install Velero
velero install \
--provider aws \
--bucket my-backup-bucket \
--secret-file ./credentials-velero
# Create backup
velero backup create cluster-backup --include-namespaces '*'
# Schedule periodic backups
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--ttl 720h
# Restore
velero restore create --from-backup cluster-backup
16.2 High Availability Checklist¶
Control Plane:
- [ ] 3+ API server replicas behind load balancer
- [ ] 3 or 5 etcd nodes (Raft quorum)
- [ ] Leader election for scheduler/controller-manager
- [ ] Spread across availability zones
Worker Nodes:
- [ ] Multiple nodes per zone
- [ ] Pod anti-affinity for critical workloads
- [ ] Pod Disruption Budgets defined
- [ ] Node auto-repair enabled
Data:
- [ ] PVs with zone-redundant storage
- [ ] Application-level replication (databases)
- [ ] Regular backup testing
- [ ] Documented recovery procedures
17. Production Architecture Example¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ Production Cluster Architecture │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐│
│ │ Control Plane ││
│ │ ││
│ │ Zone A Zone B Zone C ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││
│ │ │API Server│ │API Server│ │API Server│ ││
│ │ │etcd │ │etcd │ │etcd │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ ││
│ │ ││
│ │ Load Balancer (internal) ││
│ └────────────────────────────────────────────────────────────────────────┘│
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐│
│ │ Worker Nodes ││
│ │ ││
│ │ Zone A Zone B Zone C ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││
│ │ │ Node 1 │ │ Node 3 │ │ Node 5 │ ││
│ │ │ Node 2 │ │ Node 4 │ │ Node 6 │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ ││
│ │ ││
│ │ Node pools: ││
│ │ • General purpose (on-demand) ││
│ │ • Compute optimized (spot/preemptible) ││
│ │ • Memory optimized (databases) ││
│ │ • GPU (ML workloads) ││
│ └────────────────────────────────────────────────────────────────────────┘│
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐│
│ │ Platform Services ││
│ │ ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││
│ │ │ Ingress │ │ Cert │ │ External │ │ Secrets │ ││
│ │ │ (NGINX) │ │ Manager │ │ DNS │ │ (Vault) │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││
│ │ ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││
│ │ │Prometheus│ │ Loki │ │ Tempo │ │ Grafana │ ││
│ │ │+ Thanos │ │ │ │ │ │ │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││
│ │ ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││
│ │ │ ArgoCD │ │ Kyverno │ │ Velero │ │ Cilium │ ││
│ │ │ (GitOps) │ │ (Policy) │ │ (Backup) │ │ (CNI) │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││
│ └────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
18. Essential kubectl Commands¶
# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes
# Debugging pods
kubectl describe pod <pod>
kubectl logs <pod> -c <container> --previous
kubectl exec -it <pod> -- /bin/sh
kubectl debug -it <pod> --image=busybox
# Resource management
kubectl get all -A
kubectl api-resources
kubectl explain pod.spec.containers
# Events and troubleshooting
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning
# RBAC debugging
kubectl auth can-i create pods --as=jane
kubectl auth whoami
# Rollouts
kubectl rollout status deployment/app
kubectl rollout history deployment/app
kubectl rollout undo deployment/app --to-revision=2
# Port forwarding
kubectl port-forward svc/app 8080:80
kubectl port-forward pod/app-abc123 8080:80
# Resource editing
kubectl edit deployment app
kubectl patch deployment app -p '{"spec":{"replicas":3}}'
# Labels and selectors
kubectl get pods -l app=nginx,env=prod
kubectl label pods <pod> version=v2