Cloud Computing¶

Cloud computing is the on-demand delivery of computing resources—servers, storage, databases, networking, software, analytics, and intelligence—over the internet ("the cloud") with pay-as-you-go pricing. Instead of owning and maintaining physical data centers and servers, organizations rent access to these resources from a cloud provider.

The National Institute of Standards and Technology (NIST) defines five essential characteristics of cloud computing:

On-demand self-service: Provision resources automatically without human interaction with the provider.
Broad network access: Resources available over the network via standard mechanisms (HTTP, APIs).
Resource pooling: Provider resources are pooled to serve multiple tenants using a multi-tenant model.
Rapid elasticity: Capabilities can be elastically provisioned and released to scale with demand.
Measured service: Resource usage is monitored, controlled, and reported, enabling pay-per-use billing.

Cloud Deployment Models¶

Before diving into service models, it's important to understand where the cloud infrastructure lives:

Model	Description	Use Cases
Public Cloud	Resources owned and operated by a third-party provider, shared across tenants	Startups, SaaS, variable workloads, rapid prototyping
Private Cloud	Dedicated infrastructure for a single organization (on-prem or hosted)	Regulatory compliance, sensitive data, predictable workloads
Hybrid Cloud	Combination of public and private, with orchestration between them	Enterprise (burst to public for peak load, sensitive data stays private)
Multi-Cloud	Using multiple public cloud providers simultaneously	Vendor lock-in avoidance, best-of-breed services, regulatory requirements

Hybrid cloud is the most common enterprise model. An organization might run its core banking application on a private cloud for regulatory compliance while using AWS for customer-facing web applications and GCP BigQuery for analytics. The key challenge is data synchronization, identity federation, and consistent networking across environments.

Cloud Service Models¶

Cloud services are categorized into layers based on how much the provider manages versus how much the customer manages:

┌──────────────────────────────────────────────────────────────────┐
│                    Responsibility Model                          │
├────────────┬────────────┬────────────┬──────────────────────────┤
│ On-Premise │   IaaS     │   PaaS     │  SaaS                    │
├────────────┼────────────┼────────────┼──────────────────────────┤
│ Apps    YOU│ Apps    YOU│ Apps    YOU│ Apps         PROVIDER     │
│ Data    YOU│ Data    YOU│ Data    YOU│ Data         PROVIDER     │
│ Runtime YOU│ Runtime YOU│ Runtime PRO│ Runtime      PROVIDER     │
│ Middle  YOU│ Middle  YOU│ Middle  PRO│ Middleware   PROVIDER     │
│ OS      YOU│ OS      YOU│ OS      PRO│ OS           PROVIDER     │
│ Virtual YOU│ Virtual PRO│ Virtual PRO│ Virtualizatn PROVIDER     │
│ Servers YOU│ Servers PRO│ Servers PRO│ Servers      PROVIDER     │
│ Storage YOU│ Storage PRO│ Storage PRO│ Storage      PROVIDER     │
│ Network YOU│ Network PRO│ Network PRO│ Networking   PROVIDER     │
└────────────┴────────────┴────────────┴──────────────────────────┘

YOU = Customer manages PRO = Provider manages

The fundamental trade-off across all service models is control versus operational burden. As you move from IaaS to SaaS, you give up customization and control but gain operational simplicity and reduced staffing needs.

Infrastructure as a Service (IaaS)¶

IaaS provides virtualized computing resources over the internet. The provider manages the physical hardware, networking, and virtualization layer; the customer manages everything from the OS upward. This is the closest model to traditional IT but without the capital expenditure of physical hardware.

Feature	Description
What you get	Virtual machines, networks, storage, firewalls
What you manage	OS, middleware, runtime, applications, data
Scaling	Manual or auto-scaling of VMs
Use cases	Custom environments, legacy app migration (lift-and-shift), dev/test environments
Examples	AWS EC2, Google Compute Engine, Azure Virtual Machines, DigitalOcean Droplets

IaaS is the right choice when you need full control over the OS and runtime environment—for example, running specialized software that requires kernel-level configuration, GPUs for ML training, or legacy applications that can't be easily containerized. The downside is that you're responsible for patching the OS, configuring security groups, managing disk space, and handling instance failures.

# Example: Launching an EC2 instance with AWS CLI
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.medium \
  --key-name my-key-pair \
  --security-group-ids sg-0123456789abcdef0 \
  --subnet-id subnet-0123456789abcdef0 \
  --count 1 \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=my-server}]'

Instance type selection is critical for cost and performance. Cloud providers offer instance families optimized for different workloads:

Family	Optimized For	Examples (AWS)	Use Cases
General Purpose	Balanced CPU/memory	t3, m6i, m7g	Web servers, app servers, small databases
Compute Optimized	High-performance CPUs	c6i, c7g	Batch processing, scientific modeling, gaming
Memory Optimized	Large memory footprint	r6i, x2idn	In-memory databases, real-time analytics
Storage Optimized	High sequential I/O	i3, d3	Data warehousing, distributed filesystems
Accelerated (GPU)	GPU/FPGA workloads	p4d, g5, inf2	ML training/inference, video encoding
ARM-based (Graviton)	Cost-efficiency	t4g, m7g, c7g	20-40% better price-performance for compatible workloads

Platform as a Service (PaaS)¶

PaaS provides a platform allowing customers to develop, run, and manage applications without dealing with infrastructure. The provider manages servers, networking, storage, OS, and runtime. You focus exclusively on your application code and data.

Feature	Description
What you get	Managed runtime, databases, development tools
What you manage	Application code and data
Scaling	Automatic (usually)
Use cases	Web applications, APIs, microservices, rapid prototyping
Examples	Heroku, Google App Engine, AWS Elastic Beanstalk, Azure App Service, Railway, Render

PaaS dramatically reduces time-to-deploy. A developer can push code to a Git repository and have it running in production within minutes, without configuring a single server. The trade-off is reduced flexibility: you're constrained to the runtimes, languages, and configurations the platform supports. If you need a specific Linux kernel version or a custom native library, PaaS may not work.

When PaaS falls short: PaaS platforms impose constraints on execution time, memory, filesystem access, and network configuration. Applications that require long-running background processes, custom binary dependencies, or specific network topologies often outgrow PaaS and need to migrate to containers (CaaS) or IaaS.

Software as a Service (SaaS)¶

SaaS delivers fully managed applications over the internet. The provider manages everything; the customer simply uses the software through a web browser or API.

Feature	Description
What you get	Complete application accessible via browser or API
What you manage	Configuration, user data
Use cases	Email, CRM, collaboration, productivity
Examples	Gmail, Salesforce, Slack, GitHub, Jira, Datadog

SaaS is the dominant model for business tools. The key consideration for engineering teams is integration: how well does the SaaS product expose APIs, support webhooks, and integrate with your existing toolchain? Data portability and vendor lock-in are significant concerns—can you export your data if you switch providers?

Function as a Service (FaaS) / Serverless¶

FaaS is an event-driven execution model where the provider dynamically manages the allocation of computing resources. You deploy individual functions, and the provider runs them in response to events. There are no servers to provision, manage, or scale—the provider handles everything.

Feature	Description
What you get	Event-driven function execution, automatic scaling to zero
What you manage	Function code (and sometimes container images)
Scaling	Automatic, scales to zero when idle
Billing	Per-invocation and per-duration (e.g., per ms of execution)
Limitations	Cold starts, execution time limits (15 min on AWS Lambda), stateless
Examples	AWS Lambda, Google Cloud Functions, Azure Functions, Cloudflare Workers

# AWS Lambda function example (Python)
import json

def handler(event, context):
    """Process an API Gateway event."""
    name = event.get('queryStringParameters', {}).get('name', 'World')
    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({'message': f'Hello, {name}!'})
    }

Cold starts are the most significant operational concern with FaaS. When a function hasn't been invoked recently, the provider must spin up a new execution environment (download code, initialize runtime, execute initialization code). This adds latency—typically 100ms-2s depending on runtime, memory size, and package size. Mitigation strategies:

Provisioned concurrency: Keep a minimum number of warm instances (costs more but eliminates cold starts)
Smaller deployment packages: Minimize dependencies to reduce initialization time
Choose faster runtimes: Go and Rust cold-start in ~50ms; Python and Node.js in ~200ms; Java/C# in ~1-3s
Keep initialization outside the handler: Module-level code runs once per cold start, not per invocation

Serverless anti-patterns: Not everything should be serverless. Avoid FaaS for long-running processes (use containers), high-throughput steady-state workloads (dedicated compute is cheaper), or applications requiring local state or filesystem access.

Container as a Service (CaaS)¶

CaaS is the sweet spot between IaaS and PaaS—you package your application in a container (Docker image) and the platform handles orchestration, scaling, networking, and infrastructure management. You control the runtime environment (anything that fits in a container) without managing servers.

Feature	Description
What you get	Container orchestration, networking, auto-scaling, service discovery
What you manage	Container images (Dockerfile), application configuration
Scaling	Automatic (horizontal pod autoscaling, scale-to-zero for some platforms)
Examples	AWS ECS/Fargate, Google Cloud Run, Azure Container Apps, Fly.io

CaaS is increasingly the default deployment model for production microservices. It provides the flexibility of IaaS (run anything in your container) with the operational simplicity of PaaS (no server management). The two main flavors are:

Kubernetes-based (EKS, GKE, AKS): Full Kubernetes API, maximum flexibility, higher operational complexity
Managed container platforms (Fargate, Cloud Run): Simpler abstractions, less control, lower operational burden

# Deploy a container to Google Cloud Run (scales to zero)
gcloud run deploy my-service \
  --image gcr.io/my-project/my-app:latest \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 512Mi \
  --cpu 1 \
  --min-instances 0 \
  --max-instances 100 \
  --set-env-vars "DATABASE_URL=postgres://..." \
  --port 8080

Other Service Models¶

Model	Description	Examples
DBaaS (Database as a Service)	Managed database engines with automated backups, patching, and scaling	AWS RDS, Google Cloud SQL, Azure Cosmos DB, PlanetScale, Neon
BaaS (Backend as a Service)	Pre-built backend features (auth, storage, push notifications)	Firebase, Supabase, AWS Amplify
AIaaS (AI as a Service)	Managed AI/ML models and APIs	OpenAI API, AWS Bedrock, Google Vertex AI, Azure OpenAI Service

Choosing a Service Model¶

Decision Tree:

Need full OS/kernel control?           → IaaS (EC2, Compute Engine)
    │ No
    ▼
Already containerized?                  → CaaS (Fargate, Cloud Run, EKS)
    │ No
    ▼
Event-driven, short-lived workload?     → FaaS (Lambda, Cloud Functions)
    │ No
    ▼
Standard web app/API?                   → PaaS (Heroku, App Service, Render)
    │ No
    ▼
Just need managed software?             → SaaS (Datadog, GitHub, Slack)

Major Cloud Providers¶

Amazon Web Services (AWS)¶

The largest cloud provider (approximately 31% market share), AWS offers 200+ services across compute, storage, databases, networking, AI/ML, analytics, and more. AWS's strength is its breadth—there is a managed service for virtually every infrastructure need.

Core Services:

Category	Service	Description
Compute	EC2	Virtual machines (instances) with configurable CPU, memory, storage
	Lambda	Serverless functions (FaaS)
	ECS/EKS	Container orchestration (Docker/Kubernetes)
	Fargate	Serverless containers (no instance management)
Storage	S3	Object storage (unlimited, 11 9s durability)
	EBS	Block storage for EC2 (SSD/HDD volumes)
	EFS	Managed NFS file system
	Glacier	Archival storage (low cost, high retrieval latency)
Database	RDS	Managed relational DB (PostgreSQL, MySQL, Oracle, SQL Server)
	DynamoDB	Managed NoSQL (key-value/document, single-digit ms latency)
	ElastiCache	Managed Redis/Memcached
	Aurora	AWS-optimized MySQL/PostgreSQL (5x faster than standard MySQL)
Networking	VPC	Virtual private cloud (isolated network)
	Route 53	DNS service
	CloudFront	CDN
	ELB/ALB/NLB	Load balancers (Layer 4 and Layer 7)
Security	IAM	Identity and access management
	KMS	Key management service
	WAF	Web application firewall
	Secrets Manager	Secrets storage and rotation
Messaging	SQS	Managed message queue
	SNS	Pub/sub messaging
	EventBridge	Event bus for event-driven architectures

Google Cloud Platform (GCP)¶

Known for strong data analytics, AI/ML capabilities, and Kubernetes (GKE was the first managed Kubernetes service—Google created Kubernetes). GCP's pricing model is often simpler than AWS's, with sustained-use discounts applied automatically.

Core Services:

Category	Service	Description
Compute	Compute Engine	Virtual machines
	Cloud Run	Serverless containers (scales to zero)
	GKE	Managed Kubernetes
	Cloud Functions	Serverless functions
Storage	Cloud Storage	Object storage
	Persistent Disk	Block storage
	Filestore	Managed NFS
Database	Cloud SQL	Managed relational DB
	Firestore	NoSQL document DB
	Cloud Spanner	Globally distributed relational DB (horizontally scalable + ACID)
	Bigtable	Wide-column NoSQL (HBase-compatible)
Data/AI	BigQuery	Serverless data warehouse (SQL analytics on petabytes)
	Vertex AI	Managed ML platform
	Pub/Sub	Messaging service

Microsoft Azure¶

Strong in enterprise and hybrid cloud, tightly integrated with the Microsoft ecosystem (Active Directory, Office 365, .NET). Azure's competitive advantage is enterprise customers who already use Microsoft products.

Core Services:

Category	Service	Description
Compute	Virtual Machines	VMs
	App Service	PaaS for web apps
	AKS	Managed Kubernetes
	Azure Functions	Serverless functions
Storage	Blob Storage	Object storage
	Azure Files	Managed file shares (SMB/NFS)
Database	Azure SQL	Managed SQL Server
	Cosmos DB	Globally distributed multi-model NoSQL
Identity	Azure AD (Entra ID)	Enterprise identity and access management

Provider Comparison¶

Dimension	AWS	GCP	Azure
Market share	~31%	~12%	~24%
Strengths	Breadth of services, ecosystem	Data/AI, Kubernetes, pricing	Enterprise, hybrid, Microsoft integration
Pricing model	Complex, many dimensions	Simpler, sustained discounts	Enterprise agreements, hybrid benefit
Global regions	33+ regions	40+ regions	60+ regions
Best for	Startups to enterprise, general purpose	Data-intensive, ML, containerized workloads	Enterprise, .NET shops, hybrid cloud

Cloud-Agnostic Tools¶

To avoid vendor lock-in and manage multi-cloud environments, teams use abstraction layers:

Tool	Purpose	Description
Terraform	Infrastructure as Code	Declarative HCL language, provider ecosystem for all major clouds, state management
Pulumi	Infrastructure as Code	Real programming languages (Python, TypeScript, Go) instead of DSL, strong typing
Crossplane	Kubernetes-native IaC	Manage cloud resources as Kubernetes custom resources (CRDs)
Helm	Kubernetes package manager	Template and deploy Kubernetes applications consistently across providers

# Terraform example: Provision infrastructure across providers
provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "web" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "t3.medium"

  tags = {
    Name        = "web-server"
    Environment = "production"
    Team        = "platform"
  }
}

resource "aws_s3_bucket" "assets" {
  bucket = "my-app-assets-prod"

  tags = {
    Environment = "production"
  }
}

Cloud Networking¶

Understanding cloud networking is fundamental to deploying secure, scalable applications. Cloud networking virtualizes traditional data center networking concepts and adds cloud-specific constructs.

Virtual Private Cloud (VPC)¶

A VPC is a logically isolated virtual network within the cloud provider's infrastructure. It provides complete control over IP addressing, subnets, routing, and security. Every resource you launch in the cloud runs inside a VPC.

┌──────────────────────────VPC (10.0.0.0/16) ──────────────────────┐
│                                                                   │
│  ┌─── Availability Zone A ───┐    ┌─── Availability Zone B ───┐  │
│  │                           │    │                           │  │
│  │  ┌─ Public Subnet ─────┐  │    │  ┌─ Public Subnet ─────┐  │  │
│  │  │  10.0.1.0/24        │  │    │  │  10.0.3.0/24        │  │  │
│  │  │  ┌───────┐ ┌──────┐ │  │    │  │  ┌───────┐ ┌──────┐ │  │  │
│  │  │  │ Web-1 │ │ NAT  │ │  │    │  │  │ Web-2 │ │ NAT  │ │  │  │
│  │  │  └───────┘ └──────┘ │  │    │  │  └───────┘ └──────┘ │  │  │
│  │  └─────────────────────┘  │    │  └─────────────────────┘  │  │
│  │                           │    │                           │  │
│  │  ┌─ Private Subnet ────┐  │    │  ┌─ Private Subnet ────┐  │  │
│  │  │  10.0.2.0/24        │  │    │  │  10.0.4.0/24        │  │  │
│  │  │  ┌───────┐ ┌──────┐ │  │    │  │  ┌───────┐ ┌──────┐ │  │  │
│  │  │  │ App-1 │ │ DB-1 │ │  │    │  │  │ App-2 │ │ DB-2 │ │  │  │
│  │  │  └───────┘ └──────┘ │  │    │  │  └───────┘ └──────┘ │  │  │
│  │  └─────────────────────┘  │    │  └─────────────────────┘  │  │
│  └───────────────────────────┘    └───────────────────────────┘  │
│                                                                   │
│  ┌── Internet Gateway ──┐    ┌── Route Tables ──┐                │
│  │  Connects VPC to     │    │  Public:  0.0.0.0 │                │
│  │  the internet        │    │    → IGW           │                │
│  └──────────────────────┘    │  Private: 0.0.0.0 │                │
│                              │    → NAT Gateway   │                │
│                              └────────────────────┘                │
└───────────────────────────────────────────────────────────────────┘

Key components:

Subnets: Subdivisions of a VPC's IP range. Public subnets have routes to an internet gateway; private subnets do not (they access the internet through a NAT gateway). Best practice: place application servers and databases in private subnets; only load balancers and bastion hosts in public subnets.
Internet Gateway (IGW): Allows resources in public subnets to communicate with the internet. It's horizontally scaled, redundant, and highly available—no bandwidth constraints.
NAT Gateway: Enables resources in private subnets to initiate outbound internet connections (for updates, API calls) without being directly accessible from the internet. NAT gateways are charged per hour and per GB processed—they can become a significant cost for data-intensive workloads.
Route Tables: Rules that determine where network traffic is directed. Each subnet is associated with a route table. A public subnet's route table has 0.0.0.0/0 → IGW; a private subnet's has 0.0.0.0/0 → NAT Gateway.
Security Groups: Stateful firewalls at the instance level. Rules specify allowed inbound/outbound traffic by protocol, port, and source/destination.
Network ACLs (NACLs): Stateless firewalls at the subnet level. Act as a second layer of defense.
VPC Peering: Connects two VPCs so they can communicate using private IPs, even across regions or accounts. Non-transitive (A↔B and B↔C doesn't mean A↔C).
Transit Gateway: Hub-and-spoke model for connecting multiple VPCs and on-premises networks. Solves the scaling problem of VPC peering (N VPCs would need N(N-1)/2 peering connections vs N transit gateway attachments).
VPC Endpoints: Private connections to AWS services (S3, DynamoDB, etc.) that don't traverse the internet. Gateway endpoints (S3, DynamoDB) are free; interface endpoints (most other services) use PrivateLink and cost per hour + per GB.

Security Groups vs. NACLs¶

Feature	Security Groups	Network ACLs
Level	Instance (ENI)	Subnet
Statefulness	Stateful (return traffic auto-allowed)	Stateless (must explicitly allow return traffic)
Rules	Allow rules only	Allow and deny rules
Evaluation	All rules evaluated together	Rules evaluated in order (lowest number first)
Default	Deny all inbound, allow all outbound	Allow all inbound and outbound

Best practice: Use security groups as your primary firewall (they're easier to manage and stateful). Use NACLs as a defense-in-depth measure for subnet-level blocking (e.g., blocking known malicious IP ranges).

DNS and Traffic Routing¶

Cloud DNS services (Route 53, Cloud DNS) do more than resolve domain names—they're intelligent traffic routers:

Routing Policy	Description	Use Case
Simple	Single record, single endpoint	Small applications with one server
Weighted	Distribute traffic by percentage across endpoints	Canary deployments (95% to v1, 5% to v2)
Latency-based	Route to the lowest-latency region	Global applications (serve US users from us-east, EU from eu-west)
Failover	Active-passive: route to secondary if primary fails health check	Disaster recovery
Geolocation	Route based on user's geographic location	Compliance (EU data stays in EU), localized content
Multi-value answer	Return multiple healthy endpoints (client-side load balancing)	Simple HA without a load balancer

Content Delivery Networks (CDNs)¶

CDNs cache content at edge locations close to users, dramatically reducing latency for static and dynamic content. Major CDNs operate hundreds of points of presence (PoPs) worldwide.

Without CDN:
  User (Tokyo) → Origin Server (us-east-1) = ~200ms latency

With CDN:
  User (Tokyo) → CDN Edge (Tokyo PoP) = ~10ms latency (cache hit)
  User (Tokyo) → CDN Edge (Tokyo PoP) → Origin (us-east-1) = ~210ms (cache miss, then cached)

CDN Feature	Description
Static caching	HTML, CSS, JS, images cached at edge locations
Dynamic acceleration	Optimized routing and persistent connections to origin for dynamic content
SSL/TLS termination	Terminate TLS at the edge, reducing origin load
DDoS protection	Absorb volumetric attacks at the edge before they reach origin
Edge compute	Run code at edge locations (CloudFront Functions, Cloudflare Workers, Vercel Edge Functions)
Cache invalidation	Purge specific paths or wildcard patterns when content changes

CDN providers: CloudFront (AWS), Cloud CDN (GCP), Azure CDN, Cloudflare, Fastly, Akamai.

Load Balancing¶

Cloud providers offer managed load balancers that distribute traffic across multiple backend targets:

Type	AWS Service	Layer	Use Case
Application LB	ALB	Layer 7 (HTTP/HTTPS)	HTTP routing (path-based, host-based), WebSocket, gRPC
Network LB	NLB	Layer 4 (TCP/UDP)	Ultra-low latency, static IPs, millions of RPS
Gateway LB	GWLB	Layer 3	Inline network appliances (firewalls, IDS/IPS)

ALB routing example: An ALB can route /api/* to your backend service, /static/* to an S3 bucket, and everything else to your frontend service—all from a single endpoint.

Service Mesh¶

For microservices architectures, a service mesh provides infrastructure-level control over service-to-service communication:

Without service mesh:              With service mesh (Istio/Linkerd):
┌─────────┐   HTTP   ┌─────────┐  ┌─────────┐ ┌──────┐   ┌──────┐ ┌─────────┐
│ Service │ ──────── │ Service │  │ Service │─│Proxy │───│Proxy │─│ Service │
│    A    │          │    B    │  │    A    │ │(Envoy│   │(Envoy│ │    B    │
└─────────┘          └─────────┘  └─────────┘ └──────┘   └──────┘ └─────────┘
                                  Sidecar proxies handle: mTLS, retries,
                                  circuit breaking, observability, traffic control

Service meshes provide: mutual TLS (encrypted service-to-service communication), traffic management (canary releases, traffic splitting), observability (distributed tracing, metrics), resilience (retries, circuit breaking, timeouts), and access control (authorization policies).

Identity and Access Management (IAM)¶

IAM is the framework for managing who (identity) can do what (permissions) on which resources in the cloud. Every cloud provider has an IAM system; AWS IAM is the most widely referenced.

Core IAM Concepts¶

Users: Represent individual people or service accounts. Each has credentials (password, access keys). Should map 1:1 to humans; never share user accounts.
Groups: Collections of users. Permissions assigned to groups apply to all members. Example: developers group has read access to production, write access to staging.
Roles: Identities with permissions that can be assumed by users, services, or external identities. Unlike users, roles don't have permanent credentials—they provide temporary security tokens via AWS STS (Security Token Service).
Policies: JSON documents that define permissions. Attached to users, groups, or roles.

The Principle of Least Privilege¶

Always grant the minimum permissions necessary to perform a task. This is the single most important IAM principle.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-bucket/*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-1"
        },
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "*"
    }
  ]
}

IAM policy evaluation logic: When AWS evaluates a request, it follows this order: 1. Explicit deny — Any explicit deny in any policy wins (overrides everything) 2. Organizations SCPs — Service control policies set the maximum permissions boundary 3. Resource-based policies — Policies attached to resources (S3 bucket policies, etc.) 4. Identity-based policies — Policies attached to the user/role making the request 5. Permissions boundaries — Maximum permissions an identity can have 6. Session policies — Limit permissions for a temporary session 7. Default deny — If nothing explicitly allows the action, it's denied

RBAC vs. ABAC¶

Approach	Description	Example
RBAC (Role-Based)	Permissions assigned to roles; users assume roles	`database-admin` role has full RDS access
ABAC (Attribute-Based)	Permissions based on tags/attributes of resources and principals	Users with tag `team=data` can access resources with tag `team=data`

ABAC scales better than RBAC in large organizations. Instead of creating a new role for every team-resource combination, you create policies based on tag matching. However, ABAC requires disciplined tagging—if resources aren't tagged correctly, access control breaks.

Federation and SSO¶

Federation allows external identities (corporate Active Directory, Google Workspace, Okta) to access cloud resources without creating individual IAM users:

SAML 2.0: Enterprise standard for SSO. Corporate IdP (Okta, Azure AD) authenticates user, sends SAML assertion to AWS, AWS grants temporary credentials based on mapped role.
OIDC (OpenID Connect): Modern standard used by GitHub Actions, GitLab CI, and web applications. Allows workloads to assume cloud roles without long-lived secrets.
AWS IAM Identity Center (SSO): Centralized SSO for multiple AWS accounts, integrates with corporate IdPs.

# GitHub Actions OIDC federation — no AWS access keys needed
jobs:
  deploy:
    permissions:
      id-token: write  # Required for OIDC
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-deploy
          aws-region: us-east-1
      - run: aws s3 sync ./build s3://my-app-bucket/

Service-to-Service Authentication¶

In microservices architectures, services need to authenticate with each other:

Method	Description	Complexity
IAM roles	Services assume IAM roles to access cloud resources	Low (cloud-native)
Service accounts	Dedicated identities for services (GCP service accounts, K8s service accounts)	Low-Medium
mTLS	Mutual TLS: both client and server present certificates	Medium-High (use service mesh)
Workload Identity	Map Kubernetes service accounts to cloud IAM roles (IRSA on EKS, Workload Identity on GKE)	Medium
SPIFFE/SPIRE	Open standard for service identity	High (most flexible)

Secrets Management¶

Secrets (API keys, database passwords, TLS certificates) should never be in code, environment variables, or config files in source control:

Tool	Description	Features
HashiCorp Vault	Industry-standard secrets manager	Dynamic secrets, encryption as a service, PKI, cloud-agnostic
AWS Secrets Manager	Managed secrets with auto-rotation	RDS password rotation, cross-account sharing, $0.40/secret/month
AWS SSM Parameter Store	Simpler key-value store	Free tier (standard params), hierarchical organization
GCP Secret Manager	GCP-native secrets	IAM integration, versioning, automatic replication
External Secrets Operator	Kubernetes operator that syncs secrets from external stores	Bridges Vault/cloud secrets into K8s secrets

Cloud Security¶

Shared Responsibility Model¶

Cloud security is a shared responsibility between the provider and the customer. The exact boundary depends on the service model:

┌────────────────────────────────────────────────────┐
│              Customer Responsibility                │
│  ┌──────────┬───────────┬───────────┬────────────┐ │
│  │   IaaS   │   CaaS    │   PaaS    │    SaaS    │ │
│  ├──────────┼───────────┼───────────┼────────────┤ │
│  │ Data     │ Data      │ Data      │ Data access│ │
│  │ Apps     │ Container │ App code  │ User config│ │
│  │ OS/patch │ images    │           │            │ │
│  │ Network  │ Cluster   │           │            │ │
│  │ config   │ config    │           │            │ │
│  └──────────┴───────────┴───────────┴────────────┘ │
├────────────────────────────────────────────────────┤
│              Provider Responsibility                │
│  Physical security, hardware, hypervisor,           │
│  managed service infrastructure, global network     │
└────────────────────────────────────────────────────┘

Encryption¶

Type	Description	AWS Service
At rest	Data encrypted on disk/storage	KMS, S3 server-side encryption, EBS encryption
In transit	Data encrypted over the network (TLS)	ACM (certificate management), ALB TLS termination
Client-side	Data encrypted before sending to cloud	AWS Encryption SDK, client-side S3 encryption

Envelope encryption (used by KMS): A data key encrypts your data, and a master key encrypts the data key. This avoids sending large data blobs to KMS—only the small data key is encrypted/decrypted by KMS. The encrypted data key is stored alongside the encrypted data.

Encryption:
  Data → [Data Key] → Encrypted Data
  Data Key → [KMS Master Key] → Encrypted Data Key
  Store: Encrypted Data + Encrypted Data Key

Decryption:
  Encrypted Data Key → [KMS Master Key] → Data Key
  Encrypted Data → [Data Key] → Data

Network Security¶

Web Application Firewall (WAF): Inspects HTTP requests and blocks malicious traffic (SQL injection, XSS, bot traffic). AWS WAF, Cloudflare WAF, Azure WAF.
DDoS Protection: AWS Shield Standard (free, automatic L3/L4 protection), Shield Advanced (L7 protection, DDoS response team, cost protection).
VPC Flow Logs: Capture IP traffic metadata flowing through your VPC for auditing and troubleshooting.
PrivateLink: Access services over private IPs without traversing the internet.

Security Monitoring¶

Service	Purpose
CloudTrail	Logs every API call made in your AWS account (who did what, when, from where)
GuardDuty	ML-based threat detection analyzing CloudTrail, VPC Flow Logs, DNS logs
Security Hub	Aggregates findings from multiple security services, compliance checks
Config	Tracks resource configuration changes, evaluates compliance rules
Inspector	Automated vulnerability scanning for EC2 instances and container images

Cloud Storage Deep Dive¶

Object Storage (S3 / GCS / Blob Storage)¶

Object storage is the fundamental cloud storage primitive. Objects are stored as key-value pairs in buckets, with metadata. There is no directory hierarchy—the "folders" you see are just key prefixes.

S3 storage classes:

Class	Durability	Availability	Min Duration	Retrieval	Use Case
Standard	11 9s	99.99%	None	Instant	Frequently accessed data
Intelligent-Tiering	11 9s	99.9%	None	Instant	Unknown/changing access patterns
Standard-IA	11 9s	99.9%	30 days	Instant	Infrequent but rapid access needed
One Zone-IA	11 9s	99.5%	30 days	Instant	Reproducible, infrequent data
Glacier Instant	11 9s	99.9%	90 days	Instant	Archive with instant access
Glacier Flexible	11 9s	99.99%	90 days	1-12 hours	Archive (long-term backups)
Glacier Deep Archive	11 9s	99.99%	180 days	12-48 hours	Compliance archives, rarely accessed

Lifecycle policies automatically transition objects between storage classes based on age. Example: Move to Standard-IA after 30 days, Glacier after 90 days, Deep Archive after 365 days, delete after 7 years.

S3 performance optimization: - S3 supports 5,500 GET/s and 3,500 PUT/s per partition prefix - Use random prefixes (UUIDs, hashes) to distribute requests across partitions - Use multipart upload for objects > 100 MB (required > 5 GB) - S3 Transfer Acceleration uses CloudFront edge locations for faster uploads from distant locations

Block Storage (EBS)¶

Block storage provides raw storage volumes that attach to compute instances, behaving like physical hard drives:

Volume Type	IOPS	Throughput	Use Case
gp3 (General SSD)	3,000-16,000	125-1,000 MB/s	Default for most workloads
io2 (Provisioned SSD)	Up to 256,000	Up to 4,000 MB/s	Databases requiring consistent IOPS
st1 (Throughput HDD)	500	500 MB/s	Big data, data warehouses, log processing
sc1 (Cold HDD)	250	250 MB/s	Infrequently accessed, lowest cost

The 12-Factor App¶

The 12-Factor App is a methodology for building modern, cloud-native applications that are portable, scalable, and maintainable. Originally published by Heroku engineers, these principles are foundational for cloud-native development.

Factor	Principle	Description
I. Codebase	One codebase, many deploys	One repo per app, deployed to multiple environments (dev, staging, prod)
II. Dependencies	Explicitly declare and isolate	Use dependency manifests (`requirements.txt`, `package.json`, `Cargo.toml`). Never rely on system-wide packages
III. Config	Store config in the environment	Database URLs, API keys, feature flags → environment variables, not code. Never commit secrets
IV. Backing Services	Treat backing services as attached resources	Databases, caches, queues are interchangeable resources identified by URL/credentials. Swapping a local PostgreSQL for Amazon RDS should require only a config change
V. Build, Release, Run	Strictly separate build and run stages	Build (compile + bundle), Release (build + config), Run (execute). Every release is immutable and has a unique ID
VI. Processes	Execute the app as stateless processes	App processes are stateless and share-nothing. Persistent data lives in backing services (DB, Redis, S3), not in local memory or filesystem
VII. Port Binding	Export services via port binding	The app is completely self-contained and exports HTTP (or other) as a service by binding to a port
VIII. Concurrency	Scale out via the process model	Scale by running multiple processes (horizontal scaling), not by making a single process larger
IX. Disposability	Maximize robustness with fast startup and graceful shutdown	Processes start quickly and shut down gracefully (finish current requests, release resources)
X. Dev/Prod Parity	Keep dev, staging, and production as similar as possible	Minimize gaps in time (deploy quickly), personnel (devs who wrote code deploy it), and tools (same backing services everywhere)
XI. Logs	Treat logs as event streams	Apps write logs to `stdout`. The execution environment captures, routes, and aggregates them (e.g., to ELK, CloudWatch, Datadog)
XII. Admin Processes	Run admin/management tasks as one-off processes	Database migrations, data fixes, console sessions run as one-off processes in the same environment as the app

Cloud Architecture Patterns¶

Multi-Region Deployment¶

Deploying applications across multiple geographic regions for high availability, disaster recovery, and reduced latency.

                    ┌─────── Global DNS (Route 53 / Cloud DNS) ───────┐
                    │            Latency-based routing                 │
                    ▼                                                  ▼
        ┌─── US-East Region ───┐                        ┌── EU-West Region ───┐
        │  ┌─ Load Balancer ─┐ │                        │  ┌─ Load Balancer ┐ │
        │  └───────┬─────────┘ │                        │  └──────┬─────────┘ │
        │  ┌───────┴─────────┐ │                        │  ┌──────┴──────────┐│
        │  │ App Servers     │ │                        │  │ App Servers     ││
        │  │ (Auto-scaling)  │ │                        │  │ (Auto-scaling)  ││
        │  └───────┬─────────┘ │                        │  └──────┬──────────┘│
        │  ┌───────┴─────────┐ │    Cross-Region        │  ┌──────┴──────────┐│
        │  │ Primary DB      │◄├──── Replication ───────├──│ Replica DB      ││
        │  └─────────────────┘ │                        │  └─────────────────┘│
        └──────────────────────┘                        └─────────────────────┘

Strategies:

Strategy	RTO	RPO	Cost	Complexity
Backup & Restore	Hours	Hours	$	Low
Pilot Light	10-30 min	Minutes	$$	Medium
Warm Standby	Minutes	Seconds	$$$	Medium-High
Active-Active	~0 (automatic)	~0	$$$$	High

Active-Passive: One region handles all traffic; the other is a standby for failover. Simpler but wastes resources.
Active-Active: Both regions serve traffic simultaneously. More complex (requires data synchronization, conflict resolution) but better resource utilization and lower latency.
Pilot Light: Minimal infrastructure running in the DR region (e.g., database replica). Scale up on failover.
Warm Standby: Scaled-down version of production running in DR region. Faster failover than pilot light.

Auto-Scaling Strategies¶

Auto-scaling adjusts capacity dynamically based on demand:

Strategy	Trigger	Latency	Use Case
Reactive (target tracking)	Metric crosses threshold (CPU > 70%, queue depth > 100)	2-5 min	General workloads
Step scaling	Metric enters defined ranges, each triggering a different scaling action	2-5 min	Workloads with predictable scaling steps
Scheduled	Time-based (scale up at 9am, down at 6pm)	None (pre-provisioned)	Predictable traffic patterns (business hours)
Predictive	ML-based forecasting from historical patterns	None (pre-provisioned)	Recurring patterns (daily/weekly cycles)

Scaling best practices: - Scale out (add instances) aggressively, scale in (remove instances) conservatively - Set cooldown periods to prevent scaling thrashing - Use multiple metrics (CPU + request count + queue depth) for more accurate scaling decisions - Always test scaling by simulating load before relying on it in production

Deployment Patterns¶

Pattern	Description	Risk Level	Rollback Speed
Rolling update	Replace instances one at a time	Medium	Medium (re-deploy)
Blue-Green	Maintain two identical environments; switch traffic at once	Low	Instant (switch back)
Canary	Route small percentage of traffic to new version, gradually increase	Low	Instant (route 0% to canary)
Feature flags	New code deployed to all instances but gated behind flags	Very Low	Instant (toggle flag off)

Blue-Green deployment flow: 1. Blue environment runs current production 2. Deploy new version to Green environment 3. Run smoke tests against Green 4. Switch load balancer to route traffic to Green 5. Monitor for errors; if issues arise, switch back to Blue instantly 6. Decommission Blue (or keep as next deployment target)

Resilience Patterns¶

Circuit Breaker: When a downstream service fails repeatedly, stop calling it for a period (open circuit) instead of accumulating timeouts. After a cooldown, allow a test request (half-open). If it succeeds, close the circuit; if not, reopen.
Bulkhead: Isolate failures by partitioning resources. If the payment service's thread pool is exhausted, the search service's thread pool is unaffected. Named after ship bulkheads that prevent a hull breach from flooding the entire vessel.
Retry with exponential backoff: On transient failures, retry with increasing delays (1s, 2s, 4s, 8s) plus jitter (random offset to prevent thundering herd).
Timeout: Every external call should have a timeout. Without timeouts, a hung dependency can cascade to all callers.

Hybrid Cloud¶

Combines on-premises infrastructure with public cloud services. Common in enterprises with existing data centers, regulatory requirements, or latency-sensitive workloads.

Technologies:

AWS Outposts: AWS hardware in your data center
Azure Arc: Manage on-premises, multi-cloud, and edge resources from Azure
Google Anthos: Run GKE clusters anywhere (on-prem, AWS, Azure)
VPN / Direct Connect / ExpressRoute: Secure, dedicated connections between on-premises and cloud

Multi-Cloud¶

Using services from multiple cloud providers to avoid vendor lock-in, leverage best-of-breed services, or meet regulatory requirements.

Challenges: Different APIs, pricing models, IAM systems, and networking models. Requires abstraction layers (Terraform, Pulumi, Crossplane) and potentially higher operational complexity. True multi-cloud (running the same workload across providers) is rare; more common is "multi-cloud by choice" (different workloads on different providers).

Serverless Architecture¶

Serverless extends beyond individual functions (FaaS) to entire architectures where the cloud provider manages all infrastructure and you pay only for what you use.

Event-Driven Architecture with Serverless¶

┌───────────┐     ┌──────────┐     ┌──────────┐     ┌─────────────┐
│ API       │────▶│ Lambda   │────▶│ DynamoDB │     │ S3 Bucket   │
│ Gateway   │     │ Function │     │ (storage)│     │ (uploads)   │
└───────────┘     └──────────┘     └──────────┘     └──────┬──────┘
                                                           │ Event
                                                           ▼
┌───────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ CloudWatch│────▶│ Lambda   │     │ Lambda   │◀────│ SQS      │
│ Events    │     │ (cron)   │     │ (process)│     │ Queue    │
└───────────┘     └──────────┘     └──────────┘     └──────────┘

Common serverless event sources: API Gateway (HTTP), S3 (file uploads), SQS (queues), SNS (pub/sub), DynamoDB Streams (data changes), CloudWatch Events/EventBridge (scheduled, AWS events), Kinesis (streaming data).

Step Functions / Workflows¶

For complex multi-step workflows that require orchestration, error handling, and state management:

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate",
      "Next": "ProcessPayment",
      "Catch": [{
        "ErrorEquals": ["ValidationError"],
        "Next": "OrderFailed"
      }]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:payment",
      "Next": "FulfillOrder",
      "Retry": [{
        "ErrorEquals": ["PaymentTimeout"],
        "IntervalSeconds": 5,
        "MaxAttempts": 3,
        "BackoffRate": 2.0
      }]
    },
    "FulfillOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:fulfill",
      "End": true
    },
    "OrderFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-failure",
      "End": true
    }
  }
}

Serverless Databases¶

Database	Type	Scaling	Pricing Model
DynamoDB	NoSQL (key-value/document)	On-demand or provisioned	Per read/write request unit
Aurora Serverless v2	Relational (MySQL/PostgreSQL)	Auto-scales 0.5-128 ACUs	Per ACU-hour
Neon	PostgreSQL	Auto-scales, scales to zero	Per compute-hour + storage
PlanetScale	MySQL (Vitess)	Auto-scales	Per row read/write

Cloud Cost Optimization (FinOps)¶

Cloud costs can spiral quickly without discipline. FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending.

Key Principles¶

Visibility: Tag all resources, set up cost dashboards, use cost allocation reports.
Optimization: Right-size instances, use reserved/spot/preemptible instances, delete unused resources.
Governance: Set budgets and alerts, implement approval workflows for expensive resources.

Cost Reduction Strategies¶

Strategy	Savings Potential	Description
Right-sizing	20-40%	Match instance types to actual workload needs. Most instances are over-provisioned
Reserved Instances / Committed Use	30-72%	Commit to 1-3 year usage for significant discounts
Savings Plans	30-72%	More flexible than RIs: commit to $/hour spend, not specific instance types
Spot/Preemptible Instances	60-90%	Use spare capacity at steep discounts for fault-tolerant workloads (batch processing, CI/CD, data pipelines)
Auto-scaling	Variable	Scale resources up/down based on demand. Don't pay for idle capacity
Storage tiering	40-80%	Move infrequently accessed data to cheaper storage classes (S3 Glacier, Coldline)
Serverless	Variable	Pay only for actual execution time. Ideal for sporadic workloads
Scheduled scaling	20-50%	Turn off dev/test environments during nights and weekends
Graviton/ARM instances	20-40%	ARM-based instances offer better price-performance for compatible workloads

Tagging strategy — Tags are the foundation of cost allocation. Minimum recommended tags:

Tag Key	Purpose	Example Values
`Environment`	Separate costs by environment	production, staging, development
`Team`	Allocate costs to teams	platform, backend, data, ml
`Service`	Track costs per service	user-api, payment-service, search
`CostCenter`	Map to financial cost centers	eng-001, marketing-002
`Owner`	Identify responsible person	alice@company.com

# Example: AWS Cost Explorer CLI query
aws ce get-cost-and-usage \
  --time-period Start=2025-01-01,End=2025-01-31 \
  --granularity MONTHLY \
  --metrics "BlendedCost" "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE

Spot Instance Strategies¶

Spot instances offer 60-90% discounts but can be reclaimed with 2 minutes' notice. Effective strategies:

Diversify instance types: Request multiple instance types in your auto-scaling group; if one type is reclaimed, others may still be available
Use capacity-optimized allocation: Let the provider choose the instance type with the most available capacity
Handle interruptions gracefully: Use the 2-minute warning to drain connections and checkpoint work
Mix on-demand and spot: Run baseline on on-demand, burst on spot (e.g., 30% on-demand, 70% spot)

Cloud Migration Strategies¶

The 7 R's framework for migrating workloads to the cloud:

Strategy	Description	Effort	Risk	When to Use
Rehost (Lift & Shift)	Move as-is to cloud VMs	Low	Low	Quick migration, minimal changes
Replatform (Lift & Reshape)	Minor optimization during migration	Medium	Low-Medium	Use managed services (RDS instead of self-managed DB)
Repurchase	Replace with SaaS product	Low	Medium	On-prem CRM → Salesforce, on-prem email → Gmail
Refactor / Re-architect	Rewrite to be cloud-native	High	High	Critical apps that benefit from cloud-native features
Retire	Decommission applications no longer needed	Low	Low	Reduce portfolio before migration
Retain	Keep on-premises (for now)	None	None	Regulatory, too complex, or recently upgraded
Relocate	Move to cloud without changes (VMware on cloud)	Low	Low	VMware environments → VMware Cloud on AWS

Migration phases: 1. Assessment: Inventory applications, map dependencies, assess cloud readiness 2. Planning: Choose migration strategy per application, design target architecture, build business case 3. Migration: Execute migration in waves, validate functionality 4. Optimization: Right-size resources, implement auto-scaling, optimize costs

Database migration is typically the most complex part. AWS Database Migration Service (DMS) supports heterogeneous migrations (Oracle → PostgreSQL) with the Schema Conversion Tool (SCT). For homogeneous migrations (MySQL → Aurora MySQL), native replication can minimize downtime to seconds.