Skip to content

Cloud Computing

Cloud computing is the on-demand delivery of computing resources—servers, storage, databases, networking, software, analytics, and intelligence—over the internet ("the cloud") with pay-as-you-go pricing. Instead of owning and maintaining physical data centers and servers, organizations rent access to these resources from a cloud provider.

The National Institute of Standards and Technology (NIST) defines five essential characteristics of cloud computing:

  1. On-demand self-service: Provision resources automatically without human interaction with the provider.
  2. Broad network access: Resources available over the network via standard mechanisms (HTTP, APIs).
  3. Resource pooling: Provider resources are pooled to serve multiple tenants using a multi-tenant model.
  4. Rapid elasticity: Capabilities can be elastically provisioned and released to scale with demand.
  5. Measured service: Resource usage is monitored, controlled, and reported, enabling pay-per-use billing.

Cloud Deployment Models

Before diving into service models, it's important to understand where the cloud infrastructure lives:

Model Description Use Cases
Public Cloud Resources owned and operated by a third-party provider, shared across tenants Startups, SaaS, variable workloads, rapid prototyping
Private Cloud Dedicated infrastructure for a single organization (on-prem or hosted) Regulatory compliance, sensitive data, predictable workloads
Hybrid Cloud Combination of public and private, with orchestration between them Enterprise (burst to public for peak load, sensitive data stays private)
Multi-Cloud Using multiple public cloud providers simultaneously Vendor lock-in avoidance, best-of-breed services, regulatory requirements

Hybrid cloud is the most common enterprise model. An organization might run its core banking application on a private cloud for regulatory compliance while using AWS for customer-facing web applications and GCP BigQuery for analytics. The key challenge is data synchronization, identity federation, and consistent networking across environments.

Cloud Service Models

Cloud services are categorized into layers based on how much the provider manages versus how much the customer manages:

┌──────────────────────────────────────────────────────────────────┐
│                    Responsibility Model                          │
├────────────┬────────────┬────────────┬──────────────────────────┤
│ On-Premise │   IaaS     │   PaaS     │  SaaS                    │
├────────────┼────────────┼────────────┼──────────────────────────┤
│ Apps    YOU│ Apps    YOU│ Apps    YOU│ Apps         PROVIDER     │
│ Data    YOU│ Data    YOU│ Data    YOU│ Data         PROVIDER     │
│ Runtime YOU│ Runtime YOU│ Runtime PRO│ Runtime      PROVIDER     │
│ Middle  YOU│ Middle  YOU│ Middle  PRO│ Middleware   PROVIDER     │
│ OS      YOU│ OS      YOU│ OS      PRO│ OS           PROVIDER     │
│ Virtual YOU│ Virtual PRO│ Virtual PRO│ Virtualizatn PROVIDER     │
│ Servers YOU│ Servers PRO│ Servers PRO│ Servers      PROVIDER     │
│ Storage YOU│ Storage PRO│ Storage PRO│ Storage      PROVIDER     │
│ Network YOU│ Network PRO│ Network PRO│ Networking   PROVIDER     │
└────────────┴────────────┴────────────┴──────────────────────────┘

YOU = Customer manages PRO = Provider manages

The fundamental trade-off across all service models is control versus operational burden. As you move from IaaS to SaaS, you give up customization and control but gain operational simplicity and reduced staffing needs.

Infrastructure as a Service (IaaS)

IaaS provides virtualized computing resources over the internet. The provider manages the physical hardware, networking, and virtualization layer; the customer manages everything from the OS upward. This is the closest model to traditional IT but without the capital expenditure of physical hardware.

Feature Description
What you get Virtual machines, networks, storage, firewalls
What you manage OS, middleware, runtime, applications, data
Scaling Manual or auto-scaling of VMs
Use cases Custom environments, legacy app migration (lift-and-shift), dev/test environments
Examples AWS EC2, Google Compute Engine, Azure Virtual Machines, DigitalOcean Droplets

IaaS is the right choice when you need full control over the OS and runtime environment—for example, running specialized software that requires kernel-level configuration, GPUs for ML training, or legacy applications that can't be easily containerized. The downside is that you're responsible for patching the OS, configuring security groups, managing disk space, and handling instance failures.

# Example: Launching an EC2 instance with AWS CLI
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.medium \
  --key-name my-key-pair \
  --security-group-ids sg-0123456789abcdef0 \
  --subnet-id subnet-0123456789abcdef0 \
  --count 1 \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=my-server}]'

Instance type selection is critical for cost and performance. Cloud providers offer instance families optimized for different workloads:

Family Optimized For Examples (AWS) Use Cases
General Purpose Balanced CPU/memory t3, m6i, m7g Web servers, app servers, small databases
Compute Optimized High-performance CPUs c6i, c7g Batch processing, scientific modeling, gaming
Memory Optimized Large memory footprint r6i, x2idn In-memory databases, real-time analytics
Storage Optimized High sequential I/O i3, d3 Data warehousing, distributed filesystems
Accelerated (GPU) GPU/FPGA workloads p4d, g5, inf2 ML training/inference, video encoding
ARM-based (Graviton) Cost-efficiency t4g, m7g, c7g 20-40% better price-performance for compatible workloads

Platform as a Service (PaaS)

PaaS provides a platform allowing customers to develop, run, and manage applications without dealing with infrastructure. The provider manages servers, networking, storage, OS, and runtime. You focus exclusively on your application code and data.

Feature Description
What you get Managed runtime, databases, development tools
What you manage Application code and data
Scaling Automatic (usually)
Use cases Web applications, APIs, microservices, rapid prototyping
Examples Heroku, Google App Engine, AWS Elastic Beanstalk, Azure App Service, Railway, Render

PaaS dramatically reduces time-to-deploy. A developer can push code to a Git repository and have it running in production within minutes, without configuring a single server. The trade-off is reduced flexibility: you're constrained to the runtimes, languages, and configurations the platform supports. If you need a specific Linux kernel version or a custom native library, PaaS may not work.

When PaaS falls short: PaaS platforms impose constraints on execution time, memory, filesystem access, and network configuration. Applications that require long-running background processes, custom binary dependencies, or specific network topologies often outgrow PaaS and need to migrate to containers (CaaS) or IaaS.

Software as a Service (SaaS)

SaaS delivers fully managed applications over the internet. The provider manages everything; the customer simply uses the software through a web browser or API.

Feature Description
What you get Complete application accessible via browser or API
What you manage Configuration, user data
Use cases Email, CRM, collaboration, productivity
Examples Gmail, Salesforce, Slack, GitHub, Jira, Datadog

SaaS is the dominant model for business tools. The key consideration for engineering teams is integration: how well does the SaaS product expose APIs, support webhooks, and integrate with your existing toolchain? Data portability and vendor lock-in are significant concerns—can you export your data if you switch providers?

Function as a Service (FaaS) / Serverless

FaaS is an event-driven execution model where the provider dynamically manages the allocation of computing resources. You deploy individual functions, and the provider runs them in response to events. There are no servers to provision, manage, or scale—the provider handles everything.

Feature Description
What you get Event-driven function execution, automatic scaling to zero
What you manage Function code (and sometimes container images)
Scaling Automatic, scales to zero when idle
Billing Per-invocation and per-duration (e.g., per ms of execution)
Limitations Cold starts, execution time limits (15 min on AWS Lambda), stateless
Examples AWS Lambda, Google Cloud Functions, Azure Functions, Cloudflare Workers
# AWS Lambda function example (Python)
import json

def handler(event, context):
    """Process an API Gateway event."""
    name = event.get('queryStringParameters', {}).get('name', 'World')
    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps({'message': f'Hello, {name}!'})
    }

Cold starts are the most significant operational concern with FaaS. When a function hasn't been invoked recently, the provider must spin up a new execution environment (download code, initialize runtime, execute initialization code). This adds latency—typically 100ms-2s depending on runtime, memory size, and package size. Mitigation strategies:

  • Provisioned concurrency: Keep a minimum number of warm instances (costs more but eliminates cold starts)
  • Smaller deployment packages: Minimize dependencies to reduce initialization time
  • Choose faster runtimes: Go and Rust cold-start in ~50ms; Python and Node.js in ~200ms; Java/C# in ~1-3s
  • Keep initialization outside the handler: Module-level code runs once per cold start, not per invocation

Serverless anti-patterns: Not everything should be serverless. Avoid FaaS for long-running processes (use containers), high-throughput steady-state workloads (dedicated compute is cheaper), or applications requiring local state or filesystem access.

Container as a Service (CaaS)

CaaS is the sweet spot between IaaS and PaaS—you package your application in a container (Docker image) and the platform handles orchestration, scaling, networking, and infrastructure management. You control the runtime environment (anything that fits in a container) without managing servers.

Feature Description
What you get Container orchestration, networking, auto-scaling, service discovery
What you manage Container images (Dockerfile), application configuration
Scaling Automatic (horizontal pod autoscaling, scale-to-zero for some platforms)
Examples AWS ECS/Fargate, Google Cloud Run, Azure Container Apps, Fly.io

CaaS is increasingly the default deployment model for production microservices. It provides the flexibility of IaaS (run anything in your container) with the operational simplicity of PaaS (no server management). The two main flavors are:

  • Kubernetes-based (EKS, GKE, AKS): Full Kubernetes API, maximum flexibility, higher operational complexity
  • Managed container platforms (Fargate, Cloud Run): Simpler abstractions, less control, lower operational burden
# Deploy a container to Google Cloud Run (scales to zero)
gcloud run deploy my-service \
  --image gcr.io/my-project/my-app:latest \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --memory 512Mi \
  --cpu 1 \
  --min-instances 0 \
  --max-instances 100 \
  --set-env-vars "DATABASE_URL=postgres://..." \
  --port 8080

Other Service Models

Model Description Examples
DBaaS (Database as a Service) Managed database engines with automated backups, patching, and scaling AWS RDS, Google Cloud SQL, Azure Cosmos DB, PlanetScale, Neon
BaaS (Backend as a Service) Pre-built backend features (auth, storage, push notifications) Firebase, Supabase, AWS Amplify
AIaaS (AI as a Service) Managed AI/ML models and APIs OpenAI API, AWS Bedrock, Google Vertex AI, Azure OpenAI Service

Choosing a Service Model

Decision Tree:

Need full OS/kernel control?           → IaaS (EC2, Compute Engine)
    │ No
    ▼
Already containerized?                  → CaaS (Fargate, Cloud Run, EKS)
    │ No
    ▼
Event-driven, short-lived workload?     → FaaS (Lambda, Cloud Functions)
    │ No
    ▼
Standard web app/API?                   → PaaS (Heroku, App Service, Render)
    │ No
    ▼
Just need managed software?             → SaaS (Datadog, GitHub, Slack)

Major Cloud Providers

Amazon Web Services (AWS)

The largest cloud provider (approximately 31% market share), AWS offers 200+ services across compute, storage, databases, networking, AI/ML, analytics, and more. AWS's strength is its breadth—there is a managed service for virtually every infrastructure need.

Core Services:

Category Service Description
Compute EC2 Virtual machines (instances) with configurable CPU, memory, storage
Lambda Serverless functions (FaaS)
ECS/EKS Container orchestration (Docker/Kubernetes)
Fargate Serverless containers (no instance management)
Storage S3 Object storage (unlimited, 11 9s durability)
EBS Block storage for EC2 (SSD/HDD volumes)
EFS Managed NFS file system
Glacier Archival storage (low cost, high retrieval latency)
Database RDS Managed relational DB (PostgreSQL, MySQL, Oracle, SQL Server)
DynamoDB Managed NoSQL (key-value/document, single-digit ms latency)
ElastiCache Managed Redis/Memcached
Aurora AWS-optimized MySQL/PostgreSQL (5x faster than standard MySQL)
Networking VPC Virtual private cloud (isolated network)
Route 53 DNS service
CloudFront CDN
ELB/ALB/NLB Load balancers (Layer 4 and Layer 7)
Security IAM Identity and access management
KMS Key management service
WAF Web application firewall
Secrets Manager Secrets storage and rotation
Messaging SQS Managed message queue
SNS Pub/sub messaging
EventBridge Event bus for event-driven architectures

Google Cloud Platform (GCP)

Known for strong data analytics, AI/ML capabilities, and Kubernetes (GKE was the first managed Kubernetes service—Google created Kubernetes). GCP's pricing model is often simpler than AWS's, with sustained-use discounts applied automatically.

Core Services:

Category Service Description
Compute Compute Engine Virtual machines
Cloud Run Serverless containers (scales to zero)
GKE Managed Kubernetes
Cloud Functions Serverless functions
Storage Cloud Storage Object storage
Persistent Disk Block storage
Filestore Managed NFS
Database Cloud SQL Managed relational DB
Firestore NoSQL document DB
Cloud Spanner Globally distributed relational DB (horizontally scalable + ACID)
Bigtable Wide-column NoSQL (HBase-compatible)
Data/AI BigQuery Serverless data warehouse (SQL analytics on petabytes)
Vertex AI Managed ML platform
Pub/Sub Messaging service

Microsoft Azure

Strong in enterprise and hybrid cloud, tightly integrated with the Microsoft ecosystem (Active Directory, Office 365, .NET). Azure's competitive advantage is enterprise customers who already use Microsoft products.

Core Services:

Category Service Description
Compute Virtual Machines VMs
App Service PaaS for web apps
AKS Managed Kubernetes
Azure Functions Serverless functions
Storage Blob Storage Object storage
Azure Files Managed file shares (SMB/NFS)
Database Azure SQL Managed SQL Server
Cosmos DB Globally distributed multi-model NoSQL
Identity Azure AD (Entra ID) Enterprise identity and access management

Provider Comparison

Dimension AWS GCP Azure
Market share ~31% ~12% ~24%
Strengths Breadth of services, ecosystem Data/AI, Kubernetes, pricing Enterprise, hybrid, Microsoft integration
Pricing model Complex, many dimensions Simpler, sustained discounts Enterprise agreements, hybrid benefit
Global regions 33+ regions 40+ regions 60+ regions
Best for Startups to enterprise, general purpose Data-intensive, ML, containerized workloads Enterprise, .NET shops, hybrid cloud

Cloud-Agnostic Tools

To avoid vendor lock-in and manage multi-cloud environments, teams use abstraction layers:

Tool Purpose Description
Terraform Infrastructure as Code Declarative HCL language, provider ecosystem for all major clouds, state management
Pulumi Infrastructure as Code Real programming languages (Python, TypeScript, Go) instead of DSL, strong typing
Crossplane Kubernetes-native IaC Manage cloud resources as Kubernetes custom resources (CRDs)
Helm Kubernetes package manager Template and deploy Kubernetes applications consistently across providers
# Terraform example: Provision infrastructure across providers
provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "web" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "t3.medium"

  tags = {
    Name        = "web-server"
    Environment = "production"
    Team        = "platform"
  }
}

resource "aws_s3_bucket" "assets" {
  bucket = "my-app-assets-prod"

  tags = {
    Environment = "production"
  }
}

Cloud Networking

Understanding cloud networking is fundamental to deploying secure, scalable applications. Cloud networking virtualizes traditional data center networking concepts and adds cloud-specific constructs.

Virtual Private Cloud (VPC)

A VPC is a logically isolated virtual network within the cloud provider's infrastructure. It provides complete control over IP addressing, subnets, routing, and security. Every resource you launch in the cloud runs inside a VPC.

┌──────────────────────────VPC (10.0.0.0/16) ──────────────────────┐
│                                                                   │
│  ┌─── Availability Zone A ───┐    ┌─── Availability Zone B ───┐  │
│  │                           │    │                           │  │
│  │  ┌─ Public Subnet ─────┐  │    │  ┌─ Public Subnet ─────┐  │  │
│  │  │  10.0.1.0/24        │  │    │  │  10.0.3.0/24        │  │  │
│  │  │  ┌───────┐ ┌──────┐ │  │    │  │  ┌───────┐ ┌──────┐ │  │  │
│  │  │  │ Web-1 │ │ NAT  │ │  │    │  │  │ Web-2 │ │ NAT  │ │  │  │
│  │  │  └───────┘ └──────┘ │  │    │  │  └───────┘ └──────┘ │  │  │
│  │  └─────────────────────┘  │    │  └─────────────────────┘  │  │
│  │                           │    │                           │  │
│  │  ┌─ Private Subnet ────┐  │    │  ┌─ Private Subnet ────┐  │  │
│  │  │  10.0.2.0/24        │  │    │  │  10.0.4.0/24        │  │  │
│  │  │  ┌───────┐ ┌──────┐ │  │    │  │  ┌───────┐ ┌──────┐ │  │  │
│  │  │  │ App-1 │ │ DB-1 │ │  │    │  │  │ App-2 │ │ DB-2 │ │  │  │
│  │  │  └───────┘ └──────┘ │  │    │  │  └───────┘ └──────┘ │  │  │
│  │  └─────────────────────┘  │    │  └─────────────────────┘  │  │
│  └───────────────────────────┘    └───────────────────────────┘  │
│                                                                   │
│  ┌── Internet Gateway ──┐    ┌── Route Tables ──┐                │
│  │  Connects VPC to     │    │  Public:  0.0.0.0 │                │
│  │  the internet        │    │    → IGW           │                │
│  └──────────────────────┘    │  Private: 0.0.0.0 │                │
│                              │    → NAT Gateway   │                │
│                              └────────────────────┘                │
└───────────────────────────────────────────────────────────────────┘

Key components:

  • Subnets: Subdivisions of a VPC's IP range. Public subnets have routes to an internet gateway; private subnets do not (they access the internet through a NAT gateway). Best practice: place application servers and databases in private subnets; only load balancers and bastion hosts in public subnets.
  • Internet Gateway (IGW): Allows resources in public subnets to communicate with the internet. It's horizontally scaled, redundant, and highly available—no bandwidth constraints.
  • NAT Gateway: Enables resources in private subnets to initiate outbound internet connections (for updates, API calls) without being directly accessible from the internet. NAT gateways are charged per hour and per GB processed—they can become a significant cost for data-intensive workloads.
  • Route Tables: Rules that determine where network traffic is directed. Each subnet is associated with a route table. A public subnet's route table has 0.0.0.0/0 → IGW; a private subnet's has 0.0.0.0/0 → NAT Gateway.
  • Security Groups: Stateful firewalls at the instance level. Rules specify allowed inbound/outbound traffic by protocol, port, and source/destination.
  • Network ACLs (NACLs): Stateless firewalls at the subnet level. Act as a second layer of defense.
  • VPC Peering: Connects two VPCs so they can communicate using private IPs, even across regions or accounts. Non-transitive (A↔B and B↔C doesn't mean A↔C).
  • Transit Gateway: Hub-and-spoke model for connecting multiple VPCs and on-premises networks. Solves the scaling problem of VPC peering (N VPCs would need N(N-1)/2 peering connections vs N transit gateway attachments).
  • VPC Endpoints: Private connections to AWS services (S3, DynamoDB, etc.) that don't traverse the internet. Gateway endpoints (S3, DynamoDB) are free; interface endpoints (most other services) use PrivateLink and cost per hour + per GB.

Security Groups vs. NACLs

Feature Security Groups Network ACLs
Level Instance (ENI) Subnet
Statefulness Stateful (return traffic auto-allowed) Stateless (must explicitly allow return traffic)
Rules Allow rules only Allow and deny rules
Evaluation All rules evaluated together Rules evaluated in order (lowest number first)
Default Deny all inbound, allow all outbound Allow all inbound and outbound

Best practice: Use security groups as your primary firewall (they're easier to manage and stateful). Use NACLs as a defense-in-depth measure for subnet-level blocking (e.g., blocking known malicious IP ranges).

DNS and Traffic Routing

Cloud DNS services (Route 53, Cloud DNS) do more than resolve domain names—they're intelligent traffic routers:

Routing Policy Description Use Case
Simple Single record, single endpoint Small applications with one server
Weighted Distribute traffic by percentage across endpoints Canary deployments (95% to v1, 5% to v2)
Latency-based Route to the lowest-latency region Global applications (serve US users from us-east, EU from eu-west)
Failover Active-passive: route to secondary if primary fails health check Disaster recovery
Geolocation Route based on user's geographic location Compliance (EU data stays in EU), localized content
Multi-value answer Return multiple healthy endpoints (client-side load balancing) Simple HA without a load balancer

Content Delivery Networks (CDNs)

CDNs cache content at edge locations close to users, dramatically reducing latency for static and dynamic content. Major CDNs operate hundreds of points of presence (PoPs) worldwide.

Without CDN:
  User (Tokyo) → Origin Server (us-east-1) = ~200ms latency

With CDN:
  User (Tokyo) → CDN Edge (Tokyo PoP) = ~10ms latency (cache hit)
  User (Tokyo) → CDN Edge (Tokyo PoP) → Origin (us-east-1) = ~210ms (cache miss, then cached)
CDN Feature Description
Static caching HTML, CSS, JS, images cached at edge locations
Dynamic acceleration Optimized routing and persistent connections to origin for dynamic content
SSL/TLS termination Terminate TLS at the edge, reducing origin load
DDoS protection Absorb volumetric attacks at the edge before they reach origin
Edge compute Run code at edge locations (CloudFront Functions, Cloudflare Workers, Vercel Edge Functions)
Cache invalidation Purge specific paths or wildcard patterns when content changes

CDN providers: CloudFront (AWS), Cloud CDN (GCP), Azure CDN, Cloudflare, Fastly, Akamai.

Load Balancing

Cloud providers offer managed load balancers that distribute traffic across multiple backend targets:

Type AWS Service Layer Use Case
Application LB ALB Layer 7 (HTTP/HTTPS) HTTP routing (path-based, host-based), WebSocket, gRPC
Network LB NLB Layer 4 (TCP/UDP) Ultra-low latency, static IPs, millions of RPS
Gateway LB GWLB Layer 3 Inline network appliances (firewalls, IDS/IPS)

ALB routing example: An ALB can route /api/* to your backend service, /static/* to an S3 bucket, and everything else to your frontend service—all from a single endpoint.

Service Mesh

For microservices architectures, a service mesh provides infrastructure-level control over service-to-service communication:

Without service mesh:              With service mesh (Istio/Linkerd):
┌─────────┐   HTTP   ┌─────────┐  ┌─────────┐ ┌──────┐   ┌──────┐ ┌─────────┐
│ Service │ ──────── │ Service │  │ Service │─│Proxy │───│Proxy │─│ Service │
│    A    │          │    B    │  │    A    │ │(Envoy│   │(Envoy│ │    B    │
└─────────┘          └─────────┘  └─────────┘ └──────┘   └──────┘ └─────────┘
                                  Sidecar proxies handle: mTLS, retries,
                                  circuit breaking, observability, traffic control

Service meshes provide: mutual TLS (encrypted service-to-service communication), traffic management (canary releases, traffic splitting), observability (distributed tracing, metrics), resilience (retries, circuit breaking, timeouts), and access control (authorization policies).

Identity and Access Management (IAM)

IAM is the framework for managing who (identity) can do what (permissions) on which resources in the cloud. Every cloud provider has an IAM system; AWS IAM is the most widely referenced.

Core IAM Concepts

  • Users: Represent individual people or service accounts. Each has credentials (password, access keys). Should map 1:1 to humans; never share user accounts.
  • Groups: Collections of users. Permissions assigned to groups apply to all members. Example: developers group has read access to production, write access to staging.
  • Roles: Identities with permissions that can be assumed by users, services, or external identities. Unlike users, roles don't have permanent credentials—they provide temporary security tokens via AWS STS (Security Token Service).
  • Policies: JSON documents that define permissions. Attached to users, groups, or roles.

The Principle of Least Privilege

Always grant the minimum permissions necessary to perform a task. This is the single most important IAM principle.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-bucket/*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "us-east-1"
        },
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "*"
    }
  ]
}

IAM policy evaluation logic: When AWS evaluates a request, it follows this order: 1. Explicit deny — Any explicit deny in any policy wins (overrides everything) 2. Organizations SCPs — Service control policies set the maximum permissions boundary 3. Resource-based policies — Policies attached to resources (S3 bucket policies, etc.) 4. Identity-based policies — Policies attached to the user/role making the request 5. Permissions boundaries — Maximum permissions an identity can have 6. Session policies — Limit permissions for a temporary session 7. Default deny — If nothing explicitly allows the action, it's denied

RBAC vs. ABAC

Approach Description Example
RBAC (Role-Based) Permissions assigned to roles; users assume roles database-admin role has full RDS access
ABAC (Attribute-Based) Permissions based on tags/attributes of resources and principals Users with tag team=data can access resources with tag team=data

ABAC scales better than RBAC in large organizations. Instead of creating a new role for every team-resource combination, you create policies based on tag matching. However, ABAC requires disciplined tagging—if resources aren't tagged correctly, access control breaks.

Federation and SSO

Federation allows external identities (corporate Active Directory, Google Workspace, Okta) to access cloud resources without creating individual IAM users:

  • SAML 2.0: Enterprise standard for SSO. Corporate IdP (Okta, Azure AD) authenticates user, sends SAML assertion to AWS, AWS grants temporary credentials based on mapped role.
  • OIDC (OpenID Connect): Modern standard used by GitHub Actions, GitLab CI, and web applications. Allows workloads to assume cloud roles without long-lived secrets.
  • AWS IAM Identity Center (SSO): Centralized SSO for multiple AWS accounts, integrates with corporate IdPs.
# GitHub Actions OIDC federation — no AWS access keys needed
jobs:
  deploy:
    permissions:
      id-token: write  # Required for OIDC
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-deploy
          aws-region: us-east-1
      - run: aws s3 sync ./build s3://my-app-bucket/

Service-to-Service Authentication

In microservices architectures, services need to authenticate with each other:

Method Description Complexity
IAM roles Services assume IAM roles to access cloud resources Low (cloud-native)
Service accounts Dedicated identities for services (GCP service accounts, K8s service accounts) Low-Medium
mTLS Mutual TLS: both client and server present certificates Medium-High (use service mesh)
Workload Identity Map Kubernetes service accounts to cloud IAM roles (IRSA on EKS, Workload Identity on GKE) Medium
SPIFFE/SPIRE Open standard for service identity High (most flexible)

Secrets Management

Secrets (API keys, database passwords, TLS certificates) should never be in code, environment variables, or config files in source control:

Tool Description Features
HashiCorp Vault Industry-standard secrets manager Dynamic secrets, encryption as a service, PKI, cloud-agnostic
AWS Secrets Manager Managed secrets with auto-rotation RDS password rotation, cross-account sharing, $0.40/secret/month
AWS SSM Parameter Store Simpler key-value store Free tier (standard params), hierarchical organization
GCP Secret Manager GCP-native secrets IAM integration, versioning, automatic replication
External Secrets Operator Kubernetes operator that syncs secrets from external stores Bridges Vault/cloud secrets into K8s secrets

Cloud Security

Shared Responsibility Model

Cloud security is a shared responsibility between the provider and the customer. The exact boundary depends on the service model:

┌────────────────────────────────────────────────────┐
│              Customer Responsibility                │
│  ┌──────────┬───────────┬───────────┬────────────┐ │
│  │   IaaS   │   CaaS    │   PaaS    │    SaaS    │ │
│  ├──────────┼───────────┼───────────┼────────────┤ │
│  │ Data     │ Data      │ Data      │ Data access│ │
│  │ Apps     │ Container │ App code  │ User config│ │
│  │ OS/patch │ images    │           │            │ │
│  │ Network  │ Cluster   │           │            │ │
│  │ config   │ config    │           │            │ │
│  └──────────┴───────────┴───────────┴────────────┘ │
├────────────────────────────────────────────────────┤
│              Provider Responsibility                │
│  Physical security, hardware, hypervisor,           │
│  managed service infrastructure, global network     │
└────────────────────────────────────────────────────┘

Encryption

Type Description AWS Service
At rest Data encrypted on disk/storage KMS, S3 server-side encryption, EBS encryption
In transit Data encrypted over the network (TLS) ACM (certificate management), ALB TLS termination
Client-side Data encrypted before sending to cloud AWS Encryption SDK, client-side S3 encryption

Envelope encryption (used by KMS): A data key encrypts your data, and a master key encrypts the data key. This avoids sending large data blobs to KMS—only the small data key is encrypted/decrypted by KMS. The encrypted data key is stored alongside the encrypted data.

Encryption:
  Data → [Data Key] → Encrypted Data
  Data Key → [KMS Master Key] → Encrypted Data Key
  Store: Encrypted Data + Encrypted Data Key

Decryption:
  Encrypted Data Key → [KMS Master Key] → Data Key
  Encrypted Data → [Data Key] → Data

Network Security

  • Web Application Firewall (WAF): Inspects HTTP requests and blocks malicious traffic (SQL injection, XSS, bot traffic). AWS WAF, Cloudflare WAF, Azure WAF.
  • DDoS Protection: AWS Shield Standard (free, automatic L3/L4 protection), Shield Advanced (L7 protection, DDoS response team, cost protection).
  • VPC Flow Logs: Capture IP traffic metadata flowing through your VPC for auditing and troubleshooting.
  • PrivateLink: Access services over private IPs without traversing the internet.

Security Monitoring

Service Purpose
CloudTrail Logs every API call made in your AWS account (who did what, when, from where)
GuardDuty ML-based threat detection analyzing CloudTrail, VPC Flow Logs, DNS logs
Security Hub Aggregates findings from multiple security services, compliance checks
Config Tracks resource configuration changes, evaluates compliance rules
Inspector Automated vulnerability scanning for EC2 instances and container images

Cloud Storage Deep Dive

Object Storage (S3 / GCS / Blob Storage)

Object storage is the fundamental cloud storage primitive. Objects are stored as key-value pairs in buckets, with metadata. There is no directory hierarchy—the "folders" you see are just key prefixes.

S3 storage classes:

Class Durability Availability Min Duration Retrieval Use Case
Standard 11 9s 99.99% None Instant Frequently accessed data
Intelligent-Tiering 11 9s 99.9% None Instant Unknown/changing access patterns
Standard-IA 11 9s 99.9% 30 days Instant Infrequent but rapid access needed
One Zone-IA 11 9s 99.5% 30 days Instant Reproducible, infrequent data
Glacier Instant 11 9s 99.9% 90 days Instant Archive with instant access
Glacier Flexible 11 9s 99.99% 90 days 1-12 hours Archive (long-term backups)
Glacier Deep Archive 11 9s 99.99% 180 days 12-48 hours Compliance archives, rarely accessed

Lifecycle policies automatically transition objects between storage classes based on age. Example: Move to Standard-IA after 30 days, Glacier after 90 days, Deep Archive after 365 days, delete after 7 years.

S3 performance optimization: - S3 supports 5,500 GET/s and 3,500 PUT/s per partition prefix - Use random prefixes (UUIDs, hashes) to distribute requests across partitions - Use multipart upload for objects > 100 MB (required > 5 GB) - S3 Transfer Acceleration uses CloudFront edge locations for faster uploads from distant locations

Block Storage (EBS)

Block storage provides raw storage volumes that attach to compute instances, behaving like physical hard drives:

Volume Type IOPS Throughput Use Case
gp3 (General SSD) 3,000-16,000 125-1,000 MB/s Default for most workloads
io2 (Provisioned SSD) Up to 256,000 Up to 4,000 MB/s Databases requiring consistent IOPS
st1 (Throughput HDD) 500 500 MB/s Big data, data warehouses, log processing
sc1 (Cold HDD) 250 250 MB/s Infrequently accessed, lowest cost

The 12-Factor App

The 12-Factor App is a methodology for building modern, cloud-native applications that are portable, scalable, and maintainable. Originally published by Heroku engineers, these principles are foundational for cloud-native development.

Factor Principle Description
I. Codebase One codebase, many deploys One repo per app, deployed to multiple environments (dev, staging, prod)
II. Dependencies Explicitly declare and isolate Use dependency manifests (requirements.txt, package.json, Cargo.toml). Never rely on system-wide packages
III. Config Store config in the environment Database URLs, API keys, feature flags → environment variables, not code. Never commit secrets
IV. Backing Services Treat backing services as attached resources Databases, caches, queues are interchangeable resources identified by URL/credentials. Swapping a local PostgreSQL for Amazon RDS should require only a config change
V. Build, Release, Run Strictly separate build and run stages Build (compile + bundle), Release (build + config), Run (execute). Every release is immutable and has a unique ID
VI. Processes Execute the app as stateless processes App processes are stateless and share-nothing. Persistent data lives in backing services (DB, Redis, S3), not in local memory or filesystem
VII. Port Binding Export services via port binding The app is completely self-contained and exports HTTP (or other) as a service by binding to a port
VIII. Concurrency Scale out via the process model Scale by running multiple processes (horizontal scaling), not by making a single process larger
IX. Disposability Maximize robustness with fast startup and graceful shutdown Processes start quickly and shut down gracefully (finish current requests, release resources)
X. Dev/Prod Parity Keep dev, staging, and production as similar as possible Minimize gaps in time (deploy quickly), personnel (devs who wrote code deploy it), and tools (same backing services everywhere)
XI. Logs Treat logs as event streams Apps write logs to stdout. The execution environment captures, routes, and aggregates them (e.g., to ELK, CloudWatch, Datadog)
XII. Admin Processes Run admin/management tasks as one-off processes Database migrations, data fixes, console sessions run as one-off processes in the same environment as the app

Cloud Architecture Patterns

Multi-Region Deployment

Deploying applications across multiple geographic regions for high availability, disaster recovery, and reduced latency.

                    ┌─────── Global DNS (Route 53 / Cloud DNS) ───────┐
                    │            Latency-based routing                 │
                    ▼                                                  ▼
        ┌─── US-East Region ───┐                        ┌── EU-West Region ───┐
        │  ┌─ Load Balancer ─┐ │                        │  ┌─ Load Balancer ┐ │
        │  └───────┬─────────┘ │                        │  └──────┬─────────┘ │
        │  ┌───────┴─────────┐ │                        │  ┌──────┴──────────┐│
        │  │ App Servers     │ │                        │  │ App Servers     ││
        │  │ (Auto-scaling)  │ │                        │  │ (Auto-scaling)  ││
        │  └───────┬─────────┘ │                        │  └──────┬──────────┘│
        │  ┌───────┴─────────┐ │    Cross-Region        │  ┌──────┴──────────┐│
        │  │ Primary DB      │◄├──── Replication ───────├──│ Replica DB      ││
        │  └─────────────────┘ │                        │  └─────────────────┘│
        └──────────────────────┘                        └─────────────────────┘

Strategies:

Strategy RTO RPO Cost Complexity
Backup & Restore Hours Hours $ Low
Pilot Light 10-30 min Minutes $$ Medium
Warm Standby Minutes Seconds $$$ Medium-High
Active-Active ~0 (automatic) ~0 $$$$ High
  • Active-Passive: One region handles all traffic; the other is a standby for failover. Simpler but wastes resources.
  • Active-Active: Both regions serve traffic simultaneously. More complex (requires data synchronization, conflict resolution) but better resource utilization and lower latency.
  • Pilot Light: Minimal infrastructure running in the DR region (e.g., database replica). Scale up on failover.
  • Warm Standby: Scaled-down version of production running in DR region. Faster failover than pilot light.

Auto-Scaling Strategies

Auto-scaling adjusts capacity dynamically based on demand:

Strategy Trigger Latency Use Case
Reactive (target tracking) Metric crosses threshold (CPU > 70%, queue depth > 100) 2-5 min General workloads
Step scaling Metric enters defined ranges, each triggering a different scaling action 2-5 min Workloads with predictable scaling steps
Scheduled Time-based (scale up at 9am, down at 6pm) None (pre-provisioned) Predictable traffic patterns (business hours)
Predictive ML-based forecasting from historical patterns None (pre-provisioned) Recurring patterns (daily/weekly cycles)

Scaling best practices: - Scale out (add instances) aggressively, scale in (remove instances) conservatively - Set cooldown periods to prevent scaling thrashing - Use multiple metrics (CPU + request count + queue depth) for more accurate scaling decisions - Always test scaling by simulating load before relying on it in production

Deployment Patterns

Pattern Description Risk Level Rollback Speed
Rolling update Replace instances one at a time Medium Medium (re-deploy)
Blue-Green Maintain two identical environments; switch traffic at once Low Instant (switch back)
Canary Route small percentage of traffic to new version, gradually increase Low Instant (route 0% to canary)
Feature flags New code deployed to all instances but gated behind flags Very Low Instant (toggle flag off)

Blue-Green deployment flow: 1. Blue environment runs current production 2. Deploy new version to Green environment 3. Run smoke tests against Green 4. Switch load balancer to route traffic to Green 5. Monitor for errors; if issues arise, switch back to Blue instantly 6. Decommission Blue (or keep as next deployment target)

Resilience Patterns

  • Circuit Breaker: When a downstream service fails repeatedly, stop calling it for a period (open circuit) instead of accumulating timeouts. After a cooldown, allow a test request (half-open). If it succeeds, close the circuit; if not, reopen.
  • Bulkhead: Isolate failures by partitioning resources. If the payment service's thread pool is exhausted, the search service's thread pool is unaffected. Named after ship bulkheads that prevent a hull breach from flooding the entire vessel.
  • Retry with exponential backoff: On transient failures, retry with increasing delays (1s, 2s, 4s, 8s) plus jitter (random offset to prevent thundering herd).
  • Timeout: Every external call should have a timeout. Without timeouts, a hung dependency can cascade to all callers.

Hybrid Cloud

Combines on-premises infrastructure with public cloud services. Common in enterprises with existing data centers, regulatory requirements, or latency-sensitive workloads.

Technologies:

  • AWS Outposts: AWS hardware in your data center
  • Azure Arc: Manage on-premises, multi-cloud, and edge resources from Azure
  • Google Anthos: Run GKE clusters anywhere (on-prem, AWS, Azure)
  • VPN / Direct Connect / ExpressRoute: Secure, dedicated connections between on-premises and cloud

Multi-Cloud

Using services from multiple cloud providers to avoid vendor lock-in, leverage best-of-breed services, or meet regulatory requirements.

Challenges: Different APIs, pricing models, IAM systems, and networking models. Requires abstraction layers (Terraform, Pulumi, Crossplane) and potentially higher operational complexity. True multi-cloud (running the same workload across providers) is rare; more common is "multi-cloud by choice" (different workloads on different providers).

Serverless Architecture

Serverless extends beyond individual functions (FaaS) to entire architectures where the cloud provider manages all infrastructure and you pay only for what you use.

Event-Driven Architecture with Serverless

┌───────────┐     ┌──────────┐     ┌──────────┐     ┌─────────────┐
│ API       │────▶│ Lambda   │────▶│ DynamoDB │     │ S3 Bucket   │
│ Gateway   │     │ Function │     │ (storage)│     │ (uploads)   │
└───────────┘     └──────────┘     └──────────┘     └──────┬──────┘
                                                           │ Event
                                                           ▼
┌───────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ CloudWatch│────▶│ Lambda   │     │ Lambda   │◀────│ SQS      │
│ Events    │     │ (cron)   │     │ (process)│     │ Queue    │
└───────────┘     └──────────┘     └──────────┘     └──────────┘

Common serverless event sources: API Gateway (HTTP), S3 (file uploads), SQS (queues), SNS (pub/sub), DynamoDB Streams (data changes), CloudWatch Events/EventBridge (scheduled, AWS events), Kinesis (streaming data).

Step Functions / Workflows

For complex multi-step workflows that require orchestration, error handling, and state management:

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate",
      "Next": "ProcessPayment",
      "Catch": [{
        "ErrorEquals": ["ValidationError"],
        "Next": "OrderFailed"
      }]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:payment",
      "Next": "FulfillOrder",
      "Retry": [{
        "ErrorEquals": ["PaymentTimeout"],
        "IntervalSeconds": 5,
        "MaxAttempts": 3,
        "BackoffRate": 2.0
      }]
    },
    "FulfillOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:fulfill",
      "End": true
    },
    "OrderFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-failure",
      "End": true
    }
  }
}

Serverless Databases

Database Type Scaling Pricing Model
DynamoDB NoSQL (key-value/document) On-demand or provisioned Per read/write request unit
Aurora Serverless v2 Relational (MySQL/PostgreSQL) Auto-scales 0.5-128 ACUs Per ACU-hour
Neon PostgreSQL Auto-scales, scales to zero Per compute-hour + storage
PlanetScale MySQL (Vitess) Auto-scales Per row read/write

Cloud Cost Optimization (FinOps)

Cloud costs can spiral quickly without discipline. FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending.

Key Principles

  1. Visibility: Tag all resources, set up cost dashboards, use cost allocation reports.
  2. Optimization: Right-size instances, use reserved/spot/preemptible instances, delete unused resources.
  3. Governance: Set budgets and alerts, implement approval workflows for expensive resources.

Cost Reduction Strategies

Strategy Savings Potential Description
Right-sizing 20-40% Match instance types to actual workload needs. Most instances are over-provisioned
Reserved Instances / Committed Use 30-72% Commit to 1-3 year usage for significant discounts
Savings Plans 30-72% More flexible than RIs: commit to $/hour spend, not specific instance types
Spot/Preemptible Instances 60-90% Use spare capacity at steep discounts for fault-tolerant workloads (batch processing, CI/CD, data pipelines)
Auto-scaling Variable Scale resources up/down based on demand. Don't pay for idle capacity
Storage tiering 40-80% Move infrequently accessed data to cheaper storage classes (S3 Glacier, Coldline)
Serverless Variable Pay only for actual execution time. Ideal for sporadic workloads
Scheduled scaling 20-50% Turn off dev/test environments during nights and weekends
Graviton/ARM instances 20-40% ARM-based instances offer better price-performance for compatible workloads

Tagging strategy — Tags are the foundation of cost allocation. Minimum recommended tags:

Tag Key Purpose Example Values
Environment Separate costs by environment production, staging, development
Team Allocate costs to teams platform, backend, data, ml
Service Track costs per service user-api, payment-service, search
CostCenter Map to financial cost centers eng-001, marketing-002
Owner Identify responsible person alice@company.com
# Example: AWS Cost Explorer CLI query
aws ce get-cost-and-usage \
  --time-period Start=2025-01-01,End=2025-01-31 \
  --granularity MONTHLY \
  --metrics "BlendedCost" "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE

Spot Instance Strategies

Spot instances offer 60-90% discounts but can be reclaimed with 2 minutes' notice. Effective strategies:

  • Diversify instance types: Request multiple instance types in your auto-scaling group; if one type is reclaimed, others may still be available
  • Use capacity-optimized allocation: Let the provider choose the instance type with the most available capacity
  • Handle interruptions gracefully: Use the 2-minute warning to drain connections and checkpoint work
  • Mix on-demand and spot: Run baseline on on-demand, burst on spot (e.g., 30% on-demand, 70% spot)

Cloud Migration Strategies

The 7 R's framework for migrating workloads to the cloud:

Strategy Description Effort Risk When to Use
Rehost (Lift & Shift) Move as-is to cloud VMs Low Low Quick migration, minimal changes
Replatform (Lift & Reshape) Minor optimization during migration Medium Low-Medium Use managed services (RDS instead of self-managed DB)
Repurchase Replace with SaaS product Low Medium On-prem CRM → Salesforce, on-prem email → Gmail
Refactor / Re-architect Rewrite to be cloud-native High High Critical apps that benefit from cloud-native features
Retire Decommission applications no longer needed Low Low Reduce portfolio before migration
Retain Keep on-premises (for now) None None Regulatory, too complex, or recently upgraded
Relocate Move to cloud without changes (VMware on cloud) Low Low VMware environments → VMware Cloud on AWS

Migration phases: 1. Assessment: Inventory applications, map dependencies, assess cloud readiness 2. Planning: Choose migration strategy per application, design target architecture, build business case 3. Migration: Execute migration in waves, validate functionality 4. Optimization: Right-size resources, implement auto-scaling, optimize costs

Database migration is typically the most complex part. AWS Database Migration Service (DMS) supports heterogeneous migrations (Oracle → PostgreSQL) with the Schema Conversion Tool (SCT). For homogeneous migrations (MySQL → Aurora MySQL), native replication can minimize downtime to seconds.