Cloud Computing¶
Cloud computing is the on-demand delivery of computing resources—servers, storage, databases, networking, software, analytics, and intelligence—over the internet ("the cloud") with pay-as-you-go pricing. Instead of owning and maintaining physical data centers and servers, organizations rent access to these resources from a cloud provider.
The National Institute of Standards and Technology (NIST) defines five essential characteristics of cloud computing:
- On-demand self-service: Provision resources automatically without human interaction with the provider.
- Broad network access: Resources available over the network via standard mechanisms (HTTP, APIs).
- Resource pooling: Provider resources are pooled to serve multiple tenants using a multi-tenant model.
- Rapid elasticity: Capabilities can be elastically provisioned and released to scale with demand.
- Measured service: Resource usage is monitored, controlled, and reported, enabling pay-per-use billing.
Cloud Deployment Models¶
Before diving into service models, it's important to understand where the cloud infrastructure lives:
| Model | Description | Use Cases |
|---|---|---|
| Public Cloud | Resources owned and operated by a third-party provider, shared across tenants | Startups, SaaS, variable workloads, rapid prototyping |
| Private Cloud | Dedicated infrastructure for a single organization (on-prem or hosted) | Regulatory compliance, sensitive data, predictable workloads |
| Hybrid Cloud | Combination of public and private, with orchestration between them | Enterprise (burst to public for peak load, sensitive data stays private) |
| Multi-Cloud | Using multiple public cloud providers simultaneously | Vendor lock-in avoidance, best-of-breed services, regulatory requirements |
Hybrid cloud is the most common enterprise model. An organization might run its core banking application on a private cloud for regulatory compliance while using AWS for customer-facing web applications and GCP BigQuery for analytics. The key challenge is data synchronization, identity federation, and consistent networking across environments.
Cloud Service Models¶
Cloud services are categorized into layers based on how much the provider manages versus how much the customer manages:
┌──────────────────────────────────────────────────────────────────┐
│ Responsibility Model │
├────────────┬────────────┬────────────┬──────────────────────────┤
│ On-Premise │ IaaS │ PaaS │ SaaS │
├────────────┼────────────┼────────────┼──────────────────────────┤
│ Apps YOU│ Apps YOU│ Apps YOU│ Apps PROVIDER │
│ Data YOU│ Data YOU│ Data YOU│ Data PROVIDER │
│ Runtime YOU│ Runtime YOU│ Runtime PRO│ Runtime PROVIDER │
│ Middle YOU│ Middle YOU│ Middle PRO│ Middleware PROVIDER │
│ OS YOU│ OS YOU│ OS PRO│ OS PROVIDER │
│ Virtual YOU│ Virtual PRO│ Virtual PRO│ Virtualizatn PROVIDER │
│ Servers YOU│ Servers PRO│ Servers PRO│ Servers PROVIDER │
│ Storage YOU│ Storage PRO│ Storage PRO│ Storage PROVIDER │
│ Network YOU│ Network PRO│ Network PRO│ Networking PROVIDER │
└────────────┴────────────┴────────────┴──────────────────────────┘
YOU = Customer manages PRO = Provider manages
The fundamental trade-off across all service models is control versus operational burden. As you move from IaaS to SaaS, you give up customization and control but gain operational simplicity and reduced staffing needs.
Infrastructure as a Service (IaaS)¶
IaaS provides virtualized computing resources over the internet. The provider manages the physical hardware, networking, and virtualization layer; the customer manages everything from the OS upward. This is the closest model to traditional IT but without the capital expenditure of physical hardware.
| Feature | Description |
|---|---|
| What you get | Virtual machines, networks, storage, firewalls |
| What you manage | OS, middleware, runtime, applications, data |
| Scaling | Manual or auto-scaling of VMs |
| Use cases | Custom environments, legacy app migration (lift-and-shift), dev/test environments |
| Examples | AWS EC2, Google Compute Engine, Azure Virtual Machines, DigitalOcean Droplets |
IaaS is the right choice when you need full control over the OS and runtime environment—for example, running specialized software that requires kernel-level configuration, GPUs for ML training, or legacy applications that can't be easily containerized. The downside is that you're responsible for patching the OS, configuring security groups, managing disk space, and handling instance failures.
# Example: Launching an EC2 instance with AWS CLI
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type t3.medium \
--key-name my-key-pair \
--security-group-ids sg-0123456789abcdef0 \
--subnet-id subnet-0123456789abcdef0 \
--count 1 \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=my-server}]'
Instance type selection is critical for cost and performance. Cloud providers offer instance families optimized for different workloads:
| Family | Optimized For | Examples (AWS) | Use Cases |
|---|---|---|---|
| General Purpose | Balanced CPU/memory | t3, m6i, m7g | Web servers, app servers, small databases |
| Compute Optimized | High-performance CPUs | c6i, c7g | Batch processing, scientific modeling, gaming |
| Memory Optimized | Large memory footprint | r6i, x2idn | In-memory databases, real-time analytics |
| Storage Optimized | High sequential I/O | i3, d3 | Data warehousing, distributed filesystems |
| Accelerated (GPU) | GPU/FPGA workloads | p4d, g5, inf2 | ML training/inference, video encoding |
| ARM-based (Graviton) | Cost-efficiency | t4g, m7g, c7g | 20-40% better price-performance for compatible workloads |
Platform as a Service (PaaS)¶
PaaS provides a platform allowing customers to develop, run, and manage applications without dealing with infrastructure. The provider manages servers, networking, storage, OS, and runtime. You focus exclusively on your application code and data.
| Feature | Description |
|---|---|
| What you get | Managed runtime, databases, development tools |
| What you manage | Application code and data |
| Scaling | Automatic (usually) |
| Use cases | Web applications, APIs, microservices, rapid prototyping |
| Examples | Heroku, Google App Engine, AWS Elastic Beanstalk, Azure App Service, Railway, Render |
PaaS dramatically reduces time-to-deploy. A developer can push code to a Git repository and have it running in production within minutes, without configuring a single server. The trade-off is reduced flexibility: you're constrained to the runtimes, languages, and configurations the platform supports. If you need a specific Linux kernel version or a custom native library, PaaS may not work.
When PaaS falls short: PaaS platforms impose constraints on execution time, memory, filesystem access, and network configuration. Applications that require long-running background processes, custom binary dependencies, or specific network topologies often outgrow PaaS and need to migrate to containers (CaaS) or IaaS.
Software as a Service (SaaS)¶
SaaS delivers fully managed applications over the internet. The provider manages everything; the customer simply uses the software through a web browser or API.
| Feature | Description |
|---|---|
| What you get | Complete application accessible via browser or API |
| What you manage | Configuration, user data |
| Use cases | Email, CRM, collaboration, productivity |
| Examples | Gmail, Salesforce, Slack, GitHub, Jira, Datadog |
SaaS is the dominant model for business tools. The key consideration for engineering teams is integration: how well does the SaaS product expose APIs, support webhooks, and integrate with your existing toolchain? Data portability and vendor lock-in are significant concerns—can you export your data if you switch providers?
Function as a Service (FaaS) / Serverless¶
FaaS is an event-driven execution model where the provider dynamically manages the allocation of computing resources. You deploy individual functions, and the provider runs them in response to events. There are no servers to provision, manage, or scale—the provider handles everything.
| Feature | Description |
|---|---|
| What you get | Event-driven function execution, automatic scaling to zero |
| What you manage | Function code (and sometimes container images) |
| Scaling | Automatic, scales to zero when idle |
| Billing | Per-invocation and per-duration (e.g., per ms of execution) |
| Limitations | Cold starts, execution time limits (15 min on AWS Lambda), stateless |
| Examples | AWS Lambda, Google Cloud Functions, Azure Functions, Cloudflare Workers |
# AWS Lambda function example (Python)
import json
def handler(event, context):
"""Process an API Gateway event."""
name = event.get('queryStringParameters', {}).get('name', 'World')
return {
'statusCode': 200,
'headers': {'Content-Type': 'application/json'},
'body': json.dumps({'message': f'Hello, {name}!'})
}
Cold starts are the most significant operational concern with FaaS. When a function hasn't been invoked recently, the provider must spin up a new execution environment (download code, initialize runtime, execute initialization code). This adds latency—typically 100ms-2s depending on runtime, memory size, and package size. Mitigation strategies:
- Provisioned concurrency: Keep a minimum number of warm instances (costs more but eliminates cold starts)
- Smaller deployment packages: Minimize dependencies to reduce initialization time
- Choose faster runtimes: Go and Rust cold-start in ~50ms; Python and Node.js in ~200ms; Java/C# in ~1-3s
- Keep initialization outside the handler: Module-level code runs once per cold start, not per invocation
Serverless anti-patterns: Not everything should be serverless. Avoid FaaS for long-running processes (use containers), high-throughput steady-state workloads (dedicated compute is cheaper), or applications requiring local state or filesystem access.
Container as a Service (CaaS)¶
CaaS is the sweet spot between IaaS and PaaS—you package your application in a container (Docker image) and the platform handles orchestration, scaling, networking, and infrastructure management. You control the runtime environment (anything that fits in a container) without managing servers.
| Feature | Description |
|---|---|
| What you get | Container orchestration, networking, auto-scaling, service discovery |
| What you manage | Container images (Dockerfile), application configuration |
| Scaling | Automatic (horizontal pod autoscaling, scale-to-zero for some platforms) |
| Examples | AWS ECS/Fargate, Google Cloud Run, Azure Container Apps, Fly.io |
CaaS is increasingly the default deployment model for production microservices. It provides the flexibility of IaaS (run anything in your container) with the operational simplicity of PaaS (no server management). The two main flavors are:
- Kubernetes-based (EKS, GKE, AKS): Full Kubernetes API, maximum flexibility, higher operational complexity
- Managed container platforms (Fargate, Cloud Run): Simpler abstractions, less control, lower operational burden
# Deploy a container to Google Cloud Run (scales to zero)
gcloud run deploy my-service \
--image gcr.io/my-project/my-app:latest \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 512Mi \
--cpu 1 \
--min-instances 0 \
--max-instances 100 \
--set-env-vars "DATABASE_URL=postgres://..." \
--port 8080
Other Service Models¶
| Model | Description | Examples |
|---|---|---|
| DBaaS (Database as a Service) | Managed database engines with automated backups, patching, and scaling | AWS RDS, Google Cloud SQL, Azure Cosmos DB, PlanetScale, Neon |
| BaaS (Backend as a Service) | Pre-built backend features (auth, storage, push notifications) | Firebase, Supabase, AWS Amplify |
| AIaaS (AI as a Service) | Managed AI/ML models and APIs | OpenAI API, AWS Bedrock, Google Vertex AI, Azure OpenAI Service |
Choosing a Service Model¶
Decision Tree:
Need full OS/kernel control? → IaaS (EC2, Compute Engine)
│ No
▼
Already containerized? → CaaS (Fargate, Cloud Run, EKS)
│ No
▼
Event-driven, short-lived workload? → FaaS (Lambda, Cloud Functions)
│ No
▼
Standard web app/API? → PaaS (Heroku, App Service, Render)
│ No
▼
Just need managed software? → SaaS (Datadog, GitHub, Slack)
Major Cloud Providers¶
Amazon Web Services (AWS)¶
The largest cloud provider (approximately 31% market share), AWS offers 200+ services across compute, storage, databases, networking, AI/ML, analytics, and more. AWS's strength is its breadth—there is a managed service for virtually every infrastructure need.
Core Services:
| Category | Service | Description |
|---|---|---|
| Compute | EC2 | Virtual machines (instances) with configurable CPU, memory, storage |
| Lambda | Serverless functions (FaaS) | |
| ECS/EKS | Container orchestration (Docker/Kubernetes) | |
| Fargate | Serverless containers (no instance management) | |
| Storage | S3 | Object storage (unlimited, 11 9s durability) |
| EBS | Block storage for EC2 (SSD/HDD volumes) | |
| EFS | Managed NFS file system | |
| Glacier | Archival storage (low cost, high retrieval latency) | |
| Database | RDS | Managed relational DB (PostgreSQL, MySQL, Oracle, SQL Server) |
| DynamoDB | Managed NoSQL (key-value/document, single-digit ms latency) | |
| ElastiCache | Managed Redis/Memcached | |
| Aurora | AWS-optimized MySQL/PostgreSQL (5x faster than standard MySQL) | |
| Networking | VPC | Virtual private cloud (isolated network) |
| Route 53 | DNS service | |
| CloudFront | CDN | |
| ELB/ALB/NLB | Load balancers (Layer 4 and Layer 7) | |
| Security | IAM | Identity and access management |
| KMS | Key management service | |
| WAF | Web application firewall | |
| Secrets Manager | Secrets storage and rotation | |
| Messaging | SQS | Managed message queue |
| SNS | Pub/sub messaging | |
| EventBridge | Event bus for event-driven architectures |
Google Cloud Platform (GCP)¶
Known for strong data analytics, AI/ML capabilities, and Kubernetes (GKE was the first managed Kubernetes service—Google created Kubernetes). GCP's pricing model is often simpler than AWS's, with sustained-use discounts applied automatically.
Core Services:
| Category | Service | Description |
|---|---|---|
| Compute | Compute Engine | Virtual machines |
| Cloud Run | Serverless containers (scales to zero) | |
| GKE | Managed Kubernetes | |
| Cloud Functions | Serverless functions | |
| Storage | Cloud Storage | Object storage |
| Persistent Disk | Block storage | |
| Filestore | Managed NFS | |
| Database | Cloud SQL | Managed relational DB |
| Firestore | NoSQL document DB | |
| Cloud Spanner | Globally distributed relational DB (horizontally scalable + ACID) | |
| Bigtable | Wide-column NoSQL (HBase-compatible) | |
| Data/AI | BigQuery | Serverless data warehouse (SQL analytics on petabytes) |
| Vertex AI | Managed ML platform | |
| Pub/Sub | Messaging service |
Microsoft Azure¶
Strong in enterprise and hybrid cloud, tightly integrated with the Microsoft ecosystem (Active Directory, Office 365, .NET). Azure's competitive advantage is enterprise customers who already use Microsoft products.
Core Services:
| Category | Service | Description |
|---|---|---|
| Compute | Virtual Machines | VMs |
| App Service | PaaS for web apps | |
| AKS | Managed Kubernetes | |
| Azure Functions | Serverless functions | |
| Storage | Blob Storage | Object storage |
| Azure Files | Managed file shares (SMB/NFS) | |
| Database | Azure SQL | Managed SQL Server |
| Cosmos DB | Globally distributed multi-model NoSQL | |
| Identity | Azure AD (Entra ID) | Enterprise identity and access management |
Provider Comparison¶
| Dimension | AWS | GCP | Azure |
|---|---|---|---|
| Market share | ~31% | ~12% | ~24% |
| Strengths | Breadth of services, ecosystem | Data/AI, Kubernetes, pricing | Enterprise, hybrid, Microsoft integration |
| Pricing model | Complex, many dimensions | Simpler, sustained discounts | Enterprise agreements, hybrid benefit |
| Global regions | 33+ regions | 40+ regions | 60+ regions |
| Best for | Startups to enterprise, general purpose | Data-intensive, ML, containerized workloads | Enterprise, .NET shops, hybrid cloud |
Cloud-Agnostic Tools¶
To avoid vendor lock-in and manage multi-cloud environments, teams use abstraction layers:
| Tool | Purpose | Description |
|---|---|---|
| Terraform | Infrastructure as Code | Declarative HCL language, provider ecosystem for all major clouds, state management |
| Pulumi | Infrastructure as Code | Real programming languages (Python, TypeScript, Go) instead of DSL, strong typing |
| Crossplane | Kubernetes-native IaC | Manage cloud resources as Kubernetes custom resources (CRDs) |
| Helm | Kubernetes package manager | Template and deploy Kubernetes applications consistently across providers |
# Terraform example: Provision infrastructure across providers
provider "aws" {
region = "us-east-1"
}
resource "aws_instance" "web" {
ami = "ami-0abcdef1234567890"
instance_type = "t3.medium"
tags = {
Name = "web-server"
Environment = "production"
Team = "platform"
}
}
resource "aws_s3_bucket" "assets" {
bucket = "my-app-assets-prod"
tags = {
Environment = "production"
}
}
Cloud Networking¶
Understanding cloud networking is fundamental to deploying secure, scalable applications. Cloud networking virtualizes traditional data center networking concepts and adds cloud-specific constructs.
Virtual Private Cloud (VPC)¶
A VPC is a logically isolated virtual network within the cloud provider's infrastructure. It provides complete control over IP addressing, subnets, routing, and security. Every resource you launch in the cloud runs inside a VPC.
┌──────────────────────────VPC (10.0.0.0/16) ──────────────────────┐
│ │
│ ┌─── Availability Zone A ───┐ ┌─── Availability Zone B ───┐ │
│ │ │ │ │ │
│ │ ┌─ Public Subnet ─────┐ │ │ ┌─ Public Subnet ─────┐ │ │
│ │ │ 10.0.1.0/24 │ │ │ │ 10.0.3.0/24 │ │ │
│ │ │ ┌───────┐ ┌──────┐ │ │ │ │ ┌───────┐ ┌──────┐ │ │ │
│ │ │ │ Web-1 │ │ NAT │ │ │ │ │ │ Web-2 │ │ NAT │ │ │ │
│ │ │ └───────┘ └──────┘ │ │ │ │ └───────┘ └──────┘ │ │ │
│ │ └─────────────────────┘ │ │ └─────────────────────┘ │ │
│ │ │ │ │ │
│ │ ┌─ Private Subnet ────┐ │ │ ┌─ Private Subnet ────┐ │ │
│ │ │ 10.0.2.0/24 │ │ │ │ 10.0.4.0/24 │ │ │
│ │ │ ┌───────┐ ┌──────┐ │ │ │ │ ┌───────┐ ┌──────┐ │ │ │
│ │ │ │ App-1 │ │ DB-1 │ │ │ │ │ │ App-2 │ │ DB-2 │ │ │ │
│ │ │ └───────┘ └──────┘ │ │ │ │ └───────┘ └──────┘ │ │ │
│ │ └─────────────────────┘ │ │ └─────────────────────┘ │ │
│ └───────────────────────────┘ └───────────────────────────┘ │
│ │
│ ┌── Internet Gateway ──┐ ┌── Route Tables ──┐ │
│ │ Connects VPC to │ │ Public: 0.0.0.0 │ │
│ │ the internet │ │ → IGW │ │
│ └──────────────────────┘ │ Private: 0.0.0.0 │ │
│ │ → NAT Gateway │ │
│ └────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
Key components:
- Subnets: Subdivisions of a VPC's IP range. Public subnets have routes to an internet gateway; private subnets do not (they access the internet through a NAT gateway). Best practice: place application servers and databases in private subnets; only load balancers and bastion hosts in public subnets.
- Internet Gateway (IGW): Allows resources in public subnets to communicate with the internet. It's horizontally scaled, redundant, and highly available—no bandwidth constraints.
- NAT Gateway: Enables resources in private subnets to initiate outbound internet connections (for updates, API calls) without being directly accessible from the internet. NAT gateways are charged per hour and per GB processed—they can become a significant cost for data-intensive workloads.
- Route Tables: Rules that determine where network traffic is directed. Each subnet is associated with a route table. A public subnet's route table has
0.0.0.0/0 → IGW; a private subnet's has0.0.0.0/0 → NAT Gateway. - Security Groups: Stateful firewalls at the instance level. Rules specify allowed inbound/outbound traffic by protocol, port, and source/destination.
- Network ACLs (NACLs): Stateless firewalls at the subnet level. Act as a second layer of defense.
- VPC Peering: Connects two VPCs so they can communicate using private IPs, even across regions or accounts. Non-transitive (A↔B and B↔C doesn't mean A↔C).
- Transit Gateway: Hub-and-spoke model for connecting multiple VPCs and on-premises networks. Solves the scaling problem of VPC peering (N VPCs would need N(N-1)/2 peering connections vs N transit gateway attachments).
- VPC Endpoints: Private connections to AWS services (S3, DynamoDB, etc.) that don't traverse the internet. Gateway endpoints (S3, DynamoDB) are free; interface endpoints (most other services) use PrivateLink and cost per hour + per GB.
Security Groups vs. NACLs¶
| Feature | Security Groups | Network ACLs |
|---|---|---|
| Level | Instance (ENI) | Subnet |
| Statefulness | Stateful (return traffic auto-allowed) | Stateless (must explicitly allow return traffic) |
| Rules | Allow rules only | Allow and deny rules |
| Evaluation | All rules evaluated together | Rules evaluated in order (lowest number first) |
| Default | Deny all inbound, allow all outbound | Allow all inbound and outbound |
Best practice: Use security groups as your primary firewall (they're easier to manage and stateful). Use NACLs as a defense-in-depth measure for subnet-level blocking (e.g., blocking known malicious IP ranges).
DNS and Traffic Routing¶
Cloud DNS services (Route 53, Cloud DNS) do more than resolve domain names—they're intelligent traffic routers:
| Routing Policy | Description | Use Case |
|---|---|---|
| Simple | Single record, single endpoint | Small applications with one server |
| Weighted | Distribute traffic by percentage across endpoints | Canary deployments (95% to v1, 5% to v2) |
| Latency-based | Route to the lowest-latency region | Global applications (serve US users from us-east, EU from eu-west) |
| Failover | Active-passive: route to secondary if primary fails health check | Disaster recovery |
| Geolocation | Route based on user's geographic location | Compliance (EU data stays in EU), localized content |
| Multi-value answer | Return multiple healthy endpoints (client-side load balancing) | Simple HA without a load balancer |
Content Delivery Networks (CDNs)¶
CDNs cache content at edge locations close to users, dramatically reducing latency for static and dynamic content. Major CDNs operate hundreds of points of presence (PoPs) worldwide.
Without CDN:
User (Tokyo) → Origin Server (us-east-1) = ~200ms latency
With CDN:
User (Tokyo) → CDN Edge (Tokyo PoP) = ~10ms latency (cache hit)
User (Tokyo) → CDN Edge (Tokyo PoP) → Origin (us-east-1) = ~210ms (cache miss, then cached)
| CDN Feature | Description |
|---|---|
| Static caching | HTML, CSS, JS, images cached at edge locations |
| Dynamic acceleration | Optimized routing and persistent connections to origin for dynamic content |
| SSL/TLS termination | Terminate TLS at the edge, reducing origin load |
| DDoS protection | Absorb volumetric attacks at the edge before they reach origin |
| Edge compute | Run code at edge locations (CloudFront Functions, Cloudflare Workers, Vercel Edge Functions) |
| Cache invalidation | Purge specific paths or wildcard patterns when content changes |
CDN providers: CloudFront (AWS), Cloud CDN (GCP), Azure CDN, Cloudflare, Fastly, Akamai.
Load Balancing¶
Cloud providers offer managed load balancers that distribute traffic across multiple backend targets:
| Type | AWS Service | Layer | Use Case |
|---|---|---|---|
| Application LB | ALB | Layer 7 (HTTP/HTTPS) | HTTP routing (path-based, host-based), WebSocket, gRPC |
| Network LB | NLB | Layer 4 (TCP/UDP) | Ultra-low latency, static IPs, millions of RPS |
| Gateway LB | GWLB | Layer 3 | Inline network appliances (firewalls, IDS/IPS) |
ALB routing example: An ALB can route /api/* to your backend service, /static/* to an S3 bucket, and everything else to your frontend service—all from a single endpoint.
Service Mesh¶
For microservices architectures, a service mesh provides infrastructure-level control over service-to-service communication:
Without service mesh: With service mesh (Istio/Linkerd):
┌─────────┐ HTTP ┌─────────┐ ┌─────────┐ ┌──────┐ ┌──────┐ ┌─────────┐
│ Service │ ──────── │ Service │ │ Service │─│Proxy │───│Proxy │─│ Service │
│ A │ │ B │ │ A │ │(Envoy│ │(Envoy│ │ B │
└─────────┘ └─────────┘ └─────────┘ └──────┘ └──────┘ └─────────┘
Sidecar proxies handle: mTLS, retries,
circuit breaking, observability, traffic control
Service meshes provide: mutual TLS (encrypted service-to-service communication), traffic management (canary releases, traffic splitting), observability (distributed tracing, metrics), resilience (retries, circuit breaking, timeouts), and access control (authorization policies).
Identity and Access Management (IAM)¶
IAM is the framework for managing who (identity) can do what (permissions) on which resources in the cloud. Every cloud provider has an IAM system; AWS IAM is the most widely referenced.
Core IAM Concepts¶
- Users: Represent individual people or service accounts. Each has credentials (password, access keys). Should map 1:1 to humans; never share user accounts.
- Groups: Collections of users. Permissions assigned to groups apply to all members. Example:
developersgroup has read access to production, write access to staging. - Roles: Identities with permissions that can be assumed by users, services, or external identities. Unlike users, roles don't have permanent credentials—they provide temporary security tokens via AWS STS (Security Token Service).
- Policies: JSON documents that define permissions. Attached to users, groups, or roles.
The Principle of Least Privilege¶
Always grant the minimum permissions necessary to perform a task. This is the single most important IAM principle.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-app-bucket/*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
},
"Bool": {
"aws:MultiFactorAuthPresent": "true"
}
}
},
{
"Effect": "Deny",
"Action": "s3:DeleteObject",
"Resource": "*"
}
]
}
IAM policy evaluation logic: When AWS evaluates a request, it follows this order: 1. Explicit deny — Any explicit deny in any policy wins (overrides everything) 2. Organizations SCPs — Service control policies set the maximum permissions boundary 3. Resource-based policies — Policies attached to resources (S3 bucket policies, etc.) 4. Identity-based policies — Policies attached to the user/role making the request 5. Permissions boundaries — Maximum permissions an identity can have 6. Session policies — Limit permissions for a temporary session 7. Default deny — If nothing explicitly allows the action, it's denied
RBAC vs. ABAC¶
| Approach | Description | Example |
|---|---|---|
| RBAC (Role-Based) | Permissions assigned to roles; users assume roles | database-admin role has full RDS access |
| ABAC (Attribute-Based) | Permissions based on tags/attributes of resources and principals | Users with tag team=data can access resources with tag team=data |
ABAC scales better than RBAC in large organizations. Instead of creating a new role for every team-resource combination, you create policies based on tag matching. However, ABAC requires disciplined tagging—if resources aren't tagged correctly, access control breaks.
Federation and SSO¶
Federation allows external identities (corporate Active Directory, Google Workspace, Okta) to access cloud resources without creating individual IAM users:
- SAML 2.0: Enterprise standard for SSO. Corporate IdP (Okta, Azure AD) authenticates user, sends SAML assertion to AWS, AWS grants temporary credentials based on mapped role.
- OIDC (OpenID Connect): Modern standard used by GitHub Actions, GitLab CI, and web applications. Allows workloads to assume cloud roles without long-lived secrets.
- AWS IAM Identity Center (SSO): Centralized SSO for multiple AWS accounts, integrates with corporate IdPs.
# GitHub Actions OIDC federation — no AWS access keys needed
jobs:
deploy:
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-deploy
aws-region: us-east-1
- run: aws s3 sync ./build s3://my-app-bucket/
Service-to-Service Authentication¶
In microservices architectures, services need to authenticate with each other:
| Method | Description | Complexity |
|---|---|---|
| IAM roles | Services assume IAM roles to access cloud resources | Low (cloud-native) |
| Service accounts | Dedicated identities for services (GCP service accounts, K8s service accounts) | Low-Medium |
| mTLS | Mutual TLS: both client and server present certificates | Medium-High (use service mesh) |
| Workload Identity | Map Kubernetes service accounts to cloud IAM roles (IRSA on EKS, Workload Identity on GKE) | Medium |
| SPIFFE/SPIRE | Open standard for service identity | High (most flexible) |
Secrets Management¶
Secrets (API keys, database passwords, TLS certificates) should never be in code, environment variables, or config files in source control:
| Tool | Description | Features |
|---|---|---|
| HashiCorp Vault | Industry-standard secrets manager | Dynamic secrets, encryption as a service, PKI, cloud-agnostic |
| AWS Secrets Manager | Managed secrets with auto-rotation | RDS password rotation, cross-account sharing, $0.40/secret/month |
| AWS SSM Parameter Store | Simpler key-value store | Free tier (standard params), hierarchical organization |
| GCP Secret Manager | GCP-native secrets | IAM integration, versioning, automatic replication |
| External Secrets Operator | Kubernetes operator that syncs secrets from external stores | Bridges Vault/cloud secrets into K8s secrets |
Cloud Security¶
Shared Responsibility Model¶
Cloud security is a shared responsibility between the provider and the customer. The exact boundary depends on the service model:
┌────────────────────────────────────────────────────┐
│ Customer Responsibility │
│ ┌──────────┬───────────┬───────────┬────────────┐ │
│ │ IaaS │ CaaS │ PaaS │ SaaS │ │
│ ├──────────┼───────────┼───────────┼────────────┤ │
│ │ Data │ Data │ Data │ Data access│ │
│ │ Apps │ Container │ App code │ User config│ │
│ │ OS/patch │ images │ │ │ │
│ │ Network │ Cluster │ │ │ │
│ │ config │ config │ │ │ │
│ └──────────┴───────────┴───────────┴────────────┘ │
├────────────────────────────────────────────────────┤
│ Provider Responsibility │
│ Physical security, hardware, hypervisor, │
│ managed service infrastructure, global network │
└────────────────────────────────────────────────────┘
Encryption¶
| Type | Description | AWS Service |
|---|---|---|
| At rest | Data encrypted on disk/storage | KMS, S3 server-side encryption, EBS encryption |
| In transit | Data encrypted over the network (TLS) | ACM (certificate management), ALB TLS termination |
| Client-side | Data encrypted before sending to cloud | AWS Encryption SDK, client-side S3 encryption |
Envelope encryption (used by KMS): A data key encrypts your data, and a master key encrypts the data key. This avoids sending large data blobs to KMS—only the small data key is encrypted/decrypted by KMS. The encrypted data key is stored alongside the encrypted data.
Encryption:
Data → [Data Key] → Encrypted Data
Data Key → [KMS Master Key] → Encrypted Data Key
Store: Encrypted Data + Encrypted Data Key
Decryption:
Encrypted Data Key → [KMS Master Key] → Data Key
Encrypted Data → [Data Key] → Data
Network Security¶
- Web Application Firewall (WAF): Inspects HTTP requests and blocks malicious traffic (SQL injection, XSS, bot traffic). AWS WAF, Cloudflare WAF, Azure WAF.
- DDoS Protection: AWS Shield Standard (free, automatic L3/L4 protection), Shield Advanced (L7 protection, DDoS response team, cost protection).
- VPC Flow Logs: Capture IP traffic metadata flowing through your VPC for auditing and troubleshooting.
- PrivateLink: Access services over private IPs without traversing the internet.
Security Monitoring¶
| Service | Purpose |
|---|---|
| CloudTrail | Logs every API call made in your AWS account (who did what, when, from where) |
| GuardDuty | ML-based threat detection analyzing CloudTrail, VPC Flow Logs, DNS logs |
| Security Hub | Aggregates findings from multiple security services, compliance checks |
| Config | Tracks resource configuration changes, evaluates compliance rules |
| Inspector | Automated vulnerability scanning for EC2 instances and container images |
Cloud Storage Deep Dive¶
Object Storage (S3 / GCS / Blob Storage)¶
Object storage is the fundamental cloud storage primitive. Objects are stored as key-value pairs in buckets, with metadata. There is no directory hierarchy—the "folders" you see are just key prefixes.
S3 storage classes:
| Class | Durability | Availability | Min Duration | Retrieval | Use Case |
|---|---|---|---|---|---|
| Standard | 11 9s | 99.99% | None | Instant | Frequently accessed data |
| Intelligent-Tiering | 11 9s | 99.9% | None | Instant | Unknown/changing access patterns |
| Standard-IA | 11 9s | 99.9% | 30 days | Instant | Infrequent but rapid access needed |
| One Zone-IA | 11 9s | 99.5% | 30 days | Instant | Reproducible, infrequent data |
| Glacier Instant | 11 9s | 99.9% | 90 days | Instant | Archive with instant access |
| Glacier Flexible | 11 9s | 99.99% | 90 days | 1-12 hours | Archive (long-term backups) |
| Glacier Deep Archive | 11 9s | 99.99% | 180 days | 12-48 hours | Compliance archives, rarely accessed |
Lifecycle policies automatically transition objects between storage classes based on age. Example: Move to Standard-IA after 30 days, Glacier after 90 days, Deep Archive after 365 days, delete after 7 years.
S3 performance optimization: - S3 supports 5,500 GET/s and 3,500 PUT/s per partition prefix - Use random prefixes (UUIDs, hashes) to distribute requests across partitions - Use multipart upload for objects > 100 MB (required > 5 GB) - S3 Transfer Acceleration uses CloudFront edge locations for faster uploads from distant locations
Block Storage (EBS)¶
Block storage provides raw storage volumes that attach to compute instances, behaving like physical hard drives:
| Volume Type | IOPS | Throughput | Use Case |
|---|---|---|---|
| gp3 (General SSD) | 3,000-16,000 | 125-1,000 MB/s | Default for most workloads |
| io2 (Provisioned SSD) | Up to 256,000 | Up to 4,000 MB/s | Databases requiring consistent IOPS |
| st1 (Throughput HDD) | 500 | 500 MB/s | Big data, data warehouses, log processing |
| sc1 (Cold HDD) | 250 | 250 MB/s | Infrequently accessed, lowest cost |
The 12-Factor App¶
The 12-Factor App is a methodology for building modern, cloud-native applications that are portable, scalable, and maintainable. Originally published by Heroku engineers, these principles are foundational for cloud-native development.
| Factor | Principle | Description |
|---|---|---|
| I. Codebase | One codebase, many deploys | One repo per app, deployed to multiple environments (dev, staging, prod) |
| II. Dependencies | Explicitly declare and isolate | Use dependency manifests (requirements.txt, package.json, Cargo.toml). Never rely on system-wide packages |
| III. Config | Store config in the environment | Database URLs, API keys, feature flags → environment variables, not code. Never commit secrets |
| IV. Backing Services | Treat backing services as attached resources | Databases, caches, queues are interchangeable resources identified by URL/credentials. Swapping a local PostgreSQL for Amazon RDS should require only a config change |
| V. Build, Release, Run | Strictly separate build and run stages | Build (compile + bundle), Release (build + config), Run (execute). Every release is immutable and has a unique ID |
| VI. Processes | Execute the app as stateless processes | App processes are stateless and share-nothing. Persistent data lives in backing services (DB, Redis, S3), not in local memory or filesystem |
| VII. Port Binding | Export services via port binding | The app is completely self-contained and exports HTTP (or other) as a service by binding to a port |
| VIII. Concurrency | Scale out via the process model | Scale by running multiple processes (horizontal scaling), not by making a single process larger |
| IX. Disposability | Maximize robustness with fast startup and graceful shutdown | Processes start quickly and shut down gracefully (finish current requests, release resources) |
| X. Dev/Prod Parity | Keep dev, staging, and production as similar as possible | Minimize gaps in time (deploy quickly), personnel (devs who wrote code deploy it), and tools (same backing services everywhere) |
| XI. Logs | Treat logs as event streams | Apps write logs to stdout. The execution environment captures, routes, and aggregates them (e.g., to ELK, CloudWatch, Datadog) |
| XII. Admin Processes | Run admin/management tasks as one-off processes | Database migrations, data fixes, console sessions run as one-off processes in the same environment as the app |
Cloud Architecture Patterns¶
Multi-Region Deployment¶
Deploying applications across multiple geographic regions for high availability, disaster recovery, and reduced latency.
┌─────── Global DNS (Route 53 / Cloud DNS) ───────┐
│ Latency-based routing │
▼ ▼
┌─── US-East Region ───┐ ┌── EU-West Region ───┐
│ ┌─ Load Balancer ─┐ │ │ ┌─ Load Balancer ┐ │
│ └───────┬─────────┘ │ │ └──────┬─────────┘ │
│ ┌───────┴─────────┐ │ │ ┌──────┴──────────┐│
│ │ App Servers │ │ │ │ App Servers ││
│ │ (Auto-scaling) │ │ │ │ (Auto-scaling) ││
│ └───────┬─────────┘ │ │ └──────┬──────────┘│
│ ┌───────┴─────────┐ │ Cross-Region │ ┌──────┴──────────┐│
│ │ Primary DB │◄├──── Replication ───────├──│ Replica DB ││
│ └─────────────────┘ │ │ └─────────────────┘│
└──────────────────────┘ └─────────────────────┘
Strategies:
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Low |
| Pilot Light | 10-30 min | Minutes | $$ | Medium |
| Warm Standby | Minutes | Seconds | $$$ | Medium-High |
| Active-Active | ~0 (automatic) | ~0 | $$$$ | High |
- Active-Passive: One region handles all traffic; the other is a standby for failover. Simpler but wastes resources.
- Active-Active: Both regions serve traffic simultaneously. More complex (requires data synchronization, conflict resolution) but better resource utilization and lower latency.
- Pilot Light: Minimal infrastructure running in the DR region (e.g., database replica). Scale up on failover.
- Warm Standby: Scaled-down version of production running in DR region. Faster failover than pilot light.
Auto-Scaling Strategies¶
Auto-scaling adjusts capacity dynamically based on demand:
| Strategy | Trigger | Latency | Use Case |
|---|---|---|---|
| Reactive (target tracking) | Metric crosses threshold (CPU > 70%, queue depth > 100) | 2-5 min | General workloads |
| Step scaling | Metric enters defined ranges, each triggering a different scaling action | 2-5 min | Workloads with predictable scaling steps |
| Scheduled | Time-based (scale up at 9am, down at 6pm) | None (pre-provisioned) | Predictable traffic patterns (business hours) |
| Predictive | ML-based forecasting from historical patterns | None (pre-provisioned) | Recurring patterns (daily/weekly cycles) |
Scaling best practices: - Scale out (add instances) aggressively, scale in (remove instances) conservatively - Set cooldown periods to prevent scaling thrashing - Use multiple metrics (CPU + request count + queue depth) for more accurate scaling decisions - Always test scaling by simulating load before relying on it in production
Deployment Patterns¶
| Pattern | Description | Risk Level | Rollback Speed |
|---|---|---|---|
| Rolling update | Replace instances one at a time | Medium | Medium (re-deploy) |
| Blue-Green | Maintain two identical environments; switch traffic at once | Low | Instant (switch back) |
| Canary | Route small percentage of traffic to new version, gradually increase | Low | Instant (route 0% to canary) |
| Feature flags | New code deployed to all instances but gated behind flags | Very Low | Instant (toggle flag off) |
Blue-Green deployment flow: 1. Blue environment runs current production 2. Deploy new version to Green environment 3. Run smoke tests against Green 4. Switch load balancer to route traffic to Green 5. Monitor for errors; if issues arise, switch back to Blue instantly 6. Decommission Blue (or keep as next deployment target)
Resilience Patterns¶
- Circuit Breaker: When a downstream service fails repeatedly, stop calling it for a period (open circuit) instead of accumulating timeouts. After a cooldown, allow a test request (half-open). If it succeeds, close the circuit; if not, reopen.
- Bulkhead: Isolate failures by partitioning resources. If the payment service's thread pool is exhausted, the search service's thread pool is unaffected. Named after ship bulkheads that prevent a hull breach from flooding the entire vessel.
- Retry with exponential backoff: On transient failures, retry with increasing delays (1s, 2s, 4s, 8s) plus jitter (random offset to prevent thundering herd).
- Timeout: Every external call should have a timeout. Without timeouts, a hung dependency can cascade to all callers.
Hybrid Cloud¶
Combines on-premises infrastructure with public cloud services. Common in enterprises with existing data centers, regulatory requirements, or latency-sensitive workloads.
Technologies:
- AWS Outposts: AWS hardware in your data center
- Azure Arc: Manage on-premises, multi-cloud, and edge resources from Azure
- Google Anthos: Run GKE clusters anywhere (on-prem, AWS, Azure)
- VPN / Direct Connect / ExpressRoute: Secure, dedicated connections between on-premises and cloud
Multi-Cloud¶
Using services from multiple cloud providers to avoid vendor lock-in, leverage best-of-breed services, or meet regulatory requirements.
Challenges: Different APIs, pricing models, IAM systems, and networking models. Requires abstraction layers (Terraform, Pulumi, Crossplane) and potentially higher operational complexity. True multi-cloud (running the same workload across providers) is rare; more common is "multi-cloud by choice" (different workloads on different providers).
Serverless Architecture¶
Serverless extends beyond individual functions (FaaS) to entire architectures where the cloud provider manages all infrastructure and you pay only for what you use.
Event-Driven Architecture with Serverless¶
┌───────────┐ ┌──────────┐ ┌──────────┐ ┌─────────────┐
│ API │────▶│ Lambda │────▶│ DynamoDB │ │ S3 Bucket │
│ Gateway │ │ Function │ │ (storage)│ │ (uploads) │
└───────────┘ └──────────┘ └──────────┘ └──────┬──────┘
│ Event
▼
┌───────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ CloudWatch│────▶│ Lambda │ │ Lambda │◀────│ SQS │
│ Events │ │ (cron) │ │ (process)│ │ Queue │
└───────────┘ └──────────┘ └──────────┘ └──────────┘
Common serverless event sources: API Gateway (HTTP), S3 (file uploads), SQS (queues), SNS (pub/sub), DynamoDB Streams (data changes), CloudWatch Events/EventBridge (scheduled, AWS events), Kinesis (streaming data).
Step Functions / Workflows¶
For complex multi-step workflows that require orchestration, error handling, and state management:
{
"Comment": "Order processing workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:validate",
"Next": "ProcessPayment",
"Catch": [{
"ErrorEquals": ["ValidationError"],
"Next": "OrderFailed"
}]
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:payment",
"Next": "FulfillOrder",
"Retry": [{
"ErrorEquals": ["PaymentTimeout"],
"IntervalSeconds": 5,
"MaxAttempts": 3,
"BackoffRate": 2.0
}]
},
"FulfillOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:fulfill",
"End": true
},
"OrderFailed": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:notify-failure",
"End": true
}
}
}
Serverless Databases¶
| Database | Type | Scaling | Pricing Model |
|---|---|---|---|
| DynamoDB | NoSQL (key-value/document) | On-demand or provisioned | Per read/write request unit |
| Aurora Serverless v2 | Relational (MySQL/PostgreSQL) | Auto-scales 0.5-128 ACUs | Per ACU-hour |
| Neon | PostgreSQL | Auto-scales, scales to zero | Per compute-hour + storage |
| PlanetScale | MySQL (Vitess) | Auto-scales | Per row read/write |
Cloud Cost Optimization (FinOps)¶
Cloud costs can spiral quickly without discipline. FinOps (Financial Operations) is the practice of bringing financial accountability to cloud spending.
Key Principles¶
- Visibility: Tag all resources, set up cost dashboards, use cost allocation reports.
- Optimization: Right-size instances, use reserved/spot/preemptible instances, delete unused resources.
- Governance: Set budgets and alerts, implement approval workflows for expensive resources.
Cost Reduction Strategies¶
| Strategy | Savings Potential | Description |
|---|---|---|
| Right-sizing | 20-40% | Match instance types to actual workload needs. Most instances are over-provisioned |
| Reserved Instances / Committed Use | 30-72% | Commit to 1-3 year usage for significant discounts |
| Savings Plans | 30-72% | More flexible than RIs: commit to $/hour spend, not specific instance types |
| Spot/Preemptible Instances | 60-90% | Use spare capacity at steep discounts for fault-tolerant workloads (batch processing, CI/CD, data pipelines) |
| Auto-scaling | Variable | Scale resources up/down based on demand. Don't pay for idle capacity |
| Storage tiering | 40-80% | Move infrequently accessed data to cheaper storage classes (S3 Glacier, Coldline) |
| Serverless | Variable | Pay only for actual execution time. Ideal for sporadic workloads |
| Scheduled scaling | 20-50% | Turn off dev/test environments during nights and weekends |
| Graviton/ARM instances | 20-40% | ARM-based instances offer better price-performance for compatible workloads |
Tagging strategy — Tags are the foundation of cost allocation. Minimum recommended tags:
| Tag Key | Purpose | Example Values |
|---|---|---|
Environment |
Separate costs by environment | production, staging, development |
Team |
Allocate costs to teams | platform, backend, data, ml |
Service |
Track costs per service | user-api, payment-service, search |
CostCenter |
Map to financial cost centers | eng-001, marketing-002 |
Owner |
Identify responsible person | alice@company.com |
# Example: AWS Cost Explorer CLI query
aws ce get-cost-and-usage \
--time-period Start=2025-01-01,End=2025-01-31 \
--granularity MONTHLY \
--metrics "BlendedCost" "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
Spot Instance Strategies¶
Spot instances offer 60-90% discounts but can be reclaimed with 2 minutes' notice. Effective strategies:
- Diversify instance types: Request multiple instance types in your auto-scaling group; if one type is reclaimed, others may still be available
- Use capacity-optimized allocation: Let the provider choose the instance type with the most available capacity
- Handle interruptions gracefully: Use the 2-minute warning to drain connections and checkpoint work
- Mix on-demand and spot: Run baseline on on-demand, burst on spot (e.g., 30% on-demand, 70% spot)
Cloud Migration Strategies¶
The 7 R's framework for migrating workloads to the cloud:
| Strategy | Description | Effort | Risk | When to Use |
|---|---|---|---|---|
| Rehost (Lift & Shift) | Move as-is to cloud VMs | Low | Low | Quick migration, minimal changes |
| Replatform (Lift & Reshape) | Minor optimization during migration | Medium | Low-Medium | Use managed services (RDS instead of self-managed DB) |
| Repurchase | Replace with SaaS product | Low | Medium | On-prem CRM → Salesforce, on-prem email → Gmail |
| Refactor / Re-architect | Rewrite to be cloud-native | High | High | Critical apps that benefit from cloud-native features |
| Retire | Decommission applications no longer needed | Low | Low | Reduce portfolio before migration |
| Retain | Keep on-premises (for now) | None | None | Regulatory, too complex, or recently upgraded |
| Relocate | Move to cloud without changes (VMware on cloud) | Low | Low | VMware environments → VMware Cloud on AWS |
Migration phases: 1. Assessment: Inventory applications, map dependencies, assess cloud readiness 2. Planning: Choose migration strategy per application, design target architecture, build business case 3. Migration: Execute migration in waves, validate functionality 4. Optimization: Right-size resources, implement auto-scaling, optimize costs
Database migration is typically the most complex part. AWS Database Migration Service (DMS) supports heterogeneous migrations (Oracle → PostgreSQL) with the Schema Conversion Tool (SCT). For homogeneous migrations (MySQL → Aurora MySQL), native replication can minimize downtime to seconds.