Skip to content

Infrastructure as Code (IaC)

Introduction and Evolution

Infrastructure as Code (IaC) represents a fundamental paradigm shift in how we provision, manage, and maintain computing infrastructure. Traditional manual configuration leads to "snowflake" environments—unique, hard-to-reproduce setups prone to errors and drift. IaC solves this by codifying everything, enabling rapid provisioning (minutes vs. days), easy replication (e.g., duplicate environments for new branches), and quick recovery via rollbacks.

Historical Context

The evolution of infrastructure management has progressed through several distinct phases:

Phase 1: Manual Configuration (Pre-2000s) System administrators manually configured each server through interactive sessions. This approach was:

  • Time-consuming and error-prone
  • Impossible to reproduce consistently
  • Dependent on tribal knowledge and documentation that quickly became outdated
  • Resulted in "snowflake servers" where each machine was unique

Phase 2: Script-Based Automation (2000s) Shell scripts and batch files began automating repetitive tasks:

#!/bin/bash
# Early automation example
apt-get update
apt-get install -y nginx
cp /path/to/config /etc/nginx/nginx.conf
systemctl start nginx
systemctl enable nginx

Limitations included:

  • Scripts were often not idempotent (running twice could cause issues)
  • No state tracking—scripts didn't know what was already done
  • Poor error handling and recovery
  • Environment-specific hardcoding

Phase 3: Configuration Management Tools (2005-2010) Tools like Puppet (2005), Chef (2009), and later Ansible (2012) introduced:

  • Declarative or semi-declarative syntax
  • Idempotent operations
  • Centralized management
  • Resource abstraction

Phase 4: Cloud-Native IaC (2010-Present) The cloud era brought tools designed for provisioning entire infrastructures:

  • AWS CloudFormation (2011): First major cloud-native IaC tool
  • Terraform (2014): Multi-cloud, provider-agnostic approach
  • Pulumi (2018): Real programming languages for infrastructure
  • Crossplane (2018): Kubernetes-native infrastructure management

The Problem IaC Solves

Consider a typical pre-IaC scenario:

  1. Developer requests a new environment for testing
  2. Operations receives ticket, waits in queue (days)
  3. Manual setup through cloud console (hours, error-prone)
  4. Documentation updated (often incomplete or forgotten)
  5. Drift occurs as ad-hoc changes accumulate
  6. Environment becomes irreproducible—nobody knows exact state
  7. Disaster recovery requires heroic manual effort

With IaC:

  1. Developer clones infrastructure code
  2. Modifies parameters for new environment
  3. Runs terraform apply or equivalent
  4. Infrastructure provisions in minutes, identically to production
  5. Changes tracked in version control
  6. Recovery is simply re-running the code

Core Principles of IaC

IaC follows fundamental principles that distinguish it from ad-hoc automation:

1. Idempotence

Definition: An operation is idempotent if applying it multiple times produces the same result as applying it once.

# Idempotent: Running 10 times = running once
desired_state: server exists with 4GB RAM

# NOT idempotent: Running 10 times ≠ running once
action: create a server with 4GB RAM  # Creates 10 servers!

Why It Matters:

  • Safe to retry failed operations
  • Convergence to desired state regardless of current state
  • Enables automated remediation and drift correction

Implementation Strategies:

# Non-idempotent approach
def create_user(username):
    run_command(f"useradd {username}")  # Fails if user exists

# Idempotent approach
def ensure_user(username):
    if not user_exists(username):
        run_command(f"useradd {username}")
    # If user exists, do nothing - same end state

Most IaC tools achieve idempotence through:

  • State comparison: Compare desired vs. current state
  • Resource identification: Use unique identifiers to track resources
  • Conditional execution: Only perform actions when needed

2. Version Control

All infrastructure code belongs in version control (Git), enabling:

Change Tracking:

git log --oneline infrastructure/
a1b2c3d Add auto-scaling to web tier
d4e5f6g Increase RDS instance size for production
g7h8i9j Initial VPC and networking setup

Code Review for Infrastructure:

# Pull request shows exactly what changes
- instance_type: "t3.medium"
+ instance_type: "t3.large"  # Reviewer can assess impact

Branching Strategies:

main (production) ──────────────────────────────────────►
        │
        └── feature/add-cache ──► PR ──► merge
        │
        └── hotfix/security-patch ──► emergency merge

Audit Trail: Every change has author, timestamp, and reason (commit message).

3. Declarative Over Imperative

Declarative ("what"): Define the desired end state; the tool figures out how.

# Terraform (declarative)
resource "aws_instance" "web" {
  count         = 3
  instance_type = "t3.micro"
}
# "I want 3 t3.micro instances to exist"
# Terraform handles: create new, modify existing, or delete excess

Imperative ("how"): Specify step-by-step instructions.

# Ansible (imperative)
- name: Create EC2 instances
  ec2_instance:
    state: present
    instance_type: t3.micro
  loop: "{{ range(3) | list }}"
# "Execute these steps to create instances"

Why Declarative Dominates:

Aspect Declarative Imperative
Complexity Tool handles orchestration You manage order/dependencies
Idempotence Built-in Must be coded
Drift correction Automatic convergence Manual scripting needed
Learning curve Define "what", not "how" Need procedural knowledge
Flexibility Less (constrained by tool) More (full control)

4. Immutable Infrastructure

Traditional (mutable): Update servers in place.

Server v1 ──patch──► Server v1.1 ──config──► Server v1.1a ──hotfix──► ???
                    (drift accumulates, state unknown)

Immutable: Replace servers entirely.

Server v1 (discard) ──► Server v2 (fresh) ──► Server v3 (fresh)
                       (known state)          (known state)

Benefits of Immutability:

  • No configuration drift
  • Consistent, tested images
  • Easy rollback (switch to previous version)
  • Better security (no accumulated patches)

Implementation: Build machine images (AMI, Docker) with all configurations baked in. Deploy new instances; destroy old ones.

5. Modularity and Reusability

Break infrastructure into composable modules:

infrastructure/
├── modules/
│   ├── networking/          # VPC, subnets, gateways
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── compute/             # EC2, auto-scaling
│   ├── database/            # RDS, replicas
│   └── security/            # IAM, security groups
├── environments/
│   ├── dev/
│   │   └── main.tf          # Uses modules with dev params
│   ├── staging/
│   └── production/

Module Contract:

# modules/networking/variables.tf (inputs)
variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
}

variable "environment" {
  description = "Environment name"
  type        = string
}

# modules/networking/outputs.tf (outputs)
output "vpc_id" {
  description = "ID of created VPC"
  value       = aws_vpc.main.id
}

output "subnet_ids" {
  description = "IDs of created subnets"
  value       = aws_subnet.main[*].id
}

6. Self-Documenting Infrastructure

The code IS the documentation:

# This IS the production infrastructure specification
# Not a wiki page that might be outdated

resource "aws_rds_instance" "production" {
  identifier        = "prod-primary-db"
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.r6g.xlarge"
  allocated_storage = 500

  multi_az               = true  # High availability enabled
  backup_retention_period = 30   # 30 days of backups

  tags = {
    Environment = "production"
    Owner       = "platform-team"
    CostCenter  = "infrastructure"
  }
}

Declarative vs. Imperative: Deep Dive

Understanding this distinction is crucial for choosing the right tool.

Declarative Model

How It Works:

  1. User defines desired state in configuration
  2. Tool reads current state (from cloud API or state file)
  3. Tool computes difference (plan)
  4. Tool executes minimal changes to reach desired state
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Desired State  │    │  Current State  │    │     Plan        │
│  (config file)  │───►│  (API/state)    │───►│  (diff)         │
└─────────────────┘    └─────────────────┘    └────────┬────────┘
                                                       │
                                                       ▼
                                              ┌─────────────────┐
                                              │    Execute      │
                                              │  (apply diff)   │
                                              └─────────────────┘

Terraform Example:

# Desired: 3 instances in us-west-2
resource "aws_instance" "web" {
  count         = 3
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"

  tags = {
    Name = "web-${count.index}"
  }
}

If current state has 2 instances → Terraform creates 1 more. If current state has 5 instances → Terraform destroys 2. If current state has 3 correct instances → Terraform does nothing.

Imperative Model

How It Works:

  1. User defines sequence of operations
  2. Tool executes operations in order
  3. User must handle conditionals and state checking
# Ansible: Procedural steps
- name: Install packages
  apt:
    name: "{{ item }}"
    state: present
  loop:
    - nginx
    - certbot

- name: Copy configuration
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
  notify: Reload nginx

- name: Ensure service running
  service:
    name: nginx
    state: started
    enabled: yes

Hybrid Approaches

Many tools blend both paradigms:

Ansible (primarily imperative with declarative modules):

# Declarative module usage within imperative playbook
- name: Ensure EC2 instance exists
  amazon.aws.ec2_instance:
    state: present              # Declarative: desired state
    name: "my-instance"
    instance_type: t3.micro
    image_id: ami-12345678

Pulumi (declarative intent with programming constructs):

// Declarative resource definition with imperative logic
const instances = [];
for (let i = 0; i < config.getNumber("instanceCount") || 3; i++) {
    instances.push(new aws.ec2.Instance(`web-${i}`, {
        instanceType: "t3.micro",
        ami: ami.id,
    }));
}

When to Use Each

Scenario Recommended Approach
Cloud infrastructure provisioning Declarative (Terraform, CloudFormation)
Server configuration Imperative (Ansible) or Declarative (Puppet)
Complex orchestration workflows Imperative (Ansible)
Kubernetes applications Declarative (Helm, Kustomize)
One-time migrations Imperative scripts
Continuous state enforcement Declarative

IaC Tool Landscape

Categorization by Purpose

┌─────────────────────────────────────────────────────────────────────────┐
│                        Infrastructure as Code Tools                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────────────┐  ┌─────────────────────────────────┐  │
│  │   PROVISIONING              │  │   CONFIGURATION MANAGEMENT       │  │
│  │   (Create infrastructure)   │  │   (Configure systems)            │  │
│  │                             │  │                                   │  │
│  │   • Terraform / OpenTofu    │  │   • Ansible                       │  │
│  │   • Pulumi                  │  │   • Puppet                        │  │
│  │   • AWS CloudFormation      │  │   • Chef                          │  │
│  │   • Azure ARM / Bicep       │  │   • SaltStack                     │  │
│  │   • Google Cloud DM         │  │                                   │  │
│  │   • Crossplane              │  │                                   │  │
│  └─────────────────────────────┘  └─────────────────────────────────┘  │
│                                                                          │
│  ┌─────────────────────────────┐  ┌─────────────────────────────────┐  │
│  │   KUBERNETES-NATIVE         │  │   POLICY & COMPLIANCE            │  │
│  │   (K8s workloads)           │  │   (Governance)                   │  │
│  │                             │  │                                   │  │
│  │   • Helm                    │  │   • Open Policy Agent (OPA)       │  │
│  │   • Kustomize               │  │   • HashiCorp Sentinel            │  │
│  │   • Crossplane              │  │   • Checkov                       │  │
│  │   • ArgoCD / Flux           │  │   • tfsec / Trivy                 │  │
│  └─────────────────────────────┘  └─────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Quick Comparison Matrix

Tool Type Language State Multi-Cloud Best For
Terraform Declarative HCL External Yes General provisioning
OpenTofu Declarative HCL External Yes Open-source Terraform
Pulumi Declarative TS/Python/Go External Yes Developer-centric IaC
CloudFormation Declarative YAML/JSON AWS-managed AWS only AWS-native shops
ARM/Bicep Declarative JSON/Bicep Azure-managed Azure only Azure-native shops
Ansible Imperative YAML Stateless Yes Configuration mgmt
Helm Declarative YAML+Go tmpl K8s secrets K8s only K8s app packaging
Crossplane Declarative YAML (K8s) K8s Yes K8s-native infra

Terraform

Terraform, developed by HashiCorp (with OpenTofu as its open-source fork following licensing changes to BUSL), is the leading declarative IaC tool. It embodies IaC principles by defining infrastructure in HashiCorp Configuration Language (HCL), which is human-readable and versionable.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                         Terraform Architecture                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   ┌──────────────────┐                                                  │
│   │  Configuration   │  .tf files (HCL)                                 │
│   │  Files           │  Define desired state                            │
│   └────────┬─────────┘                                                  │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────┐     ┌──────────────────┐                         │
│   │  Terraform Core  │────►│  State File      │                         │
│   │                  │     │  (.tfstate)      │                         │
│   │  • Parse config  │     │                  │                         │
│   │  • Build graph   │     │  Maps config to  │                         │
│   │  • Plan changes  │     │  real resources  │                         │
│   │  • Apply changes │     │                  │                         │
│   └────────┬─────────┘     └──────────────────┘                         │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                         Providers                                 │  │
│   │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │  │
│   │  │  AWS    │  │  Azure  │  │  GCP    │  │ Custom  │  ...       │  │
│   │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘            │  │
│   └───────┼────────────┼────────────┼────────────┼───────────────────┘  │
│           │            │            │            │                       │
│           ▼            ▼            ▼            ▼                       │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    Cloud Provider APIs                           │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

HCL Language Deep Dive

HCL (HashiCorp Configuration Language) is designed specifically for infrastructure definition.

Basic Syntax:

# Block types: resource, data, variable, output, locals, module, provider

# Provider configuration
provider "aws" {
  region = "us-west-2"

  default_tags {
    tags = {
      ManagedBy = "Terraform"
    }
  }
}

# Resource definition
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"

  tags = {
    Name = "WebServer"
  }
}

Variables and Types:

# variables.tf - Input variables
variable "environment" {
  description = "Deployment environment"
  type        = string
  default     = "development"

  validation {
    condition     = contains(["development", "staging", "production"], var.environment)
    error_message = "Environment must be development, staging, or production."
  }
}

variable "instance_config" {
  description = "Instance configuration"
  type = object({
    instance_type = string
    volume_size   = number
    enable_monitoring = bool
  })
  default = {
    instance_type     = "t3.micro"
    volume_size       = 20
    enable_monitoring = false
  }
}

variable "allowed_cidrs" {
  description = "List of allowed CIDR blocks"
  type        = list(string)
  default     = ["10.0.0.0/8"]
}

variable "tags" {
  description = "Resource tags"
  type        = map(string)
  default     = {}
}

Local Values:

locals {
  # Computed values used throughout configuration
  name_prefix = "${var.project}-${var.environment}"

  common_tags = merge(var.tags, {
    Environment = var.environment
    Project     = var.project
    ManagedBy   = "Terraform"
  })

  # Conditional logic
  instance_type = var.environment == "production" ? "t3.large" : "t3.micro"
}

Data Sources (read existing resources):

# Query existing resources
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

data "aws_vpc" "existing" {
  tags = {
    Name = "main-vpc"
  }
}

# Use in resources
resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  subnet_id     = data.aws_vpc.existing.main_route_table_id
  instance_type = "t3.micro"
}

Outputs:

output "instance_ip" {
  description = "Public IP of the instance"
  value       = aws_instance.web.public_ip
}

output "instance_details" {
  description = "Full instance details"
  value = {
    id         = aws_instance.web.id
    public_ip  = aws_instance.web.public_ip
    private_ip = aws_instance.web.private_ip
  }
  sensitive = false
}

Control Flow and Expressions

Count (create multiple similar resources):

variable "instance_count" {
  default = 3
}

resource "aws_instance" "web" {
  count = var.instance_count

  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"

  tags = {
    Name = "web-${count.index}"  # web-0, web-1, web-2
  }
}

# Reference: aws_instance.web[0], aws_instance.web[1], etc.
# All instances: aws_instance.web[*].public_ip

For_each (create resources from a map/set):

variable "instances" {
  default = {
    web = {
      instance_type = "t3.micro"
      az            = "us-west-2a"
    }
    api = {
      instance_type = "t3.small"
      az            = "us-west-2b"
    }
    worker = {
      instance_type = "t3.medium"
      az            = "us-west-2c"
    }
  }
}

resource "aws_instance" "servers" {
  for_each = var.instances

  ami               = data.aws_ami.ubuntu.id
  instance_type     = each.value.instance_type
  availability_zone = each.value.az

  tags = {
    Name = each.key  # web, api, worker
  }
}

# Reference: aws_instance.servers["web"], aws_instance.servers["api"]

Dynamic Blocks (generate nested blocks):

variable "ingress_rules" {
  default = [
    { port = 80, cidr = "0.0.0.0/0", description = "HTTP" },
    { port = 443, cidr = "0.0.0.0/0", description = "HTTPS" },
    { port = 22, cidr = "10.0.0.0/8", description = "SSH internal" },
  ]
}

resource "aws_security_group" "web" {
  name = "web-sg"

  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port   = ingress.value.port
      to_port     = ingress.value.port
      protocol    = "tcp"
      cidr_blocks = [ingress.value.cidr]
      description = ingress.value.description
    }
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Conditional Expressions:

# Ternary operator
resource "aws_instance" "web" {
  instance_type = var.environment == "production" ? "t3.large" : "t3.micro"

  # Conditional resource creation
  count = var.create_instance ? 1 : 0
}

# Conditional in for_each
resource "aws_eip" "web" {
  for_each = var.environment == "production" ? toset(["primary", "secondary"]) : toset([])

  instance = aws_instance.web[0].id
}

State Management Deep Dive

State is Terraform's mechanism for mapping configuration to real-world resources.

State File Structure (.tfstate):

{
  "version": 4,
  "terraform_version": "1.6.0",
  "serial": 42,
  "lineage": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "outputs": {
    "instance_ip": {
      "value": "54.123.45.67",
      "type": "string"
    }
  },
  "resources": [
    {
      "mode": "managed",
      "type": "aws_instance",
      "name": "web",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "schema_version": 1,
          "attributes": {
            "id": "i-0123456789abcdef0",
            "ami": "ami-0c55b159cbfafe1f0",
            "instance_type": "t3.micro",
            "public_ip": "54.123.45.67",
            "private_ip": "10.0.1.50",
            "tags": {
              "Name": "WebServer"
            }
            // ... many more attributes
          }
        }
      ]
    }
  ]
}

Remote State Backends:

# S3 backend with DynamoDB locking (recommended for AWS)
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # For state locking
  }
}

# Azure Blob Storage
terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "tfstate12345"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
  }
}

# Google Cloud Storage
terraform {
  backend "gcs" {
    bucket = "my-terraform-state"
    prefix = "terraform/state"
  }
}

# HCP Terraform (formerly Terraform Cloud)
terraform {
  cloud {
    organization = "my-org"
    workspaces {
      name = "my-workspace"
    }
  }
}

State Operations:

# List resources in state
terraform state list

# Show specific resource
terraform state show aws_instance.web

# Move resource (rename or move to module)
terraform state mv aws_instance.web aws_instance.webserver

# Remove resource from state (doesn't destroy actual resource)
terraform state rm aws_instance.web

# Import existing resource into state
terraform import aws_instance.web i-0123456789abcdef0

# Pull remote state locally
terraform state pull > backup.tfstate

# Push local state to remote
terraform state push backup.tfstate

# Force unlock state (use carefully)
terraform force-unlock LOCK_ID

State Locking:

┌─────────────┐     ┌─────────────────────┐     ┌─────────────────┐
│  User A     │────►│  DynamoDB Lock      │◄────│  User B         │
│  terraform  │     │  Table              │     │  terraform      │
│  apply      │     │                     │     │  apply          │
└─────────────┘     │  Lock: User A       │     └─────────────────┘
                    │  ID: abc123         │            │
                    │  Created: 10:00     │            │
                    └─────────────────────┘            │
                                                       ▼
                                              "Error: state locked"

Modules

Modules are reusable packages of Terraform configuration.

Module Structure:

modules/
└── vpc/
    ├── main.tf           # Primary resources
    ├── variables.tf      # Input variables
    ├── outputs.tf        # Output values
    ├── versions.tf       # Provider/Terraform version constraints
    ├── README.md         # Documentation
    └── examples/         # Usage examples
        └── complete/
            └── main.tf

Module Definition (modules/vpc/main.tf):

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(var.tags, {
    Name = "${var.name}-vpc"
  })
}

resource "aws_subnet" "public" {
  count = length(var.public_subnet_cidrs)

  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnet_cidrs[count.index]
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = merge(var.tags, {
    Name = "${var.name}-public-${count.index + 1}"
    Tier = "Public"
  })
}

resource "aws_subnet" "private" {
  count = length(var.private_subnet_cidrs)

  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = merge(var.tags, {
    Name = "${var.name}-private-${count.index + 1}"
    Tier = "Private"
  })
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = merge(var.tags, {
    Name = "${var.name}-igw"
  })
}

resource "aws_nat_gateway" "main" {
  count = var.enable_nat_gateway ? length(var.public_subnet_cidrs) : 0

  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = merge(var.tags, {
    Name = "${var.name}-nat-${count.index + 1}"
  })
}

resource "aws_eip" "nat" {
  count  = var.enable_nat_gateway ? length(var.public_subnet_cidrs) : 0
  domain = "vpc"

  tags = merge(var.tags, {
    Name = "${var.name}-nat-eip-${count.index + 1}"
  })
}

Module Variables (modules/vpc/variables.tf):

variable "name" {
  description = "Name prefix for resources"
  type        = string
}

variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
  default     = "10.0.0.0/16"
}

variable "public_subnet_cidrs" {
  description = "CIDR blocks for public subnets"
  type        = list(string)
  default     = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}

variable "private_subnet_cidrs" {
  description = "CIDR blocks for private subnets"
  type        = list(string)
  default     = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
}

variable "availability_zones" {
  description = "Availability zones"
  type        = list(string)
}

variable "enable_nat_gateway" {
  description = "Enable NAT Gateway for private subnets"
  type        = bool
  default     = true
}

variable "tags" {
  description = "Tags to apply to resources"
  type        = map(string)
  default     = {}
}

Module Outputs (modules/vpc/outputs.tf):

output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "vpc_cidr" {
  description = "CIDR block of the VPC"
  value       = aws_vpc.main.cidr_block
}

output "public_subnet_ids" {
  description = "IDs of public subnets"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "IDs of private subnets"
  value       = aws_subnet.private[*].id
}

output "nat_gateway_ids" {
  description = "IDs of NAT Gateways"
  value       = aws_nat_gateway.main[*].id
}

Using Modules:

# Local module
module "vpc" {
  source = "./modules/vpc"

  name               = "production"
  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
  enable_nat_gateway = true

  tags = {
    Environment = "production"
  }
}

# Public registry module
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"

  name = "my-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-west-2a", "us-west-2b", "us-west-2c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = true
}

# Git repository module
module "vpc" {
  source = "git::https://github.com/org/terraform-modules.git//vpc?ref=v1.2.0"
  # ...
}

# Use module outputs
resource "aws_instance" "web" {
  subnet_id = module.vpc.public_subnet_ids[0]
  # ...
}

Terraform Workflow

Complete Workflow:

# 1. Initialize working directory
terraform init
# Downloads providers, modules, configures backend

# 2. Format code
terraform fmt -recursive
# Rewrites files to canonical format

# 3. Validate configuration
terraform validate
# Checks syntax and internal consistency

# 4. Plan changes
terraform plan -out=tfplan
# Shows what will change, saves plan file

# 5. Review plan output carefully!
# + create, - destroy, ~ update, -/+ replace

# 6. Apply changes
terraform apply tfplan
# Executes the saved plan

# Alternative: plan and apply in one (prompts for confirmation)
terraform apply

# 7. Destroy (when needed)
terraform destroy
# Removes all managed resources

Plan Output Interpretation:

Terraform will perform the following actions:

  # aws_instance.web will be created
  + resource "aws_instance" "web" {
      + ami                          = "ami-0c55b159cbfafe1f0"
      + instance_type                = "t3.micro"
      + id                           = (known after apply)
      + public_ip                    = (known after apply)
    }

  # aws_instance.api will be updated in-place
  ~ resource "aws_instance" "api" {
        id            = "i-0123456789abcdef0"
      ~ instance_type = "t3.micro" -> "t3.small"
    }

  # aws_instance.worker must be replaced
-/+ resource "aws_instance" "worker" {
        ~ ami           = "ami-old123" -> "ami-new456" # forces replacement
        ~ id            = "i-0987654321fedcba0" -> (known after apply)
    }

  # aws_instance.deprecated will be destroyed
  - resource "aws_instance" "deprecated" {
      - id            = "i-todelete123"
      - instance_type = "t2.micro"
    }

Plan: 1 to add, 1 to change, 2 to destroy.

Lifecycle Management

Control how Terraform manages resources:

resource "aws_instance" "web" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"

  lifecycle {
    # Create new before destroying old (zero-downtime updates)
    create_before_destroy = true

    # Prevent accidental destruction
    prevent_destroy = true

    # Ignore changes to specific attributes (avoid drift detection)
    ignore_changes = [
      tags["LastModified"],
      user_data,
    ]

    # Custom replacement triggers
    replace_triggered_by = [
      aws_ami.ubuntu.id
    ]
  }
}

# Preconditions and postconditions
resource "aws_instance" "web" {
  instance_type = var.instance_type

  lifecycle {
    precondition {
      condition     = contains(["t3.micro", "t3.small", "t3.medium"], var.instance_type)
      error_message = "Instance type must be t3.micro, t3.small, or t3.medium."
    }

    postcondition {
      condition     = self.public_ip != ""
      error_message = "Instance must have a public IP address."
    }
  }
}

Terraform Best Practices

Project Structure:

terraform-infrastructure/
├── modules/                    # Reusable modules
│   ├── networking/
│   ├── compute/
│   ├── database/
│   └── security/
├── environments/               # Environment-specific configs
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   ├── staging/
│   └── production/
├── .gitignore
├── .terraform-version          # tfenv version file
└── README.md

.gitignore:

# Local .terraform directories
**/.terraform/*

# .tfstate files
*.tfstate
*.tfstate.*

# Crash log files
crash.log
crash.*.log

# Exclude all .tfvars files, which may contain sensitive data
*.tfvars
*.tfvars.json

# Ignore override files
override.tf
override.tf.json
*_override.tf
*_override.tf.json

# Ignore CLI config files
.terraformrc
terraform.rc

# Ignore lock file for module development
# .terraform.lock.hcl  # Usually commit this for consistent provider versions

Naming Conventions:

# Resources: lowercase with underscores
resource "aws_instance" "web_server" { }
resource "aws_security_group" "allow_https" { }

# Variables: lowercase with underscores
variable "instance_type" { }
variable "enable_monitoring" { }

# Outputs: lowercase with underscores
output "instance_public_ip" { }

# Modules: lowercase with hyphens (directory names)
module "web-cluster" {
  source = "./modules/web-cluster"
}

Pulumi

Pulumi takes a different approach to IaC by allowing you to use general-purpose programming languages (TypeScript, Python, Go, C#, Java, YAML) instead of domain-specific languages.

Philosophy

Traditional IaC DSL (Terraform HCL):
┌─────────────────────────────────────┐
│  resource "aws_instance" "web" {    │
│    ami           = var.ami          │
│    instance_type = "t3.micro"       │
│  }                                  │
└─────────────────────────────────────┘
         Limited expressiveness

Pulumi (Real Programming Languages):
┌─────────────────────────────────────┐
│  const instance = new aws.ec2.     │
│    Instance("web", {                │
│      ami: ami.id,                   │
│      instanceType: "t3.micro",      │
│    });                              │
│                                     │
│  // Use loops, conditionals,        │
│  // functions, classes, packages    │
└─────────────────────────────────────┘
        Full language power

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          Pulumi Architecture                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   ┌──────────────────┐                                                  │
│   │  Program         │  TypeScript/Python/Go/C#/Java/YAML               │
│   │  (Your Code)     │                                                  │
│   └────────┬─────────┘                                                  │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────┐     ┌──────────────────┐                         │
│   │  Pulumi Engine   │────►│  State Backend   │                         │
│   │                  │     │                  │                         │
│   │  • Deployment    │     │  • Pulumi Cloud  │                         │
│   │  • Diff/Preview  │     │  • S3/Azure/GCS  │                         │
│   │  • Resource Mgmt │     │  • Local file    │                         │
│   └────────┬─────────┘     └──────────────────┘                         │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                    Resource Providers                             │  │
│   │  (Same providers as Terraform - bridged or native)                │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Examples by Language

TypeScript:

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// Configuration
const config = new pulumi.Config();
const instanceCount = config.getNumber("instanceCount") || 3;

// Get latest Ubuntu AMI
const ami = aws.ec2.getAmi({
    mostRecent: true,
    owners: ["099720109477"],
    filters: [{
        name: "name",
        values: ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"],
    }],
});

// Create VPC
const vpc = new aws.ec2.Vpc("main", {
    cidrBlock: "10.0.0.0/16",
    enableDnsHostnames: true,
    tags: { Name: "main-vpc" },
});

// Create instances using a loop
const instances: aws.ec2.Instance[] = [];
for (let i = 0; i < instanceCount; i++) {
    instances.push(new aws.ec2.Instance(`web-${i}`, {
        ami: ami.then(a => a.id),
        instanceType: "t3.micro",
        tags: { Name: `web-${i}` },
    }));
}

// Export outputs
export const instanceIds = instances.map(i => i.id);
export const publicIps = instances.map(i => i.publicIp);

Python:

import pulumi
import pulumi_aws as aws

# Configuration
config = pulumi.Config()
instance_count = config.get_int("instanceCount") or 3

# Get latest Ubuntu AMI
ami = aws.ec2.get_ami(
    most_recent=True,
    owners=["099720109477"],
    filters=[{
        "name": "name",
        "values": ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"],
    }]
)

# Create VPC
vpc = aws.ec2.Vpc("main",
    cidr_block="10.0.0.0/16",
    enable_dns_hostnames=True,
    tags={"Name": "main-vpc"}
)

# Create instances using list comprehension
instances = [
    aws.ec2.Instance(f"web-{i}",
        ami=ami.id,
        instance_type="t3.micro",
        tags={"Name": f"web-{i}"}
    )
    for i in range(instance_count)
]

# Export outputs
pulumi.export("instance_ids", [i.id for i in instances])
pulumi.export("public_ips", [i.public_ip for i in instances])

Go:

package main

import (
    "fmt"

    "github.com/pulumi/pulumi-aws/sdk/v6/go/aws/ec2"
    "github.com/pulumi/pulumi/sdk/v3/go/pulumi"
    "github.com/pulumi/pulumi/sdk/v3/go/pulumi/config"
)

func main() {
    pulumi.Run(func(ctx *pulumi.Context) error {
        cfg := config.New(ctx, "")
        instanceCount := cfg.GetInt("instanceCount")
        if instanceCount == 0 {
            instanceCount = 3
        }

        // Create VPC
        vpc, err := ec2.NewVpc(ctx, "main", &ec2.VpcArgs{
            CidrBlock:          pulumi.String("10.0.0.0/16"),
            EnableDnsHostnames: pulumi.Bool(true),
            Tags: pulumi.StringMap{
                "Name": pulumi.String("main-vpc"),
            },
        })
        if err != nil {
            return err
        }

        // Create instances
        var instanceIds pulumi.StringArray
        for i := 0; i < instanceCount; i++ {
            instance, err := ec2.NewInstance(ctx, fmt.Sprintf("web-%d", i), &ec2.InstanceArgs{
                Ami:          pulumi.String("ami-0c55b159cbfafe1f0"),
                InstanceType: pulumi.String("t3.micro"),
            })
            if err != nil {
                return err
            }
            instanceIds = append(instanceIds, instance.ID())
        }

        ctx.Export("vpcId", vpc.ID())
        ctx.Export("instanceIds", instanceIds)
        return nil
    })
}

Advanced Pulumi Features

Component Resources (reusable abstractions):

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

interface WebClusterArgs {
    instanceCount: number;
    instanceType: string;
    vpcId: pulumi.Input<string>;
    subnetIds: pulumi.Input<string>[];
}

class WebCluster extends pulumi.ComponentResource {
    public readonly instances: aws.ec2.Instance[];
    public readonly loadBalancer: aws.lb.LoadBalancer;
    public readonly url: pulumi.Output<string>;

    constructor(name: string, args: WebClusterArgs, opts?: pulumi.ComponentResourceOptions) {
        super("custom:infrastructure:WebCluster", name, {}, opts);

        // Security group
        const sg = new aws.ec2.SecurityGroup(`${name}-sg`, {
            vpcId: args.vpcId,
            ingress: [
                { protocol: "tcp", fromPort: 80, toPort: 80, cidrBlocks: ["0.0.0.0/0"] },
            ],
            egress: [
                { protocol: "-1", fromPort: 0, toPort: 0, cidrBlocks: ["0.0.0.0/0"] },
            ],
        }, { parent: this });

        // Create instances
        this.instances = [];
        for (let i = 0; i < args.instanceCount; i++) {
            this.instances.push(new aws.ec2.Instance(`${name}-instance-${i}`, {
                instanceType: args.instanceType,
                ami: "ami-0c55b159cbfafe1f0",
                subnetId: args.subnetIds[i % args.subnetIds.length],
                vpcSecurityGroupIds: [sg.id],
            }, { parent: this }));
        }

        // Load balancer
        this.loadBalancer = new aws.lb.LoadBalancer(`${name}-lb`, {
            loadBalancerType: "application",
            securityGroups: [sg.id],
            subnets: args.subnetIds,
        }, { parent: this });

        this.url = pulumi.interpolate`http://${this.loadBalancer.dnsName}`;

        this.registerOutputs({
            url: this.url,
        });
    }
}

// Usage
const cluster = new WebCluster("web", {
    instanceCount: 3,
    instanceType: "t3.micro",
    vpcId: vpc.id,
    subnetIds: publicSubnetIds,
});

export const clusterUrl = cluster.url;

Stack References (cross-stack dependencies):

// infrastructure/index.ts (Stack A)
export const vpcId = vpc.id;
export const subnetIds = subnets.map(s => s.id);

// application/index.ts (Stack B)
const infra = new pulumi.StackReference("org/infrastructure/prod");
const vpcId = infra.getOutput("vpcId");
const subnetIds = infra.getOutput("subnetIds");

Pulumi vs Terraform

Aspect Pulumi Terraform
Language General-purpose (TS, Python, Go, etc.) HCL (domain-specific)
Learning curve Lower for developers Lower for ops
Testing Standard language testing frameworks Terratest, custom
IDE support Full (autocomplete, refactoring) Limited
Abstraction Full OOP (classes, inheritance) Modules only
State Pulumi Cloud, S3, local S3, remote backends
Provider ecosystem Same as Terraform (bridged) Native

AWS CloudFormation

AWS CloudFormation is Amazon's native IaC service for provisioning AWS resources.

Template Structure

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Complete web application infrastructure'

# Input parameters
Parameters:
  EnvironmentType:
    Description: Environment type
    Type: String
    Default: development
    AllowedValues:
      - development
      - staging
      - production
    ConstraintDescription: Must be development, staging, or production

  InstanceType:
    Description: EC2 instance type
    Type: String
    Default: t3.micro
    AllowedValues:
      - t3.micro
      - t3.small
      - t3.medium

# Conditional logic
Conditions:
  IsProduction: !Equals [!Ref EnvironmentType, production]
  CreateNATGateway: !Or
    - !Equals [!Ref EnvironmentType, staging]
    - !Equals [!Ref EnvironmentType, production]

# Mappings (lookup tables)
Mappings:
  RegionAMI:
    us-east-1:
      HVM64: ami-0123456789abcdef0
    us-west-2:
      HVM64: ami-0fedcba9876543210

# Resources
Resources:
  # VPC
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}-vpc'
        - Key: Environment
          Value: !Ref EnvironmentType

  # Public Subnet
  PublicSubnet:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs '']
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}-public-subnet'

  # Internet Gateway
  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}-igw'

  AttachGateway:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      VpcId: !Ref VPC
      InternetGatewayId: !Ref InternetGateway

  # Security Group
  WebSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTP/HTTPS traffic
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}-web-sg'

  # EC2 Instance
  WebInstance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: !If [IsProduction, t3.large, !Ref InstanceType]
      ImageId: !FindInMap [RegionAMI, !Ref 'AWS::Region', HVM64]
      SubnetId: !Ref PublicSubnet
      SecurityGroupIds:
        - !Ref WebSecurityGroup
      Tags:
        - Key: Name
          Value: !Sub '${AWS::StackName}-web'
    DependsOn: AttachGateway

  # Conditional NAT Gateway
  NATGateway:
    Type: AWS::EC2::NatGateway
    Condition: CreateNATGateway
    Properties:
      AllocationId: !GetAtt NATElasticIP.AllocationId
      SubnetId: !Ref PublicSubnet

  NATElasticIP:
    Type: AWS::EC2::EIP
    Condition: CreateNATGateway
    Properties:
      Domain: vpc

# Outputs
Outputs:
  VPCId:
    Description: VPC ID
    Value: !Ref VPC
    Export:
      Name: !Sub '${AWS::StackName}-VPCId'

  InstancePublicIP:
    Description: Public IP of web instance
    Value: !GetAtt WebInstance.PublicIp

  WebsiteURL:
    Description: Website URL
    Value: !Sub 'http://${WebInstance.PublicDnsName}'

Intrinsic Functions

# !Ref - Reference parameter or resource
VpcId: !Ref VPC

# !GetAtt - Get resource attribute
PublicIp: !GetAtt WebInstance.PublicIp

# !Sub - String substitution
Name: !Sub '${AWS::StackName}-${EnvironmentType}-web'

# !Join - Join strings
SecurityGroups: !Join [',', [!Ref SG1, !Ref SG2]]

# !Select - Select from list
AZ: !Select [0, !GetAZs '']

# !Split - Split string into list
Subnets: !Split [',', !Ref SubnetList]

# !If - Conditional
InstanceType: !If [IsProduction, t3.large, t3.micro]

# !Equals, !And, !Or, !Not - Conditions
Condition: !Equals [!Ref Env, production]

# !FindInMap - Lookup in mappings
AMI: !FindInMap [RegionAMI, !Ref 'AWS::Region', HVM64]

# !ImportValue - Import from another stack
VpcId: !ImportValue SharedVPCId

# !Cidr - Generate CIDR blocks
Subnets: !Cidr [!GetAtt VPC.CidrBlock, 4, 8]

Nested Stacks and Cross-Stack References

# Parent stack using nested stacks
Resources:
  NetworkStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/mybucket/network.yaml
      Parameters:
        Environment: !Ref Environment

  ComputeStack:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/mybucket/compute.yaml
      Parameters:
        VpcId: !GetAtt NetworkStack.Outputs.VPCId
        SubnetIds: !GetAtt NetworkStack.Outputs.SubnetIds

CloudFormation vs Terraform

Aspect CloudFormation Terraform
Provider AWS only Multi-cloud
State AWS-managed Self-managed or remote
Syntax JSON/YAML HCL
Drift detection Built-in terraform plan
Rollback Automatic on failure Manual
Cost Free Free (HCP Terraform paid)
Ecosystem AWS-native services Large provider ecosystem

Ansible

Ansible is an open-source automation tool owned by Red Hat, primarily excelling in configuration management, application deployment, orchestration, and task automation. It's "imperative/agentless, great for configuration management; can extend to provisioning."

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         Ansible Architecture                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   ┌──────────────────┐                                                  │
│   │  Control Node    │  Where Ansible runs                              │
│   │                  │  (your workstation, CI server)                   │
│   │  • Playbooks     │                                                  │
│   │  • Inventory     │                                                  │
│   │  • Modules       │                                                  │
│   └────────┬─────────┘                                                  │
│            │                                                             │
│            │ SSH / WinRM (agentless)                                    │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                      Managed Nodes                                │  │
│   │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐            │  │
│   │  │ Server1 │  │ Server2 │  │ Server3 │  │  ...    │            │  │
│   │  └─────────┘  └─────────┘  └─────────┘  └─────────┘            │  │
│   │                                                                   │  │
│   │  No agents required - just Python and SSH access                 │  │
│   └──────────────────────────────────────────────────────────────────┘  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Inventory

# inventory/hosts.ini - Static inventory

[webservers]
web1.example.com ansible_host=10.0.1.10
web2.example.com ansible_host=10.0.1.11
web3.example.com ansible_host=10.0.1.12

[dbservers]
db1.example.com ansible_host=10.0.2.10
db2.example.com ansible_host=10.0.2.11

[loadbalancers]
lb1.example.com ansible_host=10.0.0.10

# Group of groups
[production:children]
webservers
dbservers
loadbalancers

# Group variables
[webservers:vars]
http_port=80
max_connections=1000

[dbservers:vars]
db_port=5432
# inventory/hosts.yml - YAML inventory
all:
  children:
    production:
      children:
        webservers:
          hosts:
            web1.example.com:
              ansible_host: 10.0.1.10
              http_port: 80
            web2.example.com:
              ansible_host: 10.0.1.11
        dbservers:
          hosts:
            db1.example.com:
              ansible_host: 10.0.2.10
              db_port: 5432
          vars:
            backup_enabled: true

Playbooks

# deploy-webapp.yml - Complete playbook example
---
- name: Deploy Web Application
  hosts: webservers
  become: true
  gather_facts: true

  vars:
    app_name: myapp
    app_version: "2.1.0"
    app_port: 8080
    app_user: webapp
    deploy_dir: /opt/{{ app_name }}

  vars_files:
    - vars/secrets.yml  # Encrypted with ansible-vault

  pre_tasks:
    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600
      when: ansible_os_family == "Debian"

  tasks:
    - name: Install required packages
      apt:
        name:
          - nginx
          - python3
          - python3-pip
          - python3-venv
        state: present

    - name: Create application user
      user:
        name: "{{ app_user }}"
        system: yes
        shell: /usr/sbin/nologin
        home: "{{ deploy_dir }}"
        create_home: yes

    - name: Create deployment directory
      file:
        path: "{{ deploy_dir }}"
        state: directory
        owner: "{{ app_user }}"
        group: "{{ app_user }}"
        mode: '0755'

    - name: Deploy application code
      unarchive:
        src: "https://releases.example.com/{{ app_name }}-{{ app_version }}.tar.gz"
        dest: "{{ deploy_dir }}"
        remote_src: yes
        owner: "{{ app_user }}"
        group: "{{ app_user }}"
      notify: Restart application

    - name: Create virtual environment
      pip:
        requirements: "{{ deploy_dir }}/requirements.txt"
        virtualenv: "{{ deploy_dir }}/venv"
        virtualenv_command: python3 -m venv

    - name: Configure application
      template:
        src: templates/app-config.yml.j2
        dest: "{{ deploy_dir }}/config.yml"
        owner: "{{ app_user }}"
        group: "{{ app_user }}"
        mode: '0640'
      notify: Restart application

    - name: Deploy systemd service
      template:
        src: templates/app.service.j2
        dest: /etc/systemd/system/{{ app_name }}.service
        mode: '0644'
      notify:
        - Reload systemd
        - Restart application

    - name: Configure nginx reverse proxy
      template:
        src: templates/nginx-site.conf.j2
        dest: /etc/nginx/sites-available/{{ app_name }}
        mode: '0644'
      notify: Reload nginx

    - name: Enable nginx site
      file:
        src: /etc/nginx/sites-available/{{ app_name }}
        dest: /etc/nginx/sites-enabled/{{ app_name }}
        state: link
      notify: Reload nginx

    - name: Ensure services are running
      service:
        name: "{{ item }}"
        state: started
        enabled: yes
      loop:
        - "{{ app_name }}"
        - nginx

  handlers:
    - name: Reload systemd
      systemd:
        daemon_reload: yes

    - name: Restart application
      service:
        name: "{{ app_name }}"
        state: restarted

    - name: Reload nginx
      service:
        name: nginx
        state: reloaded

  post_tasks:
    - name: Verify application is responding
      uri:
        url: "http://localhost:{{ app_port }}/health"
        status_code: 200
      retries: 5
      delay: 3

Roles

roles/
└── webserver/
    ├── defaults/           # Default variables (lowest priority)
    │   └── main.yml
    ├── vars/               # Role variables (higher priority)
    │   └── main.yml
    ├── tasks/              # Task files
    │   ├── main.yml        # Entry point
    │   ├── install.yml
    │   ├── configure.yml
    │   └── service.yml
    ├── handlers/           # Handlers
    │   └── main.yml
    ├── templates/          # Jinja2 templates
    │   ├── nginx.conf.j2
    │   └── vhost.conf.j2
    ├── files/              # Static files
    │   └── ssl-params.conf
    ├── meta/               # Role metadata
    │   └── main.yml
    └── README.md
# roles/webserver/tasks/main.yml
---
- name: Include installation tasks
  include_tasks: install.yml

- name: Include configuration tasks
  include_tasks: configure.yml

- name: Include service tasks
  include_tasks: service.yml
# roles/webserver/tasks/install.yml
---
- name: Install nginx
  apt:
    name: nginx
    state: present
  when: ansible_os_family == "Debian"

- name: Install nginx (RHEL)
  yum:
    name: nginx
    state: present
  when: ansible_os_family == "RedHat"
# roles/webserver/handlers/main.yml
---
- name: Restart nginx
  service:
    name: nginx
    state: restarted

- name: Reload nginx
  service:
    name: nginx
    state: reloaded
# Using roles in playbook
---
- name: Configure web servers
  hosts: webservers
  become: true

  roles:
    - role: common
    - role: webserver
      vars:
        nginx_worker_processes: auto
        nginx_worker_connections: 4096
    - role: ssl-certificates
      when: enable_ssl | default(false)

Jinja2 Templates

# templates/nginx-site.conf.j2
upstream {{ app_name }} {
{% for host in groups['webservers'] %}
    server {{ hostvars[host]['ansible_host'] }}:{{ app_port }};
{% endfor %}
}

server {
    listen 80;
    server_name {{ server_name }};

{% if enable_ssl | default(false) %}
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name {{ server_name }};

    ssl_certificate /etc/ssl/certs/{{ app_name }}.crt;
    ssl_certificate_key /etc/ssl/private/{{ app_name }}.key;
{% endif %}

    location / {
        proxy_pass http://{{ app_name }};
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    location /health {
        access_log off;
        return 200 "healthy\n";
    }
}

Ansible Vault

# Create encrypted file
ansible-vault create secrets.yml

# Edit encrypted file
ansible-vault edit secrets.yml

# Encrypt existing file
ansible-vault encrypt plain-secrets.yml

# Decrypt file
ansible-vault decrypt secrets.yml

# View encrypted file
ansible-vault view secrets.yml

# Run playbook with vault password
ansible-playbook playbook.yml --ask-vault-pass
ansible-playbook playbook.yml --vault-password-file ~/.vault_pass
# secrets.yml (encrypted)
db_password: supersecretpassword
api_key: abc123xyz
ssl_private_key: |
  -----BEGIN PRIVATE KEY-----
  ...
  -----END PRIVATE KEY-----

Best Practices

# Use YAML anchors and aliases for DRY
defaults: &defaults
  become: true
  gather_facts: true

- name: Configure webservers
  hosts: webservers
  <<: *defaults

- name: Configure dbservers
  hosts: dbservers
  <<: *defaults
# Use blocks for error handling
- name: Deploy with rollback
  block:
    - name: Deploy new version
      # ... deployment tasks
    - name: Run smoke tests
      uri:
        url: "http://localhost/health"
        status_code: 200
  rescue:
    - name: Rollback to previous version
      # ... rollback tasks
  always:
    - name: Send notification
      # ... notification tasks

Helm

Helm is the package manager for Kubernetes, enabling you to define, install, and upgrade complex Kubernetes applications.

Chart Structure

mychart/
├── Chart.yaml          # Chart metadata
├── Chart.lock          # Dependency lock file
├── values.yaml         # Default configuration values
├── values.schema.json  # JSON Schema for values validation
├── templates/          # Template files
│   ├── NOTES.txt       # Post-install notes
│   ├── _helpers.tpl    # Template helpers
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml
│   └── serviceaccount.yaml
├── charts/             # Dependency charts
├── crds/               # Custom Resource Definitions
└── README.md

Chart.yaml

apiVersion: v2
name: myapp
description: A Helm chart for my application
type: application
version: 1.2.3          # Chart version
appVersion: "2.0.0"     # Application version

keywords:
  - web
  - api
  - microservice

home: https://example.com/myapp
sources:
  - https://github.com/example/myapp

maintainers:
  - name: John Doe
    email: john@example.com
    url: https://johndoe.dev

dependencies:
  - name: postgresql
    version: "12.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled
  - name: redis
    version: "17.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: redis.enabled

Values.yaml

# values.yaml - Default values for myapp

# Number of replicas
replicaCount: 1

image:
  repository: myregistry.io/myapp
  pullPolicy: IfNotPresent
  tag: ""  # Defaults to appVersion

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  create: true
  annotations: {}
  name: ""

podAnnotations: {}
podSecurityContext:
  fsGroup: 1000

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: false
  className: "nginx"
  annotations: {}
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80

nodeSelector: {}
tolerations: []
affinity: {}

# Application-specific configuration
config:
  logLevel: info
  database:
    host: localhost
    port: 5432
    name: myapp
  cache:
    enabled: true
    ttl: 3600

# Feature flags
features:
  newUI: false
  betaAPI: false

# Dependencies
postgresql:
  enabled: true
  auth:
    username: myapp
    database: myapp

redis:
  enabled: false

Templates

# templates/_helpers.tpl - Template helpers
{{/*
Expand the name of the chart.
*/}}
{{- define "myapp.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Create a default fully qualified app name.
*/}}
{{- define "myapp.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "myapp.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "myapp.labels" -}}
helm.sh/chart: {{ include "myapp.chart" . }}
{{ include "myapp.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{/*
Selector labels
*/}}
{{- define "myapp.selectorLabels" -}}
app.kubernetes.io/name: {{ include "myapp.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "myapp.serviceAccountName" -}}
{{- if .Values.serviceAccount.create }}
{{- default (include "myapp.fullname" .) .Values.serviceAccount.name }}
{{- else }}
{{- default "default" .Values.serviceAccount.name }}
{{- end }}
{{- end }}
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "myapp.fullname" . }}
  labels:
    {{- include "myapp.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        {{- with .Values.podAnnotations }}
        {{- toYaml . | nindent 8 }}
        {{- end }}
      labels:
        {{- include "myapp.selectorLabels" . | nindent 8 }}
    spec:
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      serviceAccountName: {{ include "myapp.serviceAccountName" . }}
      securityContext:
        {{- toYaml .Values.podSecurityContext | nindent 8 }}
      containers:
        - name: {{ .Chart.Name }}
          securityContext:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.port }}
              protocol: TCP
          envFrom:
            - configMapRef:
                name: {{ include "myapp.fullname" . }}-config
            - secretRef:
                name: {{ include "myapp.fullname" . }}-secrets
          livenessProbe:
            httpGet:
              path: /health/live
              port: http
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}
      {{- with .Values.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with .Values.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
# templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ include "myapp.fullname" . }}
  labels:
    {{- include "myapp.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
  selector:
    {{- include "myapp.selectorLabels" . | nindent 4 }}
# templates/ingress.yaml
{{- if .Values.ingress.enabled -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: {{ include "myapp.fullname" . }}
  labels:
    {{- include "myapp.labels" . | nindent 4 }}
  {{- with .Values.ingress.annotations }}
  annotations:
    {{- toYaml . | nindent 4 }}
  {{- end }}
spec:
  {{- if .Values.ingress.className }}
  ingressClassName: {{ .Values.ingress.className }}
  {{- end }}
  {{- if .Values.ingress.tls }}
  tls:
    {{- range .Values.ingress.tls }}
    - hosts:
        {{- range .hosts }}
        - {{ . | quote }}
        {{- end }}
      secretName: {{ .secretName }}
    {{- end }}
  {{- end }}
  rules:
    {{- range .Values.ingress.hosts }}
    - host: {{ .host | quote }}
      http:
        paths:
          {{- range .paths }}
          - path: {{ .path }}
            pathType: {{ .pathType }}
            backend:
              service:
                name: {{ include "myapp.fullname" $ }}
                port:
                  number: {{ $.Values.service.port }}
          {{- end }}
    {{- end }}
{{- end }}

Helm Commands

# Repository management
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm search repo nginx

# Install a chart
helm install myrelease ./mychart
helm install myrelease ./mychart -f custom-values.yaml
helm install myrelease ./mychart --set replicaCount=3
helm install myrelease ./mychart --namespace mynamespace --create-namespace

# Upgrade a release
helm upgrade myrelease ./mychart
helm upgrade --install myrelease ./mychart  # Install if not exists

# Rollback
helm rollback myrelease 1  # Rollback to revision 1
helm history myrelease     # View release history

# Uninstall
helm uninstall myrelease

# Template rendering (dry-run)
helm template myrelease ./mychart
helm template myrelease ./mychart --debug  # With debug info

# Validate chart
helm lint ./mychart

# Package chart
helm package ./mychart
helm package ./mychart --version 1.2.3 --app-version 2.0.0

# Pull chart
helm pull bitnami/nginx --untar

# Dependencies
helm dependency update ./mychart
helm dependency build ./mychart

Environment-Specific Values

# values-dev.yaml
replicaCount: 1
image:
  tag: "latest"
ingress:
  enabled: false
resources:
  limits:
    cpu: 200m
    memory: 256Mi

# values-staging.yaml
replicaCount: 2
image:
  tag: "staging"
ingress:
  enabled: true
  hosts:
    - host: staging.example.com
      paths:
        - path: /
          pathType: Prefix

# values-prod.yaml
replicaCount: 3
image:
  tag: "v2.0.0"
ingress:
  enabled: true
  hosts:
    - host: app.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - hosts:
        - app.example.com
      secretName: app-tls
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
# Deploy to different environments
helm upgrade --install myapp ./mychart -f values-dev.yaml -n dev
helm upgrade --install myapp ./mychart -f values-staging.yaml -n staging
helm upgrade --install myapp ./mychart -f values-prod.yaml -n prod

Crossplane

Crossplane extends Kubernetes to manage cloud infrastructure using Kubernetes-native APIs (Custom Resources).

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                       Crossplane Architecture                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   ┌──────────────────┐                                                  │
│   │  kubectl /       │  Standard K8s tooling                            │
│   │  GitOps (ArgoCD) │                                                  │
│   └────────┬─────────┘                                                  │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                    Kubernetes API Server                          │  │
│   └────────┬─────────────────────────────────────────────────────────┘  │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                      Crossplane Core                              │  │
│   │  • Composition Engine    • Package Manager                       │  │
│   │  • Resource Controllers  • RBAC Integration                      │  │
│   └────────┬─────────────────────────────────────────────────────────┘  │
│            │                                                             │
│            ▼                                                             │
│   ┌──────────────────────────────────────────────────────────────────┐  │
│   │                        Providers                                  │  │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │  │
│   │  │ provider-   │  │ provider-   │  │ provider-   │              │  │
│   │  │ aws         │  │ azure       │  │ gcp         │  ...         │  │
│   │  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘              │  │
│   └─────────┼────────────────┼────────────────┼──────────────────────┘  │
│             │                │                │                          │
│             ▼                ▼                ▼                          │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    Cloud Provider APIs                           │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Basic Usage

# Install AWS provider
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws
spec:
  package: xpkg.upbound.io/upbound/provider-aws:v0.47.0

---
# Configure AWS credentials
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: credentials
# Create AWS resources using Kubernetes manifests
apiVersion: ec2.aws.upbound.io/v1beta1
kind: VPC
metadata:
  name: production-vpc
spec:
  forProvider:
    region: us-west-2
    cidrBlock: 10.0.0.0/16
    enableDnsHostnames: true
    enableDnsSupport: true
    tags:
      Name: production-vpc
      Environment: production

---
apiVersion: ec2.aws.upbound.io/v1beta1
kind: Subnet
metadata:
  name: production-public-1
spec:
  forProvider:
    region: us-west-2
    vpcIdRef:
      name: production-vpc
    cidrBlock: 10.0.1.0/24
    availabilityZone: us-west-2a
    mapPublicIpOnLaunch: true
    tags:
      Name: production-public-1

---
apiVersion: rds.aws.upbound.io/v1beta1
kind: Instance
metadata:
  name: production-db
spec:
  forProvider:
    region: us-west-2
    instanceClass: db.t3.medium
    engine: postgres
    engineVersion: "15"
    allocatedStorage: 100
    dbName: myapp
    username: admin
    passwordSecretRef:
      name: db-password
      namespace: default
      key: password
    vpcSecurityGroupIdRefs:
      - name: production-db-sg
    dbSubnetGroupNameRef:
      name: production-db-subnet-group
    publiclyAccessible: false
  writeConnectionSecretToRef:
    name: production-db-connection
    namespace: default

Compositions (Platform Abstractions)

# Define a reusable composition
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xdatabases.example.org
spec:
  group: example.org
  names:
    kind: XDatabase
    plural: xdatabases
  claimNames:
    kind: Database
    plural: databases
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                size:
                  type: string
                  enum: [small, medium, large]
                engine:
                  type: string
                  enum: [postgres, mysql]
              required:
                - size
                - engine

---
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: aws-postgres
  labels:
    provider: aws
    engine: postgres
spec:
  compositeTypeRef:
    apiVersion: example.org/v1alpha1
    kind: XDatabase
  resources:
    - name: rds-instance
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: Instance
        spec:
          forProvider:
            region: us-west-2
            engine: postgres
            engineVersion: "15"
            publiclyAccessible: false
      patches:
        - type: FromCompositeFieldPath
          fromFieldPath: spec.size
          toFieldPath: spec.forProvider.instanceClass
          transforms:
            - type: map
              map:
                small: db.t3.small
                medium: db.t3.medium
                large: db.t3.large
        - type: FromCompositeFieldPath
          fromFieldPath: metadata.name
          toFieldPath: spec.forProvider.dbName
# Claim a database (simple interface for developers)
apiVersion: example.org/v1alpha1
kind: Database
metadata:
  name: myapp-db
  namespace: myapp
spec:
  size: medium
  engine: postgres
  compositionSelector:
    matchLabels:
      provider: aws
      engine: postgres
  writeConnectionSecretToRef:
    name: myapp-db-connection

Crossplane vs Terraform

Aspect Crossplane Terraform
Runtime Kubernetes controller CLI tool
State Kubernetes etcd External state file
Drift correction Continuous reconciliation On terraform apply
GitOps native Yes (ArgoCD/Flux) Via CI/CD pipelines
Platform abstractions Compositions Modules
Learning curve Kubernetes knowledge required Self-contained
Multi-tenancy Kubernetes RBAC HCP Terraform workspaces

Policy as Code

Policy as Code ensures compliance and security by defining rules programmatically.

Open Policy Agent (OPA)

OPA is a general-purpose policy engine using Rego language.

# terraform-policies/deny_public_s3.rego
package terraform.analysis

import input as tfplan

# Deny public S3 buckets
deny[msg] {
    resource := tfplan.resource_changes[_]
    resource.type == "aws_s3_bucket"
    resource.change.after.acl == "public-read"
    msg := sprintf("S3 bucket '%s' must not be public", [resource.address])
}

# Require encryption on RDS instances
deny[msg] {
    resource := tfplan.resource_changes[_]
    resource.type == "aws_db_instance"
    not resource.change.after.storage_encrypted
    msg := sprintf("RDS instance '%s' must have storage encryption enabled", [resource.address])
}

# Enforce tagging
deny[msg] {
    resource := tfplan.resource_changes[_]
    required_tags := {"Environment", "Owner", "CostCenter"}
    provided_tags := {tag | resource.change.after.tags[tag]}
    missing := required_tags - provided_tags
    count(missing) > 0
    msg := sprintf("Resource '%s' is missing required tags: %v", [resource.address, missing])
}

# Restrict instance types
deny[msg] {
    resource := tfplan.resource_changes[_]
    resource.type == "aws_instance"
    allowed_types := {"t3.micro", "t3.small", "t3.medium"}
    not allowed_types[resource.change.after.instance_type]
    msg := sprintf("Instance '%s' uses unauthorized type '%s'. Allowed: %v", 
        [resource.address, resource.change.after.instance_type, allowed_types])
}
# Use with Terraform
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
opa eval --data terraform-policies/ --input tfplan.json "data.terraform.analysis.deny"

HashiCorp Sentinel

Sentinel is HashiCorp's policy-as-code framework for HCP Terraform.

# sentinel/require-tags.sentinel
import "tfplan/v2" as tfplan

required_tags = ["Environment", "Owner", "CostCenter"]

# Get all resources that support tags
taggable_resources = filter tfplan.resource_changes as _, rc {
    rc.mode is "managed" and
    rc.change.after is not null and
    keys(rc.change.after) contains "tags"
}

# Check each resource for required tags
missing_tags = {}
for taggable_resources as address, rc {
    tags = rc.change.after.tags else {}
    missing = filter required_tags as tag {
        tags[tag] is undefined or tags[tag] is null or tags[tag] is ""
    }
    if length(missing) > 0 {
        missing_tags[address] = missing
    }
}

# Main rule
main = rule {
    length(missing_tags) is 0
}

# Provide helpful error message
print("Resources missing required tags:", missing_tags) when not main
# sentinel/restrict-regions.sentinel
import "tfplan/v2" as tfplan

allowed_regions = ["us-west-2", "us-east-1", "eu-west-1"]

# Find AWS provider configurations
aws_providers = filter tfplan.providers as alias, p {
    p.provider_name is "registry.terraform.io/hashicorp/aws"
}

# Check regions
violations = filter aws_providers as alias, p {
    p.config.region not in allowed_regions
}

main = rule {
    length(violations) is 0
}

Checkov (Static Analysis)

# Scan Terraform files
checkov -d ./terraform --framework terraform

# Scan specific file
checkov -f main.tf

# Output formats
checkov -d . --output json
checkov -d . --output sarif  # For GitHub Advanced Security

# Skip specific checks
checkov -d . --skip-check CKV_AWS_18,CKV_AWS_19

# Custom policies
checkov -d . --external-checks-dir ./custom-policies
# Custom Checkov policy (YAML)
# custom-policies/require_encryption.yaml
metadata:
  name: "Ensure S3 buckets have server-side encryption enabled"
  id: "CUSTOM_AWS_1"
  category: "encryption"

definition:
  and:
    - cond_type: "attribute"
      resource_types:
        - "aws_s3_bucket"
      attribute: "server_side_encryption_configuration"
      operator: "exists"

tfsec / Trivy

# tfsec scan
tfsec ./terraform

# With specific severity
tfsec ./terraform --minimum-severity HIGH

# Trivy (includes tfsec)
trivy config ./terraform

# Output as SARIF for CI/CD
trivy config ./terraform --format sarif --output results.sarif

GitOps and IaC

GitOps applies Git workflows to infrastructure management.

GitOps Principles

┌─────────────────────────────────────────────────────────────────────────┐
│                         GitOps Workflow                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐             │
│   │  Developer   │───►│  Git Repo    │───►│  CI/CD       │             │
│   │  Push Code   │    │  (Source of  │    │  Pipeline    │             │
│   └──────────────┘    │  Truth)      │    └──────┬───────┘             │
│                       └──────────────┘           │                      │
│                              ▲                   │                      │
│                              │                   ▼                      │
│                       ┌──────┴───────┐   ┌──────────────┐             │
│                       │  Reconcile   │◄──│  GitOps      │             │
│                       │  Loop        │   │  Operator    │             │
│                       └──────────────┘   │  (ArgoCD/    │             │
│                                          │  Flux)       │             │
│                                          └──────┬───────┘             │
│                                                 │                      │
│                                                 ▼                      │
│                                          ┌──────────────┐             │
│                                          │  Kubernetes  │             │
│                                          │  Cluster     │             │
│                                          └──────────────┘             │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

ArgoCD with Helm

# ArgoCD Application for Helm chart
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/helm-charts
    targetRevision: main
    path: charts/myapp
    helm:
      valueFiles:
        - values.yaml
        - values-prod.yaml
      parameters:
        - name: image.tag
          value: "v2.0.0"
  destination:
    server: https://kubernetes.default.svc
    namespace: myapp
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Terraform with GitOps (Atlantis)

# atlantis.yaml - Repository configuration
version: 3
projects:
  - name: production
    dir: environments/production
    workspace: production
    terraform_version: v1.6.0
    autoplan:
      when_modified: ["*.tf", "../modules/**/*.tf"]
      enabled: true
    apply_requirements: [approved, mergeable]

  - name: staging
    dir: environments/staging
    workspace: staging
    terraform_version: v1.6.0
    autoplan:
      when_modified: ["*.tf", "../modules/**/*.tf"]
      enabled: true
# GitHub Actions for Terraform
name: Terraform

on:
  pull_request:
    paths:
      - 'terraform/**'
  push:
    branches:
      - main
    paths:
      - 'terraform/**'

jobs:
  terraform:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: terraform

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.0

      - name: Terraform Init
        run: terraform init

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        run: terraform validate

      - name: Terraform Plan
        if: github.event_name == 'pull_request'
        run: terraform plan -no-color
        continue-on-error: true

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: terraform apply -auto-approve

Testing IaC

Terratest (Go-based Testing)

// test/vpc_test.go
package test

import (
    "testing"

    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVPCModule(t *testing.T) {
    t.Parallel()

    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "vpc_cidr":    "10.0.0.0/16",
            "environment": "test",
            "name":        "terratest-vpc",
        },
        EnvVars: map[string]string{
            "AWS_DEFAULT_REGION": "us-west-2",
        },
    })

    // Clean up resources after test
    defer terraform.Destroy(t, terraformOptions)

    // Deploy infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Get outputs
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    publicSubnetIds := terraform.OutputList(t, terraformOptions, "public_subnet_ids")

    // Validate VPC exists
    vpc := aws.GetVpcById(t, vpcId, "us-west-2")
    assert.Equal(t, "10.0.0.0/16", aws.GetCidrBlock(vpc))

    // Validate subnets
    assert.Equal(t, 3, len(publicSubnetIds))

    // Validate tags
    tags := aws.GetTagsForVpc(t, vpcId, "us-west-2")
    assert.Equal(t, "test", tags["Environment"])
}

terraform test (Native Testing)

# tests/vpc.tftest.hcl
run "create_vpc" {
  command = apply

  variables {
    vpc_cidr    = "10.0.0.0/16"
    environment = "test"
    name        = "test-vpc"
  }

  assert {
    condition     = aws_vpc.main.cidr_block == "10.0.0.0/16"
    error_message = "VPC CIDR block is incorrect"
  }

  assert {
    condition     = aws_vpc.main.enable_dns_hostnames == true
    error_message = "DNS hostnames should be enabled"
  }

  assert {
    condition     = length(aws_subnet.public) == 3
    error_message = "Should create 3 public subnets"
  }
}

run "validate_tags" {
  command = plan

  variables {
    vpc_cidr    = "10.0.0.0/16"
    environment = "production"
    name        = "prod-vpc"
  }

  assert {
    condition     = aws_vpc.main.tags["Environment"] == "production"
    error_message = "Environment tag should be 'production'"
  }
}

Ansible Molecule Testing

# molecule/default/molecule.yml
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: instance
    image: geerlingguy/docker-ubuntu2204-ansible
    pre_build_image: true
    privileged: true
    command: /lib/systemd/systemd
provisioner:
  name: ansible
  inventory:
    host_vars:
      instance:
        ansible_user: root
verifier:
  name: ansible
# molecule/default/converge.yml
---
- name: Converge
  hosts: all
  tasks:
    - name: Include role
      include_role:
        name: webserver
# molecule/default/verify.yml
---
- name: Verify
  hosts: all
  gather_facts: false
  tasks:
    - name: Check nginx is installed
      package:
        name: nginx
        state: present
      check_mode: true
      register: nginx_check
      failed_when: nginx_check.changed

    - name: Check nginx is running
      service:
        name: nginx
        state: started
      check_mode: true
      register: nginx_service
      failed_when: nginx_service.changed

    - name: Verify nginx responds
      uri:
        url: http://localhost
        status_code: 200
# Run molecule tests
molecule test

# Individual stages
molecule create    # Create test instances
molecule converge  # Run playbook
molecule verify    # Run verification
molecule destroy   # Clean up

IaC Best Practices Summary

Directory Structure

infrastructure/
├── .github/
│   └── workflows/
│       ├── terraform.yml
│       └── ansible.yml
├── terraform/
│   ├── modules/
│   │   ├── networking/
│   │   ├── compute/
│   │   ├── database/
│   │   └── security/
│   ├── environments/
│   │   ├── dev/
│   │   ├── staging/
│   │   └── production/
│   └── tests/
├── ansible/
│   ├── inventory/
│   │   ├── dev/
│   │   ├── staging/
│   │   └── production/
│   ├── roles/
│   ├── playbooks/
│   └── group_vars/
├── helm/
│   └── charts/
│       └── myapp/
├── policies/
│   ├── opa/
│   └── sentinel/
└── docs/
    └── runbooks/

Security Checklist

  • [ ] Never commit secrets to version control
  • [ ] Use secret management (Vault, AWS Secrets Manager, etc.)
  • [ ] Encrypt state files (S3 server-side encryption, etc.)
  • [ ] Apply least privilege to IaC service accounts
  • [ ] Enable state locking to prevent concurrent modifications
  • [ ] Implement policy-as-code for compliance
  • [ ] Scan IaC for security misconfigurations (Checkov, tfsec)
  • [ ] Review infrastructure changes via pull requests
  • [ ] Audit who made what changes and when

Common Anti-Patterns

Anti-Pattern Problem Solution
Hardcoded secrets Security risk Use secret management tools
No state locking Race conditions Enable DynamoDB/backend locking
Single monolithic state Blast radius Split into multiple states
No testing Unreliable changes Implement terratest/molecule
Manual changes Configuration drift Enforce IaC-only changes
Copy-paste code Maintenance burden Use modules/roles
No code review Quality issues Require PR approvals
Ignoring drift Unknown state Regular drift detection

Migration Strategy

For existing infrastructure:

  1. Import: Use terraform import to bring existing resources under management
  2. Document: Create accurate state of current infrastructure
  3. Incremental: Migrate piece by piece, not all at once
  4. Validate: Compare imported state with actual infrastructure
  5. Test: Run plans to ensure no unexpected changes
# Import existing AWS resources
terraform import aws_vpc.main vpc-0123456789abcdef0
terraform import aws_subnet.public[0] subnet-0123456789abcdef0
terraform import aws_instance.web i-0123456789abcdef0

# Generate configuration from state
terraform show -no-color > imported.tf

Conclusion

Infrastructure as Code has evolved from simple automation scripts to sophisticated, enterprise-grade tooling that enables organizations to manage complex, multi-cloud environments reliably and securely.

Key Takeaways:

  1. Choose the right tool: Terraform for provisioning, Ansible for configuration, Helm for Kubernetes
  2. Embrace declarative: Prefer declarative approaches for predictability
  3. Version everything: Git is the source of truth
  4. Test thoroughly: Unit tests, integration tests, policy checks
  5. Automate completely: CI/CD pipelines for all infrastructure changes
  6. Security first: Secrets management, least privilege, audit trails

The future of IaC points toward:

  • Platform Engineering: Self-service infrastructure via internal developer platforms
  • GitOps Maturity: Declarative, version-controlled, continuously reconciled
  • AI-Assisted IaC: Automated code generation, drift detection, optimization
  • Multi-Cloud Abstraction: Tools like Crossplane providing unified control planes