Terraform on Azure: Patterns That Actually Scale | GreyHelix Technology Consulting

Terraform looks approachable until you're managing 15 subscriptions, three environments, a shared services hub, and a team of six engineers who all have slightly different opinions about how things should be organized. This post covers the patterns that survive contact with that reality.

State Management First

The fastest way to create a Terraform disaster is to treat state as an afterthought. By the time a team realizes their state management strategy is broken, they've usually got drift, conflicting workspaces, or worse — state files checked into git.

Remote State in Azure Storage

terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "stterraformstate"
    container_name       = "tfstate"
    key                  = "prod/core-networking.tfstate"
  }
}

The storage account for state needs to exist before Terraform runs. Bootstrap it manually or with a one-time script — don't try to manage it with Terraform itself. Lock the storage account down: no public access, private endpoint or service endpoint only, RBAC authentication (not access keys), and soft delete enabled with a 90-day retention.

State Locking

Azure Storage provides blob lease-based state locking out of the box with the azurerm backend. Don't use a backend that doesn't support locking in team environments — state corruption from concurrent applies is not recoverable without significant effort.

Workspace Strategy

Don't use Terraform workspaces for environment separation in Azure. The mental model breaks down: workspaces share a backend config, which means they share access permissions, and you end up with your prod state accessible from the same context as dev.

Instead, use separate state files per environment, with separate storage accounts per environment if your security posture requires it.

tfstate/
  dev/
    networking.tfstate
    compute.tfstate
    identity.tfstate
  staging/
    networking.tfstate
    ...
  prod/
    networking.tfstate
    ...

Module Structure

The question every team fights over: monorepo vs. module repo, flat vs. nested, one root module per environment vs. layered roots.

What works at scale:

Separate Infrastructure Layers

Split your infrastructure into layers with explicit dependency contracts between them. Each layer has its own state file and its own apply cycle.

infra/
  00-state-bootstrap/   # Storage account, container — applied once manually
  01-core-networking/   # Hub VNet, peerings, DNS, Firewall
  02-identity/          # Entra app registrations, managed identities
  03-platform/          # AKS, App Service Environments, SQL servers
  04-workloads/         # Per-application resources

Higher layers read outputs from lower layers via terraform_remote_state data sources. This means the networking team can apply 01-core-networking without touching workload code, and application teams deploy 04-workloads without needing access to the core network state.

⚠ Warning

terraform_remote_state creates coupling between state files. If the lower layer's output structure changes, all dependent layers need updating. Document outputs explicitly and version them carefully.

Reusable Modules for Standard Patterns

Extract anything you deploy more than twice into a module. Good candidates in Azure environments: storage accounts (with consistent security defaults), key vaults, virtual machines, app service plans.

module "storage" {
  source = "../../modules/storage-account"
 
  name                = "stmyapp${var.environment}"
  resource_group_name = azurerm_resource_group.app.name
  location            = var.location
 
  # Module enforces: no public access, TLS 1.2 min, soft delete
  # You can't accidentally create an insecure storage account
}

The module enforces your security baseline. Engineers who use the module can't accidentally create a publicly accessible storage account with no soft delete — those defaults are in the module, not in documentation nobody reads.

Naming Conventions

Azure resource names are constrained (length, allowed characters, uniqueness scope), inconsistent between resource types, and permanent. Get the convention right before you have resources in production.

A pattern that works:

{type}-{workload}-{environment}-{region}-{instance}
rg-payments-prod-eus2-001
vnet-hub-shared-eus2-001
st-backups-prod-eus2-001     # storage accounts: no dashes, lowercase only
kv-secrets-prod-eus2-001

locals {
  name_prefix = "${var.workload}-${var.environment}-${var.location_short}"
}
 
resource "azurerm_resource_group" "main" {
  name     = "rg-${local.name_prefix}"
  location = var.location
}

Lock this down in a locals block in your root module. Engineers should not be typing resource names directly — they should be composing them from variables through a naming function.

The Mistakes I See Repeatedly

Hardcoded subscription IDs. Use data "azurerm_subscription" "current" {} and reference data.azurerm_subscription.current.subscription_id. Hardcoded IDs break when code is promoted between environments.

No prevent_destroy on critical resources. Key vaults, storage accounts with data, DNS zones — put lifecycle { prevent_destroy = true } on anything where a terraform destroy would cause an incident.

azurerm_resource_group with depends_on everywhere. Resource groups don't need explicit depends_on chains — the provider handles it. Explicit depends_on on resource groups usually indicates a module structure problem.

Everything in one root module. A single main.tf with 200 resources takes 8 minutes to plan and turns every apply into a full environment scan. Split early.

💡 Tip

Run terraform plan in CI on every pull request and post the plan output as a PR comment. Engineers should see exactly what their change does to infrastructure before it merges — not after.

Azure Provider Gotchas

The azurerm provider has some behaviors that will surprise you at least once:

azurerm_key_vault soft delete. Soft delete is enabled by default and cannot be disabled. If you destroy a key vault and try to recreate it with the same name within the retention period, Terraform will fail. Use azurerm_key_vault with purge_protection_enabled = true in prod, and purge deleted vaults in dev pipelines.

azurerm_role_assignment scope. Role assignments at subscription scope require the service principal running Terraform to have User Access Administrator on the subscription. This is often missing and fails silently in plan.

Resource locks. azurerm_management_lock resources will block Terraform destroy operations on the locked resource. If your pipeline destroys and recreates resources (e.g., in ephemeral environments), manage locks carefully or they'll block your cleanup step.

The Payoff

When Terraform is structured well, infrastructure changes become reviewable, predictable, and repeatable. A new subscription can be bootstrapped from scratch in under an hour. Security baselines are enforced structurally, not by convention. And when something goes wrong — a misconfigured NSG, an accidentally public storage account — the remediation path is a pull request, not a support ticket.

That's the actual value of IaC. Not that it's faster (often it isn't, initially), but that it makes infrastructure auditable and recoverable.