June 12, 20267 min read

The Terraform module factory: gates, not prompts

Terraform
Platform Engineering
AI Agents

Most Terraform module libraries rot the same way: standards live in a wiki nobody reads, and the module backlog grows faster than the two engineers trusted to write modules. terraform-module-factory attacks this from the other end. Make every standard machine-checkable, then let a coding agent do the work.

The repo is a monorepo of reusable Azure (azurerm) modules and the harness that produces them. The name means what it says: modules are the output, and the engineering went into the machine around them, built so Claude Code can write, test, document, and release a module to the standard of a senior engineer.

Feedback loops, not prompts

The naive way to put an agent on infrastructure code is to paste your conventions into a prompt and hope. That fails for the same reason the wiki fails: prose drifts, and nothing pushes back when it's ignored. The harness stands on the principle stated in docs/harness.md: agent quality is bounded by feedback loops, not prompts. So the repo turns each rule that matters into something a machine can check, block, or verify. Prose remains for what automation can't check yet.

Humans get the same loops: the definition of done for a person and for Claude Code is one command.

One command defines done

make check MODULE=<name> runs seven gates in order: formatting, terraform validate, tflint with the azurerm ruleset, a trivy misconfiguration scan at HIGH/CRITICAL, a check that the README matches the code, mock-provider unit tests, and an init-plus-validate of every example. The result is binary, and every failure prints a specific error.

An agent told to "follow our standards" guesses. An agent that reads "README out of date — run: make docs MODULE=virtual-network" fixes it and moves on. Seven gates' worth of specific errors steer better than any system prompt.

The inner loop. A failing gate stops the run and prints why. CI calls the identical targets, so local green and CI green can't diverge.

CI runs the same Make targets. A workflow computes which modules a commit touched and fans out a matrix; a change to the harness itself re-checks every module. misepins the tool versions and installs the same builds on laptops and runners, so "works on my machine" has nowhere to live. Kick the loop once yourself:

git clone https://github.com/danielsemerjya/terraform-module-factory.git && cd terraform-module-factory

mise install

make check MODULE=virtual-network

The agent can't touch real infrastructure

The scariest sentence in agentic infrastructure work is "the model ran terraform apply". Here it can't. A PreToolUse hook inspects every shell command and hard-blocks apply, destroy, import, state, and taint, including the chained and compound forms a permission deny-list would miss. The repo is plan-level only, for humans and agents alike.

Indigo wireframe robotic hand pressed against a glowing translucent barrier, dark server racks in shadow beyond it — the agent walled off from real infrastructure — The agent plans, tests, and documents; apply stays on the other side of the wall.

The local loop also needs no credentials. Unit tests run against mock_provider, so terraform test asserts on planned resources without an Azure subscription. Tests that need real Azure live in tests/integration/ and run in CI over OIDC. The blast radius of a wrong agent move is a red gate, not a destroyed VNet.

Research before code

The most common LLM failure in Terraform is fluent HCL with arguments that don't exist. Provider APIs drift while a model's memory of azurerm stays frozen at training time. terraform validate catches part of the problem, but the downloaded schema says nothing about deprecations or behavior changes.

The harness turns research into a workflow step. A provider-researchersubagent fetches current resource schemas from provider docs and the azurerm changelog before any resource gets written, returns them as structured data (arguments, types, defaults, constraints), and labels anything it couldn't verify as UNVERIFIED rather than guessing. CLAUDE.md forbids writing a resource from memory.

Procedures as skills

Team procedures are encoded as Claude Code skills, which makes them runnable runbooks. /new-module scaffolds from the skeleton template, researches every resource, designs the variable and output interface before implementing, writes the required test coverage, fills the basic and complete examples, regenerates docs, and iterates until the gate is green.

The testing standard is specific enough to gate. Every validation block needs a matching expect_failures test, for_each cardinality is asserted, feature flags are tested on and off, and an input echo proves nothing. Tests assert what the resource received.

/upgrade-provider reads the changelog diff between current and target versions before touching code, fixes what the research and the gates surface, and reports a per-module table. A renamed argument inside a module is a fix; a renamed variable in its interface breaks consumers, and the skill flags it for a major bump. /review-modulecovers what gates can't: tautological validations, stale descriptions, tests that assert nothing. /release-moduleturns a CHANGELOG's Unreleased section into a per-module tag and a GitHub release with the notes already written.

The harness, layer by layer

Eight layers, one rule: if a standard matters, a machine checks it.

Each layer earns its place by doing a job prose can't. The strongest document in the repo isn't documentation at all: templates/module-skeleton is a complete canonical module, and agents copy patterns better than they follow instructions.

Four rules keep the harness honest as it grows. A new standard ships with a new check, or it will drift. The agent may never weaken a gate: tflint ignores, trivy skips, and deleted tests all require human approval. The inner loop stays credential-free and fast, because loop latency is harness UX. And anything CI checks must be a Make target first.

What changes in a platform team's week

The backlog. A module request that used to wait a sprint becomes an afternoon: run /new-module, then review the interface and the diff with every gate already green. Senior time moves from writing HCL to judging design.
Provider upgrades. The azurerm major bump across a thirty-module library is the toil everyone defers. /upgrade-provider researches the breaking changes once, fixes module by module, and hands you a table of what changed and what breaks for consumers.
Review.Gates absorb the mechanical comments (formatting, docs drift, missing tests) before a human looks. What's left for a human is the design.
Standards that stick. A convention that matters becomes a tflint rule or a Makefile script instead of a wiki page. New joiners onboard with make help and six short standards files full of good/bad examples.
Releases consumers can trust. Per-module semver tags and enforced CHANGELOG entries mean teams pin ?ref=modules/<name>/v1.2.0 and get release notes pulled straight from the CHANGELOG.

Steal the structure

Nothing in the harness is Azure-specific. Swap the tflint ruleset, point the researcher at a different provider's docs, replace the skeleton's resources, and the same eight layers run an AWS or GCP module library. To retrofit an existing repo:

Make one command the definition of done and wire every check into it. Specific errors, binary exit.
Pin the toolchain and run the same targets in CI. Local green must equal CI green.
Block state-mutating commands with a hook. The agent's loop stays plan-level.
Make unit tests credential-free with mock providers, so the loop runs anywhere, fast.
Turn each standard into a gate where a tool exists; keep the rest as terse pages with good/bad examples.
Encode procedures as skills, and ground provider knowledge in live docs through a research subagent.

Then treat the harness as a product. A change to it can regress agent behavior with no failing test, so the repo keeps golden tasks ("add a flow-timeout variable to virtual-network, tested and documented") to replay after harness changes. Score each run: research before code, gates intact, docs and tests and CHANGELOG moving together.