Your Deployment Pipeline Should Not Depend on a Prompt

LLMs are useful for authoring deployment checks. Making a prompt part of the deployment path is a reliability, auditability, and cost mistake.

Kamil Chmielewski

• 2026-04-13 • 10 min read • AI

Your Deployment Pipeline Should Not Depend on a Prompt

Use AI to write the checks. Don't make AI the check.

Recently I reviewed a deployment workflow for a company where the codebase itself was mostly fine. The weird part was everything wrapped around it.

What should have been a small set of deterministic checks had been turned into a giant LLM prompt. The model was supposed to inspect the repository, decide what was actually deployable, infer ports and health checks, interpret security settings, propose fixes, sometimes apply them, and then validate the result.

That kind of thing looks sophisticated right up until you need it to be repeatable.

I am deliberately generalizing the details here because the company does not matter. The pattern does. The examples in this post are based on a real review, not made-up edge cases.

LLMs are useful for turning vague intent into candidate implementations. That makes them good at authoring-time work: drafting scripts, sketching policy, explaining failures, and proposing patches.

It does not make them good release gates.

On a good day, that looks clever.

On a normal day, it is just operational logic expressed in the least testable form available.

If your pipeline needs the same answer every time for the same repository state, stop writing prompt prose and start writing code.

Why This Keeps Happening

A workflow like this is attractive for a few very understandable reasons.

First, it feels fast. Writing a long prompt is easier than building a real tool. It feels more flexible than a script, broad enough to cover messy edge cases, and impressive enough that people can pretend a lot of operational knowledge has been captured very quickly.

Second, there is the AI hype factor. Everyone wants to show up with the new Claude workflow, the new super-prompt, the new clever system that looks like the future. A giant prompt looks modern in a way that ./scripts/deploy-audit never will, even if the script is the thing that will still work next month.

Third, some organizations make this worse by turning tool usage into a goal. If leadership is tracking AI adoption, token usage, or platform usage, people get pushed toward using the model whether it is the right tool or not. Once usage becomes a KPI, bad automation starts looking like good compliance.

So yes, something gets produced quickly.

What usually gets produced is ambiguity.

Then that ambiguity gets promoted from prototype to production, and now the release path depends on a model interpreting the repository correctly at runtime. Same repo, slightly different context window, slightly different wording, slightly different model behavior, and congratulations: you turned deployment logic into a brittle runtime gamble.

Deployment gates are where you want boring software, not runtime interpretation.

The Anti-Pattern

The anti-pattern is simple: take a deterministic workflow and replace explicit checks with broad natural-language instructions.

This is what prompt-driven pseudo-automation usually sounds like.

Inspect the repository and determine which services are deployable.
Infer expected runtime configuration, ports, health checks, security settings,
container hardening requirements, and startup behavior. Identify any problems,
apply fixes when appropriate, build the containers, run validation, and confirm
whether the application is production ready.

That is not a deployment check. That is a wish.

It bundles discovery, policy interpretation, validation, remediation, and final judgment into one probabilistic step. It also hides the real rules. If the pipeline fails, you now need to debug prompt intent, tool availability, model behavior, and repository structure at the same time. Very efficient, if your hobby is prompt archaeology.

Why It Breaks in Practice

The failure modes are not exotic. They are predictable.

They are also not hypothetical. I am describing a real class of operational mistakes I have seen in review.

Service Discovery by Vibes

A model sees a directory with a package.json and decides it is a deployable service. It sees a local database used for development and treats it like part of the production topology. It mistakes helpers, asset pipelines, or one-off scripts for first-class runtime components.

Once that happens, the rest of the audit is garbage. Ports are wrong, health checks are wrong, container expectations are wrong, and the model starts reporting failures for things that were never meant to be deployed in the first place.

Security Interpretation by Guesswork

Proxy-aware security settings get confused with in-app TLS termination. Secure cookies behind a reverse proxy get treated like a broken HTTPS configuration. Standard forwarded-header settings get interpreted as a sign that the application is misconfigured.

That is where this stops being annoying and starts being dangerous. A bad fix in a deployment path is worse than a false positive. At least a false positive only wastes your time.

Fake Validation

This one is common and especially stupid.

The model invents a validation step that sounds reasonable but does not match the application contract. It calls a readiness endpoint with placeholder settings even though the endpoint is supposed to verify real downstream dependencies. The check fails, not because the app is broken, but because the check itself is invalid.

You did not automate validation. You automated confusion.

Policy as Prose

A 500-line prompt looks comprehensive. A 150-line script looks small and boring.

The script wins.

The script can be versioned, diffed, tested, linted, cached, and run locally or in CI. The prompt cannot tell you which branch of reasoning changed, which assumption was added, or whether the model silently interpreted one sentence differently this week than last week.

Prompt text is fine for discussion. It is a terrible source of truth.

Too Much Responsibility in One Step

Audit, judgment, remediation, and validation should not all live inside the same model call in the hot path.

That is too much authority to hand to a system whose core feature is generating plausible output under uncertainty.

Useful feature, by the way. Just not for release control.

What Deployment Workflows Actually Need

Deployment and release gates are not creative work. They are control work.

We want:

deterministic outcomes
explicit inputs and outputs
stable failure modes
fast execution
low cost per run
easy auditing
minimal runtime dependencies
no dependency on an external LLM provider being available
behavior that does not drift when a model or prompt changes

And yes, that external dependency part matters. If your deployment path depends on an LLM provider, your release process now inherits that provider's latency, outages, degraded performance, and auth failures.

At the time of writing, the public Claude status page showed 98.94% uptime for the Claude API over the past 90 days. That may be fine for an assistant. It is not what I want anywhere near a deployment gate.

An LLM is useful when the task is fuzzy, exploratory, or generative. A deployment gate is the opposite.

If you put a model in that path, you are adding hidden dependencies on:

prompt quality
model behavior
token budget
context assembly
tool auth and availability
the model correctly understanding repo-specific exceptions

That is a bad trade for operational logic.

If a task must produce the same answer every time for the same input, it probably belongs in code.

The Right Division of Labor

The better pattern is not “never use AI.” The better pattern is much simpler.

Use AI to create the deterministic system. Do not use AI as the deterministic system.

That means using an LLM for authoring time:

drafting checks
generating tests
turning fuzzy policy into an implementation plan
explaining failures from real tools
proposing targeted patches

And using deterministic tools at runtime for:

deployment checks
release gates
health validation
config validation
secret scanning
image inspection
startup and migration orchestration
policy enforcement

The model should consume the output of deterministic tools. It should not pretend to be the deterministic tool.

That distinction matters because only one of those systems is supposed to define reality.

The Compiler Example Makes This Obvious

Compilers are the cleanest example because nobody argues with them for long.

The wrong pattern is asking an LLM to inspect source code and tell you whether an integer was assigned to a string variable. That is just asking the model to cosplay as the compiler.

The correct pattern is:

Run the compiler.
Let it report the exact file, line, and type error.
Feed that output to the LLM.
Ask the LLM to explain or fix the problem.

The compiler is the source of truth.

The LLM is the repair assistant.

Same pattern everywhere else:

run the linter, then let the LLM fix lint errors
run the test suite, then let the LLM investigate failures
run the schema validator, then let the LLM align config with it
run the deployment audit, then let the LLM propose a patch that satisfies the audit

In every case, the deterministic tool defines reality first. The model reacts to reality second.

What the Better Pattern Looks Like

This script makes deployment assumptions explicit and testable.

#!/usr/bin/env bash
set -euo pipefail

./scripts/assert-single-deployable-service
./scripts/assert-container-port 8080
./scripts/assert-health-endpoint /healthz
./scripts/assert-non-root-user
./scripts/assert-startup-migrations

docker build -t app:test .

cid="$(docker run -d -p 8080:8080 app:test)"
trap 'docker rm -f "$cid" >/dev/null 2>&1 || true' EXIT

curl -fsS http://127.0.0.1:8080/healthz >/dev/null

None of that is magical. Good.

The point is that each assumption is repo-local, explicit, and reviewable. If a rule changes, you update a script. If the check fails, you know which check failed. If you want AI in the loop, let it summarize the output or generate the patch after the script tells the truth.

That is actual automation.

A Practical Migration Path Out of Prompt-Heavy Ops

If your team already has a prompt-driven operational workflow, you do not need to rewrite everything in one shot.

Start here:

Identify the parts that should be deterministic.
Extract them into repo-local scripts.
Put them behind a stable command like ./scripts/deploy-audit.
Make CI run the same command.
Reduce the prompt to summarization, explanation, and targeted suggestions.
Remove the prompt from the critical path entirely when possible.

That last part matters.

Keeping the model as a helper is fine. Keeping the model as the gatekeeper is where things go sideways.

The Design Rule Worth Remembering

There is a big difference between these two sentences:

AI helped us build our deployment checks.
AI is our deployment checks.

The first is modern engineering.

The second is usually a control failure dressed up as innovation.

Final Word

LLMs are excellent at turning ambiguous intent into candidate implementations. That is real value. Use it.

But once the implementation matters operationally, the goal should be boring software: scripts, tests, CI jobs, typed config, explicit contracts, and deterministic tooling.

That is how you reduce risk, cut cost, and avoid the class of failures that only exist because a model was asked to improvise in a place where improvisation was never desirable.

Use AI to write the playbook.

Do not use AI as the playbook during deployment.

Your Deployment Pipeline Should Not Depend on a Prompt

Your Deployment Pipeline Should Not Depend on a Prompt

Why This Keeps Happening

The Anti-Pattern

Why It Breaks in Practice

Service Discovery by Vibes

Security Interpretation by Guesswork

Fake Validation

Policy as Prose

Too Much Responsibility in One Step

What Deployment Workflows Actually Need

The Right Division of Labor

The Compiler Example Makes This Obvious

What the Better Pattern Looks Like

A Practical Migration Path Out of Prompt-Heavy Ops

The Design Rule Worth Remembering

Final Word

Menu

Settings