Agentic engineering: the grown-up version of vibe coding

Vibe coding is seductive. You describe what you want, the model produces something plausible, you iterate a couple of times, and suddenly you have a working feature without having touched a keyboard much. I get it. I have done it. It is a genuinely useful way to explore ideas quickly. But there is a ceiling on it, and that ceiling is lower than most people expect. The moment your requirements get specific, your codebase gets large, or you need someone else to trust what you have built, the vibes run out. What you are left with is code you do not fully understand, tests that may or may not cover what matters, and a workflow that depends entirely on the model making good guesses. That is not engineering.

Diagram showing the agentic engineering loop: Specify, Execute, Evaluate, Test — with continuous iteration

What vibe coding actually gets wrong

The problem with pure vibe coding is not the AI — it is the absence of a feedback loop. You put a prompt in, you get code out, you eyeball it and move on. There is no specification to check the output against, no evaluation to tell you whether the model's interpretation of your intent was correct, and no automated test that will catch it when a future change breaks something. The workflow is optimised for speed of initial generation, which turns out to be the least valuable part of software development.

This pattern shows up repeatedly in teams that have adopted AI tooling without changing their process. Someone builds something impressive in an afternoon. Six weeks later, nobody can extend it without breaking something, because the code has the shape of a solution but not the structure. There are no tests, no clear contracts between components, and the person who originally built it cannot explain half of what the model generated. At that point you are not going faster than if you had written it carefully by hand — you are slower, because you are also carrying the debt.

Agentic engineering is the answer to that. It is not a rejection of AI assistance — quite the opposite. It is the recognition that to get the real value out of these tools, you have to apply the same discipline you would apply to any other powerful tool: know what you are asking for, know how to verify you got it, and have a way to catch regressions.

Specification is where it starts

The single biggest improvement you can make to an AI-assisted workflow is to specify what you want before you ask for it. This sounds obvious. In practice almost nobody does it. They open a chat, describe the feature in one sentence, and start iterating. The result is a conversation where half the tokens are spent correcting misunderstandings that a two-paragraph spec would have prevented.

With Claude Code the primary specification mechanism is CLAUDE.md. This is a plain text file that lives in your project root and gets loaded at the start of every session. It is where you put your architectural constraints, your naming conventions, which files are off-limits, how builds and tests work, and any project-specific context the model needs. Once you have written a good CLAUDE.md, you stop getting suggestions that contradict your architecture. The model is not guessing your intent from scratch every time — it has a specification to work from.

Beyond CLAUDE.md, writing proper acceptance criteria before you ask the model to build a feature makes an enormous difference. It does not have to be formal. A few sentences describing the expected behaviour, the edge cases that matter, and the definition of done is enough. If you cannot write that down, you are not ready to ask an AI to implement it — and the discipline of writing it forces clarity in your own thinking before the first token is generated.

Evaluation closes the loop

Specification tells the model what to build. Evaluation tells you whether it built it. This is the step most AI-assisted workflows skip entirely, and it is the one that matters most for quality at scale.

Evaluation in an agentic context means having a way to assess the model's output against your intent — automatically, and consistently. The simplest form is a set of test cases that you can run after every generation. More sophisticated approaches use LLM-as-judge patterns, where a second model evaluates the output of the first against a rubric - which is a specific set of rules or a scoring guide you provide to the Judge. The specific tool matters less than the principle: you need a feedback loop that does not depend on you reading the output by hand.

Structured outputs are worth mentioning here too. If you constrain the model to produce JSON that matches a schema, or TypeScript that satisfies a set of types, you get machine-verifiable properties for free. That is not a complete evaluation strategy, but it eliminates a whole class of subtle misalignments between what you asked for and what you got.

What is striking about this is how much it mirrors traditional TDD. You write the spec, you write the eval, and then you let the model write the code to satisfy them. The model is much better at producing correct code when it has a concrete target to hit than when it is working from a vague description. The eval is not a check on the model — it is a gift to the model. It gives it something to aim at.

Testing is not optional just because AI wrote the code

There is a temptation, when using AI-assisted development, to treat testing as something you can defer. The model generated the tests too, right? So they must be fine. That reasoning is backwards. AI-generated tests tend to be optimistic — they test the happy path the model just implemented, not the edge cases that will break it in production. A model that just wrote a function will write tests that pass against that function. That is not testing; that is a tautology.

The tests that actually matter are the ones written from the specification, not from the implementation. Write your test cases before you ask the model to generate code, or at minimum before you review the generated code. Use property-based testing to cover the space of inputs the model did not think about. Run mutation testing to verify that your tests actually catch defects, not just that they pass. Plug the whole thing into a CI pipeline so that no generated change can merge without hitting the quality gate.

This is not more work than vibe coding. It is different work, front-loaded rather than deferred. The payoff is that when you come back to this code in three months — or when a colleague picks it up — there is a foundation to build on. The tests are the executable spec. They tell the next person (or the next AI session) what correct behaviour looks like.

From assistant to autonomous collaborator

Where this is heading, I think, is that the distinction between "you writing code with AI assistance" and "an AI agent building software under your direction" will continue to blur. The patterns of agentic engineering — specify precisely, execute with capable tools, evaluate against intent, test thoroughly — are what make that shift trustworthy rather than chaotic. A model running without a specification is guessing. A model running against a well-defined eval is being held accountable. That is the difference between a productivity tool and a reliable engineering collaborator. The investment in process pays off as the autonomy increases, not the other way around.