An agentic workflow can pass every test it was given and still be wrong about the business it was built for.
It goes wrong after the solution is considered finished. The workflow handles the cases in the requirements doc, clears the validation set, gets handed to the client, and runs. Then a few weeks in, it starts making the wrong call on the cases that matter most, and the people relying on it can't say exactly why. Nothing is broken in the way software breaks. The logic is sound, the model is fine, the workflow is doing what it was told. What it was told was never the whole job.
The whole job was never written down. Most of what decides whether an agentic workflow is right for an organization lives in judgment nobody put in the spec. Which exceptions are worth handling. What "good enough" means here. When the workflow should stop and ask a person. Call it the unwritten spec, the part of the requirements that never made it into the requirements, because the people who hold it don't experience it as requirements. They experience it as how the work is done.
That gap is why the embedded AI engineer exists. Not a consultant who runs a discovery phase and leaves, not a vendor who delivers against a spec and moves on, but an engineer who works inside the client's operation, close enough to the actual work to see the decisions that never made it into the documentation, and tunes the workflow against them as they surface. The rest of this is the case for why that closeness isn't a service upgrade. It's the only place the work can be done.
Where the handoff breaks
Consider a claims operation at a mid-size insurer that brings in an agentic workflow to triage incoming claims. Read the submission, pull the policy details, route the claim to fast-track, standard review, or escalation. In testing, it works. On a sample of historical claims, it routes the way the senior adjusters would have. The solution is delivered, and for the common claims it keeps doing well.
Then the unusual claims arrive, the ones a senior adjuster handles on instinct. A claim that's technically within policy but fits a pattern the team has learned to treat as fraud. A claim where the right move isn't in any rulebook, but every adjuster on the floor knows to slow down. The workflow routes these the way the documented rules say to, which is to say correctly, and wrong. The documented rules were never where the real decision lived. The senior adjuster's sense of when to distrust a clean-looking claim never got handed over, because nobody thought to write it down, and nobody easily could. It isn't a rule. It's pattern recognition the person applying it would struggle to put into words if you asked. And it surfaces on exactly the cases the validation set never contained, because those cases were never well-described enough to put in a validation set.
Why distance fails
That last point is what breaks the standard delivery model. The obvious fix for a missing requirement is to go gather it. Interview the senior adjusters, ask how they decide, write it down. It doesn't work, and why it doesn't work is the whole argument. The judgment is tacit because the people who have it can't fully retrieve it on demand. Ask an adjuster how they know to distrust a claim and you get a partial, rationalized answer that leaves out most of what they actually do, not because they're hiding it but because they don't have conscious access to it. The judgment shows itself when a real case triggers it, not when someone asks about it in a conference room.
There's a second reason, specific to this generation of automation. What enterprises automated before mostly encoded rules. If the claim is over this amount, route it here. Rules are explicit by construction, which is why they transferred cleanly from people to software. Agentic workflows reach into the decisions the rules never covered, the ones a person was making precisely because no rule fit. That reach is the source of the value and the source of the exposure. The further into human judgment the automation goes, the more it depends on a spec that was never written, and the less a remote team can recover by asking.
So the work can't be specified from outside and handed in. It has to be read from inside, while the workflow runs on real cases, by someone watching where its decisions diverge from what the experienced people would have done. The industry often calls this role the forward deployed engineer. The name matters less than where the person sits.
What the work actually is
An embedded AI engineer is doing something neither side can do alone. The client's own people have the judgment but can't build or tune an agentic system around it. A remote vendor can build the system but doesn't have the judgment and can't acquire it from a distance. The role works only at the intersection, and the intersection exists only on site.
In practice the work is iterative and unglamorous. The workflow makes a call, the engineer sees it diverge from what a senior person would have done, and the two of them reconstruct what the person knew that the system didn't. That knowledge goes into the workflow, the next similar case goes better, and the gap closes one decision at a time. None of it could have come from a requirements document, because the cases that teach it are the ones nobody thought to anticipate. The engineer isn't gathering requirements that were always there to be gathered. The engineer is watching judgment operate and converting it into something the system can hold.
In the claims case, that might mean the workflow learns to hold a clean-looking submission for human review when it matches the shape the adjusters had been quietly treating as suspect, a check that lived in nobody's documentation until someone watched the system route straight past it. Each such case looks small on its own. The compounding across hundreds of them is the difference between a workflow the operators trust and one they keep second-guessing.
The obvious objection is that you could keep a senior adjuster reviewing the workflow's calls instead, and run the system under permanent supervision. Some operations do exactly that, and as a stopgap it's reasonable. But it isn't capturing the judgment, it's renting it with no end date. The reviewer catches the bad calls without the workflow ever learning why they were bad, so the review never tapers and the cost never comes down. The embedded engineer's job is the opposite. Move the judgment out of the reviewer's head and into the system, case by case, until the oversight that's left is proportional to the risk rather than a standing tax on every decision.
Where the cost hides
From an executive seat, putting an engineer inside a client's operation looks like the expensive option. It's a person on the ground, a higher day rate, weeks of someone's time spent watching work the staff could presumably explain in an afternoon. The alternative, build against a clear spec and hand it over, looks cheaper, cleaner, and easier to repeat. On the budget line, it is.
The cost of the handoff model doesn't appear on a budget line. It appears as a workflow that's right most of the time and quietly wrong on the cases that carry the most risk. It appears as a solution that was delivered and then never fully trusted, so the team keeps a manual check in place and the efficiency never fully arrives. Eventually it shows up as a project marked complete that the organization has, in practice, worked around. None of it registers as a failure. It registers as an AI initiative that cost what it was supposed to cost and returned less than it was supposed to return, for reasons no one can quite locate.
This is the same structural blindness that hides every distributed cost. The price of embedding is concentrated, visible, and easy to question in a review. The price of not embedding is spread across the months after delivery, carried by the operators who don't trust the workflow and the outcomes that quietly underperform. One has an owner and a number. The other has neither, until someone asks why last year's AI investment didn't move the metric it was bought to move.
The teams getting durable results from agentic systems are not the ones with the cleanest handoff. They're the ones who put an engineer where the judgment lives and kept them there long enough to learn the part of the job that was never written down. The choice was never between an expensive delivery model and a cheap one. It's between paying to capture the judgment that makes the workflow right, and paying later, less visibly, for the judgment that got left behind.
At Trackmind, we've found that the most successful AI deployments occur when engineers spend time inside the business process itself, watching the work run before they tune the system to it. The part of the job that was never written down doesn't transfer through a specification. It gets learned where it lives, or it gets left behind.
Trackmind embeds engineers inside enterprise AI operations to capture the judgment the spec leaves out. Learn about our Claude Practice.