There is a specific kind of meeting that happens a lot in the AI space right now. A business has identified something they want to automate. They have budget. They have buy-in. They bring in an agency or a consultant, spend a few hours talking through the idea, and leave with a statement of work and a vague sense of optimism.
Six weeks later, the build is underway but the goalposts have moved twice. The original use case turned out to be more complex than expected. A third system nobody mentioned in the first meeting needs to be integrated. The "simple" input documents come in seventeen different formats. The project either stalls, balloons in scope, or gets delivered in a form nobody actually uses.
This is a scoping failure, and it is almost always avoidable.
Start with the failure mode, not the solution
The most common scoping mistake is starting with the solution. "We want an AI agent that reads our inbound emails and routes them to the right team." That is a solution. The underlying problem might be: responses take too long because triage is manual and inconsistent. Or it might be: the wrong team picks up tickets, causing handoffs that delay resolution. Or it might be: one person does all the triage and they are a bottleneck and a single point of failure.
Each of those problems implies a different solution, a different definition of success, and different risks. If you build the routing agent without understanding which problem you are actually solving, you will optimise for the wrong thing.
The question to open with is not "what should the agent do?" It is "what is currently broken, and how do you know?" Walk through the manual process end to end. Find where time is lost, where errors cluster, where people make judgment calls that could be codified. That is where automation creates value.
Define what good looks like before you build anything
Before any technical work starts, you need a concrete answer to: how will we know if this is working? Not "the agent routes emails correctly" but something measurable.
// vague success criteria (avoid these)
"The agent should handle most of the inbound emails."
"It should be accurate."
"It needs to be faster than the current process."
// measurable success criteria (use these)
"90% of inbound emails correctly classified on first pass."
"Median triage time under 90 seconds, P95 under 5 minutes."
"False positive rate below 2% on urgent/billing category."
These numbers become your eval targets. They tell you when to ship, when to iterate, and when to escalate to a human. Without them, you have no way to make an honest assessment of whether the project succeeded, and neither does the team that has to maintain it afterwards.
It is worth being explicit about the floor as well as the ceiling. What is the minimum acceptable performance for this system to be useful at all? An agent that routes 70% of emails correctly might be a meaningful improvement over the current process, or it might just create a second manual process where someone fixes the agent's mistakes. Know which situation you are in before you commit to the build.
Map every input source before you scope the work
AI automation projects almost always underestimate the variety of real inputs. In a scoping conversation, the person describing the use case is thinking of the clean, representative example. The invoice that arrives as a structured PDF from a known supplier. The email that uses the standard subject line format. The document that matches the template.
Production gets everything else. Before scoping the build, you need to see a realistic sample of actual inputs: not curated examples but a random slice from the last few months. Count the variants. Look for the edge cases. Ask what happens when an input does not match the expected format.
// input audit checklist
- 01How many distinct input formats exist for this use case?
- 02What percentage of inputs are "clean" vs requiring manual handling today?
- 03What are the five most common edge cases your team already handles manually?
- 04What happens downstream if an input is misclassified or misread?
- 05Are there any inputs the agent should never touch, regardless of confidence?
The answers to these questions shape the architecture more than any technical decision. A use case where 95% of inputs are clean and misclassification has low cost is a very different project from one where inputs are messy and errors have real consequences. Scoping them the same way is how projects go wrong.
The integration question nobody asks early enough
Every AI automation project connects to existing systems. Sometimes that is straightforward: a REST API with good documentation, a database with a stable schema, a SaaS tool with a proper webhook setup. Often it is not.
The integration work is the part most scoping conversations skip over. It gets a line in the statement of work ("the agent will integrate with your CRM") without any investigation of what that actually involves. Then in week three of the build someone discovers the CRM only exposes read access via the API, write access requires a separate contract with the vendor, and the data model is nothing like what the early conversations assumed.
Before finalising scope, walk through every system the agent will need to read from or write to. For each one, establish: what access exists today, what access will need to be provisioned, who owns that system and needs to be involved, and what the data structure actually looks like in production (not in the documentation, which is frequently out of date).
This work takes a few hours. Skipping it can cost weeks.
Scope the human in the loop, not just the automation
The instinct in AI projects is to scope the automation and treat human review as a fallback. That gets the relationship backwards. The human-in-the-loop path is a first-class feature, not an afterthought.
You need to define: what triggers escalation to a human? Who does it go to? In what form? What does that person need to see to make a decision quickly? How does their decision feed back into the system? Is there a way to flag cases that should improve the model's performance over time?
An agent with a well-designed escalation path can operate reliably at 80% automation rate while a human handles the remaining 20%, and that might be entirely acceptable for the use case. An agent with no escalation path that handles 95% of cases and silently fails on the rest is a worse system, even though the raw automation number looks better.
What a good scope document actually contains
A scope document for an AI automation project should be short enough to read in ten minutes and specific enough that two engineers who have never spoken to each other could independently arrive at roughly the same design. In practice that means:
A clear statement of the problem being solved and the evidence for it. Measurable success criteria with a defined floor. A representative sample of real inputs and an explicit list of known edge cases. A map of every system integration with current access status. A description of the escalation path and who owns it. A list of what is explicitly out of scope for this phase.
That last item is underrated. Explicitly naming what is not being built in phase one is one of the most useful things a scope document can do. It gives everyone permission to stay focused, and it makes future conversations about what to build next much easier to have.
Good scoping is not glamorous work. It does not produce anything you can demo. But it is the difference between a project that ships cleanly and one that drifts for months before quietly being abandoned.