Why most AI automation in Manchester fails before it reaches production

Manchester's tech scene has grown up fast. MediaCityUK, the digital cluster around NOMA, the universities spinning out talent - there's genuine substance here. And with that growth has come a wave of agencies selling AI automation: n8n flows, GPT wrappers, Make.com sequences dressed up as AI transformation. Some of it works. A lot of it doesn't survive contact with the real world.

The failure mode is almost always the same: a demo that runs flawlessly on curated inputs, a slide deck that looks compelling, a statement of work signed in January - and then nothing in production by April. Not because the idea was wrong, but because the gap between "works in a notebook" and "runs reliably in your stack" is where most AI projects go to die.

The demo-to-production gap is real and specific

Here's what that gap actually consists of. It's not a vague "scaling challenge." It's a set of concrete, predictable problems:

// What the demo tested:

input: "Please summarise this invoice."

// What production gets:

input: PDF scan, rotated 12°, footer cut off,

VAT number missing, vendor name in Welsh

Demos use clean, representative inputs. Production gets the edge cases - the ones your team has been quietly handling manually for years because they're awkward. An AI system that hasn't been tested against those inputs will fail on them. Every time. And if you haven't built a way to detect and surface those failures, you won't know until a customer complains or an invoice gets lost.

Nobody's talking about evals

When a North West agency pitches you AI automation, ask them one question: what does your evaluation framework look like? Watch the response. If they reach for "we'll test it manually before handover," that's your answer.

Evals - automated test suites that measure your AI system's output quality across a representative dataset - are the single biggest gap between AI projects that work and ones that don't. They're what lets you catch regressions when a model is updated. They're what lets you quantify "this is 94% accurate on document classification" instead of "seems to work great." They're what makes a handover actually mean something.

// eval framework basics

A working eval setup includes:

→A curated dataset of real inputs, including known edge cases
→Expected outputs (or rubrics for scoring) for each input
→Automated runs on every code change
→Tracking of score over time - so you catch degradation
→Failure logging that surfaces what went wrong, not just that something did

This isn't exotic. It's the same discipline software engineers apply to any non-deterministic system. The reason it's rare in AI automation projects is that it takes time to build and it's invisible on a demo. Agencies optimise for the demo.

Observability: the thing nobody builds until it's on fire

A production AI system that you can't observe is a liability. You need to know: what inputs is it seeing? What outputs is it producing? Where is it slow? Where is it failing silently - returning a result, but the wrong one?

Standard application monitoring (uptime, latency, error rates) is necessary but nowhere near sufficient. You need LLM-level observability: token usage, prompt traces, output sampling, confidence signals where the model provides them. Tools like Langfuse, Arize, or a well-structured logging layer built around your specific use case. None of this is hard to implement, but it has to be designed in - you can't bolt it on after the fact without significant rework.

The pattern we see repeatedly in the North West is projects handed over to in-house teams with no instrumentation. The system runs, nobody's quite sure how well, and the first sign of a problem is when something goes badly wrong at the worst moment.

Prompt engineering is not a deployment strategy

If the answer to every failure mode in your project is "we'll tweak the prompt," that's a red flag. Prompt changes should be versioned, tested against your eval set before deployment, and rolled back if they cause regressions. Treating a prompt as an editable config file that anyone can update in prod is how you introduce silent failures at 3am.

This is especially acute as foundation models update under you. GPT-4o in January behaves differently from GPT-4o in April. A prompt that worked in the demo phase can subtly break weeks later with no code change on your side. Without evals running against the live model, you won't know.

What production-ready actually looks like

A production-ready AI automation system has: evals running in CI, observability covering both infrastructure and model-level behaviour, human-in-the-loop escalation paths for low-confidence outputs, prompt versioning, graceful degradation when the model or API is unavailable, and documentation that lets your team understand and maintain it without the agency.

That last point matters particularly if you're working with a Manchester-based agency on a project you'll own long-term. The handover is part of the build. If it's not planned from week one, it won't happen properly at the end.

The AI automation projects that survive in production aren't the ones with the flashiest demos. They're the ones where someone treated the AI component with the same engineering rigour they'd apply to any other critical system dependency.

The demo-to-production gap is real and specific

Nobody's talking about evals

Observability: the thing nobody builds until it's on fire

Prompt engineering is not a deployment strategy

What production-ready actually looks like

We build AI systems that reach production.