Inside an AI automation project: from brief to production in 8 weeks

The client: a mid-sized professional services firm in Greater Manchester, around 80 staff. Their problem: inbound client enquiries arrived across three channels (email, a web form, and a legacy phone system that generated transcripts). Each enquiry needed to be categorised, a draft response generated, the right internal owner identified, and a record created in their case management system. Four steps, all manual, taking an average of 22 minutes per enquiry, across roughly 60 enquiries per day.

Their previous attempt at AI automation had involved a no-code platform that got 60% of categorisations right on the demo dataset and around 40% on live traffic. They'd shelved it six months before talking to us.

// project parameters

timeline: 8 weeks to production

scope: triage + draft response + routing + CRM write

constraint: data must not leave UK (professional privilege)

success metric: >90% triage accuracy on live traffic, measured

handover: in-house team owns it at the end

Week by week

Weeks 1–2 · Scoping and data archaeology

Before writing a line of code, we spent two weeks on the data. We pulled 6 months of historical enquiries - 7,400 records - and did a proper analysis of the distribution. Categorisation was the first surprise: what the client called "eight enquiry types" was actually 23 distinct patterns when you looked at the data. Four of them accounted for 71% of volume. The long tail was genuinely varied and would need different handling.

We also found that the "legacy phone transcripts" were much noisier than expected - speech-to-text errors, background noise artefacts, callers speaking over each other. These would need a preprocessing step before hitting any classification model. We flagged this in week one because it affected the timeline. Better then than in week six.

Output of this phase: a data quality report, a revised category taxonomy agreed with the client, and a decision to build a thin slice - handle the top four enquiry types only, with a clean escalation path for everything else. This is the scope discipline that separates AI automation projects that ship from ones that don't.

Weeks 2–3 · Eval set construction

We built the eval set before building the system. 400 enquiries, manually labelled with the revised taxonomy, stratified to cover each category and a representative sample of edge cases (ambiguous enquiries, multi-topic enquiries, hostile tone, non-English content). Every label was reviewed by two people and disagreements resolved.

This took longer than clients expect it to. It also made everything that followed faster and more reliable. The eval set is the ground truth the whole project is measured against. Skipping it or shortcutting it is how you end up with a system that scores 97% on the demo and 60% on live traffic.

We also defined the success metrics at this stage, before seeing any model outputs: ≥90% accuracy on category classification, ≤5% false negative rate on urgent enquiries, draft response quality rated ≥4/5 by reviewers on a blinded sample. Having these agreed upfront means "is this good enough" has a concrete answer.

Weeks 3–5 · Build and iteration

The architecture: an ingestion layer normalising inputs from all three channels into a standard format, a preprocessing step for transcript cleanup, a classification pipeline using Azure OpenAI (UK South region, satisfying the data residency requirement), a response drafting step using a separate fine-tuned prompt set per category, a routing module querying the firm's directory against classification output, and a CRM integration using their existing API.

// simplified pipeline

raw_input

→ normalise(channel: email | form | transcript)

→ preprocess(clean_transcript if needed)

→ classify(category, confidence, urgency_flag)

→ if confidence < 0.82: route_to_human()

→ draft_response(category, tone, template)

→ identify_owner(category, workload_data)

→ write_crm(record, draft, owner, audit_log)

→ notify(owner, slack_webhook)

The confidence threshold at the classification step is the most important single parameter in the system. Below 0.82, the enquiry routes to human review - no automated action taken. We tuned this against the eval set, plotting precision/recall curves and landing at the threshold where false negatives on urgent enquiries dropped below our 5% target. This number isn't magic; it's the output of measurement.

We ran eval against every meaningful change. Not nightly - every commit to the classification or drafting components triggered an eval run. Regressions were caught within hours. Over the three weeks, classification accuracy on the eval set moved: 81% → 87% → 91% → 93%. Each jump has a corresponding change in the changelog.

Week 6 · Shadow mode

Before touching live workflow, we ran the system in shadow mode for a week - processing real incoming enquiries, generating outputs, but not writing to the CRM or sending notifications. The human team handled everything as normal. We compared our outputs to their decisions.

Shadow mode accuracy on live traffic: 89.3% on classification. Lower than the eval set (93%), which is expected - the eval set, however carefully constructed, isn't perfectly representative. We identified two new enquiry subcategories that had emerged in the previous 8 weeks (a new service the firm had launched, a regulatory change that triggered a cluster of related enquiries). Both were added to the taxonomy and the eval set; accuracy on these after a targeted prompt update moved to 91.1%.

Shadow mode also revealed something unexpected: the routing module was performing well on classification but choosing lower-workload owners at the expense of specialism match. We adjusted the weighting function. This kind of finding only shows up when you run against real data. It doesn't show up in demos.

Weeks 7–8 · Staged rollout, observability, handover

We enabled live mode for the top two enquiry categories first - 40% of volume - with human review of every automated output for the first three days. Error rate: 6 corrections out of 183 outputs. We expanded to all four categories by end of week seven.

Observability was built throughout but we used week eight to complete it: a Langfuse dashboard showing daily accuracy metrics, a Slack alert for any run where confidence dropped below threshold across more than 15% of enquiries (a signal that something has changed in the input distribution or the model), full prompt trace logging, and a weekly automated eval run against the live model to catch any model-side drift.

Week eight was also where we handed over. Not just the code - the eval dataset, the changelog, the architecture decision record, a runbook for common failure scenarios, and two sessions with their senior developer on how the classification prompt is structured, why the confidence threshold is where it is, and how to update the category taxonomy when new enquiry types emerge. They own it. They can change it. They don't need to call us to run it.

The numbers, four weeks post-launch

91.4%

classification accuracy on live traffic

3.2%

false negative rate on urgent enquiries

6 min

avg handling time (down from 22 min)

17%

of enquiries still routed to human review

That last number - 17% to human review - is worth sitting with. The goal was never 100% automation. The goal was to handle the automatable portion reliably, with clear escalation for the rest. A system that handles 83% of volume correctly and flags the remaining 17% for human attention is a good system. A system that handles 95% of volume and silently gets 5% wrong is a liability.

Eight weeks, Manchester, professional services, real stack, real data, production. That's what AI automation looks like when it's treated as engineering.

Week by week

The numbers, four weeks post-launch

Same process. Your problem. Your stack.