Use case · funnels to Sprint

AI Data Cleaning & Deduplication for UK SMEs

AI-powered data cleansing, deduplication, and standardisation. Make your customer + product + supplier records actually trustworthy.

Use case for the AI Implementation Sprint · 4 weeks · From £3,500
Scope a sprint Book a 30-min call
An AI data cleaning workflow deduplicating customer records

In short

Dirty data is the silent tax on every other AI investment. Cleaning it first means everything downstream (segmentation, scoring, automation) actually works. Wingenious’s data cleaning sprint typically dedups 20-40% of records on a CRM that’s been collecting muck for years, standardises formats (addresses, phone, emails), and installs a prevention layer at point-of-entry.

Delivered as a Quick Win (£1,500–£3,500, 1–3 days) for small automations or an Implementation Sprint (from £8,000, 4 weeks) for production workflows. Priced against scope.

What we clean

  • Duplicate customer / supplier / product records (fuzzy match + confidence-threshold merge)
  • Inconsistent format fields (UK postcodes, phone numbers, company names, currencies)
  • Missing field completion (look up + populate from authoritative sources)
  • Invalid / dead records (bounced emails, struck-off companies, dead phone numbers)

Prevention layer added at the same time so the data doesn’t re-rot.

Why this is almost always the first job

Most SMEs come to us with an AI use case in mind. Six times out of ten, the audit recommends data cleaning as a precursor before that use case is worth building. The pattern is consistent. The CRM has been collecting records for eight or twelve years. Different staff used different naming conventions. Imports from old systems duplicated the file. Sales reps created new records rather than search for existing ones. Two acquisitions worth of customer data got merged in without reconciliation. Today the CRM has 84,000 records and probably 50,000 distinct customers. Reporting is unreliable, segmentation is impossible, and any AI workflow built on top inherits the noise.

The cost of doing nothing is hidden but real. Every time a marketing campaign goes out, some customers receive it three times under different email addresses. Every time the sales team pulls a target list, they call the same business twice in a week, sometimes pitching different offers. Every time finance reconciles aged debtors, they spend an extra two days because the same customer appears twice in slightly different forms. The cumulative drag on a 25-person SME is typically in the order of one and a half full-time equivalents of wasted effort, plus the harder-to-measure damage to customer perception.

What clean data actually looks like

Three properties matter, in order.

  1. Uniqueness. One record per real-world entity. Acme Trading Ltd appears once, not three times under slightly different spellings, capitalisations and trading names.
  2. Consistency. Postcodes in the format the postal service uses. Phone numbers in E.164. Email addresses lower-cased. Company names matched to Companies House where applicable. Country codes ISO-2. Currency fields tagged with currency.
  3. Completeness on the fields that matter. Not every field needs filling. The cleanse defines the “must-have” set per record type (typically 8 to 12 fields for a customer, 5 to 8 for a product, 6 to 10 for a supplier) and gets those above 95 percent populated. Optional fields are left alone.

A clean dataset is not perfect. It is reliable enough for the downstream systems to trust. The line between “clean enough” and “perfect” is where most SME data projects burn budget; the build is deliberate about stopping at the right side of that line.

How AI changes the dedup conversation

Old-school deduplication relied on rule-based matching: exact email match, or exact phone match, or a Levenshtein distance under three. The rules either over-merged (joining genuinely different customers) or under-merged (leaving obvious duplicates apart because one used “Ltd” and the other used “Limited”).

AI-assisted dedup uses embedding-based similarity on the full record. Two records that read like the same business get a high similarity score even when none of the individual fields exactly match. The score is calibrated against a small labelled sample from your real data, so the threshold reflects your business not a textbook. The build then sorts merges into three buckets.

  • Auto-merge. Score above the high-confidence threshold (typically 0.95+). Merge happens with the audit trail logged. Reversible inside the 90-day window.
  • Human review. Score in the middle band. The pair appears in a review queue where a staff member ticks merge, do not merge, or “merge but use field X from record A”. Typical review effort is 200 to 600 pairs for a 50,000-record CRM, manageable inside a fortnight.
  • Ignore. Score below the low-confidence threshold. Left alone. The build does not chase phantom matches at the cost of false positives.

The false-merge rate after tuning is typically below 0.1 percent for customer data, well inside the threshold for any sensible CRM operation.

The prevention layer matters more than the cleanse

A one-off cleanse without a prevention layer is a sandcastle in a rising tide. Within 18 months the CRM will be back to where it started, because the sources of dirt are still active. Three prevention mechanisms get installed alongside the cleanse.

  • Point-of-entry validation. Forms reject malformed postcodes, phone numbers and emails before the record gets created. Companies House lookup runs on company name and number to enforce the official version.
  • Duplicate-on-create suggestion. When a sales rep creates a new record, the system surfaces likely existing matches before the new one is saved. Most duplicates die at this gate.
  • Periodic sweep. A scheduled job runs weekly or monthly on Make.com, or bespoke code via Claude Code, looking for new duplicates and format drift. Anything above the auto-merge threshold is dealt with quietly; the rest queues for review.

Prevention is the difference between a cleansed CRM staying clean and the cleansing cost recurring every three years.

When data cleaning is not the right first job

Two scenarios.

  1. The data is too small to matter. A CRM with 800 customers and visible duplicates is faster cleansed by a human in a fortnight than by a build. The honest recommendation is a tidy-up afternoon.
  2. The data is about to be replaced. A migration to a new CRM is on the roadmap inside six months. The cleanse runs on the new platform, not the old one, to avoid doing the work twice.

Engagement options

Three shapes.

  1. Quick Win from £1,500. A tightly-scoped cleanse on a single source (often the customer table or the supplier list) plus basic prevention. Lands in 1 to 3 days.
  2. AI Implementation Sprint from £8,000, four weeks. Full cleanse on one main source, prevention layer, 30-day stabilisation, review queue training for the team.
  3. Fractional CAIO from £3,500 per month where the data quality job is ongoing rather than one-off, especially after an acquisition or a CRM consolidation.

The cleanse usually pays for itself inside the first two segments of marketing it touches; the prevention layer pays for itself across the next two or three years by keeping the data trustworthy.

What the audit trail looks like

A well-run cleanse leaves the SME with three things: a clean dataset, a documented set of changes, and the ability to reverse any change inside an audit window.

The audit trail typically lives in a small admin tool that exposes every merge made during the cleanse with the source records, the chosen master record, the confidence score, and the merge timestamp. Any merge can be reversed inside the 90-day archive window with one click. The trail is queryable by user, by date and by record type, which is exactly the shape an ISO auditor, an ICO query, or a customer access request would need to interrogate.

Where the SME operates in a regulated industry, the audit window extends to match the retention requirement. Financial records typically retain for six years; client contract records for the duration of the contract plus a documented period. The cleanse honours the regulatory perimeter rather than forcing a tidy purge that creates a different problem.

The bit that takes longer than expected

Two stages reliably exceed initial estimates.

The first is the human review queue. The medium-confidence pairs that need a person to decide merge-or-not are usually thicker than the SME expects. For a 50,000-record CRM, the queue is typically 200 to 600 pairs, each one requiring 20 to 60 seconds of review. The cleanse builds in time for this and provides a streamlined reviewing interface, but the SME team needs to commit the time. Trying to skip the review by raising the auto-merge threshold causes false merges; trying to skip by dropping the low-confidence threshold causes obvious duplicates to survive.

The second is the prevention layer rollout. The point-of-entry validation has to be installed on every form, every API endpoint and every import flow that creates records. Larger SMEs often discover three or four import flows nobody had documented; the cleanse surfaces them and the prevention layer extends to cover them. Skipping any one leaves a hole through which duplicates re-enter.

How clean data unlocks the next builds

The cleanse is rarely the end of the story. Most SMEs commission it as the precursor to a build that needed clean data to be worth doing. Three patterns recur.

The first is segmentation. A customer segmentation build on dirty data produces unstable segments that confuse rather than illuminate. The cleanse precedes the segmentation by a few weeks; the segmentation then lands cleanly on the now-trustworthy customer base.

The second is reporting. The leadership team that finally has a trustworthy customer view starts asking questions of the data that they could not previously ask. Lifetime value by acquisition source. Repeat purchase rates by category. Customer concentration risk. An actionable dashboards build is the natural follow-on.

The third is targeted marketing. The marketing team that knows the list is accurate can run campaigns with confidence. Email deliverability improves because the bounce rate drops. Cost per acquisition improves because the targeting is sharper. The cleanse pays back through the next quarter of marketing campaigns as much as through the cleanse itself.

Sector overlays on the dedup work

The four-test framework applies across sectors, but the specific fields that need attention vary. Law firms care about matter codes and client identifiers more than postcodes; the cleanse focuses on conflict-check accuracy. Accountancy practices care about client status, partner allocation and engagement type; the cleanse aligns these across the practice management system. Ecommerce stores care about email and phone hygiene, repeat-customer attribution, and address standardisation for shipping accuracy. Manufacturers care about supplier consolidation, part-number normalisation and bill-of-materials cleanliness.

Why the cleanse stays human in the loop

The temptation, with modern AI tooling, is to let the model dedupe everything in one pass. The temptation is wrong.

Three reasons the cleanse keeps a human in the loop on medium-confidence pairs. The first is that the cost of a false merge is high and asymmetric. Merging two customers who turn out to be different costs the SME more than leaving two duplicates that turn out to be the same. The asymmetry justifies the human review.

The second is that the human review queue produces calibration data. Every pair the human reviews tells the model something about where the threshold should sit. After a few hundred reviews, the auto-merge threshold can be raised because the team has learned what the boundary looks like.

The third is that the human in the loop builds trust. The team that watched the cleanse happen and signed off on the borderline cases is much more confident in the result than the team that woke up to a fait accompli. Trust matters because the cleansed dataset is going to be the foundation for everything downstream.

What the team experiences during a cleanse

Practical view of what the cleanse feels like for the SME team.

Week one is interview and access. Wingenious staff spend time understanding the data shape, the historical sources, the field conventions and the team’s specific concerns. Light read access to the CRM is established. A sample of 1,000 records is exported for the calibration work.

Week two is calibration and rules. The team agrees the auto-merge threshold, the human review threshold, and the field-by-field rules for standardisation. A small set of practice merges run on the sample so the team can see how the engine behaves before it operates at scale.

Week three is the live cleanse. The full dataset is processed. Auto-merges run with the audit trail logged. The human review queue populates. The team picks up the queue (around 200 to 600 pairs for a 50,000-record CRM) and works through it during the week with the streamlined reviewer interface.

Week four is the prevention layer and handover. Forms and import flows get the validation hooks installed. Companies House lookup is wired in. The duplicate-on-create suggestion is configured. Documentation, runbooks and the audit trail are handed over. The 30-day stabilisation window begins.

Workflow automation · Actionable dashboards · Document management · CRM automation

Sectors where data cleaning lands hardest first: accountants, ecommerce, manufacturing.

FAQ

Questions SME leaders ask.

Won't AI dedup create false matches?

Risk yes, manageable yes. We always run AI dedup with a confidence threshold: high-confidence merges happen automatically, medium goes to human review queue, low gets ignored. Every merge is logged and reversible. False-match rate after tuning is typically <0.1% for customer data.

What about my Salesforce / HubSpot / Pipedrive data?

All work. We dedup in place via the platform APIs, or run a one-off cleanse + sync the cleaned data back. Most builds combine a one-time cleanse (typical 20-40% record reduction on an old CRM) with an ongoing prevention layer that catches duplicates at point of entry.

How long does a typical cleanse take?

For a CRM with under 50,000 records, four weeks end to end including the prevention layer. Volume drives timeline more than complexity: 100,000 to 500,000 records typically lands in six to eight weeks. The bottleneck is rarely the AI, it is the human review queue for medium-confidence merges. Sprint includes the cleanse, the prevention layer, and 30 days of stabilisation; ongoing data quality monitoring sits inside Fractional CAIO if needed.

What data sources beyond the CRM can you clean?

Anywhere you have records that should be unique but probably are not. Common targets: supplier records in your accounting system, product SKUs in your ERP or store, contact lists in your email tool, ticket histories in your helpdesk. The pattern is the same: pull, dedupe with confidence thresholds, push cleaned data back, install prevention at point of entry. Sprint can cover one source thoroughly or two lighter sources in the same four weeks.

What happens to the records that get merged or deleted?

Nothing is deleted permanently in the first pass. Merges are logged with a reversible audit trail; original records are moved to an archive table for 90 days before final removal. Where regulatory retention applies (financial records, customer contracts), the archive period is extended accordingly. Reversal of any merge is a one-click operation during the audit window. After 90 days, the archive is purged unless legal hold applies.

Next step

Make this real with the Sprint.

One named workflow live in four weeks, so your team gets that time back for higher-value work. Make.com or bespoke code, weekly demo. From £3,500 · 4 weeks.