A nationwide risk query took three days to run. In fraud detection, a slow query is a slow intervention — and citizens pay for the wait.
That single fact is where this story starts. The query itself wasn't exotic: match records across the country, against multiple risk rules, to find the patterns that signal fraud and error. The problem was that running it across the full national dataset took 72 hours. Three days of waiting, every cycle, before anyone could act on a single thing it surfaced.
Why a slow query is a slow intervention
It's tempting to treat query performance as a purely technical concern — something for the engineers to worry about, invisible to everyone else. In fraud and error work, it isn't. The speed of the query is the speed of the response.
When detection takes three days, every part of the chain slows with it. The fraudulent pattern stays live longer. The trail goes colder. The loss compounds while the answer is still being computed. Fraud and error already cost the UK taxpayer somewhere between £55 billion and £81 billion in a single year, by the National Audit Office's estimate — and only a fraction of that is ever detected. When the tooling that does the detecting runs at the speed of days, the gap between what's lost and what's caught only widens. Fast isn't a vanity metric here. It's how much you can actually act on.
The query wasn't the problem — the foundation was
The instinct, when something runs slowly, is to optimise the thing that's running: rewrite the query, tune the rules, throw more compute at it. Sometimes that helps at the margins. But a job that takes three days at national scale is rarely telling you the query is wrong. It's telling you the foundation underneath it can't serve the query efficiently.
In a large public body, the data feeding a query like this comes from hundreds of separate source systems — in this case, more than 240. Each was built at a different time, for a different purpose, with its own format and its own quirks. When the platform underneath isn't designed for matching at this volume, every run pays the price: data scattered awkwardly, processing that can't parallelise cleanly, the same expensive work repeated because nothing upstream made it cheaper. The query is slow because the foundation makes it slow.
What re-engineering the platform actually changed
The fix was to rebuild the foundation the query depends on, not to keep tuning the query in isolation. That meant moving the work onto a platform engineered for public-sector volumes from the start — one designed to hold and process tens of terabytes without falling over, and to run complex matching across the full dataset as a first-class operation rather than an afterthought.
Concretely, that work spanned roughly 45 terabytes of data and multiple risk rules running together. On the old footing, the matching job took 72 hours. On the re-engineered platform, the same job came down to around three — a speed-up of roughly 24×, achieved not by simplifying what the query asked, but by giving it a foundation that could answer it properly.
Why the multiple matters more than the milestone
A 24× improvement reads like a benchmark, but the value isn't in the ratio. It's in what the new speed makes possible. A job that runs in three hours can run overnight and be ready by morning. It can run more often, so findings are fresh rather than stale. It can be iterated on — adjust a rule, re-run, see the result the same day instead of three days later. Detection stops being a quarterly-feeling event and starts being something close to a routine.
For a technical leader, that's the real lesson, and it generalises well beyond fraud. The headline number is whatever job you're waiting three days for. The fix is almost never a cleverer version of that job. It's a foundation built for the scale you're actually operating at.
This is a citizen issue, not just an engineering one
It's easy to file all of this under "performance tuning" and move on. But behind a slow risk query is a slower response to real loss — money that should be funding services quietly leaking away while the system computes. Speeding up the foundation isn't about a tidier benchmark. It's about catching things while they still matter, recovering money while it's still recoverable, and doing it without spending more — simply by letting existing systems work at the speed the problem demands.
Where this leaves us
The next time a critical job takes days, the temptation will be to optimise the job. Sometimes that's right. Often it's a sign the foundation underneath was never built for the scale you've grown into. The organisations that move fastest won't be the ones with the cleverest individual queries — they'll be the ones that fixed the platform those queries run on, then let everything on top finally move at the speed of need.
