Building 250 Data Pipelines: Lessons From a Government-Scale Data Platform
250+ pipelines, drawn from 240+ source systems, built from scratch. Scale teaches you things bespoke, one-off work never does.
That's the heart of it. A single, hand-built data pipeline is a tractable problem — most competent engineers can produce one that works. The discipline only reveals itself when you're running hundreds of them, pulling from hundreds of different source systems, each with its own format and its own bad habits, all feeding a platform that other people depend on every day. At that point, the things that felt optional on a single pipeline become the things that decide whether the whole platform survives. Here are the lessons that matter most.
Lesson 1: Bespoke does not scale — patterns do
The first instinct, building pipeline number three, is to craft each one perfectly for its source. That instinct is fatal at scale. If all 250 pipelines are bespoke, you don't have a platform — you have 250 separate things to understand, fix and maintain, each subtly different from the last. The first engineer who leaves takes irreplaceable knowledge with them.
What works instead is standardised patterns: a small number of well-designed pipeline templates that the vast majority of sources fit into, with variation handled through configuration rather than fresh code. The goal is that any engineer on the team can open any pipeline and immediately understand it, because it looks like all the others. Standardisation feels slower on pipeline three. By pipeline thirty, it's the only reason you're still moving.
Lesson 2: Assume every pipeline will run twice — design for idempotency
At scale, pipelines fail and get retried. A network blip, a timed-out connection, a node that falls over mid-run — over hundreds of pipelines running constantly, partial failures aren't an edge case, they're a daily certainty. The dangerous failure isn't the one that stops; it's the one that half-completes and then runs again.
If a pipeline isn't idempotent — if running it twice produces a different result from running it once — a routine retry can silently double-count records or corrupt the data it was meant to load. So the discipline is to design every pipeline so that running it twice is safe and produces exactly the same end state as running it once. It's unglamorous, it adds work upfront, and it's the single thing that most often separates a platform whose numbers you can trust from one you can't.
Lesson 3: The failures that hurt are silent — so monitor for them
A pipeline that crashes loudly is the easy case: it alerts, someone fixes it. The expensive failures are the quiet ones. A pipeline that runs "successfully" but moves zero rows because the source went empty. One that completes but takes four times longer than usual, a sign something upstream is degrading. One that's been failing for a week in a way nobody configured an alert for.
At scale you cannot watch hundreds of pipelines by eye, and you cannot rely on someone noticing. The platform has to watch itself: monitoring not just "did it run?" but "did it move a sensible volume of data, in a sensible time, with sensible results?" — and surface the anomaly before it reaches the people downstream. The rule of thumb is simple and uncompromising: you should learn that a pipeline broke from your monitoring, never from a user asking why the numbers look wrong.
Lesson 4: Source systems change without telling you — handle schema drift gracefully
When you're drawing from 240+ source systems, you don't control any of them. A team that owns one of those sources will, eventually, add a column, rename a field, or change a data type — and they will not tell you first. This is schema drift, and at scale it isn't a possibility, it's a weekly event. A naive pipeline does one of two bad things when the schema shifts beneath it: it breaks outright, or — worse — it keeps running and quietly misinterprets the new shape of the data. The mature approach is to detect schema changes deliberately: validate incoming data against an expected contract, fail safely and loudly when something unexpected arrives rather than guessing, and make adapting to a known change a configuration tweak rather than a re-engineering project. You can't stop source systems from changing. You can make sure your platform notices, and responds on your terms instead of theirs.
Lesson 5: The pipeline is a means — the platform is the point
The final lesson sits above the other four. It's tempting to treat each pipeline as a deliverable in its own right. But no one actually wants 250 pipelines — they want a platform that reliably presents trustworthy, connected data, and the pipelines are just how it gets there. That reframing changes every decision: you optimise for the health of the whole estate, not the cleverness of any one pipeline. Consistency beats local perfection. Boring and reliable beats clever and fragile, every time.
Why this is a Foundations-First story
None of these lessons are exciting. Standardised patterns, idempotency, monitoring, schema-drift handling — this is the plumbing, not the headline. But it's exactly the plumbing that decides whether everything built on top, every report and every model, can be trusted. A platform without this discipline doesn't fail dramatically; it fails quietly, in ways you discover too late, when a number that fed a decision turns out to have been wrong for weeks. Getting the foundations right is what makes everything above them safe to rely on.
Where this leaves us
Scale is an unforgiving teacher, but a clear one. It strips away the comforting idea that careful, bespoke craftsmanship is the answer, and replaces it with something less glamorous and far more durable: discipline, applied consistently, across the whole estate. The teams whose platforms hold under real load aren't the ones who built the cleverest individual pipelines. They're the ones who were disciplined about the boring things, two hundred and fifty times over.
Foundations first. At scale, foundations means discipline.
