Schema Evolution: The Unsexy Problem Breaking Pipelines

Schema evolution breaks more data infrastructure than anyone talks about. It's not a glamorous failure. There's no dramatic outage, no CEO asking what happened. A field gets renamed. A column gets added. A data type changes from string to integer. Somewhere downstream, a pipeline quietly starts producing garbage, and nobody notices for weeks.

A data team spends months building reliable infrastructure, everything tested, everything monitored. Then a product team makes a routine database change and three pipelines break in ways that don't trigger alerts. The data team finds out when a business analyst asks why the numbers look wrong.

Most data infrastructure assumes upstream sources are stable. They're not.

The Visibility Gap

The team that owns the source system doesn't know who depends on their data. They're not trying to break things. They're shipping a feature, adding a field their application needs, changing a column type to fix a bug on their end. Routine work.

They have no visibility into the seven downstream pipelines that ingest from their tables. Nobody told them those pipelines exist. The data catalog, if there is one, was populated once and hasn't been updated since. The data team, meanwhile, wrote their pipelines assuming column names don't change and field types stay consistent. Both teams are doing their jobs correctly. The space between them is unmanaged.

Silent Failures Are the Dangerous Ones

Some failures are loud. A pipeline throws an error because a column it expects doesn't exist anymore. Someone gets paged. The problem gets fixed. These are the easy ones.

The dangerous failures are silent.

A field that used to contain customer IDs now contains account IDs. Both are integers. Validation passes. The pipeline runs successfully. The dashboard updates. The numbers look plausible. Three weeks later, someone notices the metrics don't match another report, and the investigation begins.

An operations team discovers their capacity planning model has been running on corrupted data for a month. The inventory team changed how they represented warehouse locations, switching from a code to a composite key. The schema didn't change in any way that triggered errors. The model kept running, kept producing forecasts, kept informing decisions. The forecasts were nonsense, but they looked like numbers.

The inventory team had no idea. From their side, everything was working fine.

Documentation Doesn't Scale

Document your schemas. Notify downstream consumers before making changes. Coordinate releases.

This advice sounds reasonable and falls apart immediately at scale. Documentation gets stale the moment someone writes it. "Please notify the data team before making changes" relies on people remembering a process that isn't part of their normal workflow. The product engineer shipping a feature at 4pm on Thursday isn't thinking about a data pipeline they've never heard of.

And if you try coordination meetings, you create a bottleneck. Every schema change requires a review. Product velocity slows. Leadership asks why simple changes take so long. The process gets bypassed for "low-risk" changes. Nothing breaks. The bypass becomes normal.

Then something breaks badly, and someone suggests more process.

What Actually Holds Up

Schema registries that treat data interfaces like APIs. Not documentation. Versioning with teeth. Breaking changes require a version bump. Downstream consumers declare which versions they depend on. When a source system wants to rename a column, the registry knows who gets affected. Notification is automatic, not a process someone has to remember.

This creates friction that product teams hate. They can't just rename a column anymore. They have to deprecate the old name, add the new one, wait for consumers to migrate, then remove the old one. It slows them down. They're not wrong about that.

Contract testing catches some failures in CI before they reach production. But it only works for structural changes, like removed columns or type mismatches. When the schema stays identical but the meaning shifts, when customer IDs become account IDs, contract tests pass and the data is still wrong.

Catching semantic drift requires monitoring that tracks distributions, not just schemas. If a field that usually contains values between 1 and 1000 suddenly contains values in the millions, something changed upstream. Most teams don't build this kind of monitoring. It's not obvious you need it until you've already lost three weeks of data.

Nobody Owns the Interface

The source system team owns their database. The data team owns their pipelines. The interface between them, the implicit contract that says this column will exist and contain this type of data, belongs to nobody.

Some organizations respond by making data teams responsible for defensive coding. Assume everything will change. Validate constantly. Build pipelines that fail gracefully. This works, but every pipeline gets more complex. Every ingestion job grows layers of checks that exist because nobody upstream is accountable for stability.

Other organizations push responsibility to source system teams. Your database is a product. Downstream consumers are your users. You own interface stability. This requires executive backing and a culture shift that most engineering orgs aren't ready for.

Neither answer is obviously correct. But "nobody owns it" is the default, and the default is how you end up with a month of corrupted data and two teams pointing at each other.

The conversation about schema registries and data contracts usually happens after an incident. Whether it leads to actual change depends on how badly the organization got burned and whether anyone with authority is willing to add friction to product teams.

Usually, the answer is not yet.

Trackmind builds the data infrastructure that holds up when upstream sources change. Learn about our data engineering practice.