Schema Drift: What It Is, Why It Happens, and How to Stop It
Sarah Jenkins · June 14, 2024 · 5 min read
Sarah Jenkins · June 14, 2024 · 5 min read
It was a Sunday evening. Your CTO opens the Q3 revenue report, expecting the usual steady growth curve. Instead, they see a jagged line that flatlines for three weeks, then spikes inexplicably.
You pull up the dbt artifact and the source table in Snowflake. The query runs fine. The schema looks consistent. But the dashboard is lying to you.
The culprit? Schema drift. An upstream marketing API quietly changed the data type of the `lead_source` column from `VARCHAR` to `INT` and added a new column, `campaign_id`. Your aggregation logic assumed strings; it started counting integers as zero.
For three weeks, your team was reporting "steady growth" on a dataset that had effectively become garbage. This isn't a hypothetical; it happens to data teams every week.
Schema drift occurs when the structure of your data changes over time—specifically, the structure of the source data differs from the structure your downstream applications or dashboards expect. In the world of data engineering, schema changes are inevitable; ignoring them is a recipe for downtime.
New columns are added to the source table. Your downstream models might ignore them, or worse, they might be named ambiguously and accidentally overwrite existing columns.
Critical columns are removed or renamed. This breaks lookups and joins instantly, often without throwing a SQL error if you're using left joins.
A column changes data type (e.g., Integer to String, or Date to Timestamp). This is the silent killer of business logic, often causing aggregation functions to return nulls or zeros.
Schema drift is rarely malicious. It is usually the result of rapid iteration in the data ecosystem.
When an API provider releases a v2.0 update without a migration guide, or a vendor changes their database schema, your downstream models are left holding the bag. If you rely on a webhook to populate a staging table, a missing field in the payload creates a schema mismatch.
If you are moving data from an on-premise SQL Server to the cloud, manual schema mapping errors are common. A column defined as `DECIMAL(10,2)` in the source might be truncated to `INT` during export, leading to massive rounding errors in financial reports.
The frontend team expects a `user_id` of type UUID, while the backend team is sending a string representation of that ID. Without a shared contract, these drifts propagate through the ETL pipeline unnoticed until a UI bug surfaces.
Legacy systems often lack strict schema enforcement. When a maintenance window forces a schema change on a mainframe, it can cascade through your entire data warehouse overnight.
Waiting for a stakeholder to flag a broken dashboard is reactive. Here is how you compare the detection methods.
Running `dbt docs generate` and comparing versions, or using a tool like `dbt-core --docs compare`. You review the HTML artifact manually.
Pros: Free, granular control.
Cons: Error-prone, resource-intensive, only runs on demand, cannot detect real-time drift.
Tools like Valido run continuous queries against your source tables to compare the current schema against the expected schema.
Pros: Real-time alerts, catches drift before it breaks the dashboard, requires no manual review.
Cons: Slight overhead on the warehouse.
You can configure schema drift checks directly in your dbt model files. Here is how to set up a Valido check to alert you immediately if a source table changes.
source + dbt_source:
name: marketing_api
database: ANALYTICS_DB
schema: staging
tables:
- name: leads
description: Raw leads from the CRM webhook.
columns:
- name: lead_id
tests:
- unique
- not_null
models + core:
staging:
+materialized: view
+schema: staging
In the Valido UI, you can apply a specific rule to this source. The following YAML snippet shows how to enforce strict schema adherence:
valido:
checks:
schema_drift_monitor:
source: marketing_api
table: leads
type: strict
on: fail # Alert immediately on drift
Don't just detect drift; prevent the damage.
Treat your data contracts like code contracts. Define the schema of your staging models as the contract for your downstream consumers. If the source changes, the contract is broken.
Use a schema registry (like AWS Glue Schema Registry or Confluent Schema Registry) to version your data schemas. This ensures that all consumers of a topic are using the same structure.
Configure your monitoring to alert on the first occurrence of drift, not after data has been processed. A 30-minute lag is too late when you have a 2-hour ETL window.
| Method | Speed | Accuracy | Cost |
|---|---|---|---|
| Manual Diff | Low | High (if done correctly) | Free |
| dbt Docs Diff | Medium | High | Free |
| Valido / Automated | Real-time | High | Variable |
Stop letting schema changes break your dashboards. Monitor your data contracts in real-time and ensure your data pipeline stays in sync.