What is a Model Contract?
dbt Model Contracts is a way to statically define the shape of your data and is defined in a YAML file, specifying the names, data types, and any additional constraints for each column and it is enforced while the model is being built. The contract is the set of guarantees about the shape of the returned dataset. This ensures that downstream consumers of your models have a consistent and stable set of columns to use in their analyses.
Once the contract configuration is enforced, dbt cross-checks the returned dataset from your model against the attributes you have defined in your YAML file. If it encounters a discrepancy — in data types, column names, or number of columns — dbt will promptly raise an error before materializing the model as a table in the database.
Benefits of dbt Model Contracts
dbt model contracts provide a range of benefits that improve the efficiency, effectiveness, and quality of data engineering processes. These benefits include:
- Proactive data governance: Implementing dbt model contracts encourages data teams to take a proactive approach to data governance. By defining contracts, teams can establish a clear understanding of data lineage, dependencies, and quality expectations. This helps ensure compliance with data regulations and promotes a culture of data stewardship within the organization.
- Improved data reliability: dbt model contracts enforce strict expectations about the output structure and data types of data models, ensuring that the output data is reliable and adheres to predefined standards. This increased reliability leads to more accurate insights and better decision-making, as data consumers can trust the data generated by the models.
- Easier debugging and error resolution: dbt model contracts simplify the debugging process by enforcing constraints during the build process. If a build fails due to a constraint violation, data engineers can quickly identify the issue and address it before it impacts downstream processes. This saves time and resources, reducing the potential impact of errors on the organization’s>Reduced costs and improved resource optimization: dbt model contracts can help reduce compute costs by replacing tests with equivalent constraints, which guarantee validation at build time. Constraints are typically more efficient than tests because they are enforced by the database and do not require additional compute resources. By optimizing resource usage, organizations can minimize costs while ensuring the quality and consistency of their data models.
- Increased flexibility and adaptability: DBT model contracts provide a flexible mechanism for defining the shape of the returned dataset, allowing data teams to tailor the output to their specific needs. This adaptability enables organizations to respond to evolving business requirements and make changes to their data models without introducing inconsistencies or errors.
- Streamlined collaboration and knowledge sharing: With dbt model contracts, teams can establish a clear understanding of each model’s structure, purpose, and constraints. This shared knowledge fosters better collaboration and knowledge sharing among team members, making it easier to onboard new members, address issues collectively, and optimize the overall data pipeline.
- Enhanced change management and versioning: dbt model contracts provide a structured way to manage changes in the output of data models. By defining contracts, data teams can easily identify and manage breaking changes, enabling more effective version control and change management processes. This helps maintain the stability and integrity of data pipelines, reducing the risk of unexpected issues arising from changes in the data models.
- Documentation and transparency: Model contracts serve as a form of documentation that outlines the expectations and structure of each model. This documentation can be invaluable for maintaining transparency within the organization and ensuring that stakeholders understand the purpose and limitations of each data model.
Model contracts are supported for SQL models, materialized as tables, views, and incremental (with on_schema_change: append_new_columns ).
dbt version should be >= 1.5 to use model contracts
Example :
Below is the model definition for customers and modified to add contract definition.
The raw file is sourced from
https://github.com/dbt-labs/jaffle_shop/blob/main/models/schema.yml
models:
- name: customers
description: This table has basic information about a customer, as well as some derived facts based on a customer's orders
config:
materialized: table
contract:
enforced: true
columns:
- name: customer_id
description: This is a unique identifier for a customer
data_type: int
constraints:
- type: not_null
tests:
- unique
- not_null
- name: first_name
description: Customer's first name. PII.
data_type: string
- name: last_name
description: Customer's last name. PII.
data_type: string
- name: first_order
description: Date (UTC) of a customer's first order
data_type: date
- name: most_recent_order
description: Date (UTC) of a customer's most recent order
data_type: date
- name: number_of_orders
description: Count of the number of orders a customer has placed
data_type: bigint
- name: customer_lifetime_value
description: Total value (AUD) of a customer's orders
data_type: double
Let’s say your model is defined as:
https://github.com/dbt-labs/jaffle\_shop/blob/main/models/customers.sql models/dim_customers.sql
with customers as (
select * from {{ ref('stg_customers') }}
),
orders as (
select * from {{ ref('stg_orders') }}
),
payments as (
select * from {{ ref('stg_payments') }}
),
customer_orders as (
select
customer_id,
min(order_date) as first_order,
max(order_date) as most_recent_order,
count(order_id) as number_of_orders
from orders
group by customer_id
),
customer_payments as (
select
orders.customer_id,
sum(amount) as total_amount
from payments
left join orders on
payments.order_id = orders.order_id
group by orders.customer_id
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount as customer_lifetime_value
from customers
left join customer_orders
on customers.customer_id = customer_orders.customer_id
left join customer_payments
on customers.customer_id = customer_payments.customer_id
)
select * from final
In this case, as the column types in your SQL model don’t align with those specified in the YAML contract (line#12), DBT will raise an error. This gives you a chance to resolve the issue before it potentially affects downstream analyses.
When you dbt run -s customers your model, you will recieve below error as datatype is enforced as INT, but the end results has the definition of BIGINT which comes from referenced models.
00:35:12 Finished running 1 table model in 0 hours 0 minutes and 3.35 seconds (3.35s).
00:35:12
00:35:12 Completed with 1 error and 0 warnings:
00:35:12
00:35:12 Compilation Error in model customers (models/customers.sql)
00:35:12 This model has an enforced contract that failed.
00:35:12 Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.
00:35:12
00:35:12 | column_name | definition_type | contract_type | mismatch_reason |
00:35:12 | ----------- | --------------- | ------------- | ------------------ |
00:35:12 | customer_id | bigint | int | data type mismatch |
00:35:12
00:35:12
00:35:12 > in macro assert_columns_equivalent (macros/materializations/models/table/columns_spec_ddl.sql)
00:35:12 > called by macro default__get_assert_columns_equivalent (macros/materializations/models/table/columns_spec_ddl.sql)
00:35:12 > called by macro get_assert_columns_equivalent (macros/materializations/models/table/columns_spec_ddl.sql)
00:35:12 > called by macro databricks__create_table_as (macros/adapters.sql)
00:35:12 > called by macro create_table_as (macros/materializations/models/table/create_table_as.sql)
00:35:12 > called by macro statement (macros/etc/statement.sql)
00:35:12 > called by macro materialization_table_databricks (macros/materializations/table.sql)
00:35:12 > called by model customers (models/customers.sql)
00:35:12
00:35:12 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
How does this compare to dbt test?
Contracts enforces data shape. Test enforces data quality.
The main difference between dbt tests and dbt model contracts is that tests are used to validate the content of a dbt model after it has been built, while contracts are used to define the shape of the returned dataset before the model is built.
Tests are more flexible than contracts and can be used to check a wide variety of things, such as the presence of certain values in a column, the relationship between different columns, or the freshness of the data. Contracts, on the other hand, are limited to validating the column names and data types of a model.
Limitations with dbt Model Contracts
The below limitation exists as of 10/Oct/2023
No Partial Contracts : dbt contracts currently require all columns in a model to be defined. Partial contracts, which would allow only a subset of columns to be defined, are not yet supported but may be introduced in the future.
Model Types: Model contracts can only be used with table or incremental models, not view or ephemeral models.
This article was originally published at https://medium.com/@aradsouza/dbt-model-contracts-e0a6a6e58c0e