dbt: Build and Test Data Models Without the Chaos
Raw data is rarely analysis-ready. Column names clash, duplicates sneak in, data types disagree β and thatβs before you even get to business logic. dbt (Data Build Tool) was built specifically to solve this problem. It gives analytics engineers a disciplined way to write, test, and document SQL transformations that run inside your data warehouse.
This guide walks through the full picture: what dbt is, why teams adopt it, and how to wire up a working transformation pipeline from scratch.
Why dbt Became the Standard for Data Transformation
Traditional ETL workflows required dedicated infrastructure to transform data before it landed in a warehouse. That worked when storage was expensive and compute was slow. Cloud warehouses flipped that equation. Snowflake, BigQuery, Redshift, and Databricks can process enormous datasets on demand β so it made more sense to load raw data first and transform it inside the warehouse.
dbt was designed for exactly this ELT pattern. Instead of orchestrating separate transform jobs, you write plain SQL files, and dbt compiles and runs them against your warehouse. No Java, no Spark clusters, no custom pipeline code.
ELT Pattern with dbt---------------------[Source Systems] --> [Load Raw Data] --> [Warehouse] --> [dbt Transforms] --> [Analytics Layer] APIs Fivetran BigQuery .sql models BI Tools DBs Airbyte Snowflake tests + docs Dashboards Files custom Redshift lineage graphThe result is a codebase of SQL models that are version-controlled, peer-reviewed, tested, and documented β the same engineering discipline applied to data that software teams use for application code.
What You Need Before You Start
dbt does not extract or load data. It assumes your raw data already lives in a warehouse. Before you build anything, make sure you have:
- A running data warehouse (Snowflake, BigQuery, Redshift, PostgreSQL, Databricks, or DuckDB all work)
- Credentials to connect to that warehouse
- Python 3.9 or newer installed
- Basic SQL knowledge β dbt is SQL-native
Install dbt with the adapter that matches your warehouse:
pip install dbt-snowflake # for Snowflakepip install dbt-bigquery # for BigQuerypip install dbt-redshift # for Redshiftpip install dbt-postgres # for PostgreSQLpip install dbt-duckdb # for DuckDB (great for local dev)Setting Up a dbt Project
Initialize a new project with a single command:
dbt init my_projectcd my_projectThis creates a folder structure like this:
my_project/βββ dbt_project.yml # project configβββ models/ # your SQL transformation filesβ βββ staging/ # clean raw sourcesβ βββ marts/ # business-ready tablesβββ seeds/ # static CSV reference dataβββ snapshots/ # slowly changing dimension trackingβββ tests/ # custom data testsβββ macros/ # reusable Jinja snippetsβββ analyses/ # one-off SQL explorationsOpen profiles.yml (usually stored at ~/.dbt/profiles.yml) and add your warehouse connection details. This file stays off version control to protect credentials.
Defining Raw Data Sources
Before you transform anything, you tell dbt where your raw data lives. Create a sources.yml file inside the models/ folder:
version: 2
sources: - name: ecommerce schema: raw tables: - name: orders description: "Raw order records loaded by Fivetran" - name: customers description: "Customer records from the CRM"Now you can reference these tables in models using {{ source('ecommerce', 'orders') }}. dbt tracks this reference and includes it in the lineage graph automatically.
Writing a Staging Model
Staging models are your first transformation layer. They clean and standardize raw data without applying any business logic. Create models/staging/stg_orders.sql:
with source as ( select * from {{ source('ecommerce', 'orders') }}),
renamed as ( select order_id, customer_id, cast(order_date as date) as order_date, status, amount as order_amount_usd from source where order_id is not null)
select * from renamedRun just this model to check it:
dbt run --select stg_ordersBuilding a Mart Model
Mart models sit on top of staging models and answer specific business questions. Create models/marts/monthly_revenue.sql:
with orders as ( select * from {{ ref('stg_orders') }} where status = 'completed'),
aggregated as ( select date_trunc('month', order_date) as month, count(distinct order_id) as total_orders, sum(order_amount_usd) as total_revenue_usd from orders group by 1)
select * from aggregatedorder by monthThe ref() function is how models connect to each other. dbt uses these references to build a DAG (directed acyclic graph) that determines the correct execution order.
DAG Example----------- [source: orders] [source: customers] | | [stg_orders] [stg_customers] \ / [fct_orders_enriched] | [monthly_revenue]Adding Tests
dbt has two types of tests: generic tests (built-in) and singular tests (custom SQL).
Generic tests go in a YAML file alongside your models:
version: 2
models: - name: stg_orders columns: - name: order_id tests: - unique - not_null - name: status tests: - accepted_values: values: ['pending', 'completed', 'cancelled', 'refunded'] - name: customer_id tests: - relationships: to: ref('stg_customers') field: customer_idRun all tests:
dbt testOr target a specific model:
dbt test --select stg_ordersIn 2025, dbt also supports unit tests β you define expected outputs for a model given a mocked input, letting you test transformation logic independently of live data.
Running and Deploying
Build everything in dependency order:
dbt build # runs models, seeds, snapshots, and tests togetherRun only models that have changed since the last run (using dbtβs state comparison):
dbt build --select state:modified+For production, most teams deploy through dbt Cloud, which provides a scheduler, a browser-based IDE, CI/CD integration, and a hosted documentation site. Teams that prefer self-hosting wire dbt jobs into Airflow, Prefect, or Dagster.
Documentation and Lineage
Generate and serve docs locally:
dbt docs generatedbt docs serveThis opens a browser with a searchable catalog of every model, source, column, and test β plus an interactive lineage graph. In dbt Cloud, this is hosted automatically and stays current with every deployment.
Key dbt Concepts at a Glance
| Concept | What It Does |
|---|---|
| Model | A .sql file that defines one transformation |
| Source | A reference to a raw table in your warehouse |
| Ref | Links one model to another, builds the DAG |
| Test | Validates data quality (unique, not_null, custom SQL) |
| Seed | A static CSV file loaded into the warehouse |
| Snapshot | Tracks row-level changes over time (SCD Type 2) |
| Macro | Reusable Jinja code block |
| Materialization | How a model is stored: table, view, incremental, ephemeral |
What Changed in 2025-2026
dbt has moved quickly over the past year. A few things worth knowing:
dbt Fusion β A new Rust-based compiler introduced in 2025 that dramatically speeds up project compilation. Large projects that took minutes now compile in seconds.
Python models β dbt now supports .py model files for transformations that are awkward in SQL β think ML feature engineering, pandas-style reshaping, or calling external APIs mid-pipeline.
Unit testing β Formally added to the dbt spec, letting you write deterministic tests for model logic without touching production data.
MetricFlow integration β Semantic layer definitions now live inside dbt projects, letting BI tools query metrics consistently regardless of which tool is used.
dbt Mesh β Cross-project references that let large organizations split a monolithic dbt project into domain-owned sub-projects that still share data contracts.
dbt has become the default choice for transformation work in modern data stacks because it applies software engineering practices to SQL β version control, testing, documentation, and CI/CD β without requiring analytics engineers to learn a new language. If your team is still managing transformation logic in ad hoc scripts or stored procedures, dbt is worth evaluating seriously.