dbt: Build and Test Data Models Without the Chaos

Raw data is rarely analysis-ready. Column names clash, duplicates sneak in, data types disagree — and that’s before you even get to business logic. dbt (Data Build Tool) was built specifically to solve this problem. It gives analytics engineers a disciplined way to write, test, and document SQL transformations that run inside your data warehouse.

This guide walks through the full picture: what dbt is, why teams adopt it, and how to wire up a working transformation pipeline from scratch.

Why dbt Became the Standard for Data Transformation

Traditional ETL workflows required dedicated infrastructure to transform data before it landed in a warehouse. That worked when storage was expensive and compute was slow. Cloud warehouses flipped that equation. Snowflake, BigQuery, Redshift, and Databricks can process enormous datasets on demand — so it made more sense to load raw data first and transform it inside the warehouse.

dbt was designed for exactly this ELT pattern. Instead of orchestrating separate transform jobs, you write plain SQL files, and dbt compiles and runs them against your warehouse. No Java, no Spark clusters, no custom pipeline code.

ELT Pattern with dbt
---------------------
[Source Systems] --> [Load Raw Data] --> [Warehouse] --> [dbt Transforms] --> [Analytics Layer]
     APIs                Fivetran          BigQuery         .sql models          BI Tools
     DBs                 Airbyte           Snowflake         tests + docs         Dashboards
     Files               custom            Redshift          lineage graph

The result is a codebase of SQL models that are version-controlled, peer-reviewed, tested, and documented — the same engineering discipline applied to data that software teams use for application code.

What You Need Before You Start

dbt does not extract or load data. It assumes your raw data already lives in a warehouse. Before you build anything, make sure you have:

A running data warehouse (Snowflake, BigQuery, Redshift, PostgreSQL, Databricks, or DuckDB all work)
Credentials to connect to that warehouse
Python 3.9 or newer installed
Basic SQL knowledge — dbt is SQL-native

Install dbt with the adapter that matches your warehouse:

pip install dbt-snowflake   # for Snowflake
pip install dbt-bigquery    # for BigQuery
pip install dbt-redshift    # for Redshift
pip install dbt-postgres    # for PostgreSQL
pip install dbt-duckdb      # for DuckDB (great for local dev)

Setting Up a dbt Project

Initialize a new project with a single command:

dbt init my_project
cd my_project

This creates a folder structure like this:

my_project/
├── dbt_project.yml       # project config
├── models/               # your SQL transformation files
│   ├── staging/          # clean raw sources
│   └── marts/            # business-ready tables
├── seeds/                # static CSV reference data
├── snapshots/            # slowly changing dimension tracking
├── tests/                # custom data tests
├── macros/               # reusable Jinja snippets
└── analyses/             # one-off SQL explorations

Open profiles.yml (usually stored at ~/.dbt/profiles.yml) and add your warehouse connection details. This file stays off version control to protect credentials.

Defining Raw Data Sources

Before you transform anything, you tell dbt where your raw data lives. Create a sources.yml file inside the models/ folder:

version: 2

sources:
  - name: ecommerce
    schema: raw
    tables:
      - name: orders
        description: "Raw order records loaded by Fivetran"
      - name: customers
        description: "Customer records from the CRM"

Now you can reference these tables in models using {{ source('ecommerce', 'orders') }}. dbt tracks this reference and includes it in the lineage graph automatically.

Writing a Staging Model

Staging models are your first transformation layer. They clean and standardize raw data without applying any business logic. Create models/staging/stg_orders.sql:

with source as (
    select * from {{ source('ecommerce', 'orders') }}
),

renamed as (
    select
        order_id,
        customer_id,
        cast(order_date as date)   as order_date,
        status,
        amount                     as order_amount_usd
    from source
    where order_id is not null
)

select * from renamed

Run just this model to check it:

dbt run --select stg_orders

Building a Mart Model

Mart models sit on top of staging models and answer specific business questions. Create models/marts/monthly_revenue.sql:

with orders as (
    select * from {{ ref('stg_orders') }}
    where status = 'completed'
),

aggregated as (
    select
        date_trunc('month', order_date)  as month,
        count(distinct order_id)         as total_orders,
        sum(order_amount_usd)            as total_revenue_usd
    from orders
    group by 1
)

select * from aggregated
order by month

The ref() function is how models connect to each other. dbt uses these references to build a DAG (directed acyclic graph) that determines the correct execution order.

DAG Example
-----------
  [source: orders]     [source: customers]
        |                      |
  [stg_orders]          [stg_customers]
        \                     /
         [fct_orders_enriched]
                 |
         [monthly_revenue]

Adding Tests

dbt has two types of tests: generic tests (built-in) and singular tests (custom SQL).

Generic tests go in a YAML file alongside your models:

version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']
      - name: customer_id
        tests:
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Run all tests:

dbt test

Or target a specific model:

dbt test --select stg_orders

In 2025, dbt also supports unit tests — you define expected outputs for a model given a mocked input, letting you test transformation logic independently of live data.

Running and Deploying

Build everything in dependency order:

dbt build   # runs models, seeds, snapshots, and tests together

Run only models that have changed since the last run (using dbt’s state comparison):

dbt build --select state:modified+

For production, most teams deploy through dbt Cloud, which provides a scheduler, a browser-based IDE, CI/CD integration, and a hosted documentation site. Teams that prefer self-hosting wire dbt jobs into Airflow, Prefect, or Dagster.

Documentation and Lineage

Generate and serve docs locally:

dbt docs generate
dbt docs serve

This opens a browser with a searchable catalog of every model, source, column, and test — plus an interactive lineage graph. In dbt Cloud, this is hosted automatically and stays current with every deployment.

Key dbt Concepts at a Glance

Concept	What It Does
Model	A `.sql` file that defines one transformation
Source	A reference to a raw table in your warehouse
Ref	Links one model to another, builds the DAG
Test	Validates data quality (unique, not_null, custom SQL)
Seed	A static CSV file loaded into the warehouse
Snapshot	Tracks row-level changes over time (SCD Type 2)
Macro	Reusable Jinja code block
Materialization	How a model is stored: table, view, incremental, ephemeral

What Changed in 2025-2026

dbt has moved quickly over the past year. A few things worth knowing:

dbt Fusion — A new Rust-based compiler introduced in 2025 that dramatically speeds up project compilation. Large projects that took minutes now compile in seconds.

Python models — dbt now supports .py model files for transformations that are awkward in SQL — think ML feature engineering, pandas-style reshaping, or calling external APIs mid-pipeline.

Unit testing — Formally added to the dbt spec, letting you write deterministic tests for model logic without touching production data.

MetricFlow integration — Semantic layer definitions now live inside dbt projects, letting BI tools query metrics consistently regardless of which tool is used.

dbt Mesh — Cross-project references that let large organizations split a monolithic dbt project into domain-owned sub-projects that still share data contracts.

dbt has become the default choice for transformation work in modern data stacks because it applies software engineering practices to SQL — version control, testing, documentation, and CI/CD — without requiring analytics engineers to learn a new language. If your team is still managing transformation logic in ad hoc scripts or stored procedures, dbt is worth evaluating seriously.