Data Engineering  /  dbt

πŸ”„ dbt β€” Data Build Tool 23 guides Β· updated 2026

Analytics engineering with SQL β€” models, tests, sources, and Jinja macros that turn raw warehouse tables into trustworthy, documented data products.

How dbt Works Under the Hood

dbt looks simple on the surface: you write SQL files, run a command, and tables appear in your warehouse. But a fair amount of machinery runs between those two steps. Understanding what dbt actually does when you run dbt run makes you a much better user of it β€” and helps you debug things when they go wrong.


The Big Picture

dbt is a transformation framework that sits between your raw data layer and your analytics layer. It does not move data in or out of your warehouse. Everything it does happens inside the warehouse you connect it to.

What dbt does NOT do What dbt DOES do
------------------------ --------------------------------
Extract data from sources Compile SQL models
Load raw data Build a dependency graph (DAG)
Store data externally Execute SQL in correct order
Replace your warehouse Test model outputs
Generate documentation
Track lineage

When you run dbt run, dbt reads your .sql model files, resolves all the {{ ref() }} and {{ source() }} references, compiles the Jinja templates into plain SQL, and then sends that SQL to your warehouse to execute. The warehouse does the actual compute work.


Core Components

Models

A model is a .sql file in your models/ directory. Each file defines one transformation. dbt materializes each model as a table or view in your warehouse (depending on the materialization setting).

-- models/staging/stg_customers.sql
with source as (
select * from {{ source('crm', 'customers') }}
),
renamed as (
select
customer_id,
lower(trim(email)) as email,
first_name || ' ' || last_name as full_name,
cast(created_at as date) as signup_date
from source
where customer_id is not null
)
select * from renamed

The {{ source() }} call tells dbt this model reads from a raw source table. The {{ ref() }} call (used in downstream models) tells dbt this model depends on another model.

Sources

Sources define the raw tables that your models read from. They are declared in YAML files:

version: 2
sources:
- name: crm
schema: raw
tables:
- name: customers
- name: accounts

Declaring sources lets dbt track lineage from raw tables all the way to final marts, and enables freshness checks to detect stale data.

Seeds

Seeds are CSV files stored in the seeds/ directory. dbt loads them into your warehouse as tables. They are useful for reference data that changes infrequently β€” country codes, product categories, status mappings.

seeds/
β”œβ”€β”€ country_codes.csv
└── product_categories.csv

Load them with:

Terminal window
dbt seed

Snapshots

Snapshots track how a row changes over time. If a customer’s status changes from active to churned, a snapshot captures both the old and new states with timestamps. This implements SCD Type 2 (Slowly Changing Dimension) without custom code.

-- snapshots/customer_status_snapshot.sql
{% snapshot customer_status_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='check',
check_cols=['status']
)
}}
select * from {{ source('crm', 'customers') }}
{% endsnapshot %}

Tests

dbt has two test types. Generic tests are declared in YAML and apply common validations:

models:
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: email
tests:
- not_null

Singular tests are .sql files in the tests/ directory that return rows only when something is wrong. If the query returns zero rows, the test passes.

-- tests/assert_positive_revenue.sql
select order_id
from {{ ref('fct_orders') }}
where order_amount_usd < 0

How dbt Builds the DAG

The dependency graph β€” the DAG β€” is the most important internal concept in dbt. It determines which models run before others and enables parallel execution.

dbt builds the DAG automatically by analyzing your ref() and source() calls. You never define the order explicitly β€” dbt figures it out from the code.

Example DAG
-----------
[source: raw.orders] ---------> [stg_orders]
|
[source: raw.customers] --> [stg_customers]
|
[fct_orders]
|
[monthly_revenue]
|
[exec_dashboard]

dbt executes models at the same DAG level in parallel (up to your configured thread count), and waits for all dependencies to complete before running each downstream model.


What Happens When You Run dbt run

Here is what dbt does step by step:

1. Parse project
- Read dbt_project.yml
- Discover all .sql and .yml files in models/
2. Compile Jinja
- Replace {{ source() }} with actual schema.table references
- Replace {{ ref() }} with the correct warehouse object names
- Apply config() blocks (materialization, schema overrides, etc.)
3. Build the DAG
- Analyze all ref() and source() calls
- Determine execution order
4. Execute SQL
- For each model, run the compiled SQL against the warehouse
- Create or replace the table/view as specified by materialization
5. Write artifacts
- manifest.json (DAG + metadata)
- run_results.json (timing, success/failure per model)
- catalog.json (after dbt docs generate)

Materializations: How Models Are Stored

Materialization controls how dbt writes the output of each model to your warehouse.

MaterializationWhat it createsWhen to use it
viewA database view (no data stored)Lightweight models, rarely queried
tableA full table rebuilt each runMost staging and mart models
incrementalAppends/merges only new rowsLarge fact tables, event data
ephemeralA CTE, not stored at allReusable SQL snippets

Set materialization in the model file or in dbt_project.yml:

-- at the top of a model file
{{ config(materialized='incremental', unique_key='order_id') }}

Or in dbt_project.yml for an entire folder:

models:
my_project:
staging:
+materialized: view
marts:
+materialized: table

The YAML Layer

YAML files in dbt serve three purposes: declaring sources, documenting models and columns, and attaching tests. A well-written YAML file for a model looks like this:

version: 2
models:
- name: stg_orders
description: "Cleaned and standardized order records from the raw layer"
columns:
- name: order_id
description: "Unique identifier for each order"
tests:
- unique
- not_null
- name: status
description: "Order lifecycle status"
tests:
- accepted_values:
values: ['pending', 'processing', 'completed', 'cancelled']

This YAML drives both the test suite and the auto-generated documentation site.


dbt CLI Commands Reference

dbt init Create a new project
dbt debug Test your warehouse connection
dbt deps Install packages from packages.yml
dbt seed Load CSV files to the warehouse
dbt compile Compile Jinja SQL without executing
dbt run Execute all models
dbt test Run all data tests
dbt build Run models + seeds + snapshots + tests together
dbt source Check source freshness
dbt docs generate Build the documentation catalog
dbt docs serve Open the documentation site locally
dbt snapshot Run snapshot models
dbt list List resources in the project
dbt parse Parse and validate the project

Jinja Templating Inside SQL

dbt uses Jinja2 as a templating layer on top of SQL. Beyond ref() and source(), Jinja unlocks logic that plain SQL cannot express:

-- Dynamic column selection based on environment
select
order_id,
customer_id,
{% if target.name == 'prod' %}
order_amount_usd,
customer_pii_hash
{% else %}
order_amount_usd,
'REDACTED' as customer_pii_hash
{% endif %}
from {{ ref('stg_orders') }}

Macros extend this further β€” you can write reusable Jinja functions in macros/ and call them across models:

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
({{ column_name }} / 100.0)::numeric(10,2)
{% endmacro %}
-- used in a model:
select
order_id,
{{ cents_to_dollars('amount_cents') }} as order_amount_usd
from {{ ref('stg_orders') }}

2025-2026 Changes to How dbt Works

dbt Fusion compiler β€” A new Rust-based compiler released in 2025 that replaces the Python-based compilation step. Large projects that took 30-60 seconds to compile now compile in under 5 seconds. The interface is unchanged; the speed improvement is automatic.

Unit tests β€” dbt Core 1.8 introduced unit tests that let you define mocked inputs and expected outputs for a model. The test runs against synthetic data, not production data, so you can validate transformation logic before deploying.

unit_tests:
- name: test_status_normalization
model: stg_orders
given:
- input: source('ecommerce', 'orders')
rows:
- {order_id: 1, status: 'COMPLETED', amount: 50}
expect:
rows:
- {order_id: 1, status: 'completed', order_amount_usd: 50}

Python models β€” For transformations that are awkward in SQL, dbt supports .py model files that run inside Snowpark (Snowflake) or BigQuery DataFrames. The model returns a DataFrame, which dbt materializes like any SQL model.

dbt Mesh β€” Large organizations can now split one monolithic dbt project into multiple interconnected projects. Each team owns their models and publishes stable public contracts. Other teams reference these via cross-project ref() calls.


Understanding how dbt works internally β€” the DAG, the compilation step, the materialization options, the YAML layer β€” is what separates teams that just use dbt from teams that use it well. Once you see the machinery, you can make better decisions about model structure, materialization strategy, and test coverage.