How dbt Works Under the Hood
dbt looks simple on the surface: you write SQL files, run a command, and tables appear in your warehouse. But a fair amount of machinery runs between those two steps. Understanding what dbt actually does when you run dbt run makes you a much better user of it β and helps you debug things when they go wrong.
The Big Picture
dbt is a transformation framework that sits between your raw data layer and your analytics layer. It does not move data in or out of your warehouse. Everything it does happens inside the warehouse you connect it to.
What dbt does NOT do What dbt DOES do------------------------ --------------------------------Extract data from sources Compile SQL modelsLoad raw data Build a dependency graph (DAG)Store data externally Execute SQL in correct orderReplace your warehouse Test model outputs Generate documentation Track lineageWhen you run dbt run, dbt reads your .sql model files, resolves all the {{ ref() }} and {{ source() }} references, compiles the Jinja templates into plain SQL, and then sends that SQL to your warehouse to execute. The warehouse does the actual compute work.
Core Components
Models
A model is a .sql file in your models/ directory. Each file defines one transformation. dbt materializes each model as a table or view in your warehouse (depending on the materialization setting).
-- models/staging/stg_customers.sqlwith source as ( select * from {{ source('crm', 'customers') }}),
renamed as ( select customer_id, lower(trim(email)) as email, first_name || ' ' || last_name as full_name, cast(created_at as date) as signup_date from source where customer_id is not null)
select * from renamedThe {{ source() }} call tells dbt this model reads from a raw source table. The {{ ref() }} call (used in downstream models) tells dbt this model depends on another model.
Sources
Sources define the raw tables that your models read from. They are declared in YAML files:
version: 2
sources: - name: crm schema: raw tables: - name: customers - name: accountsDeclaring sources lets dbt track lineage from raw tables all the way to final marts, and enables freshness checks to detect stale data.
Seeds
Seeds are CSV files stored in the seeds/ directory. dbt loads them into your warehouse as tables. They are useful for reference data that changes infrequently β country codes, product categories, status mappings.
seeds/βββ country_codes.csvβββ product_categories.csvLoad them with:
dbt seedSnapshots
Snapshots track how a row changes over time. If a customerβs status changes from active to churned, a snapshot captures both the old and new states with timestamps. This implements SCD Type 2 (Slowly Changing Dimension) without custom code.
-- snapshots/customer_status_snapshot.sql{% snapshot customer_status_snapshot %}
{{ config( target_schema='snapshots', unique_key='customer_id', strategy='check', check_cols=['status'] )}}
select * from {{ source('crm', 'customers') }}
{% endsnapshot %}Tests
dbt has two test types. Generic tests are declared in YAML and apply common validations:
models: - name: stg_customers columns: - name: customer_id tests: - unique - not_null - name: email tests: - not_nullSingular tests are .sql files in the tests/ directory that return rows only when something is wrong. If the query returns zero rows, the test passes.
-- tests/assert_positive_revenue.sqlselect order_idfrom {{ ref('fct_orders') }}where order_amount_usd < 0How dbt Builds the DAG
The dependency graph β the DAG β is the most important internal concept in dbt. It determines which models run before others and enables parallel execution.
dbt builds the DAG automatically by analyzing your ref() and source() calls. You never define the order explicitly β dbt figures it out from the code.
Example DAG-----------
[source: raw.orders] ---------> [stg_orders] |[source: raw.customers] --> [stg_customers] | [fct_orders] | [monthly_revenue] | [exec_dashboard]dbt executes models at the same DAG level in parallel (up to your configured thread count), and waits for all dependencies to complete before running each downstream model.
What Happens When You Run dbt run
Here is what dbt does step by step:
1. Parse project - Read dbt_project.yml - Discover all .sql and .yml files in models/
2. Compile Jinja - Replace {{ source() }} with actual schema.table references - Replace {{ ref() }} with the correct warehouse object names - Apply config() blocks (materialization, schema overrides, etc.)
3. Build the DAG - Analyze all ref() and source() calls - Determine execution order
4. Execute SQL - For each model, run the compiled SQL against the warehouse - Create or replace the table/view as specified by materialization
5. Write artifacts - manifest.json (DAG + metadata) - run_results.json (timing, success/failure per model) - catalog.json (after dbt docs generate)Materializations: How Models Are Stored
Materialization controls how dbt writes the output of each model to your warehouse.
| Materialization | What it creates | When to use it |
|---|---|---|
view | A database view (no data stored) | Lightweight models, rarely queried |
table | A full table rebuilt each run | Most staging and mart models |
incremental | Appends/merges only new rows | Large fact tables, event data |
ephemeral | A CTE, not stored at all | Reusable SQL snippets |
Set materialization in the model file or in dbt_project.yml:
-- at the top of a model file{{ config(materialized='incremental', unique_key='order_id') }}Or in dbt_project.yml for an entire folder:
models: my_project: staging: +materialized: view marts: +materialized: tableThe YAML Layer
YAML files in dbt serve three purposes: declaring sources, documenting models and columns, and attaching tests. A well-written YAML file for a model looks like this:
version: 2
models: - name: stg_orders description: "Cleaned and standardized order records from the raw layer" columns: - name: order_id description: "Unique identifier for each order" tests: - unique - not_null - name: status description: "Order lifecycle status" tests: - accepted_values: values: ['pending', 'processing', 'completed', 'cancelled']This YAML drives both the test suite and the auto-generated documentation site.
dbt CLI Commands Reference
dbt init Create a new projectdbt debug Test your warehouse connectiondbt deps Install packages from packages.ymldbt seed Load CSV files to the warehousedbt compile Compile Jinja SQL without executingdbt run Execute all modelsdbt test Run all data testsdbt build Run models + seeds + snapshots + tests togetherdbt source Check source freshnessdbt docs generate Build the documentation catalogdbt docs serve Open the documentation site locallydbt snapshot Run snapshot modelsdbt list List resources in the projectdbt parse Parse and validate the projectJinja Templating Inside SQL
dbt uses Jinja2 as a templating layer on top of SQL. Beyond ref() and source(), Jinja unlocks logic that plain SQL cannot express:
-- Dynamic column selection based on environmentselect order_id, customer_id, {% if target.name == 'prod' %} order_amount_usd, customer_pii_hash {% else %} order_amount_usd, 'REDACTED' as customer_pii_hash {% endif %}from {{ ref('stg_orders') }}Macros extend this further β you can write reusable Jinja functions in macros/ and call them across models:
-- macros/cents_to_dollars.sql{% macro cents_to_dollars(column_name) %} ({{ column_name }} / 100.0)::numeric(10,2){% endmacro %}
-- used in a model:select order_id, {{ cents_to_dollars('amount_cents') }} as order_amount_usdfrom {{ ref('stg_orders') }}2025-2026 Changes to How dbt Works
dbt Fusion compiler β A new Rust-based compiler released in 2025 that replaces the Python-based compilation step. Large projects that took 30-60 seconds to compile now compile in under 5 seconds. The interface is unchanged; the speed improvement is automatic.
Unit tests β dbt Core 1.8 introduced unit tests that let you define mocked inputs and expected outputs for a model. The test runs against synthetic data, not production data, so you can validate transformation logic before deploying.
unit_tests: - name: test_status_normalization model: stg_orders given: - input: source('ecommerce', 'orders') rows: - {order_id: 1, status: 'COMPLETED', amount: 50} expect: rows: - {order_id: 1, status: 'completed', order_amount_usd: 50}Python models β For transformations that are awkward in SQL, dbt supports .py model files that run inside Snowpark (Snowflake) or BigQuery DataFrames. The model returns a DataFrame, which dbt materializes like any SQL model.
dbt Mesh β Large organizations can now split one monolithic dbt project into multiple interconnected projects. Each team owns their models and publishes stable public contracts. Other teams reference these via cross-project ref() calls.
Understanding how dbt works internally β the DAG, the compilation step, the materialization options, the YAML layer β is what separates teams that just use dbt from teams that use it well. Once you see the machinery, you can make better decisions about model structure, materialization strategy, and test coverage.