How dbt Works Under the Hood

dbt looks simple on the surface: you write SQL files, run a command, and tables appear in your warehouse. But a fair amount of machinery runs between those two steps. Understanding what dbt actually does when you run dbt run makes you a much better user of it — and helps you debug things when they go wrong.

The Big Picture

dbt is a transformation framework that sits between your raw data layer and your analytics layer. It does not move data in or out of your warehouse. Everything it does happens inside the warehouse you connect it to.

What dbt does NOT do         What dbt DOES do
------------------------     --------------------------------
Extract data from sources    Compile SQL models
Load raw data                Build a dependency graph (DAG)
Store data externally        Execute SQL in correct order
Replace your warehouse       Test model outputs
                             Generate documentation
                             Track lineage

When you run dbt run, dbt reads your .sql model files, resolves all the {{ ref() }} and {{ source() }} references, compiles the Jinja templates into plain SQL, and then sends that SQL to your warehouse to execute. The warehouse does the actual compute work.

Core Components

Models

A model is a .sql file in your models/ directory. Each file defines one transformation. dbt materializes each model as a table or view in your warehouse (depending on the materialization setting).

-- models/staging/stg_customers.sql
with source as (
    select * from {{ source('crm', 'customers') }}
),

renamed as (
    select
        customer_id,
        lower(trim(email))         as email,
        first_name || ' ' || last_name  as full_name,
        cast(created_at as date)   as signup_date
    from source
    where customer_id is not null
)

select * from renamed

The {{ source() }} call tells dbt this model reads from a raw source table. The {{ ref() }} call (used in downstream models) tells dbt this model depends on another model.

Sources

Sources define the raw tables that your models read from. They are declared in YAML files:

version: 2

sources:
  - name: crm
    schema: raw
    tables:
      - name: customers
      - name: accounts

Declaring sources lets dbt track lineage from raw tables all the way to final marts, and enables freshness checks to detect stale data.

Seeds

Seeds are CSV files stored in the seeds/ directory. dbt loads them into your warehouse as tables. They are useful for reference data that changes infrequently — country codes, product categories, status mappings.

seeds/
├── country_codes.csv
└── product_categories.csv

Load them with:

dbt seed

Snapshots

Snapshots track how a row changes over time. If a customer’s status changes from active to churned, a snapshot captures both the old and new states with timestamps. This implements SCD Type 2 (Slowly Changing Dimension) without custom code.

-- snapshots/customer_status_snapshot.sql
{% snapshot customer_status_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='check',
        check_cols=['status']
    )
}}

select * from {{ source('crm', 'customers') }}

{% endsnapshot %}

Tests

dbt has two test types. Generic tests are declared in YAML and apply common validations:

models:
  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: email
        tests:
          - not_null

Singular tests are .sql files in the tests/ directory that return rows only when something is wrong. If the query returns zero rows, the test passes.

-- tests/assert_positive_revenue.sql
select order_id
from {{ ref('fct_orders') }}
where order_amount_usd < 0

How dbt Builds the DAG

The dependency graph — the DAG — is the most important internal concept in dbt. It determines which models run before others and enables parallel execution.

dbt builds the DAG automatically by analyzing your ref() and source() calls. You never define the order explicitly — dbt figures it out from the code.

Example DAG
-----------

[source: raw.orders] ---------> [stg_orders]
                                      |
[source: raw.customers] --> [stg_customers]
                                      |
                               [fct_orders]
                                      |
                            [monthly_revenue]
                                      |
                             [exec_dashboard]

dbt executes models at the same DAG level in parallel (up to your configured thread count), and waits for all dependencies to complete before running each downstream model.

What Happens When You Run `dbt run`

Here is what dbt does step by step:

1. Parse project
   - Read dbt_project.yml
   - Discover all .sql and .yml files in models/

2. Compile Jinja
   - Replace {{ source() }} with actual schema.table references
   - Replace {{ ref() }} with the correct warehouse object names
   - Apply config() blocks (materialization, schema overrides, etc.)

3. Build the DAG
   - Analyze all ref() and source() calls
   - Determine execution order

4. Execute SQL
   - For each model, run the compiled SQL against the warehouse
   - Create or replace the table/view as specified by materialization

5. Write artifacts
   - manifest.json (DAG + metadata)
   - run_results.json (timing, success/failure per model)
   - catalog.json (after dbt docs generate)

Materializations: How Models Are Stored

Materialization controls how dbt writes the output of each model to your warehouse.

Materialization	What it creates	When to use it
`view`	A database view (no data stored)	Lightweight models, rarely queried
`table`	A full table rebuilt each run	Most staging and mart models
`incremental`	Appends/merges only new rows	Large fact tables, event data
`ephemeral`	A CTE, not stored at all	Reusable SQL snippets

Set materialization in the model file or in dbt_project.yml:

-- at the top of a model file
{{ config(materialized='incremental', unique_key='order_id') }}

Or in dbt_project.yml for an entire folder:

models:
  my_project:
    staging:
      +materialized: view
    marts:
      +materialized: table

The YAML Layer

YAML files in dbt serve three purposes: declaring sources, documenting models and columns, and attaching tests. A well-written YAML file for a model looks like this:

version: 2

models:
  - name: stg_orders
    description: "Cleaned and standardized order records from the raw layer"
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        tests:
          - unique
          - not_null
      - name: status
        description: "Order lifecycle status"
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'completed', 'cancelled']

This YAML drives both the test suite and the auto-generated documentation site.

dbt CLI Commands Reference

dbt init          Create a new project
dbt debug         Test your warehouse connection
dbt deps          Install packages from packages.yml
dbt seed          Load CSV files to the warehouse
dbt compile       Compile Jinja SQL without executing
dbt run           Execute all models
dbt test          Run all data tests
dbt build         Run models + seeds + snapshots + tests together
dbt source        Check source freshness
dbt docs generate Build the documentation catalog
dbt docs serve    Open the documentation site locally
dbt snapshot      Run snapshot models
dbt list          List resources in the project
dbt parse         Parse and validate the project

Jinja Templating Inside SQL

dbt uses Jinja2 as a templating layer on top of SQL. Beyond ref() and source(), Jinja unlocks logic that plain SQL cannot express:

-- Dynamic column selection based on environment
select
    order_id,
    customer_id,
    {% if target.name == 'prod' %}
        order_amount_usd,
        customer_pii_hash
    {% else %}
        order_amount_usd,
        'REDACTED' as customer_pii_hash
    {% endif %}
from {{ ref('stg_orders') }}

Macros extend this further — you can write reusable Jinja functions in macros/ and call them across models:

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)::numeric(10,2)
{% endmacro %}

-- used in a model:
select
    order_id,
    {{ cents_to_dollars('amount_cents') }} as order_amount_usd
from {{ ref('stg_orders') }}

2025-2026 Changes to How dbt Works

dbt Fusion compiler — A new Rust-based compiler released in 2025 that replaces the Python-based compilation step. Large projects that took 30-60 seconds to compile now compile in under 5 seconds. The interface is unchanged; the speed improvement is automatic.

Unit tests — dbt Core 1.8 introduced unit tests that let you define mocked inputs and expected outputs for a model. The test runs against synthetic data, not production data, so you can validate transformation logic before deploying.

unit_tests:
  - name: test_status_normalization
    model: stg_orders
    given:
      - input: source('ecommerce', 'orders')
        rows:
          - {order_id: 1, status: 'COMPLETED', amount: 50}
    expect:
      rows:
          - {order_id: 1, status: 'completed', order_amount_usd: 50}

Python models — For transformations that are awkward in SQL, dbt supports .py model files that run inside Snowpark (Snowflake) or BigQuery DataFrames. The model returns a DataFrame, which dbt materializes like any SQL model.

dbt Mesh — Large organizations can now split one monolithic dbt project into multiple interconnected projects. Each team owns their models and publishes stable public contracts. Other teams reference these via cross-project ref() calls.

Understanding how dbt works internally — the DAG, the compilation step, the materialization options, the YAML layer — is what separates teams that just use dbt from teams that use it well. Once you see the machinery, you can make better decisions about model structure, materialization strategy, and test coverage.