What is dbt?

If you have spent any time around modern data teams, you have heard dbt mentioned constantly. It shows up in job descriptions, conference talks, and nearly every conversation about the “modern data stack.” So what actually is it, and why does it matter?

dbt (Data Build Tool) is an open-source framework that lets analytics engineers and data engineers transform data that is already in a warehouse using SQL. That is the whole thing. It does not extract data from source systems, it does not load data into a warehouse, and it does not visualize anything. It handles one job: turning raw data into clean, tested, documented tables that analytics teams can actually use.

Where dbt Fits in the Data Stack

Modern data pipelines follow an ELT pattern: Extract, Load, Transform. Ingestion tools like Fivetran and Airbyte handle the Extract and Load steps. dbt handles the Transform.

ELT Stack
---------

[Source Systems]
  CRMs, DBs, APIs, Files
         |
         v
[Ingestion Tools]
  Fivetran / Airbyte / custom loaders
         |
         v
[Data Warehouse]
  Snowflake / BigQuery / Redshift / DuckDB
         |
         v
[dbt -- the T in ELT]
  SQL models / tests / docs / lineage
         |
         v
[BI and Analytics]
  Tableau / Looker / Power BI / Notebooks

The reason this arrangement works well is that cloud warehouses have cheap storage and powerful compute. Storing raw data costs almost nothing, and running SQL transformations at scale is exactly what these platforms are optimized for. dbt takes advantage of both.

Core Concepts

Models

A model in dbt is a .sql file. Each file defines one transformation. When you run dbt run, dbt compiles the SQL and creates a table or view in your warehouse for each model.

Here is a simple model that cleans raw order data:

-- models/staging/stg_orders.sql

with source as (
    select * from {{ source('raw', 'orders') }}
),

cleaned as (
    select
        order_id,
        customer_id,
        cast(order_date as date)    as order_date,
        lower(trim(status))         as status,
        amount_cents / 100.0        as order_amount_usd
    from source
    where order_id is not null
)

select * from cleaned

Run it:

dbt run --select stg_orders

dbt creates a view or table called stg_orders in your warehouse.

The ref() Function

Models reference each other using {{ ref('model_name') }}. This is how dbt builds its dependency graph.

-- models/marts/fct_revenue.sql

with orders as (
    select * from {{ ref('stg_orders') }}
    where status = 'completed'
)

select
    date_trunc('month', order_date) as revenue_month,
    sum(order_amount_usd)           as total_revenue
from orders
group by 1

Because fct_revenue references stg_orders via ref(), dbt knows it must run stg_orders first. You never manage execution order manually — dbt handles it from the dependency graph.

Sources

Sources tell dbt where your raw data lives. You define them in YAML:

version: 2

sources:
  - name: raw
    schema: raw_data
    tables:
      - name: orders
        description: "Raw order records loaded by Fivetran"
      - name: customers
        description: "Customer records from the CRM"

In models, you reference sources with {{ source('raw', 'orders') }}. This creates a tracked lineage connection from raw tables all the way to final marts.

Tests

dbt has a built-in testing framework. You define tests in YAML alongside your models:

version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'completed', 'cancelled', 'refunded']
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

Run all tests:

dbt test

Tests that fail show up in the run output with details on which rows violated which constraint. You can plug test results into alerting systems to catch data quality issues before they reach dashboards.

Macros

Macros are reusable Jinja functions stored in the macros/ folder. They eliminate repetitive SQL patterns:

-- macros/safe_divide.sql
{% macro safe_divide(numerator, denominator) %}
    case
        when {{ denominator }} = 0 or {{ denominator }} is null then null
        else {{ numerator }} / {{ denominator }}::float
    end
{% endmacro %}

Use it in any model:

select
    product_id,
    {{ safe_divide('total_revenue', 'total_orders') }} as avg_order_value
from {{ ref('fct_product_revenue') }}

Snapshots

Snapshots track how rows change over time. If a customer’s status changes from active to churned, a snapshot captures both versions with timestamps — this is SCD Type 2 tracking without writing custom history management code.

-- snapshots/customer_snapshot.sql
{% snapshot customer_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at'
    )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}

The DAG: dbt’s Execution Graph

dbt builds a Directed Acyclic Graph (DAG) from your ref() and source() calls. The DAG shows how every piece of data flows from raw sources to final tables.

Example DAG
-----------

[source: raw.orders] --------> [stg_orders]
                                     |
[source: raw.customers] --> [stg_customers]
                                     |
                              [fct_orders] ----------> [monthly_revenue]
                                     |
                              [customer_ltv]

When you run dbt run, dbt walks this graph and executes models in the correct order — running independent models in parallel up to your configured thread limit.

You can view the DAG interactively after running:

dbt docs generate
dbt docs serve

dbt Core vs dbt Cloud

dbt Core is the open-source CLI tool. You install it locally, configure your warehouse connection in profiles.yml, and run everything from the command line. It is free and works with any warehouse that has a dbt adapter.

dbt Cloud is a hosted platform built on top of dbt Core. It adds:

A browser-based IDE for writing and testing models
A job scheduler (with cron-based scheduling and CI triggers)
Hosted documentation site that updates after every run
Collaboration features for teams
Semantic layer integration for metric definitions

Most individual contributors and small teams start with dbt Core. Larger teams or organizations that want managed infrastructure typically use dbt Cloud.

A Simple End-to-End Workflow

Here is what a typical day looks like using dbt:

1. Source data arrives in warehouse (loaded by Fivetran overnight)

2. Engineer writes or updates a model in models/staging/stg_products.sql

3. Test it locally:
   dbt run --select stg_products
   dbt test --select stg_products

4. Commit to git and open a pull request

5. CI pipeline runs dbt build on changed models against a dev schema:
   dbt build --select state:modified+

6. PR gets reviewed, merged, deployed to production

7. Production job runs on schedule:
   dbt build  (runs everything, in dependency order)

8. If tests fail, alerts go to Slack

What dbt Is Not

A common source of confusion: dbt does not replace your warehouse or your ingestion tools. It does not run Python code for ETL jobs (though it does support Python model files for transformations). It does not store data outside your warehouse. It does not connect directly to source systems.

dbt is a transformation framework. Its job is to make the SQL that lives between your raw data and your analytics layer maintainable, testable, and documented.

dbt in 2025-2026: What’s New

The tool has evolved significantly over the past couple of years:

dbt Fusion — A Rust-based compiler that replaced the Python compiler in 2025. Projects that used to take a minute to compile now compile in a few seconds. No changes needed on your end.

Unit tests — Formally introduced in dbt Core 1.8. You can now test transformation logic using mocked inputs without touching production data — the same way developers write unit tests for application code.

dbt Mesh — Designed for large organizations with multiple teams. Instead of one giant dbt project, teams own their own sub-projects and publish stable data contracts. Other teams reference these via cross-project ref() calls.

MetricFlow integration — Define business metrics (revenue, churn rate, customer LTV) once in dbt, and any BI tool that supports the semantic layer can query them consistently. No more “my revenue number doesn’t match yours.”

Python models — For transformations that are awkward in SQL — ML feature pipelines, calling external APIs, complex array manipulations — dbt now supports .py model files that run inside Snowpark (Snowflake) or BigQuery DataFrames.

Why Data Teams Adopt dbt

Before dbt, transformation logic was scattered: some in stored procedures, some in Jupyter notebooks, some in Python scripts with unclear provenance. Nobody could answer “where did this number come from?” without spending hours tracing things manually.

dbt centralizes all transformation logic in one place, makes it version-controlled, adds a testing layer, generates documentation, and produces a visual lineage graph. The SQL itself is not revolutionary — it is the structure around the SQL that matters.

Teams that adopt dbt consistently report faster iteration cycles, fewer data quality incidents, and easier onboarding for new team members because the project structure is predictable and the documentation is automatic.

If you know SQL and work with a cloud data warehouse, dbt is worth learning. The core concepts — models, refs, sources, tests — take an afternoon to understand. Using them well in production takes longer, but the fundamentals are genuinely accessible.