What is dbt?
If you have spent any time around modern data teams, you have heard dbt mentioned constantly. It shows up in job descriptions, conference talks, and nearly every conversation about the “modern data stack.” So what actually is it, and why does it matter?
dbt (Data Build Tool) is an open-source framework that lets analytics engineers and data engineers transform data that is already in a warehouse using SQL. That is the whole thing. It does not extract data from source systems, it does not load data into a warehouse, and it does not visualize anything. It handles one job: turning raw data into clean, tested, documented tables that analytics teams can actually use.
Where dbt Fits in the Data Stack
Modern data pipelines follow an ELT pattern: Extract, Load, Transform. Ingestion tools like Fivetran and Airbyte handle the Extract and Load steps. dbt handles the Transform.
ELT Stack---------
[Source Systems] CRMs, DBs, APIs, Files | v[Ingestion Tools] Fivetran / Airbyte / custom loaders | v[Data Warehouse] Snowflake / BigQuery / Redshift / DuckDB | v[dbt -- the T in ELT] SQL models / tests / docs / lineage | v[BI and Analytics] Tableau / Looker / Power BI / NotebooksThe reason this arrangement works well is that cloud warehouses have cheap storage and powerful compute. Storing raw data costs almost nothing, and running SQL transformations at scale is exactly what these platforms are optimized for. dbt takes advantage of both.
Core Concepts
Models
A model in dbt is a .sql file. Each file defines one transformation. When you run dbt run, dbt compiles the SQL and creates a table or view in your warehouse for each model.
Here is a simple model that cleans raw order data:
-- models/staging/stg_orders.sql
with source as ( select * from {{ source('raw', 'orders') }}),
cleaned as ( select order_id, customer_id, cast(order_date as date) as order_date, lower(trim(status)) as status, amount_cents / 100.0 as order_amount_usd from source where order_id is not null)
select * from cleanedRun it:
dbt run --select stg_ordersdbt creates a view or table called stg_orders in your warehouse.
The ref() Function
Models reference each other using {{ ref('model_name') }}. This is how dbt builds its dependency graph.
-- models/marts/fct_revenue.sql
with orders as ( select * from {{ ref('stg_orders') }} where status = 'completed')
select date_trunc('month', order_date) as revenue_month, sum(order_amount_usd) as total_revenuefrom ordersgroup by 1Because fct_revenue references stg_orders via ref(), dbt knows it must run stg_orders first. You never manage execution order manually — dbt handles it from the dependency graph.
Sources
Sources tell dbt where your raw data lives. You define them in YAML:
version: 2
sources: - name: raw schema: raw_data tables: - name: orders description: "Raw order records loaded by Fivetran" - name: customers description: "Customer records from the CRM"In models, you reference sources with {{ source('raw', 'orders') }}. This creates a tracked lineage connection from raw tables all the way to final marts.
Tests
dbt has a built-in testing framework. You define tests in YAML alongside your models:
version: 2
models: - name: stg_orders columns: - name: order_id tests: - unique - not_null - name: status tests: - accepted_values: values: ['pending', 'processing', 'completed', 'cancelled', 'refunded'] - name: customer_id tests: - not_null - relationships: to: ref('stg_customers') field: customer_idRun all tests:
dbt testTests that fail show up in the run output with details on which rows violated which constraint. You can plug test results into alerting systems to catch data quality issues before they reach dashboards.
Macros
Macros are reusable Jinja functions stored in the macros/ folder. They eliminate repetitive SQL patterns:
-- macros/safe_divide.sql{% macro safe_divide(numerator, denominator) %} case when {{ denominator }} = 0 or {{ denominator }} is null then null else {{ numerator }} / {{ denominator }}::float end{% endmacro %}Use it in any model:
select product_id, {{ safe_divide('total_revenue', 'total_orders') }} as avg_order_valuefrom {{ ref('fct_product_revenue') }}Snapshots
Snapshots track how rows change over time. If a customer’s status changes from active to churned, a snapshot captures both versions with timestamps — this is SCD Type 2 tracking without writing custom history management code.
-- snapshots/customer_snapshot.sql{% snapshot customer_snapshot %}
{{ config( target_schema='snapshots', unique_key='customer_id', strategy='timestamp', updated_at='updated_at' )}}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}The DAG: dbt’s Execution Graph
dbt builds a Directed Acyclic Graph (DAG) from your ref() and source() calls. The DAG shows how every piece of data flows from raw sources to final tables.
Example DAG-----------
[source: raw.orders] --------> [stg_orders] |[source: raw.customers] --> [stg_customers] | [fct_orders] ----------> [monthly_revenue] | [customer_ltv]When you run dbt run, dbt walks this graph and executes models in the correct order — running independent models in parallel up to your configured thread limit.
You can view the DAG interactively after running:
dbt docs generatedbt docs servedbt Core vs dbt Cloud
dbt Core is the open-source CLI tool. You install it locally, configure your warehouse connection in profiles.yml, and run everything from the command line. It is free and works with any warehouse that has a dbt adapter.
dbt Cloud is a hosted platform built on top of dbt Core. It adds:
- A browser-based IDE for writing and testing models
- A job scheduler (with cron-based scheduling and CI triggers)
- Hosted documentation site that updates after every run
- Collaboration features for teams
- Semantic layer integration for metric definitions
Most individual contributors and small teams start with dbt Core. Larger teams or organizations that want managed infrastructure typically use dbt Cloud.
A Simple End-to-End Workflow
Here is what a typical day looks like using dbt:
1. Source data arrives in warehouse (loaded by Fivetran overnight)
2. Engineer writes or updates a model in models/staging/stg_products.sql
3. Test it locally: dbt run --select stg_products dbt test --select stg_products
4. Commit to git and open a pull request
5. CI pipeline runs dbt build on changed models against a dev schema: dbt build --select state:modified+
6. PR gets reviewed, merged, deployed to production
7. Production job runs on schedule: dbt build (runs everything, in dependency order)
8. If tests fail, alerts go to SlackWhat dbt Is Not
A common source of confusion: dbt does not replace your warehouse or your ingestion tools. It does not run Python code for ETL jobs (though it does support Python model files for transformations). It does not store data outside your warehouse. It does not connect directly to source systems.
dbt is a transformation framework. Its job is to make the SQL that lives between your raw data and your analytics layer maintainable, testable, and documented.
dbt in 2025-2026: What’s New
The tool has evolved significantly over the past couple of years:
dbt Fusion — A Rust-based compiler that replaced the Python compiler in 2025. Projects that used to take a minute to compile now compile in a few seconds. No changes needed on your end.
Unit tests — Formally introduced in dbt Core 1.8. You can now test transformation logic using mocked inputs without touching production data — the same way developers write unit tests for application code.
dbt Mesh — Designed for large organizations with multiple teams. Instead of one giant dbt project, teams own their own sub-projects and publish stable data contracts. Other teams reference these via cross-project ref() calls.
MetricFlow integration — Define business metrics (revenue, churn rate, customer LTV) once in dbt, and any BI tool that supports the semantic layer can query them consistently. No more “my revenue number doesn’t match yours.”
Python models — For transformations that are awkward in SQL — ML feature pipelines, calling external APIs, complex array manipulations — dbt now supports .py model files that run inside Snowpark (Snowflake) or BigQuery DataFrames.
Why Data Teams Adopt dbt
Before dbt, transformation logic was scattered: some in stored procedures, some in Jupyter notebooks, some in Python scripts with unclear provenance. Nobody could answer “where did this number come from?” without spending hours tracing things manually.
dbt centralizes all transformation logic in one place, makes it version-controlled, adds a testing layer, generates documentation, and produces a visual lineage graph. The SQL itself is not revolutionary — it is the structure around the SQL that matters.
Teams that adopt dbt consistently report faster iteration cycles, fewer data quality incidents, and easier onboarding for new team members because the project structure is predictable and the documentation is automatic.
If you know SQL and work with a cloud data warehouse, dbt is worth learning. The core concepts — models, refs, sources, tests — take an afternoon to understand. Using them well in production takes longer, but the fundamentals are genuinely accessible.