Guides

How to Build a Local Analytics Pipeline with dbt and DuckDB

Arkzero ResearchMar 28, 20267 min read

Last updated Mar 28, 2026

dbt Core and DuckDB together let you build a complete SQL transformation pipeline that runs entirely on your laptop, with no server, no cloud account, and no infrastructure costs. You install two Python packages, point a profiles.yml file at a local database file, write SQL models, and run dbt run. This guide walks through the full setup from scratch, including a working model that transforms CSV data into a clean analytics table.
A data engineer working at a laptop with a terminal showing dbt and DuckDB commands for a local analytics pipeline

dbt and DuckDB together solve a problem many analysts face: you want structured SQL transformations with version control, tests, and documentation, but you do not want to set up a cloud data warehouse or manage a running database server. DuckDB runs as an in-process library with no daemon and no connection string. dbt handles the transformation layer. The two tools together deliver a production-style analytics pipeline that runs entirely from a terminal on your laptop.

Why DuckDB Does Not Need a Server

Most relational databases run as separate processes. You start a PostgreSQL server, open a connection over a TCP socket, and send queries. DuckDB works differently. It runs as a shared library loaded directly into whatever process invokes it, whether that is a Python script, a Jupyter notebook, or dbt. There is no server to start, no port to open, and no authentication to configure.

DuckDB stores data in a single file on your filesystem, similar to SQLite, but built for analytical workloads rather than transactional ones. It supports columnar storage, vectorized query execution, and multi-core parallelism out of the box. According to benchmarks on the DuckDB project site, the engine processes hundreds of millions of rows per second on a standard laptop for aggregation-heavy queries, outperforming row-oriented embedded databases like SQLite by a wide margin on GROUP BY and window function workloads.

Beyond raw speed, DuckDB can query CSV, Parquet, JSON, and Apache Arrow files directly without importing them into the database first. For exploratory analysis, this removes the import step entirely. You can run a GROUP BY across a 500 MB CSV file in seconds without any loading step.

Installing dbt Core and the DuckDB Adapter

You need Python 3.9 or later. Create a virtual environment to isolate project dependencies:

python -m venv dbt-env
source dbt-env/bin/activate       # Windows: dbt-env\Scripts\activate

Install the required packages:

pip install dbt-core dbt-duckdb

The dbt-duckdb adapter bundles DuckDB itself, so no separate DuckDB installation is required. Confirm the setup:

dbt --version

As of early 2026, dbt-core 1.8.x and dbt-duckdb 1.8.x are the stable production releases. The adapter maps all of dbt's standard materialization strategies (table, view, incremental, ephemeral) directly to DuckDB SQL, so existing dbt knowledge transfers without changes.

Configuring Your Connection with profiles.yml

dbt reads connection configuration from a file at ~/.dbt/profiles.yml. Create or edit that file and add a profile for your project:

local_analytics:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: /absolute/path/to/your/project/analytics.duckdb
      threads: 4

The path key tells DuckDB where to create or find the database file. Use an absolute path to avoid ambiguity when running dbt from different directories. The threads setting controls how many CPU cores DuckDB uses for parallel query execution. Four is a reasonable default on most laptops.

To use an in-memory database that does not write to disk, set path: ':memory:'. This is useful in CI environments where you want tests to run without leaving files behind.

Starting a dbt Project

Navigate to the directory where you want the project and run:

dbt init my_analytics

Select duckdb when prompted for the database adapter. dbt creates the following structure:

my_analytics/
├── dbt_project.yml
├── models/
│   └── example/
├── tests/
├── macros/
└── seeds/

Open dbt_project.yml and set the profile key to match the name you defined in profiles.yml:

name: 'my_analytics'
version: '1.0.0'
profile: 'local_analytics'

Verify the connection from inside the project directory:

dbt debug

A successful run prints each check in sequence and ends with "All checks passed!"

Loading Data with Seeds

The fastest way to get data into your local DuckDB database is through dbt seeds: CSV files that dbt loads directly as tables.

Place a file called orders.csv in the seeds/ directory with columns such as order_id, order_date, customer_id, and revenue. Then run:

dbt seed

dbt infers column types, creates a table in DuckDB, and loads the rows. You can now reference the table in any model using {{ ref('orders') }}.

For larger files or binary formats, DuckDB's file-reading functions work in any SQL query without seeding:

select * from read_parquet('/data/transactions.parquet') limit 100;
select * from read_csv_auto('/data/events.csv') where event_type = 'purchase';

This is useful during exploration before formalizing a data source into a dbt model.

Writing Your First Model

Create models/marts/monthly_revenue.sql:

{{ config(materialized='table') }}

with orders as (
    select * from {{ ref('orders') }}
),

monthly as (
    select
        date_trunc('month', order_date) as month,
        sum(revenue)                    as total_revenue,
        count(order_id)                 as order_count
    from orders
    group by 1
)

select * from monthly order by month

The {{ ref('orders') }} call tells dbt this model depends on the orders seed. dbt builds the full dependency graph before running, so models always execute in the correct order. The config(materialized='table') directive persists the result as a physical table rather than a view, which speeds up downstream queries.

Run the model:

dbt run

dbt compiles the Jinja-templated SQL, executes it against your local DuckDB file, and reports row counts on success. The compiled SQL is stored in target/compiled/, which is useful for debugging when a model produces unexpected results.

For incrementally updated models that only process new rows, use materialized='incremental' with an is_incremental() filter. This is the standard pattern for large append-only tables like event logs or daily transaction records.

Testing Data Quality

dbt's generic tests validate your data without requiring you to write test SQL from scratch. Add a schema.yml file alongside your model:

models:
  - name: monthly_revenue
    columns:
      - name: month
        tests:
          - not_null
          - unique
      - name: total_revenue
        tests:
          - not_null

Run all tests:

dbt test

dbt generates the assertion SQL for each test, executes it against DuckDB, and reports pass or fail. Custom singular tests go in the tests/ directory as plain SQL files. A test fails if the query returns any rows, so you write a query that describes invalid data rather than valid data.

To run seeds, models, and tests together in dependency order:

dbt build

This is the standard command for CI pipelines. If any test fails, dbt stops and reports the exact assertion that did not pass.

Generating Project Documentation

dbt generates a browsable documentation site with a lineage graph showing how every model connects to its dependencies:

dbt docs generate
dbt docs serve

A local web server opens at localhost:8080. You can browse each model's compiled SQL, view upstream and downstream dependencies, and see which tests apply to each column. For a solo analyst, this functions as a lightweight data catalog that documents itself as the project grows.

The lineage view is particularly useful when a seed or source table changes and you need to identify every downstream model that will be affected.

When This Stack Makes Sense

The dbt and DuckDB combination suits projects where data fits on one machine and the team wants full SQL control over transformations. DuckDB handles datasets up to several hundred gigabytes on a standard laptop, covering the analytical workloads of most small and medium operations without a cloud warehouse subscription or per-query costs.

The main limitation is that DuckDB does not support concurrent writes from multiple connections, which makes it unsuitable for transactional applications. For analytical pipelines run by one or a small team reading and transforming data, this is rarely a constraint in practice.

If the goal is faster iteration on questions rather than building a maintained transformation pipeline, writing SQL models adds overhead that may not be worth it. Tools like VSLZ let you upload a file and get aggregations, charts, and statistical summaries from a plain English prompt without configuring a profiles.yml or writing Jinja SQL. The dbt and DuckDB approach is the right choice when the transformation logic needs to be versioned, tested, and rerun reliably as source data changes over time.

FAQ

Do I need to install DuckDB separately to use dbt-duckdb?

No. The dbt-duckdb adapter package bundles DuckDB as a dependency. Running pip install dbt-duckdb installs both dbt and DuckDB in a single command. You do not need to download or configure DuckDB separately.

Can dbt with DuckDB handle production-scale data?

DuckDB can process datasets up to several hundred gigabytes on a standard laptop or server, which covers most small-to-medium analytical workloads. For very large datasets or multi-user concurrent write scenarios, a cloud warehouse like BigQuery or Snowflake is more appropriate. DuckDB also integrates with MotherDuck for cloud-based deployments that use the same dbt-duckdb adapter.

How does dbt profiles.yml differ from a standard database connection string?

A profiles.yml file is dbt-specific configuration that maps a profile name to a database target. For DuckDB, the only required field beyond the type is the path to the .duckdb file. This is simpler than a connection string because there is no host, port, username, or password. The file lives at ~/.dbt/profiles.yml and can contain multiple environments (dev, staging, prod) under the same profile name.

Can I read Parquet or JSON files in dbt models with DuckDB?

Yes. DuckDB can read Parquet, CSV, JSON, and Apache Arrow files directly in SQL using read_parquet(), read_csv_auto(), and read_json_auto() functions. You can reference these functions inside a dbt model using a source() or directly in the SQL. This is useful when raw data lives in files rather than a database table.

What is the difference between dbt seeds and dbt sources?

Seeds are CSV files stored inside your dbt project that dbt loads into the database as tables when you run dbt seed. They are best for small, slow-changing reference data like country codes or product categories. Sources are external tables or views that already exist in your database, defined in a sources.yml file. dbt can run freshness checks on sources and documents them in the lineage graph, but it does not manage their creation or loading.

Related