Guides

How to Build Data Pipelines with dlt

Arkzero ResearchApr 28, 20267 min read

Last updated Apr 28, 2026

dlt (data load tool) is a free, open-source Python library that extracts data from APIs, databases, and cloud storage and loads it into any warehouse in minutes. Install it with pip, write a 10-line script, and your data lands in DuckDB or BigQuery automatically — with schema inference, incremental loading, and no infrastructure to manage. In January 2025 the dlt community created 2,400 pipelines per month; by January 2026 that number had grown to 81,000.
Code editor showing a dlt Python pipeline script loading data into DuckDB

dlt (data load tool) is an open-source Python library that moves data from any source to any warehouse with minimal configuration. Install it with pip, write roughly 10 lines of code, and your data lands in DuckDB, BigQuery, Snowflake, or Redshift automatically — complete with schema inference, incremental updates, and no cloud infrastructure to manage.

In January 2025, the dlt community created 2,400 pipelines per month. By January 2026, that number had grown to 81,000, with AI agents now accounting for 91 percent of all new pipelines built on the platform. The growth reflects both direct developer adoption and the tool's fit for automated data workflows where a pipeline needs to run on a schedule without human intervention.

What dlt Does and What It Replaces

Managed ETL connectors like Fivetran and Airbyte charge $300 to $500 or more per month to sync SaaS data from Stripe, HubSpot, or Shopify into your warehouse automatically. dlt replicates most of that functionality in Python code you own and run anywhere: a laptop, a serverless function, an Airflow DAG, or a cron job.

The trade-off is real. Fivetran and Airbyte handle schema drift, connector updates, and retries without your intervention. dlt requires you to write and maintain pipeline code. For teams with a developer or analyst who can write Python, the cost difference is significant. For teams with no technical staff, managed connectors remain the right choice.

What dlt removes entirely is the undifferentiated work: paginating through API responses, detecting new records since the last run, writing data to a typed table, and handling schema evolution when an upstream API changes its response shape. The library's built-in REST API connector and schema inference engine handle all of it.

Installing dlt

dlt requires Python 3.9 or later. Install the base library:

pip install dlt

To write to DuckDB, the most common local development destination:

pip install "dlt[duckdb]"

To write to BigQuery:

pip install "dlt[bigquery]"

Other supported destinations include Snowflake, Redshift, Postgres, MotherDuck, and filesystem targets like S3 and GCS.

Your First Pipeline: GitHub Issues to DuckDB

The fastest way to understand dlt is a working pipeline. This example loads open issues from a public GitHub repository into a local DuckDB file:

import dlt
from dlt.sources.helpers import requests

url = "https://api.github.com/repos/dlt-hub/dlt/issues"
response = requests.get(url)
response.raise_for_status()

pipeline = dlt.pipeline(
    pipeline_name="github_issues",
    destination="duckdb",
    dataset_name="github_data",
)

load_info = pipeline.run(response.json(), table_name="issues")
print(load_info)

Run this script and dlt creates a local file named github_issues.duckdb in your working directory. Open it with DuckDB's CLI or any SQL client and you will find a github_data.issues table populated with every field from the API response — inferred automatically, typed correctly, and ready to query. On a modern laptop this takes about 30 seconds.

Loading from an Authenticated REST API

Most production pipelines connect to authenticated APIs. dlt's REST API source handles bearer token auth, API key headers, and OAuth without custom code.

This example loads contacts from a CRM API with bearer token authentication:

import dlt
from dlt.sources.rest_api import rest_api_source

pipeline = dlt.pipeline(
    pipeline_name="crm_contacts",
    destination="duckdb",
    dataset_name="crm",
)

source = rest_api_source({
    "client": {
        "base_url": "https://api.yourcrm.com/v1/",
        "auth": {
            "type": "bearer",
            "token": dlt.secrets["crm_api_token"],
        },
    },
    "resources": [
        {
            "name": "contacts",
            "endpoint": "contacts",
            "write_disposition": "merge",
            "primary_key": "id",
        }
    ],
})

load_info = pipeline.run(source)
print(load_info)

The write_disposition: "merge" setting upserts records by primary key on each run, updating existing contacts and inserting new ones. Store secrets in a .dlt/secrets.toml file next to your script. dlt reads them automatically at runtime and never requires environment variable configuration.

Loading from a SQL Database

For teams that want to replicate an operational database into their analytics warehouse, dlt includes a sql_database source that connects to Postgres, MySQL, SQLite, and other databases via SQLAlchemy.

Install the dependency:

pip install "dlt[sqlalchemy]"

Then replicate Postgres tables incrementally into BigQuery:

import dlt
from dlt.sources.sql_database import sql_database

pipeline = dlt.pipeline(
    pipeline_name="postgres_to_bq",
    destination="bigquery",
    dataset_name="operations",
)

source = sql_database(
    "postgresql://user:password@host:5432/mydb",
    schema="public",
    table_names=["orders", "customers", "products"],
    incremental=dlt.sources.incremental("updated_at"),
)

load_info = pipeline.run(source)
print(load_info)

The incremental parameter loads only rows where updated_at is greater than the last successful run's maximum value. dlt stores cursor state automatically between runs, so re-running the script never reloads data that was already loaded.

Switching to BigQuery for Production

One of dlt's most practical features is the ability to develop locally against DuckDB and deploy against BigQuery in production with one line change:

pipeline = dlt.pipeline(
    pipeline_name="crm_contacts",
    destination="bigquery",  # was "duckdb"
    dataset_name="crm",
)

To authenticate with BigQuery, create a service account with BigQuery Data Editor and BigQuery Job User roles, download its JSON key, and add the key contents to .dlt/secrets.toml under the [destination.bigquery] section. The official dlt documentation provides the exact key structure.

This local-to-cloud pattern eliminates the need for a staging warehouse and reduces the cost of iterating on pipeline logic before committing to production infrastructure.

Schema Inference and Evolution

One persistent pain when building pipelines manually is handling changes in the upstream API's response shape. A new field appears, an existing field changes type, a nested object gets flattened, and your pipeline breaks silently.

dlt tracks the schema of every table it manages and detects changes automatically. When a new field appears in the API response, dlt adds a column to the destination table without any intervention. When a field changes type in a way that can be safely promoted — integer to float, for example — dlt handles it without errors. Breaking changes raise a SchemaEvolutionError so the issue surfaces rather than writing corrupted data.

Running on a Schedule

dlt pipelines are plain Python scripts and run anywhere Python runs. The simplest production deployment is a cron job:

# Run the contacts sync every hour
0 * * * * cd /home/user/pipelines && python crm_contacts.py >> /var/log/dlt_crm.log 2>&1

For teams already on Airflow or Prefect, dlt ships native operator wrappers. The dlt documentation includes working Airflow DAG examples that drop into existing orchestration setups without additional configuration.

What dlt Does Not Handle

dlt covers extraction and loading. Transformation — cleaning, joining, and modeling raw data into analytical tables — is the job of SQL or dbt. A common stack pairs dlt for ingestion with dbt for transformation and a BI tool for reporting. Each layer stays simple and independently replaceable.

dlt also provides no UI. Pipeline state, load history, and schema versions are stored in internal tables within your destination database and queried directly with SQL. If you want to explore the loaded data without writing SQL, tools like VSLZ can connect to a dlt-populated DuckDB or BigQuery dataset and surface insights through a conversational interface.

Getting Started

For any team that needs to move data from an API or database into a warehouse, dlt removes the need for a managed ETL subscription as long as someone on the team can write basic Python. The install takes one command, a working pipeline takes under an hour to build, and moving from local DuckDB development to production BigQuery requires changing one line. At 81,000 pipelines per month and growing, it has become a standard part of the open-source data stack in 2026.

FAQ

What is dlt and what does it do?

dlt (data load tool) is a free, open-source Python library that extracts data from REST APIs, SQL databases, cloud storage, and other sources and loads it into a data warehouse or local database. It handles pagination, authentication, schema inference, incremental loading, and schema evolution automatically. You write a Python script that defines your source and destination, and dlt manages the rest. Supported destinations include DuckDB, BigQuery, Snowflake, Redshift, Postgres, and MotherDuck.

How does dlt compare to Fivetran and Airbyte?

Fivetran and Airbyte are managed services that maintain connectors, handle schema drift, and retry failed syncs without requiring code from you. dlt is a Python library you run yourself — it requires more upfront setup but costs nothing beyond your compute. For teams with a developer or analyst who can write Python, dlt is significantly cheaper. For teams without technical staff, managed connectors are the better fit. dlt is also a better choice when you need to load from a custom internal API or a data source that Fivetran and Airbyte do not support.

Can I use dlt without knowing Python?

dlt requires basic Python to configure and run pipelines. A minimal working pipeline is about 10 to 15 lines of code, and the official documentation includes copy-paste examples for common sources like REST APIs and SQL databases. Someone comfortable with Python scripting can build a production pipeline in under an hour. Teams without any Python knowledge are better served by managed ETL tools that provide a UI-based configuration workflow.

What databases and destinations does dlt support?

dlt supports DuckDB, BigQuery, Snowflake, Redshift, Postgres, MotherDuck, Azure Blob Storage, Amazon S3, Google Cloud Storage, and filesystem destinations. You install a destination-specific package alongside the base library: for example, `pip install dlt[duckdb]` or `pip install dlt[bigquery]`. The destination is specified as a string parameter when you create a pipeline, and switching between destinations requires only changing that one value.

How do I run a dlt pipeline on a schedule?

dlt pipelines are plain Python scripts and run anywhere Python runs. The simplest scheduling method is a cron job on a Linux server or EC2 instance. For teams already using Airflow or Prefect, dlt provides native operator wrappers. For serverless options, dlt can be packaged as an AWS Lambda function or a Google Cloud Function and triggered on a schedule using EventBridge or Cloud Scheduler. The dlt documentation includes working examples for each of these deployment patterns.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026