Guides

How to Get Started with Dagster

Arkzero ResearchApr 24, 20267 min read

Last updated Apr 24, 2026

Dagster is an open-source data orchestration platform that models pipelines as software-defined assets rather than task graphs. To get started, install it with pip, define assets as plain Python functions decorated with @asset, and run dagster dev to launch a local UI. Unlike Airflow, Dagster tracks the freshness of each data asset it produces, making lineage and staleness visible at a glance. A working pipeline with a daily schedule takes under 20 minutes to set up from scratch.
Dagster data orchestration platform logo on clean background

What Is Dagster and Why Teams Are Switching

Dagster is an open-source data orchestration platform built around software-defined assets (SDAs). Instead of defining a workflow as a sequence of tasks — the Airflow model — Dagster asks you to define the data objects your pipeline produces: tables, files, ML model artifacts, API responses. The scheduler works backward from those assets to determine what needs to run and when.

The distinction matters more than it sounds. In a task-based system, you observe whether jobs succeeded or failed. In an asset-based system, you observe whether your data is fresh, stale, or missing. That shift reduces debugging time because the question "why is this dashboard stale?" becomes directly answerable from the UI rather than requiring you to trace task logs.

Dagster's GitHub repository has passed 13,000 stars as of April 2026. With Apache Airflow 2 reaching end of life, many teams that built their pipelines in 2021-2022 are evaluating alternatives. According to Dagster's own benchmarking data, engineers building in Dagster are 2x more productive than teams on Airflow, primarily because lineage and staleness are first-class concepts rather than bolted-on plugins.

Prerequisites

Before you start:

  • Python 3.10 or higher
  • pip or uv for package management
  • A terminal and a text editor

No database setup. No Docker. No separate service. Dagster runs entirely in-process.

Step 1: Install Dagster

pip install dagster dagster-webserver

If you use uv, which installs packages roughly 10-20x faster than pip:

uv pip install dagster dagster-webserver

Verify the install:

dagster --version

You should see output like dagster, version 1.10.x.

Step 2: Scaffold a New Project

Dagster includes a scaffold command that creates the correct directory structure:

dagster project scaffold --name my_pipeline
cd my_pipeline
pip install -e ".[dev]"

The scaffold creates:

my_pipeline/
  my_pipeline/
    __init__.py
    assets.py
    definitions.py
  my_pipeline_tests/
  setup.py
  pyproject.toml

The definitions.py file is the entry point Dagster loads. It wires assets, jobs, schedules, and sensors into a single Definitions object.

Step 3: Define Your First Assets

Open assets.py and replace its contents:

import pandas as pd
from dagster import asset, AssetExecutionContext

@asset
def raw_sales():
    """Load raw sales records from a remote CSV."""
    return pd.read_csv(
        "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
    )

@asset
def sales_summary(context: AssetExecutionContext, raw_sales):
    """Summarize conversion rate by customer segment."""
    summary = raw_sales.groupby("Pclass")["Survived"].mean().reset_index()
    summary.columns = ["segment", "rate"]
    context.log.info(f"Summarized {len(raw_sales)} rows into {len(summary)} segments")
    return summary

Two things to notice. First, sales_summary takes raw_sales as a parameter — Dagster reads that argument name and resolves the dependency automatically. No YAML graph definitions, no explicit set_upstream calls. Second, context.log.info writes to Dagster's structured event log, which surfaces in the UI next to run metadata.

Step 4: Wire Everything into Definitions

Open definitions.py:

from dagster import Definitions, load_assets_from_modules
from . import assets

defs = Definitions(
    assets=load_assets_from_modules([assets]),
)

load_assets_from_modules scans the module and collects every @asset-decorated function. You can also list assets explicitly if you want finer control over what gets loaded.

Step 5: Launch the Local UI

From the project root:

dagster dev

Open http://localhost:3000. The Dagster UI shows an Asset Graph with two nodes — raw_sales and sales_summary — connected by a directed edge.

Click Materialize all to run the full pipeline. Dagster executes raw_sales first, passes its result to sales_summary, and records both outputs as asset materializations. The event log on the right of the run view shows each step's start time, duration, and logged metadata.

The UI also tracks the freshness status of each asset. If an upstream asset was materialized three hours ago and a downstream asset has not been re-run since, Dagster marks the downstream asset as stale. This visibility is the practical core difference between asset-based and task-based orchestrators.

Step 6: Add a Recurring Schedule

Edit definitions.py:

from dagster import (
    Definitions,
    load_assets_from_modules,
    define_asset_job,
    ScheduleDefinition,
)
from . import assets

all_assets_job = define_asset_job("all_assets_job")

morning_schedule = ScheduleDefinition(
    job=all_assets_job,
    cron_schedule="0 6 * * *",  # 6 AM UTC daily
)

defs = Definitions(
    assets=load_assets_from_modules([assets]),
    jobs=[all_assets_job],
    schedules=[morning_schedule],
)

Restart dagster dev. Navigate to Automation in the sidebar and toggle the schedule on. Dagster's internal scheduler manages execution without any external cron job or additional infrastructure.

Step 7: Add Persistent Storage with an I/O Manager

By default, asset materializations are stored in memory and lost when the process restarts. For persistence, attach a filesystem I/O manager:

from dagster import FilesystemIOManager

defs = Definitions(
    assets=load_assets_from_modules([assets]),
    jobs=[all_assets_job],
    schedules=[morning_schedule],
    resources={
        "io_manager": FilesystemIOManager(base_dir="/tmp/dagster_storage")
    },
)

For production, Dagster supports S3, GCS, BigQuery, Snowflake, and DeltaLake through official integrations. Switching storage backends does not require changing your asset code — you swap the I/O manager in Definitions and the asset functions stay identical.

Step 8: Connect dbt Models as Assets

If your stack includes dbt, the dagster-dbt integration maps each dbt model to a Dagster asset automatically, giving you end-to-end lineage from raw ingestion through SQL transformations in a single graph.

pip install dagster-dbt

Then in definitions.py:

from dagster_dbt import DbtCliResource, dbt_assets
from pathlib import Path

DBT_PROJECT_DIR = Path("/path/to/your/dbt/project")

@dbt_assets(manifest=DBT_PROJECT_DIR / "target/manifest.json")
def my_dbt_assets(context, dbt: DbtCliResource):
    yield from dbt.cli(["build"], context=context).stream()

Each dbt model becomes a node in the same Dagster asset graph alongside your Python ingestion assets. The combined lineage view is the clearest picture most data teams will have ever had of where their data comes from.

Deploying to Production

Dagster runs in two modes. Self-hosted requires running the daemon, webserver, and a PostgreSQL database yourself — the right choice for teams with data residency requirements or existing Kubernetes infrastructure.

Dagster Cloud Serverless is a managed option that activates in under five minutes with no infrastructure configuration. The free tier supports small teams and is a practical way to run a Dagster pipeline in production before committing to server management. Dagster Cloud also supports branch deployments — isolated pipeline environments per Git branch — which reduces the risk of shipping pipeline changes to production without a staging run.

What Dagster Does Better Than Airflow

When a Dagster pipeline fails midway, the UI shows exactly which asset materialization failed and which downstream assets are now stale. Airflow's task model shows you that a task failed — it cannot show you which data objects are affected without additional tooling.

Dagster also ships with built-in support for partitioned assets (daily, monthly, or custom key-based). Incremental loading patterns that require external operators in Airflow are first-class primitives in Dagster.

For teams running Airflow 2 pipelines that need a migration path in 2026, Dagster provides official tooling to incrementally migrate DAGs. You can observe existing Airflow DAGs from within Dagster without changing a line of Airflow code, then migrate assets one at a time.

If your team works primarily with uploaded files or ad hoc CSV exports rather than scheduled pipelines, VSLZ handles on-demand analysis from a file upload in plain English without any pipeline setup.

Practical Summary

A working Dagster pipeline with a daily schedule takes under 20 minutes to set up from scratch. The asset graph becomes genuinely useful once you have more than three interdependent assets — that is when freshness tracking and lineage visibility start saving real debugging time. After getting the local setup running, the recommended path is Dagster Cloud Serverless for your first production deployment, followed by connecting one dbt project using dagster-dbt to see unified lineage across ingestion and transformation in a single view.

FAQ

What is Dagster used for?

Dagster is an open-source data orchestration platform used to build, schedule, and monitor data pipelines. Unlike Airflow, which models pipelines as task graphs, Dagster models them as software-defined assets — tables, files, ML models — and tracks whether each asset is fresh or stale. It is commonly used for ETL pipelines, dbt orchestration, ML workflow management, and data platform engineering.

How is Dagster different from Apache Airflow?

The core difference is the execution model. Airflow schedules tasks and reports whether they succeeded or failed. Dagster schedules asset materializations and reports whether the resulting data objects are current, stale, or missing. This means Dagster gives you data lineage and freshness visibility out of the box, without additional plugins. Dagster also has a more Pythonic API — no YAML DAG definitions, no operators — and built-in support for partitioned assets and branch deployments.

How do I install Dagster?

Run pip install dagster dagster-webserver in a Python 3.10+ environment. Then use dagster project scaffold --name my_project to create a project with the correct directory structure, install it with pip install -e '.[dev]', and launch the local UI with dagster dev. The full setup from zero to a running local pipeline takes under 15 minutes.

Is Dagster free?

Yes. Dagster is open-source (Apache 2.0 license) and free to self-host. Dagster Cloud, the managed SaaS version, offers a free Serverless tier for small teams. Paid Dagster Cloud plans add features like SSO, role-based access control, and higher execution limits. Self-hosting requires running the Dagster daemon, webserver, and a PostgreSQL database, but there are no licensing costs.

Can Dagster replace Airflow for existing pipelines?

In many cases, yes. Dagster provides official tooling to incrementally migrate Airflow DAGs — you can observe existing Airflow pipelines from within Dagster without rewriting them, then migrate assets one at a time. With Apache Airflow 2 reaching end of life in 2026, many teams are using the migration tooling to transition to Dagster over several sprints rather than doing a big-bang rewrite.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026