Dagster: Modern Data Orchestration with Software-Defined Assets
Dagster is a data orchestrator that organizes pipelines around the data they produce rather than the tasks that run. This tutorial walks through the core concepts using a sales-data pipeline as a running example, from installation to deployment. By the end you will understand assets, resources, IO managers, partitions, schedules, sensors, and data-quality checks.
What Dagster Is
Dagster is an orchestration framework for building, running, and observing data pipelines. It was designed to make pipelines testable, observable, and maintainable as software. Instead of treating a pipeline as a graph of opaque tasks, Dagster encourages you to declare the assets your pipeline produces — tables, files, machine-learning models, dashboards — and lets the framework figure out execution order from the dependencies between them.
An asset is a persistent object in storage that captures some understanding of the world. A daily sales table in a warehouse is an asset. A cleaned Parquet file is an asset. A trained model is an asset. When you describe your pipeline in terms of these outputs, the orchestrator gains a richer model of your system, which improves observability and lineage.
How Dagster Differs from Airflow and Prefect
Airflow and Prefect are primarily task-centric. You define tasks (or flows) and wire up the order in which they execute. The orchestrator knows that task B runs after task A, but it does not inherently know that task A produces a rawsales table and task B consumes it. Lineage and data awareness must be added on top.
Dagster is asset-centric. You declare the assets, and the dependency graph between assets is derived from how they reference each other. This shift has practical consequences:
- Lineage is built in. The asset graph is the lineage graph. You can see which upstream asset feeds which downstream asset directly in the UI.
- Reasoning in terms of data. "Rematerialize the
dailysalessummaryasset" is a more natural operation than "rerun the transform task and then the load task." - Partitions are first-class. Each asset can be partitioned (for example, one partition per day), and Dagster tracks which partitions are materialized.
- Built-in data quality. Asset checks let you attach validations to assets and surface failures in the UI.
Dagster still supports task-style execution through ops and jobs (covered later), so you are not forced into the asset model for everything. But for most data pipelines, assets are the recommended starting point.
Installation
Dagster runs on Python 3.9 or later. Create a virtual environment and install the core package together with the web UI.
python -m venv .venv
source .venv/bin/activate
pip install dagster dagster-webserver
The dagster package provides the core library and command-line tools. The dagster-webserver package provides the local web UI you will use to inspect and run assets. Integration packages such as dagster-pandas, dagster-duckdb, and dagster-aws are installed separately as needed.
Verify the installation:
dagster --version
A clean project layout for the examples below looks like this:
salespipeline/
salespipeline/
init.py
assets.py
resources.py
pyproject.toml
The @asset Decorator and Software-Defined Assets
A software-defined asset is a Python function decorated with @asset. The function computes the asset's value, and its return value is handed to an IO manager for persistence. The function name becomes the asset's key.
from dagster import asset
import pandas as pd
@asset
def rawsales() -> pd.DataFrame:
"""Extract raw sales records from a source CSV."""
return pd.readcsv("https://example.com/sales.csv")
Dependencies between assets are expressed by adding a parameter whose name matches an upstream asset key. Dagster injects the upstream asset's value at runtime.
@asset
def cleanedsales(rawsales: pd.DataFrame) -> pd.DataFrame:
"""Transform: drop nulls and normalize column names."""
df = rawsales.dropna(subset=["orderid", "amount"])