Dagster Tutorial: Data Orchestration with Software-Defined Assets

# Dagster: Orkestrasi Data Modern dengan Software-Defined Assets Dagster adalah orkestrator data yang menyusun pipeline berdasarkan data yang dihasilkan, bukan berdasarkan task yang dijalankan. Tutor...

By Ruby Abdullah · · tutorial
DagsterData OrchestrationData PipelineETLData EngineeringPython

Dagster: Modern Data Orchestration with Software-Defined Assets

Dagster is a data orchestrator that organizes pipelines around the data they produce rather than the tasks that run. This tutorial walks through the core concepts using a sales-data pipeline as a running example, from installation to deployment. By the end you will understand assets, resources, IO managers, partitions, schedules, sensors, and data-quality checks.

What Dagster Is

Dagster is an orchestration framework for building, running, and observing data pipelines. It was designed to make pipelines testable, observable, and maintainable as software. Instead of treating a pipeline as a graph of opaque tasks, Dagster encourages you to declare the assets your pipeline produces — tables, files, machine-learning models, dashboards — and lets the framework figure out execution order from the dependencies between them.

An asset is a persistent object in storage that captures some understanding of the world. A daily sales table in a warehouse is an asset. A cleaned Parquet file is an asset. A trained model is an asset. When you describe your pipeline in terms of these outputs, the orchestrator gains a richer model of your system, which improves observability and lineage.

How Dagster Differs from Airflow and Prefect

Airflow and Prefect are primarily task-centric. You define tasks (or flows) and wire up the order in which they execute. The orchestrator knows that task B runs after task A, but it does not inherently know that task A produces a rawsales table and task B consumes it. Lineage and data awareness must be added on top.

Dagster is asset-centric. You declare the assets, and the dependency graph between assets is derived from how they reference each other. This shift has practical consequences:

  • Lineage is built in. The asset graph is the lineage graph. You can see which upstream asset feeds which downstream asset directly in the UI.
  • Reasoning in terms of data. "Rematerialize the dailysalessummary asset" is a more natural operation than "rerun the transform task and then the load task."
  • Partitions are first-class. Each asset can be partitioned (for example, one partition per day), and Dagster tracks which partitions are materialized.
  • Built-in data quality. Asset checks let you attach validations to assets and surface failures in the UI.

Dagster still supports task-style execution through ops and jobs (covered later), so you are not forced into the asset model for everything. But for most data pipelines, assets are the recommended starting point.

Installation

Dagster runs on Python 3.9 or later. Create a virtual environment and install the core package together with the web UI.

python -m venv .venv

source .venv/bin/activate

pip install dagster dagster-webserver

The dagster package provides the core library and command-line tools. The dagster-webserver package provides the local web UI you will use to inspect and run assets. Integration packages such as dagster-pandas, dagster-duckdb, and dagster-aws are installed separately as needed.

Verify the installation:

dagster --version

A clean project layout for the examples below looks like this:

salespipeline/

salespipeline/

init.py

assets.py

resources.py

pyproject.toml

The @asset Decorator and Software-Defined Assets

A software-defined asset is a Python function decorated with @asset. The function computes the asset's value, and its return value is handed to an IO manager for persistence. The function name becomes the asset's key.

from dagster import asset

import pandas as pd

@asset

def rawsales() -> pd.DataFrame:

"""Extract raw sales records from a source CSV."""

return pd.readcsv("https://example.com/sales.csv")

Dependencies between assets are expressed by adding a parameter whose name matches an upstream asset key. Dagster injects the upstream asset's value at runtime.

@asset

def cleanedsales(rawsales: pd.DataFrame) -> pd.DataFrame:

"""Transform: drop nulls and normalize column names."""

df = rawsales.dropna(subset=["orderid", "amount"])

Related Articles

dlt Tutorial: Python-First Data Ingestion Pipelines

Membangun Pipeline EL Berbasis Python dengan dlt (data load tool) Sebagian besar tim data menghabiskan waktu yang tidak ...

Complete Prefect Tutorial: Modern Workflow Orchestration for ML

Tutorial Lengkap Prefect: Modern Workflow Orchestration untuk ML Prefect adalah platform workflow orchestration modern y...

Complete Apache Airflow Tutorial: Workflow Orchestration for Data Pipelines

Tutorial Lengkap Apache Airflow: Workflow Orchestration untuk Data Pipelines Apache Airflow adalah platform open-source ...

Pandera Tutorial: Statistical Data Validation for DataFrames

Pandera: Validasi Data Statistik untuk DataFrame pandas dan Polars Pipeline data sering gagal tanpa suara. Sebuah kolom ...