dlt Tutorial: Python-First Data Ingestion Pipelines
# Membangun Pipeline EL Berbasis Python dengan dlt (data load tool)
Sebagian besar tim data menghabiskan waktu yang tidak sedikit pada bagian pipeline yang kurang menarik: menarik data dari sebuah AP...
Building Python-First EL Pipelines with dlt (data load tool)
Most data teams spend a surprising amount of time on the unglamorous part of the pipeline: pulling data out of an API and landing it in a warehouse with a sane schema. dlt (short for data load tool, from dltHub) is a lightweight Python library built specifically for that job. This tutorial walks through how it works and how it fits alongside the dbt and Dagster tooling you may already be using.
A quick note on naming, because the abbreviation is overloaded: the dlt discussed here is the open-source Python ingestion library from dltHub. It is not Delta Lake (the storage format) and not PyTorch's DLT. When we say dlt, we mean pip install dlt.
Where dlt Fits in the Modern Data Stack
The modern analytics stack is commonly described as EL + T: Extract and Load raw data first, then Transform it in the warehouse.
Extract + Load is dlt's territory. It reads from a source (REST API, database, file, generator) and writes the data into a destination warehouse, handling schema inference, normalization, and incremental state for you.
Transform is where dbt lives. Once raw tables land in the warehouse, dbt models reshape them into clean, tested, documented marts.
Orchestration is where Dagster (or Airflow, cron, GitHub Actions) lives. It decides when the dlt extract-load runs and when the dbt transform runs, and wires up the dependency between them.
So dlt does not replace dbt or Dagster. It fills the gap they deliberately leave open: getting raw data into the warehouse reliably, in plain Python, without standing up a heavy ingestion platform. Unlike connector platforms such as Fivetran or Airbyte — great when a connector exists, awkward when it does not — dlt lets you write a normal Python function that yields records, decorate it, and get a robust, schema-aware, incremental pipeline. If you can call an API in Python, you can load it with dlt.
Installation and Project Setup
dlt runs anywhere Python 3.8+ runs. Install the core package plus the extras for your destination.
# Core library
pip install dlt
With a specific destination's dependencies
pip install "dlt[duckdb]" # local analytics, great for development
pip install "dlt[bigquery]" # Google BigQuery
pip install "dlt[postgres]" # PostgreSQL
pip install "dlt[snowflake]" # Snowflake
It is good practice to work inside a virtual environment.
The .dlt/secrets.toml file is added to .gitignore automatically. Never commit it.
Core Concepts: Resource, Source, Pipeline
dlt has three building blocks worth internalizing before writing code.
Resource (@dlt.resource): a function that yields data — typically rows or pages of records. Each resource maps to one table in the destination.
Source (@dlt.source): a function that groups several related resources together (for example, all endpoints of one API). It returns the resources it manages.
Pipeline (dlt.pipeline(...)): the object that connects a source to a destination, runs the load, and tracks state (schema, incremental cursors, run history).