dlt Tutorial: Python-First Data Ingestion Pipelines

# Membangun Pipeline EL Berbasis Python dengan dlt (data load tool) Sebagian besar tim data menghabiskan waktu yang tidak sedikit pada bagian pipeline yang kurang menarik: menarik data dari sebuah AP...

By Ruby Abdullah · · tutorial
dltData IngestionELTData PipelineData EngineeringPython

Building Python-First EL Pipelines with dlt (data load tool)

Most data teams spend a surprising amount of time on the unglamorous part of the pipeline: pulling data out of an API and landing it in a warehouse with a sane schema. dlt (short for data load tool, from dltHub) is a lightweight Python library built specifically for that job. This tutorial walks through how it works and how it fits alongside the dbt and Dagster tooling you may already be using.

A quick note on naming, because the abbreviation is overloaded: the dlt discussed here is the open-source Python ingestion library from dltHub. It is not Delta Lake (the storage format) and not PyTorch's DLT. When we say dlt, we mean pip install dlt.

Where dlt Fits in the Modern Data Stack

The modern analytics stack is commonly described as EL + T: Extract and Load raw data first, then Transform it in the warehouse.

  • Extract + Load is dlt's territory. It reads from a source (REST API, database, file, generator) and writes the data into a destination warehouse, handling schema inference, normalization, and incremental state for you.
  • Transform is where dbt lives. Once raw tables land in the warehouse, dbt models reshape them into clean, tested, documented marts.
  • Orchestration is where Dagster (or Airflow, cron, GitHub Actions) lives. It decides when the dlt extract-load runs and when the dbt transform runs, and wires up the dependency between them.

So dlt does not replace dbt or Dagster. It fills the gap they deliberately leave open: getting raw data into the warehouse reliably, in plain Python, without standing up a heavy ingestion platform. Unlike connector platforms such as Fivetran or Airbyte — great when a connector exists, awkward when it does not — dlt lets you write a normal Python function that yields records, decorate it, and get a robust, schema-aware, incremental pipeline. If you can call an API in Python, you can load it with dlt.

Installation and Project Setup

dlt runs anywhere Python 3.8+ runs. Install the core package plus the extras for your destination.

# Core library

pip install dlt

With a specific destination's dependencies

pip install "dlt[duckdb]" # local analytics, great for development

pip install "dlt[bigquery]" # Google BigQuery

pip install "dlt[postgres]" # PostgreSQL

pip install "dlt[snowflake]" # Snowflake

It is good practice to work inside a virtual environment.

python -m venv .venv

source .venv/bin/activate # Windows: .venv\Scripts\activate

pip install "dlt[duckdb]"

Scaffolding with dlt init

dlt init generates a starter pipeline wired to a source and a destination. The general form is dlt init .
# Start a generic pipeline that loads into DuckDB

dlt init github duckdb

This creates a project layout similar to:

.

├── .dlt/

│ ├── config.toml # non-secret config (dataset names, options)

│ └── secrets.toml # credentials (gitignored)

├── githubpipeline.py # your pipeline script

└── requirements.txt

The .dlt/secrets.toml file is added to .gitignore automatically. Never commit it.

Core Concepts: Resource, Source, Pipeline

dlt has three building blocks worth internalizing before writing code.

  • Resource (@dlt.resource): a function that yields data — typically rows or pages of records. Each resource maps to one table in the destination.
  • Source (@dlt.source): a function that groups several related resources together (for example, all endpoints of one API). It returns the resources it manages.
  • Pipeline (dlt.pipeline(...)): the object that connects a source to a destination, runs the load, and tracks state (schema, incremental cursors, run history).

Related Articles

Dagster Tutorial: Data Orchestration with Software-Defined Assets

Dagster: Orkestrasi Data Modern dengan Software-Defined Assets Dagster adalah orkestrator data yang menyusun pipeline be...

Pandera Tutorial: Statistical Data Validation for DataFrames

Pandera: Validasi Data Statistik untuk DataFrame pandas dan Polars Pipeline data sering gagal tanpa suara. Sebuah kolom ...

Crawl4AI: AI-Optimized Web Crawling for Data Pipelines

Crawl4AI: Web Crawling yang Dioptimalkan untuk Pipeline AI Dalam membangun aplikasi AI modern, terutama yang berbasis Re...

Apache Kafka for Real-Time ML Tutorial: Streaming Data Pipeline

Tutorial 13: Apache Kafka untuk Pipeline ML Real-Time Daftar Isi Pendahuluan Prasyarat Memahami Apache Kafka [Menyiapkan...