Delta Lake Tutorial: Reliable Lakehouse Storage with ACID Transactions

# Delta Lake: Penyimpanan Lakehouse yang Andal dengan Transaksi ACID Data lake yang dibangun di atas object storage memang murah dan mudah diskalakan, tetapi format file biasa tidak memberi jaminan a...

By Ruby Abdullah · · tutorial
Delta LakeLakehouseACIDData EngineeringSparkParquet

Delta Lake: Reliable Lakehouse Storage with ACID Transactions

Data lakes built on object storage are cheap and scalable, but plain file formats give you no guarantees: a half-finished write can leave readers with corrupt data, and concurrent jobs can clobber each other. Delta Lake fixes this by adding a transaction log on top of Parquet, bringing ACID transactions, schema enforcement, and time travel to the lake. This tutorial walks through Delta Lake from first principles using two practical paths: the lightweight Python deltalake library (no Spark required) and Apache Spark with delta-spark.

What Delta Lake Is

Delta Lake is an open table format. A Delta table is not a single file but a directory that contains:

  • One or more Parquet data files holding the actual rows.
  • A transaction log directory named deltalog, containing ordered JSON commit files (00000000000000000000.json, 00000000000000000001.json, and so on) plus periodic Parquet checkpoints.

Every change to the table is recorded as an atomic commit in the log. Readers consult the log to learn exactly which Parquet files make up the current version of the table. Because the log is the source of truth, a reader never sees a partially written file, and two writers can detect that they conflicted.

This is the core of the lakehouse idea: keep the low cost and openness of a data lake (Parquet on S3, ADLS, GCS, or local disk), but add the reliability and transactional semantics you would expect from a data warehouse.

Why Not Just Parquet?

Plain Parquet is excellent columnar storage, but it has no notion of a table version or a commit. If a job writes ten Parquet files and crashes after five, a reader scanning the directory will pick up an inconsistent mix. There is no atomic "append", no safe concurrent update, no way to roll back, and schema changes are entirely manual. Delta Lake keeps Parquet as the storage layer and layers transactional metadata on top.

The Transaction Log in Detail

A typical Delta directory looks like this:

orders/

deltalog/

00000000000000000000.json

00000000000000000001.json

00000000000000000002.json

lastcheckpoint

part-00000-....snappy.parquet

part-00001-....snappy.parquet

Each JSON commit describes actions: files added (add), files removed (remove), metadata changes, and protocol changes. Version 0 is the table creation; each subsequent commit increments the version by one. This monotonic version number is what makes time travel possible, and the remove actions are what let VACUUM later reclaim disk space.

Path 1: The Python deltalake Library (No Spark)

The deltalake package (the delta-rs project) is a native Rust implementation with Python bindings. It reads and writes Delta tables without a JVM or Spark cluster, which makes it ideal for single-node ETL, notebooks, and embedding Delta into existing pandas or PyArrow pipelines.

Installation

pip install "deltalake>=0.18" pandas pyarrow

Creating and Appending to a Table

We will model an orders table throughout this tutorial.

import pandas as pd

from deltalake import writedeltalake, DeltaTable

tablepath = "data/orders"

orders = pd.DataFrame(

{

"orderid": [1, 2, 3],

"customer": ["alice", "bob", "carol"],

"amount": [120.0, 80.5, 45.0],

"status": ["paid", "paid", "pending"],

"region": ["west", "east", "west"],

}

)

First write creates the table at version 0.

writedeltalake(tablepath, orders, mode="overwrite")

To add more rows without rewriting existing data, use mode="append":

neworders = pd.DataFrame(

{

"orderid": [4, 5],

"customer": ["dave", "erin"],

"amount": [200.0, 15.0],

"status": ["paid", "paid"],

"region": ["east", "west"],

}

)

writedeltalake(tablepath, neworders, mode="append")

Reading the Table

dt = DeltaTable(tablepath)

As a pandas DataFrame

df = dt.topandas()

Related Articles

Apache Iceberg Tutorial: Open Table Format for the Data Lakehouse

Apache Iceberg: Format Tabel Lintas-Engine untuk Lakehouse Apache Iceberg adalah format tabel terbuka yang mengubah seku...

dlt Tutorial: Python-First Data Ingestion Pipelines

Membangun Pipeline EL Berbasis Python dengan dlt (data load tool) Sebagian besar tim data menghabiskan waktu yang tidak ...

Pandera Tutorial: Statistical Data Validation for DataFrames

Pandera: Validasi Data Statistik untuk DataFrame pandas dan Polars Pipeline data sering gagal tanpa suara. Sebuah kolom ...

Dagster Tutorial: Data Orchestration with Software-Defined Assets

Dagster: Orkestrasi Data Modern dengan Software-Defined Assets Dagster adalah orkestrator data yang menyusun pipeline be...