Delta Lake: Reliable Lakehouse Storage with ACID Transactions
Data lakes built on object storage are cheap and scalable, but plain file formats give you no guarantees: a half-finished write can leave readers with corrupt data, and concurrent jobs can clobber each other. Delta Lake fixes this by adding a transaction log on top of Parquet, bringing ACID transactions, schema enforcement, and time travel to the lake. This tutorial walks through Delta Lake from first principles using two practical paths: the lightweight Python deltalake library (no Spark required) and Apache Spark with delta-spark.
What Delta Lake Is
Delta Lake is an open table format. A Delta table is not a single file but a directory that contains:
- One or more Parquet data files holding the actual rows.
- A transaction log directory named
deltalog, containing ordered JSON commit files (00000000000000000000.json,00000000000000000001.json, and so on) plus periodic Parquet checkpoints.
Every change to the table is recorded as an atomic commit in the log. Readers consult the log to learn exactly which Parquet files make up the current version of the table. Because the log is the source of truth, a reader never sees a partially written file, and two writers can detect that they conflicted.
This is the core of the lakehouse idea: keep the low cost and openness of a data lake (Parquet on S3, ADLS, GCS, or local disk), but add the reliability and transactional semantics you would expect from a data warehouse.
Why Not Just Parquet?
Plain Parquet is excellent columnar storage, but it has no notion of a table version or a commit. If a job writes ten Parquet files and crashes after five, a reader scanning the directory will pick up an inconsistent mix. There is no atomic "append", no safe concurrent update, no way to roll back, and schema changes are entirely manual. Delta Lake keeps Parquet as the storage layer and layers transactional metadata on top.
The Transaction Log in Detail
A typical Delta directory looks like this:
orders/
deltalog/
00000000000000000000.json
00000000000000000001.json
00000000000000000002.json
lastcheckpoint
part-00000-....snappy.parquet
part-00001-....snappy.parquet
Each JSON commit describes actions: files added (add), files removed (remove), metadata changes, and protocol changes. Version 0 is the table creation; each subsequent commit increments the version by one. This monotonic version number is what makes time travel possible, and the remove actions are what let VACUUM later reclaim disk space.
Path 1: The Python deltalake Library (No Spark)
The deltalake package (the delta-rs project) is a native Rust implementation with Python bindings. It reads and writes Delta tables without a JVM or Spark cluster, which makes it ideal for single-node ETL, notebooks, and embedding Delta into existing pandas or PyArrow pipelines.
Installation
pip install "deltalake>=0.18" pandas pyarrow
Creating and Appending to a Table
We will model an orders table throughout this tutorial.
import pandas as pd
from deltalake import writedeltalake, DeltaTable
tablepath = "data/orders"
orders = pd.DataFrame(
{
"orderid": [1, 2, 3],
"customer": ["alice", "bob", "carol"],
"amount": [120.0, 80.5, 45.0],
"status": ["paid", "paid", "pending"],
"region": ["west", "east", "west"],
}
)
First write creates the table at version 0.
writedeltalake(tablepath, orders, mode="overwrite")
To add more rows without rewriting existing data, use mode="append":
neworders = pd.DataFrame(
{
"orderid": [4, 5],
"customer": ["dave", "erin"],
"amount": [200.0, 15.0],
"status": ["paid", "paid"],
"region": ["east", "west"],
}
)
writedeltalake(tablepath, neworders, mode="append")
Reading the Table
dt = DeltaTable(tablepath)
As a pandas DataFrame
df = dt.to
pandas()