Apache Iceberg: An Engine-Agnostic Table Format for the Lakehouse
Apache Iceberg is an open table format that turns a directory of Parquet (or ORC/Avro) files on object storage into a real table with snapshots, schema evolution, and safe concurrent writes. What sets it apart from other lakehouse formats is its catalog-centric design and hidden partitioning: the table is defined by a catalog pointer to a metadata tree, and partitioning is a property of the table rather than something queries must spell out. This tutorial works through Iceberg from first principles using PyIceberg as the primary hands-on path, then shows the same table from Spark SQL and notes how DuckDB and Trino read it.
What Iceberg Is and Why a Table Format Matters
A plain data lake is just files in a bucket. Parquet gives you efficient columnar storage, but a folder of Parquet files is not a table: there is no atomic commit, no agreed-upon schema, no way to know which files are current, and no safe way for two jobs to write at once. A reader scanning the directory while a writer is mid-flight can pick up half-written data.
A table format solves this by adding a layer of metadata on top of the data files. That metadata answers three questions precisely:
- Which files make up the table right now (and at any past point in time)?
- What is the schema, and how has it changed?
- How is the data laid out so a query can skip files it does not need?
Iceberg's answer is a tree of immutable metadata files plus a catalog that holds a single pointer to the current top of that tree. Swapping that pointer atomically is what makes a commit. Because every metadata file is immutable, readers always see a consistent snapshot and never observe a partial write.
How Iceberg Differs From Delta Lake and Hudi
All three are open table formats over columnar files, but they make different design choices:
- Iceberg tracks state through a catalog pointer and a tree of metadata/manifest files. Partitioning is hidden — declared on the table and applied automatically — and Iceberg is explicitly engine-agnostic, with a stable spec implemented by Spark, Flink, Trino, DuckDB, Snowflake, and the pure-Python PyIceberg.
- Delta Lake tracks state through an ordered JSON transaction log (
deltalog) co-located with the data. Partitioning is by physical directory layout, and the ecosystem historically centered on Spark (thoughdelta-rsbroadened it). - Hudi focuses on fast upserts and incremental streams, with a timeline of commits and a strong record-key/primary-key model. It is well suited to CDC-heavy workloads.
The practical takeaway: choose Iceberg when you want multiple engines to read and write the same table through a shared catalog, and when you want partitioning and schema changes to be metadata operations rather than data rewrites.
Iceberg Architecture
An Iceberg table is a layered tree. From the top down:
Catalog (points to the current metadata file per table)
└─ metadata.json (table schema, partition specs, snapshot list, current snapshot id)
└─ manifest list (one Avro file per snapshot; lists the manifests in that snapshot)
└─ manifest files (Avro; each lists data files + per-file stats and partition values)
└─ data files (Parquet / ORC / Avro holding the actual rows)
Reading top to bottom: