Apache Iceberg: An Engine-Agnostic Table Format for the Lakehouse

Apache Iceberg is an open table format that turns a directory of Parquet (or ORC/Avro) files on object storage into a real table with snapshots, schema evolution, and safe concurrent writes. What sets it apart from other lakehouse formats is its catalog-centric design and hidden partitioning: the table is defined by a catalog pointer to a metadata tree, and partitioning is a property of the table rather than something queries must spell out. This tutorial works through Iceberg from first principles using PyIceberg as the primary hands-on path, then shows the same table from Spark SQL and notes how DuckDB and Trino read it.

What Iceberg Is and Why a Table Format Matters

A plain data lake is just files in a bucket. Parquet gives you efficient columnar storage, but a folder of Parquet files is not a table: there is no atomic commit, no agreed-upon schema, no way to know which files are current, and no safe way for two jobs to write at once. A reader scanning the directory while a writer is mid-flight can pick up half-written data.

A table format solves this by adding a layer of metadata on top of the data files. That metadata answers three questions precisely:

Which files make up the table right now (and at any past point in time)?
What is the schema, and how has it changed?
How is the data laid out so a query can skip files it does not need?

Iceberg's answer is a tree of immutable metadata files plus a catalog that holds a single pointer to the current top of that tree. Swapping that pointer atomically is what makes a commit. Because every metadata file is immutable, readers always see a consistent snapshot and never observe a partial write.

How Iceberg Differs From Delta Lake and Hudi

All three are open table formats over columnar files, but they make different design choices:

Iceberg tracks state through a catalog pointer and a tree of metadata/manifest files. Partitioning is hidden — declared on the table and applied automatically — and Iceberg is explicitly engine-agnostic, with a stable spec implemented by Spark, Flink, Trino, DuckDB, Snowflake, and the pure-Python PyIceberg.
Delta Lake tracks state through an ordered JSON transaction log (deltalog) co-located with the data. Partitioning is by physical directory layout, and the ecosystem historically centered on Spark (though delta-rs broadened it).
Hudi focuses on fast upserts and incremental streams, with a timeline of commits and a strong record-key/primary-key model. It is well suited to CDC-heavy workloads.

The practical takeaway: choose Iceberg when you want multiple engines to read and write the same table through a shared catalog, and when you want partitioning and schema changes to be metadata operations rather than data rewrites.

Iceberg Architecture

An Iceberg table is a layered tree. From the top down:

Catalog (points to the current metadata file per table) └─ metadata.json (table schema, partition specs, snapshot list, current snapshot id) └─ manifest list (one Avro file per snapshot; lists the manifests in that snapshot) └─ manifest files (Avro; each lists data files + per-file stats and partition values) └─ data files (Parquet / ORC / Avro holding the actual rows)

Reading top to bottom:

Apache Iceberg Tutorial: Open Table Format for the Data Lakehouse

Apache Iceberg: An Engine-Agnostic Table Format for the Lakehouse

What Iceberg Is and Why a Table Format Matters

How Iceberg Differs From Delta Lake and Hudi

Iceberg Architecture

Related Articles

Delta Lake Tutorial: Reliable Lakehouse Storage with ACID Transactions

dlt Tutorial: Python-First Data Ingestion Pipelines

Pandera Tutorial: Statistical Data Validation for DataFrames

Dagster Tutorial: Data Orchestration with Software-Defined Assets

Related Articles

Delta Lake Tutorial: Reliable Lakehouse Storage with ACID Transactions

Delta Lake: Penyimpanan Lakehouse yang Andal dengan Transaksi ACID Data lake yang dibangun di atas object storage memang...

dlt Tutorial: Python-First Data Ingestion Pipelines

Membangun Pipeline EL Berbasis Python dengan dlt (data load tool) Sebagian besar tim data menghabiskan waktu yang tidak ...

Pandera Tutorial: Statistical Data Validation for DataFrames

Pandera: Validasi Data Statistik untuk DataFrame pandas dan Polars Pipeline data sering gagal tanpa suara. Sebuah kolom ...

Dagster Tutorial: Data Orchestration with Software-Defined Assets

Dagster: Orkestrasi Data Modern dengan Software-Defined Assets Dagster adalah orkestrator data yang menyusun pipeline be...