Apache Iceberg Tutorial: Open Table Format for the Data Lakehouse

# Apache Iceberg: Format Tabel Lintas-Engine untuk Lakehouse Apache Iceberg adalah format tabel terbuka yang mengubah sekumpulan file Parquet (atau ORC/Avro) di object storage menjadi tabel sungguhan...

By Ruby Abdullah · · tutorial
Apache IcebergLakehouseTable FormatData EngineeringPyIcebergSpark

Apache Iceberg: An Engine-Agnostic Table Format for the Lakehouse

Apache Iceberg is an open table format that turns a directory of Parquet (or ORC/Avro) files on object storage into a real table with snapshots, schema evolution, and safe concurrent writes. What sets it apart from other lakehouse formats is its catalog-centric design and hidden partitioning: the table is defined by a catalog pointer to a metadata tree, and partitioning is a property of the table rather than something queries must spell out. This tutorial works through Iceberg from first principles using PyIceberg as the primary hands-on path, then shows the same table from Spark SQL and notes how DuckDB and Trino read it.

What Iceberg Is and Why a Table Format Matters

A plain data lake is just files in a bucket. Parquet gives you efficient columnar storage, but a folder of Parquet files is not a table: there is no atomic commit, no agreed-upon schema, no way to know which files are current, and no safe way for two jobs to write at once. A reader scanning the directory while a writer is mid-flight can pick up half-written data.

A table format solves this by adding a layer of metadata on top of the data files. That metadata answers three questions precisely:

  • Which files make up the table right now (and at any past point in time)?
  • What is the schema, and how has it changed?
  • How is the data laid out so a query can skip files it does not need?

Iceberg's answer is a tree of immutable metadata files plus a catalog that holds a single pointer to the current top of that tree. Swapping that pointer atomically is what makes a commit. Because every metadata file is immutable, readers always see a consistent snapshot and never observe a partial write.

How Iceberg Differs From Delta Lake and Hudi

All three are open table formats over columnar files, but they make different design choices:

  • Iceberg tracks state through a catalog pointer and a tree of metadata/manifest files. Partitioning is hidden — declared on the table and applied automatically — and Iceberg is explicitly engine-agnostic, with a stable spec implemented by Spark, Flink, Trino, DuckDB, Snowflake, and the pure-Python PyIceberg.
  • Delta Lake tracks state through an ordered JSON transaction log (deltalog) co-located with the data. Partitioning is by physical directory layout, and the ecosystem historically centered on Spark (though delta-rs broadened it).
  • Hudi focuses on fast upserts and incremental streams, with a timeline of commits and a strong record-key/primary-key model. It is well suited to CDC-heavy workloads.

The practical takeaway: choose Iceberg when you want multiple engines to read and write the same table through a shared catalog, and when you want partitioning and schema changes to be metadata operations rather than data rewrites.

Iceberg Architecture

An Iceberg table is a layered tree. From the top down:

Catalog                          (points to the current metadata file per table)

└─ metadata.json (table schema, partition specs, snapshot list, current snapshot id)

└─ manifest list (one Avro file per snapshot; lists the manifests in that snapshot)

└─ manifest files (Avro; each lists data files + per-file stats and partition values)

└─ data files (Parquet / ORC / Avro holding the actual rows)

Reading top to bottom:

Related Articles

Delta Lake Tutorial: Reliable Lakehouse Storage with ACID Transactions

Delta Lake: Penyimpanan Lakehouse yang Andal dengan Transaksi ACID Data lake yang dibangun di atas object storage memang...

dlt Tutorial: Python-First Data Ingestion Pipelines

Membangun Pipeline EL Berbasis Python dengan dlt (data load tool) Sebagian besar tim data menghabiskan waktu yang tidak ...

Pandera Tutorial: Statistical Data Validation for DataFrames

Pandera: Validasi Data Statistik untuk DataFrame pandas dan Polars Pipeline data sering gagal tanpa suara. Sebuah kolom ...

Dagster Tutorial: Data Orchestration with Software-Defined Assets

Dagster: Orkestrasi Data Modern dengan Software-Defined Assets Dagster adalah orkestrator data yang menyusun pipeline be...