Ibis Tutorial: The Portable Python DataFrame API Across Backends

# Ibis: API Dataframe Python yang Portabel di Banyak Backend Ibis adalah library dataframe Python yang memungkinkan Anda menulis kode analitik satu kali lalu menjalankannya di banyak mesin eksekusi,...

By Ruby Abdullah · · tutorial
IbisDataFrameSQLData EngineeringAnalyticsPython

Ibis: The Portable Python Dataframe API Across Many Backends

Ibis is a Python dataframe library that lets you write analytics code once and run it on many execution engines, from a local DuckDB instance on your laptop to BigQuery, Snowflake, or Spark in production. Instead of rewriting pandas logic into SQL when your data outgrows memory, you describe the transformation in Python, and Ibis compiles it to the dialect of whichever backend you connect to. This tutorial walks through the core ideas: deferred execution, backend portability, and a practical end-to-end analytics example.

The Problem Ibis Solves

Most data teams hit the same wall. You prototype an analysis in pandas because it is fast to write and easy to reason about. The dataset grows, no longer fits in memory, and now you have to translate that pandas code into SQL so it can run inside the data warehouse. The two implementations drift apart, bugs creep in during translation, and you maintain two versions of the same logic.

Ibis offers a single dataframe API that sits in front of the engine. You write expressions in Python, and:

  • The same code runs on a local engine (DuckDB) during development and on a warehouse (BigQuery, Snowflake, Postgres) in production.
  • The heavy computation happens inside the engine, close to the data, not in your Python process.
  • You avoid hand-translating pandas to SQL, and you can scale from a laptop to a warehouse by changing only the connection.

The mental model is closer to a query builder than to pandas. You are composing a description of a computation, not eagerly materializing intermediate results.

Deferred (Lazy) Execution

This is the single most important concept in Ibis. When you chain operations such as filter, select, and groupby, nothing is computed. You are building an expression tree. Execution only happens when you explicitly ask for results.

import ibis

con = ibis.duckdb.connect()

This builds an expression. No query runs yet.

orders = con.table("orders")

expr = (

orders

.filter(orders.status == "completed")

.groupby("country")

.aggregate(total=orders.amount.sum())

)

Still nothing has executed. expr is just a description.

print(type(expr)) #

Execution happens here, returning a pandas DataFrame.

df = expr.execute()

The trigger points that actually run the query are:

  • .execute() — returns a pandas DataFrame (the default).
  • .topandas() — explicit pandas output.
  • .topolars() — returns a Polars DataFrame.
  • .topyarrow() — returns a PyArrow Table.
  • .head().execute() — preview a small sample.

Because execution is deferred, Ibis can push the entire pipeline down to the engine as one optimized query. That is what makes it scale: you are not pulling raw data into Python and filtering it there.

Installation

Ibis is installed with one or more backend extras. Each backend you want to use is an optional dependency, which keeps the install lean.

# Core library plus the DuckDB backend (recommended starting point)

pip install 'ibis-framework[duckdb]'

Add other backends as needed

pip install 'ibis-framework[bigquery]'

pip install 'ibis-framework[postgres]'

pip install 'ibis-framework[snowflake]'

pip install 'ibis-framework[polars]'

Multiple backends at once

pip install 'ibis-framework[duckdb,postgres,bigquery]'

Verify the installation:

import ibis

print(ibis.version)

DuckDB is the recommended local backend. It runs in-process, needs no server, and supports the full range of SQL features Ibis compiles to, which makes it ideal for development and testing.

Connecting to Backends

A connection is the object that knows how to talk to an engine. The same expression API works regardless of which connection you build.

import ibis

DuckDB in-memory (great for experiments)

con = ibis.duckdb.connect()

DuckDB persisted to a file

con = ibis.duckdb.connect("analytics.ddb")

URL-style connection (works across backends)

con = ibis.connect("duckdb://analytics.ddb")

For warehouse backends, the connection carries credentials and project details. The expressions you build afterward are identical.

# BigQuery

con = ibis.bigquery.connect(

Related Articles

DuckDB: In-Process Analytical Database for Data Science

DuckDB: Database Analitik In-Process untuk Data Science DuckDB adalah database analitik in-process yang dirancang khusus...

PostgreSQL Advanced for ML Tutorial: Analytics and Feature Engineering

Tutorial 17: PostgreSQL Lanjutan untuk Machine Learning Daftar Isi Pendahuluan Prasyarat Window Functions untuk Rekayasa...

dlt Tutorial: Python-First Data Ingestion Pipelines

Membangun Pipeline EL Berbasis Python dengan dlt (data load tool) Sebagian besar tim data menghabiskan waktu yang tidak ...

Pandera Tutorial: Statistical Data Validation for DataFrames

Pandera: Validasi Data Statistik untuk DataFrame pandas dan Polars Pipeline data sering gagal tanpa suara. Sebuah kolom ...