Ibis: The Portable Python Dataframe API Across Many Backends
Ibis is a Python dataframe library that lets you write analytics code once and run it on many execution engines, from a local DuckDB instance on your laptop to BigQuery, Snowflake, or Spark in production. Instead of rewriting pandas logic into SQL when your data outgrows memory, you describe the transformation in Python, and Ibis compiles it to the dialect of whichever backend you connect to. This tutorial walks through the core ideas: deferred execution, backend portability, and a practical end-to-end analytics example.
The Problem Ibis Solves
Most data teams hit the same wall. You prototype an analysis in pandas because it is fast to write and easy to reason about. The dataset grows, no longer fits in memory, and now you have to translate that pandas code into SQL so it can run inside the data warehouse. The two implementations drift apart, bugs creep in during translation, and you maintain two versions of the same logic.
Ibis offers a single dataframe API that sits in front of the engine. You write expressions in Python, and:
- The same code runs on a local engine (DuckDB) during development and on a warehouse (BigQuery, Snowflake, Postgres) in production.
- The heavy computation happens inside the engine, close to the data, not in your Python process.
- You avoid hand-translating pandas to SQL, and you can scale from a laptop to a warehouse by changing only the connection.
The mental model is closer to a query builder than to pandas. You are composing a description of a computation, not eagerly materializing intermediate results.
Deferred (Lazy) Execution
This is the single most important concept in Ibis. When you chain operations such as filter, select, and groupby, nothing is computed. You are building an expression tree. Execution only happens when you explicitly ask for results.
import ibis
con = ibis.duckdb.connect()
This builds an expression. No query runs yet.
orders = con.table("orders")
expr = (
orders
.filter(orders.status == "completed")
.groupby("country")
.aggregate(total=orders.amount.sum())
)
Still nothing has executed. expr is just a description.
print(type(expr)) #
Execution happens here, returning a pandas DataFrame.
df = expr.execute()
The trigger points that actually run the query are:
.execute()— returns a pandas DataFrame (the default)..topandas()— explicit pandas output..topolars()— returns a Polars DataFrame..topyarrow()— returns a PyArrow Table..head().execute()— preview a small sample.
Because execution is deferred, Ibis can push the entire pipeline down to the engine as one optimized query. That is what makes it scale: you are not pulling raw data into Python and filtering it there.
Installation
Ibis is installed with one or more backend extras. Each backend you want to use is an optional dependency, which keeps the install lean.
# Core library plus the DuckDB backend (recommended starting point)
pip install 'ibis-framework[duckdb]'
Add other backends as needed
pip install 'ibis-framework[bigquery]'
pip install 'ibis-framework[postgres]'
pip install 'ibis-framework[snowflake]'
pip install 'ibis-framework[polars]'
Multiple backends at once
pip install 'ibis-framework[duckdb,postgres,bigquery]'
Verify the installation:
import ibis
print(ibis.version)
DuckDB is the recommended local backend. It runs in-process, needs no server, and supports the full range of SQL features Ibis compiles to, which makes it ideal for development and testing.
Connecting to Backends
A connection is the object that knows how to talk to an engine. The same expression API works regardless of which connection you build.
import ibis
DuckDB in-memory (great for experiments)
con = ibis.duckdb.connect()
DuckDB persisted to a file
con = ibis.duckdb.connect("analytics.ddb")
URL-style connection (works across backends)
con = ibis.connect("duckdb://analytics.ddb")
For warehouse backends, the connection carries credentials and project details. The expressions you build afterward are identical.
# BigQuery
con = ibis.bigquery.connect(