Polars - Ultra-Fast DataFrame Library Complete Tutorial
Table of Contents
Introduction
Polars is a blazingly fast DataFrame library written in Rust with Python bindings. It leverages Apache Arrow's columnar memory format and a multi-threaded query engine to deliver performance that is often 10-100x faster than Pandas. Polars supports both eager and lazy evaluation, making it ideal for both exploratory data analysis and production data pipelines.
Key advantages of Polars over Pandas:
- Written in Rust with zero-copy interoperability via Apache Arrow
- Multi-threaded execution by default
- Lazy evaluation with query optimization
- Consistent API with no index-based confusion
- Streaming support for out-of-core processing
Prerequisites
- Python 3.8 or higher
- Basic understanding of DataFrames and tabular data
pip install polars
pip install polars[all] # Includes optional dependencies (Excel, database connectors, etc.)
pip install pandas # For comparison benchmarks
pip install scikit-learn # For ML pipeline examples
Polars Basics
Creating DataFrames
import polars as pl
From a dictionary
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"age": [30, 25, 35, 28, 32],
"department": ["Engineering", "Marketing", "Engineering", "Sales", "Marketing"],
"salary": [95000, 65000, 105000, 72000, 78000],
"joindate": ["2020-01-15", "2021-06-01", "2019-03-20", "2022-01-10", "2020-09-05"]
})
Cast date column
df = df.withcolumns(pl.col("joindate").str.todate())
print(df)
print(f"Shape: {df.shape}")
print(f"Schema: {df.schema}")
print(f"Dtypes: {df.dtypes}")
Reading Data
# CSV
df = pl.readcsv("data.csv")
Parquet (highly recommended for performance)
df = pl.readparquet("data.parquet")
JSON
df = pl.readjson("data.json")
From Pandas
import pandas as pd
pandasdf = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
polarsdf = pl.frompandas(pandasdf)
Back to Pandas when needed
pandasback = polarsdf.topandas()
Basic Operations
# Select columns
df.select("name", "salary")
df.select(pl.col("name"), pl.col("salary"))
Filter rows
df.filter(pl.col("salary") > 80000)
df.filter((pl.col("department") == "Engineering") & (pl.col("age") > 30))
Sort
df.sort("salary", descending=True)
df.sort(["department", "salary"], descending=[False, True])
Add/modify columns
df.withcolumns(
(pl.col("salary") 1.1).alias("salaryafterraise"),
pl.col("name").str.touppercase().alias("nameupper"),
pl.lit("Active").alias("status")
)
Drop columns
df.drop("joindate")
Rename columns
df.rename({"name": "employeename", "salary": "annualsalary"})
Descriptive statistics
df.describe()
Lazy vs Eager Evaluation
Polars' lazy evaluation is one of its most powerful features. It builds a query plan that is optimized before execution:
import polars as pl
EAGER: operations execute immediately
df = pl.readcsv("largefile.csv")
result = df.filter(pl.col("amount") > 100).select("id", "amount").sort("amount")
LAZY: builds a query plan, optimizes, then executes
result = (