Polars Tutorial: Ultra-Fast DataFrame Library for Data Science

# Polars - Tutorial Lengkap Library DataFrame Ultra-Cepat ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Dasar-Dasar Polars](#dasar-dasar-polars) 4. [Evaluasi Lazy vs Ea...

By Ruby Abdullah · · tutorial
PolarsDataFrameData SciencePythonPandas AlternativePerformance

Polars - Ultra-Fast DataFrame Library Complete Tutorial

Table of Contents

  • Introduction
  • Prerequisites
  • Polars Basics
  • Lazy vs Eager Evaluation
  • Expressions and Transformations
  • Joins and Combining DataFrames
  • Group By and Aggregations
  • Window Functions
  • Polars vs Pandas Benchmark
  • ML Pipeline Integration
  • Handling Large Datasets with Streaming
  • Best Practices
  • Conclusion

  • Introduction

    Polars is a blazingly fast DataFrame library written in Rust with Python bindings. It leverages Apache Arrow's columnar memory format and a multi-threaded query engine to deliver performance that is often 10-100x faster than Pandas. Polars supports both eager and lazy evaluation, making it ideal for both exploratory data analysis and production data pipelines.

    Key advantages of Polars over Pandas:

    • Written in Rust with zero-copy interoperability via Apache Arrow
    • Multi-threaded execution by default
    • Lazy evaluation with query optimization
    • Consistent API with no index-based confusion
    • Streaming support for out-of-core processing

    Prerequisites

    • Python 3.8 or higher
    • Basic understanding of DataFrames and tabular data

    pip install polars
    

    pip install polars[all] # Includes optional dependencies (Excel, database connectors, etc.)

    pip install pandas # For comparison benchmarks

    pip install scikit-learn # For ML pipeline examples

    Polars Basics

    Creating DataFrames

    import polars as pl
    
    

    From a dictionary

    df = pl.DataFrame({

    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],

    "age": [30, 25, 35, 28, 32],

    "department": ["Engineering", "Marketing", "Engineering", "Sales", "Marketing"],

    "salary": [95000, 65000, 105000, 72000, 78000],

    "joindate": ["2020-01-15", "2021-06-01", "2019-03-20", "2022-01-10", "2020-09-05"]

    })

    Cast date column

    df = df.withcolumns(pl.col("joindate").str.todate())

    print(df)

    print(f"Shape: {df.shape}")

    print(f"Schema: {df.schema}")

    print(f"Dtypes: {df.dtypes}")

    Reading Data

    # CSV
    

    df = pl.readcsv("data.csv")

    Parquet (highly recommended for performance)

    df = pl.readparquet("data.parquet")

    JSON

    df = pl.readjson("data.json")

    From Pandas

    import pandas as pd

    pandasdf = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

    polarsdf = pl.frompandas(pandasdf)

    Back to Pandas when needed

    pandasback = polarsdf.topandas()

    Basic Operations

    # Select columns
    

    df.select("name", "salary")

    df.select(pl.col("name"), pl.col("salary"))

    Filter rows

    df.filter(pl.col("salary") > 80000)

    df.filter((pl.col("department") == "Engineering") & (pl.col("age") > 30))

    Sort

    df.sort("salary", descending=True)

    df.sort(["department", "salary"], descending=[False, True])

    Add/modify columns

    df.withcolumns(

    (pl.col("salary") 1.1).alias("salaryafterraise"),

    pl.col("name").str.touppercase().alias("nameupper"),

    pl.lit("Active").alias("status")

    )

    Drop columns

    df.drop("joindate")

    Rename columns

    df.rename({"name": "employeename", "salary": "annualsalary"})

    Descriptive statistics

    df.describe()

    Lazy vs Eager Evaluation

    Polars' lazy evaluation is one of its most powerful features. It builds a query plan that is optimized before execution:

    import polars as pl
    
    

    EAGER: operations execute immediately

    df = pl.readcsv("largefile.csv")

    result = df.filter(pl.col("amount") > 100).select("id", "amount").sort("amount")

    LAZY: builds a query plan, optimizes, then executes

    result = (

    Related Articles

    Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

    Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...

    DuckDB: In-Process Analytical Database for Data Science

    DuckDB: Database Analitik In-Process untuk Data Science DuckDB adalah database analitik in-process yang dirancang khusus...

    Feature Engineering Masterclass Tutorial: Feature Techniques for ML

    Tutorial 14: Masterclass Rekayasa Fitur (Feature Engineering) Daftar Isi Pendahuluan Prasyarat Mengapa Rekayasa Fitur Pe...

    Multiprocessing vs Multithreading in Python: Complete Guide

    Multiprocessing vs Multithreading di Python: Panduan Lengkap Concurrency dan parallelism adalah konsep penting dalam pro...