DuckDB: In-Process Analytical Database for Data Science

# DuckDB: Database Analitik In-Process untuk Data Science DuckDB adalah database analitik in-process yang dirancang khusus untuk workload OLAP (Online Analytical Processing). Berbeda dengan database...

By Ruby Abdullah · · tutorial
DuckDBAnalyticsSQLData SciencePython

DuckDB: In-Process Analytical Database for Data Science

DuckDB is an in-process analytical database designed specifically for OLAP (Online Analytical Processing) workloads. Unlike traditional databases such as PostgreSQL or MySQL that require a separate server, DuckDB runs directly within your application process, similar to SQLite but optimized for analytics. In this tutorial, we will explore how to use DuckDB for efficient and powerful data science workflows.

Why DuckDB?

Before diving into the technical details, let us understand why DuckDB has become a popular choice among data scientists:

  • Serverless: No database server installation, configuration, or management needed
  • Columnar storage: Data is stored per column, optimal for analytical queries
  • Vectorized execution: Processes data in batches for maximum performance
  • Query files directly: Can query CSV, Parquet, and JSON without importing first
  • Python integration: Seamless Python API with Pandas and Polars
  • Standard SQL: Supports feature-rich SQL including window functions and CTEs

Installation

Installing DuckDB is straightforward. Simply use pip:

pip install duckdb

To verify the installation:

import duckdb

print(duckdb.version)

If you want to use the DuckDB CLI, you can download the binary from the official website or use:

pip install duckdb[cli]

Getting Started with DuckDB

In-Memory Database

The simplest way to get started is with an in-memory database:

import duckdb

Create an in-memory connection

con = duckdb.connect()

Or explicitly

con = duckdb.connect(database=':memory:')

Run a simple query

result = con.sql("SELECT 42 AS answer, 'Hello DuckDB' AS message")

print(result.fetchall())

[(42, 'Hello DuckDB')]

Persistent Database

To store data permanently, provide a filename:

import duckdb

Create or open a persistent database

con = duckdb.connect('analytics.duckdb')

Create a table

con.sql("""

CREATE TABLE IF NOT EXISTS sales (

id INTEGER PRIMARY KEY,

product VARCHAR,

category VARCHAR,

quantity INTEGER,

price DECIMAL(10, 2),

saledate DATE

)

""")

Insert data

con.sql("""

INSERT INTO sales VALUES

(1, 'Laptop Pro', 'Electronics', 5, 1299.99, '2026-01-15'),

(2, 'Wireless Mouse', 'Accessories', 50, 24.99, '2026-01-16'),

(3, 'Mechanical Keyboard', 'Accessories', 30, 79.99, '2026-01-16'),

(4, '27-inch Monitor', 'Electronics', 10, 449.99, '2026-01-17'),

(5, 'Gaming Headset', 'Accessories', 25, 59.99, '2026-01-18')

""")

Query data

result = con.sql("SELECT FROM sales WHERE category = 'Electronics'")

result.show()

Querying Files Directly Without Import

One of DuckDB's most powerful features is the ability to query files directly.

Querying CSV Files

import duckdb

Query CSV directly

result = duckdb.sql("""

SELECT

FROM readcsvauto('data/transactions.csv')

WHERE total > 1000

ORDER BY transactiondate DESC

LIMIT 10

""")

result.show()

With specific options

result = duckdb.sql("""

SELECT

FROM readcsv('data/transactions.csv',

delim=',',

header=true,

dateformat='%Y-%m-%d'

)

""")

Querying Parquet Files

Parquet is a highly efficient columnar file format. DuckDB reads it natively:

import duckdb

Query a Parquet file

result = duckdb.sql("""

SELECT

category,

COUNT() AS transactioncount,

SUM(total) AS totalrevenue,

AVG(total) AS averageamount

FROM readparquet('data/transactions2026.parquet')

GROUP BY category

ORDER BY totalrevenue DESC

""")

result.show()

Query multiple Parquet files with glob patterns

result = duckdb.sql("""

SELECT

Related Articles