DuckDB: In-Process Analytical Database for Data Science
DuckDB is an in-process analytical database designed specifically for OLAP (Online Analytical Processing) workloads. Unlike traditional databases such as PostgreSQL or MySQL that require a separate server, DuckDB runs directly within your application process, similar to SQLite but optimized for analytics. In this tutorial, we will explore how to use DuckDB for efficient and powerful data science workflows.
Why DuckDB?
Before diving into the technical details, let us understand why DuckDB has become a popular choice among data scientists:
- Serverless: No database server installation, configuration, or management needed
- Columnar storage: Data is stored per column, optimal for analytical queries
- Vectorized execution: Processes data in batches for maximum performance
- Query files directly: Can query CSV, Parquet, and JSON without importing first
- Python integration: Seamless Python API with Pandas and Polars
- Standard SQL: Supports feature-rich SQL including window functions and CTEs
Installation
Installing DuckDB is straightforward. Simply use pip:
pip install duckdb
To verify the installation:
import duckdb
print(duckdb.version)
If you want to use the DuckDB CLI, you can download the binary from the official website or use:
pip install duckdb[cli]
Getting Started with DuckDB
In-Memory Database
The simplest way to get started is with an in-memory database:
import duckdb
Create an in-memory connection
con = duckdb.connect()
Or explicitly
con = duckdb.connect(database=':memory:')
Run a simple query
result = con.sql("SELECT 42 AS answer, 'Hello DuckDB' AS message")
print(result.fetchall())
[(42, 'Hello DuckDB')]
Persistent Database
To store data permanently, provide a filename:
import duckdb
Create or open a persistent database
con = duckdb.connect('analytics.duckdb')
Create a table
con.sql("""
CREATE TABLE IF NOT EXISTS sales (
id INTEGER PRIMARY KEY,
product VARCHAR,
category VARCHAR,
quantity INTEGER,
price DECIMAL(10, 2),
saledate DATE
)
""")
Insert data
con.sql("""
INSERT INTO sales VALUES
(1, 'Laptop Pro', 'Electronics', 5, 1299.99, '2026-01-15'),
(2, 'Wireless Mouse', 'Accessories', 50, 24.99, '2026-01-16'),
(3, 'Mechanical Keyboard', 'Accessories', 30, 79.99, '2026-01-16'),
(4, '27-inch Monitor', 'Electronics', 10, 449.99, '2026-01-17'),
(5, 'Gaming Headset', 'Accessories', 25, 59.99, '2026-01-18')
""")
Query data
result = con.sql("SELECT FROM sales WHERE category = 'Electronics'")
result.show()
Querying Files Directly Without Import
One of DuckDB's most powerful features is the ability to query files directly.
Querying CSV Files
import duckdb
Query CSV directly
result = duckdb.sql("""
SELECT
FROM readcsvauto('data/transactions.csv')
WHERE total > 1000
ORDER BY transactiondate DESC
LIMIT 10
""")
result.show()
With specific options
result = duckdb.sql("""
SELECT
FROM readcsv('data/transactions.csv',
delim=',',
header=true,
dateformat='%Y-%m-%d'
)
""")
Querying Parquet Files
Parquet is a highly efficient columnar file format. DuckDB reads it natively:
import duckdb
Query a Parquet file
result = duckdb.sql("""
SELECT
category,
COUNT() AS transactioncount,
SUM(total) AS totalrevenue,
AVG(total) AS averageamount
FROM readparquet('data/transactions2026.parquet')
GROUP BY category
ORDER BY totalrevenue DESC
""")
result.show()
Query multiple Parquet files with glob patterns
result = duckdb.sql("""
SELECT