Tutorial 17: PostgreSQL Advanced for Machine Learning
Table of Contents
Introduction
PostgreSQL is far more than a simple relational database. With its advanced analytical functions, flexible data types, and extensibility, it serves as a powerful backbone for machine learning workflows. Many ML engineers underestimate what can be done directly in the database layer, often pulling raw data into Python for transformations that PostgreSQL handles more efficiently.
This tutorial covers advanced PostgreSQL techniques that are directly applicable to ML pipelines: feature engineering with window functions, query organization with CTEs, caching computed features with materialized views, storing experiment metadata with JSONB, handling large datasets with partitioning, monitoring query performance, and integrating everything with Python using psycopg2 and SQLAlchemy.
Prerequisites
- PostgreSQL 14+ installed and running
- Python 3.9+ with pip
- Basic SQL knowledge (SELECT, JOIN, GROUP BY)
- Familiarity with ML concepts (features, training data, experiments)
- Install required Python packages:
# Install dependencies
pip install psycopg2-binary sqlalchemy pandas
import psycopg2
import sqlalchemy
import pandas as pd
Window Functions for Feature Engineering
Window functions are one of the most powerful tools for ML feature engineering. They allow you to compute aggregates across rows related to the current row without collapsing the result set — perfect for creating rolling statistics, lag features, and ranking features.
Rolling Statistics
-- Create a sample transactions table
CREATE TABLE transactions (
id SERIAL PRIMARY KEY,
customerid INTEGER NOT NULL,
amount DECIMAL(10, 2) NOT NULL,
category VARCHAR(50),
createdat TIMESTAMP NOT NULL DEFAULT NOW()
);
-- Rolling average transaction amount (last 7 days) per customer
SELECT
id,
customerid,
amount,
createdat,
AVG(amount) OVER (
PARTITION BY customerid
ORDER BY createdat
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) AS rollingavg7d,
STDDEV(amount) OVER (
PARTITION BY customerid
ORDER BY createdat
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) AS rollingstddev7d,
COUNT() OVER (
PARTITION BY customerid
ORDER BY createdat
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) AS txncount7d
FROM transactions
ORDER BY customerid, createdat;
Lag and Lead Features
Lag features are essential for time-series ML models. They capture temporal patterns without data leakage when used correctly.
-- Lag features: previous transaction amounts and time deltas
SELECT
id,
customerid,
amount,
createdat,
LAG(amount, 1) OVER w AS prevamount1,
LAG(amount, 2) OVER w AS prevamount2,
LAG(amount, 3) OVER w AS prevamount3,
amount - LAG(amount, 1) OVER w AS amountdiff,
EXTRACT(EPOCH FROM (
createdat - LAG(createdat, 1) OVER w
)) / 3600.0 AS hourssincelasttxn,
NTILE(10) OVER (
PARTITION BY customerid
ORDER BY amount
) AS amountdecile
FROM transactions
WINDOW w AS (PARTITION BY customerid ORDER BY createdat);