PostgreSQL Advanced for ML Tutorial: Analytics and Feature Engineering

# Tutorial 17: PostgreSQL Lanjutan untuk Machine Learning ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Window Functions untuk Rekayasa Fitur](#window-functions-untuk-r...

By Ruby Abdullah · · tutorial
PostgreSQLSQLFeature EngineeringAnalyticsPythonSQLAlchemy

Tutorial 17: PostgreSQL Advanced for Machine Learning

Table of Contents

  • Introduction
  • Prerequisites
  • Window Functions for Feature Engineering
  • CTEs for Complex Queries
  • Materialized Views for Feature Caching
  • JSONB for ML Metadata
  • Table Partitioning for Large Datasets
  • Monitoring with pgstat
  • Integration with Python
  • Best Practices
  • Conclusion
  • Introduction

    PostgreSQL is far more than a simple relational database. With its advanced analytical functions, flexible data types, and extensibility, it serves as a powerful backbone for machine learning workflows. Many ML engineers underestimate what can be done directly in the database layer, often pulling raw data into Python for transformations that PostgreSQL handles more efficiently.

    This tutorial covers advanced PostgreSQL techniques that are directly applicable to ML pipelines: feature engineering with window functions, query organization with CTEs, caching computed features with materialized views, storing experiment metadata with JSONB, handling large datasets with partitioning, monitoring query performance, and integrating everything with Python using psycopg2 and SQLAlchemy.

    Prerequisites

    • PostgreSQL 14+ installed and running
    • Python 3.9+ with pip
    • Basic SQL knowledge (SELECT, JOIN, GROUP BY)
    • Familiarity with ML concepts (features, training data, experiments)
    • Install required Python packages:

    # Install dependencies
    

    pip install psycopg2-binary sqlalchemy pandas

    import psycopg2

    import sqlalchemy

    import pandas as pd

    Window Functions for Feature Engineering

    Window functions are one of the most powerful tools for ML feature engineering. They allow you to compute aggregates across rows related to the current row without collapsing the result set — perfect for creating rolling statistics, lag features, and ranking features.

    Rolling Statistics

    -- Create a sample transactions table
    

    CREATE TABLE transactions (

    id SERIAL PRIMARY KEY,

    customerid INTEGER NOT NULL,

    amount DECIMAL(10, 2) NOT NULL,

    category VARCHAR(50),

    createdat TIMESTAMP NOT NULL DEFAULT NOW()

    );

    -- Rolling average transaction amount (last 7 days) per customer

    SELECT

    id,

    customerid,

    amount,

    createdat,

    AVG(amount) OVER (

    PARTITION BY customerid

    ORDER BY createdat

    RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW

    ) AS rollingavg7d,

    STDDEV(amount) OVER (

    PARTITION BY customerid

    ORDER BY createdat

    RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW

    ) AS rollingstddev7d,

    COUNT() OVER (

    PARTITION BY customerid

    ORDER BY createdat

    RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW

    ) AS txncount7d

    FROM transactions

    ORDER BY customerid, createdat;

    Lag and Lead Features

    Lag features are essential for time-series ML models. They capture temporal patterns without data leakage when used correctly.

    -- Lag features: previous transaction amounts and time deltas
    

    SELECT

    id,

    customerid,

    amount,

    createdat,

    LAG(amount, 1) OVER w AS prevamount1,

    LAG(amount, 2) OVER w AS prevamount2,

    LAG(amount, 3) OVER w AS prevamount3,

    amount - LAG(amount, 1) OVER w AS amountdiff,

    EXTRACT(EPOCH FROM (

    createdat - LAG(createdat, 1) OVER w

    )) / 3600.0 AS hourssincelasttxn,

    NTILE(10) OVER (

    PARTITION BY customerid

    ORDER BY amount

    ) AS amountdecile

    FROM transactions

    WINDOW w AS (PARTITION BY customerid ORDER BY createdat);

    Related Articles

    DuckDB: In-Process Analytical Database for Data Science

    DuckDB: Database Analitik In-Process untuk Data Science DuckDB adalah database analitik in-process yang dirancang khusus...

    SQLModel: Modern Python ORM for Type-Safe AI Applications

    SQLModel: ORM Modern Python untuk Aplikasi AI yang Type-Safe Dalam pengembangan aplikasi AI/ML, pengelolaan data di data...

    Feature Engineering Masterclass Tutorial: Feature Techniques for ML

    Tutorial 14: Masterclass Rekayasa Fitur (Feature Engineering) Daftar Isi Pendahuluan Prasyarat Mengapa Rekayasa Fitur Pe...

    Complete BigQuery ML Tutorial: Machine Learning with SQL

    Tutorial Lengkap BigQuery ML: Machine Learning dengan SQL BigQuery ML memungkinkan Anda membuat dan mengeksekusi model m...