PyOD Tutorial: Anomaly and Outlier Detection in Python

# Deteksi Anomali di Python dengan PyOD: Panduan Praktis Sebagian besar dataset di dunia nyata mengandung sebagian kecil data yang berperilaku berbeda dari mayoritas: pembayaran penipuan, pembacaan s...

By Ruby Abdullah · · tutorial
PyODAnomaly DetectionOutlier DetectionMachine LearningFraud DetectionPython

Anomaly Detection in Python with PyOD: A Practical Guide

Most real-world datasets contain a small fraction of records that do not behave like the rest: a fraudulent payment, a faulty sensor reading, an intrusion attempt, or a defective unit on a production line. PyOD is a mature, scikit-learn-style library that bundles more than 40 outlier detection algorithms behind one consistent interface, which makes it straightforward to try several methods and compare them. This tutorial walks through the core concepts, the unified API, the main algorithm families, and a realistic end-to-end workflow.

What Is Outlier / Anomaly Detection?

Outlier detection (often used interchangeably with anomaly detection) is the task of identifying observations that deviate so much from the majority that they are likely generated by a different process. Unlike standard classification, you usually have very few labeled anomalies, or none at all, so most of the work is unsupervised: the model learns what "normal" looks like and scores each point by how far it sits from that normal region.

A few important framing points:

  • Anomalies are rare by definition. If 40% of your data is "anomalous," it is probably not an anomaly problem but an imbalanced classification problem.
  • The boundary between normal and abnormal is rarely crisp. Detectors produce a continuous score; you choose a cutoff.
  • "Anomalous" is context-dependent. A transaction of USD 5,000 is normal for one customer and suspicious for another.

Real-World Use Cases

  • Fraud detection. Credit card transactions, insurance claims, account takeovers. Fraud patterns shift over time, so unsupervised or semi-supervised detectors complement rule engines.
  • Fault and defect detection. Vibration, temperature, and pressure sensors on industrial equipment; defective items in manufacturing.
  • Network intrusion detection. Unusual traffic volume, port-scanning behaviour, or connection patterns that differ from a learned baseline.
  • Quality control. Out-of-spec measurements in a production batch, or readings that drift outside expected tolerances.
  • Data cleaning. Catching corrupted rows, unit errors, and sensor glitches before they pollute a downstream model.

Installation

PyOD itself is lightweight and depends mainly on NumPy, SciPy, and scikit-learn.

pip install pyod

Some detectors have optional dependencies. The neural models (for example AutoEncoder) require PyTorch, and a few utilities use additional packages:

# Required only for the neural-network based detectors

pip install torch

Optional, used by some combination/visualization helpers

pip install combo matplotlib

You can confirm the installation from a Python shell:

import pyod

print(pyod.version)

The Unified, scikit-learn-Style API

Every detector in PyOD follows the same lifecycle, which is the main reason the library is pleasant to use. Once you learn one model, you know all of them.

The common pieces are:

  • fit(X) — learn the model from training data. Detection is typically unsupervised, so you do not pass labels.
  • decisionfunction(X) — return a raw outlier score for new data. Higher means more anomalous.
  • predict(X) — return binary labels (0 = inlier, 1 = outlier).
  • predictproba(X) — return calibrated-ish probabilities of being an outlier.
  • decisionscores — the raw outlier scores computed on the training data during fit.
  • labels — the binary labels assigned to the training data.
  • threshold — the score cutoff used to separate inliers from outliers.
  • contamination — the constructor parameter that tells the model the expected proportion of outliers (default 0.1). It is used to set threshold.

The contamination value is central. PyOD ranks all training points by score and labels the top contamination fraction as outliers. It does not change how scores are computed, only where the cutoff falls. If you set it too high you get false positives; too low and you miss real anomalies.

Related Articles

SHAP Tutorial: Explainable AI and Model Interpretability

SHAP - Panduan Praktis Explainable AI dan Interpretabilitas Model Model machine learning makin sering dipakai untuk meng...

spaCy Tutorial: Industrial-Strength NLP in Python

spaCy: NLP Kelas Industri di Python spaCy adalah pustaka open-source untuk pemrosesan bahasa alami (NLP) yang dirancang ...

DSPy: A Framework for Programmatic LLM Optimization

DSPy: Framework untuk Optimasi LLM Secara Programatik Prompt engineering secara manual adalah proses yang melelahkan dan...

Optuna Tutorial: Automatic Hyperparameter Optimization

Tutorial Lengkap Optuna: Optimasi Hyperparameter Otomatis Daftar Isi Pendahuluan Prasyarat Instalasi dan Setup [Konsep D...