Anomaly Detection in Python with PyOD: A Practical Guide
Most real-world datasets contain a small fraction of records that do not behave like the rest: a fraudulent payment, a faulty sensor reading, an intrusion attempt, or a defective unit on a production line. PyOD is a mature, scikit-learn-style library that bundles more than 40 outlier detection algorithms behind one consistent interface, which makes it straightforward to try several methods and compare them. This tutorial walks through the core concepts, the unified API, the main algorithm families, and a realistic end-to-end workflow.
What Is Outlier / Anomaly Detection?
Outlier detection (often used interchangeably with anomaly detection) is the task of identifying observations that deviate so much from the majority that they are likely generated by a different process. Unlike standard classification, you usually have very few labeled anomalies, or none at all, so most of the work is unsupervised: the model learns what "normal" looks like and scores each point by how far it sits from that normal region.
A few important framing points:
- Anomalies are rare by definition. If 40% of your data is "anomalous," it is probably not an anomaly problem but an imbalanced classification problem.
- The boundary between normal and abnormal is rarely crisp. Detectors produce a continuous score; you choose a cutoff.
- "Anomalous" is context-dependent. A transaction of USD 5,000 is normal for one customer and suspicious for another.
Real-World Use Cases
- Fraud detection. Credit card transactions, insurance claims, account takeovers. Fraud patterns shift over time, so unsupervised or semi-supervised detectors complement rule engines.
- Fault and defect detection. Vibration, temperature, and pressure sensors on industrial equipment; defective items in manufacturing.
- Network intrusion detection. Unusual traffic volume, port-scanning behaviour, or connection patterns that differ from a learned baseline.
- Quality control. Out-of-spec measurements in a production batch, or readings that drift outside expected tolerances.
- Data cleaning. Catching corrupted rows, unit errors, and sensor glitches before they pollute a downstream model.
Installation
PyOD itself is lightweight and depends mainly on NumPy, SciPy, and scikit-learn.
pip install pyod
Some detectors have optional dependencies. The neural models (for example AutoEncoder) require PyTorch, and a few utilities use additional packages:
# Required only for the neural-network based detectors
pip install torch
Optional, used by some combination/visualization helpers
pip install combo matplotlib
You can confirm the installation from a Python shell:
import pyod
print(pyod.version)
The Unified, scikit-learn-Style API
Every detector in PyOD follows the same lifecycle, which is the main reason the library is pleasant to use. Once you learn one model, you know all of them.
The common pieces are:
fit(X)— learn the model from training data. Detection is typically unsupervised, so you do not pass labels.decisionfunction(X)— return a raw outlier score for new data. Higher means more anomalous.predict(X)— return binary labels (0 = inlier, 1 = outlier).predictproba(X)— return calibrated-ish probabilities of being an outlier.decisionscores— the raw outlier scores computed on the training data duringfit.labels— the binary labels assigned to the training data.threshold— the score cutoff used to separate inliers from outliers.contamination— the constructor parameter that tells the model the expected proportion of outliers (default0.1). It is used to setthreshold.
The contamination value is central. PyOD ranks all training points by score and labels the top contamination fraction as outliers. It does not change how scores are computed, only where the cutoff falls. If you set it too high you get false positives; too low and you miss real anomalies.