Anamoly Detection

Contents

Anamoly Detection #

Anomaly detection means identifying outliers in data — points that deviate significantly from normal patterns.
Outliers are crucial in some problems (e.g., fraud detection, security breaches, disease detection) but may be irrelevant in others.
Examples:
- Bank login from unusual locations
- Unusual runs scored in an IPL over
- Rare disease detection in healthcare datasets

Importance of Outliers #

Outliers indicate unique or abnormal events in a dataset.
Detecting anomalies is often unsupervised, as labels for anomalies are usually not available.
Outliers may represent critical events (fraud, disease) or noise, depending on the context.

Isolation Forest Concept #

Isolation Forest is an unsupervised anomaly detection algorithm.
Uses isolation trees (similar to decision trees) to separate individual points:
- Outliers are isolated faster, requiring fewer splits.
- Normal points require more splits to isolate.
The anomaly score is calculated using the formula:

\[ s(x, m) = 2^{-\frac{E(h(x))}{c(m)}} \]

Where:

\(h(x)\) = average path length to isolate point \(x\) in a tree
\(E(h(x))\) = average path length over multiple trees
\(c(m)\) = expected path length for a sample of size \(m\)
Points with scores close to 1 → likely anomalies.
Threshold (e.g., 0.5) is set to classify points as outliers.

How Isolation Forest Works #

Randomly select a feature and split a value between its min and max.
Recursively create nodes until each point is isolated in a leaf.
Points that are isolated in shorter paths → anomalies.
Multiple isolation trees are used for robustness.

5. Practical Example#

A healthcare dataset with 2 features (indicating disease) was used.
Steps:
1. Load dataset.
2. Fit Isolation Forest (contamination parameter defines proportion of expected anomalies).
3. Predict anomalies (1 = normal, -1 = outlier).
4. Visualize outliers on a scatter plot (outliers highlighted in red).
Results: Outliers were clearly separated from normal points, demonstrating Isolation Forest’s effectiveness.

6. Key Points#

Anomaly detection is unsupervised.
Isolation Forest isolates data points rather than clustering.
Outliers are detected based on how quickly they can be separated from the rest of the data.
Useful for fraud detection, cybersecurity, healthcare, and other domains with rare events.

Statistical / Classical Methods #

Z-score / Standard Deviation Detect points that deviate from mean by > n standard deviations.
Modified Z-score More robust to outliers using median and MAD (Median Absolute Deviation).
Grubbs’ Test / Dixon’s Q Test Statistical tests for single outliers.
Boxplot / IQR Method Points outside Q1 - 1.5*IQR or Q3 + 1.5*IQR.

Distance-based Methods #

k-Nearest Neighbors (kNN) for anomaly detection Anomalies have large distances to nearest neighbors.
Local Outlier Factor (LOF) Measures how isolated a point is compared to its neighbors.
Mahalanobis Distance Measures distance considering correlation between features.

Clustering-based Methods #

K-Means-based anomaly detection Points far from any cluster centroid are anomalies.
DBSCAN Points labeled as noise (-1) are anomalies.
Hierarchical clustering Small isolated clusters or singleton points can be anomalies.

Classification / Supervised Methods #

(Requires labeled data: normal vs. anomaly)

Support Vector Machine (SVM) – One-Class SVM Learns the boundary of normal points; points outside are anomalies.
Random Forest / Isolation Forest Detects anomalies by isolating points that are easier to split.
Gradient Boosting / XGBoost (for anomaly classification if labeled)

Neural Network / Deep Learning Methods #

Autoencoders Reconstruct input; large reconstruction error → anomaly.
Variational Autoencoders (VAE) Probabilistic reconstruction; high likelihood deviations → anomaly.
LSTM-based Autoencoders For time series anomaly detection.
Generative Adversarial Networks (GANs) Identify anomalies as points the generator fails to reproduce well.

Probabilistic / Density-based Methods #

Gaussian Mixture Models (GMM) Low probability points under the model are anomalies.
Kernel Density Estimation (KDE) Points in low-density regions → anomalies.
Bayesian Networks Probabilistic modeling to detect unusual events.

Time-Series Specific Methods #

ARIMA / SARIMA Residuals Residuals beyond thresholds → anomaly.
Prophet / Facebook Prophet Detect deviations from predicted trends.
Twitter AnomalyDetection (R / Python port)

Ensemble Methods #

Combine multiple anomaly detection models:
- Isolation Forest + LOF
- Autoencoder + Statistical threshold
- Voting / stacking ensemble