Anamoly Detection#

  • Anomaly detection means identifying outliers in data — points that deviate significantly from normal patterns.

  • Outliers are crucial in some problems (e.g., fraud detection, security breaches, disease detection) but may be irrelevant in others.

  • Examples:

    • Bank login from unusual locations

    • Unusual runs scored in an IPL over

    • Rare disease detection in healthcare datasets


Importance of Outliers#

  • Outliers indicate unique or abnormal events in a dataset.

  • Detecting anomalies is often unsupervised, as labels for anomalies are usually not available.

  • Outliers may represent critical events (fraud, disease) or noise, depending on the context.


Isolation Forest Concept#

  • Isolation Forest is an unsupervised anomaly detection algorithm.

  • Uses isolation trees (similar to decision trees) to separate individual points:

    • Outliers are isolated faster, requiring fewer splits.

    • Normal points require more splits to isolate.

  • The anomaly score is calculated using the formula:

\[ s(x, m) = 2^{-\frac{E(h(x))}{c(m)}} \]

Where:

  • \(h(x)\) = average path length to isolate point \(x\) in a tree

  • \(E(h(x))\) = average path length over multiple trees

  • \(c(m)\) = expected path length for a sample of size \(m\)

  • Points with scores close to 1 → likely anomalies.

  • Threshold (e.g., 0.5) is set to classify points as outliers.


How Isolation Forest Works#

  1. Randomly select a feature and split a value between its min and max.

  2. Recursively create nodes until each point is isolated in a leaf.

  3. Points that are isolated in shorter paths → anomalies.

  4. Multiple isolation trees are used for robustness.


5. Practical Example#

  • A healthcare dataset with 2 features (indicating disease) was used.

  • Steps:

    1. Load dataset.

    2. Fit Isolation Forest (contamination parameter defines proportion of expected anomalies).

    3. Predict anomalies (1 = normal, -1 = outlier).

    4. Visualize outliers on a scatter plot (outliers highlighted in red).

  • Results: Outliers were clearly separated from normal points, demonstrating Isolation Forest’s effectiveness.


6. Key Points#

  • Anomaly detection is unsupervised.

  • Isolation Forest isolates data points rather than clustering.

  • Outliers are detected based on how quickly they can be separated from the rest of the data.

  • Useful for fraud detection, cybersecurity, healthcare, and other domains with rare events.

Statistical / Classical Methods#

  • Z-score / Standard Deviation Detect points that deviate from mean by > n standard deviations.

  • Modified Z-score More robust to outliers using median and MAD (Median Absolute Deviation).

  • Grubbs’ Test / Dixon’s Q Test Statistical tests for single outliers.

  • Boxplot / IQR Method Points outside Q1 - 1.5*IQR or Q3 + 1.5*IQR.


Distance-based Methods#

  • k-Nearest Neighbors (kNN) for anomaly detection Anomalies have large distances to nearest neighbors.

  • Local Outlier Factor (LOF) Measures how isolated a point is compared to its neighbors.

  • Mahalanobis Distance Measures distance considering correlation between features.


Clustering-based Methods#

  • K-Means-based anomaly detection Points far from any cluster centroid are anomalies.

  • DBSCAN Points labeled as noise (-1) are anomalies.

  • Hierarchical clustering Small isolated clusters or singleton points can be anomalies.


Classification / Supervised Methods#

(Requires labeled data: normal vs. anomaly)

  • Support Vector Machine (SVM) – One-Class SVM Learns the boundary of normal points; points outside are anomalies.

  • Random Forest / Isolation Forest Detects anomalies by isolating points that are easier to split.

  • Gradient Boosting / XGBoost (for anomaly classification if labeled)


Neural Network / Deep Learning Methods#

  • Autoencoders Reconstruct input; large reconstruction error → anomaly.

  • Variational Autoencoders (VAE) Probabilistic reconstruction; high likelihood deviations → anomaly.

  • LSTM-based Autoencoders For time series anomaly detection.

  • Generative Adversarial Networks (GANs) Identify anomalies as points the generator fails to reproduce well.


Probabilistic / Density-based Methods#

  • Gaussian Mixture Models (GMM) Low probability points under the model are anomalies.

  • Kernel Density Estimation (KDE) Points in low-density regions → anomalies.

  • Bayesian Networks Probabilistic modeling to detect unusual events.


Time-Series Specific Methods#

  • ARIMA / SARIMA Residuals Residuals beyond thresholds → anomaly.

  • Prophet / Facebook Prophet Detect deviations from predicted trends.

  • Twitter AnomalyDetection (R / Python port)


Ensemble Methods#

  • Combine multiple anomaly detection models:

    • Isolation Forest + LOF

    • Autoencoder + Statistical threshold

    • Voting / stacking ensemble

Click here for Sections