DBSCAN#

DBSCAN = Density-Based Spatial Clustering of Applications with Noise. It is a clustering algorithm that groups together points that are closely packed and marks points in sparse regions as noise (outliers).


Core Intuition#

Instead of assuming clusters are spherical (like K-Means), DBSCAN finds clusters based on density.

  • Dense regions = clusters.

  • Sparse regions = noise.

  • Works well for arbitrary-shaped clusters.


Key Concepts#

  1. Parameters:

    • \(\epsilon\) (eps): radius of neighborhood.

    • \(\text{minPts}\): minimum number of points required to form a dense region.

  2. Point types:

    • Core point: has at least minPts neighbors within distance \(\epsilon\).

    • Border point: not a core, but lies within \(\epsilon\) of a core point.

    • Noise point: neither core nor border.

  3. Density reachability:

    • A point \(p\) is directly density-reachable from \(q\) if \(p\) is within \(\epsilon\) of \(q\) and \(q\) is a core point.

    • Density-connected: two points are density-connected if there is a chain of density-reachable points linking them.


Workflow#

  1. Pick an unvisited point.

  2. If it is a core point → start a new cluster.

    • Collect all density-reachable points into this cluster.

  3. If it is a border point → assign it to a nearby cluster (if possible).

  4. If it is noise → mark it as outlier.

  5. Repeat until all points are visited.


Mathematical Core#

For a point \(p\):

\[ N_\epsilon(p) = \{ q \in D \ | \ dist(p,q) \leq \epsilon \} \]

If \(|N_\epsilon(p)| \geq \text{minPts}\), then \(p\) is a core point.


Advantages#

  • Detects clusters of arbitrary shape.

  • No need to specify number of clusters (unlike K-Means).

  • Identifies noise points.

Limitations#

  • Sensitive to choice of \(\epsilon\) and minPts.

  • Struggles with clusters of varying densities.

  • Computationally expensive for large datasets (nearest neighbor queries).


Performance metrics for DBSCAN#

Since DBSCAN does not optimize a cost function, evaluation uses:

  • Silhouette score: cohesion vs separation.

  • Davies–Bouldin index.

  • Adjusted Rand Index (ARI) if ground truth labels are available.


Applications#

  • Anomaly detection (fraud, network intrusion).

  • Spatial/geographical clustering (e.g., earthquake epicenters).

  • Image segmentation.

  • Customer segmentation with irregular distributions.

Click here for Sections