DBSCAN#
DBSCAN = Density-Based Spatial Clustering of Applications with Noise. It is a clustering algorithm that groups together points that are closely packed and marks points in sparse regions as noise (outliers).
Core Intuition#
Instead of assuming clusters are spherical (like K-Means), DBSCAN finds clusters based on density.
Dense regions = clusters.
Sparse regions = noise.
Works well for arbitrary-shaped clusters.
Key Concepts#
Parameters:
\(\epsilon\) (eps): radius of neighborhood.
\(\text{minPts}\): minimum number of points required to form a dense region.
Point types:
Core point: has at least
minPtsneighbors within distance \(\epsilon\).Border point: not a core, but lies within \(\epsilon\) of a core point.
Noise point: neither core nor border.
Density reachability:
A point \(p\) is directly density-reachable from \(q\) if \(p\) is within \(\epsilon\) of \(q\) and \(q\) is a core point.
Density-connected: two points are density-connected if there is a chain of density-reachable points linking them.
Workflow#
Pick an unvisited point.
If it is a core point → start a new cluster.
Collect all density-reachable points into this cluster.
If it is a border point → assign it to a nearby cluster (if possible).
If it is noise → mark it as outlier.
Repeat until all points are visited.
Mathematical Core#
For a point \(p\):
If \(|N_\epsilon(p)| \geq \text{minPts}\), then \(p\) is a core point.
Advantages#
Detects clusters of arbitrary shape.
No need to specify number of clusters (unlike K-Means).
Identifies noise points.
Limitations#
Sensitive to choice of \(\epsilon\) and
minPts.Struggles with clusters of varying densities.
Computationally expensive for large datasets (nearest neighbor queries).
Performance metrics for DBSCAN#
Since DBSCAN does not optimize a cost function, evaluation uses:
Silhouette score: cohesion vs separation.
Davies–Bouldin index.
Adjusted Rand Index (ARI) if ground truth labels are available.
Applications#
Anomaly detection (fraud, network intrusion).
Spatial/geographical clustering (e.g., earthquake epicenters).
Image segmentation.
Customer segmentation with irregular distributions.