Hyperparameter Tuning

Hyperparameter Tuning #

\(\epsilon\) (eps): neighborhood radius.
minPts: minimum number of points required to form a dense cluster.

2. Hyperparameter Tuning#

Choosing \(\epsilon\)#

Too small → many points become noise, clusters fragmented (underfitting).
Too large → different clusters merge (overfitting, loss of structure).
Method: k-distance graph
1. Compute distance of each point to its \(k\)-th nearest neighbor (\(k = \text{minPts}\)).
2. Sort these distances.
3. Look for an “elbow” (sharp change). That distance is a good \(\epsilon\).

Choosing minPts#

Rule of thumb: \(\text{minPts} \geq d+1\), where \(d\) = data dimension.
Larger datasets: minPts ~ log(n).
Smaller minPts → risk of tiny spurious clusters.
Larger minPts → risk of treating small real clusters as noise.

3. Overfitting vs Underfitting in DBSCAN#

Overfitting (too many small clusters, fragmented)
- Caused by very small \(\epsilon\), small minPts.
- Symptoms: many clusters, many outliers.
- Fix: increase \(\epsilon\) or minPts.
Underfitting (too few clusters, merged into one)
- Caused by large \(\epsilon\), large minPts.
- Symptoms: dataset collapses into 1 cluster or most points noise.
- Fix: decrease \(\epsilon\) or minPts.

4. Practical Tuning Workflow#

Standardize data (DBSCAN is distance-based).
Use k-distance plot to select \(\epsilon\).
Choose minPts ≈ dimensionality × 2 (or log(n)).
Run DBSCAN, check metrics (Silhouette, Davies–Bouldin, ARI).
Adjust iteratively.

5. Alternatives when DBSCAN fails#

HDBSCAN: extension that adapts to clusters with varying densities.
OPTICS: similar but avoids picking a fixed \(\epsilon\).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.neighbors import NearestNeighbors

# Generate data
X, _ = make_moons(n_samples=300, noise=0.08, random_state=42)

# Compute k-nearest neighbors distances (k = 4, as minPts rule of thumb for 2D)
k = 4
nbrs = NearestNeighbors(n_neighbors=k).fit(X)
distances, indices = nbrs.kneighbors(X)

# Take the k-th nearest neighbor distance for each point
k_distances = np.sort(distances[:, k-1])

# Plot k-distance graph to find epsilon "elbow"
plt.figure(figsize=(7,5))
plt.plot(k_distances)
plt.ylabel(f"{k}-NN distance")
plt.xlabel("Points sorted by distance")
plt.title("k-distance plot (use elbow to choose eps)")
plt.grid(True)
plt.show()

../../../_images/cd7224487299b6260ec2f93759bab837e2f886d49728dcad56b31e9820d59285.png

Hyperparameter Tuning

Contents

Hyperparameter Tuning#

2. Hyperparameter Tuning#

Choosing \(\epsilon\)#

Choosing minPts#

3. Overfitting vs Underfitting in DBSCAN#

4. Practical Tuning Workflow#

5. Alternatives when DBSCAN fails#

Hyperparameter Tuning #