Cost Function

Cost Function #

DBSCAN does not use a cost function in the way algorithms like K-Means or PCA do. Instead, it is rule-based: clusters are defined by density conditions. Still, we can frame its logic mathematically.

1. Implicit Objective of DBSCAN#

The algorithm seeks a partition of dataset \(D\) into:

Clusters = maximal sets of density-connected points.
Noise = points not assigned to any cluster.

No optimization of a global scalar function happens. Instead, clustering emerges from local density checks.

2. Local Rules (instead of cost function)#

Neighborhood condition:

\[ N_\epsilon(p) = \{ q \in D \ | \ \text{dist}(p,q) \leq \epsilon \} \]

Core condition:

\[ |N_\epsilon(p)| \geq \text{minPts} \]

Cluster condition: Clusters are unions of points that are mutually density-connected.

3. Why no global cost function?#

DBSCAN does not optimize inter-cluster variance or distances.
It is deterministic given \(\epsilon\) and minPts.
Different parameter choices produce different clusterings.

4. How to evaluate DBSCAN output (proxy cost functions)#

Since there’s no internal objective, external metrics are used:

Silhouette coefficient

\[ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} \]

where \(a(i)\) = average intra-cluster distance, \(b(i)\) = nearest-cluster distance.

Davies–Bouldin Index: ratio of within-cluster scatter to between-cluster separation.
Adjusted Rand Index (ARI) or Mutual Information if ground-truth labels exist.

Summary

DBSCAN has no explicit cost function.
It uses density-based rules instead of optimization.
Quality of clustering is judged via external evaluation metrics.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_blobs
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score, adjusted_rand_score

# Prepare datasets
X_moons, y_moons = make_moons(n_samples=300, noise=0.06, random_state=42)
X_blobs, y_blobs = make_blobs(n_samples=300, centers=[(-3, -3), (0, 0), (3, 3)], 
                              cluster_std=[0.3, 1.0, 2.0], random_state=42)

datasets = [
    ("Moons (good for DBSCAN)", X_moons, y_moons, 0.25, 5),
    ("Blobs (varying density)", X_blobs, y_blobs, 0.6, 5),
]

results = []

for name, X, y_true, eps, minPts in datasets:
    db = DBSCAN(eps=eps, min_samples=minPts).fit(X)
    labels = db.labels_
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    
    # Compute evaluation metrics when applicable
    metrics = {}
    metrics['n_clusters'] = n_clusters
    metrics['n_noise'] = n_noise
    metrics['eps'] = eps
    metrics['minPts'] = minPts
    
    if n_clusters >= 2:
        metrics['silhouette'] = silhouette_score(X, labels)
        metrics['davies_bouldin'] = davies_bouldin_score(X, labels)
    else:
        metrics['silhouette'] = None
        metrics['davies_bouldin'] = None
    
    # ARI vs ground truth (tolerates noise label -1)
    metrics['ARI'] = adjusted_rand_score(y_true, labels)
    
    results.append((name, X, labels, metrics))

# Print results
for name, X, labels, metrics in results:
    print(f"Dataset: {name}")
    print(f" eps={metrics['eps']}, minPts={metrics['minPts']}")
    print(f" Number of clusters (excluding noise): {metrics['n_clusters']}")
    print(f" Number of noise points: {metrics['n_noise']}")
    if metrics['silhouette'] is not None:
        print(f" Silhouette score: {metrics['silhouette']:.4f}")
        print(f" Davies-Bouldin index: {metrics['davies_bouldin']:.4f}")
    else:
        print(" Silhouette score: Not computed (requires >= 2 clusters)")
        print(" Davies-Bouldin index: Not computed (requires >= 2 clusters)")
    print(f" Adjusted Rand Index (vs true labels): {metrics['ARI']:.4f}")
    print("-" * 60)

# Plot each dataset separately as required
# Redo visualization with subplots for clarity
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

for ax, (name, X, labels, metrics) in zip(axs, results):
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=35, cmap="tab10")
    ax.set_title(f"{name}\nclusters={metrics['n_clusters']}, noise={metrics['n_noise']}")
    ax.set_xlabel("x1")
    ax.set_ylabel("x2")

plt.tight_layout()
plt.show()

Dataset: Moons (good for DBSCAN)
 eps=0.25, minPts=5
 Number of clusters (excluding noise): 2
 Number of noise points: 0
 Silhouette score: 0.3298
 Davies-Bouldin index: 1.1514
 Adjusted Rand Index (vs true labels): 1.0000
------------------------------------------------------------
Dataset: Blobs (varying density)
 eps=0.6, minPts=5
 Number of clusters (excluding noise): 3
 Number of noise points: 72
 Silhouette score: 0.3689
 Davies-Bouldin index: 1.7966
 Adjusted Rand Index (vs true labels): 0.5813
------------------------------------------------------------

../../../_images/593d9263ece8715834cbf37e4ba7f787417738374dd5ae1d77a412065b933c63.png

Results below.

Moons dataset: 2 clusters, 0 noise. Silhouette 0.3298. Davies–Bouldin 1.1514. ARI 1.0000. Blobs (varying density): 3 clusters, 72 noise points. Silhouette 0.3689. Davies–Bouldin 1.7966. ARI 0.5813.

Interpretation, short:

Silhouette near 0.3 indicates moderate cluster separation.
Lower Davies–Bouldin is better. Moons fit DBSCAN shape assumptions so metrics are strong and ARI is 1.0.
Blobs show many noise points and lower ARI. This reflects DBSCAN’s single global density assumption failing when clusters have different densities.

Would you like the code used or a version where I sweep eps and minPts and show metric trends?