t-SNE#
t-SNE is a non-linear dimensionality reduction technique.
It’s mainly used for visualizing high-dimensional data in 2D or 3D.
Unlike PCA (linear), t-SNE preserves local structure and clusters in the data.
Use case: Visualizing clusters in datasets like images, word embeddings, or gene expression data.
2. Key Idea#
t-SNE tries to map similar points in high-dimensional space close together in low-dimensional space, and dissimilar points far apart.
Compute pairwise similarities in high-dimensional space:
Convert distances into probabilities using a Gaussian distribution:
\[ p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)} \]\(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}\)
Compute pairwise similarities in low-dimensional space:
Use a Student-t distribution with 1 degree of freedom (heavy tails):
\[ q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} \]Minimize Kullback-Leibler (KL) divergence between high- and low-dimensional similarities:
Intuition: Points close in high-dimensional space should be close in 2D/3D space.
3. How t-SNE Works (Step by Step)#
Compute high-dimensional probabilities \(p_{ij}\) representing similarity.
Initialize points in low-dimensional space randomly (\(y_i\)).
Compute low-dimensional probabilities \(q_{ij}\).
Minimize KL divergence using gradient descent.
Iterate until low-dimensional embedding preserves local neighborhoods.
4. Important Hyperparameters#
Hyperparameter |
Effect |
Typical Values |
|---|---|---|
|
Balances local vs global structure. Low = focus on small clusters, high = larger neighborhoods |
5–50 |
|
Step size for gradient descent |
10–1000 |
|
Number of iterations |
1000+ |
|
Distance metric |
‘euclidean’, ‘cosine’, etc. |
⚡ Tip: t-SNE is mostly for visualization, not feature reduction for predictive models.
5. Strengths & Limitations#
Strengths#
Captures non-linear structure.
Excellent for visualizing clusters.
Preserves local neighborhoods better than PCA.
Limitations#
Does not preserve global distances.
Sensitive to hyperparameters (
perplexity,learning_rate).Computationally expensive for large datasets.
Embeddings are non-deterministic (different runs may differ unless random seed fixed).
Intuition
Imagine high-dimensional points connected with springs.
t-SNE stretches and squeezes points in 2D so that similar points stay close and dissimilar points are far apart, using a special heavy-tailed distribution to avoid crowding.