t-SNE#

  • t-SNE is a non-linear dimensionality reduction technique.

  • It’s mainly used for visualizing high-dimensional data in 2D or 3D.

  • Unlike PCA (linear), t-SNE preserves local structure and clusters in the data.

Use case: Visualizing clusters in datasets like images, word embeddings, or gene expression data.


2. Key Idea#

t-SNE tries to map similar points in high-dimensional space close together in low-dimensional space, and dissimilar points far apart.

  1. Compute pairwise similarities in high-dimensional space:

    • Convert distances into probabilities using a Gaussian distribution:

    \[ p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)} \]
    • \(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}\)

  2. Compute pairwise similarities in low-dimensional space:

    • Use a Student-t distribution with 1 degree of freedom (heavy tails):

    \[ q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} \]
  3. Minimize Kullback-Leibler (KL) divergence between high- and low-dimensional similarities:

\[ \text{KL}(P || Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} \]
  • Intuition: Points close in high-dimensional space should be close in 2D/3D space.


3. How t-SNE Works (Step by Step)#

  1. Compute high-dimensional probabilities \(p_{ij}\) representing similarity.

  2. Initialize points in low-dimensional space randomly (\(y_i\)).

  3. Compute low-dimensional probabilities \(q_{ij}\).

  4. Minimize KL divergence using gradient descent.

  5. Iterate until low-dimensional embedding preserves local neighborhoods.


4. Important Hyperparameters#

Hyperparameter

Effect

Typical Values

perplexity

Balances local vs global structure. Low = focus on small clusters, high = larger neighborhoods

5–50

learning_rate

Step size for gradient descent

10–1000

n_iter

Number of iterations

1000+

metric

Distance metric

‘euclidean’, ‘cosine’, etc.

Tip: t-SNE is mostly for visualization, not feature reduction for predictive models.


5. Strengths & Limitations#

Strengths#

  • Captures non-linear structure.

  • Excellent for visualizing clusters.

  • Preserves local neighborhoods better than PCA.

Limitations#

  • Does not preserve global distances.

  • Sensitive to hyperparameters (perplexity, learning_rate).

  • Computationally expensive for large datasets.

  • Embeddings are non-deterministic (different runs may differ unless random seed fixed).


Intuition

  • Imagine high-dimensional points connected with springs.

  • t-SNE stretches and squeezes points in 2D so that similar points stay close and dissimilar points are far apart, using a special heavy-tailed distribution to avoid crowding.

Click here for Sections