Workflow#

t-SNE is unsupervised and mainly used for dimensionality reduction for visualization. The workflow can be divided into key stages:


1. Data Preprocessing#

  1. Standardize / normalize features

    • t-SNE is sensitive to scale.

    • Common: z-score normalization (mean 0, variance 1).

  2. Optional PCA preprocessing

    • Reduce dimensionality to 30–50 components first.

    • Advantages:

      • Reduces noise

      • Speeds up t-SNE

      • Helps avoid local minima


2. Compute Pairwise Similarities in High-Dimensional Space#

  1. For each point \(x_i\), compute conditional probabilities \(p_{j|i}\) representing similarity to other points:

\[ p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)} \]
  1. Symmetrize the probabilities:

\[ p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n} \]
  1. Perplexity parameter determines \(\sigma_i\) (effective neighborhood size).

Intuition: Nearby points get higher similarity, distant points get low similarity.


3. Initialize Low-Dimensional Embedding#

  • Start with a random placement of points in 2D or 3D: \(y_i\).

  • Alternative: PCA initialization (can help convergence).


4. Compute Pairwise Similarities in Low-Dimensional Space#

  • Use a Student-t distribution with 1 degree of freedom:

\[ q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} \]
  • Heavy tails help avoid “crowding problem”: distant points are spread out.


5. Minimize KL Divergence#

  • Objective: match high-D similarities (\(p_{ij}\)) with low-D similarities (\(q_{ij}\)) by minimizing:

\[ \text{KL}(P || Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} \]
  • Use gradient descent to update low-D points \(y_i\).

Iterative Process:

  • Gradually move points to reduce KL divergence.

  • Early exaggeration phase: multiplies \(p_{ij}\) temporarily to form tight clusters early in training.

  • Momentum is used to stabilize optimization.


6. Output Low-Dimensional Embedding#

  • After convergence, you get a 2D or 3D representation: \(y_i\).

  • Can visualize clusters and local neighborhoods.


7. Optional Post-Processing#

  • Color points by labels (if available) for visualization.

  • Annotate clusters, compute centroids, or overlay additional metadata.


Workflow Summary Diagram

  1. Data preprocessing → standardization, optional PCA

  2. Compute high-D similarities → probabilities \(p_{ij}\)

  3. Initialize low-D embedding → random or PCA

  4. Compute low-D similarities\(q_{ij}\)

  5. Minimize KL divergence → iterative gradient descent

  6. Output embedding → 2D/3D for visualization

  7. Post-processing / visualization → annotate, color, interpret


Intuition

  • t-SNE is like folding a high-dimensional sheet of data into 2D:

    • Keep neighbors close

    • Push distant points away

    • Heavy-tailed distribution ensures clusters don’t collapse