Workflow#
t-SNE is unsupervised and mainly used for dimensionality reduction for visualization. The workflow can be divided into key stages:
1. Data Preprocessing#
Standardize / normalize features
t-SNE is sensitive to scale.
Common: z-score normalization (mean 0, variance 1).
Optional PCA preprocessing
Reduce dimensionality to 30–50 components first.
Advantages:
Reduces noise
Speeds up t-SNE
Helps avoid local minima
2. Compute Pairwise Similarities in High-Dimensional Space#
For each point \(x_i\), compute conditional probabilities \(p_{j|i}\) representing similarity to other points:
Symmetrize the probabilities:
Perplexity parameter determines \(\sigma_i\) (effective neighborhood size).
Intuition: Nearby points get higher similarity, distant points get low similarity.
3. Initialize Low-Dimensional Embedding#
Start with a random placement of points in 2D or 3D: \(y_i\).
Alternative: PCA initialization (can help convergence).
4. Compute Pairwise Similarities in Low-Dimensional Space#
Use a Student-t distribution with 1 degree of freedom:
Heavy tails help avoid “crowding problem”: distant points are spread out.
5. Minimize KL Divergence#
Objective: match high-D similarities (\(p_{ij}\)) with low-D similarities (\(q_{ij}\)) by minimizing:
Use gradient descent to update low-D points \(y_i\).
Iterative Process:
Gradually move points to reduce KL divergence.
Early exaggeration phase: multiplies \(p_{ij}\) temporarily to form tight clusters early in training.
Momentum is used to stabilize optimization.
6. Output Low-Dimensional Embedding#
After convergence, you get a 2D or 3D representation: \(y_i\).
Can visualize clusters and local neighborhoods.
7. Optional Post-Processing#
Color points by labels (if available) for visualization.
Annotate clusters, compute centroids, or overlay additional metadata.
Workflow Summary Diagram
Data preprocessing → standardization, optional PCA
Compute high-D similarities → probabilities \(p_{ij}\)
Initialize low-D embedding → random or PCA
Compute low-D similarities → \(q_{ij}\)
Minimize KL divergence → iterative gradient descent
Output embedding → 2D/3D for visualization
Post-processing / visualization → annotate, color, interpret
Intuition
t-SNE is like folding a high-dimensional sheet of data into 2D:
Keep neighbors close
Push distant points away
Heavy-tailed distribution ensures clusters don’t collapse