Hyperparameter Tuning

Hyperparameter Tuning#

Unlike algorithms like Random Forest or SVM, PCA has very few hyperparameters, but the main ones are:

a) Number of components (`n_components`)#

Definition: The number of principal components to keep (k).
Effect:
- Too small → lose too much information (high reconstruction error).
- Too large → include noise, less dimensionality reduction benefit.

b) Whiten#

Definition: Whether to scale the principal components to unit variance.
Effect:
- whiten=True → components are decorrelated and scaled.
- Useful for feeding into algorithms sensitive to scale (like k-NN, SVM).

c) Solver#

Options: auto, full, arpack, randomized
Effect: Determines algorithm to compute SVD (can affect speed for large datasets).

2. Hyperparameter Tuning Goal#

For PCA, tuning usually means selecting n_components to balance:

Variance explained: Keep enough components to capture most variance.
Dimensionality reduction: Reduce features as much as possible.

3. Techniques for Tuning `n_components`#

a) Variance Explained Threshold#

Plot cumulative explained variance ratio and pick components that capture e.g., 90–95% of variance.
Formula:

\[ \text{Explained Variance Ratio} = \frac{\lambda_i}{\sum_{j=1}^{d} \lambda_j} \]

Where \(\lambda_i\) is the i-th eigenvalue.

Rule of Thumb: Keep the smallest number of components that explains ≥90–95% of variance.

b) Cross-Validation (with downstream task)#

If PCA is used before a supervised learning model:

Try different n_components.
Train your model (e.g., logistic regression) on the reduced data.
Evaluate performance (accuracy, F1-score, etc.).
Pick n_components giving best validation performance.

This is more robust because variance alone doesn’t guarantee optimal performance for predictive tasks.

c) Scree Plot#

Plot eigenvalues (variance per component) vs component index.
Look for “elbow”, where adding more components gives diminishing returns.

4. Quick Summary Table#

Hyperparameter	Tuning Strategy	Notes
`n_components`	Variance threshold, CV with downstream model, scree plot	Most important
`whiten`	Try True/False	Useful if model sensitive to scale
`solver`	Depends on data size	Large datasets → `randomized` for speed

Intuition

PCA is unsupervised, so tuning is mostly about information retention.
If using PCA as preprocessing, always validate via model performance after dimensionality reduction.

Handle Overfitting/Underfitting#

PCA is an unsupervised dimensionality reduction technique, so overfitting/underfitting is less direct than in supervised learning. However, it still matters when PCA is used as preprocessing for a model:

Overfitting:
- Happens if you retain too many principal components, including noise.
- The downstream model learns noise instead of signal.
Underfitting:
- Happens if you retain too few principal components, losing important information.
- The downstream model cannot capture the true patterns in data.

2. How to Handle Overfitting in PCA#

a) Reduce number of components#

Keep only the components that capture the majority of variance (90–95%).
Discard minor components which may contain noise.

b) Use Cross-Validation#

Evaluate model performance with different n_components.
Choose the number of components that optimizes validation performance.

c) Regularization in downstream models#

Even with PCA, the model itself can overfit. Apply regularization (L1/L2) after dimensionality reduction.

3. How to Handle Underfitting in PCA#

a) Increase number of components#

If too little variance is retained, include more principal components.

b) Feature engineering#

PCA reduces dimensions but cannot create new signal. Ensure important features are not discarded before PCA.

c) Use non-linear dimensionality reduction#

Standard PCA is linear. If patterns are non-linear, try Kernel PCA or t-SNE for better representation.

4. Practical Workflow to Avoid Overfitting/Underfitting#

Standardize data → PCA is sensitive to scale.
Compute explained variance ratio → Decide initial n_components.
Cross-validate downstream model with different n_components.
Monitor performance → Look for improvement plateau to avoid overfitting.
Adjust components → Increase if underfitting, decrease if overfitting.