Cross-Validation#
Why do we need Cross-Validation#
When training ML models, we split data into:
Training set β model learns patterns.
Validation set β used for hyperparameter tuning.
Test set β unseen data, used only at the end to check performance.
If we simply split data once (say 70% train, 30% test), the result depends on the random split. π Cross-validation helps us get a more reliable performance estimate by training and validating on multiple splits.
Types of Cross-Validation#
1. Leave-One-Out CV (LOOCV)#
Take one data point as validation, rest as training.
Repeat for every data point.
Accuracy = average of all experiments.
β Advantage: maximum use of training data. β Disadvantages:
Computationally expensive (n experiments for n records).
Prone to overfitting (since validation set = 1 record).
2. Leave-P-Out CV#
Instead of leaving 1 record, leave p records as validation.
Train on the rest.
Repeat for all possible combinations.
β More flexible than LOOCV. β Impractical for large datasets (combinatorial explosion).
3. K-Fold Cross-Validation#
Split dataset into k equal folds.
Train on k-1 folds, validate on the remaining fold.
Repeat k times (each fold used once as validation).
Final score = average of all folds.
β Balance between efficiency and reliability. β Most common method in ML. β Validation set may not preserve class distribution (in classification).
4. Stratified K-Fold CV#
Same as K-Fold, but ensures class distribution is preserved in each fold.
Example: if dataset has 60% positive and 40% negative labels β each fold keeps ~60:40 ratio.
β Very important for classification tasks with imbalanced data.
5. Time Series Cross-Validation#
In time series, order matters β canβt randomly shuffle.
Train on past data, validate on future data.
Example:
Train = Day 1β100, Validate = Day 101β120
Train = Day 1β120, Validate = Day 121β140
β Used in forecasting, stock prediction, sentiment analysis. β Training size keeps growing β computationally heavier.
πΉ Summary Table#
CV Type |
Works Best For |
Pros |
Cons |
|---|---|---|---|
Hold-Out |
Large datasets |
Simple, fast |
High variance, depends on split |
LOOCV |
Very small data |
Uses max training data |
Very slow, overfitting |
Leave-P-Out |
Small datasets |
Flexible |
Impractical for big data |
K-Fold |
General ML tasks |
Reliable, efficient |
Random split may cause imbalance |
Stratified K-Fold |
Classification |
Maintains class ratios |
Slightly slower |
Time Series CV |
Forecasting/temporal data |
Respects time order |
Increasing training size |
πΉ Why Cross-Validation?#
Provides stable estimate of performance.
Helps avoid overfitting by testing model on multiple validation sets.
Essential for hyperparameter tuning (GridSearchCV, RandomizedSearchCV, Bayesian Optimization).