Common ML Pitfalls & How to Prevent Them#
1. Data Leakage#
What it is: Information from test/future data sneaks into training.
Example: Scaling before splitting, or using “future” features.
✅ Prevention:
Always split before preprocessing.
Use scikit-learn pipelines.
In time-series, only use past data for training.
Got it 👍 — let’s go deep into Data Leakage because it’s one of the trickiest yet most common mistakes in machine learning.
Definition#
Data leakage happens when information that would not be available at prediction time is used (directly or indirectly) during training.
👉 This gives the model unfair hints, making it look very accurate on validation/test data but fail on real-world unseen data.
Why It’s Dangerous#
Inflates model performance (fake high accuracy).
Leads to overconfidence in the model.
Deployment disaster: model fails when such information isn’t available.
It’s like cheating in an exam with leaked answers → perfect marks in practice, but no real skill.
Types of Data Leakage#
A. Target Leakage#
Features include data that would only be available after the prediction is made.
Example:
Predicting if a patient has diabetes.
Including “insulin prescribed” as a feature.
Problem: prescription decisions depend on knowing the patient has diabetes.
B. Train-Test Contamination#
Test data information accidentally influences training.
Example:
Scaling or feature selection done before splitting dataset into train/test.
The test data indirectly shapes the training process.
C. Temporal Leakage#
In time-series data, using future information to predict the past.
Example:
Predicting stock price at \(t\).
Accidentally including features from \(t+1\) or later.
D. Indirect / Proxy Leakage#
When a feature is a disguised form of the target.
Example:
Predicting whether a customer churns.
Including “last month’s customer support ticket closure” → which directly correlates with churn.
Causes of Data Leakage#
Preprocessing the entire dataset before splitting.
Poor feature engineering (using outcome-related variables).
Mismanaged cross-validation (e.g., same patient’s data across train & test).
Temporal misalignment in time-series datasets.
Real-World Examples#
Healthcare: Using “hospital billing code” as a feature when predicting disease → billing code assigned after diagnosis.
Finance: Predicting loan defaults using “late payment flag” → this flag only appears after default happens.
E-commerce: Predicting purchase likelihood using “discount applied” → but discount decisions happen after purchase intent.
How to Detect Data Leakage#
Too-good-to-be-true model performance.
Validation accuracy much higher than real-world deployment.
Suspicious features that seem too correlated with the target.
Leakage found in feature importance analysis.
How to Prevent Data Leakage#
Best Practices:
Split first, preprocess later
Do train/test split before scaling, imputing, or feature selection.
Pipelines
Use sklearn
Pipelineto ensure preprocessing happens separately for train/test.
Audit features
Check: Would I have this feature at prediction time?
Careful with time-series
Always split chronologically, not randomly.
Cross-validation grouping
Ensure related samples (same patient, same user) are not split across train/test.
Domain expertise
Work with subject experts to identify hidden leakage features.
Analogy#
Training with leakage = student cheating with leaked exam answers.
Deployment = real exam without leaks → the student (model) fails badly.
In summary: Data leakage = using future or unavailable information in training. It’s subtle, dangerous, and often the reason behind “amazing models that collapse in production.”
2. Overfitting#
What it is: Model memorizes noise in training data → poor generalization.
Example: Deep tree that perfectly fits training but fails on test.
✅ Prevention:
Use regularization (L1, L2, dropout).
Collect more data.
Use cross-validation.
Prune complexity (e.g., max depth in decision trees).
3. Underfitting#
What it is: Model too simple → misses important patterns.
Example: Using linear regression on complex nonlinear data.
✅ Prevention:
Use more expressive models.
Add features or polynomial terms.
Reduce regularization strength.
4. Class Imbalance#
What it is: One class dominates (e.g., 99% normal, 1% fraud).
Example: Classifier predicts “normal” always → high accuracy but useless.
✅ Prevention:
Resample (oversample minority, undersample majority).
Use SMOTE (synthetic data generation).
Choose balanced metrics (F1, ROC-AUC, Precision-Recall).
Apply class weights in algorithms.
5. Data Drift & Concept Drift#
What it is: Data or relationships change over time.
Example: Customer behavior before vs after COVID.
✅ Prevention:
Monitor model performance regularly.
Retrain periodically.
Use online learning for streaming data.
6. Multicollinearity#
What it is: Features highly correlated → unstable coefficients.
Example: Predicting salary with both “years of experience” and “months of experience”.
✅ Prevention:
Remove redundant features.
Use regularization (Ridge/Lasso).
Apply PCA for dimensionality reduction.
7. Curse of Dimensionality#
What it is: As features grow, data becomes sparse → distance metrics fail.
Example: kNN performs poorly in 1000 dimensions.
✅ Prevention:
Use feature selection.
Apply dimensionality reduction (PCA, t-SNE, UMAP).
Gather more data.
8. Sampling Bias#
What it is: Training data doesn’t represent real-world distribution.
Example: Training only on urban customers → fails on rural customers.
✅ Prevention:
Ensure stratified sampling.
Collect representative datasets.
Be cautious with web-scraped or convenience samples.
9. Scaling & Normalization Issues#
What it is: Using features with different scales can mislead algorithms.
Example: kNN treating “income ($)” as more important than “age (years)”.
✅ Prevention:
Normalize/standardize features.
Use pipelines to prevent leakage.
Choose scale-invariant models if possible (trees).
10. Evaluation Pitfalls#
What it is: Using the wrong metric for the problem.
Example: Accuracy in fraud detection (useless if data is imbalanced).
✅ Prevention:
Choose metrics suited to task (F1 for imbalance, RMSE for regression).
Use cross-validation.
Avoid test set reuse (keep a final hold-out set).