Common ML Pitfalls & How to Prevent Them

Common ML Pitfalls & How to Prevent Them #

1. Data Leakage #

What it is: Information from test/future data sneaks into training.
Example: Scaling before splitting, or using “future” features.
✅ Prevention:
- Always split before preprocessing.
- Use scikit-learn pipelines.
- In time-series, only use past data for training.

Got it 👍 — let’s go deep into Data Leakage because it’s one of the trickiest yet most common mistakes in machine learning.

Definition #

Data leakage happens when information that would not be available at prediction time is used (directly or indirectly) during training.

👉 This gives the model unfair hints, making it look very accurate on validation/test data but fail on real-world unseen data.

Why It’s Dangerous #

Inflates model performance (fake high accuracy).
Leads to overconfidence in the model.
Deployment disaster: model fails when such information isn’t available.

It’s like cheating in an exam with leaked answers → perfect marks in practice, but no real skill.

Types of Data Leakage #

A. Target Leakage #

Features include data that would only be available after the prediction is made.
Example:
- Predicting if a patient has diabetes.
- Including “insulin prescribed” as a feature.
- Problem: prescription decisions depend on knowing the patient has diabetes.

B. Train-Test Contamination #

Test data information accidentally influences training.
Example:
- Scaling or feature selection done before splitting dataset into train/test.
- The test data indirectly shapes the training process.

C. Temporal Leakage #

In time-series data, using future information to predict the past.
Example:
- Predicting stock price at $t$.
- Accidentally including features from $t+1$ or later.

D. Indirect / Proxy Leakage #

When a feature is a disguised form of the target.
Example:
- Predicting whether a customer churns.
- Including “last month’s customer support ticket closure” → which directly correlates with churn.

Causes of Data Leakage #

Preprocessing the entire dataset before splitting.
Poor feature engineering (using outcome-related variables).
Mismanaged cross-validation (e.g., same patient’s data across train & test).
Temporal misalignment in time-series datasets.

Real-World Examples #

Healthcare: Using “hospital billing code” as a feature when predicting disease → billing code assigned after diagnosis.
Finance: Predicting loan defaults using “late payment flag” → this flag only appears after default happens.
E-commerce: Predicting purchase likelihood using “discount applied” → but discount decisions happen after purchase intent.

How to Detect Data Leakage #

Too-good-to-be-true model performance.
Validation accuracy much higher than real-world deployment.
Suspicious features that seem too correlated with the target.
Leakage found in feature importance analysis.

How to Prevent Data Leakage #

Best Practices:

Split first, preprocess later
- Do train/test split before scaling, imputing, or feature selection.
Pipelines
- Use sklearn Pipeline to ensure preprocessing happens separately for train/test.
Audit features
- Check: Would I have this feature at prediction time?
Careful with time-series
- Always split chronologically, not randomly.
Cross-validation grouping
- Ensure related samples (same patient, same user) are not split across train/test.
Domain expertise
- Work with subject experts to identify hidden leakage features.

Analogy #

Training with leakage = student cheating with leaked exam answers.
Deployment = real exam without leaks → the student (model) fails badly.

In summary: Data leakage = using future or unavailable information in training. It’s subtle, dangerous, and often the reason behind “amazing models that collapse in production.”

2. Overfitting #

What it is: Model memorizes noise in training data → poor generalization.
Example: Deep tree that perfectly fits training but fails on test.
✅ Prevention:
- Use regularization (L1, L2, dropout).
- Collect more data.
- Use cross-validation.
- Prune complexity (e.g., max depth in decision trees).

3. Underfitting #

What it is: Model too simple → misses important patterns.
Example: Using linear regression on complex nonlinear data.
✅ Prevention:
- Use more expressive models.
- Add features or polynomial terms.
- Reduce regularization strength.

4. Class Imbalance #

What it is: One class dominates (e.g., 99% normal, 1% fraud).
Example: Classifier predicts “normal” always → high accuracy but useless.
✅ Prevention:
- Resample (oversample minority, undersample majority).
- Use SMOTE (synthetic data generation).
- Choose balanced metrics (F1, ROC-AUC, Precision-Recall).
- Apply class weights in algorithms.

5. Data Drift & Concept Drift #

What it is: Data or relationships change over time.
Example: Customer behavior before vs after COVID.
✅ Prevention:
- Monitor model performance regularly.
- Retrain periodically.
- Use online learning for streaming data.

6. Multicollinearity #

What it is: Features highly correlated → unstable coefficients.
Example: Predicting salary with both “years of experience” and “months of experience”.
✅ Prevention:
- Remove redundant features.
- Use regularization (Ridge/Lasso).
- Apply PCA for dimensionality reduction.

7. Curse of Dimensionality #

What it is: As features grow, data becomes sparse → distance metrics fail.
Example: kNN performs poorly in 1000 dimensions.
✅ Prevention:
- Use feature selection.
- Apply dimensionality reduction (PCA, t-SNE, UMAP).
- Gather more data.

8. Sampling Bias #

What it is: Training data doesn’t represent real-world distribution.
Example: Training only on urban customers → fails on rural customers.
✅ Prevention:
- Ensure stratified sampling.
- Collect representative datasets.
- Be cautious with web-scraped or convenience samples.

9. Scaling & Normalization Issues #

What it is: Using features with different scales can mislead algorithms.
Example: kNN treating “income ($)” as more important than “age (years)”.
✅ Prevention:
- Normalize/standardize features.
- Use pipelines to prevent leakage.
- Choose scale-invariant models if possible (trees).

10. Evaluation Pitfalls #

What it is: Using the wrong metric for the problem.
Example: Accuracy in fraud detection (useless if data is imbalanced).
✅ Prevention:
- Choose metrics suited to task (F1 for imbalance, RMSE for regression).
- Use cross-validation.
- Avoid test set reuse (keep a final hold-out set).

Common ML Pitfalls & How to Prevent Them

Contents