Common ML Pitfalls & How to Prevent Them#


1. Data Leakage#

  • What it is: Information from test/future data sneaks into training.

  • Example: Scaling before splitting, or using “future” features.

  • Prevention:

    • Always split before preprocessing.

    • Use scikit-learn pipelines.

    • In time-series, only use past data for training.

Got it 👍 — let’s go deep into Data Leakage because it’s one of the trickiest yet most common mistakes in machine learning.


Definition#

Data leakage happens when information that would not be available at prediction time is used (directly or indirectly) during training.

👉 This gives the model unfair hints, making it look very accurate on validation/test data but fail on real-world unseen data.


Why It’s Dangerous#

  • Inflates model performance (fake high accuracy).

  • Leads to overconfidence in the model.

  • Deployment disaster: model fails when such information isn’t available.

It’s like cheating in an exam with leaked answers → perfect marks in practice, but no real skill.


Types of Data Leakage#

A. Target Leakage#

  • Features include data that would only be available after the prediction is made.

  • Example:

    • Predicting if a patient has diabetes.

    • Including “insulin prescribed” as a feature.

    • Problem: prescription decisions depend on knowing the patient has diabetes.


B. Train-Test Contamination#

  • Test data information accidentally influences training.

  • Example:

    • Scaling or feature selection done before splitting dataset into train/test.

    • The test data indirectly shapes the training process.


C. Temporal Leakage#

  • In time-series data, using future information to predict the past.

  • Example:

    • Predicting stock price at \(t\).

    • Accidentally including features from \(t+1\) or later.


D. Indirect / Proxy Leakage#

  • When a feature is a disguised form of the target.

  • Example:

    • Predicting whether a customer churns.

    • Including “last month’s customer support ticket closure” → which directly correlates with churn.


Causes of Data Leakage#

  • Preprocessing the entire dataset before splitting.

  • Poor feature engineering (using outcome-related variables).

  • Mismanaged cross-validation (e.g., same patient’s data across train & test).

  • Temporal misalignment in time-series datasets.


Real-World Examples#

  • Healthcare: Using “hospital billing code” as a feature when predicting disease → billing code assigned after diagnosis.

  • Finance: Predicting loan defaults using “late payment flag” → this flag only appears after default happens.

  • E-commerce: Predicting purchase likelihood using “discount applied” → but discount decisions happen after purchase intent.


How to Detect Data Leakage#

  • Too-good-to-be-true model performance.

  • Validation accuracy much higher than real-world deployment.

  • Suspicious features that seem too correlated with the target.

  • Leakage found in feature importance analysis.


How to Prevent Data Leakage#

  • Best Practices:

  1. Split first, preprocess later

    • Do train/test split before scaling, imputing, or feature selection.

  2. Pipelines

    • Use sklearn Pipeline to ensure preprocessing happens separately for train/test.

  3. Audit features

    • Check: Would I have this feature at prediction time?

  4. Careful with time-series

    • Always split chronologically, not randomly.

  5. Cross-validation grouping

    • Ensure related samples (same patient, same user) are not split across train/test.

  6. Domain expertise

    • Work with subject experts to identify hidden leakage features.


Analogy#

  • Training with leakage = student cheating with leaked exam answers.

  • Deployment = real exam without leaks → the student (model) fails badly.


In summary: Data leakage = using future or unavailable information in training. It’s subtle, dangerous, and often the reason behind “amazing models that collapse in production.”


2. Overfitting#

  • What it is: Model memorizes noise in training data → poor generalization.

  • Example: Deep tree that perfectly fits training but fails on test.

  • Prevention:

    • Use regularization (L1, L2, dropout).

    • Collect more data.

    • Use cross-validation.

    • Prune complexity (e.g., max depth in decision trees).


3. Underfitting#

  • What it is: Model too simple → misses important patterns.

  • Example: Using linear regression on complex nonlinear data.

  • Prevention:

    • Use more expressive models.

    • Add features or polynomial terms.

    • Reduce regularization strength.


4. Class Imbalance#

  • What it is: One class dominates (e.g., 99% normal, 1% fraud).

  • Example: Classifier predicts “normal” always → high accuracy but useless.

  • Prevention:

    • Resample (oversample minority, undersample majority).

    • Use SMOTE (synthetic data generation).

    • Choose balanced metrics (F1, ROC-AUC, Precision-Recall).

    • Apply class weights in algorithms.


5. Data Drift & Concept Drift#

  • What it is: Data or relationships change over time.

  • Example: Customer behavior before vs after COVID.

  • Prevention:

    • Monitor model performance regularly.

    • Retrain periodically.

    • Use online learning for streaming data.


6. Multicollinearity#

  • What it is: Features highly correlated → unstable coefficients.

  • Example: Predicting salary with both “years of experience” and “months of experience”.

  • Prevention:

    • Remove redundant features.

    • Use regularization (Ridge/Lasso).

    • Apply PCA for dimensionality reduction.


7. Curse of Dimensionality#

  • What it is: As features grow, data becomes sparse → distance metrics fail.

  • Example: kNN performs poorly in 1000 dimensions.

  • Prevention:

    • Use feature selection.

    • Apply dimensionality reduction (PCA, t-SNE, UMAP).

    • Gather more data.


8. Sampling Bias#

  • What it is: Training data doesn’t represent real-world distribution.

  • Example: Training only on urban customers → fails on rural customers.

  • Prevention:

    • Ensure stratified sampling.

    • Collect representative datasets.

    • Be cautious with web-scraped or convenience samples.


9. Scaling & Normalization Issues#

  • What it is: Using features with different scales can mislead algorithms.

  • Example: kNN treating “income ($)” as more important than “age (years)”.

  • Prevention:

    • Normalize/standardize features.

    • Use pipelines to prevent leakage.

    • Choose scale-invariant models if possible (trees).


10. Evaluation Pitfalls#

  • What it is: Using the wrong metric for the problem.

  • Example: Accuracy in fraud detection (useless if data is imbalanced).

  • Prevention:

    • Choose metrics suited to task (F1 for imbalance, RMSE for regression).

    • Use cross-validation.

    • Avoid test set reuse (keep a final hold-out set).