Workflows#

1. Data Preparation#

Before applying KNN:

  • Collect data: You need labeled training data for classification or regression.

  • Feature scaling: KNN relies on distance metrics, so features should be on the same scale. Use:

    • Min-Max Scaling

    • Standardization (Z-score)

Why scaling matters: Without scaling, a feature with a larger range will dominate the distance calculation.


2. Choose the Distance Metric#

Decide how to measure “closeness” between points. Common choices:

Metric

Formula (2D example)

When to Use

Euclidean

\(\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}\)

Most common, continuous features

Manhattan

(

x_1 - y_1

+

x_2 - y_2

)

Grid-like distances, discrete features

Minkowski

Generalization of Euclidean & Manhattan

Flexibility with parameter p

Hamming

Counts different features

Categorical variables


3. Select k#

Decide how many neighbors to consider:

  • Small k → sensitive to noise, high variance (overfitting)

  • Large k → smooths boundaries, may underfit

  • Common practice: test multiple odd k values using cross-validation.


4. Compute Distances#

For each new data point \(x_{\text{new}}\):

  1. Calculate the distance to every point in the training set.

  2. Store these distances in a sorted list.


5. Identify Nearest Neighbors#

  • Pick the top k closest points from the sorted distance list.

  • These points “vote” for the label (classification) or contribute to the average (regression).


6. Aggregate Neighbor Information#

  • Classification: Majority vote determines the predicted class.

    • Optional: weighted vote (closer neighbors count more).

  • Regression: Take the mean (or weighted mean) of neighbors’ values.


7. Assign the Label or Value#

  • Output the predicted class or numerical value for \(x_{\text{new}}\).


8. Evaluate the Model#

  • Use performance metrics:

    • Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix

    • Regression: MSE, RMSE, MAE, R²

  • Optionally, tune k and/or distance metric to improve results.


Summary

  1. Prepare data → 2. Scale features → 3. Select k & distance metric

  2. Compute distances → 5. Find nearest neighbors → 6. Aggregate results → 7. Predict output → 8. Evaluate & tune