Workflows#
1. Data Preparation#
Before applying KNN:
Collect data: You need labeled training data for classification or regression.
Feature scaling: KNN relies on distance metrics, so features should be on the same scale. Use:
Min-Max Scaling
Standardization (Z-score)
Why scaling matters: Without scaling, a feature with a larger range will dominate the distance calculation.
2. Choose the Distance Metric#
Decide how to measure “closeness” between points. Common choices:
Metric |
Formula (2D example) |
When to Use |
||||
|---|---|---|---|---|---|---|
Euclidean |
\(\sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}\) |
Most common, continuous features |
||||
Manhattan |
( |
x_1 - y_1 |
+ |
x_2 - y_2 |
) |
Grid-like distances, discrete features |
Minkowski |
Generalization of Euclidean & Manhattan |
Flexibility with parameter |
||||
Hamming |
Counts different features |
Categorical variables |
3. Select k#
Decide how many neighbors to consider:
Small
k→ sensitive to noise, high variance (overfitting)Large
k→ smooths boundaries, may underfitCommon practice: test multiple odd
kvalues using cross-validation.
4. Compute Distances#
For each new data point \(x_{\text{new}}\):
Calculate the distance to every point in the training set.
Store these distances in a sorted list.
5. Identify Nearest Neighbors#
Pick the top
kclosest points from the sorted distance list.These points “vote” for the label (classification) or contribute to the average (regression).
6. Aggregate Neighbor Information#
Classification: Majority vote determines the predicted class.
Optional: weighted vote (closer neighbors count more).
Regression: Take the mean (or weighted mean) of neighbors’ values.
7. Assign the Label or Value#
Output the predicted class or numerical value for \(x_{\text{new}}\).
8. Evaluate the Model#
Use performance metrics:
Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix
Regression: MSE, RMSE, MAE, R²
Optionally, tune
kand/or distance metric to improve results.
Summary
Prepare data → 2. Scale features → 3. Select k & distance metric →
Compute distances → 5. Find nearest neighbors → 6. Aggregate results → 7. Predict output → 8. Evaluate & tune