2026-04-02 13:40 Tags:
Source Notebook: /Users/liyachen/Documents/fang/UNZIP_FOR_NOTEBOOKS_FINAL/12-K-Nearest-Neighbors/00-KNN-Classification.ipynb
KNN - K Nearest Neighbors - Classification
To understand KNN for classification, we’ll work with a simple dataset representing gene expression levels. Gene expression levels are calculated by the ratio between the expression of the target gene (i.e., the gene of interest) and the expression of one or more reference genes (often household genes). This dataset is synthetic and specifically designed to show some of the strengths and limitations of using KNN for Classification.
More info on gene expression: https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/gene-expression-level
Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsData
df = pd.read_csv('../DATA/gene_expression.csv')df.head() Gene One Gene Two Cancer Present
0 4.3 3.9 1
1 2.5 6.3 0
2 5.7 3.9 1
3 6.1 6.2 0
4 7.4 3.4 1sns.scatterplot(x='Gene One',y='Gene Two',hue='Cancer Present',data=df,alpha=0.7)
sns.scatterplot(x='Gene One',y='Gene Two',hue='Cancer Present',data=df)
plt.xlim(2,6)
plt.ylim(3,10)
plt.legend(loc=(1.1,0.5))
Train|Test Split and Scaling Data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScalerX = df.drop('Cancer Present',axis=1)
y = df['Cancer Present']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)scaler = StandardScaler()scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)from sklearn.neighbors import KNeighborsClassifierknn_model = KNeighborsClassifier(n_neighbors=1)knn_model.fit(scaled_X_train,y_train)KNeighborsClassifier(n_neighbors=1)Understanding KNN and Choosing K Value
full_test = pd.concat([X_test,y_test],axis=1)len(full_test)900sns.scatterplot(x='Gene One',y='Gene Two',hue='Cancer Present',
data=full_test,alpha=0.7)
Model Evaluation
y_pred = knn_model.predict(scaled_X_test)from sklearn.metrics import classification_report,confusion_matrix,accuracy_scoreaccuracy_score(y_test,y_pred)0.8922222222222222confusion_matrix(y_test,y_pred)array([[420, 50],
[ 47, 383]], dtype=int64)print(classification_report(y_test,y_pred)) precision recall f1-score support
0 0.90 0.89 0.90 470
1 0.88 0.89 0.89 430
accuracy 0.89 900
macro avg 0.89 0.89 0.89 900
weighted avg 0.89 0.89 0.89 900Elbow Method for Choosing Reasonable K Values
NOTE: This uses the test set for the hyperparameter selection of K.
test_error_rates = []
for k in range(1,30):
knn_model = KNeighborsClassifier(n_neighbors=k)
knn_model.fit(scaled_X_train,y_train)
y_pred_test = knn_model.predict(scaled_X_test)
test_error = 1 - accuracy_score(y_test,y_pred_test)
test_error_rates.append(test_error)plt.figure(figsize=(10,6),dpi=200)
plt.plot(range(1,30),test_error_rates,label='Test Error')
plt.legend()
plt.ylabel('Error Rate')
plt.xlabel("K Value")Text(0.5, 0, 'K Value')
Full Cross Validation Grid Search for K Value
Creating a Pipeline to find K value
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier())
])
param_grid = {
'knn__n_neighbors': list(range(1, 31))
}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)🧠 Step 1: What problem are we actually solving?
You want to do three things together:
-
Scale your data
-
Train a KNN model
-
Find the best
k
Now ask yourself:
❓ If I do these separately… what could go wrong?
🚨 Step 2: The wrong way (very common beginner mistake)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid.fit(X_train_scaled, y_train)Looks fine, right?
But think deeper:
❗ Problem: Data leakage
Inside CV, the model is supposed to behave like this:
Fold 1:
train on subset A
validate on subset B
BUT your scaler was fit on ALL X_train already.
So:
The validation fold has “seen” information from itself.
That’s cheating.
🧠 Step 3: So what should happen instead?
Let’s think like a clean ML pipeline:
For EACH fold:
1. Take training fold
2. Fit scaler ONLY on this fold
3. Transform training fold
4. Train model
5. Transform validation fold using SAME scaler
6. Evaluate
👉 This must happen inside every fold
🔥 Step 4: This is exactly why Pipeline exists
Now look at your code again:
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier())
])This is not just “writing steps”.
This is saying:
“These steps must ALWAYS happen together, in this order, safely.”
⚙️ Step 5: What GridSearchCV does with Pipeline
Now this line:
grid = GridSearchCV(pipe, param_grid, cv=5)Think of it like:
“Try different k values, and for EACH one, run the ENTIRE pipeline correctly across folds.”
🧪 What actually happens internally
For each k:
For each fold:
train_fold → scaler.fit()
train_fold → scaler.transform()
train_fold → knn.fit()
val_fold → scaler.transform()
val_fold → knn.predict()
🧩 Step 6: Why this syntax (knn__n_neighbors)?
param_grid = {
'knn__n_neighbors': list(range(1, 31))
}This looks weird at first.
But think:
Pipeline = multiple layers
Pipeline
├── scaler
└── knn
└── n_neighbors
So:
knn__n_neighbors
means:
“Go inside step
knn, change parametern_neighbors”
🧠 Step 1: What does “best k” actually mean?
After this:
grid.fit(X_train, y_train)You now have:
grid.best_params_
grid.best_estimator_👉 Important:
“Best k” = best on cross-validation, not on real unseen data yet
🎯 Step 2: What should we do next?
Ask yourself:
❓ Have we evaluated on true unseen data?
If not → we’re not done
✅ Step 3: Use the best model on test data
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)Then evaluate:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))🧠 Why this step matters
Think of it like:
| Stage | Purpose |
|---|---|
| CV (GridSearch) | choose best k |
| Test set | simulate real-world performance |
👉 If you skip test evaluation → you don’t know if your model generalizes
🔥 Step 4: What’s inside best_estimator_?
This is important conceptually.
best_modelis actually:
Pipeline(
scaler = fitted StandardScaler,
knn = KNN with best k
)👉 So when you call:
best_model.predict(X_test)it automatically:
X_test → scaler.transform → knn.predictNo extra work needed.
🧩 Step 5: Optional but powerful — inspect performance vs k
You can also analyze:
grid.cv_results_Example:
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
results[['param_knn__n_neighbors', 'mean_test_score']]This helps you see:
Is performance stable? or very sensitive to k?