What Is It?
What Is Model Evaluation?
Model evaluation is the process of measuring how well a machine learning model performs on unseen data. Training a model is only half the job -- if you cannot properly evaluate it, you have no idea whether it will work in the real world.
The fundamental question is: How will this model perform on data it has never seen before? A model that achieves 99% accuracy on training data but only 60% on new data is useless. Model evaluation gives us reliable estimates of real-world performance.
# The evaluation pipeline:
# 1. Split data into training and testing sets
# 2. Train model on training set only
# 3. Evaluate on test set (never seen during training)
# 4. Use proper metrics (not just accuracy!)
# 5. Use cross-validation for robust estimates
# 6. Tune hyperparameters with GridSearchCV
# 7. Final evaluation on held-out test setWhat Is Hyperparameter Tuning?
Hyperparameters are settings you choose before training (like K in KNN, max_depth in decision trees, C in SVM). Unlike model parameters (weights, biases) that are learned during training, hyperparameters must be set manually. Hyperparameter tuning is the systematic process of finding the best combination of hyperparameters for your model and data.
Why Does It Matter?
Why Is Proper Model Evaluation Critical?
1. Prevents Costly Mistakes in Production
When Kavita deploys a fraud detection model at a bank, a poorly evaluated model could either miss actual fraud (losing money) or flag legitimate transactions (losing customers). Proper evaluation catches these issues before deployment.
2. Accuracy Alone Is Misleading
If 99% of transactions are legitimate, a model that always predicts "legitimate" gets 99% accuracy but catches zero fraud. Precision, recall, F1-score, and ROC-AUC give the complete picture that accuracy hides.
3. Overfitting Is the Silent Killer
A model that memorizes training data looks perfect on training metrics but fails spectacularly on new data. Without proper train-test splitting and cross-validation, you will never know your model is overfitting until it is too late.
4. Hyperparameter Tuning Can Make or Break a Model
The same algorithm with different hyperparameters can give 70% accuracy or 95% accuracy. Rohit's Random Forest with max_depth=3 might underfit, while max_depth=50 might overfit. GridSearchCV systematically finds the sweet spot.
5. Cross-Validation Gives Reliable Estimates
A single train-test split can give optimistic or pessimistic estimates depending on which points end up in which split. Cross-validation averages over multiple splits, giving a much more reliable performance estimate.
Detailed Explanation
Detailed Explanation
1. Why Accuracy Alone Is Not Enough
Consider an email spam filter with 1000 emails: 950 legitimate, 50 spam.
# Dummy model: always predicts "legitimate"
# Accuracy = 950/1000 = 95% (sounds great!)
# But: catches ZERO spam emails (0% recall on spam)
#
# The problem: accuracy treats all errors equally
# Missing spam (FN) and flagging legitimate (FP) have very different costs
#
# We need metrics that distinguish between these error types2. Confusion Matrix Deep Dive
# Predicted
# Positive Negative
# Actual Positive TP FN
# Actual Negative FP TN
#
# TP (True Positive): Correctly identified positive
# TN (True Negative): Correctly identified negative
# FP (False Positive): Incorrectly labeled as positive (Type I error)
# FN (False Negative): Incorrectly labeled as negative (Type II error)
#
# From the confusion matrix, we derive ALL evaluation metrics:3. Precision, Recall, F1-Score
# Precision = TP / (TP + FP)
# "Of everything I predicted positive, what fraction is actually positive?"
# High precision = few false alarms
#
# Recall (Sensitivity) = TP / (TP + FN)
# "Of everything actually positive, what fraction did I detect?"
# High recall = few missed positives
#
# F1 Score = 2 * Precision * Recall / (Precision + Recall)
# Harmonic mean: punishes extreme imbalance between P and R
#
# Specificity = TN / (TN + FP)
# "Of everything actually negative, what fraction did I correctly identify?"
#
# The Precision-Recall Trade-off:
# Lower threshold -> more positive predictions -> higher recall, lower precision
# Higher threshold -> fewer positive predictions -> higher precision, lower recall4. ROC Curve and AUC
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate (recall) vs False Positive Rate at various thresholds.
# TPR (True Positive Rate) = Recall = TP / (TP + FN)
# FPR (False Positive Rate) = FP / (FP + TN)
#
# ROC Curve: plot TPR vs FPR at different thresholds
# - Perfect model: goes straight up to (0,1) then right (AUC = 1.0)
# - Random model: diagonal line from (0,0) to (1,1) (AUC = 0.5)
# - Worse than random: below the diagonal (AUC < 0.5)
#
# AUC (Area Under ROC Curve):
# - 1.0: perfect classifier
# - 0.9-1.0: excellent
# - 0.8-0.9: good
# - 0.7-0.8: fair
# - 0.5: no better than random guessing
#
# AUC is threshold-independent: it measures overall model quality5. Train-Test Split
# Basic approach: split data into training and test sets
# Typical splits: 80/20, 70/30, or 75/25
#
# Rules:
# 1. NEVER train on test data
# 2. NEVER tune hyperparameters using test data
# 3. Use stratify=y to maintain class proportions
# 4. Set random_state for reproducibility
#
# Problem: a single split may not be representative
# Solution: cross-validation6. K-Fold Cross-Validation
# K-Fold splits data into K equal parts (folds):
# Fold 1: [TEST] [Train] [Train] [Train] [Train]
# Fold 2: [Train] [TEST] [Train] [Train] [Train]
# Fold 3: [Train] [Train] [TEST] [Train] [Train]
# Fold 4: [Train] [Train] [Train] [TEST] [Train]
# Fold 5: [Train] [Train] [Train] [Train] [TEST]
#
# Each fold serves as test set exactly once
# Final score = average of all K scores
# Typical K: 5 or 10
#
# Stratified K-Fold: preserves class proportions in each fold
# Essential for imbalanced datasets!
#
# Leave-One-Out (LOO): K = N (one sample per fold)
# Most thorough but very slow for large datasets7. Overfitting vs Underfitting
# Underfitting (High Bias):
# - Model too simple to capture patterns
# - Low training accuracy AND low test accuracy
# - Fix: more complex model, more features, less regularization
#
# Overfitting (High Variance):
# - Model memorizes training data including noise
# - High training accuracy but LOW test accuracy
# - Fix: simpler model, more data, regularization, pruning
#
# The Bias-Variance Tradeoff:
# Total Error = Bias^2 + Variance + Irreducible Noise
# - Bias: error from oversimplified model (misses patterns)
# - Variance: error from sensitivity to training data (captures noise)
# - Sweet spot: balance between bias and variance8. Hyperparameter Tuning
# GridSearchCV: exhaustive search over a parameter grid
# - Tries EVERY combination of specified parameters
# - Uses cross-validation to evaluate each combination
# - Guaranteed to find the best within the grid
# - Slow for large grids (exponential combinations)
#
# RandomizedSearchCV: random sampling from parameter distributions
# - Samples N random combinations from parameter distributions
# - Much faster than GridSearchCV for large search spaces
# - May not find the absolute best, but often finds near-optimal
# - Better for initial exploration
#
# Typical workflow:
# 1. RandomizedSearchCV to narrow the range
# 2. GridSearchCV to fine-tune within the narrowed range9. Learning Curves and Validation Curves
# Learning Curve: training/test score vs training set size
# - If both scores are low: underfitting (need more complex model)
# - If training high, test low: overfitting (need more data or simpler model)
# - If both converge high: good model
#
# Validation Curve: training/test score vs hyperparameter value
# - Left side (simple model): both scores low (underfitting)
# - Right side (complex model): training high, test low (overfitting)
# - Sweet spot: where test score peaks
Code Examples
import numpy as np
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
recall_score, f1_score, classification_report)
# Simulate predictions for a disease screening test
np.random.seed(42)
n = 1000
y_true = np.array([1]*50 + [0]*950) # 5% disease prevalence
# Model predictions (imperfect)
y_pred = y_true.copy()
# Introduce errors
y_pred[45:50] = 0 # 5 false negatives (miss 5 out of 50 sick patients)
y_pred[950:960] = 1 # 10 false positives (flag 10 healthy as sick)
cm = confusion_matrix(y_true, y_pred)
TN, FP, FN, TP = cm.ravel()
print("Confusion Matrix:")
print(f" TN={TN} FP={FP}")
print(f" FN={FN} TP={TP}")
print(f"\nMetrics (manual calculation):")
print(f" Accuracy = (TP+TN)/(Total) = ({TP}+{TN})/{n} = {(TP+TN)/n:.4f}")
print(f" Precision = TP/(TP+FP) = {TP}/({TP}+{FP}) = {TP/(TP+FP):.4f}")
print(f" Recall = TP/(TP+FN) = {TP}/({TP}+{FN}) = {TP/(TP+FN):.4f}")
print(f" Specificity = TN/(TN+FP) = {TN}/({TN}+{FP}) = {TN/(TN+FP):.4f}")
f1 = 2*TP/(2*TP+FP+FN)
print(f" F1 Score = 2TP/(2TP+FP+FN) = {f1:.4f}")
print(f"\nMetrics (sklearn):")
print(f" Accuracy: {accuracy_score(y_true, y_pred):.4f}")
print(f" Precision: {precision_score(y_true, y_pred):.4f}")
print(f" Recall: {recall_score(y_true, y_pred):.4f}")
print(f" F1: {f1_score(y_true, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Healthy', 'Sick']))
print("Key Insight: Accuracy is 98.5% but we miss 10% of sick patients!")
print("In medical screening, Recall (90%) is more important than Accuracy.")import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.datasets import make_classification
# Generate data
X, y = make_classification(n_samples=500, n_features=10,
n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Train multiple models
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'SVM': SVC(probability=True, random_state=42)
}
plt.figure(figsize=(10, 8))
colors = ['blue', 'green', 'red']
for (name, model), color in zip(models.items(), colors):
model.fit(X_train_s, y_train)
y_proba = model.predict_proba(X_test_s)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color=color, linewidth=2,
label=f'{name} (AUC = {roc_auc:.3f})')
# Random classifier line
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()
# Print AUC scores
print("AUC Scores:")
for name, model in models.items():
y_proba = model.predict_proba(X_test_s)[:, 1]
print(f" {name:25s}: {roc_auc_score(y_test, y_proba):.4f}")import numpy as np
from sklearn.model_selection import (KFold, StratifiedKFold, cross_val_score,
cross_validate)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Generate imbalanced data
X, y = make_classification(n_samples=500, n_features=10,
n_informative=5, weights=[0.9, 0.1],
random_state=42)
print(f"Class distribution: {np.bincount(y)}")
print(f"Class 1 ratio: {np.mean(y):.2%}")
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Regular K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X, y, cv=kf, scoring='f1')
# Stratified K-Fold (preserves class proportions)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
print(f"\nRegular K-Fold F1 scores: {np.round(kf_scores, 3)}")
print(f" Mean: {kf_scores.mean():.4f} +/- {kf_scores.std():.4f}")
print(f"Stratified K-Fold F1 scores: {np.round(skf_scores, 3)}")
print(f" Mean: {skf_scores.mean():.4f} +/- {skf_scores.std():.4f}")
# Show class distribution per fold
print(f"\nClass 1 count per fold (Stratified):")
for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
n_pos = np.sum(y[test_idx])
n_total = len(test_idx)
print(f" Fold {i+1}: {n_pos}/{n_total} positive ({n_pos/n_total:.2%})")
# cross_validate gives train scores too
results = cross_validate(model, X, y, cv=skf,
scoring=['accuracy', 'f1', 'roc_auc'],
return_train_score=True)
print(f"\nMultiple metrics (Stratified 5-Fold):")
for metric in ['accuracy', 'f1', 'roc_auc']:
train = results[f'train_{metric}'].mean()
test = results[f'test_{metric}'].mean()
print(f" {metric:10s}: Train={train:.4f}, Test={test:.4f}, Gap={train-test:.4f}")import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
import time
# Generate data
X, y = make_classification(n_samples=1000, n_features=15,
n_informative=8, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
total_combinations = 1
for values in param_grid.values():
total_combinations *= len(values)
print(f"Total combinations to try: {total_combinations}")
print(f"With 5-fold CV: {total_combinations * 5} model fits")
# Run GridSearchCV
start = time.time()
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='f1',
n_jobs=-1, # Use all CPU cores
verbose=0,
return_train_score=True
)
grid.fit(X_train, y_train)
elapsed = time.time() - start
print(f"\nSearch completed in {elapsed:.1f} seconds")
print(f"\nBest parameters: {grid.best_params_}")
print(f"Best CV F1 score: {grid.best_score_:.4f}")
# Evaluate on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print(f"\nTest set results with best model:")
print(classification_report(y_test, y_pred))
# Compare with default parameters
default_model = RandomForestClassifier(random_state=42)
default_model.fit(X_train, y_train)
print(f"Default model test accuracy: {default_model.score(X_test, y_test):.4f}")
print(f"Tuned model test accuracy: {best_model.score(X_test, y_test):.4f}")
print(f"Improvement: {(best_model.score(X_test, y_test) - default_model.score(X_test, y_test)):.4f}")import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.datasets import make_classification
from scipy.stats import randint, uniform
import time
# Generate data
X, y = make_classification(n_samples=1000, n_features=15,
n_informative=8, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Define parameter distributions (not a grid)
param_distributions = {
'n_estimators': randint(50, 500), # Random integer between 50 and 500
'max_depth': [3, 5, 7, 10, 15, 20, None],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', None],
'bootstrap': [True, False]
}
# RandomizedSearchCV: sample 50 random combinations
start = time.time()
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=50, # Try 50 random combinations
cv=5,
scoring='f1',
n_jobs=-1,
random_state=42,
return_train_score=True
)
random_search.fit(X_train, y_train)
elapsed = time.time() - start
print(f"RandomizedSearchCV completed in {elapsed:.1f} seconds")
print(f"Combinations tried: 50")
print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best CV F1 score: {random_search.best_score_:.4f}")
print(f"Test accuracy: {random_search.best_estimator_.score(X_test, y_test):.4f}")
# Show top 5 parameter combinations
import pandas as pd
results = pd.DataFrame(random_search.cv_results_)
top5 = results.nsmallest(5, 'rank_test_score')[[
'params', 'mean_test_score', 'std_test_score', 'rank_test_score'
]]
print(f"\nTop 5 parameter combinations:")
for _, row in top5.iterrows():
print(f" Rank {row['rank_test_score']:.0f}: F1={row['mean_test_score']:.4f} "
f"+/- {row['std_test_score']:.4f}")import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Generate data
X, y = make_classification(n_samples=1000, n_features=10,
n_informative=5, random_state=42)
# 1. Learning Curve: score vs training set size
train_sizes, train_scores, test_scores = learning_curve(
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy',
n_jobs=-1
)
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
plt.fill_between(train_sizes, train_scores.mean(axis=1) - train_scores.std(axis=1),
train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1, color='blue')
plt.fill_between(train_sizes, test_scores.mean(axis=1) - test_scores.std(axis=1),
test_scores.mean(axis=1) + test_scores.std(axis=1), alpha=0.1, color='red')
plt.plot(train_sizes, train_scores.mean(axis=1), 'b-o', label='Training Score')
plt.plot(train_sizes, test_scores.mean(axis=1), 'r-o', label='Cross-Validation Score')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.title('Learning Curve')
plt.legend()
plt.grid(True, alpha=0.3)
# 2. Validation Curve: score vs hyperparameter
param_range = [1, 2, 3, 5, 7, 10, 15, 20, None]
train_scores_v, test_scores_v = validation_curve(
RandomForestClassifier(n_estimators=100, random_state=42),
X, y,
param_name='max_depth',
param_range=param_range[:-1], # exclude None for plotting
cv=5,
scoring='accuracy',
n_jobs=-1
)
plt.subplot(1, 2, 2)
plt.plot(param_range[:-1], train_scores_v.mean(axis=1), 'b-o', label='Training Score')
plt.plot(param_range[:-1], test_scores_v.mean(axis=1), 'r-o', label='Cross-Validation Score')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve (max_depth)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Learning Curve Interpretation:")
print(" - If both scores converge high: model is good")
print(" - If gap between train/test is large: overfitting")
print(" - If both scores are low: underfitting")
print("\nValidation Curve Interpretation:")
print(" - Left (low depth): both scores low -> underfitting")
print(" - Right (high depth): train high, test drops -> overfitting")
print(" - Sweet spot: where test score peaks")import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (GridSearchCV, train_test_split,
StratifiedKFold)
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, roc_auc_score,
confusion_matrix)
from sklearn.datasets import make_classification
# Generate realistic dataset
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=10,
n_redundant=5, weights=[0.7, 0.3], random_state=42
)
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {np.bincount(y)}")
# Split into train+val and final test
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y
)
# Step 1: Baseline model
baseline = RandomForestClassifier(random_state=42)
baseline.fit(X_trainval, y_trainval)
print(f"\nBaseline test accuracy: {baseline.score(X_test, y_test):.4f}")
print(f"Baseline test AUC: {roc_auc_score(y_test, baseline.predict_proba(X_test)[:, 1]):.4f}")
# Step 2: GridSearchCV tuning
param_grid = {
'n_estimators': [100, 200],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'max_features': ['sqrt', 'log2']
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=cv,
scoring='roc_auc',
n_jobs=-1,
return_train_score=True
)
grid.fit(X_trainval, y_trainval)
print(f"\nBest CV AUC: {grid.best_score_:.4f}")
print(f"Best params: {grid.best_params_}")
# Step 3: Final evaluation on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]
print(f"\nFinal Test Results:")
print(f"Accuracy: {best_model.score(X_test, y_test):.4f}")
print(f"AUC: {roc_auc_score(y_test, y_proba):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Step 4: Check for overfitting
cv_train = grid.cv_results_['mean_train_score'][grid.best_index_]
cv_test = grid.cv_results_['mean_test_score'][grid.best_index_]
print(f"\nOverfitting check:")
print(f" CV Train AUC: {cv_train:.4f}")
print(f" CV Test AUC: {cv_test:.4f}")
print(f" Gap: {cv_train - cv_test:.4f} {'(OK)' if cv_train - cv_test < 0.05 else '(might overfit)'}")Common Mistakes
Tuning Hyperparameters on the Test Set (Data Leakage)
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y)
# WRONG: using test set to choose hyperparameters
best_acc = 0
for depth in [3, 5, 10, 20]:
model = RandomForestClassifier(max_depth=depth)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test) # Evaluating on test set!
if acc > best_acc:
best_acc = acc
best_depth = depth
# The test set is now "seen" -- the reported accuracy is optimisticfrom sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y)
# CORRECT: use cross-validation on training set for tuning
param_grid = {'max_depth': [3, 5, 10, 20]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train) # Only uses training data!
# Final evaluation on test set (only once!)
final_acc = grid.best_estimator_.score(X_test, y_test)
print(f"Best depth: {grid.best_params_['max_depth']}")
print(f"Final test accuracy: {final_acc:.4f}")Not Using Stratified Splitting for Imbalanced Data
from sklearn.model_selection import train_test_split
# 5% positive class
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# The test set might have 0% or 15% positive -- very unreliablefrom sklearn.model_selection import train_test_split, StratifiedKFold
# Always stratify for classification, especially imbalanced data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y # Preserves class ratios
)
# For cross-validation, use StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)Using Cross-Validation Score as Final Test Score
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"Final model accuracy: {scores.mean():.4f}") # This is NOT the final test score!
# You used ALL the data for cross-validation
# There is no held-out test set for final evaluationfrom sklearn.model_selection import cross_val_score, train_test_split
# Hold out a final test set first
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.15, random_state=42, stratify=y
)
# Use CV only on train+val for model selection
scores = cross_val_score(model, X_trainval, y_trainval, cv=5)
print(f"CV estimate: {scores.mean():.4f}")
# Final evaluation on held-out test set
model.fit(X_trainval, y_trainval)
final_score = model.score(X_test, y_test)
print(f"Final test score: {final_score:.4f}")Ignoring the Bias-Variance Tradeoff
from sklearn.tree import DecisionTreeClassifier
# "My training accuracy is 100%, the model is perfect!"
model = DecisionTreeClassifier() # max_depth=None (default)
model.fit(X_train, y_train)
print(f"Train accuracy: {model.score(X_train, y_train)}") # 1.0
# But test accuracy might be 0.75 -- severe overfitting!from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
# Check both training and cross-validation scores
for depth in [2, 5, 10, None]:
model = DecisionTreeClassifier(max_depth=depth, random_state=42)
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
cv_acc = cross_val_score(model, X_train, y_train, cv=5).mean()
gap = train_acc - cv_acc
print(f"depth={str(depth):4s}: train={train_acc:.3f}, CV={cv_acc:.3f}, gap={gap:.3f}")
# Choose the depth where CV score is highest (not training score)Summary
- Accuracy alone is misleading on imbalanced datasets. A model predicting all negatives on 99% negative data gets 99% accuracy but 0% recall. Always use precision, recall, F1, and ROC-AUC.
- The confusion matrix shows TP, TN, FP, FN. Precision = TP/(TP+FP) (how many positive predictions are correct). Recall = TP/(TP+FN) (how many actual positives are detected).
- F1 score is the harmonic mean of precision and recall: F1 = 2*P*R/(P+R). It is low when either P or R is low. Use it when you need a single metric balancing both.
- ROC curve plots TPR vs FPR at all thresholds. AUC (area under ROC) is threshold-independent: 1.0=perfect, 0.5=random. AUC is the best single metric for comparing models.
- K-Fold cross-validation splits data into K parts, trains on K-1 and tests on 1, rotating K times. Stratified K-Fold preserves class proportions. Use K=5 or K=10.
- Overfitting: high training score, low test score (model too complex). Underfitting: both scores low (model too simple). The bias-variance tradeoff governs this.
- GridSearchCV tries every combination of hyperparameters with cross-validation. Exhaustive but slow. RandomizedSearchCV samples random combinations -- faster for large search spaces.
- Always hold out a final test set that is NEVER used during model selection or hyperparameter tuning. Use cross-validation only on the training portion.
- Learning curves (score vs training size) diagnose whether more data would help. Validation curves (score vs hyperparameter) find the optimal complexity.
- The proper workflow: (1) Split data into train+val and test. (2) Use CV on train+val for model selection. (3) Tune with GridSearchCV. (4) Evaluate ONCE on test set.
- For imbalanced datasets, always use stratify=y in train_test_split and StratifiedKFold for cross-validation. Score with F1 or ROC-AUC, not accuracy.