What Is It?
What Is Classification?
In machine learning, classification is the task of predicting a discrete category (a label) for a given input. Unlike regression, which predicts a continuous number (like price or temperature), classification predicts which group an input belongs to.
Examples of classification problems:
- Email: spam or not spam (2 classes)
- Disease diagnosis: positive or negative (2 classes)
- Handwritten digit recognition: 0 through 9 (10 classes)
- Sentiment analysis: positive, negative, or neutral (3 classes)
When there are exactly two classes, it is called binary classification. When there are more than two, it is multi-class classification.
What Is Logistic Regression?
Logistic regression is one of the most fundamental classification algorithms. Despite its name containing "regression", it is used for classification. The name comes from the fact that it uses a regression-like equation internally, but passes the result through a sigmoid function to produce a probability between 0 and 1.
# The core idea:
# Linear Regression: y = w1*x1 + w2*x2 + b (output: any real number)
# Logistic Regression: p = sigmoid(w1*x1 + w2*x2 + b) (output: 0 to 1)Logistic regression answers the question: "What is the probability that this input belongs to class 1?" If the probability is above a threshold (typically 0.5), we predict class 1; otherwise, class 0.
Why Does It Matter?
Why Learn Logistic Regression?
1. The Foundation of Classification
Logistic regression is the simplest classification algorithm and the first one every ML practitioner should understand. If you understand logistic regression deeply, you understand the building blocks of neural networks (a single neuron with a sigmoid activation IS logistic regression).
2. Widely Used in Industry
Logistic regression is used extensively in real-world applications: credit scoring at banks (will this person default on a loan?), medical diagnosis (does this patient have the disease?), ad click prediction (will this user click this ad?), and churn prediction (will this customer leave?). Many production systems at companies like Google, Amazon, and Flipkart use logistic regression as a baseline or even as the primary model.
3. Interpretable and Fast
Unlike black-box models like neural networks, logistic regression gives you interpretable coefficients. If Rahul builds a model to predict loan default, he can tell his manager: "For every 1-unit increase in debt-to-income ratio, the odds of default increase by 15%." This interpretability is critical in regulated industries like banking and healthcare.
4. Great Baseline Model
In any ML project, you should always start with a simple model before trying complex ones. Logistic regression is the standard baseline for classification. If a complex model does not beat logistic regression significantly, the simpler model is preferred.
5. Gateway to Neural Networks
A single neuron in a neural network with a sigmoid activation function is exactly logistic regression. Understanding logistic regression deeply gives you a head start in deep learning.
Detailed Explanation
Detailed Explanation
1. From Linear Regression to Logistic Regression
In linear regression, we compute: z = w1*x1 + w2*x2 + ... + wn*xn + b. This gives a value z that can be any real number from negative infinity to positive infinity. But for classification, we need a probability between 0 and 1. The sigmoid function does exactly this transformation.
2. The Sigmoid Function
The sigmoid function (also called the logistic function) maps any real number to a value between 0 and 1:
# Sigmoid function formula:
# sigma(z) = 1 / (1 + e^(-z))
#
# Properties:
# - When z = 0: sigma(0) = 0.5
# - When z >> 0: sigma(z) approaches 1
# - When z << 0: sigma(z) approaches 0
# - Output is always between 0 and 1
# - The curve is S-shaped (hence "sigmoid")
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
return 1 / (1 + np.exp(-z))
z = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.axhline(y=0.5, color='r', linestyle='--', label='Threshold = 0.5')
plt.xlabel('z (linear combination)')
plt.ylabel('sigma(z) (probability)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True)
plt.show()The sigmoid function has a beautiful mathematical property: its derivative can be expressed in terms of itself: sigma'(z) = sigma(z) * (1 - sigma(z)). This makes gradient computation efficient during training.
3. The Logistic Regression Model
Logistic regression combines a linear equation with the sigmoid function:
# Step 1: Compute the linear combination
# z = w1*x1 + w2*x2 + ... + wn*xn + b
# z = W^T * X + b (vector notation)
# Step 2: Apply sigmoid to get probability
# p = sigma(z) = 1 / (1 + e^(-z))
# Step 3: Make prediction
# y_pred = 1 if p >= 0.5 else 0The model learns the weights w1, w2, ..., wn and bias b during training. These weights determine the decision boundary -- the line (or hyperplane) that separates the two classes.
4. The Decision Boundary
The decision boundary is where p = 0.5, which means z = 0, which means w1*x1 + w2*x2 + b = 0. This is a straight line in 2D (or a hyperplane in higher dimensions).
For two features x1 and x2: w1*x1 + w2*x2 + b = 0 gives us x2 = -(w1/w2)*x1 - (b/w2), which is the equation of a line. Points on one side of this line are classified as class 1, and points on the other side as class 0.
5. Log Loss (Binary Cross-Entropy)
We cannot use MSE (Mean Squared Error) for logistic regression because the sigmoid function makes the loss surface non-convex with MSE, leading to many local minima. Instead, we use log loss (binary cross-entropy):
# Log Loss for a single sample:
# L = -[y * log(p) + (1-y) * log(1-p)]
#
# Where:
# y = actual label (0 or 1)
# p = predicted probability
#
# If y=1: L = -log(p) -> penalizes low p heavily
# If y=0: L = -log(1-p) -> penalizes high p heavily
#
# Total loss over N samples:
# J = -(1/N) * sum[y_i * log(p_i) + (1-y_i) * log(1-p_i)]Intuition: If the actual label is 1 and we predict p=0.99, the loss is very small (-log(0.99) = 0.01). But if we predict p=0.01, the loss is huge (-log(0.01) = 4.6). The log function penalizes confident wrong predictions very heavily.
6. Gradient Descent for Logistic Regression
We minimize the log loss using gradient descent. The gradients turn out to be elegant:
# Gradient of loss with respect to weights:
# dJ/dw = (1/N) * X^T * (predictions - actuals)
# dJ/db = (1/N) * sum(predictions - actuals)
#
# Update rules:
# w = w - learning_rate * dJ/dw
# b = b - learning_rate * dJ/db
#
# These are exactly the same form as linear regression!This is remarkable: even though the loss function is different (log loss vs MSE), the gradient update rules have the same form as linear regression. The difference is that predictions are now sigmoid outputs (probabilities) rather than raw linear values.
7. Multi-Class Classification
Logistic regression naturally handles binary classification. For multi-class problems (3+ classes), there are two main strategies:
One-vs-Rest (OvR)
Train K separate binary classifiers, one for each class. Each classifier learns to distinguish its class from all others. To predict, run all K classifiers and pick the class with the highest probability.
# For 3 classes (cat, dog, bird):
# Classifier 1: cat vs (not cat)
# Classifier 2: dog vs (not dog)
# Classifier 3: bird vs (not bird)
# Prediction: class with highest probability among the 3 classifiersSoftmax Regression (Multinomial Logistic Regression)
Instead of K separate classifiers, use a single model with K output neurons, each producing a score. The softmax function converts these scores into probabilities that sum to 1:
# Softmax function:
# P(class_k) = e^(z_k) / sum(e^(z_j) for all j)
#
# Example: scores = [2.0, 1.0, 0.5]
# e^2.0 = 7.39, e^1.0 = 2.72, e^0.5 = 1.65
# sum = 11.76
# probabilities = [7.39/11.76, 2.72/11.76, 1.65/11.76]
# = [0.63, 0.23, 0.14] (sum = 1.0)8. Evaluation Metrics for Classification
Confusion Matrix
A confusion matrix shows the counts of correct and incorrect predictions for each class:
# Predicted
# Positive Negative
# Actual Positive TP FN
# Actual Negative FP TN
#
# TP (True Positive): Predicted positive, actually positive (correct)
# TN (True Negative): Predicted negative, actually negative (correct)
# FP (False Positive): Predicted positive, actually negative (Type I error)
# FN (False Negative): Predicted negative, actually positive (Type II error)Metrics Derived from Confusion Matrix
# Accuracy = (TP + TN) / (TP + TN + FP + FN)
# Precision = TP / (TP + FP) -> Of all positive predictions, how many are correct?
# Recall = TP / (TP + FN) -> Of all actual positives, how many did we catch?
# F1 Score = 2 * (Precision * Recall) / (Precision + Recall) -> Harmonic meanWhen to prioritize which metric:
- Precision: When false positives are costly (spam detection -- you do not want to mark a real email as spam)
- Recall: When false negatives are costly (cancer detection -- you do not want to miss a positive case)
- F1 Score: When you need a balance between precision and recall
Code Examples
import numpy as np
import matplotlib.pyplot as plt
# Sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Plot sigmoid
z = np.linspace(-10, 10, 200)
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(z, sigmoid(z), 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.fill_between(z, sigmoid(z), 0.5, where=(sigmoid(z) >= 0.5),
alpha=0.2, color='green', label='Predict Class 1')
plt.fill_between(z, sigmoid(z), 0.5, where=(sigmoid(z) < 0.5),
alpha=0.2, color='red', label='Predict Class 0')
plt.xlabel('z')
plt.ylabel('sigmoid(z)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True, alpha=0.3)
# Demonstrate key values
print("Sigmoid values at key points:")
for val in [-10, -5, -2, 0, 2, 5, 10]:
print(f" sigmoid({val:3d}) = {sigmoid(val):.6f}")
# Derivative of sigmoid
def sigmoid_derivative(z):
s = sigmoid(z)
return s * (1 - s)
plt.subplot(1, 2, 2)
plt.plot(z, sigmoid_derivative(z), 'r-', linewidth=2)
plt.xlabel('z')
plt.ylabel("sigmoid'(z)")
plt.title('Sigmoid Derivative (max at z=0)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()import numpy as np
class LogisticRegressionScratch:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
self.losses = []
def sigmoid(self, z):
# Clip z to avoid overflow in exp
z = np.clip(z, -500, 500)
return 1 / (1 + np.exp(-z))
def compute_loss(self, y, y_pred):
# Binary cross-entropy (log loss)
epsilon = 1e-15 # avoid log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
return loss
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for i in range(self.n_iterations):
# Forward pass
z = np.dot(X, self.weights) + self.bias
y_pred = self.sigmoid(z)
# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
db = (1 / n_samples) * np.sum(y_pred - y)
# Update parameters
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
# Track loss
loss = self.compute_loss(y, y_pred)
self.losses.append(loss)
if (i + 1) % 200 == 0:
print(f"Iteration {i+1}/{self.n_iterations}, Loss: {loss:.4f}")
def predict_proba(self, X):
z = np.dot(X, self.weights) + self.bias
return self.sigmoid(z)
def predict(self, X, threshold=0.5):
probabilities = self.predict_proba(X)
return (probabilities >= threshold).astype(int)
# Generate sample data: students who pass (1) or fail (0)
# Features: study_hours, attendance_percentage
np.random.seed(42)
n = 100
study_hours = np.random.uniform(1, 10, n)
attendance = np.random.uniform(40, 100, n)
# Rule: pass if 3*study_hours + 0.5*attendance + noise > 40
noise = np.random.normal(0, 3, n)
score = 3 * study_hours + 0.5 * attendance + noise
y = (score > 45).astype(int)
# Normalize features
X = np.column_stack([
(study_hours - study_hours.mean()) / study_hours.std(),
(attendance - attendance.mean()) / attendance.std()
])
# Train model
model = LogisticRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)
# Predictions
y_pred = model.predict(X)
accuracy = np.mean(y_pred == y)
print(f"\nTraining Accuracy: {accuracy:.2%}")
print(f"Learned weights: {model.weights}")
print(f"Learned bias: {model.bias:.4f}")import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report)
# Generate dataset: predict if student passes
np.random.seed(42)
n = 200
study_hours = np.random.uniform(1, 10, n)
attendance = np.random.uniform(40, 100, n)
assignment_score = np.random.uniform(0, 100, n)
noise = np.random.normal(0, 5, n)
score = 2.5 * study_hours + 0.3 * attendance + 0.2 * assignment_score + noise
y = (score > 45).astype(int)
X = np.column_stack([study_hours, attendance, assignment_score])
print(f"Dataset: {X.shape[0]} students, {X.shape[1]} features")
print(f"Class distribution: Pass={y.sum()}, Fail={len(y)-y.sum()}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
# Predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)
# Evaluation
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))
# Model coefficients
print(f"\nCoefficients: {model.coef_[0]}")
print(f"Intercept: {model.intercept_[0]:.4f}")
# Predict probabilities for a new student
new_student = scaler.transform([[7.5, 85, 72]])
proba = model.predict_proba(new_student)[0]
print(f"\nNew student (7.5 hrs, 85% attendance, 72 assignment):")
print(f" P(Fail) = {proba[0]:.4f}, P(Pass) = {proba[1]:.4f}")
print(f" Prediction: {'Pass' if model.predict(new_student)[0] == 1 else 'Fail'}")import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Generate 2D data for visualization
np.random.seed(42)
# Class 0: students who fail
n0 = 50
study_0 = np.random.normal(3, 1.2, n0)
attendance_0 = np.random.normal(55, 12, n0)
# Class 1: students who pass
n1 = 50
study_1 = np.random.normal(7, 1.2, n1)
attendance_1 = np.random.normal(80, 12, n1)
X = np.vstack([np.column_stack([study_0, attendance_0]),
np.column_stack([study_1, attendance_1])])
y = np.array([0]*n0 + [1]*n1)
# Scale and train
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression(random_state=42)
model.fit(X_scaled, y)
# Create mesh grid for decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 5, X[:, 1].max() + 5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
grid = np.column_stack([xx.ravel(), yy.ravel()])
grid_scaled = scaler.transform(grid)
Z = model.predict_proba(grid_scaled)[:, 1].reshape(xx.shape)
# Plot
plt.figure(figsize=(10, 7))
plt.contourf(xx, yy, Z, levels=50, cmap='RdYlGn', alpha=0.6)
plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label='Fail', edgecolors='k', s=60)
plt.scatter(X[y==1, 0], X[y==1, 1], c='green', label='Pass', edgecolors='k', s=60)
plt.xlabel('Study Hours')
plt.ylabel('Attendance %')
plt.title('Logistic Regression Decision Boundary')
plt.legend()
plt.colorbar(label='P(Pass)')
plt.show()
print(f"Training accuracy: {model.score(X_scaled, y):.2%}")import numpy as np
import matplotlib.pyplot as plt
# Log loss for a single sample
def log_loss_single(y_true, y_pred):
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
p = np.linspace(0.001, 0.999, 500)
plt.figure(figsize=(10, 5))
# When actual label = 1
loss_when_1 = log_loss_single(1, p)
plt.subplot(1, 2, 1)
plt.plot(p, loss_when_1, 'b-', linewidth=2)
plt.xlabel('Predicted Probability (p)')
plt.ylabel('Loss')
plt.title('Loss when Actual = 1')
plt.grid(True, alpha=0.3)
plt.annotate('Low p = High loss\n(confident wrong prediction)',
xy=(0.05, 3), fontsize=9)
# When actual label = 0
loss_when_0 = log_loss_single(0, p)
plt.subplot(1, 2, 2)
plt.plot(p, loss_when_0, 'r-', linewidth=2)
plt.xlabel('Predicted Probability (p)')
plt.ylabel('Loss')
plt.title('Loss when Actual = 0')
plt.grid(True, alpha=0.3)
plt.annotate('High p = High loss\n(confident wrong prediction)',
xy=(0.6, 3), fontsize=9)
plt.tight_layout()
plt.show()
# Concrete examples
print("Log loss examples:")
print(f" Actual=1, Predicted=0.95 -> Loss: {log_loss_single(1, 0.95):.4f} (good)")
print(f" Actual=1, Predicted=0.50 -> Loss: {log_loss_single(1, 0.50):.4f} (uncertain)")
print(f" Actual=1, Predicted=0.05 -> Loss: {log_loss_single(1, 0.05):.4f} (terrible)")
print(f" Actual=0, Predicted=0.05 -> Loss: {log_loss_single(0, 0.05):.4f} (good)")
print(f" Actual=0, Predicted=0.95 -> Loss: {log_loss_single(0, 0.95):.4f} (terrible)")import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
# Load iris dataset (3 classes)
iris = load_iris()
X, y = iris.data, iris.target
print(f"Classes: {iris.target_names}")
print(f"Features: {iris.feature_names}")
print(f"Shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# One-vs-Rest approach
model_ovr = LogisticRegression(multi_class='ovr', random_state=42, max_iter=1000)
model_ovr.fit(X_train_scaled, y_train)
print(f"\nOne-vs-Rest Accuracy: {model_ovr.score(X_test_scaled, y_test):.4f}")
# Softmax (multinomial) approach
model_softmax = LogisticRegression(multi_class='multinomial', random_state=42, max_iter=1000)
model_softmax.fit(X_train_scaled, y_train)
print(f"Softmax Accuracy: {model_softmax.score(X_test_scaled, y_test):.4f}")
# Detailed evaluation for softmax model
y_pred = model_softmax.predict(X_test_scaled)
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Predict probabilities for a sample
sample = X_test_scaled[0:1]
proba = model_softmax.predict_proba(sample)[0]
print(f"\nSample prediction probabilities:")
for name, prob in zip(iris.target_names, proba):
print(f" {name}: {prob:.4f}")
print(f"Predicted class: {iris.target_names[model_softmax.predict(sample)[0]]}")import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, accuracy_score,
precision_score, recall_score, f1_score)
import matplotlib.pyplot as plt
import seaborn as sns
# Generate imbalanced binary data
np.random.seed(42)
n = 300
# 80% negative class (fail), 20% positive class (pass)
X_neg = np.random.randn(240, 2) + np.array([-1, -1])
X_pos = np.random.randn(60, 2) + np.array([2, 2])
X = np.vstack([X_neg, X_pos])
y = np.array([0]*240 + [1]*60)
# Shuffle
idx = np.random.permutation(len(y))
X, y = X[idx], y[idx]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
model = LogisticRegression(random_state=42)
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print(f" TN={cm[0][0]}, FP={cm[0][1]}")
print(f" FN={cm[1][0]}, TP={cm[1][1]}")
# Calculate metrics manually
TP, TN, FP, FN = cm[1][1], cm[0][0], cm[0][1], cm[1][0]
print(f"\nManual calculation:")
print(f" Accuracy = (TP+TN)/(TP+TN+FP+FN) = ({TP}+{TN})/({TP}+{TN}+{FP}+{FN}) = {(TP+TN)/(TP+TN+FP+FN):.4f}")
print(f" Precision = TP/(TP+FP) = {TP}/({TP}+{FP}) = {TP/(TP+FP):.4f}")
print(f" Recall = TP/(TP+FN) = {TP}/({TP}+{FN}) = {TP/(TP+FN):.4f}")
precision = TP/(TP+FP)
recall = TP/(TP+FN)
f1 = 2 * precision * recall / (precision + recall)
print(f" F1 Score = 2*P*R/(P+R) = {f1:.4f}")
# Verify with sklearn
print(f"\nsklearn verification:")
print(f" Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f" Precision: {precision_score(y_test, y_pred):.4f}")
print(f" Recall: {recall_score(y_test, y_pred):.4f}")
print(f" F1 Score: {f1_score(y_test, y_pred):.4f}")
# Plot confusion matrix heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Fail', 'Pass'],
yticklabels=['Fail', 'Pass'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
# Create realistic student dataset
np.random.seed(42)
n = 500
data = {
'study_hours_per_week': np.random.uniform(1, 15, n),
'attendance_pct': np.random.uniform(30, 100, n),
'assignment_avg': np.random.uniform(20, 100, n),
'previous_gpa': np.random.uniform(1.0, 4.0, n),
'sleep_hours': np.random.uniform(4, 10, n)
}
df = pd.DataFrame(data)
# Generate target: pass (1) or fail (0)
score = (0.2 * df['study_hours_per_week'] +
0.15 * df['attendance_pct'] +
0.1 * df['assignment_avg'] +
2.0 * df['previous_gpa'] +
0.3 * df['sleep_hours'] +
np.random.normal(0, 2, n))
df['passed'] = (score > 17).astype(int)
print("Dataset Preview:")
print(df.head())
print(f"\nClass balance: {df['passed'].value_counts().to_dict()}")
# Prepare features and target
feature_cols = ['study_hours_per_week', 'attendance_pct',
'assignment_avg', 'previous_gpa', 'sleep_hours']
X = df[feature_cols].values
y = df['passed'].values
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Train
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)
# Evaluate
y_pred = model.predict(X_test_s)
print(f"\nTest Accuracy: {model.score(X_test_s, y_test):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))
# Feature importance (coefficients)
print("Feature Importance (coefficients):")
for feature, coef in sorted(zip(feature_cols, model.coef_[0]),
key=lambda x: abs(x[1]), reverse=True):
print(f" {feature:25s}: {coef:+.4f}")
# Predict for specific students
students = {
'Aarav': [10, 90, 85, 3.5, 7],
'Priya': [3, 45, 40, 1.5, 5],
'Kavita': [6, 70, 60, 2.5, 6],
}
print(f"\nPredictions for new students:")
for name, features in students.items():
feat_scaled = scaler.transform([features])
prob = model.predict_proba(feat_scaled)[0]
pred = 'Pass' if model.predict(feat_scaled)[0] == 1 else 'Fail'
print(f" {name}: P(Fail)={prob[0]:.3f}, P(Pass)={prob[1]:.3f} -> {pred}")Common Mistakes
Not Scaling Features Before Logistic Regression
from sklearn.linear_model import LogisticRegression
# Features with very different scales
# study_hours: 1-10, attendance: 40-100, salary: 10000-100000
X = [[5, 80, 50000], [3, 60, 30000], [8, 90, 70000]]
y = [1, 0, 1]
model = LogisticRegression()
model.fit(X, y) # salary dominates because of scalefrom sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
X = [[5, 80, 50000], [3, 60, 30000], [8, 90, 70000]]
y = [1, 0, 1]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression()
model.fit(X_scaled, y) # All features on same scale nowUsing Accuracy on Imbalanced Datasets
from sklearn.metrics import accuracy_score
# 95% negative, 5% positive (e.g., fraud detection)
y_test = [0]*950 + [1]*50
y_pred = [0]*1000 # Predicts all as negative
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}") # 95% accuracy!
# But it misses ALL fraud cases!from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_test = [0]*950 + [1]*50
y_pred = [0]*1000
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred, zero_division=0):.2%}")
print(f"Recall: {recall_score(y_test, y_pred):.2%}") # 0%!
print(f"F1 Score: {f1_score(y_test, y_pred, zero_division=0):.2%}")
# Accuracy is misleading here. Use F1, recall, or ROC-AUC instead.Applying Scaler Fitted on Test Data (Data Leakage)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# WRONG: fitting scaler on entire dataset or on test data
scaler = StandardScaler()
scaler.fit(X) # Leaks test data statistics into training
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
# CORRECT: fit only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform on train
X_test_scaled = scaler.transform(X_test) # only transform on testForgetting to Set random_state for Reproducibility
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Different split every time -> different results every time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test)) # Different each runfrom sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Fixed random state = same split every time
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)
print(model.score(X_test, y_test)) # Same result every runConfusing predict() and predict_proba()
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
# Using predict() when you need probabilities for threshold tuning
predictions = model.predict(X_test)
# predictions = [0, 1, 1, 0, 1, ...] (hard labels, threshold locked at 0.5)from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
# Use predict_proba() to get probabilities
probabilities = model.predict_proba(X_test) # Shape: (n_samples, n_classes)
p_positive = probabilities[:, 1] # Probability of class 1
# Now you can use any threshold
threshold = 0.3 # Lower threshold = higher recall
y_pred = (p_positive >= threshold).astype(int)Summary
- Classification predicts discrete categories (labels), unlike regression which predicts continuous numbers. Binary classification has 2 classes; multi-class has 3 or more.
- Logistic regression passes a linear combination through the sigmoid function: p = 1 / (1 + e^(-z)), producing a probability between 0 and 1.
- The sigmoid function maps any real number to (0, 1). At z=0, sigmoid outputs 0.5 (the default decision boundary). Its derivative is sigmoid(z) * (1 - sigmoid(z)).
- Log loss (binary cross-entropy) is the loss function: L = -[y*log(p) + (1-y)*log(1-p)]. It penalizes confident wrong predictions exponentially.
- The decision boundary is where P(class 1) = 0.5, which corresponds to the linear equation w^T*X + b = 0. Logistic regression creates a linear decision boundary.
- Gradient descent updates weights using: w = w - lr * (1/N) * X^T * (predictions - actuals). The gradient formula is identical in form to linear regression.
- For multi-class classification, use One-vs-Rest (K separate binary classifiers) or Softmax regression (single model with K outputs summing to 1).
- The confusion matrix shows TP, TN, FP, FN. From it: Accuracy = (TP+TN)/total, Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2*P*R/(P+R).
- Use precision when false positives are costly (spam filter). Use recall when false negatives are costly (disease detection). F1 balances both.
- Always scale features before logistic regression. Use StandardScaler fitted only on training data. Fit on train, transform both train and test.
- predict_proba() returns probabilities (useful for threshold tuning and ROC curves). predict() returns hard labels using threshold 0.5 by default.