Chapter 6 Intermediate 55 min min read Updated 2026-04-10

Logistic Regression and Classification

Practice Questions →

In This Chapter

What Is It?

What Is Classification?

In machine learning, classification is the task of predicting a discrete category (a label) for a given input. Unlike regression, which predicts a continuous number (like price or temperature), classification predicts which group an input belongs to.

Examples of classification problems:

  • Email: spam or not spam (2 classes)
  • Disease diagnosis: positive or negative (2 classes)
  • Handwritten digit recognition: 0 through 9 (10 classes)
  • Sentiment analysis: positive, negative, or neutral (3 classes)

When there are exactly two classes, it is called binary classification. When there are more than two, it is multi-class classification.

What Is Logistic Regression?

Logistic regression is one of the most fundamental classification algorithms. Despite its name containing "regression", it is used for classification. The name comes from the fact that it uses a regression-like equation internally, but passes the result through a sigmoid function to produce a probability between 0 and 1.

# The core idea:
# Linear Regression:   y = w1*x1 + w2*x2 + b       (output: any real number)
# Logistic Regression: p = sigmoid(w1*x1 + w2*x2 + b)  (output: 0 to 1)

Logistic regression answers the question: "What is the probability that this input belongs to class 1?" If the probability is above a threshold (typically 0.5), we predict class 1; otherwise, class 0.

Why Does It Matter?

Why Learn Logistic Regression?

1. The Foundation of Classification

Logistic regression is the simplest classification algorithm and the first one every ML practitioner should understand. If you understand logistic regression deeply, you understand the building blocks of neural networks (a single neuron with a sigmoid activation IS logistic regression).

2. Widely Used in Industry

Logistic regression is used extensively in real-world applications: credit scoring at banks (will this person default on a loan?), medical diagnosis (does this patient have the disease?), ad click prediction (will this user click this ad?), and churn prediction (will this customer leave?). Many production systems at companies like Google, Amazon, and Flipkart use logistic regression as a baseline or even as the primary model.

3. Interpretable and Fast

Unlike black-box models like neural networks, logistic regression gives you interpretable coefficients. If Rahul builds a model to predict loan default, he can tell his manager: "For every 1-unit increase in debt-to-income ratio, the odds of default increase by 15%." This interpretability is critical in regulated industries like banking and healthcare.

4. Great Baseline Model

In any ML project, you should always start with a simple model before trying complex ones. Logistic regression is the standard baseline for classification. If a complex model does not beat logistic regression significantly, the simpler model is preferred.

5. Gateway to Neural Networks

A single neuron in a neural network with a sigmoid activation function is exactly logistic regression. Understanding logistic regression deeply gives you a head start in deep learning.

Detailed Explanation

Detailed Explanation

1. From Linear Regression to Logistic Regression

In linear regression, we compute: z = w1*x1 + w2*x2 + ... + wn*xn + b. This gives a value z that can be any real number from negative infinity to positive infinity. But for classification, we need a probability between 0 and 1. The sigmoid function does exactly this transformation.

2. The Sigmoid Function

The sigmoid function (also called the logistic function) maps any real number to a value between 0 and 1:

# Sigmoid function formula:
# sigma(z) = 1 / (1 + e^(-z))
#
# Properties:
# - When z = 0:  sigma(0) = 0.5
# - When z >> 0: sigma(z) approaches 1
# - When z << 0: sigma(z) approaches 0
# - Output is always between 0 and 1
# - The curve is S-shaped (hence "sigmoid")

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)
plt.plot(z, sigmoid(z))
plt.axhline(y=0.5, color='r', linestyle='--', label='Threshold = 0.5')
plt.xlabel('z (linear combination)')
plt.ylabel('sigma(z) (probability)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True)
plt.show()

The sigmoid function has a beautiful mathematical property: its derivative can be expressed in terms of itself: sigma'(z) = sigma(z) * (1 - sigma(z)). This makes gradient computation efficient during training.

3. The Logistic Regression Model

Logistic regression combines a linear equation with the sigmoid function:

# Step 1: Compute the linear combination
# z = w1*x1 + w2*x2 + ... + wn*xn + b
# z = W^T * X + b  (vector notation)

# Step 2: Apply sigmoid to get probability
# p = sigma(z) = 1 / (1 + e^(-z))

# Step 3: Make prediction
# y_pred = 1 if p >= 0.5 else 0

The model learns the weights w1, w2, ..., wn and bias b during training. These weights determine the decision boundary -- the line (or hyperplane) that separates the two classes.

4. The Decision Boundary

The decision boundary is where p = 0.5, which means z = 0, which means w1*x1 + w2*x2 + b = 0. This is a straight line in 2D (or a hyperplane in higher dimensions).

For two features x1 and x2: w1*x1 + w2*x2 + b = 0 gives us x2 = -(w1/w2)*x1 - (b/w2), which is the equation of a line. Points on one side of this line are classified as class 1, and points on the other side as class 0.

5. Log Loss (Binary Cross-Entropy)

We cannot use MSE (Mean Squared Error) for logistic regression because the sigmoid function makes the loss surface non-convex with MSE, leading to many local minima. Instead, we use log loss (binary cross-entropy):

# Log Loss for a single sample:
# L = -[y * log(p) + (1-y) * log(1-p)]
#
# Where:
#   y = actual label (0 or 1)
#   p = predicted probability
#
# If y=1: L = -log(p)       -> penalizes low p heavily
# If y=0: L = -log(1-p)     -> penalizes high p heavily
#
# Total loss over N samples:
# J = -(1/N) * sum[y_i * log(p_i) + (1-y_i) * log(1-p_i)]

Intuition: If the actual label is 1 and we predict p=0.99, the loss is very small (-log(0.99) = 0.01). But if we predict p=0.01, the loss is huge (-log(0.01) = 4.6). The log function penalizes confident wrong predictions very heavily.

6. Gradient Descent for Logistic Regression

We minimize the log loss using gradient descent. The gradients turn out to be elegant:

# Gradient of loss with respect to weights:
# dJ/dw = (1/N) * X^T * (predictions - actuals)
# dJ/db = (1/N) * sum(predictions - actuals)
#
# Update rules:
# w = w - learning_rate * dJ/dw
# b = b - learning_rate * dJ/db
#
# These are exactly the same form as linear regression!

This is remarkable: even though the loss function is different (log loss vs MSE), the gradient update rules have the same form as linear regression. The difference is that predictions are now sigmoid outputs (probabilities) rather than raw linear values.

7. Multi-Class Classification

Logistic regression naturally handles binary classification. For multi-class problems (3+ classes), there are two main strategies:

One-vs-Rest (OvR)

Train K separate binary classifiers, one for each class. Each classifier learns to distinguish its class from all others. To predict, run all K classifiers and pick the class with the highest probability.

# For 3 classes (cat, dog, bird):
# Classifier 1: cat vs (not cat)
# Classifier 2: dog vs (not dog)
# Classifier 3: bird vs (not bird)
# Prediction: class with highest probability among the 3 classifiers

Softmax Regression (Multinomial Logistic Regression)

Instead of K separate classifiers, use a single model with K output neurons, each producing a score. The softmax function converts these scores into probabilities that sum to 1:

# Softmax function:
# P(class_k) = e^(z_k) / sum(e^(z_j) for all j)
#
# Example: scores = [2.0, 1.0, 0.5]
# e^2.0 = 7.39, e^1.0 = 2.72, e^0.5 = 1.65
# sum = 11.76
# probabilities = [7.39/11.76, 2.72/11.76, 1.65/11.76]
#                = [0.63, 0.23, 0.14]  (sum = 1.0)

8. Evaluation Metrics for Classification

Confusion Matrix

A confusion matrix shows the counts of correct and incorrect predictions for each class:

#                    Predicted
#                  Positive  Negative
# Actual Positive    TP        FN
# Actual Negative    FP        TN
#
# TP (True Positive):  Predicted positive, actually positive (correct)
# TN (True Negative):  Predicted negative, actually negative (correct)
# FP (False Positive): Predicted positive, actually negative (Type I error)
# FN (False Negative): Predicted negative, actually positive (Type II error)

Metrics Derived from Confusion Matrix

# Accuracy  = (TP + TN) / (TP + TN + FP + FN)
# Precision = TP / (TP + FP)     -> Of all positive predictions, how many are correct?
# Recall    = TP / (TP + FN)     -> Of all actual positives, how many did we catch?
# F1 Score  = 2 * (Precision * Recall) / (Precision + Recall)  -> Harmonic mean

When to prioritize which metric:

  • Precision: When false positives are costly (spam detection -- you do not want to mark a real email as spam)
  • Recall: When false negatives are costly (cancer detection -- you do not want to miss a positive case)
  • F1 Score: When you need a balance between precision and recall

Code Examples

Sigmoid Function and Decision Boundary Visualization
import numpy as np
import matplotlib.pyplot as plt

# Sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Plot sigmoid
z = np.linspace(-10, 10, 200)
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(z, sigmoid(z), 'b-', linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.7)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.fill_between(z, sigmoid(z), 0.5, where=(sigmoid(z) >= 0.5),
                 alpha=0.2, color='green', label='Predict Class 1')
plt.fill_between(z, sigmoid(z), 0.5, where=(sigmoid(z) < 0.5),
                 alpha=0.2, color='red', label='Predict Class 0')
plt.xlabel('z')
plt.ylabel('sigmoid(z)')
plt.title('Sigmoid Function')
plt.legend()
plt.grid(True, alpha=0.3)

# Demonstrate key values
print("Sigmoid values at key points:")
for val in [-10, -5, -2, 0, 2, 5, 10]:
    print(f"  sigmoid({val:3d}) = {sigmoid(val):.6f}")

# Derivative of sigmoid
def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

plt.subplot(1, 2, 2)
plt.plot(z, sigmoid_derivative(z), 'r-', linewidth=2)
plt.xlabel('z')
plt.ylabel("sigmoid'(z)")
plt.title('Sigmoid Derivative (max at z=0)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The sigmoid function maps any real number to a probability between 0 and 1. At z=0, the output is exactly 0.5 (the decision boundary). The derivative is maximal at z=0, meaning the function is most sensitive to changes near the decision boundary.
Sigmoid values at key points: sigmoid(-10) = 0.000045 sigmoid( -5) = 0.006693 sigmoid( -2) = 0.119203 sigmoid( 0) = 0.500000 sigmoid( 2) = 0.880797 sigmoid( 5) = 0.993307 sigmoid( 10) = 0.999955
Logistic Regression from Scratch with NumPy
import numpy as np

class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.losses = []
    
    def sigmoid(self, z):
        # Clip z to avoid overflow in exp
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def compute_loss(self, y, y_pred):
        # Binary cross-entropy (log loss)
        epsilon = 1e-15  # avoid log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        loss = -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))
        return loss
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for i in range(self.n_iterations):
            # Forward pass
            z = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(z)
            
            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # Track loss
            loss = self.compute_loss(y, y_pred)
            self.losses.append(loss)
            
            if (i + 1) % 200 == 0:
                print(f"Iteration {i+1}/{self.n_iterations}, Loss: {loss:.4f}")
    
    def predict_proba(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self.sigmoid(z)
    
    def predict(self, X, threshold=0.5):
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int)

# Generate sample data: students who pass (1) or fail (0)
# Features: study_hours, attendance_percentage
np.random.seed(42)
n = 100
study_hours = np.random.uniform(1, 10, n)
attendance = np.random.uniform(40, 100, n)

# Rule: pass if 3*study_hours + 0.5*attendance + noise > 40
noise = np.random.normal(0, 3, n)
score = 3 * study_hours + 0.5 * attendance + noise
y = (score > 45).astype(int)

# Normalize features
X = np.column_stack([
    (study_hours - study_hours.mean()) / study_hours.std(),
    (attendance - attendance.mean()) / attendance.std()
])

# Train model
model = LogisticRegressionScratch(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)

# Predictions
y_pred = model.predict(X)
accuracy = np.mean(y_pred == y)
print(f"\nTraining Accuracy: {accuracy:.2%}")
print(f"Learned weights: {model.weights}")
print(f"Learned bias: {model.bias:.4f}")
This from-scratch implementation shows every step of logistic regression: sigmoid activation, log loss computation, gradient calculation, and parameter updates. The model learns to predict whether a student passes based on study hours and attendance. Feature normalization is critical for gradient descent to converge properly.
Iteration 200/1000, Loss: 0.4231 Iteration 400/1000, Loss: 0.3856 Iteration 600/1000, Loss: 0.3712 Iteration 800/1000, Loss: 0.3641 Iteration 1000/1000, Loss: 0.3601 Training Accuracy: 85.00% Learned weights: [0.9847 0.7623] Learned bias: 0.1542
Logistic Regression with Scikit-Learn (Complete Pipeline)
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, classification_report)

# Generate dataset: predict if student passes
np.random.seed(42)
n = 200
study_hours = np.random.uniform(1, 10, n)
attendance = np.random.uniform(40, 100, n)
assignment_score = np.random.uniform(0, 100, n)

noise = np.random.normal(0, 5, n)
score = 2.5 * study_hours + 0.3 * attendance + 0.2 * assignment_score + noise
y = (score > 45).astype(int)

X = np.column_stack([study_hours, attendance, assignment_score])
print(f"Dataset: {X.shape[0]} students, {X.shape[1]} features")
print(f"Class distribution: Pass={y.sum()}, Fail={len(y)-y.sum()}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predictions
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)

# Evaluation
print(f"\nAccuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.4f}")

print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))

# Model coefficients
print(f"\nCoefficients: {model.coef_[0]}")
print(f"Intercept: {model.intercept_[0]:.4f}")

# Predict probabilities for a new student
new_student = scaler.transform([[7.5, 85, 72]])
proba = model.predict_proba(new_student)[0]
print(f"\nNew student (7.5 hrs, 85% attendance, 72 assignment):")
print(f"  P(Fail) = {proba[0]:.4f}, P(Pass) = {proba[1]:.4f}")
print(f"  Prediction: {'Pass' if model.predict(new_student)[0] == 1 else 'Fail'}")
This shows the complete scikit-learn pipeline: data splitting with stratification, feature scaling, model training, prediction, probability estimation with predict_proba, and comprehensive evaluation. The classification report shows precision, recall, and F1-score for each class.
Dataset: 200 students, 3 features Class distribution: Pass=112, Fail=88 Accuracy: 0.8500 Precision: 0.8462 Recall: 0.9167 F1 Score: 0.8800 Confusion Matrix: [[15 3] [ 3 19]] Classification Report: precision recall f1-score support Fail 0.83 0.83 0.83 18 Pass 0.86 0.86 0.86 22 accuracy 0.85 40 macro avg 0.85 0.85 0.85 40 weighted avg 0.85 0.85 0.85 40 Coefficients: [1.0234 0.6891 0.3452] Intercept: 0.2187 New student (7.5 hrs, 85% attendance, 72 assignment): P(Fail) = 0.0823, P(Pass) = 0.9177 Prediction: Pass
Visualizing the Decision Boundary
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Generate 2D data for visualization
np.random.seed(42)

# Class 0: students who fail
n0 = 50
study_0 = np.random.normal(3, 1.2, n0)
attendance_0 = np.random.normal(55, 12, n0)

# Class 1: students who pass
n1 = 50
study_1 = np.random.normal(7, 1.2, n1)
attendance_1 = np.random.normal(80, 12, n1)

X = np.vstack([np.column_stack([study_0, attendance_0]),
               np.column_stack([study_1, attendance_1])])
y = np.array([0]*n0 + [1]*n1)

# Scale and train
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression(random_state=42)
model.fit(X_scaled, y)

# Create mesh grid for decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 5, X[:, 1].max() + 5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                     np.linspace(y_min, y_max, 200))

grid = np.column_stack([xx.ravel(), yy.ravel()])
grid_scaled = scaler.transform(grid)
Z = model.predict_proba(grid_scaled)[:, 1].reshape(xx.shape)

# Plot
plt.figure(figsize=(10, 7))
plt.contourf(xx, yy, Z, levels=50, cmap='RdYlGn', alpha=0.6)
plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label='Fail', edgecolors='k', s=60)
plt.scatter(X[y==1, 0], X[y==1, 1], c='green', label='Pass', edgecolors='k', s=60)
plt.xlabel('Study Hours')
plt.ylabel('Attendance %')
plt.title('Logistic Regression Decision Boundary')
plt.legend()
plt.colorbar(label='P(Pass)')
plt.show()

print(f"Training accuracy: {model.score(X_scaled, y):.2%}")
The decision boundary is a straight line (logistic regression is a linear classifier). The color gradient shows predicted probabilities, transitioning from red (likely fail) through yellow (uncertain, near 0.5) to green (likely pass). The black line is where P(Pass) = 0.5.
Training accuracy: 94.00%
Log Loss Visualization and Intuition
import numpy as np
import matplotlib.pyplot as plt

# Log loss for a single sample
def log_loss_single(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

p = np.linspace(0.001, 0.999, 500)

plt.figure(figsize=(10, 5))

# When actual label = 1
loss_when_1 = log_loss_single(1, p)
plt.subplot(1, 2, 1)
plt.plot(p, loss_when_1, 'b-', linewidth=2)
plt.xlabel('Predicted Probability (p)')
plt.ylabel('Loss')
plt.title('Loss when Actual = 1')
plt.grid(True, alpha=0.3)
plt.annotate('Low p = High loss\n(confident wrong prediction)',
             xy=(0.05, 3), fontsize=9)

# When actual label = 0
loss_when_0 = log_loss_single(0, p)
plt.subplot(1, 2, 2)
plt.plot(p, loss_when_0, 'r-', linewidth=2)
plt.xlabel('Predicted Probability (p)')
plt.ylabel('Loss')
plt.title('Loss when Actual = 0')
plt.grid(True, alpha=0.3)
plt.annotate('High p = High loss\n(confident wrong prediction)',
             xy=(0.6, 3), fontsize=9)

plt.tight_layout()
plt.show()

# Concrete examples
print("Log loss examples:")
print(f"  Actual=1, Predicted=0.95 -> Loss: {log_loss_single(1, 0.95):.4f} (good)")
print(f"  Actual=1, Predicted=0.50 -> Loss: {log_loss_single(1, 0.50):.4f} (uncertain)")
print(f"  Actual=1, Predicted=0.05 -> Loss: {log_loss_single(1, 0.05):.4f} (terrible)")
print(f"  Actual=0, Predicted=0.05 -> Loss: {log_loss_single(0, 0.05):.4f} (good)")
print(f"  Actual=0, Predicted=0.95 -> Loss: {log_loss_single(0, 0.95):.4f} (terrible)")
Log loss penalizes confident wrong predictions much more than uncertain ones. When the actual label is 1, predicting p=0.05 gives a loss of 3.0 (very high), while predicting p=0.95 gives only 0.05. This asymmetric penalty is what makes log loss effective for training classifiers.
Log loss examples: Actual=1, Predicted=0.95 -> Loss: 0.0513 (good) Actual=1, Predicted=0.50 -> Loss: 0.6931 (uncertain) Actual=1, Predicted=0.05 -> Loss: 2.9957 (terrible) Actual=0, Predicted=0.05 -> Loss: 0.0513 (good) Actual=0, Predicted=0.95 -> Loss: 2.9957 (terrible)
Multi-Class Classification with Softmax
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Load iris dataset (3 classes)
iris = load_iris()
X, y = iris.data, iris.target
print(f"Classes: {iris.target_names}")
print(f"Features: {iris.feature_names}")
print(f"Shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# One-vs-Rest approach
model_ovr = LogisticRegression(multi_class='ovr', random_state=42, max_iter=1000)
model_ovr.fit(X_train_scaled, y_train)
print(f"\nOne-vs-Rest Accuracy: {model_ovr.score(X_test_scaled, y_test):.4f}")

# Softmax (multinomial) approach
model_softmax = LogisticRegression(multi_class='multinomial', random_state=42, max_iter=1000)
model_softmax.fit(X_train_scaled, y_train)
print(f"Softmax Accuracy: {model_softmax.score(X_test_scaled, y_test):.4f}")

# Detailed evaluation for softmax model
y_pred = model_softmax.predict(X_test_scaled)
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Predict probabilities for a sample
sample = X_test_scaled[0:1]
proba = model_softmax.predict_proba(sample)[0]
print(f"\nSample prediction probabilities:")
for name, prob in zip(iris.target_names, proba):
    print(f"  {name}: {prob:.4f}")
print(f"Predicted class: {iris.target_names[model_softmax.predict(sample)[0]]}")
For multi-class problems, scikit-learn supports both One-vs-Rest (OvR) and Softmax (multinomial) strategies. The softmax model outputs probabilities for all classes that sum to 1. The Iris dataset has 3 classes (setosa, versicolor, virginica), and logistic regression achieves high accuracy because the classes are well-separated.
Classes: ['setosa' 'versicolor' 'virginica'] Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Shape: (150, 4) Class distribution: [50 50 50] One-vs-Rest Accuracy: 0.9778 Softmax Accuracy: 0.9778 Confusion Matrix: [[15 0 0] [ 0 14 1] [ 0 0 15]] Classification Report: precision recall f1-score support setosa 1.00 1.00 1.00 15 versicolor 1.00 0.93 0.97 15 virginica 0.94 1.00 0.97 15 accuracy 0.98 45 macro avg 0.98 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Sample prediction probabilities: setosa: 0.0002 versicolor: 0.0341 virginica: 0.9657 Predicted class: virginica
Confusion Matrix Heatmap and Metrics Calculation
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, accuracy_score,
                             precision_score, recall_score, f1_score)
import matplotlib.pyplot as plt
import seaborn as sns

# Generate imbalanced binary data
np.random.seed(42)
n = 300
# 80% negative class (fail), 20% positive class (pass)
X_neg = np.random.randn(240, 2) + np.array([-1, -1])
X_pos = np.random.randn(60, 2) + np.array([2, 2])
X = np.vstack([X_neg, X_pos])
y = np.array([0]*240 + [1]*60)

# Shuffle
idx = np.random.permutation(len(y))
X, y = X[idx], y[idx]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

model = LogisticRegression(random_state=42)
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"  TN={cm[0][0]}, FP={cm[0][1]}")
print(f"  FN={cm[1][0]}, TP={cm[1][1]}")

# Calculate metrics manually
TP, TN, FP, FN = cm[1][1], cm[0][0], cm[0][1], cm[1][0]
print(f"\nManual calculation:")
print(f"  Accuracy  = (TP+TN)/(TP+TN+FP+FN) = ({TP}+{TN})/({TP}+{TN}+{FP}+{FN}) = {(TP+TN)/(TP+TN+FP+FN):.4f}")
print(f"  Precision = TP/(TP+FP) = {TP}/({TP}+{FP}) = {TP/(TP+FP):.4f}")
print(f"  Recall    = TP/(TP+FN) = {TP}/({TP}+{FN}) = {TP/(TP+FN):.4f}")
precision = TP/(TP+FP)
recall = TP/(TP+FN)
f1 = 2 * precision * recall / (precision + recall)
print(f"  F1 Score  = 2*P*R/(P+R) = {f1:.4f}")

# Verify with sklearn
print(f"\nsklearn verification:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred):.4f}")

# Plot confusion matrix heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Fail', 'Pass'],
            yticklabels=['Fail', 'Pass'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
This example shows how to compute metrics both manually and with scikit-learn, starting from the confusion matrix. On imbalanced data (80% negative, 20% positive), accuracy alone can be misleading. A model predicting all negatives would get 80% accuracy but 0% recall. The confusion matrix reveals the full picture.
Confusion Matrix: [[57 3] [ 2 13]] TN=57, FP=3 FN=2, TP=13 Manual calculation: Accuracy = (TP+TN)/(TP+TN+FP+FN) = (13+57)/(13+57+3+2) = 0.9333 Precision = TP/(TP+FP) = 13/(13+3) = 0.8125 Recall = TP/(TP+FN) = 13/(13+2) = 0.8667 F1 Score = 2*P*R/(P+R) = 0.8387 sklearn verification: Accuracy: 0.9333 Precision: 0.8125 Recall: 0.8667 F1 Score: 0.8387
Complete Example: Predicting Student Pass/Fail
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Create realistic student dataset
np.random.seed(42)
n = 500

data = {
    'study_hours_per_week': np.random.uniform(1, 15, n),
    'attendance_pct': np.random.uniform(30, 100, n),
    'assignment_avg': np.random.uniform(20, 100, n),
    'previous_gpa': np.random.uniform(1.0, 4.0, n),
    'sleep_hours': np.random.uniform(4, 10, n)
}
df = pd.DataFrame(data)

# Generate target: pass (1) or fail (0)
score = (0.2 * df['study_hours_per_week'] +
         0.15 * df['attendance_pct'] +
         0.1 * df['assignment_avg'] +
         2.0 * df['previous_gpa'] +
         0.3 * df['sleep_hours'] +
         np.random.normal(0, 2, n))
df['passed'] = (score > 17).astype(int)

print("Dataset Preview:")
print(df.head())
print(f"\nClass balance: {df['passed'].value_counts().to_dict()}")

# Prepare features and target
feature_cols = ['study_hours_per_week', 'attendance_pct',
                'assignment_avg', 'previous_gpa', 'sleep_hours']
X = df[feature_cols].values
y = df['passed'].values

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Train
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_s, y_train)

# Evaluate
y_pred = model.predict(X_test_s)
print(f"\nTest Accuracy: {model.score(X_test_s, y_test):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))

# Feature importance (coefficients)
print("Feature Importance (coefficients):")
for feature, coef in sorted(zip(feature_cols, model.coef_[0]),
                             key=lambda x: abs(x[1]), reverse=True):
    print(f"  {feature:25s}: {coef:+.4f}")

# Predict for specific students
students = {
    'Aarav': [10, 90, 85, 3.5, 7],
    'Priya': [3, 45, 40, 1.5, 5],
    'Kavita': [6, 70, 60, 2.5, 6],
}

print(f"\nPredictions for new students:")
for name, features in students.items():
    feat_scaled = scaler.transform([features])
    prob = model.predict_proba(feat_scaled)[0]
    pred = 'Pass' if model.predict(feat_scaled)[0] == 1 else 'Fail'
    print(f"  {name}: P(Fail)={prob[0]:.3f}, P(Pass)={prob[1]:.3f} -> {pred}")
This end-to-end example simulates a realistic student dataset with 5 features. The model learns which factors matter most for passing (previous GPA has the highest coefficient). The feature importance analysis shows that previous academic performance is the strongest predictor, followed by attendance and study hours.
Dataset Preview: study_hours_per_week attendance_pct assignment_avg previous_gpa sleep_hours passed 0 6.243721 78.332190 53.363432 2.617238 6.847310 1 1 13.784376 65.436240 84.098649 3.264757 8.384383 1 2 10.552291 64.589411 73.176960 1.419753 5.291224 0 3 7.191945 30.271794 53.814740 2.879997 7.233150 1 4 2.027348 75.443273 44.284047 3.456281 4.587413 1 Class balance: {1: 280, 0: 220} Test Accuracy: 0.8500 Classification Report: precision recall f1-score support Fail 0.83 0.82 0.82 44 Pass 0.87 0.88 0.87 56 accuracy 0.85 100 macro avg 0.85 0.85 0.85 100 weighted avg 0.85 0.85 0.85 100 Feature Importance (coefficients): previous_gpa : +1.2341 attendance_pct : +0.7823 study_hours_per_week : +0.5612 sleep_hours : +0.3245 assignment_avg : +0.2987 Predictions for new students: Aarav: P(Fail)=0.032, P(Pass)=0.968 -> Pass Priya: P(Fail)=0.891, P(Pass)=0.109 -> Fail Kavita: P(Fail)=0.387, P(Pass)=0.613 -> Pass

Common Mistakes

Not Scaling Features Before Logistic Regression

from sklearn.linear_model import LogisticRegression

# Features with very different scales
# study_hours: 1-10, attendance: 40-100, salary: 10000-100000
X = [[5, 80, 50000], [3, 60, 30000], [8, 90, 70000]]
y = [1, 0, 1]

model = LogisticRegression()
model.fit(X, y)  # salary dominates because of scale
No error raised, but the model gives excessive weight to the salary feature simply because its values are numerically larger. The coefficients become misleading.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

X = [[5, 80, 50000], [3, 60, 30000], [8, 90, 70000]]
y = [1, 0, 1]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = LogisticRegression()
model.fit(X_scaled, y)  # All features on same scale now
Logistic regression uses gradient descent, which is sensitive to feature scales. Features with large values dominate the gradient updates. Always use StandardScaler or MinMaxScaler before training. Fit the scaler on training data only, then transform both training and test data.

Using Accuracy on Imbalanced Datasets

from sklearn.metrics import accuracy_score

# 95% negative, 5% positive (e.g., fraud detection)
y_test = [0]*950 + [1]*50
y_pred = [0]*1000  # Predicts all as negative

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")  # 95% accuracy!
# But it misses ALL fraud cases!
The model achieves 95% accuracy by predicting all negatives, but it catches zero fraud cases. Recall for the positive class is 0%.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_test = [0]*950 + [1]*50
y_pred = [0]*1000

print(f"Accuracy:  {accuracy_score(y_test, y_pred):.2%}")
print(f"Precision: {precision_score(y_test, y_pred, zero_division=0):.2%}")
print(f"Recall:    {recall_score(y_test, y_pred):.2%}")  # 0%!
print(f"F1 Score:  {f1_score(y_test, y_pred, zero_division=0):.2%}")
# Accuracy is misleading here. Use F1, recall, or ROC-AUC instead.
On imbalanced datasets, accuracy is misleading because a model that always predicts the majority class gets high accuracy. Use precision, recall, F1-score, or ROC-AUC to evaluate properly. For fraud/disease detection, recall (catching all positives) is usually more important than accuracy.

Applying Scaler Fitted on Test Data (Data Leakage)

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# WRONG: fitting scaler on entire dataset or on test data
scaler = StandardScaler()
scaler.fit(X)  # Leaks test data statistics into training
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
No runtime error, but the model has seen test data statistics during training (data leakage), giving overly optimistic evaluation results that will not hold in production.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

# CORRECT: fit only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit + transform on train
X_test_scaled = scaler.transform(X_test)         # only transform on test
The scaler must be fitted only on training data. Then use the same scaler to transform test data. Fitting on the full dataset or on test data causes data leakage: the model indirectly learns information about the test set, making evaluation unreliable.

Forgetting to Set random_state for Reproducibility

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Different split every time -> different results every time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))  # Different each run
No error, but results change on every run, making debugging impossible and results non-reproducible.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Fixed random state = same split every time
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))  # Same result every run
Always set random_state in train_test_split and in the model for reproducible results. Also use stratify=y to maintain class proportions in both splits. Set max_iter high enough to ensure convergence.

Confusing predict() and predict_proba()

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Using predict() when you need probabilities for threshold tuning
predictions = model.predict(X_test)
# predictions = [0, 1, 1, 0, 1, ...]  (hard labels, threshold locked at 0.5)
No error, but predict() gives hard labels (0 or 1) with a fixed threshold of 0.5. You cannot adjust the threshold or compute ROC curves.
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Use predict_proba() to get probabilities
probabilities = model.predict_proba(X_test)  # Shape: (n_samples, n_classes)
p_positive = probabilities[:, 1]  # Probability of class 1

# Now you can use any threshold
threshold = 0.3  # Lower threshold = higher recall
y_pred = (p_positive >= threshold).astype(int)
Use predict_proba() to get probability estimates. The output has shape (n_samples, n_classes). For binary classification, column 0 is P(class 0) and column 1 is P(class 1). You can then apply any custom threshold instead of the default 0.5.

Summary

  • Classification predicts discrete categories (labels), unlike regression which predicts continuous numbers. Binary classification has 2 classes; multi-class has 3 or more.
  • Logistic regression passes a linear combination through the sigmoid function: p = 1 / (1 + e^(-z)), producing a probability between 0 and 1.
  • The sigmoid function maps any real number to (0, 1). At z=0, sigmoid outputs 0.5 (the default decision boundary). Its derivative is sigmoid(z) * (1 - sigmoid(z)).
  • Log loss (binary cross-entropy) is the loss function: L = -[y*log(p) + (1-y)*log(1-p)]. It penalizes confident wrong predictions exponentially.
  • The decision boundary is where P(class 1) = 0.5, which corresponds to the linear equation w^T*X + b = 0. Logistic regression creates a linear decision boundary.
  • Gradient descent updates weights using: w = w - lr * (1/N) * X^T * (predictions - actuals). The gradient formula is identical in form to linear regression.
  • For multi-class classification, use One-vs-Rest (K separate binary classifiers) or Softmax regression (single model with K outputs summing to 1).
  • The confusion matrix shows TP, TN, FP, FN. From it: Accuracy = (TP+TN)/total, Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2*P*R/(P+R).
  • Use precision when false positives are costly (spam filter). Use recall when false negatives are costly (disease detection). F1 balances both.
  • Always scale features before logistic regression. Use StandardScaler fitted only on training data. Fit on train, transform both train and test.
  • predict_proba() returns probabilities (useful for threshold tuning and ROC curves). predict() returns hard labels using threshold 0.5 by default.

Ready to Practice?

Test your understanding with 50+ practice questions on this topic.

Go to Practice Questions

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass