What Is It?
What is Linear Regression?
Linear Regression is the simplest and most fundamental Machine Learning algorithm. It predicts a continuous numerical value (like price, temperature, salary) based on one or more input features by finding the best straight line (or hyperplane) that fits the data.
The equation for simple linear regression is: y = mx + b, where y is the predicted value, x is the input feature, m is the slope (how much y changes when x increases by 1), and b is the y-intercept (the value of y when x is 0). The goal of linear regression is to find the values of m and b that make the line fit the data as closely as possible.
For multiple features, the equation becomes: y = w1*x1 + w2*x2 + ... + wn*xn + b, where each feature has its own weight (coefficient) that the model learns.
When to Use Linear Regression
Use linear regression when: (1) The target variable is continuous (price, temperature, score). (2) The relationship between features and target is approximately linear. (3) You need an interpretable model (you can explain which features matter and by how much). It is the first algorithm to try for any regression problem -- if it works well, there is no need for a more complex model.
Why Does It Matter?
Why is Linear Regression Important?
1. The Foundation of All ML
Linear regression introduces the core concepts that apply to ALL ML algorithms: cost functions, optimization (gradient descent), model parameters, training and evaluation. Understanding linear regression deeply makes every other algorithm easier to learn.
2. Surprisingly Powerful
Despite its simplicity, linear regression works well for many real-world problems: predicting house prices, stock trends, sales forecasting, salary estimation, and demand prediction. In Kaggle competitions, linear models often perform within a few percent of much more complex models.
3. Highly Interpretable
Unlike black-box models (neural networks, random forests), linear regression tells you exactly how each feature affects the prediction. If the coefficient for 'area' is 5000, it means each additional square foot adds 5000 to the predicted price. This interpretability is essential in regulated industries (finance, healthcare).
4. The Gateway to Advanced Models
Logistic regression (classification), polynomial regression, ridge/lasso regression, and even neural networks are extensions of linear regression. Master it, and you have a strong foundation for everything that follows.
Detailed Explanation
Detailed Explanation
Simple Linear Regression
Simple linear regression finds the best straight line y = mx + b through a set of data points. "Best" means the line that minimizes the total error between predicted and actual values.
Imagine plotting student study hours (x-axis) against exam scores (y-axis). The data points roughly form an upward trend. Linear regression finds the line that best captures this trend, so you can predict the score for any number of study hours.
Cost Function: Mean Squared Error (MSE)
How do we measure how "good" a line is? We use the Mean Squared Error (MSE):
MSE = (1/n) * sum of (y_actual - y_predicted)^2
For each data point, we compute the difference between the actual value and the predicted value (the error or residual), square it (to make all errors positive and penalize large errors more), and then average across all data points. The goal of training is to find m and b that minimize MSE.
Why square the error? Two reasons: (1) It makes all errors positive (a prediction of 5 too high is as bad as 5 too low). (2) It penalizes large errors more heavily -- an error of 10 contributes 100 to MSE, while an error of 2 contributes only 4.
Gradient Descent: How the Model Learns
Gradient descent is the optimization algorithm that finds the best m and b. The process:
- Start with random values for m and b.
- Compute the MSE (how bad the current line is).
- Compute the gradient (partial derivatives of MSE with respect to m and b).
- Update: m = m - learning_rate * dMSE/dm, b = b - learning_rate * dMSE/db.
- Repeat until MSE converges (stops decreasing significantly).
The gradients are: dMSE/dm = (-2/n) * sum(x_i * (y_i - y_pred_i)) and dMSE/db = (-2/n) * sum(y_i - y_pred_i).
Think of it as standing in a foggy valley and always walking downhill. The gradient tells you the direction of steepest descent, and the learning rate controls how big your steps are.
The Normal Equation (Closed-Form Solution)
For linear regression specifically, there is a direct formula to compute the optimal weights without iterating: w = (X^T X)^(-1) X^T y. This is called the normal equation. It gives the exact optimal solution in one step. scikit-learn uses a variant of this for small to medium datasets. Gradient descent is preferred for very large datasets because the matrix inversion becomes computationally expensive.
Multiple Linear Regression
When you have multiple features (area, bedrooms, age), the model becomes: y = w1*area + w2*bedrooms + w3*age + b. Each feature gets its own weight. The model learns how much each feature contributes to the prediction. The training process is the same (minimize MSE), but with more parameters to optimize.
Evaluation Metrics
MAE (Mean Absolute Error)
MAE = (1/n) * sum(|y_actual - y_predicted|). Average of absolute errors. Easy to interpret: "on average, predictions are off by X units." Not as sensitive to large errors as MSE.
MSE (Mean Squared Error)
MSE = (1/n) * sum((y_actual - y_predicted)^2). Penalizes large errors heavily. Most commonly used for training (as the loss function).
RMSE (Root Mean Squared Error)
RMSE = sqrt(MSE). In the same units as the target variable. "On average, predictions are off by X units" (but gives more weight to large errors).
R-squared (Coefficient of Determination)
R^2 = 1 - (SS_res / SS_tot), where SS_res = sum of (y - y_pred)^2 and SS_tot = sum of (y - y_mean)^2. Ranges from 0 to 1 (can be negative for very bad models). R^2 = 0.85 means "the model explains 85% of the variance in the data." Higher is better. R^2 = 1 means perfect prediction. R^2 = 0 means the model is no better than predicting the mean.
Assumptions of Linear Regression
- Linearity: The relationship between features and target should be approximately linear.
- Independence: Data points should be independent of each other.
- Homoscedasticity: The variance of errors should be constant across all values of features.
- Normality: Residuals (errors) should be approximately normally distributed.
- No multicollinearity: Features should not be highly correlated with each other.
In practice, mild violations of these assumptions are common and often do not significantly impact performance. But severe violations (like a clearly non-linear relationship) mean linear regression is the wrong model.
Polynomial Regression
If the data has a curved (non-linear) relationship, you can use polynomial regression by adding polynomial features (x^2, x^3, x1*x2). This is still "linear" regression because the model is linear in its parameters (weights), even though it captures non-linear relationships. sklearn's PolynomialFeatures creates these features automatically.
Code Examples
import numpy as np
class LinearRegressionScratch:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.weights = None
self.bias = None
self.losses = []
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for i in range(self.n_iter):
y_pred = X @ self.weights + self.bias
error = y_pred - y
# Compute gradients
dw = (1 / n_samples) * (X.T @ error)
db = (1 / n_samples) * np.sum(error)
# Update parameters
self.weights -= self.lr * dw
self.bias -= self.lr * db
# Track loss
mse = np.mean(error ** 2)
self.losses.append(mse)
def predict(self, X):
return X @ self.weights + self.bias
# Generate data: y = 3x + 7 + noise
np.random.seed(42)
X = np.random.uniform(0, 10, (100, 1))
y = 3 * X.squeeze() + 7 + np.random.normal(0, 2, 100)
# Normalize X for better convergence
X_norm = (X - X.mean()) / X.std()
# Train
model = LinearRegressionScratch(learning_rate=0.1, n_iterations=500)
model.fit(X_norm, y)
print(f"Learned weight: {model.weights[0]:.4f}")
print(f"Learned bias: {model.bias:.4f}")
print(f"Final MSE: {model.losses[-1]:.4f}")
print(f"MSE after 10 iterations: {model.losses[9]:.4f}")
print(f"MSE after 100 iterations: {model.losses[99]:.4f}")
print(f"\nTrue equation: y = 3x + 7")
# Predict
test_x = np.array([[5.0]])
test_x_norm = (test_x - X.mean()) / X.std()
print(f"\nPrediction for x=5: {model.predict(test_x_norm)[0]:.2f}")
print(f"Actual (3*5 + 7): 22")fit() method: (1) initializes weights to 0, (2) computes predictions (X @ w + b), (3) computes gradients of MSE with respect to weights and bias, (4) updates parameters. We normalize X first because gradient descent converges much faster with normalized features. After 500 iterations, the model learns approximately w=3 and b=7, matching the true equation y = 3x + 7.from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
# Generate data: y = 3x + 7 + noise
np.random.seed(42)
X = np.random.uniform(0, 10, (100, 1))
y = 3 * X.squeeze() + 7 + np.random.normal(0, 2, 100)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Model parameters
print(f"Slope (coefficient): {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
print(f"Learned equation: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}")
print(f"True equation: y = 3.00x + 7.00")
# Predictions and evaluation
y_pred = model.predict(X_test)
print(f"\n=== Evaluation Metrics ===")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"R^2: {r2_score(y_test, y_pred):.4f}")
# Sample predictions
print(f"\n=== Sample Predictions ===")
for i in range(5):
print(f" x={X_test[i][0]:.2f}: predicted={y_pred[i]:.2f}, actual={y_test[i]:.2f}")model.coef_ gives the slope (weight for each feature) and model.intercept_ gives the y-intercept. The model learned approximately y = 3.04x + 6.74, very close to the true y = 3x + 7. R^2 = 0.92 means the model explains 92% of the variance. RMSE = 1.89 means predictions are off by about 1.89 units on average.import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# Generate synthetic house price data
np.random.seed(42)
n = 200
data = {
'Area_sqft': np.random.uniform(500, 3000, n).round(0),
'Bedrooms': np.random.randint(1, 6, n),
'Age_years': np.random.randint(0, 30, n),
'Distance_km': np.random.uniform(1, 30, n).round(1)
}
df = pd.DataFrame(data)
# True relationship: price depends on all features
df['Price_Lakhs'] = (
20 + 0.03 * df['Area_sqft'] + 8 * df['Bedrooms']
- 0.5 * df['Age_years'] - 0.8 * df['Distance_km']
+ np.random.normal(0, 5, n)
).round(2)
print("=== Dataset ===")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"\nCorrelations with Price:")
print(df.corr()['Price_Lakhs'].drop('Price_Lakhs').round(3))
# Prepare data
feature_cols = ['Area_sqft', 'Bedrooms', 'Age_years', 'Distance_km']
X = df[feature_cols]
y = df['Price_Lakhs']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Feature importance
print(f"\n=== Feature Coefficients (on scaled data) ===")
for feat, coef in zip(feature_cols, model.coef_):
direction = 'increases' if coef > 0 else 'decreases'
print(f" {feat}: {coef:.4f} (1 std increase {direction} price by {abs(coef):.2f} lakhs)")
# Evaluate
y_pred = model.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"\n=== Model Performance ===")
print(f"RMSE: {rmse:.2f} lakhs")
print(f"R-squared: {r2:.4f}")
print(f"Interpretation: Model explains {r2*100:.1f}% of price variance")
# Predict a new house
new_house = pd.DataFrame({
'Area_sqft': [1500], 'Bedrooms': [3], 'Age_years': [5], 'Distance_km': [10]
})
new_house_scaled = scaler.transform(new_house)
pred_price = model.predict(new_house_scaled)[0]
print(f"\n=== New House Prediction ===")
print(f"Features: 1500 sqft, 3 bed, 5 years old, 10 km from center")
print(f"Predicted price: {pred_price:.2f} lakhs")import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
np.random.seed(42)
# Case 1: Strong linear relationship (R^2 near 1)
X1 = np.arange(1, 21).reshape(-1, 1)
y1 = 2 * X1.squeeze() + 3 + np.random.normal(0, 1, 20) # Low noise
# Case 2: Weak relationship (R^2 near 0)
y2 = np.random.normal(25, 10, 20) # Random, no relationship with X
# Case 3: Moderate relationship
y3 = 2 * X1.squeeze() + 3 + np.random.normal(0, 8, 20) # High noise
for name, y in [('Strong (low noise)', y1), ('No relationship', y2), ('Moderate (high noise)', y3)]:
model = LinearRegression()
model.fit(X1, y)
y_pred = model.predict(X1)
r2 = r2_score(y, y_pred)
# Manual R^2 calculation
ss_res = np.sum((y - y_pred) ** 2) # Residual sum of squares
ss_tot = np.sum((y - np.mean(y)) ** 2) # Total sum of squares
r2_manual = 1 - (ss_res / ss_tot)
print(f"\n{name}:")
print(f" R^2 = {r2:.4f}")
print(f" R^2 (manual) = {r2_manual:.4f}")
print(f" Interpretation: Model explains {r2*100:.1f}% of variance")
if r2 > 0.8:
print(f" --> Excellent fit")
elif r2 > 0.5:
print(f" --> Moderate fit")
else:
print(f" --> Poor fit - linear regression may not be appropriate")import numpy as np
def gradient_descent_demo(X, y, lr, n_iter):
n = len(X)
w, b = 0.0, 0.0
losses = []
for i in range(n_iter):
y_pred = w * X + b
mse = np.mean((y - y_pred) ** 2)
losses.append(mse)
dw = (-2/n) * np.sum(X * (y - y_pred))
db = (-2/n) * np.sum(y - y_pred)
w -= lr * dw
b -= lr * db
return w, b, losses
# Data: y = 4x + 10 + noise
np.random.seed(42)
X = np.random.uniform(0, 10, 50)
y = 4 * X + 10 + np.random.normal(0, 3, 50)
X_norm = (X - X.mean()) / X.std()
print("=== Testing Different Learning Rates ===")
for lr in [0.001, 0.01, 0.1, 0.5]:
w, b, losses = gradient_descent_demo(X_norm, y, lr, 200)
print(f"\nLR = {lr}:")
print(f" Final MSE: {losses[-1]:.4f}")
print(f" MSE after 10 steps: {losses[9]:.4f}")
print(f" MSE after 50 steps: {losses[49]:.4f}")
if losses[-1] < losses[0]:
improvement = (1 - losses[-1]/losses[0]) * 100
print(f" Improvement: {improvement:.1f}%")
else:
print(f" WARNING: Loss increased! LR too high.")
print("\n=== Key Insight ===")
print("LR=0.001: Converges but slowly")
print("LR=0.01: Good balance of speed and stability")
print("LR=0.1: Converges fast")
print("LR=0.5: May be too aggressive for some problems")import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
# Non-linear data: y = 2x^2 - 3x + 5 + noise
np.random.seed(42)
X = np.linspace(-3, 3, 50).reshape(-1, 1)
y = 2 * X.squeeze()**2 - 3 * X.squeeze() + 5 + np.random.normal(0, 1, 50)
# Linear regression (will be bad)
lin_model = LinearRegression()
lin_model.fit(X, y)
y_pred_lin = lin_model.predict(X)
r2_lin = r2_score(y, y_pred_lin)
# Polynomial regression (degree 2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
poly_model = LinearRegression()
poly_model.fit(X_poly, y)
y_pred_poly = poly_model.predict(X_poly)
r2_poly = r2_score(y, y_pred_poly)
print("=== Linear vs Polynomial Regression ===")
print(f"\nLinear regression R^2: {r2_lin:.4f}")
print(f"Polynomial (degree 2) R^2: {r2_poly:.4f}")
print(f"\nLinear equation: y = {lin_model.coef_[0]:.2f}x + {lin_model.intercept_:.2f}")
print(f"Polynomial equation: y = {poly_model.coef_[2]:.2f}x^2 + {poly_model.coef_[1]:.2f}x + {poly_model.intercept_:.2f}")
print(f"True equation: y = 2.00x^2 - 3.00x + 5.00")
print(f"\nPolynomial features created: {poly.get_feature_names_out()}")
print(f"\nKey insight: PolynomialFeatures transforms [x] into [1, x, x^2]")
print(f"Then LinearRegression finds the best weights for these features")PolynomialFeatures(degree=2) transforms the feature x into [1, x, x^2]. Linear regression then finds the best weights for these polynomial features, effectively fitting a curve. The polynomial model achieves R^2 near 0.97, closely recovering the true quadratic equation y = 2x^2 - 3x + 5. This approach is powerful but beware of overfitting with high degrees.import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Create realistic salary dataset
np.random.seed(42)
n = 50
experience = np.random.uniform(0, 15, n).round(1)
salary = 25000 + 5000 * experience + np.random.normal(0, 3000, n)
salary = salary.round(0)
df = pd.DataFrame({'Experience_Years': experience, 'Salary': salary})
print("=== Salary Dataset ===")
print(df.describe().round(0))
# Prepare and split
X = df[['Experience_Years']]
y = df['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
print(f"\n=== Model ===")
print(f"Salary = {model.coef_[0]:.0f} * Experience + {model.intercept_:.0f}")
print(f"Interpretation: Each year of experience adds ~{model.coef_[0]:.0f} to salary")
print(f"Starting salary (0 years): ~{model.intercept_:.0f}")
# Evaluate
y_pred = model.predict(X_test)
print(f"\n=== Evaluation ===")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.0f} (average error)")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.0f}")
print(f"R^2: {r2_score(y_test, y_pred):.4f}")
# Predict for new employees
print(f"\n=== Salary Predictions ===")
for exp in [0, 2, 5, 10, 15]:
pred = model.predict([[exp]])[0]
print(f" {exp} years experience -> Predicted salary: {pred:,.0f}")Common Mistakes
Not Scaling Features Before Gradient Descent
import numpy as np
# Features on very different scales
X = np.array([[1, 1000], [2, 2000], [3, 3000]])
y = np.array([10, 20, 30])
# Gradient descent with unscaled features
w = np.zeros(2)
b = 0
lr = 0.01
for i in range(100):
y_pred = X @ w + b
dw = (1/3) * X.T @ (y_pred - y)
w -= lr * dw
# Diverges because feature 2 (1000s) dominates gradients!import numpy as np
from sklearn.preprocessing import StandardScaler
X = np.array([[1, 1000], [2, 2000], [3, 3000]])
y = np.array([10, 20, 30])
# Scale features first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
w = np.zeros(2)
b = 0
lr = 0.1 # Can use larger LR with scaled features
for i in range(100):
y_pred = X_scaled @ w + b
dw = (1/3) * X_scaled.T @ (y_pred - y)
db = (1/3) * np.sum(y_pred - y)
w -= lr * dw
b -= lr * db
print(f"Converged: weights = {w.round(4)}, bias = {b:.4f}")Ignoring Feature Shape for sklearn
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(X, y) # Error: X must be 2Dfrom sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) # Make 2D
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(X, y)
print(f"y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}").reshape(-1, 1) to convert to shape (3, 1) meaning 3 samples with 1 feature.Using R-squared Alone to Evaluate the Model
# R^2 = 0.99 looks great, but...
from sklearn.linear_model import LinearRegression
import numpy as np
# Only 3 data points - model fits perfectly but means nothing
X = np.array([[1], [2], [3]])
y = np.array([10, 20, 30])
model = LinearRegression()
model.fit(X, y)
print(f"R^2: {model.score(X, y):.4f}") # 1.0 - misleadingly perfect!from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
# Use enough data and evaluate on TEST set
np.random.seed(42)
X = np.random.uniform(0, 10, (100, 1))
y = 3 * X.squeeze() + 5 + np.random.normal(0, 2, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"R^2 (test): {r2_score(y_test, y_pred):.4f}")
print(f"MAE (test): {mean_absolute_error(y_test, y_pred):.4f}")
print("Always evaluate on TEST data with multiple metrics!")Using Linear Regression for Classification
from sklearn.linear_model import LinearRegression
import numpy as np
# Trying to predict categories (0/1) with linear regression
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1]) # Binary classification
model = LinearRegression()
model.fit(X, y)
print(model.predict([[3.5]])) # Output: 0.7 - not 0 or 1!from sklearn.linear_model import LogisticRegression
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1])
model = LogisticRegression()
model.fit(X, y)
print(f"Prediction: {model.predict([[3.5]])[0]}")
print(f"Probability: {model.predict_proba([[3.5]])[0]}")Summary
- Linear regression predicts continuous values by finding the best line y = mx + b (simple) or y = w1*x1 + w2*x2 + ... + b (multiple) that fits the data.
- The cost function (MSE = mean of squared errors) measures how bad the predictions are. The goal of training is to minimize MSE by finding optimal weights.
- Gradient descent iteratively updates weights: w = w - lr * gradient. The gradient points toward increasing error, so we move opposite to it. Learning rate controls step size.
- The normal equation w = (X^T X)^(-1) X^T y gives the exact optimal solution without iteration. scikit-learn uses this for small-medium datasets.
- scikit-learn's LinearRegression: fit(X_train, y_train) trains, predict(X_test) predicts. coef_ gives feature weights, intercept_ gives the bias term.
- Evaluation metrics: MAE (average absolute error, easy to interpret), MSE (penalizes large errors), RMSE (same units as target), R-squared (proportion of variance explained, 0-1).
- R-squared = 0.85 means the model explains 85% of the variance. Always evaluate on the test set, not training set. Use multiple metrics, not just R-squared.
- Feature scaling is essential for gradient descent but not for the normal equation (sklearn). Always reshape 1D arrays to 2D with .reshape(-1, 1) for sklearn.
- For non-linear data, use PolynomialFeatures to create x^2, x^3 terms and then apply linear regression on the expanded features.
- Assumptions: linearity, independence, constant error variance, normal residuals, no multicollinearity. Check these for reliable results.