What Is It?
What Is Feature Engineering?
Feature engineering is the process of creating new features or transforming existing features to improve the performance of a machine learning model. It is arguably the most important step in the ML pipeline -- the difference between a mediocre model and a competition-winning model often comes down to the quality of features, not the choice of algorithm.
Andrew Ng, a pioneer in AI, famously said: "Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering."
Feature engineering involves two main activities:
- Feature creation: Deriving new features from existing data -- calculating age from date of birth, creating BMI from height and weight, extracting day-of-week from a timestamp.
- Feature selection: Identifying and keeping only the most relevant features, removing redundant or noisy ones that could hurt model performance.
Why Does It Matter?
Why Does Feature Engineering Matter More Than Algorithm Choice?
1. Garbage In, Garbage Out
Even the most sophisticated algorithm cannot learn patterns that are not present in the features. If you are predicting house prices but only provide the number of windows, no algorithm can learn the effect of location or size. But if you engineer features like price_per_sqft_in_neighborhood, the simplest linear model can perform well.
2. Real-World Data Is Messy
Raw data is rarely in a form that ML algorithms can use directly. Dates are strings, addresses are text, categories need encoding, and the most informative signals often come from combining multiple raw fields. Feature engineering transforms this raw data into algorithm-friendly numerical features.
3. Domain Knowledge Multiplier
A data scientist who understands the business domain can create features that capture the underlying physics, economics, or behavior. For example, in credit scoring, the ratio of credit used to credit available (utilization ratio) is far more predictive than either number alone. No algorithm can discover this ratio on its own from the raw numbers.
4. Reduces Need for Complex Models
Good features can make a simple model outperform a complex one with poor features. A well-engineered linear model is often more interpretable, faster, and more robust than a deep neural network trained on raw features.
5. Kaggle Winners Agree
In competitive machine learning (Kaggle), the top solutions almost always emphasize feature engineering over exotic algorithms. The winning models typically use gradient boosting (XGBoost/LightGBM) with hundreds of carefully engineered features.
Detailed Explanation
Detailed Explanation
1. Creating Features from Existing Data
Mathematical Combinations
Combine existing numerical features to create new ones that capture domain knowledge:
- Ratios: price_per_sqft = price / area, BMI = weight / (height^2), profit_margin = profit / revenue
- Differences: age = current_year - birth_year, tenure = end_date - start_date
- Aggregations: total_purchases = sum of monthly purchases, avg_order_value = total_revenue / order_count
- Interactions: area_x_quality = area * quality_score (captures that large high-quality houses are disproportionately expensive)
2. Handling Datetime Features
Datetime columns contain rich information that ML algorithms cannot use in raw form. Extract multiple features:
- Time components: hour, minute, day, month, year, quarter
- Derived features: day_of_week (0=Monday to 6=Sunday), is_weekend (boolean), is_month_start, is_month_end
- Cyclical encoding: For circular features like hour (23 is close to 0), use sin/cos encoding: sin(2*pi*hour/24) and cos(2*pi*hour/24)
- Time since event: days_since_last_purchase, hours_since_signup
3. Text Features
When your data includes text fields (product descriptions, reviews, names), you can extract numerical features:
- Basic statistics: word_count, char_count, avg_word_length, sentence_count
- Specific patterns: has_exclamation, num_uppercase_words, num_urls, num_mentions
- TF-IDF (Term Frequency - Inverse Document Frequency): Converts text into numerical vectors where each word gets a weight based on how important it is in that document relative to the entire corpus. Common words (the, is, a) get low weights; distinctive words get high weights.
4. Binning Continuous Variables
Sometimes converting a continuous variable into bins (categories) helps capture non-linear relationships:
- Age: 0-18 (minor), 19-25 (young adult), 26-40 (adult), 41-60 (middle-aged), 60+ (senior)
- Income: Low, Medium, High, Very High
Use pd.cut() for equal-width bins or pd.qcut() for equal-frequency bins (quantile-based).
5. Polynomial Features
Polynomial features create higher-order terms and interaction terms automatically. For features [a, b] with degree=2, sklearn generates: [a, b, a^2, ab, b^2]. This allows linear models to capture non-linear relationships.
Use PolynomialFeatures(degree=2, interaction_only=False, include_bias=False). Be careful with high degrees -- the number of features explodes combinatorially.
6. Feature Selection Methods
Filter Methods (Independent of Model)
- Correlation matrix: Remove features that are highly correlated with each other (redundant). If feature A and feature B have correlation > 0.95, remove one.
- Variance threshold: Remove features with very low variance (near-constant). If a feature has the same value for 99% of rows, it is uninformative.
- Mutual information: Measures the dependency between a feature and the target. Higher mutual information means the feature is more informative. Works for both linear and non-linear relationships.
Wrapper Methods (Model-Dependent)
- Recursive Feature Elimination (RFE): Train a model, rank features by importance, remove the least important, repeat until the desired number remains. Sklearn provides
RFEandRFECV(with cross-validation to find optimal count).
Embedded Methods (Built into Model Training)
- Tree-based feature importance: Random Forest and XGBoost provide feature_importances_ based on how much each feature reduces impurity across all trees.
- L1 regularization (Lasso): Sets unimportant feature coefficients to exactly zero, effectively selecting features during training.
Code Examples
import pandas as pd
import numpy as np
# Sample real estate dataset
np.random.seed(42)
df = pd.DataFrame({
'area_sqft': [1200, 1800, 2500, 900, 3200],
'bedrooms': [2, 3, 4, 1, 5],
'bathrooms': [1, 2, 3, 1, 3],
'year_built': [1990, 2005, 2018, 1985, 2020],
'lot_size_sqft': [4000, 5500, 8000, 3000, 10000],
'price': [2500000, 4500000, 7500000, 1800000, 12000000]
})
# Feature Engineering
df['price_per_sqft'] = df['price'] / df['area_sqft']
df['bed_bath_ratio'] = df['bedrooms'] / df['bathrooms']
df['age'] = 2026 - df['year_built']
df['area_per_bedroom'] = df['area_sqft'] / df['bedrooms']
df['lot_to_house_ratio'] = df['lot_size_sqft'] / df['area_sqft']
df['is_new_construction'] = (df['age'] <= 5).astype(int)
df['total_rooms'] = df['bedrooms'] + df['bathrooms']
df['area_x_rooms'] = df['area_sqft'] * df['total_rooms'] # Interaction
print("Original features: area_sqft, bedrooms, bathrooms, year_built, lot_size_sqft")
print(f"Engineered features: {df.columns.tolist()[6:]}")
print(f"\nSample engineered values:")
print(df[['price_per_sqft', 'age', 'bed_bath_ratio', 'is_new_construction']].to_string())import pandas as pd
import numpy as np
# Sample e-commerce orders
df = pd.DataFrame({
'order_id': range(1, 8),
'order_date': pd.to_datetime([
'2026-01-15 09:30:00', '2026-03-22 14:15:00',
'2026-06-07 22:45:00', '2026-09-11 06:00:00',
'2026-12-25 18:30:00', '2026-07-04 11:00:00',
'2026-11-01 23:59:00'
]),
'customer_dob': pd.to_datetime([
'1998-05-20', '2001-11-03', '1995-08-14',
'2003-02-28', '1990-07-10', '1999-12-25', '1997-04-01'
]),
'amount': [1500, 3200, 890, 5600, 12000, 2100, 4500]
})
# Extract datetime features
df['hour'] = df['order_date'].dt.hour
df['day_of_week'] = df['order_date'].dt.dayofweek # 0=Mon, 6=Sun
df['day_name'] = df['order_date'].dt.day_name()
df['month'] = df['order_date'].dt.month
df['quarter'] = df['order_date'].dt.quarter
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
df['is_night_order'] = ((df['hour'] >= 22) | (df['hour'] <= 5)).astype(int)
# Cyclical encoding for hour (so 23 is close to 0)
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24).round(4)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24).round(4)
# Age from DOB
df['customer_age'] = ((df['order_date'] - df['customer_dob']).dt.days / 365.25).astype(int)
print("Datetime features:")
print(df[['order_date', 'hour', 'day_name', 'is_weekend', 'is_night_order',
'quarter', 'hour_sin', 'hour_cos', 'customer_age']].to_string())import pandas as pd
import numpy as np
# Sample product reviews
df = pd.DataFrame({
'review': [
'This phone is AMAZING! Best purchase ever!!!',
'Decent product. Works as expected.',
'Terrible quality. Broke after 2 days. DO NOT BUY.',
'Good value for money. Camera could be better though.',
'LOVE IT! Perfect gift for my daughter. Fast delivery!'
],
'rating': [5, 3, 1, 4, 5]
})
# Basic text features
df['word_count'] = df['review'].str.split().str.len()
df['char_count'] = df['review'].str.len()
df['avg_word_length'] = (df['char_count'] / df['word_count']).round(2)
df['exclamation_count'] = df['review'].str.count('!')
df['question_count'] = df['review'].str.count('\\?')
df['uppercase_count'] = df['review'].str.findall(r'\b[A-Z]{2,}\b').str.len()
df['sentence_count'] = df['review'].str.count(r'[.!?]+')
df['has_caps_words'] = (df['uppercase_count'] > 0).astype(int)
print("Text features:")
cols = ['word_count', 'char_count', 'exclamation_count', 'uppercase_count', 'sentence_count', 'rating']
print(df[cols].to_string())
print(f"\nCorrelation between exclamation_count and rating: {df['exclamation_count'].corr(df['rating']):.4f}")
print(f"Correlation between uppercase_count and rating: {df['uppercase_count'].corr(df['rating']):.4f}")import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
# Sample data
df = pd.DataFrame({
'age': [22, 35, 48, 19, 62, 41, 28, 55],
'income': [25000, 55000, 75000, 18000, 90000, 65000, 40000, 82000]
})
# Binning age into categories
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 40, 55, 100],
labels=['Young', 'Adult', 'Middle', 'Senior'])
# Quantile-based binning for income (equal number of samples per bin)
df['income_quartile'] = pd.qcut(df['income'], q=4,
labels=['Low', 'Medium', 'High', 'Very High'])
print("Binning results:")
print(df[['age', 'age_group', 'income', 'income_quartile']].to_string())
# Polynomial features (degree=2)
X = df[['age', 'income']].values
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"\nOriginal features: {['age', 'income']}")
print(f"Polynomial features: {poly.get_feature_names_out()}")
print(f"Original shape: {X.shape}")
print(f"Polynomial shape: {X_poly.shape}")
print(f"\nSample row (age=22, income=25000):")
print(f" age=22, income=25000, age^2={22**2}, age*income={22*25000}, income^2={25000**2}")import pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_regression, SelectKBest
from sklearn.datasets import fetch_california_housing
# Load California housing dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target
# Method 1: Correlation with target
print("Correlation with Price (target):")
corr = df.corr()['Price'].drop('Price').sort_values(ascending=False)
for feat, val in corr.items():
print(f" {feat:15s}: {val:+.4f}")
# Method 2: Find highly correlated feature pairs (redundancy)
print("\nHighly correlated feature pairs (>0.7):")
corr_matrix = df.drop('Price', axis=1).corr()
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.7:
print(f" {corr_matrix.columns[i]} <-> {corr_matrix.columns[j]}: {corr_matrix.iloc[i, j]:.4f}")
# Method 3: Mutual information
mi = mutual_info_regression(housing.data, housing.target, random_state=42)
mi_series = pd.Series(mi, index=housing.feature_names).sort_values(ascending=False)
print("\nMutual Information scores (higher = more informative):")
for feat, val in mi_series.items():
print(f" {feat:15s}: {val:.4f}")import numpy as np
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Load data
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Method 1: Tree-based feature importance
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Random Forest Feature Importance:")
for feat, imp in importances.items():
bar = '#' * int(imp * 50)
print(f" {feat:15s}: {imp:.4f} {bar}")
print(f"\nFull model R2: {rf.score(X_test, y_test):.4f}")
# Method 2: Recursive Feature Elimination
rfe = RFE(estimator=RandomForestRegressor(n_estimators=50, random_state=42),
n_features_to_select=4)
rfe.fit(X_train, y_train)
selected = X.columns[rfe.support_]
ranking = pd.Series(rfe.ranking_, index=X.columns).sort_values()
print(f"\nRFE selected features: {list(selected)}")
print(f"RFE feature ranking:")
for feat, rank in ranking.items():
marker = ' <-- selected' if rank == 1 else ''
print(f" {feat:15s}: rank {rank}{marker}")
# Compare: full features vs selected features
rf_selected = RandomForestRegressor(n_estimators=100, random_state=42)
rf_selected.fit(X_train[selected], y_train)
print(f"\nSelected features R2: {rf_selected.score(X_test[selected], y_test):.4f}")
print(f"Features reduced: {X.shape[1]} -> {len(selected)}")import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_california_housing
# Load data
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['Price'] = housing.target
# Baseline: raw features
X_raw = df.drop('Price', axis=1)
y = df['Price']
# Feature Engineering
df_eng = df.copy()
df_eng['rooms_per_household'] = df['AveRooms'] / df['AveOccup']
df_eng['bedrooms_ratio'] = df['AveBedrms'] / df['AveRooms']
df_eng['population_per_household'] = df['Population'] / df['AveOccup']
df_eng['income_per_room'] = df['MedInc'] / df['AveRooms']
df_eng['income_squared'] = df['MedInc'] ** 2
df_eng['is_coastal'] = ((df['Longitude'] < -122) & (df['Latitude'] < 38)).astype(int)
df_eng['income_x_rooms'] = df['MedInc'] * df['AveRooms']
X_eng = df_eng.drop('Price', axis=1)
print(f"Raw features: {X_raw.shape[1]}")
print(f"Engineered features: {X_eng.shape[1]}")
print(f"New features: {[c for c in X_eng.columns if c not in X_raw.columns]}")
# Split
X_raw_train, X_raw_test, y_train, y_test = train_test_split(
X_raw, y, test_size=0.2, random_state=42)
X_eng_train, X_eng_test, _, _ = train_test_split(
X_eng, y, test_size=0.2, random_state=42)
# Model: Gradient Boosting
model = GradientBoostingRegressor(n_estimators=200, max_depth=5,
learning_rate=0.1, random_state=42)
# Evaluate raw features
model.fit(X_raw_train, y_train)
raw_train_score = model.score(X_raw_train, y_train)
raw_test_score = model.score(X_raw_test, y_test)
print(f"\nRaw features:")
print(f" Train R2: {raw_train_score:.4f}")
print(f" Test R2: {raw_test_score:.4f}")
# Evaluate engineered features
model.fit(X_eng_train, y_train)
eng_train_score = model.score(X_eng_train, y_train)
eng_test_score = model.score(X_eng_test, y_test)
print(f"\nEngineered features:")
print(f" Train R2: {eng_train_score:.4f}")
print(f" Test R2: {eng_test_score:.4f}")
print(f"\nImprovement: {eng_test_score - raw_test_score:+.4f} R2")
# Feature importance for engineered model
importances = pd.Series(model.feature_importances_,
index=X_eng.columns).sort_values(ascending=False)
print(f"\nTop 5 features in engineered model:")
for feat, imp in importances.head(5).items():
print(f" {feat}: {imp:.4f}")Common Mistakes
Data Leakage: Using Future Information in Features
import pandas as pd
# Predicting whether a customer will churn next month
df = pd.DataFrame({
'customer_id': [1, 2, 3],
'total_purchases_ever': [50, 10, 30], # Includes future purchases!
'account_closed_date': ['2026-06-01', None, '2026-04-15'],
'churned': [1, 0, 1]
})
# 'account_closed_date' leaks the target -- if it is not None, the customer churned!
# 'total_purchases_ever' includes purchases after the prediction dateimport pandas as pd
# Only use features available at the time of prediction
df = pd.DataFrame({
'customer_id': [1, 2, 3],
'purchases_last_30_days': [2, 8, 5], # Only past data
'days_since_last_purchase': [45, 3, 20],
'avg_monthly_spend': [500, 1200, 800],
'churned': [1, 0, 1]
})
# All features use only historical data available before the prediction pointFitting Feature Selection on Full Data (Including Test Set)
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(1000, 20)
y = np.random.randn(1000)
# WRONG: Fitting on ALL data before splitting
selector = SelectKBest(f_regression, k=5)
X_selected = selector.fit_transform(X, y) # Uses test data!
X_train, X_test, y_train, y_test = train_test_split(
X_selected, y, test_size=0.2, random_state=42)from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.randn(1000, 20)
y = np.random.randn(1000)
# CORRECT: Split first, then fit selector on training data only
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
selector = SelectKBest(f_regression, k=5)
X_train_sel = selector.fit_transform(X_train, y_train) # Fit on train only
X_test_sel = selector.transform(X_test) # Transform test with same selectionCreating Too Many Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.random.randn(100, 20) # 20 features
poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Features: {X_poly.shape[1]}") # Thousands of features!from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.random.randn(100, 20)
# Option 1: Use degree=2 with interaction_only (no squared terms)
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Interaction-only features: {X_poly.shape[1]}")
# Option 2: Manually create specific polynomial features
# based on domain knowledge
X_manual = np.column_stack([X, X[:, 0] * X[:, 1], X[:, 2]**2])
print(f"Manual features: {X_manual.shape[1]}")Removing Features Based on Low Correlation with Target When Relationship Is Non-Linear
import pandas as pd
import numpy as np
# Feature has strong non-linear relationship but low linear correlation
np.random.seed(42)
df = pd.DataFrame({
'x': np.linspace(-3, 3, 100),
})
df['y'] = df['x'] ** 2 + np.random.randn(100) * 0.5
print(f"Correlation: {df['x'].corr(df['y']):.4f}")
# Low correlation! But x is highly predictive of y (non-linearly)
# WRONG: Removing x because correlation is lowimport pandas as pd
import numpy as np
from sklearn.feature_selection import mutual_info_regression
np.random.seed(42)
df = pd.DataFrame({'x': np.linspace(-3, 3, 100)})
df['y'] = df['x'] ** 2 + np.random.randn(100) * 0.5
print(f"Correlation: {df['x'].corr(df['y']):.4f}") # Low
# Use mutual information instead -- captures non-linear relationships
mi = mutual_info_regression(df[['x']], df['y'], random_state=42)
print(f"Mutual Information: {mi[0]:.4f}") # High!
# MI correctly identifies x as informativeSummary
- Feature engineering is the process of creating new features or transforming existing ones to improve ML model performance. It is often more impactful than algorithm selection.
- Create features from domain knowledge: ratios (price_per_sqft), differences (age from DOB), interactions (area * quality), and aggregations (total_purchases). These capture relationships that raw features do not explicitly contain.
- Datetime features are rich: extract hour, day_of_week, month, quarter, is_weekend, and time-since-event. Use sin/cos cyclical encoding for circular features like hour.
- Text features include word_count, char_count, exclamation_count, uppercase words, and TF-IDF scores. Simple text statistics can be surprisingly predictive.
- Binning converts continuous variables into categories using pd.cut() (equal-width) or pd.qcut() (equal-frequency). This helps capture non-linear step-function relationships.
- PolynomialFeatures creates squared and interaction terms automatically. Use degree=2 and be cautious with high degrees (combinatorial explosion). Prefer manual feature creation based on domain knowledge.
- Feature selection methods: correlation matrix (linear relationships with target), mutual information (linear and non-linear), variance threshold (remove near-constant features), RFE (iterative removal), tree-based importance (built into model).
- Fit feature selection on training data only. Fitting on full data (including test set) is data leakage and gives overly optimistic results.
- Data leakage is the most dangerous feature engineering mistake. Never use information that would not be available at prediction time (future data, target-derived features).
- Good features make simple models outperform complex ones. In Kaggle competitions, feature engineering with gradient boosting consistently beats exotic algorithms with raw features.