Practice Questions — Feature Engineering and Feature Selection
← Back to NotesTopic-Specific Questions
Question 1
Easy
What is feature engineering and why is it important?
Think about creating new features from existing data to help models learn better.
Feature engineering is the process of creating new features or transforming existing features to improve ML model performance. It is important because: (1) Models can only learn from the features you provide -- if the right patterns are not in the features, no algorithm can find them. (2) Good features can make simple models outperform complex ones. (3) Raw data is often not in a form algorithms can use directly (dates, text, categories).
Question 2
Easy
Give three examples of features you could engineer from a 'date_of_birth' column.
Think about age, zodiac sign, and generational cohort.
1. age = current_date - date_of_birth (in years). 2. birth_month = month extracted from DOB (captures seasonal patterns). 3. is_born_in_90s = binary flag for generational cohort. Other options: birth_day_of_week, birth_quarter, age_group (binned), days_until_next_birthday.
Question 3
Easy
What is the output?
import pandas as pd
df = pd.DataFrame({'height_cm': [170, 165, 180], 'weight_kg': [70, 55, 85]})
df['bmi'] = df['weight_kg'] / (df['height_cm']/100)**2
print(df['bmi'].round(1).tolist())BMI = weight / height^2 where height is in meters.
[24.2, 20.2, 26.2]Question 4
Easy
What is the output?
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[2, 3]])
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print(X_poly)
print(poly.get_feature_names_out())Degree 2 creates: original features, squared terms, and interaction terms.
[[ 2. 3. 4. 6. 9.]]['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2']Question 5
Easy
What is the difference between pd.cut() and pd.qcut() in pandas?
One creates equal-width bins, the other creates equal-frequency bins.
pd.cut() creates bins of equal width -- each bin covers the same range of values (e.g., 0-20, 20-40, 40-60). pd.qcut() creates bins of equal frequency -- each bin contains approximately the same number of data points (quantile-based). Use cut() when the ranges matter, use qcut() when you want balanced groups.Question 6
Easy
What is data leakage in feature engineering? Give one example.
Using information that would not be available at prediction time.
Data leakage occurs when a feature contains information that would not be available at the time you need to make a prediction. Example: Predicting whether a customer will churn next month, but using 'total_lifetime_value' which includes future purchases. The model appears accurate during training but fails in production because the future data is not available when making real predictions.
Question 7
Medium
What is the output?
import pandas as pd
import numpy as np
dates = pd.to_datetime(['2026-01-05', '2026-07-15', '2026-12-25'])
print(dates.month.tolist())
print(dates.dayofweek.tolist())
print((dates.dayofweek >= 5).astype(int).tolist())month gives 1-12, dayofweek gives 0=Monday to 6=Sunday.
[1, 7, 12][0, 2, 4][0, 0, 0]Question 8
Medium
Why do we use sin/cos cyclical encoding for features like hour of the day?
Think about the distance between hour 23 and hour 0.
Hour 23 and hour 0 are only 1 hour apart in reality, but numerically they are 23 units apart. ML algorithms would treat them as very different, which is wrong. Sin/cos encoding maps cyclical values onto a circle:
sin(2*pi*hour/24) and cos(2*pi*hour/24). In this encoding, hour 23 and hour 0 are close together (both near sin=0, cos=1), correctly representing their temporal proximity.Question 9
Medium
Write code to create a correlation matrix for the Iris dataset features and print all pairs with absolute correlation above 0.8.
Use load_iris(), create a DataFrame, compute .corr(), then loop through the upper triangle.
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
corr_matrix = df.corr()
print("Feature pairs with |correlation| > 0.8:")
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.8:
print(f" {corr_matrix.columns[i]} <-> {corr_matrix.columns[j]}: {corr_matrix.iloc[i, j]:.4f}")Question 10
Medium
Explain the difference between filter, wrapper, and embedded feature selection methods. Give one example of each.
They differ in whether they are independent of the model, use a model iteratively, or are built into model training.
Filter methods: Evaluate features independently of any model. Fast but may miss feature interactions. Example: correlation with target, mutual information, variance threshold. Wrapper methods: Use a model to evaluate feature subsets iteratively. More accurate but slower. Example: Recursive Feature Elimination (RFE), forward/backward selection. Embedded methods: Feature selection is built into the model training process. Example: Random Forest feature_importances_, Lasso (L1) regularization setting unimportant coefficients to zero.
Question 11
Medium
Write code to use SelectKBest with mutual_info_regression to select the top 4 features from the California housing dataset.
Use fetch_california_housing, fit SelectKBest(mutual_info_regression, k=4), and print selected feature names.
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.datasets import fetch_california_housing
import numpy as np
housing = fetch_california_housing()
X, y = housing.data, housing.target
selector = SelectKBest(mutual_info_regression, k=4)
selector.fit(X, y)
scores = selector.scores_
selected_mask = selector.get_support()
selected_features = [housing.feature_names[i] for i in range(len(selected_mask)) if selected_mask[i]]
print("MI scores per feature:")
for name, score in sorted(zip(housing.feature_names, scores), key=lambda x: -x[1]):
marker = " <-- selected" if name in selected_features else ""
print(f" {name}: {score:.4f}{marker}")
print(f"\nSelected: {selected_features}")Question 12
Hard
Shreya is building a credit scoring model. She has features: monthly_income, monthly_expenses, credit_limit, and current_balance. What features should she engineer and why?
Think about ratios and financial metrics that banks use.
Shreya should engineer: (1) savings_ratio = (income - expenses) / income -- measures how much income is saved, indicating financial stability. (2) credit_utilization = current_balance / credit_limit -- the most important factor in credit scoring; high utilization indicates risk. (3) debt_to_income = current_balance / monthly_income -- measures how manageable the debt is relative to earnings. (4) expense_ratio = monthly_expenses / monthly_income -- how much of income goes to expenses. (5) available_credit = credit_limit - current_balance -- raw available credit amount.
Question 13
Hard
What is the output?
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[1, 2, 3]])
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Shape: {X_poly.shape}")
print(f"Features: {poly.get_feature_names_out()}")interaction_only=True means no squared terms (x0^2, x1^2, x2^2), only cross-terms.
Shape: (1, 6)Features: ['x0' 'x1' 'x2' 'x0 x1' 'x0 x2' 'x1 x2']Question 14
Hard
What is mutual information and why is it better than correlation for feature selection?
Correlation only captures linear relationships. Mutual information captures any dependency.
Mutual information (MI) measures the amount of information that knowing one variable provides about another. It captures any type of statistical dependency (linear, quadratic, sinusoidal, etc.), not just linear relationships. Correlation (Pearson's r) only measures linear relationships. A feature with a perfect quadratic relationship to the target (y = x^2) has near-zero correlation but high mutual information. MI is always non-negative (0 means independence, higher means more dependent). It is more general but slower to compute and requires more data for reliable estimates.
Question 15
Hard
Write code to apply RFE with cross-validation (RFECV) using a Random Forest on the Wine dataset to find the optimal number of features.
Use RFECV from sklearn.feature_selection with RandomForestClassifier. It automatically finds the best number of features.
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import StratifiedKFold
wine = load_wine()
X, y = wine.data, wine.target
rfecv = RFECV(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
step=1,
cv=StratifiedKFold(5),
scoring='accuracy',
min_features_to_select=1
)
rfecv.fit(X, y)
print(f"Optimal features: {rfecv.n_features_}")
print(f"Selected features:")
for name, selected in zip(wine.feature_names, rfecv.support_):
if selected:
print(f" {name}")
print(f"\nBest CV accuracy: {rfecv.cv_results_['mean_test_score'].max():.4f}")Question 16
Easy
What is the output?
import pandas as pd
df = pd.DataFrame({'name': ['Aarav Singh', 'Priya Sharma', 'Ravi K']})
df['name_length'] = df['name'].str.len()
df['word_count'] = df['name'].str.split().str.len()
print(df[['name_length', 'word_count']].values.tolist())str.len() counts characters including spaces. str.split().str.len() counts words.
[[11, 2], [12, 2], [6, 2]]Question 17
Medium
What is target encoding and why can it cause data leakage?
Target encoding replaces a category with the mean target value for that category.
Target encoding replaces each categorical value with the mean of the target variable for that category. For example, city='Mumbai' might be replaced by the average house price in Mumbai. Leakage risk: If computed on the full dataset (including test data), the encoding uses target information from the test set. Even on training data alone, a category with few samples gets an encoding very close to those specific samples' target values, leading to overfitting. Solutions: compute on training data only, use cross-validated target encoding, or add noise/smoothing.
Question 18
Hard
Write a function that takes a DataFrame with a datetime column and returns 6 extracted features: hour, day_of_week, month, is_weekend, quarter, and cyclical hour encoding (sin and cos).
Use dt accessor for datetime operations. sin/cos encoding: sin(2*pi*hour/24), cos(2*pi*hour/24).
import pandas as pd
import numpy as np
def extract_datetime_features(df, col):
df = df.copy()
dt = pd.to_datetime(df[col])
df['hour'] = dt.dt.hour
df['day_of_week'] = dt.dt.dayofweek
df['month'] = dt.dt.month
df['is_weekend'] = (dt.dt.dayofweek >= 5).astype(int)
df['quarter'] = dt.dt.quarter
df['hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24).round(4)
df['hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24).round(4)
return df
df = pd.DataFrame({'timestamp': ['2026-01-15 09:30', '2026-06-20 22:00']})
result = extract_datetime_features(df, 'timestamp')
print(result.drop('timestamp', axis=1).to_string())Mixed & Application Questions
Question 1
Easy
What is the output?
import pandas as pd
df = pd.DataFrame({'price': [100, 200, 150], 'quantity': [5, 3, 8]})
df['total'] = df['price'] * df['quantity']
df['avg_price'] = df['total'] / df['quantity']
print(df['total'].tolist())
print(df['avg_price'].tolist())total = price * quantity. avg_price = total / quantity = price.
[500, 600, 1200][100.0, 200.0, 150.0]Question 2
Easy
What is TF-IDF and when would you use it?
It converts text into numbers by weighing words based on importance.
TF-IDF (Term Frequency - Inverse Document Frequency) converts text documents into numerical vectors. TF measures how often a word appears in a document. IDF measures how rare the word is across all documents. TF-IDF = TF * IDF. Common words like 'the' get low scores (high TF but low IDF). Distinctive words like 'algorithm' get high scores if they appear often in one document but rarely in others. Use it when you need to convert text data into features for ML models.
Question 3
Easy
What is the output?
import pandas as pd
df = pd.DataFrame({'age': [15, 28, 45, 62, 35]})
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100],
labels=['Minor', 'Young', 'Adult', 'Senior'])
print(df['age_group'].tolist())Bins are (0,18], (18,35], (35,60], (60,100].
['Minor', 'Young', 'Adult', 'Senior', 'Young']Question 4
Medium
What is the output?
import numpy as np
hour = np.array([0, 6, 12, 18, 23])
hour_sin = np.sin(2 * np.pi * hour / 24).round(4)
hour_cos = np.cos(2 * np.pi * hour / 24).round(4)
print(f"Hour 0: sin={hour_sin[0]}, cos={hour_cos[0]}")
print(f"Hour 23: sin={hour_sin[4]}, cos={hour_cos[4]}")Hour 0 and hour 23 should have similar sin/cos values.
Hour 0: sin=0.0, cos=1.0Hour 23: sin=-0.2588, cos=0.9659Question 5
Medium
Write code to engineer 5 features from a DataFrame with columns 'length', 'width', and 'height' of shipping boxes.
Think about area, volume, aspect ratios, and surface area.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'length': [30, 50, 20, 80],
'width': [20, 30, 15, 40],
'height': [15, 25, 10, 35]
})
df['volume'] = df['length'] * df['width'] * df['height']
df['surface_area'] = 2 * (df['length']*df['width'] + df['width']*df['height'] + df['length']*df['height'])
df['aspect_ratio_lw'] = df['length'] / df['width']
df['max_dimension'] = df[['length', 'width', 'height']].max(axis=1)
df['is_cubic'] = ((df['length'] == df['width']) & (df['width'] == df['height'])).astype(int)
print(df[['volume', 'surface_area', 'aspect_ratio_lw', 'max_dimension', 'is_cubic']].to_string())Question 6
Medium
Rahul is building a model to predict exam scores. He has features: study_hours, sleep_hours, and attendance_percentage. His model R2 is 0.65. Suggest three engineered features that might improve it.
Think about interactions, ratios, and non-linear transformations that make domain sense.
1. study_to_sleep_ratio = study_hours / sleep_hours -- captures the balance between studying and rest. Too much of either can hurt scores. 2. study_x_attendance = study_hours * attendance_percentage -- interaction: studying is more effective if you also attend class. 3. sleep_deficit = max(0, 7 - sleep_hours) -- hours below recommended 7 hours; captures that insufficient sleep hurts performance but extra sleep beyond 7 may not help. Other options: study_hours_squared (diminishing returns), is_high_attendance (binary flag for > 80%).
Question 7
Medium
What is the output?
from sklearn.feature_selection import VarianceThreshold
import numpy as np
X = np.array([
[1, 0, 100],
[2, 0, 200],
[3, 0, 150],
[4, 0, 300],
[5, 0, 250]
])
selector = VarianceThreshold(threshold=0.0)
X_selected = selector.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Selected shape: {X_selected.shape}")
print(f"Removed feature index: {[i for i, s in enumerate(selector.get_support()) if not s]}")VarianceThreshold(threshold=0.0) removes features with zero variance (constant features).
Original shape: (5, 3)Selected shape: (5, 2)Removed feature index: [1]Question 8
Hard
Why is feature selection (fit) on the full dataset before train-test split considered data leakage?
Think about what information from the test set leaks into the feature selection process.
When you fit feature selection on the full dataset (including test data), the selector uses statistical properties of the test set (correlations, variance, mutual information) to decide which features to keep. The selected features are optimized for the combined data, including the test set. This means the test set is no longer a truly unseen evaluation -- the model benefits from having features pre-selected to work well on test data. This inflates test metrics, making the model appear better than it truly is on genuinely new data.
Question 9
Hard
Write a function that takes a DataFrame with a 'text' column and adds 4 text-based features: word_count, avg_word_length, has_exclamation (binary), and uppercase_ratio (fraction of uppercase letters).
Use str accessor methods: split, len, count, and a lambda for uppercase ratio.
import pandas as pd
def add_text_features(df):
df = df.copy()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text'].apply(
lambda x: sum(len(w) for w in x.split()) / len(x.split())
).round(2)
df['has_exclamation'] = df['text'].str.contains('!').astype(int)
df['uppercase_ratio'] = df['text'].apply(
lambda x: sum(1 for c in x if c.isupper()) / len(x)
).round(4)
return df
# Test
df = pd.DataFrame({'text': ['Hello World!', 'this is a test', 'URGENT: Read NOW']})
result = add_text_features(df)
print(result[['word_count', 'avg_word_length', 'has_exclamation', 'uppercase_ratio']].to_string())Question 10
Hard
Kavitha has a dataset with 50 features. She wants to select features but is unsure whether relationships are linear or non-linear. What strategy should she use?
Use both linear and non-linear methods, then compare results.
Kavitha should use a multi-method strategy: (1) Compute Pearson correlation with the target to find linearly related features. (2) Compute mutual information to find non-linearly related features that correlation might miss. (3) Train a Random Forest and use feature_importances_ which capture both linear and non-linear relationships. (4) Run RFECV to find the optimal subset. (5) Keep features that are consistently ranked high across multiple methods. Features that appear in the top 10 of all methods are strong candidates; features that only appear in one method warrant further investigation.
Question 11
Hard
What is the output?
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[1, 2]])
poly3 = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly3.fit_transform(X)
print(f"Features: {X_poly.shape[1]}")
print(poly3.get_feature_names_out())For 2 features with degree 3: all combinations up to degree 3 minus the bias term.
Features: 9['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2' 'x0^3' 'x0^2 x1' 'x0 x1^2' 'x1^3']Question 12
Hard
In a Kaggle competition for house price prediction, Arjun uses raw features and gets rank 500. After extensive feature engineering, he reaches rank 50. What types of features likely made the biggest impact?
Think about domain-specific ratios, spatial features, and temporal features.
The biggest impact likely came from: (1) Spatial features: Distance to city center, proximity to schools/hospitals/metro, neighborhood average price (requires joining external data). (2) Interaction features: total_sqft * quality_rating, bedrooms_per_floor, bathroom-to-bedroom ratio. (3) Temporal features: age_of_property, years_since_renovation, price_trend_in_neighborhood. (4) Aggregated features: median_price_in_zipcode, average_sqft_price_in_area (target encoding of location). (5) Log/power transformations: log(price), log(area) for skewed distributions. Top Kaggle solutions in house price competitions consistently show that these domain-driven features matter more than algorithm tuning.
Multiple Choice Questions
MCQ 1
Which of the following is an example of feature engineering?
Answer: B
B is correct. Creating BMI from height and weight is feature engineering -- deriving a new, meaningful feature from existing ones. Training (A), splitting data (C), and tuning hyperparameters (D) are other ML pipeline steps, not feature engineering.
B is correct. Creating BMI from height and weight is feature engineering -- deriving a new, meaningful feature from existing ones. Training (A), splitting data (C), and tuning hyperparameters (D) are other ML pipeline steps, not feature engineering.
MCQ 2
What does pd.qcut() do?
Answer: B
B is correct. pd.qcut() creates bins where each bin contains approximately the same number of observations (quantile-based). pd.cut() (A) creates equal-width bins. qcut is better for skewed data where equal-width bins would create very unbalanced groups.
B is correct. pd.qcut() creates bins where each bin contains approximately the same number of observations (quantile-based). pd.cut() (A) creates equal-width bins. qcut is better for skewed data where equal-width bins would create very unbalanced groups.
MCQ 3
What is data leakage?
Answer: B
B is correct. Data leakage occurs when features contain future information or target-derived information. The model appears accurate during evaluation but fails in production because the leaked information is not available when making real predictions.
B is correct. Data leakage occurs when features contain future information or target-derived information. The model appears accurate during evaluation but fails in production because the leaked information is not available when making real predictions.
MCQ 4
Which method measures both linear and non-linear feature-target relationships?
Answer: C
C is correct. Mutual information captures any statistical dependency between a feature and the target, including non-linear relationships. Pearson correlation (A) only captures linear relationships.
C is correct. Mutual information captures any statistical dependency between a feature and the target, including non-linear relationships. Pearson correlation (A) only captures linear relationships.
MCQ 5
What does PolynomialFeatures(degree=2, include_bias=False) create for input features [a, b]?
Answer: B
B is correct. With degree=2 and include_bias=False, PolynomialFeatures generates the original features, their squares, and their cross-product: [a, b, a^2, ab, b^2]. Option D would be the result with include_bias=True (includes the constant 1).
B is correct. With degree=2 and include_bias=False, PolynomialFeatures generates the original features, their squares, and their cross-product: [a, b, a^2, ab, b^2]. Option D would be the result with include_bias=True (includes the constant 1).
MCQ 6
Amit has two features with correlation 0.98. What should he consider doing?
Answer: B
B is correct. Features with 0.98 correlation are nearly identical (redundant). Keeping both adds noise and can cause multicollinearity in linear models without providing additional information. Remove the one that is less correlated with the target or harder to interpret.
B is correct. Features with 0.98 correlation are nearly identical (redundant). Keeping both adds noise and can cause multicollinearity in linear models without providing additional information. Remove the one that is less correlated with the target or harder to interpret.
MCQ 7
What is Recursive Feature Elimination (RFE)?
Answer: B
B is correct. RFE trains a model (e.g., Random Forest), ranks features by importance, removes the least important one, retrains, and repeats until the desired number of features remains. It is a wrapper method that uses model performance to guide selection.
B is correct. RFE trains a model (e.g., Random Forest), ranks features by importance, removes the least important one, retrains, and repeats until the desired number of features remains. It is a wrapper method that uses model performance to guide selection.
MCQ 8
Which of the following is an embedded feature selection method?
Answer: C
C is correct. Lasso regularization performs feature selection as part of model training by setting unimportant feature coefficients to exactly zero. This is 'embedded' because selection happens during training. Correlation (A), mutual information (B), and variance threshold (D) are filter methods.
C is correct. Lasso regularization performs feature selection as part of model training by setting unimportant feature coefficients to exactly zero. This is 'embedded' because selection happens during training. Correlation (A), mutual information (B), and variance threshold (D) are filter methods.
MCQ 9
Why is cyclical encoding (sin/cos) used for hour-of-day instead of raw hour values?
Answer: C
C is correct. Raw hour values treat 0 and 23 as very different (distance of 23), but they are actually 1 hour apart. Sin/cos encoding maps hours onto a circle where 0 and 23 are close. This is important for algorithms using distance (KNN, SVM).
C is correct. Raw hour values treat 0 and 23 as very different (distance of 23), but they are actually 1 hour apart. Sin/cos encoding maps hours onto a circle where 0 and 23 are close. This is important for algorithms using distance (KNN, SVM).
MCQ 10
When should feature selection be performed relative to train-test split?
Answer: B
B is correct. Feature selection must be fit on training data only to prevent data leakage. Then use the same selected features (transform) on the test set. Fitting on the full dataset (A) leaks test set information into the selection process.
B is correct. Feature selection must be fit on training data only to prevent data leakage. Then use the same selected features (transform) on the test set. Fitting on the full dataset (A) leaks test set information into the selection process.
MCQ 11
Priya creates 200 engineered features from 20 original features using PolynomialFeatures(degree=3). Her model performance drops. What is the most likely cause?
Answer: B
B is correct. With 200 features (many of which are noise) and potentially limited samples, the model overfits to spurious patterns in the training data. The curse of dimensionality makes the feature space too sparse. She should use feature selection after generating polynomial features, or use a lower degree with interaction_only=True.
B is correct. With 200 features (many of which are noise) and potentially limited samples, the model overfits to spurious patterns in the training data. The curse of dimensionality makes the feature space too sparse. She should use feature selection after generating polynomial features, or use a lower degree with interaction_only=True.
MCQ 12
Deepak's Random Forest shows that 'customer_id' is the most important feature for predicting churn. What is happening?
Answer: B
B is correct. Customer ID is a unique identifier with very high cardinality. Tree-based models can create individual splits for each customer, effectively memorizing the training data. This gives high importance but zero generalization. IDs, names, and other unique identifiers should always be removed before training.
B is correct. Customer ID is a unique identifier with very high cardinality. Tree-based models can create individual splits for each customer, effectively memorizing the training data. This gives high importance but zero generalization. IDs, names, and other unique identifiers should always be removed before training.
MCQ 13
In a text classification task, which feature engineering approach would likely work best?
Answer: B
B is correct. TF-IDF captures the importance of individual words relative to the corpus, while text statistics capture structural properties (length, complexity). Together they provide both semantic and structural information. Raw character codes (A) are meaningless. Word count alone (C) is too limited. Converting to uppercase (D) adds no useful information.
B is correct. TF-IDF captures the importance of individual words relative to the corpus, while text statistics capture structural properties (length, complexity). Together they provide both semantic and structural information. Raw character codes (A) are meaningless. Word count alone (C) is too limited. Converting to uppercase (D) adds no useful information.
MCQ 14
What is the advantage of mutual_info_regression over Pearson correlation for feature selection?
Answer: C
C is correct. Mutual information captures any type of statistical dependency (linear, quadratic, exponential, etc.), while Pearson correlation only detects linear relationships. A feature with a perfect y=x^2 relationship would have near-zero correlation but high mutual information. MI is slower than correlation (not A) and works for continuous targets (not B).
C is correct. Mutual information captures any type of statistical dependency (linear, quadratic, exponential, etc.), while Pearson correlation only detects linear relationships. A feature with a perfect y=x^2 relationship would have near-zero correlation but high mutual information. MI is slower than correlation (not A) and works for continuous targets (not B).
MCQ 15
What is a 'feature' in machine learning?
Answer: B
B is correct. A feature (also called attribute, variable, or column) is a measurable property of the data that serves as input to the model. Examples: age, income, height, word_count.
B is correct. A feature (also called attribute, variable, or column) is a measurable property of the data that serves as input to the model. Examples: age, income, height, word_count.
MCQ 16
Which of the following is an interaction feature?
Answer: C
C is correct. An interaction feature is created by combining (typically multiplying) two or more features: area * quality_score captures how area and quality jointly affect the target. Squared (A) and log (D) are single-feature transformations. BMI (B) is a ratio, which is a different type of engineered feature.
C is correct. An interaction feature is created by combining (typically multiplying) two or more features: area * quality_score captures how area and quality jointly affect the target. Squared (A) and log (D) are single-feature transformations. BMI (B) is a ratio, which is a different type of engineered feature.
MCQ 17
Ankit creates 500 polynomial features from 10 original features. His validation accuracy drops from 85% to 72%. What happened?
Answer: B
B is correct. Adding 500 features (mostly noise) to a limited dataset causes overfitting. The model memorizes training data patterns in the high-dimensional space that do not generalize. He should use feature selection after generating polynomial features, use a lower degree, or use regularization.
B is correct. Adding 500 features (mostly noise) to a limited dataset causes overfitting. The model memorizes training data patterns in the high-dimensional space that do not generalize. He should use feature selection after generating polynomial features, use a lower degree, or use regularization.
MCQ 18
Which of the following is feature selection (not feature extraction)?
Answer: B
B is correct. Feature selection chooses a subset of existing features without transforming them. Selecting top 10 by correlation keeps the original features. PCA (A) and autoencoder (D) are feature extraction. Polynomial features (C) is feature creation.
B is correct. Feature selection chooses a subset of existing features without transforming them. Selecting top 10 by correlation keeps the original features. PCA (A) and autoencoder (D) are feature extraction. Polynomial features (C) is feature creation.
MCQ 19
Neha has a feature 'date_joined' as a string like '2025-03-15'. What should she do before using it in a model?
Answer: B
B is correct. ML models cannot use raw date strings. Converting to datetime and extracting numerical features makes temporal information accessible to the model.
B is correct. ML models cannot use raw date strings. Converting to datetime and extracting numerical features makes temporal information accessible to the model.
MCQ 20
What is the main advantage of tree-based feature importance over correlation for feature selection?
Answer: B
B is correct. Tree-based importance captures non-linear relationships and interactions between features. Correlation only measures linear pairwise relationships.
B is correct. Tree-based importance captures non-linear relationships and interactions between features. Correlation only measures linear pairwise relationships.
MCQ 21
Which sklearn class automatically finds the optimal number of features using cross-validation during RFE?
Answer: B
B is correct. RFECV (Recursive Feature Elimination with Cross-Validation) automatically finds the optimal number of features. RFE (A) requires you to specify the count manually.
B is correct. RFECV (Recursive Feature Elimination with Cross-Validation) automatically finds the optimal number of features. RFE (A) requires you to specify the count manually.
MCQ 22
What is VarianceThreshold used for?
Answer: A
A is correct. VarianceThreshold removes features whose variance is below a threshold. Features with zero or near-zero variance carry no useful information.
A is correct. VarianceThreshold removes features whose variance is below a threshold. Features with zero or near-zero variance carry no useful information.
Coding Challenges
Coding challenges coming soon.
Need to Review the Concepts?
Go back to the detailed notes for this chapter.
Read Chapter NotesWant to learn AI and ML with a live mentor?
Explore our AI/ML Masterclass