Chapter 14 Advanced 56 Questions

Practice Questions — Ensemble Methods - Boosting (XGBoost, LightGBM, CatBoost)

← Back to Notes

12 Easy

11 Medium

11 Hard

Topic-Specific Questions

Question 1

Easy

What is ensemble learning? Give a real-world analogy.

Think about combining opinions from multiple people versus relying on one person.

Ensemble learning combines multiple individual models (weak learners) to produce a single stronger prediction. Analogy: It is like asking 100 doctors for a diagnosis instead of just one. Each doctor might make occasional mistakes, but the majority opinion is almost always correct. Similarly, combining many imperfect models produces a more accurate and robust prediction than any single model alone.

Question 2

Easy

What is the difference between bagging and boosting?

One trains models independently, the other trains them sequentially.

Bagging: Trains multiple models independently on random subsets of data, then averages their predictions. Each model has equal weight. Reduces variance (overfitting). Example: Random Forest. Boosting: Trains models sequentially, where each new model focuses on correcting the errors of previous models. Later models may have higher weight. Reduces both bias and variance. Examples: XGBoost, LightGBM, CatBoost.

Question 3

Easy

What does the learning_rate parameter control in gradient boosting?

It controls how much each tree contributes to the final prediction.

The learning rate (also called eta or shrinkage) controls the contribution of each tree to the ensemble. With learning_rate=0.1, each tree's prediction is multiplied by 0.1 before being added. A smaller learning rate means each tree has less impact, requiring more trees for the same performance but typically generalizing better. Common values: 0.01 to 0.3.

Question 4

Easy

Name three advantages of XGBoost over plain decision trees.

Think about regularization, handling missing values, and combining multiple trees.

1. Ensemble of trees: XGBoost combines hundreds of trees, each correcting the errors of the previous ones, producing much more accurate predictions than a single tree. 2. Regularization: L1 and L2 regularization on leaf weights prevent overfitting, which single decision trees are prone to. 3. Missing value handling: XGBoost learns the optimal direction for missing values at each split, eliminating the need for manual imputation.

Question 5

Easy

What is the output?

from xgboost import XGBClassifier
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 0, 1, 1])

xgb = XGBClassifier(n_estimators=10, random_state=42, eval_metric='logloss')
xgb.fit(X, y)
print(xgb.predict(np.array([[2, 3], [6, 7]])))

Points near (1,2) and (3,4) should be class 0. Points near (5,6) and (7,8) should be class 1.

[0 1]

Question 6

Medium

Explain how gradient boosting builds trees sequentially. What does each new tree try to learn?

Each tree learns the errors (residuals) of the current ensemble.

Gradient boosting starts with a simple prediction (e.g., the mean). Then: (1) Compute residuals = actual - current prediction. (2) Train a new tree to predict these residuals (the errors). (3) Add the new tree's predictions (multiplied by the learning rate) to the running prediction. (4) Repeat. Each new tree does not predict the original target -- it predicts what the current ensemble got wrong. By adding corrections iteratively, the ensemble gradually improves its predictions.

Question 7

Medium

What is the difference between depth-wise and leaf-wise tree growth? Which algorithm uses which?

XGBoost grows all leaves at the same level. LightGBM grows the leaf that reduces loss the most.

Depth-wise (level-wise): Grows all leaves at the same depth before moving to the next level. Produces balanced trees. Used by XGBoost. Leaf-wise: Grows the single leaf that achieves the maximum loss reduction, regardless of depth. Produces unbalanced trees that can be deeper on one side. Used by LightGBM. Leaf-wise growth is typically more accurate per tree (larger loss reduction per split) but can overfit more easily, which is why LightGBM uses num_leaves as the primary complexity control.

Question 8

Medium

Why is XGBoost's native handling of missing values an advantage? How does it work?

XGBoost learns which direction (left or right) to send missing values at each tree split.

When a feature has missing values at a particular split, XGBoost tries sending all missing values to the left child and computes the gain, then tries sending them to the right child and computes the gain. It picks the direction that gives the best gain. This means: (1) You do not need to impute missing values manually. (2) The optimal treatment of missing data is learned from the data itself. (3) Different splits can handle missing values differently (left at one split, right at another). This is better than a single imputation strategy because the best treatment of missingness can vary across the feature space.

Question 9

Medium

Write code to train an XGBoost classifier with early stopping on the breast cancer dataset. Print the number of trees used and the test accuracy.

Use XGBClassifier with n_estimators=500, early_stopping_rounds=20, and eval_set.

from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42)

xgb = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=5,
    eval_metric='logloss',
    early_stopping_rounds=20,
    random_state=42
)
xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

print(f"Trees used: {xgb.best_iteration + 1} out of 500")
print(f"Test accuracy: {xgb.score(X_test, y_test):.4f}")

Question 10

Medium

What are the three main advantages of LightGBM over XGBoost?

Think about speed, tree growth strategy, and handling large datasets.

1. Faster training: Histogram-based splitting is O(n) compared to XGBoost's sort-based splitting which is O(n log n). On large datasets, LightGBM can be 5-10x faster. 2. Leaf-wise growth: Produces more accurate trees per iteration because it always splits the most informative leaf, unlike XGBoost which splits all leaves at the same depth. 3. Memory efficiency: GOSS (smart sampling) and EFB (feature bundling) reduce the effective dataset size, allowing LightGBM to handle very large datasets that XGBoost might struggle with.

Question 11

Hard

Explain the concept of regularization in XGBoost. What do reg_alpha (L1) and reg_lambda (L2) regularize?

They regularize the leaf weight values, not the feature weights like in linear models.

reg_lambda (L2) adds a penalty proportional to the square of leaf weights to the objective function. This shrinks leaf values toward zero, preventing any single leaf from having an extreme prediction. It makes the model smoother and more robust. reg_alpha (L1) adds a penalty proportional to the absolute value of leaf weights. This encourages sparsity -- some leaves may get exactly zero weight, effectively pruning parts of the tree. In XGBoost's objective: Objective = Loss + gamma*T + 0.5*lambda*sum(w^2) + alpha*sum(|w|) where T is the number of leaves and w are leaf weights. gamma penalizes tree complexity (more leaves), while lambda and alpha penalize large leaf values.

Question 12

Hard

What is CatBoost's 'ordered boosting' and what problem does it solve?

It addresses target leakage that occurs when computing target statistics for categorical encoding.

In standard gradient boosting with target encoding, the target statistics (e.g., mean target per category) are computed using all training data including the current sample. This creates prediction shift -- a form of target leakage where the model's prediction for a sample is partially based on that sample's own target value. CatBoost's ordered boosting solves this by maintaining a random permutation of the training data. When computing target statistics for sample i, it only uses samples that come before i in the permutation. This ensures no sample's target value leaks into its own encoding, producing more honest estimates and better generalization.

Question 13

Hard

Write code to use GridSearchCV to find the best max_depth and learning_rate for XGBoost on the breast cancer dataset.

Define a param_grid with max_depth (3, 5, 7) and learning_rate (0.01, 0.05, 0.1). Use GridSearchCV with cv=5.

from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42)

param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1]
}

grid = GridSearchCV(
    XGBClassifier(n_estimators=200, random_state=42, eval_metric='logloss'),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy: {grid.score(X_test, y_test):.4f}")

Question 14

Hard

Vikram is deciding between XGBoost, LightGBM, and CatBoost for his project. His dataset has 5 million rows, 200 features, and 30 of those features are categorical. Which algorithm should he use and why?

Consider dataset size, number of categorical features, and training speed.

CatBoost is the best choice because: (1) 30 categorical features is a significant portion. CatBoost handles them natively with ordered target encoding, avoiding the need for one-hot encoding (which would create potentially hundreds of sparse features) or label encoding (which creates false ordinal relationships). (2) While LightGBM is generally faster, CatBoost's native categorical handling avoids the preprocessing overhead. (3) On large datasets with many categoricals, CatBoost typically achieves the best accuracy with minimal tuning. Alternative: If training speed is the top priority and categoricals have low cardinality (few unique values), LightGBM with one-hot encoding could be faster.

Question 15

Hard

Explain what subsample and colsample_bytree do in XGBoost and why they help prevent overfitting.

They introduce randomness into each tree, similar to Random Forest's approach.

subsample (0.5-1.0): Each tree is trained on a random fraction of the training data. With subsample=0.8, each tree sees only 80% of the rows, making each tree slightly different and reducing the chance of memorizing specific training patterns. colsample_bytree (0.3-1.0): Each tree only considers a random subset of features. With colsample_bytree=0.8, each tree uses only 80% of the features, forcing the model to learn from different feature combinations. Both introduce stochasticity that prevents the model from overfitting to specific patterns in the full training set. This is the same principle as Random Forest's random subspace method. Combined, they significantly reduce overfitting, especially on small or noisy datasets.

Question 16

Easy

What is the output?

from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=100, max_depth=3)
print(xgb.get_params()['max_depth'])
print(xgb.get_params()['n_estimators'])

get_params() returns all hyperparameters as a dictionary.

3
100

Question 17

Medium

Priya has a dataset with 50 numerical features and 10 categorical features (each with 100+ categories). She needs the best accuracy. Which boosting library should she choose and why?

Consider how each library handles high-cardinality categorical features.

CatBoost is the best choice because: (1) 10 categorical features with 100+ categories each would require one-hot encoding for XGBoost/LightGBM, creating 1000+ sparse features. (2) CatBoost handles these natively using ordered target encoding, preserving the information without the sparsity problem. (3) CatBoost's encoding is provably unbiased (no target leakage), unlike manual target encoding. For XGBoost/LightGBM, one-hot encoding high-cardinality categoricals creates very sparse data that can hurt tree-based model performance.

Question 18

Hard

Write code to train XGBoost on the breast cancer dataset with 5-fold cross-validation and print the mean and standard deviation of accuracy scores.

Use cross_val_score with XGBClassifier and cv=5.

from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score

data = load_breast_cancer()
xgb = XGBClassifier(n_estimators=100, max_depth=5, learning_rate=0.1,
                    random_state=42, eval_metric='logloss')

scores = cross_val_score(xgb, data.data, data.target, cv=5, scoring='accuracy')
print(f"CV Scores: {scores.round(4)}")
print(f"Mean: {scores.mean():.4f} +/- {scores.std():.4f}")

Question 19

Easy

What does n_estimators mean in XGBoost?

It is the number of boosting rounds.

n_estimators is the number of boosting rounds, which equals the number of trees in the ensemble. Each round adds one decision tree that corrects the errors of the previous trees. More trees generally improve performance up to a point, after which they cause overfitting. Use early stopping to find the optimal number automatically.

Question 20

Easy

What is a boosting "round" or "iteration" in XGBoost?

Each round adds one more tree to the ensemble.

One boosting round trains one additional decision tree and adds its predictions (scaled by the learning rate) to the current ensemble. After n_estimators rounds, the ensemble has n_estimators trees.

Mixed & Application Questions

Question 1

Easy

Why is gradient boosting called 'gradient' boosting?

Think about how it uses gradients to find the direction of improvement.

It is called 'gradient' boosting because each new tree is trained on the negative gradient of the loss function (i.e., the residuals). This is equivalent to performing gradient descent in function space: each tree is a step that reduces the overall loss. Just as gradient descent finds the minimum of a function by following the negative gradient, gradient boosting finds the best prediction function by adding trees that follow the negative gradient of the loss.

Question 2

Easy

What is the output?

from xgboost import XGBClassifier

xgb = XGBClassifier(n_estimators=100, max_depth=5)
print(type(xgb))
print(xgb.get_params()['n_estimators'])
print(xgb.get_params()['max_depth'])

XGBClassifier is a class, and get_params() returns the hyperparameters.

<class 'xgboost.sklearn.XGBClassifier'>
100
5

Question 3

Easy

What is early stopping and why is it important for gradient boosting?

It stops training when the model stops improving on validation data.

Early stopping monitors the model's performance on a validation set during training. If the performance does not improve for a specified number of rounds (early_stopping_rounds), training stops. It is important because: (1) It automatically finds the optimal number of trees, preventing overfitting. (2) It saves computation time by not training unnecessary trees. (3) It acts as a form of regularization. Without early stopping, you must guess n_estimators, and too many trees cause overfitting while too few cause underfitting.

Question 4

Medium

What is the output?

from xgboost import XGBClassifier
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([0, 0, 1, 1])

xgb = XGBClassifier(n_estimators=50, random_state=42, eval_metric='logloss')
xgb.fit(X, y)
probs = xgb.predict_proba(np.array([[4, 5]]))
print(f"Shape: {probs.shape}")
print(f"Class probabilities: {probs.round(3)}")

predict_proba returns probabilities for each class. Point (4,5) is between the two clusters.

Shape: (1, 2)
Class probabilities: [[0.3xx 0.6xx]] (approximately, class 1 has higher probability since (4,5) is closer to the class 1 region)

Question 5

Medium

Write code to extract and display the top 5 most important features from a trained XGBoost model on the Iris dataset.

Use xgb.feature_importances_ and sort them.

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
xgb = XGBClassifier(n_estimators=100, random_state=42, eval_metric='mlogloss')
xgb.fit(iris.data, iris.target)

importances = xgb.feature_importances_
sorted_idx = np.argsort(importances)[::-1]

print("Feature importance ranking:")
for i, idx in enumerate(sorted_idx):
    print(f"  {i+1}. {iris.feature_names[idx]}: {importances[idx]:.4f}")

Question 6

Medium

What does scale_pos_weight do in XGBoost and when should you use it?

It handles class imbalance by weighting the minority class.

scale_pos_weight controls the balance of positive and negative weights in the loss function. Setting it to the ratio of negative to positive samples (e.g., 950/50 = 19 for a 95%/5% split) tells XGBoost to weight positive (minority) class errors 19x more than negative class errors. Use it when your dataset has class imbalance and the minority class is important (e.g., fraud detection where frauds are rare but must be caught). Without it, the model may learn to predict the majority class for everything, achieving high accuracy but zero recall on the minority class.

Question 7

Medium

Ananya trains XGBoost with learning_rate=0.5 and 50 trees. Her training accuracy is 99% but test accuracy is 72%. What should she change?

The large gap between training and test accuracy indicates overfitting.

The 27% gap between train (99%) and test (72%) accuracy indicates severe overfitting. Ananya should: (1) Reduce learning_rate to 0.05 and increase n_estimators (with early stopping). (2) Reduce max_depth (e.g., from default 6 to 3-4). (3) Add regularization: increase reg_lambda (L2) or reg_alpha (L1). (4) Increase subsampling: set subsample=0.7 and colsample_bytree=0.7. (5) Increase min_child_weight: prevents splits on very small leaf groups. All these changes reduce model complexity and the training accuracy will drop, but the test accuracy should improve.

Question 8

Hard

On tabular/structured data, gradient boosting often outperforms deep learning. Why?

Think about inductive biases, sample efficiency, and the structure of tabular data.

Gradient boosting outperforms deep learning on tabular data because: (1) Better inductive bias: Trees naturally handle feature interactions, different scales, and missing values. Neural networks require careful feature engineering, scaling, and architecture design. (2) Sample efficiency: Trees learn effectively from thousands of samples. Deep learning typically needs millions. Most real-world tabular datasets have thousands to hundreds of thousands of rows. (3) No spatial/temporal structure: Deep learning excels when data has spatial structure (images -- CNNs) or temporal structure (text, time series -- RNNs/Transformers). Tabular data has no such structure for deep learning to exploit. (4) Regularization: XGBoost has well-understood regularization. Deep learning regularization is harder to tune. (5) Training speed: Boosting trains in minutes vs hours for deep learning.

Question 9

Hard

Write code that trains XGBoost and LightGBM on the same dataset and compares their accuracy and training time.

Use time.time() to measure training duration. Use make_classification for the dataset.

import numpy as np
import time
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=20000, n_features=30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# XGBoost
start = time.time()
xgb = XGBClassifier(n_estimators=200, max_depth=6, learning_rate=0.1,
                    random_state=42, eval_metric='logloss')
xgb.fit(X_train, y_train)
xgb_time = time.time() - start
xgb_acc = xgb.score(X_test, y_test)

# LightGBM
start = time.time()
lgbm = lgb.LGBMClassifier(n_estimators=200, num_leaves=31, learning_rate=0.1,
                           random_state=42, verbose=-1)
lgbm.fit(X_train, y_train)
lgb_time = time.time() - start
lgb_acc = lgbm.score(X_test, y_test)

print(f"XGBoost:  accuracy={xgb_acc:.4f}, time={xgb_time:.3f}s")
print(f"LightGBM: accuracy={lgb_acc:.4f}, time={lgb_time:.3f}s")
print(f"LightGBM is {xgb_time/lgb_time:.1f}x faster")

Question 10

Hard

What is the relationship between max_depth and num_leaves in LightGBM? Why is num_leaves more important?

LightGBM uses leaf-wise growth. The number of leaves directly controls tree complexity.

In LightGBM, num_leaves is the primary tree complexity control because LightGBM grows trees leaf-wise (always splitting the leaf with the maximum gain), not depth-wise. A tree with num_leaves=31 can have varying depth -- some branches may be deep while others are shallow. max_depth is secondary and acts as a safety limit. The relationship: a balanced binary tree with depth d has 2^d leaves. So num_leaves should generally be less than 2^max_depth to avoid overfitting. For example, max_depth=7 supports up to 128 leaves, but setting num_leaves=31 constrains the tree to be much simpler. Setting max_depth=-1 (unlimited) and controlling complexity solely through num_leaves is the recommended approach.

Question 11

Hard

Write a complete churn prediction pipeline: generate a synthetic dataset with 5 features, engineer 2 new features, train XGBoost with early stopping, and print accuracy, AUC, and top 3 features.

Create features like tenure*monthly_charges and charges_per_tenure. Use classification_report and roc_auc_score.

import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

np.random.seed(42)
n = 1500
df = pd.DataFrame({
    'tenure': np.random.randint(1, 72, n),
    'monthly_charges': np.random.normal(60, 20, n).round(2),
    'total_charges': np.random.normal(2500, 1200, n).round(2),
    'contract_length': np.random.choice([1, 12, 24], n, p=[0.5, 0.3, 0.2]),
    'support_calls': np.random.poisson(2, n)
})
churn_prob = (0.3*(df['contract_length']==1) + 0.2*(df['tenure']<12) + 0.1*(df['support_calls']>3)).clip(0,1)
df['churned'] = (np.random.random(n) < churn_prob).astype(int)

# Feature engineering
df['charges_per_tenure'] = df['total_charges'] / (df['tenure'] + 1)
df['tenure_x_contract'] = df['tenure'] * df['contract_length']

X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

xgb = XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=5,
                    eval_metric='auc', early_stopping_rounds=20, random_state=42)
xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

y_pred = xgb.predict(X_test)
y_prob = xgb.predict_proba(X_test)[:, 1]
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Trees used: {xgb.best_iteration + 1}")
top3 = np.argsort(xgb.feature_importances_)[-3:][::-1]
print("Top 3 features:")
for idx in top3:
    print(f"  {X.columns[idx]}: {xgb.feature_importances_[idx]:.4f}")

Question 12

Hard

Explain what colsample_bytree, colsample_bylevel, and colsample_bynode do in XGBoost. How do they differ?

Each controls feature subsampling at a different granularity: per tree, per depth level, or per split.

colsample_bytree: Randomly samples a fraction of features once per tree. All splits in that tree use the same feature subset. colsample_bylevel: Randomly samples features at each depth level within a tree. Different depths may use different features. colsample_bynode: Randomly samples features at each individual split. Maximum randomness -- each split considers a different feature subset. They can be combined multiplicatively: if colsample_bytree=0.8 and colsample_bylevel=0.8, each level uses 64% (0.8 * 0.8) of the original features. More aggressive subsampling increases diversity among trees and reduces overfitting, but too much can hurt performance.

Question 13

Easy

Can you use XGBoost for regression problems, or is it only for classification?

Think about XGBRegressor.

XGBoost works for both classification and regression. Use XGBClassifier for classification (predicting categories) and XGBRegressor for regression (predicting continuous values). The only difference is the loss function: log loss for classification, squared error for regression. The tree-building process is the same.

Question 14

Medium

What is the output?

from xgboost import XGBClassifier
import numpy as np

xgb = XGBClassifier(n_estimators=10, random_state=42, eval_metric='logloss')
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1])
xgb.fit(X, y)
print(xgb.predict(np.array([[2.5]])))
print(xgb.predict_proba(np.array([[2.5]])).round(3))

2.5 is exactly between the 0 and 1 regions. The model should be uncertain.

[1] (or [0], borderline case)
Probabilities close to [[0.45 0.55]] (near 50/50, reflecting uncertainty)

Multiple Choice Questions

MCQ 1

Which of the following is a boosting algorithm?

A. Random Forest
B. K-Means
C. XGBoost
D. PCA

Answer: C
C is correct. XGBoost is a boosting algorithm that trains trees sequentially to correct errors. Random Forest (A) is bagging. K-Means (B) is clustering. PCA (D) is dimensionality reduction.

MCQ 2

What does 'early stopping' prevent in gradient boosting?

A. Underfitting
B. Overfitting
C. Missing values
D. Feature scaling issues

Answer: B
B is correct. Early stopping halts training when validation performance stops improving, preventing the model from training too many trees and overfitting to the training data.

MCQ 3

In gradient boosting, what does each new tree try to predict?

A. The original target values
B. The residuals (errors) of the current ensemble
C. The feature importances
D. Random noise

Answer: B
B is correct. Each new tree is trained on the residuals (actual - current prediction), meaning it learns to correct the errors of all previous trees combined.

MCQ 4

Which library handles categorical features natively without manual encoding?

A. XGBoost
B. LightGBM
C. CatBoost
D. scikit-learn

Answer: C
C is correct. CatBoost (Categorical Boosting) was specifically designed to handle categorical features natively using ordered target encoding. XGBoost and LightGBM require manual encoding.

MCQ 5

Which parameter should you always pair with a high n_estimators value in XGBoost?

A. max_features
B. early_stopping_rounds
C. kernel
D. n_components

Answer: B
B is correct. With high n_estimators (e.g., 5000), early_stopping_rounds prevents overfitting by stopping training when validation performance plateaus. Without it, all trees are trained regardless of whether they improve performance.

MCQ 6

What is the primary advantage of LightGBM over XGBoost?

A. Better accuracy on small datasets
B. Native categorical feature handling
C. Significantly faster training on large datasets due to histogram-based splitting
D. Does not require any hyperparameter tuning

Answer: C
C is correct. LightGBM's histogram-based splitting is O(n) vs XGBoost's O(n log n), making it much faster on large datasets. CatBoost (B) is the one with native categorical handling. All algorithms benefit from tuning (D is wrong).

MCQ 7

How does XGBoost handle missing values?

A. It imputes them with the mean before training
B. It removes rows with missing values
C. It learns the optimal direction (left or right) for missing values at each split
D. It treats missing values as zero

Answer: C
C is correct. At each split, XGBoost tries sending missing values both left and right, picks the direction with the best gain, and saves this decision. Different splits can handle missing values differently, adapting to the data.

MCQ 8

What happens if you set learning_rate=1.0 with many trees?

A. The model trains faster and better
B. Each tree contributes 100% of its prediction, causing severe overfitting
C. Learning rate does not affect the model
D. The model becomes more regularized

Answer: B
B is correct. With learning_rate=1.0, each tree's full prediction is added to the ensemble. This is very aggressive -- the model quickly memorizes training data and overfits. Lower learning rates (0.01-0.1) with early stopping are recommended.

MCQ 9

Meera has 95% class 0 and 5% class 1 in her dataset. Which XGBoost parameter helps with this?

A. max_depth
B. learning_rate
C. scale_pos_weight
D. n_estimators

Answer: C
C is correct. scale_pos_weight handles class imbalance by weighting positive (minority) class samples higher. Setting it to 95/5 = 19 makes the model pay 19x more attention to the minority class, preventing it from always predicting class 0.

MCQ 10

In LightGBM, which parameter is the PRIMARY control for tree complexity?

A. max_depth
B. n_estimators
C. num_leaves
D. learning_rate

Answer: C
C is correct. LightGBM uses leaf-wise tree growth, so num_leaves (default 31) is the primary control for how complex each tree can be. max_depth is secondary (set to -1 by default). This is different from XGBoost where max_depth is primary.

MCQ 11

Why does gradient boosting on tabular data often outperform deep learning?

A. Gradient boosting is always more accurate than deep learning
B. Tabular data lacks spatial/temporal structure that deep learning exploits, and trees handle feature interactions and different scales naturally
C. Deep learning cannot process numerical features
D. Gradient boosting uses more parameters

Answer: B
B is correct. Deep learning excels on data with spatial structure (images/CNNs) or sequential structure (text/RNNs). Tabular data has no such structure. Trees naturally handle mixed feature types, different scales, missing values, and feature interactions without the extensive preprocessing that neural networks require.

MCQ 12

What is the relationship between learning_rate and n_estimators in gradient boosting?

A. They should both be set to the same value
B. Smaller learning_rate requires more trees (higher n_estimators) to achieve the same performance
C. Larger learning_rate requires more trees
D. They are completely independent

Answer: B
B is correct. Lower learning rate means each tree contributes less, so more trees are needed. The total 'model capacity' is approximately learning_rate * n_estimators. A common strategy: set learning_rate=0.05, n_estimators=5000, and use early stopping to find the optimal tree count.

MCQ 13

Rohan runs XGBoost with subsample=0.7, colsample_bytree=0.8, and colsample_bylevel=0.9. What fraction of features does each split see?

A. 0.7 * 0.8 * 0.9 = 0.504
B. 0.8 * 0.9 = 0.72
C. 0.8 only (colsample_bytree dominates)
D. The minimum of the three = 0.7

Answer: B
B is correct. subsample controls row sampling (not feature sampling). colsample_bytree and colsample_bylevel are multiplied together for feature sampling: 0.8 * 0.9 = 0.72, meaning each split at each level sees 72% of the original features. Row sampling (subsample=0.7) means each tree trains on 70% of the data, but this is independent of feature sampling.

MCQ 14

What problem does CatBoost's 'ordered boosting' solve?

A. It makes training faster
B. It solves the prediction shift (target leakage) that occurs with traditional target encoding in gradient boosting
C. It orders features by importance
D. It ensures trees are built in a specific order

Answer: B
B is correct. In standard gradient boosting with target encoding, each sample's encoding uses the target values of all samples, including itself (target leakage). Ordered boosting uses random permutations to ensure that when computing target statistics for a sample, only previous samples in the permutation are used. This produces unbiased estimates and better generalization.

MCQ 15

Which of the following is an example of bagging?

A. XGBoost
B. Random Forest
C. LightGBM
D. CatBoost

Answer: B
B is correct. Random Forest is the classic bagging algorithm: it trains many decision trees independently on random bootstrap samples of the data and averages their predictions. XGBoost (A), LightGBM (C), and CatBoost (D) are all boosting algorithms.

MCQ 16

What is the effect of decreasing the learning_rate in gradient boosting from 0.3 to 0.01?

A. Training becomes faster
B. Fewer trees are needed
C. Each tree contributes less, requiring more trees but often producing better generalization
D. The model underfits regardless of the number of trees

Answer: C
C is correct. A lower learning rate means each tree's prediction is scaled down more (multiplied by 0.01 instead of 0.3). More trees are needed to reach the same model capacity, but the ensemble is more robust and generalizes better. Training is slower (A is wrong) and more trees are needed (B is wrong).

MCQ 17

What is the purpose of reg_lambda (L2 regularization) in XGBoost?

A. It controls the number of trees
B. It penalizes the square of leaf weight values to prevent extreme predictions
C. It sets the learning rate
D. It controls the maximum depth of trees

Answer: B
B is correct. reg_lambda adds L2 regularization on leaf weights: lambda * sum(w^2). This shrinks leaf values toward zero, preventing any leaf from making extreme predictions. Higher reg_lambda = more regularization = smoother, more conservative model.

MCQ 18

What does the term "weak learner" mean in ensemble methods?

A. A model with no training data
B. A model that performs only slightly better than random guessing
C. A model with very high accuracy
D. A model that uses only one feature

Answer: B
B is correct. A weak learner performs only marginally better than random chance. Boosting combines many weak learners (shallow trees) to create a strong learner.

MCQ 19

What is the typical maximum depth of individual trees in gradient boosting?

A. 1 (decision stumps)
B. 3-10 (shallow trees)
C. 50-100 (deep trees)
D. Unlimited depth

Answer: B
B is correct. Gradient boosting uses shallow trees (depth 3-10). The ensemble of many shallow trees captures complexity without individual trees overfitting.

MCQ 20

Why does XGBoost use second-order gradients (Hessian) while standard gradient boosting uses only first-order?

A. Second-order gradients are faster to compute
B. Second-order information provides better loss approximation, leading to more accurate splits
C. First-order gradients do not work for classification
D. Second-order gradients reduce memory usage

Answer: B
B is correct. XGBoost uses a second-order Taylor expansion. The Hessian provides curvature information for better splits and natural regularization.

MCQ 21

Which of the following is TRUE about gradient boosting?

A. Trees are trained independently in parallel
B. Trees are trained sequentially, each correcting errors of previous ones
C. Only one tree is used
D. Trees are trained on random subsets ignoring errors

Answer: B
B is correct. Gradient boosting trains trees sequentially, each fitting the residuals of the current ensemble.

MCQ 22

Deepa wants to speed up LightGBM training without losing accuracy. Which parameter should she increase?

A. n_estimators
B. max_depth
C. min_child_samples
D. learning_rate

Answer: C
C is correct. Increasing min_child_samples prevents very small leaves, reducing complexity and training time while acting as regularization.

Coding Challenges

Coding challenges coming soon.

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass