Chapter 12 Intermediate 52 Questions

Practice Questions — Dimensionality Reduction - PCA and t-SNE

← Back to Notes

11 Easy

11 Medium

11 Hard

Topic-Specific Questions

Question 1

Easy

What is dimensionality reduction? Give a simple analogy.

Think of compressing information from many features to fewer features.

Dimensionality reduction transforms high-dimensional data (many features) into lower-dimensional data (fewer features) while preserving as much useful information as possible. Analogy: It is like summarizing a 500-page book into a 10-page summary. You lose some details, but the main ideas are preserved.

Question 2

Easy

What is the curse of dimensionality?

Think about what happens to data density as the number of features increases exponentially.

The curse of dimensionality refers to problems that arise when working with high-dimensional data: (1) Data becomes extremely sparse because the volume of space grows exponentially with dimensions. (2) Distance between points converges to roughly the same value, making distance-based algorithms ineffective. (3) More training data is needed to maintain the same density, which is often impractical.

Question 3

Easy

What is the difference between feature selection and feature extraction?

One keeps original features; the other creates new ones.

Feature selection chooses a subset of the original features and discards the rest. The selected features are unchanged and remain interpretable. Feature extraction creates new features by combining existing ones (e.g., PCA creates linear combinations). The new features are typically less interpretable but can capture more information in fewer dimensions.

Question 4

Easy

What is the output?

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_.sum())

If you keep all components, how much variance is explained?

1.0

Question 5

Easy

What is the output?

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 1, 1]])
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(X_pca.shape)

n_components=2 reduces 3D data to 2D.

(4, 2)

Question 6

Easy

Why must you standardize data before applying PCA?

PCA finds directions of maximum variance. What happens if one feature has much larger values?

PCA finds directions of maximum variance. If features have different scales (e.g., income in thousands vs age in years), the high-scale feature dominates the variance and the first principal component essentially becomes that feature alone. Standardizing (zero mean, unit variance) ensures all features contribute equally to the principal components.

Question 7

Medium

What is the output?

from sklearn.decomposition import PCA
import numpy as np

np.random.seed(42)
X = np.random.randn(100, 10)
pca = PCA(n_components=5)
pca.fit(X)
print(f"Components: {pca.n_components_}")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2f}")

Random data has variance spread roughly equally across all dimensions.

Components: 5
Explained variance: 0.50 (approximately, around 0.48-0.55)

Question 8

Medium

Explain what pca.components_ represents and what pca.explained_variance_ratio_ represents.

One is about the directions; the other is about how much each direction matters.

pca.components_ is a matrix of shape (n_components, n_features). Each row is a principal component -- a direction in the original feature space. The values (loadings) indicate how much each original feature contributes to that component. pca.explained_variance_ratio_ is an array of length n_components. Each value indicates the proportion of total variance captured by that component. They sum to less than 1.0 (unless you keep all components).

Question 9

Medium

Write code to determine how many PCA components are needed to explain 90% of the variance in the Wine dataset. Use load_wine() and StandardScaler.

Fit PCA with all components, then find where cumulative variance reaches 0.90.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine
import numpy as np

wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)

pca = PCA()
pca.fit(X_scaled)

cum_var = pca.explained_variance_ratio_.cumsum()
n_90 = np.argmax(cum_var >= 0.90) + 1
print(f"Components for 90% variance: {n_90}")
print(f"Original features: {wine.data.shape[1]}")
print(f"Variance at {n_90} components: {cum_var[n_90-1]:.4f}")

Question 10

Medium

What is the perplexity parameter in t-SNE and how does it affect the visualization?

It controls how many neighbors each point considers, affecting local vs global structure.

Perplexity roughly corresponds to the number of effective nearest neighbors each point considers when computing the high-dimensional probability distribution. Low perplexity (5-10): Focuses on very local structure, producing many small, tight clusters. May miss connections between related clusters. High perplexity (30-50): Considers more neighbors, producing fewer and larger clusters that better reflect global structure. May blur fine local details. Typical range: 5-50. Default is 30.

Question 11

Medium

Why does t-SNE not have a transform() method, and what should you use instead for new data?

t-SNE optimizes all points simultaneously. New points cannot be added without re-running.

t-SNE jointly optimizes the positions of ALL data points to minimize KL divergence. Adding a new point would change the optimal positions of all existing points, so you cannot project a new point without re-fitting on the entire dataset. For new data projection, use PCA (linear, has transform()) or UMAP (non-linear, has transform()). Both can be fitted once and then applied to new data.

Question 12

Hard

What is the output?

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

np.random.seed(42)
# Highly correlated features: x2 = 2*x1 + noise
x1 = np.random.randn(100)
x2 = 2 * x1 + np.random.randn(100) * 0.1
X = np.column_stack([x1, x2])
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
pca.fit(X_scaled)
print(f"Variance: {pca.explained_variance_ratio_.round(4)}")
print(f"PC1 captures: {pca.explained_variance_ratio_[0]:.4f}")

When two features are nearly perfectly correlated, almost all variance lies along one direction.

Variance: [0.9975 0.0025] (approximately)
PC1 captures: 0.9975

Question 13

Hard

Deepak has a dataset with 500 features for a KNN classifier. Training is slow and accuracy is poor. Explain how PCA can help with both problems.

Think about the curse of dimensionality and how PCA addresses distance computation and noise.

Speed: KNN computes distances between the query point and all training points. With 500 features, each distance calculation involves 500 dimensions. PCA reducing to 50 components makes each distance computation 10x faster. Accuracy: In 500 dimensions, the curse of dimensionality makes all distances similar (poor discrimination). Many of the 500 features may be noise, which adds random distance that masks the signal. PCA keeps only the components with the most variance (signal) and discards low-variance components (noise), making distances more meaningful and improving KNN's ability to find true neighbors.

Question 14

Easy

What is the output?

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 0], [0, 1], [1, 1]])
pca = PCA(n_components=1)
X_reduced = pca.fit_transform(X)
print(X_reduced.shape)
print(pca.explained_variance_ratio_.round(4))

3 samples with 2 features reduced to 1 component.

(3, 1)
[0.75] (approximately)

Question 15

Medium

Why is PCA sensitive to outliers? What happens when an outlier is present?

Outliers have extreme values that affect variance calculations.

PCA finds directions of maximum variance. Outliers have extreme values that disproportionately inflate the variance along certain directions. A single outlier can dominate the covariance matrix, causing the first principal component to point toward the outlier rather than capturing the true data structure. The resulting components may represent the outlier's influence rather than the underlying patterns. Solutions: remove outliers before PCA, use robust PCA methods, or use algorithms like t-SNE/UMAP that are less sensitive to outliers.

Question 16

Hard

Write code that applies PCA to the Wine dataset, reconstructs the data using inverse_transform, and computes the reconstruction error for different numbers of components (1, 3, 5, 10, 13).

For each n_components, fit PCA, transform, inverse_transform, and compute MSE between original and reconstructed.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

for n in [1, 3, 5, 10, 13]:
    pca = PCA(n_components=n)
    X_pca = pca.fit_transform(X)
    X_recon = pca.inverse_transform(X_pca)
    mse = np.mean((X - X_recon)**2)
    var = pca.explained_variance_ratio_.sum()
    print(f"n={n:2d}: variance={var:.4f}, reconstruction_error={mse:.4f}")

Question 17

Hard

Meera applies PCA and finds that 3 components explain 95% of variance in her 100-feature dataset. What does this tell her about the data structure?

The intrinsic dimensionality is much lower than the observed dimensionality.

This tells Meera that her data has very low intrinsic dimensionality (approximately 3) despite having 100 features. The 100 features are highly redundant -- they are correlated in ways that confine the data to roughly a 3-dimensional subspace. Most of the 100 features can be expressed as linear combinations of just 3 underlying factors. This is common in datasets where many features measure related aspects of the same underlying phenomenon.

Question 18

Hard

Write code to apply PCA to reduce the digits dataset from 64 to 30 features, train a KNN classifier on the reduced data, and compare accuracy with training on the full data.

Use train_test_split, StandardScaler, PCA, and KNeighborsClassifier. Remember to fit scaler and PCA on train only.

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

# Without PCA
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_sc, y_train)
acc_full = knn.score(X_test_sc, y_test)

# With PCA
pca = PCA(n_components=30)
X_train_pca = pca.fit_transform(X_train_sc)
X_test_pca = pca.transform(X_test_sc)
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
acc_pca = knn_pca.score(X_test_pca, y_test)

print(f"Full features (64): accuracy = {acc_full:.4f}")
print(f"PCA (30 components): accuracy = {acc_pca:.4f}")
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.4f}")

Question 19

Hard

Why should you NOT apply PCA before training a Random Forest or XGBoost model?

Think about how tree-based models select features and how PCA transforms the feature space.

Tree-based models (Random Forest, XGBoost) handle high dimensions natively by performing implicit feature selection at each split. They can identify which features matter and ignore the rest. PCA transforms features into linear combinations, which: (1) Makes the feature space harder for trees to split on (axis-aligned splits become less effective). (2) Destroys feature interpretability (you cannot determine feature importance). (3) May mix informative features with noise. Trees do not suffer from the curse of dimensionality or multicollinearity, so PCA provides no benefit for them.

Mixed & Application Questions

Question 1

Easy

What is the output?

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pca = PCA(n_components=1)
X_reduced = pca.fit_transform(X)
print(X_reduced.shape)

3 samples, 3 features reduced to 1 component.

(3, 1)

Question 2

Easy

Can PCA increase the number of features? Why or why not?

What is the maximum number of components PCA can produce?

No. PCA can produce at most min(n_samples, n_features) components. Setting n_components higher than this raises an error. PCA reduces or maintains dimensionality; it never increases it. The components are derived from the existing feature space.

Question 3

Easy

What is the output?

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 1], [2, 2], [3, 3]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_.round(4))

All points lie exactly on the line y=x. How many directions have non-zero variance?

[1. 0.]

Question 4

Medium

What is the output?

from sklearn.decomposition import PCA
import numpy as np

pca = PCA(n_components=0.95)
X = np.random.RandomState(42).randn(100, 20)
pca.fit(X)
print(f"Components selected: {pca.n_components_}")

With random data and 20 features, variance is spread evenly. How many components to reach 95%?

Components selected: 19 (approximately 18-19)

Question 5

Medium

Nisha applies t-SNE to her dataset and notices that one cluster appears much larger than the others. She concludes those data points are more spread out in the original space. Is she correct?

t-SNE distorts global properties of the data.

No, Nisha is incorrect. t-SNE does not preserve cluster sizes or inter-cluster distances. A cluster that appears large in the t-SNE plot may be tight in the original space, and vice versa. t-SNE only reliably preserves local neighborhood structure (which points are near which). Cluster sizes, distances between clusters, and shapes in the t-SNE plot should not be interpreted as properties of the original data.

Question 6

Medium

Write code to apply PCA on the Iris dataset and print which original features contribute most to PC1.

After fitting PCA, check pca.components_[0] for PC1 loadings. The feature with the highest absolute loading contributes most.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

pca = PCA(n_components=2)
pca.fit(X_scaled)

print("PC1 loadings:")
for name, loading in zip(iris.feature_names, pca.components_[0]):
    print(f"  {name}: {loading:.4f}")

most_important = iris.feature_names[np.argmax(np.abs(pca.components_[0]))]
print(f"\nMost important feature for PC1: {most_important}")

Question 7

Medium

Rohit has 10,000 data points with 500 features. He wants to visualize the data in 2D. Should he run t-SNE directly on the 500-dimensional data?

t-SNE is O(n^2) and slow in high dimensions. What preprocessing step would help?

No. Rohit should first apply PCA to reduce from 500 to 30-50 dimensions, then apply t-SNE on the PCA output to get 2D. This two-step approach is standard because: (1) t-SNE is O(n^2) and very slow in 500 dimensions. (2) PCA removes noise and redundancy, giving t-SNE cleaner input. (3) The first 30-50 PCA components typically capture 95%+ of the variance. This PCA-then-t-SNE pipeline is used by the original t-SNE authors and is the recommended approach.

Question 8

Hard

What is the output?

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

np.random.seed(42)
X = np.random.randn(100, 5)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
X_reconstructed = pca.inverse_transform(X_pca)
print(f"Original shape: {X_scaled.shape}")
print(f"Reconstructed shape: {X_reconstructed.shape}")
print(f"Reconstruction error: {np.mean((X_scaled - X_reconstructed)**2):.4f}")

inverse_transform reconstructs the original space from the reduced representation. Information lost by discarding components becomes reconstruction error.

Original shape: (100, 5)
Reconstructed shape: (100, 5)
Reconstruction error: 0.3979 (approximately 0.35-0.45)

Question 9

Hard

Explain why the eigenvectors of the covariance matrix give the principal components, and why eigenvalues correspond to the variance along each component.

Think about what the covariance matrix represents and what eigenvectors are.

The covariance matrix captures how each pair of features varies together. Eigenvectors of this matrix are directions in which the data varies independently (uncorrelated). The eigenvector with the largest eigenvalue points in the direction of maximum variance because: the eigenvalue equation Cv = lambda * v means projecting the data onto direction v produces variance equal to lambda. Sorting eigenvectors by eigenvalue (descending) gives PC1, PC2, etc., ordered by decreasing variance.

Question 10

Hard

Write code that compares PCA and t-SNE on the digits dataset. Compute how well each method separates the 10 digit classes in 2D by calculating the ratio of between-class variance to within-class variance.

Project to 2D with PCA and t-SNE separately. For each, compute the mean per class, overall mean, then between-class variance (sum of squared distances from class means to overall mean) and within-class variance (sum of squared distances from points to their class mean).

import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

def class_var_ratio(X_2d, y):
    overall_mean = X_2d.mean(axis=0)
    between = 0
    within = 0
    for label in np.unique(y):
        points = X_2d[y == label]
        class_mean = points.mean(axis=0)
        between += len(points) * np.sum((class_mean - overall_mean)**2)
        within += np.sum((points - class_mean)**2)
    return between / within

digits = load_digits()
X_scaled = StandardScaler().fit_transform(digits.data)
y = digits.target

X_pca = PCA(n_components=2).fit_transform(X_scaled)
X_tsne = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X_scaled)

print(f"PCA variance ratio: {class_var_ratio(X_pca, y):.4f}")
print(f"t-SNE variance ratio: {class_var_ratio(X_tsne, y):.4f}")
print("Higher ratio = better class separation")

Question 11

Hard

What is UMAP and what are its advantages over t-SNE?

UMAP is newer, faster, and preserves more global structure.

UMAP (Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction technique that: (1) Is significantly faster than t-SNE, especially on large datasets. (2) Better preserves global structure -- distances between clusters are more meaningful. (3) Has a transform() method for projecting new data (unlike t-SNE). (4) Can produce embeddings in any number of dimensions (useful for feature extraction, not just visualization). (5) Is based on solid mathematical theory (Riemannian geometry and algebraic topology).

Question 12

Hard

Sanjay has a linear regression model that suffers from multicollinearity (highly correlated features). How does PCA solve this, and what trade-off does he accept?

PCA produces uncorrelated components. But what happens to interpretability?

PCA creates principal components that are orthogonal (uncorrelated) by definition. Even if the original features are highly correlated, the PCA components have zero correlation with each other. This completely eliminates multicollinearity, stabilizing the regression coefficients. The trade-off: the original feature names are lost. Instead of interpreting the effect of 'age' or 'income', Sanjay would interpret 'PC1' and 'PC2', which are abstract linear combinations. If interpretability is critical (e.g., medical research, regulatory requirements), PCA may not be acceptable.

Question 13

Medium

What is the output?

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
pca = PCA(n_components=2)
pca.fit(X)
print(f"Explained variance: {pca.explained_variance_ratio_.round(4)}")
print(f"Total: {pca.explained_variance_ratio_.sum():.4f}")

All three features increase together perfectly linearly. How many directions have non-zero variance?

Explained variance: [1. 0.]
Total: 1.0000

Question 14

Easy

What is the difference between PCA and t-SNE in one sentence each?

One is linear for preprocessing, the other is non-linear for visualization.

PCA: A linear method that finds directions of maximum variance, used for dimensionality reduction as preprocessing in ML pipelines. t-SNE: A non-linear method that preserves local neighborhood structure, used exclusively for 2D/3D visualization of high-dimensional data.

Multiple Choice Questions

MCQ 1

What does PCA stand for?

A. Primary Component Analyzer
B. Principal Component Analysis
C. Partial Cluster Algorithm
D. Projected Covariance Approximation

Answer: B
B is correct. PCA stands for Principal Component Analysis. It finds principal components (directions of maximum variance) in the data.

MCQ 2

What is the main purpose of dimensionality reduction?

A. To increase the number of features
B. To reduce features while preserving important information
C. To add noise to the data
D. To increase model complexity

Answer: B
B is correct. Dimensionality reduction reduces the number of features while preserving as much useful information as possible. It helps with visualization, speed, and reducing overfitting.

MCQ 3

Which of the following is a consequence of the curse of dimensionality?

A. Models train faster with more features
B. Distance between all points becomes similar
C. Accuracy always improves with more features
D. Data becomes denser in high dimensions

Answer: B
B is correct. In high-dimensional spaces, the ratio of maximum to minimum distance converges to 1 -- all points appear roughly equidistant. This makes distance-based algorithms (KNN, K-Means) ineffective.

MCQ 4

What does the explained variance ratio in PCA tell you?

A. The accuracy of the model
B. The proportion of total variance captured by each principal component
C. The number of features to remove
D. The learning rate for training

Answer: B
B is correct. The explained variance ratio indicates what fraction of the total data variance is captured by each component. If PC1 has ratio 0.72, it captures 72% of the total variance.

MCQ 5

t-SNE is primarily used for:

A. Feature extraction for classifiers
B. Regression tasks
C. Visualization of high-dimensional data in 2D or 3D
D. Speeding up training of neural networks

Answer: C
C is correct. t-SNE is designed for visualization. It creates 2D or 3D representations that reveal cluster structure. It should not be used for feature extraction (A) because it has no transform() method and distorts global structure.

MCQ 6

Ananya wants to reduce dimensions before training an SVM. Which technique should she use?

A. t-SNE
B. PCA
C. Random Forest feature importance
D. Plotting histograms

Answer: B
B is correct. PCA is the right choice for SVM preprocessing because: it has transform() for new data, creates uncorrelated features, and reduces dimensions while preserving maximum variance. t-SNE (A) has no transform() method and is for visualization only.

MCQ 7

What does PCA(n_components=0.95) do in sklearn?

A. Keeps 95 components
B. Removes 95% of the features
C. Selects the minimum number of components that explain 95% of variance
D. Sets the learning rate to 0.95

Answer: C
C is correct. When n_components is a float between 0 and 1, sklearn interprets it as the target explained variance ratio. It automatically selects the fewest components needed to reach that threshold.

MCQ 8

Which statement about t-SNE is FALSE?

A. t-SNE preserves local neighborhood structure
B. t-SNE has a transform() method for new data
C. t-SNE is non-linear
D. t-SNE uses KL divergence as its cost function

Answer: B
B is FALSE (and therefore the correct answer). t-SNE does not have a transform() method. It can only produce embeddings for the data it was trained on. All other statements are true.

MCQ 9

Which algorithm should NOT typically be preceded by PCA?

A. K-Nearest Neighbors
B. Support Vector Machine
C. Random Forest
D. Logistic Regression

Answer: C
C is correct. Random Forest handles high dimensions natively through implicit feature selection at each split. PCA can actually hurt tree performance by creating abstract components that are harder to split on. KNN (A), SVM (B), and Logistic Regression (D) all benefit from PCA.

MCQ 10

What is the typical recommended range for the perplexity parameter in t-SNE?

A. 0.01 to 0.1
B. 5 to 50
C. 100 to 1000
D. 1 to 2

Answer: B
B is correct. Perplexity typically ranges from 5 to 50. Low values emphasize local structure; high values emphasize global structure. The default in sklearn is 30.

MCQ 11

Vikram applies PCA to a 100-feature dataset and finds that the first 10 components explain 98% of the variance. What does this tell him?

A. He has 10 important features and 90 useless ones
B. The original 100 features are highly correlated, and 10 directions capture almost all the variation
C. He should add more features
D. PCA has failed because not all variance is captured

Answer: B
B is correct. When 10 components capture 98% of 100-dimensional variance, it means the data has low intrinsic dimensionality. The 100 features are highly redundant (correlated), and the data mostly lives in a 10-dimensional subspace. This is an ideal scenario for PCA -- massive dimensionality reduction with minimal information loss.

MCQ 12

Why does t-SNE use a Student's t-distribution in the low-dimensional space instead of a Gaussian?

A. To make the algorithm faster
B. To solve the crowding problem -- heavier tails allow moderate distances in low dimensions to model large distances in high dimensions
C. Because Gaussian is only for 1D data
D. To prevent overfitting

Answer: B
B is correct. In high dimensions, a point can have many equidistant neighbors (they spread out in all directions). In 2D, there is not enough room for all these neighbors. The t-distribution's heavier tails allow moderately distant points in 2D to represent large distances in high dimensions, preventing all points from being crushed together (the crowding problem).

MCQ 13

What is the relationship between PCA and the eigendecomposition of the covariance matrix?

A. PCA uses eigenvalues as feature weights
B. The eigenvectors of the covariance matrix are the principal components, and eigenvalues represent the variance along each component
C. PCA uses eigenvalues to scale the data
D. There is no relationship; PCA uses gradient descent

Answer: B
B is correct. PCA decomposes the covariance matrix into eigenvectors (directions of maximum variance = principal components) and eigenvalues (amount of variance in each direction). Sorting eigenvectors by descending eigenvalue gives PC1, PC2, etc. This is the mathematical core of PCA.

MCQ 14

Aisha has 500 samples and 2000 features. She runs PCA. What is the maximum number of non-zero principal components she can get?

A. 2000
B. 500
C. 499
D. 1

Answer: C
C is correct. The maximum number of non-zero principal components is min(n_samples, n_features) - 1 = min(500, 2000) - 1 = 499. The -1 is because centering the data (subtracting the mean) reduces the rank by 1. In practice, sklearn may return 500 components but the last one will have zero or near-zero variance.

MCQ 15

Which of the following is TRUE about UMAP compared to t-SNE?

A. UMAP is always slower than t-SNE
B. UMAP can only produce 2D embeddings
C. UMAP has a transform() method and can project new data
D. UMAP does not preserve any local structure

Answer: C
C is correct. UMAP has a transform() method, making it usable in ML pipelines for new data. UMAP is generally faster (A is wrong), can produce any-dimensional embeddings (B is wrong), and preserves both local and global structure (D is wrong).

MCQ 16

What happens if you apply PCA to a dataset where all features are completely uncorrelated?

A. PCA reduces to 1 component
B. PCA components are the same as the original features (rotated), each explaining equal variance
C. PCA fails with an error
D. PCA removes all features

Answer: B
B is correct. When features are uncorrelated, the covariance matrix is diagonal. Eigenvectors align with the original axes (possibly rotated). Each component captures roughly equal variance (1/n_features each). PCA provides no dimensionality reduction benefit because there is no redundancy to exploit.

MCQ 17

Arjun uses PCA(n_components=50) on a dataset with 1000 features and 200 samples. What is the maximum number of non-zero components he can get?

A. 50
B. 200
C. 199
D. 1000

Answer: A
A is correct. He requested 50 components, and n_components caps the output. However, the theoretical maximum for any PCA on this dataset would be min(200, 1000) - 1 = 199 (due to centering). Since 50 < 199, he gets exactly 50 components.

MCQ 18

Which of the following is NOT a benefit of dimensionality reduction?

A. Faster model training
B. Better visualization of data
C. Guaranteed higher model accuracy
D. Reduced overfitting

Answer: C
C is correct. Dimensionality reduction does NOT guarantee higher accuracy. While it often improves accuracy by removing noise, it can also hurt accuracy if important information is lost. Benefits include faster training (A), better visualization (B), and reduced overfitting (D).

MCQ 19

What is the computational complexity of computing PCA using eigendecomposition of the covariance matrix?

A. O(n)
B. O(n * d^2 + d^3) where n=samples, d=features
C. O(n^3)
D. O(d)

Answer: B
B is correct. PCA has two main steps: computing the covariance matrix O(n*d^2) and eigendecomposition O(d^3). For large n and small d, the covariance step dominates. For small n and large d, truncated SVD is more efficient.

Coding Challenges

Coding challenges coming soon.

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass