Chapter 3 Beginner 58 Questions

Practice Questions — Mathematics for Machine Learning

← Back to Notes

9 Easy

12 Medium

9 Hard

Topic-Specific Questions

Question 1

Easy

Compute the dot product of [1, 2, 3] and [4, 5, 6] using NumPy. Show the manual calculation in a comment.

np.dot(a, b) or a @ b. Manual: 1*4 + 2*5 + 3*6.

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Manual: 1*4 + 2*5 + 3*6 = 4 + 10 + 18 = 32
print(f"Dot product: {np.dot(a, b)}")

Output: Dot product: 32

Question 2

Easy

Create a 3x3 identity matrix using NumPy and print it.

Use np.eye(3).

import numpy as np
I = np.eye(3)
print(I)

Output: [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]

Question 3

Easy

What is the output?

import numpy as np
A = np.array([[1, 2], [3, 4]])
print(A.T)

.T transposes the matrix: rows become columns.

[[1 3]
 [2 4]]

Question 4

Easy

What does the gradient tell us in Machine Learning?

Think about direction and steepness.

The gradient is a vector of partial derivatives that points in the direction of steepest increase of a function. In ML, we want to minimize the loss function, so we move in the opposite direction of the gradient. The magnitude of the gradient tells us how steep the function is at that point.

Question 5

Medium

Write NumPy code to multiply two matrices A = [[1, 2], [3, 4]] and B = [[5, 6], [7, 8]]. What is element [0][1] of the result?

Use A @ B. Element [0][1] = dot product of row 0 of A and column 1 of B.

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = A @ B
print(f"A @ B:\n{C}")
print(f"\nElement [0][1] = 1*6 + 2*8 = {C[0, 1]}")

Output: [[19 22] [43 50]], Element [0][1] = 22

Question 6

Medium

Write code to compute the mean, median, and standard deviation of the array [10, 20, 30, 40, 50, 1000] and explain why mean and median differ significantly.

Use np.mean(), np.median(), np.std(). The value 1000 is an outlier.

import numpy as np
arr = np.array([10, 20, 30, 40, 50, 1000])
print(f"Mean: {np.mean(arr):.1f}")
print(f"Median: {np.median(arr):.1f}")
print(f"Std: {np.std(arr):.1f}")
print("Mean is 191.7, median is 35.0")
print("The outlier 1000 drags the mean up but does not affect the median")

Question 7

Medium

Explain Bayes Theorem with a real-world example. What are prior, likelihood, and posterior?

P(A|B) = P(B|A) * P(A) / P(B). Think of disease testing.

Bayes Theorem: P(A|B) = P(B|A) * P(A) / P(B). Prior P(A) = initial belief before seeing evidence. Likelihood P(B|A) = probability of evidence given the hypothesis. Posterior P(A|B) = updated belief after seeing evidence. Example: P(disease) = 0.01 (prior). P(positive test | disease) = 0.95 (likelihood). P(positive test) = 0.06. P(disease | positive test) = 0.95 * 0.01 / 0.06 = 0.158 (posterior). Even with a positive test, there is only 15.8% chance of disease because the disease is rare.

Question 8

Medium

Implement one step of gradient descent for the function f(x) = x^2. Starting at x = 5 with learning rate 0.1, what is the new x?

Derivative of x^2 is 2x. New x = old_x - lr * gradient.

x = 5.0
learning_rate = 0.1
gradient = 2 * x  # derivative of x^2
x_new = x - learning_rate * gradient
print(f"Old x: {x}")
print(f"Gradient at x=5: {gradient}")
print(f"New x: {x_new}")
print(f"f(old x) = {x**2}, f(new x) = {x_new**2}")

Output: New x: 4.0, f(old x) = 25, f(new x) = 16

Question 9

Medium

What is the output?

import numpy as np
arr = np.array([2, 4, 4, 4, 5, 5, 7, 9])
print(np.mean(arr))
print(np.var(arr))

Mean = sum/count. Variance = average of squared deviations from mean.

5.0
4.0

Question 10

Hard

Write code to compute the correlation between two arrays using the formula: corr = cov(X,Y) / (std(X) * std(Y)). Verify with np.corrcoef().

Compute covariance manually, then divide by product of standard deviations.

import numpy as np

X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 5, 4, 5])

# Manual calculation
mean_x, mean_y = np.mean(X), np.mean(Y)
cov_xy = np.mean((X - mean_x) * (Y - mean_y))
std_x, std_y = np.std(X), np.std(Y)
corr_manual = cov_xy / (std_x * std_y)

# NumPy verification
corr_numpy = np.corrcoef(X, Y)[0, 1]

print(f"Manual correlation: {corr_manual:.4f}")
print(f"NumPy correlation: {corr_numpy:.4f}")

Question 11

Hard

What are eigenvalues and eigenvectors? Why are they important in ML (specifically PCA)?

Eigenvectors are directions that do not change when a transformation is applied.

For a matrix A, an eigenvector v satisfies A @ v = lambda * v, where lambda is the eigenvalue. The eigenvector's direction is unchanged by the transformation -- only its magnitude changes by factor lambda. In PCA, the eigenvectors of the covariance matrix represent the principal directions of maximum variance in the data. The eigenvalues indicate how much variance is in each direction. PCA keeps the top-k eigenvectors (those with largest eigenvalues) to reduce dimensionality while preserving the most information.

Question 12

Hard

Implement gradient descent to find the minimum of f(x) = (x - 5)^2 + 3. Start at x = 0, learning rate = 0.2, run for 20 steps.

Derivative: f'(x) = 2(x - 5). Minimum is at x = 5, f(5) = 3.

x = 0.0
lr = 0.2

for i in range(20):
    grad = 2 * (x - 5)
    x = x - lr * grad

print(f"Final x: {x:.6f} (expected: 5.0)")
print(f"Final f(x): {(x-5)**2 + 3:.6f} (expected: 3.0)")

Question 13

Hard

What is the output?

import numpy as np
A = np.array([[1, 0], [0, 1]])
v = np.array([3, 7])
print(A @ v)

A is the identity matrix.

[3 7]

Question 14

Easy

What is the difference between variance and standard deviation? Which one is in the same units as the original data?

One is the square of the other.

Variance is the average of squared deviations from the mean. Standard deviation is the square root of variance. Standard deviation is in the same units as the original data (e.g., if data is in cm, std is in cm). Variance is in squared units (cm^2). This is why standard deviation is more commonly used and reported.

Question 15

Medium

Compute the inverse of matrix A = [[2, 1], [5, 3]] using NumPy and verify that A @ A_inv gives the identity matrix.

Use np.linalg.inv(A) and np.allclose() to verify.

import numpy as np
A = np.array([[2, 1], [5, 3]])
A_inv = np.linalg.inv(A)
print(f"A:\n{A}")
print(f"\nA inverse:\n{A_inv}")
print(f"\nA @ A_inv:\n{np.round(A @ A_inv)}")
print(f"Is identity? {np.allclose(A @ A_inv, np.eye(2))}")

Question 16

Easy

Write Python code to calculate the probability of rolling a 6 on a fair die, and the probability of NOT rolling a 6.

P(6) = 1/6. P(not 6) = 1 - P(6).

p_six = 1 / 6
p_not_six = 1 - p_six
print(f"P(rolling 6): {p_six:.4f} ({p_six*100:.2f}%)")
print(f"P(not rolling 6): {p_not_six:.4f} ({p_not_six*100:.2f}%)")

Question 17

Hard

Use Bayes theorem to calculate: If 2% of people have a disease, a test is 95% accurate for sick people and 90% accurate for healthy people, what is P(disease | positive test)?

P(disease)=0.02, P(positive|disease)=0.95, P(positive|healthy)=0.10.

p_disease = 0.02
p_healthy = 0.98
p_pos_disease = 0.95
p_pos_healthy = 0.10

p_pos = p_pos_disease * p_disease + p_pos_healthy * p_healthy
p_disease_pos = (p_pos_disease * p_disease) / p_pos

print(f"P(disease | positive test): {p_disease_pos:.4f}")
print(f"That's only {p_disease_pos*100:.1f}%!")

Question 18

Medium

What is the output?

import numpy as np
a = np.array([1, 0, 0])
b = np.array([0, 1, 0])
print(np.dot(a, b))

These vectors are perpendicular (orthogonal).

0

Question 19

Hard

Explain the chain rule in calculus and why it is essential for training neural networks (backpropagation).

If y = f(g(x)), then dy/dx = f'(g(x)) * g'(x). Think about layers in a neural network.

The chain rule states that if y = f(g(x)), then dy/dx = f'(g(x)) * g'(x). In neural networks, the output is a composition of many functions (layers): output = f3(f2(f1(x))). To compute how a weight in layer 1 affects the final loss, we need the chain rule: dL/dw1 = dL/df3 * df3/df2 * df2/df1 * df1/dw1. This chaining of derivatives through layers is called backpropagation. Without the chain rule, we could not train deep networks.

Question 20

Medium

Write code to compute and display the correlation matrix for three variables: hours_studied, attendance, and marks.

Use np.corrcoef() with the three arrays stacked together.

import numpy as np

hours = np.array([2, 4, 6, 8, 3, 7, 5])
attendance = np.array([60, 70, 85, 95, 65, 90, 80])
marks = np.array([50, 65, 80, 92, 55, 85, 72])

corr = np.corrcoef([hours, attendance, marks])
labels = ['Hours', 'Attend', 'Marks']

print('Correlation Matrix:')
print(f"{'':>10}", end='')
for l in labels: print(f"{l:>10}", end='')
print()
for i, l in enumerate(labels):
    print(f"{l:>10}", end='')
    for j in range(3):
        print(f"{corr[i][j]:>10.4f}", end='')
    print()

Mixed & Application Questions

Question 1

Easy

Create two NumPy vectors a = [3, 4] and b = [1, 2]. Compute and print: a + b, a - b, a * 2, and the dot product.

Use +, -, *, np.dot() for the operations.

import numpy as np
a = np.array([3, 4])
b = np.array([1, 2])
print(f"a + b = {a + b}")
print(f"a - b = {a - b}")
print(f"a * 2 = {a * 2}")
print(f"dot product = {np.dot(a, b)}")

Question 2

Easy

What is the output?

import numpy as np
print(np.mean([10, 20, 30, 40, 50]))

Mean = sum / count.

30.0

Question 3

Medium

Deepak's dataset has two features: temperature (0-50 Celsius) and income (10000-500000 rupees). Why might this cause problems for KNN, and what should he do?

KNN uses distance. Which feature will dominate the distance calculation?

Income (range: 490000) will completely dominate the distance calculation over temperature (range: 50) in KNN, because KNN uses Euclidean distance. A 1-degree temperature difference would be negligible compared to a 1000-rupee income difference. Deepak should apply feature scaling: either StandardScaler (z-score normalization) or MinMaxScaler (scale to 0-1) to bring both features to the same scale.

Question 4

Medium

Write NumPy code to compute the magnitude (length) of the vector [3, 4]. Verify that it equals 5 (Pythagorean theorem).

Magnitude = sqrt(x^2 + y^2). Use np.linalg.norm().

import numpy as np
v = np.array([3, 4])
mag = np.linalg.norm(v)
print(f"Vector: {v}")
print(f"Magnitude: {mag}")
print(f"Manual: sqrt(3^2 + 4^2) = sqrt(9+16) = sqrt(25) = {np.sqrt(9+16)}")

Output: Magnitude: 5.0

Question 5

Hard

Implement gradient descent to minimize f(w1, w2) = w1^2 + w2^2. Start at (5, 5), learning rate 0.1, 30 steps. Print every 5th step.

Gradients: df/dw1 = 2*w1, df/dw2 = 2*w2. Update both simultaneously.

import numpy as np

w = np.array([5.0, 5.0])
lr = 0.1

for i in range(30):
    grad = 2 * w
    w = w - lr * grad
    if i % 5 == 0:
        print(f"Step {i:2d}: w = [{w[0]:.4f}, {w[1]:.4f}], f(w) = {np.sum(w**2):.6f}")

print(f"\nFinal: w = [{w[0]:.6f}, {w[1]:.6f}]")
print(f"Expected: [0, 0]")

Question 6

Hard

What is the output?

import numpy as np
A = np.array([[2, 0], [0, 3]])
v = np.array([1, 1])
print(A @ v)

A is a diagonal matrix. It scales each component independently.

[2 3]

Question 7

Easy

What is the difference between correlation and covariance?

One is bounded [-1, 1], the other is unbounded.

Covariance measures how two variables change together but is unbounded (its value depends on the scale of the data). Correlation is normalized covariance, bounded between -1 and +1, making it easier to interpret. Correlation = Covariance / (std_X * std_Y). A correlation of 0.9 always means a strong positive relationship, regardless of the data's scale.

Question 8

Hard

Write a Python function that implements the normal equation for linear regression: w = (X^T @ X)^(-1) @ X^T @ y. Test it with simple data.

Use np.linalg.inv() for the inverse, @ for matrix multiplication.

import numpy as np

def normal_equation(X, y):
    return np.linalg.inv(X.T @ X) @ X.T @ y

# Simple data: y = 2*x + 1
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]])  # Bias column + feature
y = np.array([3, 5, 7, 9, 11])  # y = 2x + 1

w = normal_equation(X, y)
print(f"Weights: bias={w[0]:.2f}, slope={w[1]:.2f}")
print(f"Equation: y = {w[1]:.2f}x + {w[0]:.2f}")

Output: Weights: bias=1.00, slope=2.00

Question 9

Medium

What is the 68-95-99.7 rule for the normal distribution?

It describes what percentage of data falls within 1, 2, and 3 standard deviations of the mean.

For a normal distribution: approximately 68% of data falls within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations. If marks have mean=70 and std=10, then 68% of students score 60-80, 95% score 50-90, and 99.7% score 40-100.

Question 10

Medium

Write code to simulate flipping a coin 10000 times and verify that the probability of heads approaches 0.5.

Use np.random.choice(['H', 'T'], size=10000).

import numpy as np
np.random.seed(42)

flips = np.random.choice(['H', 'T'], size=10000)
heads = np.sum(flips == 'H')
p_heads = heads / len(flips)

print(f"Total flips: {len(flips)}")
print(f"Heads: {heads}")
print(f"P(Heads): {p_heads:.4f} (expected: 0.5000)")

Multiple Choice Questions

MCQ 1

What is the dot product of [1, 2, 3] and [4, 5, 6]?

A. [4, 10, 18]
B. 32
C. 15
D. 21

Answer: B
B is correct. Dot product = (1*4) + (2*5) + (3*6) = 4 + 10 + 18 = 32. Option A is element-wise multiplication (not summed). The dot product always returns a single scalar.

MCQ 2

What is the transpose of a 3x2 matrix?

A. 3x2
B. 2x3
C. 3x3
D. 2x2

Answer: B
B is correct. Transpose swaps rows and columns. A (3 rows x 2 columns) matrix becomes (2 rows x 3 columns). In general, the transpose of an (m x n) matrix is (n x m).

MCQ 3

What is the derivative of f(x) = x^2?

A. x
B. 2x
C. x^2
D. 2

Answer: B
B is correct. Using the power rule: the derivative of x^n is n * x^(n-1). So the derivative of x^2 is 2 * x^1 = 2x. At x=3, the derivative is 6, meaning the function is increasing at a rate of 6.

MCQ 4

In gradient descent, the update rule is w = w - lr * gradient. What happens if the learning rate is too large?

A. The model converges faster
B. The model converges to a better minimum
C. The model overshoots and diverges
D. Nothing changes

Answer: C
C is correct. A learning rate that is too large causes gradient descent to take steps that are too big, overshooting the minimum and bouncing back and forth (or diverging to infinity). The loss increases instead of decreasing. Typical good learning rates are 0.01 or 0.001.

MCQ 5

What is the shape of the result when multiplying a (3, 4) matrix with a (4, 2) matrix?

A. (3, 2)
B. (4, 4)
C. (3, 4)
D. (4, 2)

Answer: A
A is correct. For matrix multiplication, (m x n) @ (n x p) = (m x p). The inner dimensions (n=4) must match. The result takes the outer dimensions: (3 x 4) @ (4 x 2) = (3 x 2).

MCQ 6

Bayes theorem states P(A|B) = P(B|A) * P(A) / P(B). What is P(A) called?

A. Posterior
B. Likelihood
C. Prior
D. Evidence

Answer: C
C is correct. P(A) is the prior -- our initial belief before seeing evidence. P(B|A) is the likelihood -- probability of evidence given our hypothesis. P(A|B) is the posterior -- updated belief after seeing evidence. P(B) is the evidence.

MCQ 7

If a dataset has mean = 100 and standard deviation = 15, what Z-score does a value of 130 have?

A. 1.0
B. 2.0
C. 3.0
D. 0.5

Answer: B
B is correct. Z-score = (value - mean) / std = (130 - 100) / 15 = 30 / 15 = 2.0. This means 130 is exactly 2 standard deviations above the mean. According to the 68-95-99.7 rule, about 97.5% of values are below this point.

MCQ 8

What does a correlation of -0.95 between two variables indicate?

A. No relationship
B. Strong positive relationship
C. Strong negative relationship
D. The variables are independent

Answer: C
C is correct. A correlation of -0.95 indicates a very strong negative linear relationship: as one variable increases, the other strongly decreases. The magnitude (0.95) shows the relationship is very strong. Only a correlation near 0 indicates no linear relationship.

MCQ 9

Which measure of central tendency is most robust to outliers?

A. Mean
B. Median
C. Mode
D. Variance

Answer: B
B is correct. The median (middle value when sorted) is robust to outliers because it only depends on the position, not the magnitude of extreme values. The mean is heavily influenced by outliers. Variance is a measure of spread, not central tendency.

MCQ 10

In PCA, eigenvalues of the covariance matrix represent:

A. The direction of maximum variance
B. The amount of variance explained by each principal component
C. The number of features to keep
D. The correlation between features

Answer: B
B is correct. Eigenvalues represent the amount of variance in the direction of their corresponding eigenvectors. Larger eigenvalue = more variance explained. Eigenvectors (not eigenvalues) give the direction. PCA keeps components with the largest eigenvalues.

MCQ 11

What is the partial derivative of f(x, y) = x^2 + 3xy with respect to x?

A. 2x + 3y
B. 2x + 3x
C. x^2 + 3x
D. 2x + y

Answer: A
A is correct. To find the partial derivative with respect to x, treat y as a constant. d/dx(x^2) = 2x. d/dx(3xy) = 3y (y is constant). Total: 2x + 3y.

MCQ 12

For matrix multiplication A @ B to be valid, which dimensions must match?

A. Number of rows of A must equal number of rows of B
B. Number of columns of A must equal number of rows of B
C. Both matrices must be square
D. Both matrices must have the same shape

Answer: B
B is correct. For (m x n) @ (n x p), the inner dimensions (number of columns of A = number of rows of B) must match. The result has shape (m x p). The matrices do NOT need to be square or have the same shape.

MCQ 13

The gradient of f(w1, w2) = w1^2 + w2^2 at point (3, 4) is:

A. [3, 4]
B. [6, 8]
C. [9, 16]
D. [2, 2]

Answer: B
B is correct. The gradient is [df/dw1, df/dw2] = [2*w1, 2*w2]. At (3, 4): gradient = [2*3, 2*4] = [6, 8]. This gradient points in the direction of steepest increase. Gradient descent would move in the opposite direction: [-6, -8] (scaled by learning rate).

MCQ 14

P(A) + P(not A) always equals:

A. 0
B. 0.5
C. 1
D. It depends on the event

Answer: C
C is correct. An event either happens or it does not. P(A) + P(not A) = 1 is one of the fundamental axioms of probability. If P(rain) = 0.3, then P(no rain) = 0.7, and 0.3 + 0.7 = 1.

MCQ 15

Which NumPy function computes the inverse of a matrix?

A. np.inverse(A)
B. np.linalg.inv(A)
C. A.inv()
D. np.inv(A)

Answer: B
B is correct. Matrix operations in NumPy are in the np.linalg module. np.linalg.inv(A) computes the inverse. Other useful functions: np.linalg.det(A) for determinant, np.linalg.eig(A) for eigenvalues/eigenvectors, np.linalg.norm(v) for vector magnitude.

MCQ 16

Why is the chain rule important in deep learning?

A. It helps choose the right learning rate
B. It enables backpropagation to compute gradients through multiple layers
C. It determines the number of layers in a network
D. It prevents overfitting

Normalize features before training. Print loss every 200 steps.

import numpy as np

np.random.seed(42)
n = 100
x1 = np.random.uniform(0, 10, n)
x2 = np.random.uniform(0, 10, n)
y = 2 * x1 + 3 * x2 + 5 + np.random.normal(0, 1, n)

# Normalize features
x1_norm = (x1 - x1.mean()) / x1.std()
x2_norm = (x2 - x2.mean()) / x2.std()
X = np.column_stack([np.ones(n), x1_norm, x2_norm])

w = np.zeros(3)
lr = 0.01

for i in range(1000):
    y_pred = X @ w
    error = y_pred - y
    loss = np.mean(error ** 2)
    gradient = (2 / n) * X.T @ error
    w = w - lr * gradient
    if i % 200 == 0:
        print(f'Step {i}: Loss = {loss:.4f}')

print(f'\nLearned weights (normalized): bias={w[0]:.2f}, w1={w[1]:.2f}, w2={w[2]:.2f}')
print(f'True equation: y = 2*x1 + 3*x2 + 5')

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass