Chapter 23 Advanced 50 Questions

Practice Questions — Reinforcement Learning Basics

← Back to Notes

10 Easy

11 Medium

9 Hard

Topic-Specific Questions

Question 1

Easy

What is the output of the following code?

rl_components = ["Agent", "Environment", "State", "Action", "Reward", "Policy"]
for i, comp in enumerate(rl_components, 1):
    print(f"{i}. {comp}")

enumerate with start=1 gives 1-based indexing.

1. Agent
2. Environment
3. State
4. Action
5. Reward
6. Policy

Question 2

Easy

What is the output?

learning_types = {
    "Supervised": "Labeled data",
    "Unsupervised": "No labels",
    "Reinforcement": "Reward signal"
}
for lt, feedback in learning_types.items():
    print(f"{lt}: {feedback}")

Dictionary iteration produces key-value pairs in insertion order.

Supervised: Labeled data
Unsupervised: No labels
Reinforcement: Reward signal

Question 3

Easy

What is the output?

import random
random.seed(42)

epsilon = 0.3
choices = []
for _ in range(10):
    if random.random() < epsilon:
        choices.append("explore")
    else:
        choices.append("exploit")

explore_count = choices.count("explore")
exploit_count = choices.count("exploit")
print(f"Explore: {explore_count}, Exploit: {exploit_count}")

With epsilon=0.3, about 30% of actions will be random (explore).

Explore: 3, Exploit: 7 (approximately, exact depends on seed)

Question 4

Easy

What is the output?

gamma_values = [0.0, 0.5, 0.9, 0.99, 1.0]
for g in gamma_values:
    if g == 0:
        desc = "only immediate reward"
    elif g < 0.9:
        desc = "moderate future value"
    elif g < 1.0:
        desc = "high future value"
    else:
        desc = "equal value (may diverge)"
    print(f"gamma={g}: {desc}")

gamma controls how much the agent values future rewards vs immediate ones.

gamma=0.0: only immediate reward
gamma=0.5: moderate future value
gamma=0.9: high future value
gamma=0.99: high future value
gamma=1.0: equal value (may diverge)

Question 5

Medium

What is the output?

import numpy as np

# Q-table for a 3-state, 2-action environment
Q = np.zeros((3, 2))
print(f"Initial Q-table:\n{Q}")

# Update Q(state=0, action=1) using Bellman equation
state, action = 0, 1
reward = 5.0
next_state = 2
alpha = 0.1
gamma = 0.9

best_next_q = np.max(Q[next_state])
td_target = reward + gamma * best_next_q
td_error = td_target - Q[state][action]
Q[state][action] += alpha * td_error

print(f"\nAfter update: Q[{state}][{action}] = {Q[state][action]}")
print(f"TD target: {td_target}")
print(f"TD error: {td_error}")

Q starts at 0. TD target = reward + gamma * max(Q[next_state]) = 5 + 0.9 * 0.

Initial Q-table:
[[0. 0.]
[0. 0.]
[0. 0.]]

After update: Q[0][1] = 0.5
TD target: 5.0
TD error: 5.0

Question 6

Medium

What is the output?

def compute_return(rewards, gamma):
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
    return round(G, 3)

rewards = [1, 1, 1, 10]  # reward at each timestep
print(f"gamma=0.0: G = {compute_return(rewards, 0.0)}")
print(f"gamma=0.5: G = {compute_return(rewards, 0.5)}")
print(f"gamma=0.9: G = {compute_return(rewards, 0.9)}")
print(f"gamma=1.0: G = {compute_return(rewards, 1.0)}")

The return is computed backwards: G = r + gamma * G. Start from the last reward.

gamma=0.0: G = 1
gamma=0.5: G = 2.625
gamma=0.9: G = 10.539
gamma=1.0: G = 13

Question 7

Medium

What is the output?

import numpy as np

def epsilon_greedy(Q_state, epsilon):
    if np.random.random() < epsilon:
        return np.random.randint(len(Q_state))  # Random
    return np.argmax(Q_state)  # Best

np.random.seed(0)
Q_state = [2.5, 8.1, 1.3, 5.0]  # Q-values for 4 actions

# With epsilon=0 (pure exploitation)
print(f"Q-values: {Q_state}")
print(f"Best action (epsilon=0): {np.argmax(Q_state)}")

# Track exploration vs exploitation over 1000 trials
epsilon = 0.2
actions = [epsilon_greedy(Q_state, epsilon) for _ in range(1000)]
from collections import Counter
counts = Counter(actions)
for a in sorted(counts):
    print(f"Action {a}: {counts[a]} times ({counts[a]/10:.1f}%)")

Action 1 has the highest Q-value (8.1). With epsilon=0.2, ~80% exploit (action 1), ~20% random.

Q-values: [2.5, 8.1, 1.3, 5.0]
Best action (epsilon=0): 1
Action 0: ~50 times (5%)
Action 1: ~850 times (85%)
Action 2: ~50 times (5%)
Action 3: ~50 times (5%)

Question 8

Hard

What is the output?

import numpy as np

# Simulate Q-Learning for 5 updates on a single state-action pair
Q = 0.0  # Initial Q-value for one (state, action) pair
alpha = 0.1
gamma = 0.9

# Simulated transitions: (reward, max_next_Q)
transitions = [
    (1.0, 0.0),
    (1.0, 0.1),
    (0.0, 0.5),
    (1.0, 0.8),
    (1.0, 1.0)
]

for i, (reward, max_next_q) in enumerate(transitions):
    td_target = reward + gamma * max_next_q
    td_error = td_target - Q
    Q = Q + alpha * td_error
    print(f"Step {i+1}: R={reward}, maxQ'={max_next_q}, "
          f"target={td_target:.3f}, error={td_error:.3f}, Q={Q:.4f}")

Apply the Q-learning update: Q += alpha * (R + gamma * maxQ' - Q) at each step.

Each step shows the Bellman update: the Q-value gradually increases as the agent observes rewards and higher future values.

Question 9

Easy

What is the difference between exploration and exploitation in RL?

Think about trying new restaurants vs going to your favorite one.

Exploitation means choosing the action that currently has the highest estimated value -- using what the agent has already learned. Exploration means trying actions the agent has not tried much, potentially discovering better strategies. The dilemma: too much exploitation means the agent might miss better options; too much exploration means the agent wastes time on suboptimal actions instead of using what it knows. The epsilon-greedy strategy balances this: with probability epsilon, explore (random action); otherwise, exploit (best known action). Epsilon typically decays from 1.0 to 0.01 during training.

Question 10

Medium

Why does DQN use a separate target network instead of using the same Q-network for both prediction and target computation?

Think about what happens when the target keeps changing every time you update the network.

If the same network is used for both Q-value prediction and target computation, every weight update changes the target values too. This creates a moving target problem: the network is trying to match targets that shift with every gradient step, causing oscillation or divergence. The target network is a separate copy of the Q-network that is updated less frequently (every N episodes or via soft updates). This provides stable, fixed targets for several training steps, allowing the Q-network to converge toward them before the targets shift again.

Question 11

Hard

What is the Bellman equation in Q-Learning, and why is it the foundation of the algorithm?

Think about how the value of a state-action pair relates to the immediate reward plus future value.

The Bellman equation states: Q(s, a) = R(s,a) + gamma * max_a'(Q(s', a')). It says the value of taking action a in state s equals the immediate reward R plus the discounted value of the best action in the next state s'. This creates a recursive relationship: the value of the current state depends on the values of future states. Q-Learning uses this to iteratively update a Q-table: Q(s,a) += alpha * [R + gamma * max(Q(s',a')) - Q(s,a)]. The term in brackets is the temporal difference (TD) error -- the gap between the target (Bellman estimate) and the current Q-value. By reducing this error over many updates, Q-values converge to the true optimal values.

Question 12

Hard

What is the output?

import torch
import torch.nn as nn

# Simple DQN for CartPole
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim)
        )
    
    def forward(self, x):
        return self.net(x)

model = DQN(4, 2)  # CartPole: 4 state dims, 2 actions
params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {params}")

# Test forward pass
state = torch.randn(1, 4)  # Single state
q_values = model(state)
print(f"Q-values shape: {q_values.shape}")
print(f"Best action: {q_values.argmax().item()}")

Count params: Linear(4,64) + Linear(64,64) + Linear(64,2). Include biases.

Total parameters: 4738
Q-values shape: torch.Size([1, 2])
Best action: 0 or 1 (random, untrained)

Question 13

Hard

What is RLHF (Reinforcement Learning from Human Feedback) and how is it used to align LLMs like ChatGPT?

Think about how human preferences can be turned into a reward signal for RL.

RLHF aligns LLMs with human preferences in three steps: (1) Supervised fine-tuning: Train the LLM on high-quality human-written responses. (2) Reward model training: Human raters rank multiple model outputs for the same prompt. These rankings train a reward model that predicts how much a human would prefer a given response. (3) PPO optimization: The LLM (as the RL agent) generates responses (actions), and the reward model provides reward signals. The LLM's policy is optimized using PPO (Proximal Policy Optimization) to maximize the reward while staying close to the original model (KL penalty prevents the model from diverging too far).

Question 14

Easy

What is the output?

rl_apps = ["AlphaGo", "Robotics", "Recommendations", "RLHF", "Data center cooling"]
for i, app in enumerate(rl_apps, 1):
    print(f"{i}. {app}")

5 real-world RL applications.

1. AlphaGo
2. Robotics
3. Recommendations
4. RLHF
5. Data center cooling

Question 15

Medium

What is the output?

import numpy as np

def compare_q_values(Q_before, Q_after, state):
    print(f"Before training: {Q_before[state]}")
    print(f"After training:  {Q_after[state]}")
    print(f"Best action before: {np.argmax(Q_before[state])}")
    print(f"Best action after:  {np.argmax(Q_after[state])}")

Q_before = {0: [0.0, 0.0, 0.0, 0.0]}
Q_after = {0: [0.5, 8.2, 1.3, 0.7]}

compare_q_values(Q_before, Q_after, 0)

Before training, all Q-values are 0. After training, action 1 has the highest value.

Before training: [0.0, 0.0, 0.0, 0.0]
After training: [0.5, 8.2, 1.3, 0.7]
Best action before: 0
Best action after: 1

Question 16

Hard

What is the output?

from collections import deque

buf = deque(maxlen=5)
for i in range(8):
    buf.append(f"exp_{i}")

print(f"Buffer size: {len(buf)}")
print(f"Contents: {list(buf)}")
print(f"Oldest: {buf[0]}")
print(f"Newest: {buf[-1]}")

deque with maxlen=5 drops oldest when full. After 8 pushes, only last 5 remain.

Buffer size: 5
Contents: ['exp_3', 'exp_4', 'exp_5', 'exp_6', 'exp_7']
Oldest: exp_3
Newest: exp_7

Question 17

Medium

What is the difference between on-policy and off-policy RL? Give an example of each.

Think about whether the data-collecting policy matches the policy being learned.

On-policy: The agent learns about the same policy it uses to collect data. The behavior policy and target policy are identical. Example: SARSA uses the actual next action taken. Off-policy: The agent can learn about a different (usually optimal) policy while following an exploratory behavior policy. Example: Q-Learning always uses max Q(s',a') regardless of the action actually taken. Off-policy enables experience replay (reusing old data) and is more sample-efficient. On-policy is generally more stable but requires fresh data for each update.

Mixed & Application Questions

Question 1

Easy

What is the output?

rl_milestones = [
    ("AlphaGo", "Go", 2016),
    ("AlphaZero", "Chess + Go", 2017),
    ("OpenAI Five", "Dota 2", 2019),
    ("AlphaStar", "StarCraft II", 2019)
]
for name, game, year in rl_milestones:
    print(f"{year}: {name} mastered {game}")

Tuple unpacking in a for loop.

2016: AlphaGo mastered Go
2017: AlphaZero mastered Chess + Go
2019: OpenAI Five mastered Dota 2
2019: AlphaStar mastered StarCraft II

Question 2

Easy

What is the output?

epsilon = 1.0
decay = 0.9
for step in range(6):
    print(f"Step {step}: epsilon = {epsilon:.4f}")
    epsilon *= decay

Multiply epsilon by 0.9 each step. 1.0 * 0.9 = 0.9, * 0.9 = 0.81, etc.

Step 0: epsilon = 1.0000
Step 1: epsilon = 0.9000
Step 2: epsilon = 0.8100
Step 3: epsilon = 0.7290
Step 4: epsilon = 0.6561
Step 5: epsilon = 0.5905

Question 3

Medium

What is the output?

import numpy as np

# Simulate experience replay buffer
buffer = []
for i in range(10):
    experience = (f"s{i}", f"a{i%2}", i * 0.5, f"s{i+1}")
    buffer.append(experience)

print(f"Buffer size: {len(buffer)}")
print(f"First: {buffer[0]}")
print(f"Last: {buffer[-1]}")

# Random sample of 3
np.random.seed(42)
indices = np.random.choice(len(buffer), 3, replace=False)
print(f"\nSampled indices: {sorted(indices)}")
for idx in sorted(indices):
    print(f"  {buffer[idx]}")

10 experiences are stored. Random sample draws 3 non-consecutive transitions.

Buffer size: 10
First: ('s0', 'a0', 0.0, 's1')
Last: ('s9', 'a1', 4.5, 's10')
Sampled indices: [varies]
Three random experiences from the buffer.

Question 4

Medium

What is the output?

def cartpole_state_meaning(state):
    names = ["Cart Position", "Cart Velocity", "Pole Angle", "Pole Angular Vel"]
    for name, val in zip(names, state):
        print(f"  {name:20s}: {val:+.4f}")

state = [0.0312, -0.1547, 0.0285, 0.3021]
print("CartPole State:")
cartpole_state_meaning(state)

# Check if pole is falling
angle = state[2]
print(f"\nPole angle: {angle:.4f} rad ({abs(angle)*180/3.14159:.1f} degrees)")
print(f"Falling? {abs(angle) > 0.2095}")

CartPole has 4 state variables. The episode ends when angle exceeds ~12 degrees (0.2095 rad).

CartPole State:
Cart Position : +0.0312
Cart Velocity : -0.1547
Pole Angle : +0.0285
Pole Angular Vel : +0.3021

Pole angle: 0.0285 rad (1.6 degrees)
Falling? False

Question 5

Hard

What is the output?

import torch
import torch.nn as nn

# Experience replay sampling and DQN loss computation
def compute_dqn_loss(q_net, target_net, batch, gamma=0.99):
    states, actions, rewards, next_states, dones = batch
    
    states = torch.FloatTensor(states)
    actions = torch.LongTensor(actions)
    rewards = torch.FloatTensor(rewards)
    next_states = torch.FloatTensor(next_states)
    dones = torch.FloatTensor(dones)
    
    # Current Q-values for taken actions
    current_q = q_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
    
    # Target Q-values
    with torch.no_grad():
        next_q = target_net(next_states).max(1)[0]
        target_q = rewards + gamma * next_q * (1 - dones)
    
    loss = nn.MSELoss()(current_q, target_q)
    return loss

# Dummy networks and batch
q_net = nn.Sequential(nn.Linear(4, 32), nn.ReLU(), nn.Linear(32, 2))
target_net = nn.Sequential(nn.Linear(4, 32), nn.ReLU(), nn.Linear(32, 2))
target_net.load_state_dict(q_net.state_dict())

batch = (
    [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]],  # states
    [0, 1],                                             # actions
    [1.0, 1.0],                                         # rewards
    [[0.2, 0.3, 0.4, 0.5], [0.6, 0.7, 0.8, 0.9]],    # next_states
    [0.0, 0.0]                                          # dones (not done)
)

loss = compute_dqn_loss(q_net, target_net, batch)
print(f"DQN Loss: {loss.item():.6f}")
print(f"Loss > 0: {loss.item() > 0}")
print(f"Loss type: {type(loss).__name__}")

The loss is MSE between predicted Q-values and Bellman targets.
DQN Loss: [some small positive value]
Loss > 0: True
Loss type: Tensor



                Question 6
                Medium
                Nidhi wants to build an RL agent that learns to play a simple board game. Should she use Q-Learning with a Q-table or DQN? What factors should she consider?
Think about the size of the state space and action space.
If the board game has a small, discrete state space (e.g., tic-tac-toe has about 5,500 valid states), a Q-table is simpler, faster, and guaranteed to converge. If the game has a large or continuous state space (e.g., chess has ~10^43 possible positions), a Q-table is impractical and Nidhi should use DQN which can generalize across similar states. Other factors: Q-tables are easier to implement and debug, require no GPU, and are fully interpretable (you can read the exact Q-values). DQN requires more hyperparameter tuning, training data, and compute, but handles complex environments that Q-tables cannot.

                Question 7
                Hard
                Compare Q-Learning and Policy Gradient methods. What are the advantages and disadvantages of each? When would Vikram choose Policy Gradients over Q-Learning?
Think about discrete vs continuous action spaces and the nature of what each method learns.
Q-Learning learns a value function and derives the policy (pick the action with highest Q-value). Pros: sample-efficient, works well with discrete actions. Cons: struggles with continuous action spaces, may converge to suboptimal solutions due to function approximation errors. Policy Gradient methods directly optimize the policy (a probability distribution over actions). Pros: naturally handles continuous action spaces, can learn stochastic policies, more theoretically grounded. Cons: high variance, less sample-efficient, may converge to local optima. Vikram should choose Policy Gradients when: the action space is continuous (e.g., controlling a robot's joint angles), a stochastic policy is needed (e.g., game theory scenarios where randomization is optimal), or when using actor-critic methods like PPO for complex environments.

                Question 8
                Hard
                Explain Experience Replay in DQN. Why is random sampling from a buffer better than training on consecutive transitions?
Think about what happens when a model trains on correlated data sequences.
Experience Replay stores past transitions (s, a, r, s', done) in a buffer and trains on randomly sampled batches. This is better than consecutive training for three reasons: (1) Breaks temporal correlations: consecutive transitions are highly correlated (similar states, same part of the environment). Neural networks trained on correlated data overfit to recent patterns and forget earlier learning. Random sampling provides diverse, decorrelated training data. (2) Data efficiency: each experience can be reused for multiple training updates instead of being used once and discarded. (3) Stability: the training distribution is smoother (a mix of many experiences) rather than shifting rapidly as the agent moves through different parts of the environment.

                Question 9
                Easy
                What is the output?
rl_vs_others = {
    "Supervised": "Labeled data (x, y)",
    "Unsupervised": "Unlabeled data (x)",
    "Reinforcement": "Reward signal from environment"
}
for paradigm, feedback in rl_vs_others.items():
    print(f"{paradigm:15s} uses: {feedback}")
Three ML paradigms with different types of feedback.
Supervised      uses: Labeled data (x, y)
Unsupervised    uses: Unlabeled data (x)
Reinforcement   uses: Reward signal from environment

                Question 10
                Medium
                What is the output?
def q_learning_update(Q_sa, reward, max_Q_next, alpha, gamma):
    target = reward + gamma * max_Q_next
    td_error = target - Q_sa
    new_Q = Q_sa + alpha * td_error
    return round(new_Q, 4), round(td_error, 4)

# State-action Q-value starts at 0
Q = 0.0
alpha, gamma = 0.1, 0.9

updates = [(1.0, 0.0), (1.0, 0.5), (0.0, 1.0), (1.0, 1.2)]
for reward, max_next_q in updates:
    Q, td_err = q_learning_update(Q, reward, max_next_q, alpha, gamma)
    print(f"R={reward}, maxQ'={max_next_q} -> Q={Q}, TD_err={td_err}")
Apply Bellman update: Q += alpha * (R + gamma * maxQ' - Q) at each step.
Each line shows the Q-value incrementally increasing as the agent observes rewards and estimates of future value.

                Question 11
                Medium
                What is the output?
import numpy as np

def evaluate_policy(Q_table, env_size=4):
    actions = ["Up", "Right", "Down", "Left"]
    policy = {}
    for state in Q_table:
        best_action = actions[np.argmax(Q_table[state])]
        policy[state] = best_action
    return policy

# Simulated Q-table for a 2x2 grid
Q = {
    (0,0): [0.1, 0.8, 0.3, 0.1],  # Best: Right
    (0,1): [0.1, 0.1, 0.9, 0.1],  # Best: Down
    (1,0): [0.1, 0.7, 0.1, 0.1],  # Best: Right
    (1,1): [0.0, 0.0, 0.0, 0.0],  # Goal state
}

policy = evaluate_policy(Q)
for state, action in sorted(policy.items()):
    print(f"State {state}: {action}")
argmax selects the action with the highest Q-value at each state.
State (0, 0): Right
State (0, 1): Down
State (1, 0): Right
State (1, 1): Up

                Question 12
                Easy
                What is the difference between a policy and a value function in RL?
One tells the agent what to do, the other tells the agent how good a situation is.
A policy (pi) is the agent's strategy: it maps states to actions, telling the agent what to do in each situation. It can be deterministic (one action per state) or stochastic (probability distribution over actions). A value function V(s) estimates how good a state is by predicting the expected cumulative reward from that state forward. The Q-function Q(s,a) estimates how good a specific action is in a given state. The relationship: a greedy policy selects the action that maximizes the Q-function at each state.

                Question 13
                Hard
                Why is reward shaping important in RL, and what risks does it introduce?
Think about sparse vs dense rewards and how shaping can accidentally change the optimal policy.
Reward shaping adds intermediate rewards to guide the agent toward the goal, solving the sparse reward problem where the agent only receives reward at the goal (making it hard to learn). Example: adding a small reward proportional to distance traveled toward the goal. Risks: (1) Reward hacking: the agent may find unintended ways to maximize shaped rewards without achieving the actual goal. (2) Optimal policy change: poorly designed shaping can change the optimal policy (the agent does what earns shaped rewards instead of the real objective). (3) Human bias: the designer's assumptions about 'good' intermediate states may be wrong. Potential-based reward shaping (PBRS) is a theoretically safe method that preserves the optimal policy while providing denser feedback.



        
        
            Multiple Choice Questions
            
                MCQ 1
                What type of feedback does an RL agent receive from the environment?
                A. Labeled training examples
B. A reward signal (numerical value)
C. Cluster assignments
D. Feature importance rankings
Answer: B
B is correct. An RL agent receives a numerical reward signal after taking actions. Positive rewards encourage the action, negative rewards discourage it. This is different from supervised learning (labeled examples) or unsupervised learning (no explicit feedback).

                MCQ 2
                In epsilon-greedy strategy, what does the agent do when random() < epsilon?
                A. Takes the best known action
B. Takes a random action (explores)
C. Stops the episode
D. Resets the environment
Answer: B
B is correct. When random() < epsilon, the agent explores by taking a random action. Otherwise (1-epsilon probability), it exploits by taking the action with the highest Q-value. Higher epsilon means more exploration.

                MCQ 3
                What does the discount factor gamma control in RL?
                A. The learning rate
B. How much the agent values future rewards vs immediate rewards
C. The number of actions available
D. The size of the neural network
Answer: B
B is correct. Gamma between 0 and 1 determines the importance of future rewards. Gamma=0 means the agent only cares about the immediate reward. Gamma=0.99 means future rewards are nearly as valuable as immediate ones.

                MCQ 4
                What is a Q-table?
                A. A neural network for RL
B. A lookup table that stores expected cumulative rewards for each state-action pair
C. A database of training examples
D. A queue of actions to take
Answer: B
B is correct. A Q-table maps (state, action) pairs to their estimated Q-values (expected cumulative rewards). The Q-learning algorithm updates this table using the Bellman equation as the agent gains experience.

                MCQ 5
                What is the Markov property in an MDP?
                A. Actions must be reversible
B. The future depends only on the current state, not on history
C. All rewards must be positive
D. The environment must be deterministic
Answer: B
B is correct. The Markov property states that the probability of the next state depends only on the current state and action, not on the sequence of states that preceded it. This memoryless property makes RL tractable.

                MCQ 6
                What is the TD (Temporal Difference) error in Q-Learning?
                A. The time taken for one episode
B. The difference between the Bellman target and the current Q-value estimate
C. The number of steps in an episode
D. The change in epsilon over time
Answer: B
B is correct. TD error = [R + gamma * max(Q(s',a'))] - Q(s,a). It is the difference between the target value (from the Bellman equation) and the current estimate. Q-learning updates Q by alpha * TD error, driving the error toward zero.

                MCQ 7
                Why does DQN use experience replay?
                A. To speed up the environment simulation
B. To break temporal correlations and provide diverse training data
C. To increase the number of actions available
D. To reduce the model size
Answer: B
B is correct. Consecutive transitions are correlated, causing unstable training. Experience replay stores past transitions in a buffer and trains on random batches, breaking correlations and providing diverse, decorrelated training data.

                MCQ 8
                What function does env.step(action) return in Gymnasium?
                A. Only the next state
B. next_state, reward, terminated, truncated, info
C. reward and done flag only
D. The action's Q-value
Answer: B
B is correct. env.step(action) returns a 5-tuple: next_state (new observation), reward (numerical feedback), terminated (episode ended due to environment rules), truncated (episode ended due to time limit), and info (diagnostic information).

                MCQ 9
                What RL technique is used to align LLMs like ChatGPT with human preferences?
                A. Q-Learning
B. Deep Q-Network
C. RLHF (Reinforcement Learning from Human Feedback)
D. Monte Carlo Tree Search
Answer: C
C is correct. RLHF trains a reward model from human preference rankings, then uses PPO to optimize the LLM's policy to maximize this learned reward while staying close to the original model. This makes LLMs helpful, harmless, and honest.

                MCQ 10
                In CartPole, what causes the episode to end (terminate)?
                A. The agent takes 10 actions
B. The pole angle exceeds ~12 degrees or the cart moves too far
C. The reward reaches zero
D. The agent chooses the same action twice
Answer: B
B is correct. CartPole terminates when the pole angle exceeds +/-12 degrees (0.2095 radians) or the cart position exceeds +/-2.4 units from center. The agent gets reward +1 for every step the pole stays upright, up to a maximum of 500 steps.

                MCQ 11
                What is the role of the target network in DQN?
                A. It generates training data
B. It provides stable target Q-values by updating less frequently than the main Q-network
C. It replaces the Q-network after training
D. It computes the loss function
Answer: B
B is correct. The target network is a copy of the Q-network that is updated only periodically (every N episodes). It computes stable target Q-values for the Bellman equation, preventing the moving target problem where the Q-network chases its own changing predictions.

                MCQ 12
                Why is Q-Learning called a 'model-free' algorithm?
                A. It does not use any neural networks
B. It does not require a model of the environment's transition dynamics
C. It works without any rewards
D. It does not need a Q-table
Answer: B
B is correct. 'Model-free' means Q-Learning does not need to know the transition probabilities P(s'|s,a) or the reward function R(s,a). It learns directly from experience -- the agent takes actions, observes outcomes, and updates Q-values. Model-based methods, in contrast, learn or use a model of the environment to plan ahead.

                MCQ 13
                What problem does Q-Learning face when applied to environments with continuous state spaces?
                A. It runs too slowly
B. The Q-table becomes impractically large or infinite since continuous values are rarely revisited
C. The rewards become negative
D. The agent cannot take actions
Answer: B
B is correct. With continuous states (float values), almost no state is visited twice, so the Q-table has millions of unique entries with values of 0. Learning is impossible because Q-values are never updated. Solutions: discretize the state space (lose precision) or use DQN (neural network as function approximator).

                MCQ 14
                What is the key difference between on-policy and off-policy RL algorithms?
                A. On-policy is faster, off-policy is more accurate
B. On-policy learns about the policy being followed; off-policy can learn about a different policy than the one generating data
C. On-policy uses neural networks; off-policy uses tables
D. There is no difference
Answer: B
B is correct. On-policy methods (SARSA, PPO) learn about the same policy that generates actions. Off-policy methods (Q-Learning, DQN) can learn about the optimal policy while following a different exploration policy (like epsilon-greedy). Off-policy is more sample-efficient because it can reuse old data (experience replay).

                MCQ 15
                Which algorithm defeated the world champion at Go in 2016?
                A. Deep Q-Network
B. AlphaGo (using Monte Carlo Tree Search + deep RL)
C. SARSA
D. Vanilla Policy Gradient
Answer: B
B is correct. AlphaGo (DeepMind, 2016) defeated Lee Sedol at Go using a combination of deep convolutional neural networks, Monte Carlo Tree Search, supervised learning from human games, and reinforcement learning through self-play. Its successor AlphaZero learned entirely from self-play.

                MCQ 16
                What does the agent receive from the environment after taking an action?
                A. A training dataset
B. The next state, a reward signal, and a done flag
C. The model's weights
D. A label for the action
Answer: B
B is correct. After the agent takes an action, the environment returns: the next state (new observation), a reward signal (numerical feedback), and a done flag (whether the episode has ended). This tuple is the fundamental data unit in RL.

                MCQ 17
                What is the purpose of the replay buffer in DQN?
                A. To replay the game for the user
B. To store past experiences and sample random batches for training
C. To store the model weights
D. To increase the action space
Answer: B
B is correct. The replay buffer stores past transitions (s, a, r, s', done) and provides random batches for training. This breaks temporal correlations between consecutive transitions and provides diverse training data, which is essential for stable DQN training.

                MCQ 18
                What RL algorithm learns a Q-table mapping state-action pairs to expected cumulative rewards?
                A. Linear Regression
B. Q-Learning
C. K-Means
D. Random Forest
Answer: B
B is correct. Q-Learning is a model-free RL algorithm that maintains a Q-table storing the expected cumulative reward for each (state, action) pair. It updates values using the Bellman equation as the agent interacts with the environment.

                MCQ 19
                What is PPO (Proximal Policy Optimization) and why is it popular?
                A. A value-based method for small state spaces
B. A policy gradient method that constrains policy updates for stable training
C. A supervised learning algorithm for classification
D. A method for compressing neural networks
Answer: B
B is correct. PPO is a policy gradient method that limits how much the policy can change in each update (via a clipping mechanism). This prevents large, destabilizing updates while still making progress. PPO is used in RLHF for aligning LLMs and is the default choice for many complex RL tasks.

                MCQ 20
                Why is epsilon typically decayed during Q-learning training?
                A. To make the model smaller over time
B. To gradually shift from exploration to exploitation as the agent learns
C. To reduce the learning rate
D. To increase the discount factor
Answer: B
B is correct. Early in training, the agent knows little, so high epsilon (lots of exploration) helps discover good strategies. As training progresses and Q-values improve, lower epsilon (more exploitation) lets the agent use what it has learned. A minimum epsilon (e.g., 0.01) ensures some exploration continues.
        

        
        
            Coding Challenges
            Coding challenges coming soon.
        

        
        
            
            Previous Chapter
            Generative AI - GANs, VAEs, and Diffusion Models
        
            
            Next Chapter
            MLOps and Model Deployment
        
        

        
        
            Need to Review the Concepts?
            Go back to the detailed notes for this chapter.
            Read Chapter Notes
        

        
        
            Want to learn AI and ML with a live mentor?
            Explore our AI/ML Masterclass