Chapter 18 Advanced 53 Questions

Practice Questions — Recurrent Neural Networks (RNN) and LSTM

← Back to Notes

9 Easy

10 Medium

9 Hard

Topic-Specific Questions

Question 1

Easy

What type of data are RNNs specifically designed to process?

Think about data where order matters.

RNNs are designed for sequential data -- data where the order of elements matters. Examples: text (sequences of words), time series (sequences of measurements), audio (sequences of sound samples), video (sequences of frames), DNA (sequences of nucleotides). The key property is that the meaning depends on the order.

Question 2

Easy

What is the hidden state in an RNN?

It is the RNN's memory of what it has seen so far.

The hidden state is a vector (e.g., 128 dimensions) that represents the RNN's memory of the sequence seen so far. At each time step, the hidden state is updated based on the current input and the previous hidden state. After processing the full sequence, the final hidden state encodes a compressed summary of the entire sequence. It is the mechanism by which information flows from earlier time steps to later ones.

Question 3

Easy

What is the output shape?

model = tf.keras.Sequential([
    Embedding(10000, 64, input_length=100),
    LSTM(128),
    Dense(1, activation='sigmoid')
])
print(model.output_shape)

The last Dense layer has 1 unit with sigmoid.

(None, 1)

Question 4

Easy

What is the output shape of Bidirectional(LSTM(64))?

layer = Bidirectional(LSTM(64))
# Input: (batch, 50, 32)
# Output: ?

Bidirectional concatenates forward and backward outputs.

(None, 128)

Question 5

Easy

What does pad_sequences produce?

from tensorflow.keras.preprocessing.sequence import pad_sequences

seqs = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
padded = pad_sequences(seqs, maxlen=4, padding='post')
print(padded)

padding='post' adds zeros at the end. maxlen=4 limits the length.

[[1, 2, 3, 0], [4, 5, 0, 0], [6, 7, 8, 9]]

Question 6

Easy

What does the Embedding layer do?

It converts integer word indices to dense vectors.

The Embedding layer converts integer word indices into dense floating-point vectors. For example, Embedding(10000, 64) creates a lookup table of 10000 words, each mapped to a 64-dimensional vector. When the input is [42, 315, 18], the layer looks up the 64-dim vector for each index. These vectors are learned during training, so semantically similar words end up with similar vectors. It replaces the need for one-hot encoding, which would be sparse and high-dimensional.

Question 7

Medium

What is the output shape at each layer?

model = tf.keras.Sequential([
    Embedding(5000, 32, input_length=50),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dense(5, activation='softmax')
])
for layer in model.layers:
    print(f"{layer.name}: {layer.output_shape}")

return_sequences=True returns 3D output; False returns 2D.

embedding: (None, 50, 32)
lstm: (None, 50, 64)
lstm_1: (None, 32)
dense: (None, 5)

Question 8

Medium

Explain the vanishing gradient problem in vanilla RNNs. Why does it happen and what is its practical effect?

Gradients are multiplied by the weight matrix at each time step during backpropagation.

During backpropagation through time, gradients flow backward from the output through each time step. At each step, the gradient is multiplied by the weight matrix W_hh. If the largest eigenvalue of W_hh is less than 1, these repeated multiplications cause the gradient to shrink exponentially -- after N steps, the gradient is roughly proportional to (eigenvalue)^N, which approaches zero. Practical effect: the network cannot learn dependencies between events more than ~10-20 steps apart. In a 500-word review, the network cannot connect information from the beginning to the end.

Question 9

Medium

What are the three gates in an LSTM? Briefly explain what each does.

Forget, Input, and Output gates.

Forget Gate: Examines the previous hidden state and current input, outputs a value between 0 and 1 for each cell state element. Decides what information to discard from the cell state (0 = forget completely, 1 = keep fully). Input Gate: Decides what new information to store in the cell state. A sigmoid determines which values to update. A tanh creates candidate new values. Together they add new information to memory. Output Gate: Decides what to output from the cell state. A sigmoid determines which parts of the cell state to expose as the hidden state output. The cell state is the internal memory; the output gate controls what the outside world sees.

Question 10

Medium

How many parameters does this LSTM layer have?

layer = LSTM(32, input_shape=(10, 16))
layer.build(input_shape=(None, 10, 16))
print(layer.count_params())

LSTM has 4 gates, each with weights for input and hidden state, plus biases.

6272

Question 11

Medium

When should you use a Bidirectional RNN vs a unidirectional RNN?

It depends on whether future context is available.

Use Bidirectional when the complete sequence is available and future context helps: text classification, sentiment analysis, named entity recognition, question answering. The backward pass provides information from later in the sequence that helps understand earlier tokens. Use Unidirectional when future data is not available: real-time prediction (next word prediction), time series forecasting (predicting future values), speech recognition (processing audio as it arrives), any streaming/online application.

Question 12

Hard

What is wrong with this code, and what error will it produce?

model = tf.keras.Sequential([
    Embedding(10000, 64),
    LSTM(64),                          # return_sequences=False
    LSTM(32, return_sequences=True),   # Expects 3D input
    Dense(1, activation='sigmoid')
])

The first LSTM outputs 2D but the second expects 3D.

The first LSTM has return_sequences=False (default), so it outputs 2D shape (batch, 64). The second LSTM expects 3D input (batch, timesteps, features). This raises: ValueError: Input 0 of layer 'lstm_1' is incompatible with the layer: expected ndim=3, found ndim=2. Fix: set return_sequences=True on the first LSTM.

Question 13

Hard

How does GRU differ from LSTM? What are the trade-offs?

GRU has 2 gates instead of 3 and no separate cell state.

LSTM has 3 gates (forget, input, output) and a separate cell state. GRU has 2 gates (update, reset) and no separate cell state -- it merges cell state and hidden state. The update gate in GRU combines the roles of LSTM's forget and input gates. Trade-offs: GRU has ~25% fewer parameters (faster training, less memory), is simpler to implement and understand, and performs comparably to LSTM on most tasks. LSTM may have an edge on tasks requiring very long-range dependencies because its separate cell state provides a more stable memory pathway. In practice, the performance difference is usually small.

Question 14

Hard

What is the total number of trainable parameters?

model = tf.keras.Sequential([
    Embedding(5000, 32, input_length=100),
    Bidirectional(LSTM(64)),
    Dense(1, activation='sigmoid')
])
print(model.count_params())

Embedding: vocab * dim. Bidirectional LSTM: 2 * LSTM params. Dense: input * output + bias.

Embedding: 5000 * 32 = 160,000. BiLSTM: 2 * 4 * ((32+64)*64 + 64) = 2 * 4 * (6144+64) = 2 * 24,832 = 49,664. Dense: (128*1) + 1 = 129. Total: 209,793.

Question 15

Hard

Why does the cell state update equation C_t = f_t * C_(t-1) + i_t * C_candidate solve the vanishing gradient problem?

Compare multiplication by a weight matrix (vanilla RNN) vs element-wise multiplication and addition (LSTM).

In a vanilla RNN, the hidden state is updated by multiplying by the weight matrix W_hh, which causes gradients to shrink (or explode) exponentially. In LSTM, the cell state is updated through element-wise multiplication (f_t * C_(t-1)) and addition (+ i_t * C_candidate). The forget gate f_t is bounded between 0 and 1 by sigmoid. When f_t is close to 1, the gradient flows through almost unchanged (multiplying by ~1 preserves it). The addition operation has a gradient of 1, meaning it does not cause the gradient to shrink at all. Together, this creates a gradient highway where information (and gradients) can flow across many time steps without vanishing.

Mixed & Application Questions

Question 1

Easy

What is the shape of the Embedding output?

layer = Embedding(10000, 64, input_length=50)
# Input: (batch, 50) -- integer indices
# Output: ?

Each integer is replaced by a 64-dimensional vector.

(None, 50, 64)

Question 2

Easy

What does num_words=10000 mean in Tokenizer?

tokenizer = Tokenizer(num_words=10000, oov_token='')
tokenizer.fit_on_texts(texts)

It limits the vocabulary size.

It keeps only the 10,000 most frequent words in the vocabulary. Words that are not in the top 10,000 are replaced with the OOV (out-of-vocabulary) token. This limits the vocabulary size, which reduces the Embedding layer size and prevents rare words from adding noise.

Question 3

Easy

Why do we normalize time series data before feeding it to an LSTM?

Similar to why we normalize images for CNNs.

LSTM gates use sigmoid and tanh activations, which are sensitive to input scale. Large input values cause saturation (sigmoid outputs near 0 or 1, tanh outputs near -1 or 1), leading to vanishing gradients. Normalized data (mean=0, std=1 or range [0,1]) keeps values in a range where activations and gradients are most effective. This leads to faster convergence and better performance. After prediction, denormalize the output to get values in the original scale.

Question 4

Medium

What does oov_token='<OOV>' handle?

tokenizer = Tokenizer(num_words=100, oov_token='')
tokenizer.fit_on_texts(["I love coding"])
new_seq = tokenizer.texts_to_sequences(["I love dancing"])
print(new_seq)

'dancing' was not in the training texts.

[[2, 3, 1]] where 1 is the OOV token index (replacing 'dancing').

Question 5

Medium

How does the sliding window technique work for time series?

import numpy as np
data = np.array([10, 20, 30, 40, 50, 60, 70])
SEQ_LEN = 3

X, y = [], []
for i in range(len(data) - SEQ_LEN):
    X.append(data[i:i+SEQ_LEN])
    y.append(data[i+SEQ_LEN])

print(f"X: {np.array(X)}")
print(f"y: {np.array(y)}")

Each window of 3 values predicts the next value.

X: [[10, 20, 30], [20, 30, 40], [30, 40, 50], [40, 50, 60]]
y: [40, 50, 60, 70]

Question 6

Medium

Why does LSTM have approximately 4x the parameters of SimpleRNN with the same number of units?

Count the gates in LSTM.

LSTM has 4 sets of weight matrices (forget gate, input gate, candidate cell, output gate), while SimpleRNN has only 1 set. Each gate in LSTM has its own input-to-hidden weights, hidden-to-hidden weights, and biases -- the same three components as a full SimpleRNN. So LSTM with 64 units is effectively 4 SimpleRNNs with 64 units each, resulting in approximately 4x the parameters.

Question 7

Medium

What does truncating='post' do in pad_sequences?

seqs = [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
padded = pad_sequences(seqs, maxlen=5, truncating='post')
print(padded)

The sequence is too long and must be truncated to maxlen.

[[1, 2, 3, 4, 5]]

Question 8

Medium

What output shape does return_sequences produce?

import tensorflow as tf

# return_sequences=False (default)
layer_false = tf.keras.layers.LSTM(64)
out_false = layer_false(tf.random.normal((8, 20, 32)))
print(f"False: {out_false.shape}")

# return_sequences=True
layer_true = tf.keras.layers.LSTM(64, return_sequences=True)
out_true = layer_true(tf.random.normal((8, 20, 32)))
print(f"True: {out_true.shape}")

False returns the last state only. True returns state at every step.

False: (8, 64)
True: (8, 20, 64)

Question 9

Hard

Nikhil built an LSTM for sentiment analysis. Training accuracy is 95% but test accuracy is 72%. What should he do?

Large gap between train and test indicates overfitting.

The model is overfitting. Nikhil should: (1) Add Dropout layers between LSTM and Dense layers (0.3-0.5). (2) Add recurrent_dropout parameter to LSTM (e.g., LSTM(64, dropout=0.3, recurrent_dropout=0.2)) to apply dropout to both input and recurrent connections. (3) Reduce model complexity (fewer units, fewer layers). (4) Add L2 regularization to Dense layers. (5) Use a pre-trained embedding (Word2Vec, GloVe) instead of learning from scratch, which requires less data. (6) Reduce the embedding dimension. (7) Consider using a smaller vocabulary (num_words). (8) Ensure the dataset is large enough and not imbalanced.

Question 10

Hard

Why is the cell state update in LSTM based on addition (C_t = f * C_old + i * C_new) rather than concatenation or other operations?

Think about gradient flow during backpropagation.

Addition has a gradient of 1 during backpropagation. When computing dLoss/dC_(t-1), the addition term contributes a gradient that flows through unchanged. This is the key property that prevents vanishing gradients. If concatenation were used, you would need a weight matrix to combine the concatenated values, reintroducing the multiplicative gradient problem. If a different non-linear operation were used, it would compress or distort the gradient. The element-wise multiplication by f_t (forget gate) is bounded [0,1], so it can selectively preserve gradients. Addition plus gated multiplication is the minimal mechanism that provides both selective memory and gradient preservation.

Question 11

Hard

How many parameters does GRU(64, input_shape=(20, 32)) have?

GRU has 3 weight matrices (update, reset, candidate) instead of LSTM's 4.

18,816

Question 12

Hard

Ananya has a dataset where each sample is a patient's vital signs recorded every hour for 48 hours. She wants to predict whether the patient will need ICU admission. How should she structure the LSTM input and output?

Think about the input shape: (batch, timesteps, features) and the classification output.

Input shape should be (batch_size, 48, num_features) where 48 is the number of time steps (hours) and num_features is the number of vital signs (e.g., heart rate, blood pressure, temperature = 3 features). She should normalize each feature independently. The model should end with LSTM(units, return_sequences=False) to get one representation per patient, followed by Dense layers with sigmoid output for binary classification (ICU yes/no). Use binary_crossentropy loss. If data is imbalanced (few ICU admissions), use class_weight in model.fit() and evaluate with precision/recall rather than just accuracy.

Question 13

Hard

What is the effective vocabulary size of this Embedding layer?

tokenizer = Tokenizer(num_words=5000, oov_token='')
tokenizer.fit_on_texts(train_texts)

model = tf.keras.Sequential([
    Embedding(input_dim=5000, output_dim=64),
    ...
])

# Is input_dim=5000 correct?

With oov_token, the actual vocabulary includes index 0 (padding) and index 1 (OOV).

This may cause an issue. With oov_token='<OOV>', the OOV token gets index 1, and word indices start from 2. num_words=5000 means indices 0-4999 are used. But because index 0 is reserved for padding and index 1 for OOV, there are effectively 4998 real words. The Embedding input_dim should be at least 5001 to safely handle all indices. Using 5000 could cause an index-out-of-range error for the 5000th word.

Multiple Choice Questions

MCQ 1

What type of neural network is specifically designed for sequential data?

A. CNN
B. RNN
C. Autoencoder
D. GAN

Answer: B
B is correct. RNNs (Recurrent Neural Networks) are designed for sequential data where order matters (text, time series, audio). CNNs are for spatial data (images). Autoencoders and GANs are generative models.

MCQ 2

What does the hidden state in an RNN represent?

A. The model's weights
B. A compressed summary of the sequence seen so far
C. The learning rate
D. The number of layers

Answer: B
B is correct. The hidden state is a vector that encodes a compressed summary of all elements processed so far in the sequence. It is updated at each time step and serves as the RNN's memory.

MCQ 3

What problem does LSTM solve that vanilla RNNs cannot?

A. Image classification
B. The vanishing gradient problem in long sequences
C. Unsupervised learning
D. Parallel processing

Answer: B
B is correct. LSTM solves the vanishing gradient problem through its cell state and gating mechanisms. This allows it to learn long-range dependencies that vanilla RNNs cannot capture.

MCQ 4

What does pad_sequences do?

A. Encrypts the sequences for security
B. Makes all sequences the same length by adding zeros
C. Converts text to binary
D. Removes duplicate sequences

Answer: B
B is correct. pad_sequences adds zeros to shorter sequences and truncates longer sequences so all sequences have the same length. Neural networks require fixed-size input.

MCQ 5

What does the Embedding layer convert?

A. Images to text
B. Float values to integers
C. Integer word indices to dense float vectors
D. Dense vectors to sparse vectors

Answer: C
C is correct. The Embedding layer maps each integer index to a dense floating-point vector (e.g., 64 dimensions). This is more efficient and meaningful than one-hot encoding.

MCQ 6

How many gates does an LSTM cell have?

A. 1
B. 2
C. 3
D. 4

Answer: C
C is correct. LSTM has 3 gates: forget gate (what to discard), input gate (what to add), and output gate (what to output). There is also a cell candidate computation, but it is not technically a gate.

MCQ 7

What does return_sequences=True do in an LSTM layer?

A. Returns the model's weights
B. Returns hidden states at every time step, not just the last
C. Returns the input sequences unchanged
D. Returns the loss at each training step

Answer: B
B is correct. return_sequences=True returns the hidden state at every time step (3D output: batch, timesteps, units). False returns only the last hidden state (2D: batch, units). True is needed when stacking RNN layers.

MCQ 8

What is the main difference between GRU and LSTM?

A. GRU is for images, LSTM is for text
B. GRU has 2 gates and no separate cell state; LSTM has 3 gates and a cell state
C. GRU is always better than LSTM
D. LSTM cannot process long sequences

Answer: B
B is correct. GRU has 2 gates (update, reset) and merges cell state with hidden state. LSTM has 3 gates (forget, input, output) and a separate cell state. GRU is simpler with fewer parameters but similar performance.

MCQ 9

What is the input shape required for LSTM layers in Keras?

A. (batch_size, features)
B. (batch_size, height, width, channels)
C. (batch_size, timesteps, features)
D. (batch_size, timesteps)

Answer: C
C is correct. LSTM requires 3D input: (batch_size, timesteps, features). Even with a single feature, the features dimension must be present. For text, the Embedding layer converts 2D input to 3D.

MCQ 10

What does Bidirectional(LSTM(64)) output?

A. A 64-dimensional vector
B. A 128-dimensional vector (64 forward + 64 backward)
C. Two separate 64-dimensional vectors
D. A 32-dimensional vector (64 / 2)

Answer: B
B is correct. Bidirectional runs a forward LSTM (64 units) and a backward LSTM (64 units), then concatenates their outputs by default: 64 + 64 = 128. The merge_mode parameter can change this behavior.

MCQ 11

Why does LSTM use addition (not multiplication) to update the cell state?

A. Addition is computationally cheaper
B. Addition preserves gradients during backpropagation, preventing vanishing gradients
C. Multiplication would make values too large
D. Addition is required by the Keras API

Answer: B
B is correct. The gradient of addition is 1, meaning gradients flow through without shrinking. This is the fundamental mechanism that prevents vanishing gradients in LSTM. Multiplication by the weight matrix (as in vanilla RNN) causes exponential gradient decay.

MCQ 12

For a GRU layer with input_dim=32 and units=64, approximately how many parameters does it have?

A. ~6,000
B. ~12,000
C. ~19,000
D. ~25,000

Answer: C
C is correct. GRU has 3 weight matrices. Each: (input_dim + units) * units + units = (32+64)*64 + 64 = 6208. Total: 3 * 6208 = 18,624 (approximately 19,000).

MCQ 13

When stacking 3 LSTM layers, which must have return_sequences=True?

A. Only the first layer
B. Only the last layer
C. The first and second layers (all except the last)
D. All three layers

Answer: C
C is correct. All layers except the last must use return_sequences=True. Each LSTM expects 3D input (batch, timesteps, features). The first two layers must output the full sequence so the next layer receives 3D input. The last layer can use False for classification.

MCQ 14

Which of the following is NOT a suitable use case for Bidirectional RNNs?

A. Text classification
B. Named entity recognition
C. Real-time stock price prediction
D. Sentiment analysis

Answer: C
C is correct. Real-time stock price prediction cannot use a bidirectional RNN because future data is not available at prediction time. The backward RNN would need future data. Bidirectional works when the full sequence is available (text classification, NER, sentiment analysis).

MCQ 15

What does LSTM stand for?

A. Long Sequential Token Memory
B. Long Short-Term Memory
C. Large Scale Training Model
D. Linear Sequence Transform Module

Answer: B
B is correct. LSTM stands for Long Short-Term Memory. The name reflects its ability to remember information over long time spans (long-term) while also handling short-term patterns.

MCQ 16

What is the purpose of padding in sequence preprocessing?

A. To add noise to the data
B. To make all sequences the same length for batch processing
C. To encrypt the sequences
D. To remove short sequences

Answer: B
B is correct. Neural networks require fixed-size input tensors. Padding adds zeros to shorter sequences so all sequences have the same length, enabling efficient batch processing.

MCQ 17

What activation function do LSTM gates use?

A. ReLU
B. Softmax
C. Sigmoid
D. Linear

Answer: C
C is correct. LSTM gates (forget, input, output) use sigmoid activation, which outputs values between 0 and 1. This allows gates to act as soft switches: 0 = completely closed, 1 = completely open. The cell candidate uses tanh.

MCQ 18

How many parameters does Embedding(10000, 64) have?

A. 10,000
B. 64
C. 640,000
D. 10,064

Answer: C
C is correct. An Embedding layer creates a lookup table of vocab_size x embedding_dim. 10000 words x 64 dimensions = 640,000 trainable parameters.

MCQ 19

What is the cell state in LSTM?

A. The number of cells in the network
B. A separate memory pathway that carries information across time steps with minimal modification
C. The learning rate for each cell
D. The batch size used during training

Answer: B
B is correct. The cell state is a vector that acts as a conveyor belt for information. It flows through time steps with only element-wise multiplication and addition (not matrix multiplication), preserving gradients and enabling long-range memory.

MCQ 20

Which layer in Keras converts word indices to dense vectors?

A. Dense
B. LSTM
C. Embedding
D. Flatten

Answer: C
C is correct. The Embedding layer maps integer word indices to dense floating-point vectors. Embedding(10000, 64) maps each of 10000 possible word indices to a 64-dimensional vector.

MCQ 21

What is the difference between padding='pre' and padding='post' in pad_sequences?

A. pre adds zeros before the sequence; post adds zeros after
B. pre pads with ones; post pads with zeros
C. pre is faster; post is more accurate
D. pre truncates from the start; post truncates from the end

Answer: A
A is correct. padding='pre' adds zeros at the beginning of shorter sequences: [0, 0, 1, 2, 3]. padding='post' adds zeros at the end: [1, 2, 3, 0, 0]. The default is 'pre'. For LSTMs, 'pre' often works better because the last elements (real data) are closest to the output.

MCQ 22

Why is dropout applied differently during training and inference in LSTM?

A. Dropout is not used in LSTMs
B. During training, neurons are randomly dropped to prevent overfitting; during inference, all neurons are active for deterministic predictions
C. During inference, more neurons are dropped
D. Dropout only affects the final Dense layer

Answer: B
B is correct. During training, dropout randomly zeroes neurons to prevent co-adaptation and reduce overfitting. During inference (predict/evaluate), all neurons are active and outputs are scaled to compensate. This ensures predictions are deterministic and use the model's full capacity.

MCQ 23

What does recurrent_dropout do in Keras LSTM?

A. Drops entire recurrent layers
B. Applies dropout to the recurrent connections (hidden-to-hidden weights)
C. Removes time steps from the sequence
D. Reduces the number of LSTM units

Answer: B
B is correct. recurrent_dropout applies dropout to the connections between time steps (the recurrent/hidden-to-hidden connections). Regular dropout applies to the input connections. Using both provides better regularization for LSTMs.

MCQ 24

What is the key advantage of GRU over LSTM?

A. GRU always achieves higher accuracy
B. GRU has fewer parameters (2 gates vs 3), making it faster to train with comparable performance
C. GRU can handle longer sequences
D. GRU does not suffer from vanishing gradients

Answer: B
B is correct. GRU has approximately 75% of LSTM's parameters (2 gates vs 3, no separate cell state). It trains faster and requires less memory while achieving comparable performance on most tasks. LSTM may have a slight edge on tasks requiring very long-range memory.

MCQ 25

Which of the following is NOT a suitable application for RNNs?

A. Sentiment analysis of text
B. Image classification of static photos
C. Time series forecasting
D. Speech recognition

Answer: B
B is correct. Static image classification is best handled by CNNs, not RNNs. RNNs are designed for sequential data where order matters: text, time series, audio, and video. A single static image has no sequential structure.

Coding Challenges

Coding challenges coming soon.

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass