Chapter 17 Advanced 55 Questions

Practice Questions — Convolutional Neural Networks (CNN) for Computer Vision

← Back to Notes

10 Easy

11 Medium

11 Hard

Topic-Specific Questions

Question 1

Easy

What is the output shape after this Conv2D layer?

Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))

With padding='valid' (default): output = input - filter + 1.

(None, 26, 26, 32)

Question 2

Easy

What is the output shape after MaxPooling2D?

# Input shape: (None, 26, 26, 32)
MaxPooling2D((2, 2))

MaxPooling2D(2,2) halves both spatial dimensions.

(None, 13, 13, 32)

Question 3

Easy

How many parameters does a Conv2D(16, (3, 3), input_shape=(28, 28, 1)) layer have?

Parameters = (filter_height x filter_width x input_channels + 1) x num_filters.

160 parameters

Question 4

Easy

Why do CNNs use convolution instead of Dense layers for image data?

Think about parameter sharing and spatial structure.

CNNs use convolution because: (1) Parameter sharing -- a 3x3 filter has only 9 weights but detects the same pattern anywhere in the image, while a Dense layer needs a separate weight for every pixel-to-neuron connection. (2) Spatial locality -- nearby pixels are related; convolution exploits this by looking at local neighborhoods. (3) Translation invariance -- the same filter works regardless of where the pattern appears. A Dense network on a 224x224 image would need millions of parameters per layer and could not generalize across positions.

Question 5

Easy

What is the output shape with padding='same'?

Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3))

'same' padding preserves the spatial dimensions.

(None, 32, 32, 64)

Question 6

Easy

What does max pooling do, and why is it useful?

It takes the maximum value in each window, reducing the feature map size.

Max pooling selects the maximum value from each pooling window (typically 2x2). It reduces the spatial dimensions by half, which: (1) decreases computation and memory, (2) provides a degree of translation invariance (the exact position of a feature within the window does not matter, only that it exists), and (3) forces the network to learn more abstract representations. The depth (number of channels) is not changed by pooling.

Question 7

Medium

What is the output shape after Conv2D(32, (3,3)) on a 28x28x1 input?

Default padding is 'valid'. Output = (input - filter + 1).

(None, 26, 26, 32)

Question 8

Medium

What is the output shape with stride=2?

Conv2D(32, (3, 3), strides=(2, 2), padding='same', input_shape=(28, 28, 1))

With same padding and stride 2, the output is input / stride.

(None, 14, 14, 32)

Question 9

Medium

How many parameters does Conv2D(64, (3, 3), input_shape=(32, 32, 3)) have?

Input has 3 channels. Each filter is 3x3x3.

1,792 parameters

Question 10

Medium

What was the key innovation of ResNet that allowed training networks with 100+ layers?

It adds the input to the output of a block.

ResNet introduced residual connections (skip connections). Instead of learning the full mapping F(x), each block learns the residual F(x) + x. The input x is added to the block's output via a shortcut path. This solves the vanishing gradient problem in very deep networks because gradients can flow directly through the skip connections, bypassing many layers. Without skip connections, networks deeper than ~20 layers often performed worse than shallower ones.

Question 11

Medium

Explain the two phases of transfer learning: (1) training the classification head, and (2) fine-tuning.

Phase 1 freezes pre-trained layers. Phase 2 unfreezes some and uses a low learning rate.

Phase 1: Load a pre-trained model (e.g., VGG16) without its top classification layers. Freeze all pre-trained layers (trainable=False). Add new Dense layers for your task. Train only the new layers with a normal learning rate. This teaches the new layers to use the pre-trained features. Phase 2 (optional): Unfreeze the top few pre-trained layers. Recompile with a very low learning rate (e.g., 1e-5). Train again. This fine-tunes the pre-trained features to better match your specific dataset while preserving most of the learned representations.

Question 12

Medium

What does include_top=False do?

base = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
print(base.output_shape)

include_top=False removes the final classification Dense layers.

(None, 7, 7, 512)

Question 13

Hard

Calculate the total output dimensions through this CNN:

model = Sequential([
    Conv2D(32, (5, 5), input_shape=(64, 64, 3)),  # valid padding
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3)),
    MaxPooling2D((2, 2)),
    Flatten()
])
print(model.output_shape)

Apply the formula (input - filter + 1) for each Conv2D, then // 2 for each MaxPool.

(None, 7744)

Question 14

Hard

Why do CNNs use multiple stacked 3x3 convolutions instead of a single large filter like 7x7?

VGG16 showed this design principle. Think about receptive field, parameters, and non-linearity.

Two stacked 3x3 convolutions have the same receptive field as one 5x5 convolution (and three 3x3s equal one 7x7), but with fewer parameters and more non-linearity. Parameters: one 7x7 filter on C channels = 49C weights. Three 3x3 filters = 3 x 9C = 27C weights (45% fewer). Non-linearity: each 3x3 conv is followed by ReLU, so three stacked convolutions apply three non-linear transformations, making the feature extraction more expressive. This was VGG16's key design insight.

Question 15

Hard

What is the difference between GlobalAveragePooling2D and Flatten? When should you use each?

One averages each feature map to a single value; the other concatenates all values.

Flatten converts a (7, 7, 512) feature map to a (25088,) vector by concatenating all values. GlobalAveragePooling2D averages each feature map to a single value, converting (7, 7, 512) to (512,). Use GlobalAveragePooling2D for transfer learning -- it reduces parameters by ~50x, provides spatial invariance, and acts as a regularizer. Use Flatten when training from scratch on smaller models where you want the Dense layers to learn spatial relationships. For transfer learning, GlobalAveragePooling2D is almost always the better choice.

Question 16

Hard

What is the receptive field of the second Conv2D in this stack?

Conv2D(32, (3, 3), padding='same')  # Layer 1
Conv2D(32, (3, 3), padding='same')  # Layer 2

The second layer sees a 3x3 window of the first layer's output. Each of those was computed from a 3x3 window of the input.

The receptive field of Layer 2 is 5x5 with respect to the original input.

Question 17

Hard

Kavitha has only 200 training images for a 10-class classification problem. How should she approach this with CNNs?

Training a CNN from scratch on 200 images will massively overfit.

With only 200 images (20 per class), Kavitha should: (1) Use transfer learning with a pre-trained model like ResNet50 or MobileNetV2. The pre-trained model already knows general visual features. (2) Freeze the entire pre-trained base and only train a small classification head (GlobalAveragePooling2D -> Dense). (3) Apply aggressive data augmentation (rotation, flip, zoom, shift) to artificially increase the effective training set size. (4) Use strong regularization (Dropout 0.5+). (5) If validation accuracy is good enough, optionally fine-tune the top 1-2 layers with a very low learning rate. Training from scratch would hopelessly overfit with so few images.

Mixed & Application Questions

Question 1

Easy

Does MaxPooling2D have any trainable parameters?

layer = MaxPooling2D((2, 2))
print(layer.count_params())

Max pooling applies a fixed operation (taking the max).

0

Question 2

Easy

What does Flatten do in a CNN?

# Input shape: (None, 4, 4, 64)
Flatten()
# Output shape: ?

Flatten converts multi-dimensional data into a 1D vector.

(None, 1024) because 4 x 4 x 64 = 1024.

Question 3

Easy

What is the channel dimension in a color image?

Color images have Red, Green, and Blue channels.

The channel dimension represents the number of color channels. RGB images have 3 channels (Red, Green, Blue). Grayscale images have 1 channel. In Keras, the input shape is (height, width, channels), so an RGB image of 224x224 has input_shape=(224, 224, 3) and a grayscale 28x28 image has input_shape=(28, 28, 1).

Question 4

Easy

What is the difference between these two models?

# Model A
Conv2D(32, (3, 3), padding='valid', input_shape=(28, 28, 1))  # Output: 26x26

# Model B
Conv2D(32, (3, 3), padding='same', input_shape=(28, 28, 1))   # Output: ?

'same' padding adds zeros so the output matches the input size.

Model A output: (None, 26, 26, 32). Model B output: (None, 28, 28, 32).

Question 5

Medium

How many parameters does this entire CNN have?

model = Sequential([
    Conv2D(8, (3, 3), input_shape=(28, 28, 1)),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(10, activation='softmax')
])
print(model.count_params())

Calculate Conv2D params, MaxPool params, and Dense params separately.

Conv2D: (3x3x1+1) x 8 = 80. MaxPool: 0. Flatten: 0. Dense: after Conv2D the shape is 26x26x8, after pool 13x13x8=1352, so Dense has (1352+1) x 10 = 13530. Total: 13,610.

Question 6

Medium

Why is VGG16 popular for transfer learning despite being older than ResNet?

Think about simplicity and the quality of learned features.

VGG16 is popular for transfer learning because: (1) Its architecture is very simple (only 3x3 convolutions and 2x2 max pooling), making it easy to understand and modify. (2) Despite its simplicity, its convolutional features are high quality and transfer well to many tasks. (3) Its straightforward block structure makes it easy to identify where to cut the model for transfer learning. (4) It is widely studied and used as a benchmark. ResNet and others may achieve higher accuracy on ImageNet, but VGG16's simplicity makes it an excellent teaching and practical starting point.

Question 7

Medium

What happens if you train with augmented data but evaluate on the same augmented data?

datagen = ImageDataGenerator(rotation_range=30, horizontal_flip=True)
# Using same datagen for both train and validation
train_gen = datagen.flow(X_train, y_train)
val_gen = datagen.flow(X_val, y_val)  # Wrong?

Validation should measure performance on clean, unaugmented data.

The validation accuracy becomes unreliable and inconsistent. Each time you evaluate, the validation images are randomly transformed, so the same model gives different validation scores on different runs. You cannot track true progress or make reliable decisions about early stopping.

Question 8

Medium

What does base_model.trainable = False do?

base_model = VGG16(weights='imagenet', include_top=False)
print(f"Before: {len(base_model.trainable_weights)} trainable")

base_model.trainable = False
print(f"After: {len(base_model.trainable_weights)} trainable")

Freezing a model makes all its weights non-trainable.

Before: 26 trainable
After: 0 trainable

Question 9

Medium

What is the difference between stride-based downsampling and pooling-based downsampling?

Both reduce spatial dimensions. How do they differ?

Pooling-based downsampling (MaxPooling2D) applies a fixed operation (max or average) in a window after the convolution. It has no learnable parameters. Stride-based downsampling (Conv2D with strides=2) reduces dimensions during the convolution itself -- the filter skips positions. It does have learnable parameters and combines feature extraction with downsampling in one step. Modern architectures (like ResNet and MobileNet) often prefer strided convolutions over max pooling because they give the network more control over what information is preserved during downsampling.

Question 10

Hard

Rohit's model shows: training accuracy 99%, validation accuracy 62%. What is happening and how to fix it?

Huge gap between training and validation accuracy indicates a specific problem.

The model is severely overfitting -- it memorized the training data but cannot generalize. Fixes: (1) Add Dropout layers (0.3-0.5) after convolutional and Dense layers. (2) Add data augmentation (rotation, flip, zoom) to increase effective training data. (3) Use transfer learning with a pre-trained model instead of training from scratch. (4) Reduce model complexity (fewer layers, fewer filters). (5) Add L2 regularization. (6) Get more training data if possible. (7) Use EarlyStopping to stop at the point of best validation accuracy.

Question 11

Hard

How do the features learned by early CNN layers differ from those learned by deep layers?

Think about the receptive field size at different depths.

Early layers (e.g., block1 in VGG16) have small receptive fields and learn simple, low-level features: edges (horizontal, vertical, diagonal), color blobs, and simple textures. Middle layers combine these into mid-level features: corners, contours, simple shapes, and recurring texture patterns. Deep layers have large receptive fields and detect high-level, semantic features: eyes, wheels, faces, or specific object parts. The deepest layers before the classification head combine these into class-specific representations. This hierarchical feature extraction is analogous to the human visual cortex.

Question 12

Hard

What are the parameters of this residual block?

def residual_block(x, filters=64):
    shortcut = x
    x = Conv2D(filters, (3, 3), padding='same')(x)  # Layer A
    x = BatchNormalization()(x)                       # Layer B
    x = Activation('relu')(x)
    x = Conv2D(filters, (3, 3), padding='same')(x)  # Layer C
    x = BatchNormalization()(x)                       # Layer D
    x = Add()([x, shortcut])
    x = Activation('relu')(x)
    return x

# Input shape to block: (None, 16, 16, 64)
# How many trainable parameters in this block?

Calculate Conv2D and BatchNorm params separately. Add and Activation have zero params.

Layer A (Conv2D 64 filters, 3x3, 64 input channels): (3x3x64 + 1) x 64 = 36,928. Layer B (BatchNorm, 64 features): 128 trainable (gamma + beta). Layer C (Conv2D same): 36,928. Layer D (BatchNorm): 128. Total trainable: 74,112.

Question 13

Hard

Compare AlexNet (2012) and ResNet (2015). What fundamental problems did each solve?

One proved deep learning works on GPUs. The other proved very deep networks can be trained.

AlexNet (2012) solved the problem of applying deep learning at scale. It was the first CNN to win ImageNet by a large margin. Key innovations: (1) Used GPUs for training (60x speedup), (2) ReLU activation instead of tanh (faster training), (3) Dropout for regularization, (4) Data augmentation. It proved that deep CNNs with large datasets and GPU training dramatically outperform traditional computer vision. ResNet (2015) solved the degradation problem in very deep networks. Before ResNet, networks deeper than ~20 layers performed worse than shallower ones (not due to overfitting, but difficulty in training). Skip connections (F(x) + x) allowed training 152-layer networks effectively by providing gradient highways during backpropagation.

Question 14

Hard

How does GlobalAveragePooling2D change the shape?

from tensorflow.keras.layers import GlobalAveragePooling2D
# Input: (None, 7, 7, 512)
layer = GlobalAveragePooling2D()
# Output: ?

It averages each feature map (7x7) into a single value.

(None, 512)

Question 15

Hard

Deepa has a dataset of 5,000 Indian road sign images (20 classes). Should she train from scratch or use transfer learning? If transfer learning, which pre-trained model should she choose and why?

5,000 images is moderate. Consider the trade-off between model size and data availability.

Deepa should use transfer learning. With 5,000 images (250 per class), training a deep CNN from scratch would likely overfit. Recommended approach: Use MobileNetV2 or ResNet50 pre-trained on ImageNet. MobileNetV2 is preferred because (1) it is lightweight (3.4M vs 23.5M params for ResNet50), reducing overfitting risk with limited data, (2) it is fast to train and deploy, (3) its features still transfer well. Steps: freeze the base, add GlobalAveragePooling2D + Dense head, train with aggressive data augmentation (rotation, flip, brightness -- reflecting real-world road sign variations), then fine-tune the top few layers with lr=1e-5. With 5,000 images, this should achieve strong accuracy.

Multiple Choice Questions

MCQ 1

What does CNN stand for?

A. Complex Neural Network
B. Convolutional Neural Network
C. Connected Neuron Network
D. Computational Node Network

Answer: B
B is correct. CNN stands for Convolutional Neural Network. The key operation is convolution, where filters slide across the input to detect spatial patterns.

MCQ 2

What does a convolution filter detect in an image?

A. The color of the image
B. Specific spatial patterns like edges, textures, or shapes
C. The total number of pixels
D. The file format of the image

Answer: B
B is correct. Convolution filters detect specific spatial patterns. Early filters detect simple patterns (edges), while deeper filters detect complex patterns (object parts). The filter weights are learned during training.

MCQ 3

What is the output of MaxPooling2D((2,2)) applied to a 10x10 feature map?

A. 5x5
B. 10x10
C. 8x8
D. 20x20

Answer: A
A is correct. MaxPooling2D with pool_size=(2,2) halves both spatial dimensions: 10/2 = 5. The output is 5x5. The number of channels remains unchanged.

MCQ 4

What does 'padding=same' do in a Conv2D layer?

A. Applies the same filter to all channels
B. Makes all output values the same
C. Adds zero-padding so the output spatial dimensions match the input
D. Uses the same weights for all positions

Answer: C
C is correct. padding='same' adds zeros around the border of the input so that the output has the same height and width as the input (when stride=1). Without padding ('valid'), the output shrinks.

MCQ 5

What is the purpose of the Flatten layer in a CNN?

A. It flattens the image to grayscale
B. It converts the 3D feature maps to a 1D vector for Dense layers
C. It removes unnecessary features
D. It applies a flattening convolution

Answer: B
B is correct. Flatten converts multi-dimensional feature maps (e.g., 4x4x128) into a 1D vector (2048) that can be fed into Dense layers for classification.

MCQ 6

Which CNN architecture introduced residual (skip) connections?

A. VGG16
B. AlexNet
C. ResNet
D. LeNet-5

Answer: C
C is correct. ResNet (2015) introduced residual connections where the input is added to the output of a block (F(x) + x). This allowed training networks with 100+ layers by solving the vanishing gradient problem.

MCQ 7

In transfer learning, why do we freeze the pre-trained layers initially?

A. To make training faster by using fewer GPU resources
B. To preserve the learned features and only train the new classification head
C. Because frozen layers use less memory
D. Because pre-trained layers cannot be modified

Answer: B
B is correct. Freezing preserves the valuable features learned from millions of images. We only train the new classification layers to adapt those features to our specific task. Without freezing, random gradient updates would destroy the pre-trained representations.

MCQ 8

What is the key advantage of CNNs over Dense networks for image classification?

A. CNNs are always faster to train
B. CNNs can handle any data type, not just images
C. CNNs exploit spatial structure through parameter sharing and local connectivity
D. CNNs do not need activation functions

Answer: C
C is correct. CNNs share filter weights across all spatial positions (parameter sharing) and each neuron only connects to a local region (local connectivity). This drastically reduces parameters and allows the model to detect patterns regardless of their position in the image.

MCQ 9

What does include_top=False do when loading a pre-trained model?

A. Removes the first convolutional layer
B. Removes the final classification Dense layers
C. Removes all pooling layers
D. Loads the model without pre-trained weights

Answer: B
B is correct. include_top=False removes the top Dense layers used for the original classification task (e.g., ImageNet's 1000 classes). It keeps only the convolutional feature extraction base, which you connect to your own classification layers.

MCQ 10

Which of the following is NOT a data augmentation technique?

A. Random rotation
B. Horizontal flip
C. Feature normalization
D. Random zoom

Answer: C
C is correct. Feature normalization (scaling to [0,1] or standardizing) is a preprocessing step, not augmentation. Data augmentation creates modified versions of images (rotation, flip, zoom, shift) to increase training data diversity.

MCQ 11

Why should you use a very low learning rate during fine-tuning?

A. To make training faster
B. To prevent destroying the pre-trained features with large weight updates
C. Because fine-tuning does not need gradient descent
D. To increase the batch size

Answer: B
B is correct. Pre-trained weights represent features learned from millions of images. A high learning rate causes large gradient updates that destroy these carefully learned features. A low learning rate (e.g., 1e-5) makes small adjustments that adapt the features without destroying them.

MCQ 12

How many parameters does Conv2D(64, (3,3)) have when the input has 32 channels?

A. 576
B. 18,432
C. 18,496
D. 36,928

Answer: C
C is correct. Each filter is 3x3x32 = 288 weights + 1 bias = 289 per filter. With 64 filters: 289 x 64 = 18,496. The input channel depth (32) is part of each filter's dimensions.

MCQ 13

What is the receptive field of three stacked 3x3 convolutions (with stride 1)?

A. 3x3
B. 5x5
C. 7x7
D. 9x9

Answer: C
C is correct. One 3x3 conv sees 3x3. Two stacked 3x3 convs see 5x5. Three stacked 3x3 convs see 7x7. The formula is: receptive_field = 1 + n * (kernel_size - 1) = 1 + 3 * 2 = 7.

MCQ 14

GlobalAveragePooling2D converts a (None, 7, 7, 512) tensor to what shape?

A. (None, 7, 7, 1)
B. (None, 512)
C. (None, 25088)
D. (None, 3, 3, 512)

Answer: B
B is correct. GlobalAveragePooling2D averages each of the 512 feature maps (7x7 each) into a single value, producing a vector of length 512. Flatten would give (None, 25088) by concatenating all values.

MCQ 15

Which VGG16 design principle was later adopted by many architectures?

A. Using only 1x1 convolutions
B. Using multiple small (3x3) filters instead of large filters
C. Using no pooling layers
D. Using only one convolutional block

Answer: B
B is correct. VGG16 showed that stacking small 3x3 filters is better than using large filters (5x5 or 7x7). Two 3x3 filters have the same receptive field as one 5x5 but with fewer parameters and more non-linearity. This principle is now standard in most CNN architectures.

MCQ 16

In what order does a CNN typically process data for image classification?

A. Dense -> Flatten -> Conv2D -> Pooling -> Softmax
B. Conv2D -> Pooling -> (repeat) -> Flatten -> Dense -> Softmax
C. Pooling -> Conv2D -> Dense -> Flatten -> Softmax
D. Flatten -> Conv2D -> Pooling -> Dense -> Softmax

Answer: B
B is correct. The standard CNN pipeline: convolutional layers extract features, pooling layers reduce dimensions (this block repeats), Flatten converts to 1D, Dense layers classify, and softmax outputs probabilities. Data flows from spatial feature extraction to classification.

MCQ 17

What does a convolution filter detect in the first layer of a CNN?

A. Complete objects like faces and cars
B. Simple patterns like edges and corners
C. The background color of the image
D. The file size of the image

Answer: B
B is correct. The first convolutional layer learns to detect simple, low-level features like horizontal edges, vertical edges, corners, and color blobs. Deeper layers combine these into increasingly complex features.

MCQ 18

What happens to the number of channels after Conv2D(64, (3,3)) on an input with 32 channels?

A. Stays at 32
B. Becomes 64
C. Becomes 32 + 64 = 96
D. Becomes 32 x 64 = 2048

Answer: B
B is correct. The number of output channels equals the number of filters specified in Conv2D. Conv2D(64, ...) produces 64 feature maps regardless of the input channel count. The input channels affect the filter depth (each filter is 3x3x32) but not the output depth.

MCQ 19

What is the main purpose of data augmentation?

A. To make training faster
B. To increase the effective training set size and reduce overfitting
C. To improve image resolution
D. To convert images to grayscale

Answer: B
B is correct. Data augmentation creates modified versions of training images (rotated, flipped, zoomed) to effectively increase the dataset size. This helps the model learn invariant features and reduces overfitting.

MCQ 20

What is the difference between Max Pooling and Average Pooling?

A. Max pooling takes the maximum value; average pooling computes the mean in each window
B. Max pooling is faster; average pooling is more accurate
C. Max pooling increases dimensions; average pooling decreases them
D. Max pooling has learnable parameters; average pooling does not

Answer: A
A is correct. Max pooling selects the maximum value in each window, preserving the strongest activations. Average pooling computes the mean, providing a smoother representation. Both reduce spatial dimensions. Neither has learnable parameters. Max pooling is more commonly used in classification CNNs.

MCQ 21

What does 'weights=imagenet' mean when loading VGG16?

A. The model is trained on images from the internet
B. The model is loaded with weights pre-trained on the ImageNet dataset
C. The model can only process ImageNet images
D. The model has random weights

Answer: B
B is correct. weights='imagenet' loads weights that were pre-trained on the ImageNet dataset (14 million images, 1000 classes). These weights already understand general visual features and can be transferred to new tasks.

MCQ 22

Why does ResNet use batch normalization after each convolution?

A. To increase the number of parameters
B. To normalize feature maps, enabling stable training of very deep networks
C. To replace the activation function
D. To reduce the number of filters

Answer: B
B is correct. Batch normalization normalizes feature maps to zero mean and unit variance after each convolution. In very deep networks (50-152 layers), this prevents activation values from growing or shrinking uncontrollably, enabling stable gradient flow and faster convergence.

MCQ 23

What is translation invariance in the context of CNNs?

A. The ability to translate images to different languages
B. The ability to recognize a pattern regardless of its position in the image
C. The ability to resize images automatically
D. The ability to process images in any format

Answer: B
B is correct. Translation invariance means a CNN can recognize a cat whether it appears in the top-left, center, or bottom-right of the image. This is achieved through parameter sharing (same filter applied everywhere) and pooling (which provides tolerance to small position shifts).

Coding Challenges

Coding challenges coming soon.

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass