What Is It?
What Are Convolutional Neural Networks?
A Convolutional Neural Network (CNN) is a type of deep neural network specifically designed for processing grid-like data, most commonly images. While a regular Dense (fully connected) network treats every pixel as an independent feature, a CNN exploits the spatial structure of images -- the fact that nearby pixels are related and form patterns like edges, textures, and shapes.
# A CNN processes images differently from a Dense network
# Dense network: flatten 28x28 image to 784 features (loses spatial info)
# CNN: process 28x28 image directly, preserving spatial relationships
import tensorflow as tf
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
# CNN input keeps the 2D structure
model = tf.keras.Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
MaxPooling2D((2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dense(10, activation='softmax')
])The key insight of CNNs is the convolution operation: a small filter (also called a kernel) slides across the image, detecting specific patterns. Early layers detect simple features (edges, corners), middle layers combine them into parts (eyes, wheels), and deep layers recognize complete objects (faces, cars). This hierarchical feature extraction is what makes CNNs so powerful for visual tasks.
Why Not Use Dense Networks for Images?
A 224x224 RGB image has 224 x 224 x 3 = 150,528 input values. A single Dense layer with 1024 neurons would need 150,528 x 1024 + 1024 = over 154 million parameters. That is just one layer. The model would be enormous, extremely slow to train, and would massively overfit. CNNs solve this through parameter sharing -- the same small filter (e.g., 3x3 = 9 weights) is applied across the entire image, reducing parameters by orders of magnitude.
Why Does It Matter?
Why Are CNNs Important?
1. State of the Art in Computer Vision
CNNs power almost every modern computer vision system: image classification (is this a cat or dog?), object detection (where are the cars in this image?), facial recognition, medical imaging (detecting tumors from X-rays), autonomous driving, and satellite image analysis. If Aarav wants to work in any field involving visual data, CNNs are fundamental.
2. Translation Invariance
A CNN can recognize a cat whether it appears in the top-left corner, the center, or the bottom-right of an image. Because the same filter scans the entire image, the features it learns are position-independent. A Dense network, in contrast, would need to learn separate patterns for every possible position of every object -- an impossibly complex task.
3. Spatial Hierarchy of Features
CNNs automatically learn a hierarchy: simple features (edges) in early layers combine into complex features (objects) in deeper layers. This mirrors how the human visual cortex processes information. You do not need to manually engineer features like edge detectors or corner detectors -- the CNN learns them from data.
4. Parameter Efficiency
A 3x3 convolution filter has only 9 learnable weights (plus 1 bias), regardless of the image size. This filter is shared across the entire image, dramatically reducing the number of parameters compared to Dense layers. Fewer parameters means less overfitting, less memory, and faster training.
5. Transfer Learning Revolution
CNNs trained on millions of images (like ImageNet) learn general visual features that transfer to new tasks. Meera can take a VGG16 model pre-trained on 14 million images and fine-tune it for her specific task (e.g., classifying Indian medicinal plants) with only a few hundred labeled images. This makes CNNs practical even with limited data.
Detailed Explanation
Detailed Explanation
1. The Convolution Operation
A convolution applies a small matrix called a filter (or kernel) to an image by sliding it across every position. At each position, the filter elements are multiplied with the overlapping image pixels, and the results are summed to produce a single output value. This output is called a feature map.
# Conceptual example of 3x3 convolution on a 5x5 image
import numpy as np
# 5x5 grayscale image
image = np.array([
[1, 2, 0, 1, 3],
[0, 1, 2, 3, 1],
[1, 0, 1, 0, 2],
[2, 1, 0, 1, 0],
[0, 3, 2, 1, 1]
])
# 3x3 edge detection filter
filter_v = np.array([
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]
])
# Manual convolution at position (0,0)
patch = image[0:3, 0:3]
result = np.sum(patch * filter_v)
print(f"Convolution at (0,0): {result}") # Sum of element-wise products
# The filter slides across all valid positions
# For a 5x5 image with 3x3 filter: output is 3x3 (no padding)Key terms:
- Filter/Kernel: A small matrix (typically 3x3 or 5x5) with learnable weights. Each filter detects a specific pattern (horizontal edge, vertical edge, blur, etc.).
- Feature Map: The output of applying one filter across the entire image. If you have 32 filters, you get 32 feature maps.
- Stride: How many pixels the filter moves at each step. Stride=1 moves one pixel at a time. Stride=2 skips every other position, reducing the output size by half.
- Padding: Adding zeros around the image border.
padding='same'keeps the output the same size as the input.padding='valid'(default) does not add padding, so the output shrinks.
2. Conv2D in Keras
from tensorflow.keras.layers import Conv2D
# Conv2D(filters, kernel_size, strides, padding, activation)
layer = Conv2D(
filters=32, # Number of filters (output feature maps)
kernel_size=(3, 3), # Size of each filter
strides=(1, 1), # Step size (default is 1)
padding='valid', # 'valid' (no padding) or 'same' (zero padding)
activation='relu', # Activation function
input_shape=(28, 28, 1) # Height, Width, Channels
)Output shape calculation: For an input of height H, width W, with filter size F, stride S, and padding P:
# With padding='valid' (no padding, P=0):
# Output size = (H - F) / S + 1
# Example: 28x28 input, 3x3 filter, stride 1
# Output = (28 - 3) / 1 + 1 = 26
# So Conv2D(32, (3,3)) on 28x28x1 input gives 26x26x32
# With padding='same':
# Output size = H / S (same as input if stride=1)
# Example: 28x28 input, 3x3 filter, stride 1, same padding
# Output = 28x28x323. Pooling Layers
Pooling reduces the spatial dimensions (height and width) of feature maps while retaining the most important information. This reduces computation, memory, and helps prevent overfitting.
from tensorflow.keras.layers import MaxPooling2D, AveragePooling2D
# Max Pooling: takes the maximum value in each pool window
# Most common: 2x2 pool with stride 2, reduces dimensions by half
MaxPooling2D(pool_size=(2, 2)) # 26x26 becomes 13x13
# Average Pooling: takes the average value in each pool window
AveragePooling2D(pool_size=(2, 2))Max Pooling is more common because it retains the strongest activations (most prominent features). If a filter detects a vertical edge, max pooling preserves the strongest edge response regardless of its exact position within the pool window. This adds translation invariance.
Why reduce dimensions? After convolutions, feature maps can be large (e.g., 224x224x64). Pooling reduces them (to 112x112x64), making subsequent layers cheaper to compute. It also forces the network to learn more abstract representations -- instead of knowing exactly where a feature is, the network learns that the feature exists somewhere in a region.
4. Standard CNN Architecture
The classic pattern is: Convolution -> Activation (ReLU) -> Pooling, repeated several times, followed by Flatten and Dense layers for classification.
model = tf.keras.Sequential([
# Block 1: learn low-level features (edges, textures)
Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3)),
Conv2D(32, (3, 3), activation='relu', padding='same'),
MaxPooling2D((2, 2)), # 32x32 -> 16x16
# Block 2: learn mid-level features (parts, patterns)
Conv2D(64, (3, 3), activation='relu', padding='same'),
Conv2D(64, (3, 3), activation='relu', padding='same'),
MaxPooling2D((2, 2)), # 16x16 -> 8x8
# Block 3: learn high-level features (objects)
Conv2D(128, (3, 3), activation='relu', padding='same'),
Conv2D(128, (3, 3), activation='relu', padding='same'),
MaxPooling2D((2, 2)), # 8x8 -> 4x4
# Classification head
Flatten(), # 4x4x128 = 2048
Dense(256, activation='relu'),
tf.keras.layers.Dropout(0.5),
Dense(10, activation='softmax')
])5. Famous CNN Architectures
LeNet-5 (1998, Yann LeCun)
The first successful CNN, used for digit recognition. Architecture: Conv -> Pool -> Conv -> Pool -> FC -> FC -> Output. Only 60,000 parameters. Proved that CNNs could be trained with backpropagation and work for real-world tasks.
AlexNet (2012, Alex Krizhevsky)
Won the ImageNet competition by a massive margin, sparking the deep learning revolution. Key innovations: used ReLU instead of tanh (faster training), Dropout for regularization, trained on GPUs for the first time. 8 layers, 60 million parameters.
VGG16 (2014, Karen Simonyan and Andrew Zisserman)
Showed that depth matters. Used only 3x3 convolutions stacked deep (16 layers). Key insight: two 3x3 convolutions have the same receptive field as one 5x5 but with fewer parameters and more non-linearity. 138 million parameters. Very popular for transfer learning because of its simplicity.
ResNet (2015, Kaiming He)
Introduced residual connections (skip connections) that allow training very deep networks (50, 101, even 152 layers). The key innovation: instead of learning F(x), learn F(x) + x (the residual). This solves the vanishing gradient problem in very deep networks. Skip connections let gradients flow directly through shortcut paths. Won ImageNet 2015.
# ResNet-style skip connection (conceptual)
# Instead of: output = F(x)
# ResNet does: output = F(x) + x (add the input to the output)
from tensorflow.keras.layers import Add, Input, Conv2D, BatchNormalization, Activation
from tensorflow.keras import Model
def residual_block(x, filters):
shortcut = x # Save the input
x = Conv2D(filters, (3, 3), padding='same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)
x = Conv2D(filters, (3, 3), padding='same')(x)
x = BatchNormalization()(x)
x = Add()([x, shortcut]) # Add input to output (skip connection)
x = Activation('relu')(x)
return x6. Transfer Learning
Transfer learning uses a model pre-trained on a large dataset (typically ImageNet with 14 million images) as the starting point for a new task. Instead of training from random weights, you start with weights that already understand general visual features.
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Dense, Flatten, Dropout
from tensorflow.keras import Model
# Load VGG16 pre-trained on ImageNet, without the top classification layers
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
# Freeze the pre-trained layers (do not update their weights)
base_model.trainable = False
# Add custom classification layers
x = Flatten()(base_model.output)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
output = Dense(2, activation='softmax')(x) # 2 classes: cat vs dog
model = Model(inputs=base_model.input, outputs=output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])Steps: (1) Load a pre-trained model without the classification head. (2) Freeze the pre-trained layers so their weights are not changed. (3) Add new Dense layers for your specific task. (4) Train only the new layers on your data.
Fine-tuning: After training the new layers, you can unfreeze some of the top pre-trained layers and train them with a very low learning rate. This adapts the pre-trained features to your specific task.
# Fine-tuning: unfreeze the top layers of VGG16
base_model.trainable = True
# Freeze all layers except the last 4
for layer in base_model.layers[:-4]:
layer.trainable = False
# Recompile with a lower learning rate
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)7. Data Augmentation
Data augmentation creates modified versions of training images to increase the effective dataset size and reduce overfitting. The model sees slightly different versions of each image during each epoch.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Create augmentation pipeline
train_datagen = ImageDataGenerator(
rescale=1./255, # Normalize pixel values
rotation_range=20, # Random rotation up to 20 degrees
width_shift_range=0.2, # Random horizontal shift
height_shift_range=0.2, # Random vertical shift
horizontal_flip=True, # Random horizontal flip
zoom_range=0.2, # Random zoom
shear_range=0.15, # Random shear
fill_mode='nearest' # How to fill new pixels after transforms
)
# Validation data should NOT be augmented, only rescaled
val_datagen = ImageDataGenerator(rescale=1./255)
# Load images from directory
train_generator = train_datagen.flow_from_directory(
'data/train',
target_size=(224, 224),
batch_size=32,
class_mode='binary'
)
val_generator = val_datagen.flow_from_directory(
'data/validation',
target_size=(224, 224),
batch_size=32,
class_mode='binary'
)Important: only augment training data, never validation or test data. Validation and test data should only be rescaled/normalized to accurately measure performance.
Code Examples
import tensorflow as tf
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import numpy as np
# Load CIFAR-10 dataset (60,000 32x32 color images, 10 classes)
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
print(f"Training shape: {X_train.shape}") # (50000, 32, 32, 3)
print(f"Test shape: {X_test.shape}") # (10000, 32, 32, 3)
# Normalize to [0, 1]
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
# Build CNN
model = tf.keras.Sequential([
# Block 1
Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3)),
BatchNormalization(),
Conv2D(32, (3, 3), activation='relu', padding='same'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.25),
# Block 2
Conv2D(64, (3, 3), activation='relu', padding='same'),
BatchNormalization(),
Conv2D(64, (3, 3), activation='relu', padding='same'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.25),
# Block 3
Conv2D(128, (3, 3), activation='relu', padding='same'),
BatchNormalization(),
Conv2D(128, (3, 3), activation='relu', padding='same'),
BatchNormalization(),
MaxPooling2D((2, 2)),
Dropout(0.25),
# Classification head
Flatten(),
Dense(256, activation='relu'),
BatchNormalization(),
Dropout(0.5),
Dense(10, activation='softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
callbacks = [
EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True),
ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6)
]
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=64,
validation_split=0.1,
callbacks=callbacks
)
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest accuracy: {test_acc:.4f}")
# Predict on samples
predictions = model.predict(X_test[:5])
for i in range(5):
pred = class_names[predictions[i].argmax()]
actual = class_names[y_test[i][0]]
print(f"Predicted: {pred}, Actual: {actual}")import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Dense, Flatten, Dropout, GlobalAveragePooling2D
from tensorflow.keras import Model
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np
# Load VGG16 pre-trained on ImageNet (without classification head)
base_model = VGG16(
weights='imagenet',
include_top=False, # Exclude the top Dense layers
input_shape=(128, 128, 3) # Our input size
)
# Freeze all layers in the base model
base_model.trainable = False
print(f"VGG16 layers: {len(base_model.layers)}")
print(f"VGG16 trainable params: {base_model.count_params():,}")
print(f"VGG16 trainable after freeze: "
f"{sum(w.numpy().size for w in base_model.trainable_weights)}")
# Add custom classification head
x = base_model.output
x = GlobalAveragePooling2D()(x) # Better than Flatten for transfer learning
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
output = Dense(1, activation='sigmoid')(x) # Binary: cat (0) or dog (1)
model = Model(inputs=base_model.input, outputs=output)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
# Show trainable vs non-trainable params
trainable_count = sum(tf.size(w).numpy() for w in model.trainable_weights)
non_trainable_count = sum(tf.size(w).numpy() for w in model.non_trainable_weights)
print(f"\nTotal params: {model.count_params():,}")
print(f"Trainable params: {trainable_count:,}")
print(f"Non-trainable (frozen) params: {non_trainable_count:,}")
# Generate dummy data for demonstration (in practice, load real images)
np.random.seed(42)
X_train = np.random.rand(500, 128, 128, 3).astype('float32')
y_train = np.random.randint(0, 2, 500)
X_val = np.random.rand(100, 128, 128, 3).astype('float32')
y_val = np.random.randint(0, 2, 100)
# Phase 1: Train only the custom head
print("\n=== Phase 1: Training classification head ===")
history = model.fit(
X_train, y_train,
epochs=10,
batch_size=32,
validation_data=(X_val, y_val),
callbacks=[EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)]
)
# Phase 2: Fine-tune the top layers of VGG16
print("\n=== Phase 2: Fine-tuning top VGG16 layers ===")
base_model.trainable = True
# Freeze all layers except the last 4 convolutional layers
for layer in base_model.layers[:-4]:
layer.trainable = False
# Recompile with a much lower learning rate
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
loss='binary_crossentropy',
metrics=['accuracy']
)
history_fine = model.fit(
X_train, y_train,
epochs=10,
batch_size=32,
validation_data=(X_val, y_val),
callbacks=[EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)]
)
print(f"\nFinal validation accuracy: {history_fine.history['val_accuracy'][-1]:.4f}")GlobalAveragePooling2D is preferred over Flatten for transfer learning because it reduces the parameter count and provides spatial invariance.import tensorflow as tf
from tensorflow.keras.layers import Conv2D, MaxPooling2D
import numpy as np
import matplotlib.pyplot as plt
# Load a pre-trained model
model = tf.keras.applications.VGG16(weights='imagenet', include_top=False)
# Create a model that outputs feature maps from different layers
layer_names = ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1']
outputs = [model.get_layer(name).output for name in layer_names]
visualization_model = tf.keras.Model(inputs=model.input, outputs=outputs)
# Load and preprocess a sample image
# Using a random image for demonstration
np.random.seed(42)
img = np.random.rand(1, 224, 224, 3).astype('float32') * 255
# Get feature maps
feature_maps = visualization_model.predict(img)
# Display feature maps from each layer
for i, (name, fmap) in enumerate(zip(layer_names, feature_maps)):
print(f"Layer: {name}, Feature map shape: {fmap.shape}")
n_features = min(8, fmap.shape[-1]) # Show first 8 filters
fig, axes = plt.subplots(1, n_features, figsize=(20, 3))
fig.suptitle(f'Feature maps from {name}')
for j in range(n_features):
axes[j].imshow(fmap[0, :, :, j], cmap='viridis')
axes[j].axis('off')
axes[j].set_title(f'Filter {j}')
plt.tight_layout()
plt.savefig(f'feature_maps_{name}.png')
plt.show()
print("\nEarly layers detect simple patterns (edges, colors).")
print("Deeper layers detect complex patterns (textures, parts, objects).")import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import numpy as np
import matplotlib.pyplot as plt
# Create a sample image (for demonstration)
np.random.seed(42)
sample_image = np.random.rand(1, 150, 150, 3).astype('float32')
# Define augmentation
datagen = ImageDataGenerator(
rotation_range=30, # Rotate up to 30 degrees
width_shift_range=0.2, # Shift horizontally up to 20%
height_shift_range=0.2, # Shift vertically up to 20%
horizontal_flip=True, # Flip horizontally
zoom_range=0.2, # Zoom in/out up to 20%
shear_range=0.15, # Shear transformation
brightness_range=[0.8, 1.2],# Brightness variation
fill_mode='nearest' # Fill new pixels
)
# Generate augmented versions
fig, axes = plt.subplots(2, 5, figsize=(20, 8))
fig.suptitle('Original vs Augmented Images', fontsize=16)
# Show original
axes[0][0].imshow(sample_image[0])
axes[0][0].set_title('Original')
axes[0][0].axis('off')
# Generate 9 augmented versions
augmented_iterator = datagen.flow(sample_image, batch_size=1)
for i in range(9):
augmented = next(augmented_iterator)
row = (i + 1) // 5
col = (i + 1) % 5
axes[row][col].imshow(augmented[0])
axes[row][col].set_title(f'Augmented {i+1}')
axes[row][col].axis('off')
plt.tight_layout()
plt.savefig('augmentation_examples.png')
plt.show()
print("Each epoch sees different augmented versions of the same images.")
print("This effectively multiplies the dataset size and reduces overfitting.")
print("\nKey: NEVER augment validation or test data!")import tensorflow as tf
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
# Build a CNN and trace shapes through each layer
model = tf.keras.Sequential([
# Input: 28x28x1 (grayscale image)
Conv2D(16, (3, 3), activation='relu', input_shape=(28, 28, 1)),
# Output: (28-3+1) x (28-3+1) x 16 = 26x26x16
MaxPooling2D((2, 2)),
# Output: 26//2 x 26//2 x 16 = 13x13x16
Conv2D(32, (3, 3), activation='relu'),
# Output: (13-3+1) x (13-3+1) x 32 = 11x11x32
MaxPooling2D((2, 2)),
# Output: 11//2 x 11//2 x 32 = 5x5x32
Conv2D(64, (3, 3), activation='relu'),
# Output: (5-3+1) x (5-3+1) x 64 = 3x3x64
Flatten(),
# Output: 3 * 3 * 64 = 576
Dense(128, activation='relu'),
# Output: 128
Dense(10, activation='softmax')
# Output: 10
])
# Print shapes at each layer
print(f"{'Layer':<25} {'Output Shape':<20} {'Params':>10}")
print("-" * 57)
for layer in model.layers:
name = layer.name
shape = str(layer.output_shape)
params = layer.count_params()
print(f"{name:<25} {shape:<20} {params:>10,}")
print(f"\nTotal parameters: {model.count_params():,}")padding='valid' (default): output = input - filter + 1. For MaxPooling2D with pool_size=2: output = input // 2. The number of filters determines the depth (channels) of the output. Understanding shape flow is essential for designing CNNs and debugging shape mismatch errors.import tensorflow as tf
from tensorflow.keras.applications import VGG16, ResNet50, MobileNetV2
import numpy as np
# Compare three popular pre-trained models
models_info = {
'VGG16': VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3)),
'ResNet50': ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3)),
'MobileNetV2': MobileNetV2(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
}
print(f"{'Model':<15} {'Layers':>8} {'Params':>15} {'Output Shape':<20}")
print("-" * 60)
for name, model in models_info.items():
n_layers = len(model.layers)
n_params = model.count_params()
out_shape = str(model.output_shape)
print(f"{name:<15} {n_layers:>8} {n_params:>15,} {out_shape:<20}")
print("\nGuidelines for choosing:")
print("- VGG16: Simple, great for learning, heavy (528MB)")
print("- ResNet50: Deeper, skip connections, good accuracy (98MB)")
print("- MobileNetV2: Lightweight, fast, ideal for mobile (14MB)")
print("\nAll can be used for transfer learning with include_top=False.")
# Quick test: predict on a random image
test_img = np.random.rand(1, 224, 224, 3).astype('float32')
for name, model in models_info.items():
features = model.predict(test_img, verbose=0)
print(f"\n{name} feature shape: {features.shape}")include_top=False to remove the classification head. The choice depends on your constraint: accuracy (ResNet50), simplicity (VGG16), or efficiency (MobileNetV2).Common Mistakes
Not Reshaping Input for Conv2D
# MNIST images are (60000, 28, 28) but Conv2D expects (60000, 28, 28, 1)
(X_train, y_train), _ = tf.keras.datasets.mnist.load_data()
model = tf.keras.Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
# ...
])
model.fit(X_train, y_train) # Error: shape mismatch# Add the channel dimension
X_train = X_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
# Or use np.expand_dims
import numpy as np
X_train = np.expand_dims(X_train, axis=-1).astype('float32') / 255.0
# Now shape is (60000, 28, 28, 1) -- correct for Conv2Dreshape(-1, 28, 28, 1) or np.expand_dims. For RGB images (already 3 channels), this is not needed.Augmenting Validation or Test Data
# Applying augmentation to validation data
val_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=20, # Wrong! Do not augment validation data
horizontal_flip=True, # Wrong!
zoom_range=0.2 # Wrong!
)
val_generator = val_datagen.flow_from_directory('data/val', ...)# Validation data should ONLY be rescaled
val_datagen = ImageDataGenerator(rescale=1./255)
val_generator = val_datagen.flow_from_directory('data/val', ...)Using a High Learning Rate for Fine-Tuning
# After unfreezing pre-trained layers, using default learning rate
base_model.trainable = True
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), # Too high!
loss='binary_crossentropy',
metrics=['accuracy']
)# Use a very low learning rate for fine-tuning
base_model.trainable = True
for layer in base_model.layers[:-4]:
layer.trainable = False # Only unfreeze top layers
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5), # 10-100x lower
loss='binary_crossentropy',
metrics=['accuracy']
)Using Flatten Instead of GlobalAveragePooling2D in Transfer Learning
# Using Flatten with VGG16 output
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
x = Flatten()(base_model.output) # 7x7x512 = 25,088 features!
x = Dense(256, activation='relu')(x) # 25,088 x 256 = 6.4 million params!
output = Dense(10, activation='softmax')(x)from tensorflow.keras.layers import GlobalAveragePooling2D
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
x = GlobalAveragePooling2D()(base_model.output) # Averages to 512 features
x = Dense(256, activation='relu')(x) # 512 x 256 = 131K params -- much better
output = Dense(10, activation='softmax')(x)GlobalAveragePooling2D averages each feature map into a single value, converting (7, 7, 512) to (512,). Flatten concatenates all values, converting to (25088,). GlobalAveragePooling2D is preferred for transfer learning because it drastically reduces parameters, provides spatial invariance, and acts as a regularizer.Summary
- CNNs are neural networks designed for grid-like data (especially images). They exploit spatial structure through convolution operations, where small learnable filters slide across the input to detect patterns like edges, textures, and objects.
- A Conv2D layer applies multiple filters to the input. Key parameters: filters (number of feature maps), kernel_size (filter dimensions, typically 3x3), stride (step size), padding ('valid' shrinks output, 'same' preserves size), and activation (usually 'relu').
- Output shape formula for Conv2D with padding='valid': (input_size - filter_size) / stride + 1. For padding='same': input_size / stride. The number of output channels equals the number of filters.
- Pooling layers (MaxPooling2D, AveragePooling2D) reduce spatial dimensions while retaining important features. MaxPooling2D(2,2) halves both height and width. This reduces computation, memory, and adds translation invariance.
- The standard CNN pattern is: Conv2D -> ReLU -> Conv2D -> ReLU -> MaxPool (repeated), then Flatten -> Dense -> Output. Early layers learn simple features (edges), deep layers learn complex features (objects).
- Famous architectures: LeNet-5 (1998, digit recognition), AlexNet (2012, GPU training, ReLU, Dropout), VGG16 (2014, deep 3x3 convolutions), ResNet (2015, skip connections enabling 100+ layer networks).
- Transfer learning reuses a model pre-trained on a large dataset (ImageNet). Steps: load model without top (include_top=False), freeze layers, add custom classification head, train the head, optionally fine-tune top pre-trained layers with a very low learning rate.
- Data augmentation (rotation, flip, zoom, shift) creates diverse training samples from existing images. Apply only to training data, never to validation or test data. This reduces overfitting and improves generalization.
- Parameter sharing is what makes CNNs efficient. A 3x3 filter has only 9 weights but is applied across the entire image. This is far more efficient than Dense layers, which need a separate weight for every input-output connection.
- Use GlobalAveragePooling2D instead of Flatten when connecting a convolutional base to Dense layers in transfer learning. It reduces parameters dramatically and acts as a regularizer.