Chapter 22 Advanced 50 Questions

Practice Questions — Generative AI - GANs, VAEs, and Diffusion Models

← Back to Notes

10 Easy

14 Medium

10 Hard

Topic-Specific Questions

Question 1

Easy

What is the output of the following code?

gen_types = ["Text", "Image", "Audio", "Video", "Code"]
for i, t in enumerate(gen_types):
    print(f"{i+1}. {t} Generation")

enumerate gives index-value pairs. Indices start at 0, but we add 1.

1. Text Generation
2. Image Generation
3. Audio Generation
4. Video Generation
5. Code Generation

Question 2

Easy

What is the output?

gan_components = {"Generator": "creates fake data", "Discriminator": "detects fakes"}
for component, role in gan_components.items():
    print(f"{component}: {role}")

Dictionary iteration produces key-value pairs.

Generator: creates fake data
Discriminator: detects fakes

Question 3

Easy

What is the output?

import torch
z = torch.randn(1, 100)
print(f"Noise shape: {z.shape}")
print(f"Mean: {z.mean():.1f}")
print(f"Noise is input to: Generator")

torch.randn samples from N(0,1). Shape is (1, 100).

Noise shape: torch.Size([1, 100])
Mean: 0.0 (approximately)
Noise is input to: Generator

Question 4

Easy

What is the output?

vae_loss_components = ["Reconstruction Loss", "KL Divergence"]
for loss in vae_loss_components:
    print(loss)
print(f"Total components: {len(vae_loss_components)}")

VAE loss has exactly two components.

Reconstruction Loss
KL Divergence
Total components: 2

Question 5

Medium

What is the output?

import torch
import torch.nn as nn

G = nn.Sequential(
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Linear(128, 784),
    nn.Tanh()
)

z = torch.randn(4, 64)  # Batch of 4, noise dim 64
fake = G(z)
print(f"Input shape: {z.shape}")
print(f"Output shape: {fake.shape}")
print(f"Output range: [{fake.min():.2f}, {fake.max():.2f}]")

Linear(64,128) then Linear(128,784). Tanh outputs values in [-1, 1].

Input shape: torch.Size([4, 64])
Output shape: torch.Size([4, 784])
Output range: [-1.00, 1.00] (approximately, within [-1, 1])

Question 6

Medium

What is the output?

import torch

def reparameterize(mu, log_var):
    std = torch.exp(0.5 * log_var)
    eps = torch.randn_like(std)
    return mu + std * eps

mu = torch.zeros(3)      # Mean = 0
log_var = torch.zeros(3)  # log(var) = 0, so var = 1, std = 1

torch.manual_seed(42)
z = reparameterize(mu, log_var)
print(f"mu: {mu.tolist()}")
print(f"std: {torch.exp(0.5 * log_var).tolist()}")
print(f"z shape: {z.shape}")
print(f"z is from N(0,1): approximately")

When mu=0 and log_var=0, std=1. z = 0 + 1 * epsilon = epsilon, which is from N(0,1).

mu: [0.0, 0.0, 0.0]
std: [1.0, 1.0, 1.0]
z shape: torch.Size([3])
z is from N(0,1): approximately

Question 7

Medium

What is the output?

def stable_diffusion_components():
    components = [
        ("CLIP Text Encoder", "Converts text prompt to embedding"),
        ("U-Net Denoiser", "Predicts and removes noise in latent space"),
        ("VAE Decoder", "Converts latent to pixel-space image")
    ]
    for name, role in components:
        print(f"  {name}: {role}")
    return len(components)

print("Stable Diffusion Architecture:")
n = stable_diffusion_components()
print(f"Total components: {n}")

Stable Diffusion has exactly 3 main components in its pipeline.

Stable Diffusion Architecture:
CLIP Text Encoder: Converts text prompt to embedding
U-Net Denoiser: Predicts and removes noise in latent space
VAE Decoder: Converts latent to pixel-space image
Total components: 3

Question 8

Medium

What is the output?

def latent_efficiency():
    pixel_size = 512 * 512 * 3    # 512x512 RGB image
    latent_size = 64 * 64 * 4     # 64x64 latent with 4 channels
    ratio = pixel_size / latent_size
    return pixel_size, latent_size, ratio

px, lt, r = latent_efficiency()
print(f"Pixel space: {px:,} values")
print(f"Latent space: {lt:,} values")
print(f"Efficiency gain: {r:.1f}x")

Calculate total values for each space. 512*512*3 vs 64*64*4.

Pixel space: 786,432 values
Latent space: 16,384 values
Efficiency gain: 48.0x

Question 9

Hard

What is the output?

import torch
import torch.nn as nn

class SimpleDiscriminator(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(784, 128),
            nn.LeakyReLU(0.2),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.net(x)

D = SimpleDiscriminator()

# Test with real-looking data (values near 1) and random noise
real_like = torch.ones(2, 784) * 0.8
noise = torch.randn(2, 784)

with torch.no_grad():
    real_out = D(real_like)
    noise_out = D(noise)

print(f"Real-like output shape: {real_out.shape}")
print(f"Output range: [0, 1] (Sigmoid)")
print(f"Before training, outputs are near: 0.5 (random)")

Before training, the discriminator has random weights, so it outputs values near 0.5 for any input.

Real-like output shape: torch.Size([2, 1])
Output range: [0, 1] (Sigmoid)
Before training, outputs are near: 0.5 (random)

Question 10

Hard

What is the output?

import torch

def forward_diffusion(x_0, t, T=1000):
    """Add noise proportional to timestep."""
    beta_t = t / T  # Simplified linear schedule
    alpha_t = 1 - beta_t
    noise = torch.randn_like(x_0)
    x_t = torch.sqrt(torch.tensor(alpha_t)) * x_0 + torch.sqrt(torch.tensor(beta_t)) * noise
    return x_t

x_0 = torch.ones(4)  # "Clean" signal
for t in [0, 250, 500, 750, 1000]:
    x_t = forward_diffusion(x_0, t)
    signal_pct = round((1 - t/1000) * 100)
    print(f"t={t:4d} | Signal: {signal_pct:3d}% | Mean: {x_t.mean():.3f} | Std: {x_t.std():.3f}")

At t=0, the signal is pure. As t increases, noise dominates. At t=T, signal is nearly zero.

The mean decreases from ~1.0 toward ~0.0 and std increases from ~0.0 toward ~1.0 as the timestep increases from 0 to 1000, showing the signal being gradually replaced by noise.

Question 11

Easy

What is the difference between a discriminative model and a generative model?

Think about what each type of model learns to do.

A discriminative model learns the boundary between classes -- given input data, it predicts a label or category (P(label|data)). Examples: image classifiers, spam detectors. A generative model learns the underlying data distribution and can create new data samples (P(data) or P(data|condition)). Examples: GANs generating images, GPT generating text. Discriminative models answer "what is this?"; generative models answer "create something like this."

Question 12

Medium

What is mode collapse in GANs, and why does it happen?

Think about what happens when the Generator finds one output that always fools the Discriminator.

Mode collapse occurs when the Generator produces only a small subset of possible outputs instead of the full diversity of the training data. For example, a GAN trained on all 10 digits might only generate '1' and '7'. This happens because the Generator finds a few outputs that consistently fool the Discriminator, so it has no incentive to produce diverse outputs. The Discriminator then adapts to detect those specific outputs, and the Generator switches to a different small set -- creating an oscillating cycle rather than convergence.

Question 13

Hard

Why do diffusion models work in latent space (Stable Diffusion) instead of directly in pixel space? What is the role of the VAE in this architecture?

Think about computational cost and the relationship between pixel space and latent space dimensions.

Pixel-space diffusion operates on full-resolution images (e.g., 512x512x3 = 786,432 dimensions), making each denoising step extremely expensive computationally. Latent diffusion first uses a pre-trained VAE encoder to compress images to a much smaller latent space (64x64x4 = 16,384 dimensions -- 48x smaller). The diffusion process (adding and removing noise) happens entirely in this latent space. After denoising, the VAE decoder converts the clean latent back to a full-resolution image. This makes Stable Diffusion fast enough to run on consumer GPUs while maintaining high image quality.

Question 14

Hard

Explain the reparameterization trick in VAEs. Why is it necessary, and how does it work?

Think about whether gradient descent can backpropagate through a random sampling operation.

The reparameterization trick solves a fundamental problem: backpropagation cannot compute gradients through a random sampling operation (z ~ N(mu, sigma)). The trick separates the stochastic part from the parameters: z = mu + sigma * epsilon, where epsilon ~ N(0,1) is a fixed random sample treated as a constant. Now, z is a deterministic function of mu and sigma (which we want to optimize) plus a fixed noise source. Gradients can flow through the multiplication and addition to update mu and sigma. Without this trick, the encoder's parameters could not be updated via gradient descent.

Question 15

Easy

What is the output?

gan_apps = ["Face generation", "Style transfer", "Super-resolution", "Data augmentation", "Image inpainting"]
print(f"GAN Applications ({len(gan_apps)}):")
for app in gan_apps:
    print(f"  - {app}")

Simple iteration over 5 GAN application areas.

GAN Applications (5):
- Face generation
- Style transfer
- Super-resolution
- Data augmentation
- Image inpainting

Question 16

Medium

What is the output?

import torch
import torch.nn as nn

activations = {
    "Tanh": nn.Tanh(),
    "Sigmoid": nn.Sigmoid(),
    "LeakyReLU(0.2)": nn.LeakyReLU(0.2)
}

x = torch.tensor([-2.0, 0.0, 2.0])
for name, act in activations.items():
    out = act(x)
    print(f"{name:15s}: [{out[0]:.3f}, {out[1]:.3f}, {out[2]:.3f}]")

Tanh output is [-1,1], Sigmoid is [0,1], LeakyReLU allows negative values.

Tanh : [-0.964, 0.000, 0.964]
Sigmoid : [0.119, 0.500, 0.881]
LeakyReLU(0.2) : [-0.400, 0.000, 2.000]

Question 17

Medium

What is the difference between conditional and unconditional image generation?

Think about whether the model receives guidance about what to generate.

Unconditional: Generate random images from the learned distribution with no control over content. Example: a GAN producing random faces. Conditional: Generate images based on some input condition -- a text prompt, class label, or another image. Example: Stable Diffusion generating from a text description. Conditioning gives control over the output and is essential for practical applications. Most modern models (Stable Diffusion, DALL-E) are conditional.

Question 18

Easy

What is the output?

ethical_concerns = ["Deepfakes", "Copyright", "Consent", "Misinformation"]
for concern in ethical_concerns:
    print(f"  Concern: {concern}")
print(f"Total: {len(ethical_concerns)}")

4 ethical concerns related to generative AI.

Concern: Deepfakes
Concern: Copyright
Concern: Consent
Concern: Misinformation
Total: 4

Question 19

Medium

What is the output?

import torch

def noise_schedule(T, schedule_type="linear"):
    if schedule_type == "linear":
        return torch.linspace(1e-4, 0.02, T)
    elif schedule_type == "cosine":
        steps = torch.arange(T + 1) / T
        return 1 - torch.cos(steps * 3.14159 / 2)

betas = noise_schedule(1000, "linear")
print(f"Schedule length: {len(betas)}")
print(f"Beta start: {betas[0]:.6f}")
print(f"Beta end: {betas[-1]:.6f}")
print(f"Beta range: [{betas.min():.6f}, {betas.max():.6f}]")

Linear schedule goes from 1e-4 to 0.02 over 1000 steps.

Schedule length: 1000
Beta start: 0.000100
Beta end: 0.020000
Beta range: [0.000100, 0.020000]

Question 20

Hard

What is the output?

import torch
import torch.nn as nn

class SimpleUNet(nn.Module):
    def __init__(self, channels=1):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(channels, 32, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Conv2d(64, 32, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, channels, 3, padding=1)
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

model = SimpleUNet()
params = sum(p.numel() for p in model.parameters())
x = torch.randn(1, 1, 8, 8)
out = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {out.shape}")
print(f"Parameters: {params:,}")
print(f"Same shape: {x.shape == out.shape}")

With padding=1 and 3x3 kernels, spatial dimensions are preserved.

Input shape: torch.Size([1, 1, 8, 8])
Output shape: torch.Size([1, 1, 8, 8])
Parameters: ~24,000 (approximately)
Same shape: True

Question 21

Hard

Why have diffusion models largely replaced GANs for text-to-image generation despite being slower at inference?

Consider training stability, output diversity, and conditioning capabilities.

Diffusion models replaced GANs for several reasons: (1) Training stability: Diffusion models use a simple MSE loss on noise prediction, while GANs require delicate adversarial balance and are prone to mode collapse. (2) Output diversity: GANs may miss modes (produce limited variety), while diffusion models naturally cover the full distribution. (3) Conditioning: Text conditioning integrates naturally with the iterative denoising process via cross-attention. (4) Latent diffusion made inference fast enough for practical use. The trade-off is speed: GANs need one forward pass, diffusion models need 20-100 steps. This is being addressed by distillation and faster schedulers.

Mixed & Application Questions

Question 1

Easy

What is the output?

models = {
    "GAN": 2014,
    "VAE": 2013,
    "Diffusion": 2020,
    "Stable Diffusion": 2022
}
for name, year in models.items():
    print(f"{name}: {year}")

Simple dictionary iteration showing when each model type was introduced.

GAN: 2014
VAE: 2013
Diffusion: 2020
Stable Diffusion: 2022

Question 2

Easy

What is the output?

guidance_scales = [1.0, 7.5, 15.0, 30.0]
for gs in guidance_scales:
    if gs < 3:
        quality = "ignores prompt"
    elif gs <= 12:
        quality = "balanced"
    elif gs <= 20:
        quality = "strong adherence"
    else:
        quality = "oversaturated"
    print(f"Scale {gs:5.1f}: {quality}")

Guidance scale controls prompt adherence. Too low ignores prompt, too high oversaturates.

Scale 1.0: ignores prompt
Scale 7.5: balanced
Scale 15.0: strong adherence
Scale 30.0: oversaturated

Question 3

Medium

What is the output?

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(100, 256),
    nn.ReLU(),
    nn.Linear(256, 784),
    nn.Tanh()
)

total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

# Break down
for i, layer in enumerate(model):
    if hasattr(layer, 'weight'):
        w = layer.weight.numel()
        b = layer.bias.numel()
        print(f"Layer {i}: weight={w:,}, bias={b:,}")

Linear(100,256) has 100*256 weights + 256 bias. Linear(256,784) has 256*784 + 784.

Total parameters: 226,824
Layer 0: weight=25,600, bias=256
Layer 2: weight=200,704, bias=784

Question 4

Medium

What is the output?

def diffusion_steps_quality(steps):
    if steps < 10:
        return "very poor"
    elif steps < 25:
        return "rough preview"
    elif steps < 50:
        return "good quality"
    elif steps < 100:
        return "high quality"
    else:
        return "diminishing returns"

for s in [5, 20, 30, 50, 100, 200]:
    print(f"{s:3d} steps: {diffusion_steps_quality(s)}")

More denoising steps generally means better quality, but with diminishing returns.

5 steps: very poor
20 steps: rough preview
30 steps: good quality
50 steps: high quality
100 steps: diminishing returns
200 steps: diminishing returns

Question 5

Medium

What is the output?

import torch

# Simulate KL divergence for different distributions
def kl_divergence(mu, log_var):
    """KL(N(mu, var) || N(0, 1))"""
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return kl.item()

# Case 1: Already N(0,1)
kl1 = kl_divergence(torch.zeros(10), torch.zeros(10))
print(f"N(0,1) vs N(0,1): KL = {kl1:.2f}")

# Case 2: Shifted mean
kl2 = kl_divergence(torch.ones(10) * 2, torch.zeros(10))
print(f"N(2,1) vs N(0,1): KL = {kl2:.2f}")

# Case 3: Different variance
kl3 = kl_divergence(torch.zeros(10), torch.ones(10) * 2)
print(f"N(0,e^2) vs N(0,1): KL = {kl3:.2f}")

KL divergence is 0 when distributions match. It increases with divergence.

N(0,1) vs N(0,1): KL = 0.00
N(2,1) vs N(0,1): KL = 20.00
N(0,e^2) vs N(0,1): KL = 23.19 (approximately)

Question 6

Hard

What is the output?

import torch
import torch.nn as nn

def count_gan_params(noise_dim, hidden_sizes, output_dim):
    g_params = 0
    prev = noise_dim
    for h in hidden_sizes:
        g_params += prev * h + h  # Linear layer: weight + bias
        prev = h
    g_params += prev * output_dim + output_dim
    
    d_params = 0
    prev = output_dim
    for h in reversed(hidden_sizes):
        d_params += prev * h + h
        prev = h
    d_params += prev * 1 + 1
    
    return g_params, d_params

g, d = count_gan_params(100, [256, 512], 784)
print(f"Generator: {g:,} params")
print(f"Discriminator: {d:,} params")
print(f"Total GAN: {g+d:,} params")
print(f"G > D: {g > d}")

Calculate each linear layer's params: in_dim * out_dim + out_dim (bias).

Generator: 535,056 params
Discriminator: 534,017 params
Total GAN: 1,069,073 params
G > D: True

Question 7

Medium

Deepa wants to generate synthetic training images for rare medical conditions where real patient data is scarce. Which generative model family would be most appropriate and why?

Consider which model produces the highest quality images and can be conditioned on specific attributes.

Deepa should use a diffusion model (like Stable Diffusion fine-tuned on medical images) or a conditional GAN trained on available medical images. Diffusion models are preferred for their higher image quality, better training stability, and superior mode coverage (they generate more diverse samples, important for rare conditions). She could fine-tune a pre-trained diffusion model on available medical images, conditioning on the specific condition type. Important considerations: she must validate that synthetic images are medically accurate, get approval from medical ethics boards, and ensure the synthetic data does not perpetuate biases in the training set.

Question 8

Hard

Compare GANs, VAEs, and Diffusion Models in terms of training stability, output quality, and mode coverage. When would Arjun choose each?

Think about the trade-offs: sharpness, diversity, ease of training.

GANs: Best output sharpness but unstable training (mode collapse, oscillation). Good for: style transfer, super-resolution, when sharpness matters most. VAEs: Stable training, smooth latent space (great for interpolation), but outputs tend to be blurry. Good for: latent space manipulation, anomaly detection, when you need a structured latent space. Diffusion Models: Best quality AND diversity, stable training, but slower inference (many denoising steps). Good for: high-quality image generation, text-to-image, when quality and diversity both matter. Arjun should choose VAEs for learning latent representations, GANs for real-time style transfer, and diffusion models for text-to-image generation.

Question 9

Hard

What ethical concerns arise from generative AI's ability to create deepfakes, and what technical approaches exist to detect them?

Think about both the harms and the detection techniques.

Key concerns: Misinformation (fake videos of politicians, fabricated news images), fraud (impersonating someone's voice or face), harassment (non-consensual fake intimate images), trust erosion (people stop trusting authentic content). Detection approaches: Spectral analysis (GAN-generated images have consistent frequency artifacts), biological signal detection (fake faces may lack natural blinking patterns or pulse signals), trained classifiers (models trained to distinguish real from generated images), watermarking (embedding invisible signatures in generated content, like C2PA and Google SynthID), and provenance tracking (blockchain-based content authentication). The arms race between generation and detection continues to escalate.

Question 10

Easy

What is the output?

diffusion_steps = {"Preview": 20, "Good": 50, "Best": 100}
for quality, steps in diffusion_steps.items():
    print(f"{quality:8s} quality: {steps} steps")

More denoising steps = higher quality with diminishing returns.

Preview quality: 20 steps
Good quality: 50 steps
Best quality: 100 steps

Question 11

Medium

What is the output?

def prompt_quality(prompt):
    elements = []
    if any(w in prompt.lower() for w in ["photo", "painting", "render"]):
        elements.append("style")
    if any(w in prompt.lower() for w in ["detailed", "8k", "hd"]):
        elements.append("quality")
    if any(w in prompt.lower() for w in ["lighting", "sunset", "golden"]):
        elements.append("lighting")
    return elements

prompts = [
    "A cat",
    "A cat, oil painting, highly detailed",
    "A cat in golden sunset, photo, 8K, dramatic lighting"
]
for p in prompts:
    elements = prompt_quality(p)
    print(f"Elements: {len(elements)} | {p[:50]}")

Better prompts include style, quality, and lighting specifications.

Elements: 0 | A cat
Elements: 2 | A cat, oil painting, highly detailed
Elements: 3 | A cat in golden sunset, photo, 8K, dramatic lighting

Question 12

Medium

What is classifier-free guidance (CFG) in Stable Diffusion, and why is it important?

Think about how the model balances following the prompt vs generating natural-looking images.

Classifier-free guidance generates two predictions at each denoising step: one conditioned on the text prompt (conditional) and one without the prompt (unconditional). The final prediction amplifies the difference: output = unconditional + scale * (conditional - unconditional). A guidance scale of 1.0 uses only the conditional prediction. Higher scales (7-12) amplify the text influence, making images more prompt-faithful. Too-high scales (20+) oversaturate the conditioning, producing artifacts. CFG is important because it gives users a single knob to control the quality-creativity tradeoff.

Question 13

Hard

What is the difference between latent diffusion and pixel-space diffusion? Why was latent diffusion a breakthrough?

Think about dimensionality and computational cost.

Pixel-space diffusion operates directly on image pixels (e.g., 512x512x3 = 786,432 dimensions). Each denoising step requires processing this full resolution, making it extremely expensive computationally. Only organizations with massive GPU clusters could train and run these models. Latent diffusion (Rombach et al., 2022) first compresses images to a smaller latent space using a pre-trained VAE (64x64x4 = 16,384 dimensions, 48x smaller), performs diffusion in this compressed space, then decodes back to pixels. This was a breakthrough because it made high-quality image generation accessible: Stable Diffusion can run on a consumer GPU with 8GB VRAM. The quality loss from compression is minimal because the VAE's latent space preserves perceptual content.

Multiple Choice Questions

MCQ 1

What are the two components of a GAN?

A. Encoder and Decoder
B. Generator and Discriminator
C. Teacher and Student
D. Forward and Backward networks

Answer: B
B is correct. A GAN consists of a Generator (creates fake data) and a Discriminator (classifies data as real or fake). They compete in an adversarial training process.

MCQ 2

What activation function does a GAN Generator typically use in its output layer?

A. ReLU
B. Sigmoid
C. Tanh
D. Softmax

Answer: C
C is correct. The Generator uses Tanh to output values in [-1, 1], matching the normalized range of training images. The Discriminator uses Sigmoid to output a probability in [0, 1].

MCQ 3

What does the forward process in a diffusion model do?

A. Generates an image from noise
B. Gradually adds noise to the data until it becomes pure noise
C. Classifies the image
D. Compresses the image to latent space

Answer: B
B is correct. The forward (diffusion) process adds Gaussian noise progressively over T timesteps, gradually destroying the data until it becomes pure random noise. The reverse process then learns to undo this.

MCQ 4

What does the VAE loss function consist of?

A. Only reconstruction loss
B. Reconstruction loss + KL divergence
C. Only KL divergence
D. Adversarial loss + reconstruction loss

Answer: B
B is correct. VAE loss = reconstruction loss (how well the output matches the input) + KL divergence (regularizes the latent distribution toward N(0,1)). Both terms are necessary for a well-functioning VAE.

MCQ 5

What is mode collapse in GANs?

A. The model runs out of memory
B. The Generator produces only a limited variety of outputs
C. The Discriminator always outputs 0.5
D. The training loss becomes NaN

Answer: B
B is correct. Mode collapse occurs when the Generator finds a few outputs that fool the Discriminator and only produces those, ignoring the full diversity of the training data.

MCQ 6

What is the reparameterization trick in VAEs?

A. A method to speed up training
B. Expressing z = mu + sigma * epsilon to make sampling differentiable
C. A way to reduce the number of parameters
D. A regularization technique to prevent overfitting

Answer: B
B is correct. The reparameterization trick writes z = mu + sigma * epsilon (where epsilon ~ N(0,1)). This separates the stochastic part (epsilon) from the learnable parameters (mu, sigma), allowing gradients to flow through the sampling step during backpropagation.

MCQ 7

Why does Stable Diffusion operate in latent space instead of pixel space?

A. Latent space produces better colors
B. It is computationally much more efficient (48x smaller dimensionality)
C. Pixel space is not supported by PyTorch
D. Latent space does not require a GPU

Answer: B
B is correct. Latent space (64x64x4 = 16,384 values) is 48x smaller than pixel space (512x512x3 = 786,432 values). This makes each denoising step much faster and allows Stable Diffusion to run on consumer GPUs.

MCQ 8

What is the role of CLIP in Stable Diffusion?

A. It generates the final image
B. It encodes the text prompt into a semantic embedding that guides image generation
C. It adds noise to the image
D. It compresses the image to latent space

Answer: B
B is correct. CLIP (Contrastive Language-Image Pre-training) text encoder converts the text prompt into a numerical embedding. This embedding conditions the U-Net denoiser, guiding it to generate an image that matches the text description.

MCQ 9

What does guidance_scale (classifier-free guidance) control in Stable Diffusion?

A. The image resolution
B. How strongly the generated image follows the text prompt
C. The number of denoising steps
D. The model's memory usage

Answer: B
B is correct. Guidance scale controls how much the generation is influenced by the text prompt. Higher values (7-12) produce more prompt-faithful images. Too high (20+) causes oversaturation and artifacts. Too low (1) produces images that largely ignore the prompt.

MCQ 10

What neural network architecture is typically used in the reverse process of diffusion models?

A. ResNet
B. U-Net
C. VGG
D. Transformer (decoder-only)

Answer: B
B is correct. The U-Net architecture (encoder-decoder with skip connections) is used for the denoising step. Skip connections help preserve fine-grained spatial details while the encoder-decoder structure captures both local and global patterns.

MCQ 11

Which GAN variant addresses the training instability caused by the Jensen-Shannon divergence in standard GANs?

A. CycleGAN
B. Wasserstein GAN (WGAN)
C. StyleGAN
D. Pix2Pix

Answer: B
B is correct. WGAN replaces the JS divergence with the Wasserstein distance (Earth Mover's distance), which provides smoother gradients even when the real and generated distributions do not overlap. This significantly stabilizes GAN training and reduces mode collapse.

MCQ 12

What is the key difference between DALL-E 2 and Stable Diffusion?

A. DALL-E 2 uses GANs while Stable Diffusion uses diffusion
B. DALL-E 2 is open-source while Stable Diffusion is closed
C. Stable Diffusion operates in latent space (open-source), while DALL-E 2 uses a different diffusion approach (closed-source)
D. There is no difference

Answer: C
C is correct. Stable Diffusion (Stability AI) is open-source and uses latent diffusion (VAE-compressed space). DALL-E 2 (OpenAI) is closed-source and uses CLIP embeddings with a prior model + diffusion decoder. Stable Diffusion's open-source nature has driven widespread adoption and community innovation.

MCQ 13

Why does the Discriminator in a GAN use LeakyReLU instead of ReLU?

A. LeakyReLU is faster to compute
B. LeakyReLU allows small gradients for negative inputs, preventing dead neurons
C. ReLU is not supported in PyTorch
D. LeakyReLU produces values in [0, 1]

Answer: B
B is correct. LeakyReLU (e.g., with slope 0.2 for negative values) prevents the 'dying ReLU' problem where neurons permanently output 0 for all inputs. In the Discriminator, dead neurons would stop gradient flow and halt training. LeakyReLU ensures gradients always flow.

MCQ 14

What is the purpose of negative prompts in Stable Diffusion?

A. To generate the opposite of what is described
B. To guide the model away from producing undesired features in the output
C. To reduce computation time
D. To increase the resolution of the output

Answer: B
B is correct. Negative prompts specify what the model should avoid in the generated image (e.g., 'blurry', 'low quality', 'watermark'). During classifier-free guidance, the model steers the generation away from the negative prompt embedding, improving output quality.

MCQ 15

In a VAE, what does KL divergence measure?

A. The reconstruction quality of the output
B. How different the learned latent distribution is from the prior N(0,1)
C. The training speed
D. The number of latent dimensions

Answer: B
B is correct. KL divergence (Kullback-Leibler) measures how much the encoder's learned distribution Q(z|x) diverges from the standard normal prior P(z) = N(0,1). Minimizing KL divergence regularizes the latent space, making it smooth, continuous, and suitable for sampling new data.

MCQ 16

What does the Generator in a GAN take as input?

A. Real images from the training set
B. Random noise sampled from a distribution
C. Labels from the dataset
D. Gradients from the Discriminator

Answer: B
B is correct. The Generator takes random noise (typically from a Gaussian distribution) as input and transforms it into fake data that mimics the training distribution. Different noise vectors produce different generated outputs.

Coding Challenges

Coding challenges coming soon.

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass