Practice Questions — Generative AI - GANs, VAEs, and Diffusion Models
← Back to NotesTopic-Specific Questions
Question 1
Easy
What is the output of the following code?
gen_types = ["Text", "Image", "Audio", "Video", "Code"]
for i, t in enumerate(gen_types):
print(f"{i+1}. {t} Generation")enumerate gives index-value pairs. Indices start at 0, but we add 1.
1. Text Generation2. Image Generation3. Audio Generation4. Video Generation5. Code GenerationQuestion 2
Easy
What is the output?
gan_components = {"Generator": "creates fake data", "Discriminator": "detects fakes"}
for component, role in gan_components.items():
print(f"{component}: {role}")Dictionary iteration produces key-value pairs.
Generator: creates fake dataDiscriminator: detects fakesQuestion 3
Easy
What is the output?
import torch
z = torch.randn(1, 100)
print(f"Noise shape: {z.shape}")
print(f"Mean: {z.mean():.1f}")
print(f"Noise is input to: Generator")torch.randn samples from N(0,1). Shape is (1, 100).
Noise shape: torch.Size([1, 100])Mean: 0.0 (approximately)Noise is input to: GeneratorQuestion 4
Easy
What is the output?
vae_loss_components = ["Reconstruction Loss", "KL Divergence"]
for loss in vae_loss_components:
print(loss)
print(f"Total components: {len(vae_loss_components)}")VAE loss has exactly two components.
Reconstruction LossKL DivergenceTotal components: 2Question 5
Medium
What is the output?
import torch
import torch.nn as nn
G = nn.Sequential(
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, 784),
nn.Tanh()
)
z = torch.randn(4, 64) # Batch of 4, noise dim 64
fake = G(z)
print(f"Input shape: {z.shape}")
print(f"Output shape: {fake.shape}")
print(f"Output range: [{fake.min():.2f}, {fake.max():.2f}]")Linear(64,128) then Linear(128,784). Tanh outputs values in [-1, 1].
Input shape: torch.Size([4, 64])Output shape: torch.Size([4, 784])Output range: [-1.00, 1.00] (approximately, within [-1, 1])Question 6
Medium
What is the output?
import torch
def reparameterize(mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + std * eps
mu = torch.zeros(3) # Mean = 0
log_var = torch.zeros(3) # log(var) = 0, so var = 1, std = 1
torch.manual_seed(42)
z = reparameterize(mu, log_var)
print(f"mu: {mu.tolist()}")
print(f"std: {torch.exp(0.5 * log_var).tolist()}")
print(f"z shape: {z.shape}")
print(f"z is from N(0,1): approximately")When mu=0 and log_var=0, std=1. z = 0 + 1 * epsilon = epsilon, which is from N(0,1).
mu: [0.0, 0.0, 0.0]std: [1.0, 1.0, 1.0]z shape: torch.Size([3])z is from N(0,1): approximatelyQuestion 7
Medium
What is the output?
def stable_diffusion_components():
components = [
("CLIP Text Encoder", "Converts text prompt to embedding"),
("U-Net Denoiser", "Predicts and removes noise in latent space"),
("VAE Decoder", "Converts latent to pixel-space image")
]
for name, role in components:
print(f" {name}: {role}")
return len(components)
print("Stable Diffusion Architecture:")
n = stable_diffusion_components()
print(f"Total components: {n}")Stable Diffusion has exactly 3 main components in its pipeline.
Stable Diffusion Architecture: CLIP Text Encoder: Converts text prompt to embedding U-Net Denoiser: Predicts and removes noise in latent space VAE Decoder: Converts latent to pixel-space imageTotal components: 3Question 8
Medium
What is the output?
def latent_efficiency():
pixel_size = 512 * 512 * 3 # 512x512 RGB image
latent_size = 64 * 64 * 4 # 64x64 latent with 4 channels
ratio = pixel_size / latent_size
return pixel_size, latent_size, ratio
px, lt, r = latent_efficiency()
print(f"Pixel space: {px:,} values")
print(f"Latent space: {lt:,} values")
print(f"Efficiency gain: {r:.1f}x")Calculate total values for each space. 512*512*3 vs 64*64*4.
Pixel space: 786,432 valuesLatent space: 16,384 valuesEfficiency gain: 48.0xQuestion 9
Hard
What is the output?
import torch
import torch.nn as nn
class SimpleDiscriminator(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(784, 128),
nn.LeakyReLU(0.2),
nn.Linear(128, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.net(x)
D = SimpleDiscriminator()
# Test with real-looking data (values near 1) and random noise
real_like = torch.ones(2, 784) * 0.8
noise = torch.randn(2, 784)
with torch.no_grad():
real_out = D(real_like)
noise_out = D(noise)
print(f"Real-like output shape: {real_out.shape}")
print(f"Output range: [0, 1] (Sigmoid)")
print(f"Before training, outputs are near: 0.5 (random)")Before training, the discriminator has random weights, so it outputs values near 0.5 for any input.
Real-like output shape: torch.Size([2, 1])Output range: [0, 1] (Sigmoid)Before training, outputs are near: 0.5 (random)Question 10
Hard
What is the output?
import torch
def forward_diffusion(x_0, t, T=1000):
"""Add noise proportional to timestep."""
beta_t = t / T # Simplified linear schedule
alpha_t = 1 - beta_t
noise = torch.randn_like(x_0)
x_t = torch.sqrt(torch.tensor(alpha_t)) * x_0 + torch.sqrt(torch.tensor(beta_t)) * noise
return x_t
x_0 = torch.ones(4) # "Clean" signal
for t in [0, 250, 500, 750, 1000]:
x_t = forward_diffusion(x_0, t)
signal_pct = round((1 - t/1000) * 100)
print(f"t={t:4d} | Signal: {signal_pct:3d}% | Mean: {x_t.mean():.3f} | Std: {x_t.std():.3f}")At t=0, the signal is pure. As t increases, noise dominates. At t=T, signal is nearly zero.
The mean decreases from ~1.0 toward ~0.0 and std increases from ~0.0 toward ~1.0 as the timestep increases from 0 to 1000, showing the signal being gradually replaced by noise.
Question 11
Easy
What is the difference between a discriminative model and a generative model?
Think about what each type of model learns to do.
A discriminative model learns the boundary between classes -- given input data, it predicts a label or category (P(label|data)). Examples: image classifiers, spam detectors. A generative model learns the underlying data distribution and can create new data samples (P(data) or P(data|condition)). Examples: GANs generating images, GPT generating text. Discriminative models answer "what is this?"; generative models answer "create something like this."
Question 12
Medium
What is mode collapse in GANs, and why does it happen?
Think about what happens when the Generator finds one output that always fools the Discriminator.
Mode collapse occurs when the Generator produces only a small subset of possible outputs instead of the full diversity of the training data. For example, a GAN trained on all 10 digits might only generate '1' and '7'. This happens because the Generator finds a few outputs that consistently fool the Discriminator, so it has no incentive to produce diverse outputs. The Discriminator then adapts to detect those specific outputs, and the Generator switches to a different small set -- creating an oscillating cycle rather than convergence.
Question 13
Hard
Why do diffusion models work in latent space (Stable Diffusion) instead of directly in pixel space? What is the role of the VAE in this architecture?
Think about computational cost and the relationship between pixel space and latent space dimensions.
Pixel-space diffusion operates on full-resolution images (e.g., 512x512x3 = 786,432 dimensions), making each denoising step extremely expensive computationally. Latent diffusion first uses a pre-trained VAE encoder to compress images to a much smaller latent space (64x64x4 = 16,384 dimensions -- 48x smaller). The diffusion process (adding and removing noise) happens entirely in this latent space. After denoising, the VAE decoder converts the clean latent back to a full-resolution image. This makes Stable Diffusion fast enough to run on consumer GPUs while maintaining high image quality.
Question 14
Hard
Explain the reparameterization trick in VAEs. Why is it necessary, and how does it work?
Think about whether gradient descent can backpropagate through a random sampling operation.
The reparameterization trick solves a fundamental problem: backpropagation cannot compute gradients through a random sampling operation (z ~ N(mu, sigma)). The trick separates the stochastic part from the parameters:
z = mu + sigma * epsilon, where epsilon ~ N(0,1) is a fixed random sample treated as a constant. Now, z is a deterministic function of mu and sigma (which we want to optimize) plus a fixed noise source. Gradients can flow through the multiplication and addition to update mu and sigma. Without this trick, the encoder's parameters could not be updated via gradient descent.Question 15
Easy
What is the output?
gan_apps = ["Face generation", "Style transfer", "Super-resolution", "Data augmentation", "Image inpainting"]
print(f"GAN Applications ({len(gan_apps)}):")
for app in gan_apps:
print(f" - {app}")Simple iteration over 5 GAN application areas.
GAN Applications (5): - Face generation - Style transfer - Super-resolution - Data augmentation - Image inpaintingQuestion 16
Medium
What is the output?
import torch
import torch.nn as nn
activations = {
"Tanh": nn.Tanh(),
"Sigmoid": nn.Sigmoid(),
"LeakyReLU(0.2)": nn.LeakyReLU(0.2)
}
x = torch.tensor([-2.0, 0.0, 2.0])
for name, act in activations.items():
out = act(x)
print(f"{name:15s}: [{out[0]:.3f}, {out[1]:.3f}, {out[2]:.3f}]")Tanh output is [-1,1], Sigmoid is [0,1], LeakyReLU allows negative values.
Tanh : [-0.964, 0.000, 0.964]Sigmoid : [0.119, 0.500, 0.881]LeakyReLU(0.2) : [-0.400, 0.000, 2.000]Question 17
Medium
What is the difference between conditional and unconditional image generation?
Think about whether the model receives guidance about what to generate.
Unconditional: Generate random images from the learned distribution with no control over content. Example: a GAN producing random faces. Conditional: Generate images based on some input condition -- a text prompt, class label, or another image. Example: Stable Diffusion generating from a text description. Conditioning gives control over the output and is essential for practical applications. Most modern models (Stable Diffusion, DALL-E) are conditional.
Question 18
Easy
What is the output?
ethical_concerns = ["Deepfakes", "Copyright", "Consent", "Misinformation"]
for concern in ethical_concerns:
print(f" Concern: {concern}")
print(f"Total: {len(ethical_concerns)}")4 ethical concerns related to generative AI.
Concern: Deepfakes Concern: Copyright Concern: Consent Concern: MisinformationTotal: 4Question 19
Medium
What is the output?
import torch
def noise_schedule(T, schedule_type="linear"):
if schedule_type == "linear":
return torch.linspace(1e-4, 0.02, T)
elif schedule_type == "cosine":
steps = torch.arange(T + 1) / T
return 1 - torch.cos(steps * 3.14159 / 2)
betas = noise_schedule(1000, "linear")
print(f"Schedule length: {len(betas)}")
print(f"Beta start: {betas[0]:.6f}")
print(f"Beta end: {betas[-1]:.6f}")
print(f"Beta range: [{betas.min():.6f}, {betas.max():.6f}]")Linear schedule goes from 1e-4 to 0.02 over 1000 steps.
Schedule length: 1000Beta start: 0.000100Beta end: 0.020000Beta range: [0.000100, 0.020000]Question 20
Hard
What is the output?
import torch
import torch.nn as nn
class SimpleUNet(nn.Module):
def __init__(self, channels=1):
super().__init__()
self.encoder = nn.Sequential(
nn.Conv2d(channels, 32, 3, padding=1),
nn.ReLU(),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU()
)
self.decoder = nn.Sequential(
nn.Conv2d(64, 32, 3, padding=1),
nn.ReLU(),
nn.Conv2d(32, channels, 3, padding=1)
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
model = SimpleUNet()
params = sum(p.numel() for p in model.parameters())
x = torch.randn(1, 1, 8, 8)
out = model(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {out.shape}")
print(f"Parameters: {params:,}")
print(f"Same shape: {x.shape == out.shape}")With padding=1 and 3x3 kernels, spatial dimensions are preserved.
Input shape: torch.Size([1, 1, 8, 8])Output shape: torch.Size([1, 1, 8, 8])Parameters: ~24,000 (approximately)Same shape: TrueQuestion 21
Hard
Why have diffusion models largely replaced GANs for text-to-image generation despite being slower at inference?
Consider training stability, output diversity, and conditioning capabilities.
Diffusion models replaced GANs for several reasons: (1) Training stability: Diffusion models use a simple MSE loss on noise prediction, while GANs require delicate adversarial balance and are prone to mode collapse. (2) Output diversity: GANs may miss modes (produce limited variety), while diffusion models naturally cover the full distribution. (3) Conditioning: Text conditioning integrates naturally with the iterative denoising process via cross-attention. (4) Latent diffusion made inference fast enough for practical use. The trade-off is speed: GANs need one forward pass, diffusion models need 20-100 steps. This is being addressed by distillation and faster schedulers.
Mixed & Application Questions
Question 1
Easy
What is the output?
models = {
"GAN": 2014,
"VAE": 2013,
"Diffusion": 2020,
"Stable Diffusion": 2022
}
for name, year in models.items():
print(f"{name}: {year}")Simple dictionary iteration showing when each model type was introduced.
GAN: 2014VAE: 2013Diffusion: 2020Stable Diffusion: 2022Question 2
Easy
What is the output?
guidance_scales = [1.0, 7.5, 15.0, 30.0]
for gs in guidance_scales:
if gs < 3:
quality = "ignores prompt"
elif gs <= 12:
quality = "balanced"
elif gs <= 20:
quality = "strong adherence"
else:
quality = "oversaturated"
print(f"Scale {gs:5.1f}: {quality}")Guidance scale controls prompt adherence. Too low ignores prompt, too high oversaturates.
Scale 1.0: ignores promptScale 7.5: balancedScale 15.0: strong adherenceScale 30.0: oversaturatedQuestion 3
Medium
What is the output?
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(100, 256),
nn.ReLU(),
nn.Linear(256, 784),
nn.Tanh()
)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
# Break down
for i, layer in enumerate(model):
if hasattr(layer, 'weight'):
w = layer.weight.numel()
b = layer.bias.numel()
print(f"Layer {i}: weight={w:,}, bias={b:,}")Linear(100,256) has 100*256 weights + 256 bias. Linear(256,784) has 256*784 + 784.
Total parameters: 226,824Layer 0: weight=25,600, bias=256Layer 2: weight=200,704, bias=784Question 4
Medium
What is the output?
def diffusion_steps_quality(steps):
if steps < 10:
return "very poor"
elif steps < 25:
return "rough preview"
elif steps < 50:
return "good quality"
elif steps < 100:
return "high quality"
else:
return "diminishing returns"
for s in [5, 20, 30, 50, 100, 200]:
print(f"{s:3d} steps: {diffusion_steps_quality(s)}")More denoising steps generally means better quality, but with diminishing returns.
5 steps: very poor 20 steps: rough preview 30 steps: good quality 50 steps: high quality100 steps: diminishing returns200 steps: diminishing returnsQuestion 5
Medium
What is the output?
import torch
# Simulate KL divergence for different distributions
def kl_divergence(mu, log_var):
"""KL(N(mu, var) || N(0, 1))"""
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return kl.item()
# Case 1: Already N(0,1)
kl1 = kl_divergence(torch.zeros(10), torch.zeros(10))
print(f"N(0,1) vs N(0,1): KL = {kl1:.2f}")
# Case 2: Shifted mean
kl2 = kl_divergence(torch.ones(10) * 2, torch.zeros(10))
print(f"N(2,1) vs N(0,1): KL = {kl2:.2f}")
# Case 3: Different variance
kl3 = kl_divergence(torch.zeros(10), torch.ones(10) * 2)
print(f"N(0,e^2) vs N(0,1): KL = {kl3:.2f}")KL divergence is 0 when distributions match. It increases with divergence.
N(0,1) vs N(0,1): KL = 0.00N(2,1) vs N(0,1): KL = 20.00N(0,e^2) vs N(0,1): KL = 23.19 (approximately)Question 6
Hard
What is the output?
import torch
import torch.nn as nn
def count_gan_params(noise_dim, hidden_sizes, output_dim):
g_params = 0
prev = noise_dim
for h in hidden_sizes:
g_params += prev * h + h # Linear layer: weight + bias
prev = h
g_params += prev * output_dim + output_dim
d_params = 0
prev = output_dim
for h in reversed(hidden_sizes):
d_params += prev * h + h
prev = h
d_params += prev * 1 + 1
return g_params, d_params
g, d = count_gan_params(100, [256, 512], 784)
print(f"Generator: {g:,} params")
print(f"Discriminator: {d:,} params")
print(f"Total GAN: {g+d:,} params")
print(f"G > D: {g > d}")Calculate each linear layer's params: in_dim * out_dim + out_dim (bias).
Generator: 535,056 paramsDiscriminator: 534,017 paramsTotal GAN: 1,069,073 paramsG > D: TrueQuestion 7
Medium
Deepa wants to generate synthetic training images for rare medical conditions where real patient data is scarce. Which generative model family would be most appropriate and why?
Consider which model produces the highest quality images and can be conditioned on specific attributes.
Deepa should use a diffusion model (like Stable Diffusion fine-tuned on medical images) or a conditional GAN trained on available medical images. Diffusion models are preferred for their higher image quality, better training stability, and superior mode coverage (they generate more diverse samples, important for rare conditions). She could fine-tune a pre-trained diffusion model on available medical images, conditioning on the specific condition type. Important considerations: she must validate that synthetic images are medically accurate, get approval from medical ethics boards, and ensure the synthetic data does not perpetuate biases in the training set.
Question 8
Hard
Compare GANs, VAEs, and Diffusion Models in terms of training stability, output quality, and mode coverage. When would Arjun choose each?
Think about the trade-offs: sharpness, diversity, ease of training.
GANs: Best output sharpness but unstable training (mode collapse, oscillation). Good for: style transfer, super-resolution, when sharpness matters most. VAEs: Stable training, smooth latent space (great for interpolation), but outputs tend to be blurry. Good for: latent space manipulation, anomaly detection, when you need a structured latent space. Diffusion Models: Best quality AND diversity, stable training, but slower inference (many denoising steps). Good for: high-quality image generation, text-to-image, when quality and diversity both matter. Arjun should choose VAEs for learning latent representations, GANs for real-time style transfer, and diffusion models for text-to-image generation.
Question 9
Hard
What ethical concerns arise from generative AI's ability to create deepfakes, and what technical approaches exist to detect them?
Think about both the harms and the detection techniques.
Key concerns: Misinformation (fake videos of politicians, fabricated news images), fraud (impersonating someone's voice or face), harassment (non-consensual fake intimate images), trust erosion (people stop trusting authentic content). Detection approaches: Spectral analysis (GAN-generated images have consistent frequency artifacts), biological signal detection (fake faces may lack natural blinking patterns or pulse signals), trained classifiers (models trained to distinguish real from generated images), watermarking (embedding invisible signatures in generated content, like C2PA and Google SynthID), and provenance tracking (blockchain-based content authentication). The arms race between generation and detection continues to escalate.
Question 10
Easy
What is the output?
diffusion_steps = {"Preview": 20, "Good": 50, "Best": 100}
for quality, steps in diffusion_steps.items():
print(f"{quality:8s} quality: {steps} steps")More denoising steps = higher quality with diminishing returns.
Preview quality: 20 stepsGood quality: 50 stepsBest quality: 100 stepsQuestion 11
Medium
What is the output?
def prompt_quality(prompt):
elements = []
if any(w in prompt.lower() for w in ["photo", "painting", "render"]):
elements.append("style")
if any(w in prompt.lower() for w in ["detailed", "8k", "hd"]):
elements.append("quality")
if any(w in prompt.lower() for w in ["lighting", "sunset", "golden"]):
elements.append("lighting")
return elements
prompts = [
"A cat",
"A cat, oil painting, highly detailed",
"A cat in golden sunset, photo, 8K, dramatic lighting"
]
for p in prompts:
elements = prompt_quality(p)
print(f"Elements: {len(elements)} | {p[:50]}")Better prompts include style, quality, and lighting specifications.
Elements: 0 | A catElements: 2 | A cat, oil painting, highly detailedElements: 3 | A cat in golden sunset, photo, 8K, dramatic lightingQuestion 12
Medium
What is classifier-free guidance (CFG) in Stable Diffusion, and why is it important?
Think about how the model balances following the prompt vs generating natural-looking images.
Classifier-free guidance generates two predictions at each denoising step: one conditioned on the text prompt (conditional) and one without the prompt (unconditional). The final prediction amplifies the difference:
output = unconditional + scale * (conditional - unconditional). A guidance scale of 1.0 uses only the conditional prediction. Higher scales (7-12) amplify the text influence, making images more prompt-faithful. Too-high scales (20+) oversaturate the conditioning, producing artifacts. CFG is important because it gives users a single knob to control the quality-creativity tradeoff.Question 13
Hard
What is the difference between latent diffusion and pixel-space diffusion? Why was latent diffusion a breakthrough?
Think about dimensionality and computational cost.
Pixel-space diffusion operates directly on image pixels (e.g., 512x512x3 = 786,432 dimensions). Each denoising step requires processing this full resolution, making it extremely expensive computationally. Only organizations with massive GPU clusters could train and run these models. Latent diffusion (Rombach et al., 2022) first compresses images to a smaller latent space using a pre-trained VAE (64x64x4 = 16,384 dimensions, 48x smaller), performs diffusion in this compressed space, then decodes back to pixels. This was a breakthrough because it made high-quality image generation accessible: Stable Diffusion can run on a consumer GPU with 8GB VRAM. The quality loss from compression is minimal because the VAE's latent space preserves perceptual content.
Multiple Choice Questions
MCQ 1
What are the two components of a GAN?
Answer: B
B is correct. A GAN consists of a Generator (creates fake data) and a Discriminator (classifies data as real or fake). They compete in an adversarial training process.
B is correct. A GAN consists of a Generator (creates fake data) and a Discriminator (classifies data as real or fake). They compete in an adversarial training process.
MCQ 2
What activation function does a GAN Generator typically use in its output layer?
Answer: C
C is correct. The Generator uses Tanh to output values in [-1, 1], matching the normalized range of training images. The Discriminator uses Sigmoid to output a probability in [0, 1].
C is correct. The Generator uses Tanh to output values in [-1, 1], matching the normalized range of training images. The Discriminator uses Sigmoid to output a probability in [0, 1].
MCQ 3
What does the forward process in a diffusion model do?
Answer: B
B is correct. The forward (diffusion) process adds Gaussian noise progressively over T timesteps, gradually destroying the data until it becomes pure random noise. The reverse process then learns to undo this.
B is correct. The forward (diffusion) process adds Gaussian noise progressively over T timesteps, gradually destroying the data until it becomes pure random noise. The reverse process then learns to undo this.
MCQ 4
What does the VAE loss function consist of?
Answer: B
B is correct. VAE loss = reconstruction loss (how well the output matches the input) + KL divergence (regularizes the latent distribution toward N(0,1)). Both terms are necessary for a well-functioning VAE.
B is correct. VAE loss = reconstruction loss (how well the output matches the input) + KL divergence (regularizes the latent distribution toward N(0,1)). Both terms are necessary for a well-functioning VAE.
MCQ 5
What is mode collapse in GANs?
Answer: B
B is correct. Mode collapse occurs when the Generator finds a few outputs that fool the Discriminator and only produces those, ignoring the full diversity of the training data.
B is correct. Mode collapse occurs when the Generator finds a few outputs that fool the Discriminator and only produces those, ignoring the full diversity of the training data.
MCQ 6
What is the reparameterization trick in VAEs?
Answer: B
B is correct. The reparameterization trick writes z = mu + sigma * epsilon (where epsilon ~ N(0,1)). This separates the stochastic part (epsilon) from the learnable parameters (mu, sigma), allowing gradients to flow through the sampling step during backpropagation.
B is correct. The reparameterization trick writes z = mu + sigma * epsilon (where epsilon ~ N(0,1)). This separates the stochastic part (epsilon) from the learnable parameters (mu, sigma), allowing gradients to flow through the sampling step during backpropagation.
MCQ 7
Why does Stable Diffusion operate in latent space instead of pixel space?
Answer: B
B is correct. Latent space (64x64x4 = 16,384 values) is 48x smaller than pixel space (512x512x3 = 786,432 values). This makes each denoising step much faster and allows Stable Diffusion to run on consumer GPUs.
B is correct. Latent space (64x64x4 = 16,384 values) is 48x smaller than pixel space (512x512x3 = 786,432 values). This makes each denoising step much faster and allows Stable Diffusion to run on consumer GPUs.
MCQ 8
What is the role of CLIP in Stable Diffusion?
Answer: B
B is correct. CLIP (Contrastive Language-Image Pre-training) text encoder converts the text prompt into a numerical embedding. This embedding conditions the U-Net denoiser, guiding it to generate an image that matches the text description.
B is correct. CLIP (Contrastive Language-Image Pre-training) text encoder converts the text prompt into a numerical embedding. This embedding conditions the U-Net denoiser, guiding it to generate an image that matches the text description.
MCQ 9
What does guidance_scale (classifier-free guidance) control in Stable Diffusion?
Answer: B
B is correct. Guidance scale controls how much the generation is influenced by the text prompt. Higher values (7-12) produce more prompt-faithful images. Too high (20+) causes oversaturation and artifacts. Too low (1) produces images that largely ignore the prompt.
B is correct. Guidance scale controls how much the generation is influenced by the text prompt. Higher values (7-12) produce more prompt-faithful images. Too high (20+) causes oversaturation and artifacts. Too low (1) produces images that largely ignore the prompt.
MCQ 10
What neural network architecture is typically used in the reverse process of diffusion models?
Answer: B
B is correct. The U-Net architecture (encoder-decoder with skip connections) is used for the denoising step. Skip connections help preserve fine-grained spatial details while the encoder-decoder structure captures both local and global patterns.
B is correct. The U-Net architecture (encoder-decoder with skip connections) is used for the denoising step. Skip connections help preserve fine-grained spatial details while the encoder-decoder structure captures both local and global patterns.
MCQ 11
Which GAN variant addresses the training instability caused by the Jensen-Shannon divergence in standard GANs?
Answer: B
B is correct. WGAN replaces the JS divergence with the Wasserstein distance (Earth Mover's distance), which provides smoother gradients even when the real and generated distributions do not overlap. This significantly stabilizes GAN training and reduces mode collapse.
B is correct. WGAN replaces the JS divergence with the Wasserstein distance (Earth Mover's distance), which provides smoother gradients even when the real and generated distributions do not overlap. This significantly stabilizes GAN training and reduces mode collapse.
MCQ 12
What is the key difference between DALL-E 2 and Stable Diffusion?
Answer: C
C is correct. Stable Diffusion (Stability AI) is open-source and uses latent diffusion (VAE-compressed space). DALL-E 2 (OpenAI) is closed-source and uses CLIP embeddings with a prior model + diffusion decoder. Stable Diffusion's open-source nature has driven widespread adoption and community innovation.
C is correct. Stable Diffusion (Stability AI) is open-source and uses latent diffusion (VAE-compressed space). DALL-E 2 (OpenAI) is closed-source and uses CLIP embeddings with a prior model + diffusion decoder. Stable Diffusion's open-source nature has driven widespread adoption and community innovation.
MCQ 13
Why does the Discriminator in a GAN use LeakyReLU instead of ReLU?
Answer: B
B is correct. LeakyReLU (e.g., with slope 0.2 for negative values) prevents the 'dying ReLU' problem where neurons permanently output 0 for all inputs. In the Discriminator, dead neurons would stop gradient flow and halt training. LeakyReLU ensures gradients always flow.
B is correct. LeakyReLU (e.g., with slope 0.2 for negative values) prevents the 'dying ReLU' problem where neurons permanently output 0 for all inputs. In the Discriminator, dead neurons would stop gradient flow and halt training. LeakyReLU ensures gradients always flow.
MCQ 14
What is the purpose of negative prompts in Stable Diffusion?
Answer: B
B is correct. Negative prompts specify what the model should avoid in the generated image (e.g., 'blurry', 'low quality', 'watermark'). During classifier-free guidance, the model steers the generation away from the negative prompt embedding, improving output quality.
B is correct. Negative prompts specify what the model should avoid in the generated image (e.g., 'blurry', 'low quality', 'watermark'). During classifier-free guidance, the model steers the generation away from the negative prompt embedding, improving output quality.
MCQ 15
In a VAE, what does KL divergence measure?
Answer: B
B is correct. KL divergence (Kullback-Leibler) measures how much the encoder's learned distribution Q(z|x) diverges from the standard normal prior P(z) = N(0,1). Minimizing KL divergence regularizes the latent space, making it smooth, continuous, and suitable for sampling new data.
B is correct. KL divergence (Kullback-Leibler) measures how much the encoder's learned distribution Q(z|x) diverges from the standard normal prior P(z) = N(0,1). Minimizing KL divergence regularizes the latent space, making it smooth, continuous, and suitable for sampling new data.
MCQ 16
What does the Generator in a GAN take as input?
Answer: B
B is correct. The Generator takes random noise (typically from a Gaussian distribution) as input and transforms it into fake data that mimics the training distribution. Different noise vectors produce different generated outputs.
B is correct. The Generator takes random noise (typically from a Gaussian distribution) as input and transforms it into fake data that mimics the training distribution. Different noise vectors produce different generated outputs.
Coding Challenges
Coding challenges coming soon.
Need to Review the Concepts?
Go back to the detailed notes for this chapter.
Read Chapter NotesWant to learn AI and ML with a live mentor?
Explore our AI/ML Masterclass