Chapter 21 Advanced 50 Questions

Practice Questions — Large Language Models - GPT, BERT, and Beyond

← Back to Notes

12 Easy

12 Medium

10 Hard

Topic-Specific Questions

Question 1

Easy

What is the output of the following code?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokens = tokenizer.tokenize("Hello world")
print(len(tokens))
print(tokens)

GPT-2 uses BPE tokenization. Common words may remain as single tokens.

2
['Hello', ' world']

Question 2

Easy

What is the output?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
ids = tokenizer.encode("AI is great")
print(type(ids))
print(tokenizer.decode(ids))

encode() returns a list of integers. decode() converts them back to text.

<class 'list'>
AI is great

Question 3

Easy

What is the output?

temperatures = [0.0, 0.5, 1.0, 2.0]
for t in temperatures:
    if t == 0.0:
        print(f"Temp {t}: greedy (deterministic)")
    elif t < 1.0:
        print(f"Temp {t}: sharper distribution")
    elif t == 1.0:
        print(f"Temp {t}: original distribution")
    else:
        print(f"Temp {t}: flatter distribution")

Temperature 0 is greedy, less than 1 sharpens, equal to 1 keeps original, greater than 1 flattens.

Temp 0.0: greedy (deterministic)
Temp 0.5: sharper distribution
Temp 1.0: original distribution
Temp 2.0: flatter distribution

Question 4

Easy

What is the output?

models = {
    "BERT": "encoder-only",
    "GPT": "decoder-only",
    "T5": "encoder-decoder"
}
for name, arch in models.items():
    print(f"{name}: {arch}")

This iterates over a dictionary of model names and their architecture types.

BERT: encoder-only
GPT: decoder-only
T5: encoder-decoder

Question 5

Easy

What is the output?

gpt_versions = [
    ("GPT-1", "117M"),
    ("GPT-2", "1.5B"),
    ("GPT-3", "175B"),
    ("GPT-4", "Multimodal")
]
for name, info in gpt_versions:
    print(f"{name}: {info}")

Simple iteration over a list of tuples showing GPT evolution.

GPT-1: 117M
GPT-2: 1.5B
GPT-3: 175B
GPT-4: Multimodal

Question 6

Medium

What is the output?

import math

def softmax_with_temperature(logits, temperature):
    if temperature == 0:
        result = [0.0] * len(logits)
        result[logits.index(max(logits))] = 1.0
        return result
    scaled = [x / temperature for x in logits]
    exp_scaled = [math.exp(x) for x in scaled]
    total = sum(exp_scaled)
    return [round(x / total, 3) for x in exp_scaled]

logits = [2.0, 1.0, 0.5]
print(softmax_with_temperature(logits, 0))
print(softmax_with_temperature(logits, 1.0))
print(softmax_with_temperature(logits, 0.5))

Temperature 0 gives all probability to the max logit. Lower temperature sharpens the distribution.

[1.0, 0.0, 0.0]
[0.506, 0.186, 0.113] (approximately)
[0.724, 0.098, 0.036] (approximately)

Question 7

Medium

What is the output?

def top_k_filter(probs, k):
    sorted_probs = sorted(enumerate(probs), key=lambda x: x[1], reverse=True)
    top_k = sorted_probs[:k]
    total = sum(p for _, p in top_k)
    return [(idx, round(p/total, 3)) for idx, p in top_k]

probs = [0.4, 0.3, 0.15, 0.1, 0.05]
result = top_k_filter(probs, 3)
for idx, p in result:
    print(f"Token {idx}: {p}")

Top-k keeps only the k highest probability tokens and renormalizes.

Token 0: 0.471
Token 1: 0.353
Token 2: 0.176

Question 8

Medium

What is the output?

def top_p_filter(probs, p):
    sorted_probs = sorted(enumerate(probs), key=lambda x: x[1], reverse=True)
    cumulative = 0
    selected = []
    for idx, prob in sorted_probs:
        cumulative += prob
        selected.append((idx, prob))
        if cumulative >= p:
            break
    return len(selected)

probs = [0.4, 0.3, 0.15, 0.1, 0.05]
print(top_p_filter(probs, 0.5))
print(top_p_filter(probs, 0.7))
print(top_p_filter(probs, 0.9))

Top-p accumulates probabilities from highest to lowest until reaching p.

2
2
4

Question 9

Medium

What is the output?

def count_parameters(total_params, lora_rank, num_layers, hidden_dim):
    # LoRA adds two matrices per layer: A (hidden_dim x rank) and B (rank x hidden_dim)
    lora_params = num_layers * 2 * hidden_dim * lora_rank
    percentage = (lora_params / total_params) * 100
    return lora_params, round(percentage, 2)

# GPT-2 small: 124M params, 12 layers, hidden_dim=768
lora_p, pct = count_parameters(124_000_000, 8, 12, 768)
print(f"LoRA params: {lora_p:,}")
print(f"Percentage: {pct}%")

LoRA adds A (d x r) and B (r x d) matrices to each layer.

LoRA params: 147,456
Percentage: 0.12%

Question 10

Hard

What is the output?

def simulate_bpe(text, num_merges):
    # Start with character-level tokens
    tokens = list(text.replace(" ", "_ ").strip())
    print(f"Initial: {tokens}")
    
    for i in range(num_merges):
        # Count adjacent pairs
        pairs = {}
        for j in range(len(tokens) - 1):
            pair = (tokens[j], tokens[j+1])
            pairs[pair] = pairs.get(pair, 0) + 1
        
        if not pairs:
            break
        
        # Find most frequent pair
        best = max(pairs, key=pairs.get)
        merged = best[0] + best[1]
        
        # Merge all occurrences
        new_tokens = []
        j = 0
        while j < len(tokens):
            if j < len(tokens) - 1 and (tokens[j], tokens[j+1]) == best:
                new_tokens.append(merged)
                j += 2
            else:
                new_tokens.append(tokens[j])
                j += 1
        tokens = new_tokens
        print(f"Merge {i+1}: {best} -> '{merged}' | {tokens}")
    
    return tokens

result = simulate_bpe("ab ab ab cd", 2)
print(f"Final: {result}")

BPE iteratively merges the most frequent adjacent pair. Spaces become underscores.

The function first converts to character tokens with underscores for spaces, then merges the most frequent pair in each step. After 2 merges, commonly paired characters are combined.

Question 11

Hard

What is the output?

def few_shot_prompt(examples, query):
    prompt = "Classify the sentiment as Positive or Negative.\n\n"
    for text, label in examples:
        prompt += f"Text: {text}\nSentiment: {label}\n\n"
    prompt += f"Text: {query}\nSentiment:"
    return prompt

examples = [
    ("I love this product", "Positive"),
    ("Terrible experience", "Negative"),
    ("Absolutely wonderful", "Positive")
]

prompt = few_shot_prompt(examples, "Not worth the money")
lines = prompt.strip().split("\n")
print(f"Total lines: {len(lines)}")
print(f"Num examples: {len(examples)}")
print(f"Last line: {lines[-1]}")

Count the lines in the formatted few-shot prompt. Each example has 2 lines plus a blank line.

Total lines: 10
Num examples: 3
Last line: Sentiment:

Question 12

Easy

What is the fundamental task that LLMs are trained to perform?

Think about what the model predicts at each step during training.

LLMs are trained to perform next-token prediction. Given a sequence of tokens, the model learns to predict the probability distribution over the vocabulary for the next token. During training, the model sees billions of text sequences and adjusts its parameters to minimize the prediction error (cross-entropy loss). At inference time, text is generated by repeatedly predicting and appending the next token.

Question 13

Medium

What is the difference between fine-tuning and prompt engineering? When would Kavitha choose one over the other?

Think about whether model weights change, and what resources each approach needs.

Fine-tuning updates the model's weights by training on task-specific labeled data. It requires a dataset, GPU compute, and ML expertise, but produces a specialized model with high accuracy. Prompt engineering crafts the input text to guide the model's behavior without changing any weights. It requires no training data or compute beyond inference. Kavitha should choose fine-tuning when she has labeled data and needs consistently high accuracy on a specific task (e.g., classifying legal documents). She should choose prompt engineering for quick prototyping, general-purpose tasks, or when labeled data is unavailable.

Question 14

Medium

Why does LoRA freeze the base model weights and add small matrices instead of fine-tuning all parameters?

Think about memory, compute, and the number of parameters being updated.

Full fine-tuning of a model with billions of parameters requires enormous GPU memory (to store model weights, gradients, and optimizer states) and risks catastrophic forgetting. LoRA freezes the base weights and adds small trainable low-rank matrices (rank r, typically 4-16) to specific layers. This reduces trainable parameters from billions to millions (often 0.1-1% of total), drastically lowering memory requirements. The low-rank hypothesis suggests that the weight updates during fine-tuning have a low intrinsic rank, so a low-rank approximation captures most of the adaptation needed.

Question 15

Hard

Explain how BERT's Masked Language Modeling differs from GPT's next-token prediction, and why this difference makes BERT better for understanding tasks and GPT better for generation.

Consider what context each model can see when making predictions.

BERT's MLM randomly masks 15% of tokens and predicts them using both left and right context (bidirectional). This means BERT learns deep representations where each token's embedding captures information from the entire sequence. GPT's next-token prediction uses only left context (tokens that came before). Each token can only attend to previous tokens. BERT is better for understanding because bidirectional context gives richer representations for classification, NER, and QA. GPT is better for generation because its left-to-right architecture naturally supports sequential text generation -- it can generate the next token without needing future tokens.

Question 16

Hard

What is the output?

class SimpleTokenizer:
    def __init__(self):
        self.vocab = {}
        self.next_id = 0
    
    def add_token(self, token):
        if token not in self.vocab:
            self.vocab[token] = self.next_id
            self.next_id += 1
    
    def encode(self, text):
        tokens = text.lower().split()
        return [self.vocab.get(t, -1) for t in tokens]
    
    def decode(self, ids):
        reverse = {v: k for k, v in self.vocab.items()}
        return " ".join(reverse.get(i, "[UNK]") for i in ids)

tok = SimpleTokenizer()
for word in ["hello", "world", "machine", "learning"]:
    tok.add_token(word)

print(tok.encode("hello machine learning"))
print(tok.encode("hello unknown world"))
print(tok.decode([0, 3, 2]))

Tokens get IDs in order of addition. Unknown words get -1.

[0, 2, 3]
[0, -1, 1]
hello learning machine

Question 17

Hard

What are scaling laws for LLMs, and what did the Chinchilla paper reveal about how most LLMs were being trained?

Think about the relationship between model size, data size, and compute budget.

Scaling laws (Kaplan et al., 2020) showed that LLM performance (measured as loss) improves as a power law of three factors: model parameters, dataset size, and compute budget. The relationship is predictable -- you can forecast performance before training. The Chinchilla paper (Hoffmann et al., 2022) revealed that most LLMs were significantly undertrained: for a fixed compute budget, it is better to train a moderately-sized model on much more data than to train an extremely large model on less data. Chinchilla (70B parameters trained on 1.4 trillion tokens) outperformed Gopher (280B parameters trained on 300 billion tokens) despite being 4x smaller.

Question 18

Easy

What is the output?

pipeline_tasks = ["text-generation", "sentiment-analysis", "summarization", "ner", "question-answering"]
for task in pipeline_tasks:
    print(f"pipeline('{task}')")
print(f"Total tasks: {len(pipeline_tasks)}")

Hugging Face pipeline supports multiple NLP tasks.

pipeline('text-generation')
pipeline('sentiment-analysis')
pipeline('summarization')
pipeline('ner')
pipeline('question-answering')
Total tasks: 5

Question 19

Easy

What is the difference between zero-shot, one-shot, and few-shot learning in the context of LLMs?

Think about how many examples are provided in the prompt.

Zero-shot: The model performs a task with only a description, no examples. Example prompt: 'Classify the following as positive or negative: I love this.' One-shot: One example is provided. 'Positive: Great product! Now classify: I love this.' Few-shot: Several examples are provided (typically 2-10). The model learns the task pattern from the examples without any weight updates. Few-shot learning was a breakthrough capability of GPT-3 that emerged at 175B parameters.

Question 20

Medium

What is the output?

def compare_approaches(task, data_available, compute_available):
    if data_available > 1000 and compute_available:
        return "fine-tuning"
    elif data_available > 5:
        return "few-shot prompting"
    elif data_available > 0:
        return "one-shot prompting"
    else:
        return "zero-shot prompting"

scenarios = [
    ("Sentiment", 5000, True),
    ("Translation", 8, False),
    ("Summarization", 0, False),
    ("NER", 1, False)
]

for task, data, compute in scenarios:
    approach = compare_approaches(task, data, compute)
    print(f"{task:15s}: {approach}")

More data + compute = fine-tuning. Less data = prompting with available examples.

Sentiment : fine-tuning
Translation : few-shot prompting
Summarization : zero-shot prompting
NER : one-shot prompting

Mixed & Application Questions

Question 1

Easy

What is the output?

bert_tasks = ["Classification", "NER", "QA", "Similarity"]
gpt_tasks = ["Text Generation", "Completion", "Chat", "Code"]

print(f"BERT tasks: {len(bert_tasks)}")
print(f"GPT tasks: {len(gpt_tasks)}")
print(f"All tasks: {len(bert_tasks + gpt_tasks)}")

len() counts elements. Concatenating two lists adds their lengths.

BERT tasks: 4
GPT tasks: 4
All tasks: 8

Question 2

Easy

What is the output?

approach = {"fine-tuning": "updates weights", "prompting": "no weight change"}
for method, desc in approach.items():
    print(f"{method}: {desc}")

Dictionary iteration produces key-value pairs.

fine-tuning: updates weights
prompting: no weight change

Question 3

Medium

What is the output?

def estimate_memory_gb(params_billions, bytes_per_param):
    return round(params_billions * bytes_per_param, 1)

# FP32 = 4 bytes, FP16 = 2 bytes, INT8 = 1 byte, INT4 = 0.5 bytes
model_size = 7  # 7B parameter model

for precision, bpp in [("FP32", 4), ("FP16", 2), ("INT8", 1), ("INT4", 0.5)]:
    mem = estimate_memory_gb(model_size, bpp)
    print(f"{precision}: {mem} GB")

Memory = parameters * bytes per parameter. 7B * 4 bytes = 28 GB for FP32.

FP32: 28 GB
FP16: 14 GB
INT8: 7 GB
INT4: 3.5 GB

Question 4

Medium

What is the output?

def zero_shot_vs_few_shot(num_examples):
    if num_examples == 0:
        return "zero-shot"
    elif num_examples == 1:
        return "one-shot"
    else:
        return "few-shot"

for n in [0, 1, 3, 5]:
    print(f"{n} examples: {zero_shot_vs_few_shot(n)}")

Zero examples = zero-shot, one = one-shot, more = few-shot.

0 examples: zero-shot
1 examples: one-shot
3 examples: few-shot
5 examples: few-shot

Question 5

Medium

What is the output?

class LoRALayer:
    def __init__(self, in_dim, out_dim, rank):
        self.rank = rank
        self.a_shape = (in_dim, rank)
        self.b_shape = (rank, out_dim)
        self.trainable = in_dim * rank + rank * out_dim
        self.frozen = in_dim * out_dim
    
    def __repr__(self):
        pct = round(100 * self.trainable / (self.frozen + self.trainable), 2)
        return f"LoRA(rank={self.rank}, trainable={self.trainable}, frozen={self.frozen}, {pct}%)"

layer = LoRALayer(768, 768, 8)
print(layer)

A matrix has shape (in, rank) and B has shape (rank, out). Total trainable = in*r + r*out.

LoRA(rank=8, trainable=12288, frozen=589824, 2.04%)

Question 6

Hard

What is the output?

import math

def perplexity(loss):
    return round(math.exp(loss), 2)

# Lower loss = lower perplexity = better model
losses = [4.5, 3.2, 2.1, 1.5]
for loss in losses:
    ppl = perplexity(loss)
    print(f"Loss: {loss} -> Perplexity: {ppl}")

Perplexity = e^(loss). Lower perplexity means the model is less surprised by the data.

Loss: 4.5 -> Perplexity: 90.02
Loss: 3.2 -> Perplexity: 24.53
Loss: 2.1 -> Perplexity: 8.17
Loss: 1.5 -> Perplexity: 4.48

Question 7

Medium

Rohan wants to build a chatbot for his college that answers questions about admission procedures. Should he fine-tune a model or use prompt engineering? Explain your reasoning.

Consider the available data, required accuracy, and development speed.

Rohan should start with prompt engineering using a capable LLM (like GPT-4 or Claude). He can include the admission procedures as context in the prompt (RAG pattern -- Retrieval Augmented Generation). This approach is faster, cheaper, and does not require training data or GPUs. If the chatbot needs to handle very specific edge cases or domain-specific terminology that prompting cannot handle reliably, he can then consider fine-tuning a smaller model on a dataset of actual admission Q&A pairs. The practical approach: prompt engineering first, fine-tuning only if prompting falls short.

Question 8

Hard

What is the difference between BPE and SentencePiece tokenization? Why do multilingual models prefer SentencePiece?

Think about how each handles spaces and different languages.

BPE (Byte-Pair Encoding) requires pre-tokenization -- splitting text by spaces and punctuation before applying subword merges. This assumes space-separated words, which works well for English but poorly for languages like Chinese, Japanese, and Thai that do not use spaces between words. SentencePiece treats the input as a raw byte stream and learns subword tokens directly from the raw text without pre-tokenization. It represents spaces as special characters (like '_'). This makes SentencePiece language-independent -- it works equally well for English, Hindi, Chinese, and any other language.

Question 9

Hard

What is the output?

def simulate_autoregressive(vocab, seed_tokens, steps):
    # Simulated next-token prediction (deterministic for demo)
    transitions = {
        "the": "model",
        "model": "predicts",
        "predicts": "the",
        "AI": "is",
        "is": "powerful",
        "powerful": "."
    }
    tokens = list(seed_tokens)
    for _ in range(steps):
        last = tokens[-1]
        next_token = transitions.get(last, "[END]")
        tokens.append(next_token)
    return " ".join(tokens)

print(simulate_autoregressive(None, ["the"], 5))
print(simulate_autoregressive(None, ["AI"], 3))

Each token maps to a fixed next token. Follow the chain for the specified number of steps.

the model predicts the model predicts
AI is powerful .

Question 10

Hard

Why did GPT-3's few-shot learning capability emerge at 175B parameters but not at smaller scales like GPT-2's 1.5B? What does this tell us about LLM capabilities and scale?

Think about emergent abilities and the relationship between scale and new capabilities.

GPT-3 demonstrated emergent abilities -- capabilities that appear suddenly above a certain model size threshold rather than improving gradually. At 175B parameters, GPT-3 could perform tasks from a few examples in the prompt (few-shot learning) without any fine-tuning. GPT-2 at 1.5B could not do this reliably. This tells us that some LLM capabilities are not linear functions of scale -- they are phase transitions that emerge at critical thresholds. Possible explanations: larger models learn more abstract representations, store more factual knowledge, and develop better in-context learning circuits. The implication is that we cannot always predict what capabilities a larger model will have based on smaller models.

Question 11

Easy

What is the output?

hf_components = ["AutoTokenizer", "AutoModel", "pipeline", "Trainer", "TrainingArguments"]
for comp in hf_components:
    print(f"from transformers import {comp}")
print(f"\nTotal imports: {len(hf_components)}")

5 key components of the Hugging Face transformers library.

from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import pipeline
from transformers import Trainer
from transformers import TrainingArguments
Total imports: 5

Question 12

Medium

What is the output?

def token_count_estimate(text, chars_per_token=4):
    """Rough estimate: ~4 characters per token for English text."""
    return len(text) // chars_per_token

texts = [
    "Hello world",
    "Machine learning is a subset of artificial intelligence",
    "The quick brown fox jumps over the lazy dog" * 10
]

for text in texts:
    tokens = token_count_estimate(text)
    print(f"Chars: {len(text):4d} | ~Tokens: {tokens:4d} | Text: {text[:40]}...")

Divide character count by 4 for a rough token estimate.

Each text shows its character count divided by 4 as an approximate token count. Longer texts have proportionally more tokens.

Question 13

Easy

What is the Hugging Face pipeline function and why is it useful?

Think about how much code you need to write to use a pre-trained model.

The Hugging Face pipeline() function is a high-level API that handles tokenization, model inference, and post-processing in a single call. You only need to specify the task (e.g., 'sentiment-analysis', 'text-generation', 'summarization') and optionally a model name. It automatically downloads the model, tokenizes the input, runs inference, and returns human-readable results. Without pipeline, you would need to manually load the tokenizer, tokenize input, run the model forward pass, and post-process logits. Pipeline reduces this from 10+ lines to 2 lines of code.

Question 14

Hard

What is the output?

def compute_attention_memory(seq_len, hidden_dim, num_heads, dtype_bytes=2):
    """Estimate memory for self-attention computation."""
    # Attention matrix: seq_len x seq_len per head
    attention_memory = seq_len * seq_len * num_heads * dtype_bytes
    # Q, K, V projections
    qkv_memory = 3 * seq_len * hidden_dim * dtype_bytes
    total_mb = (attention_memory + qkv_memory) / (1024 * 1024)
    return round(total_mb, 2)

for seq_len in [512, 2048, 8192, 32768]:
    mem = compute_attention_memory(seq_len, 4096, 32)
    print(f"Seq length {seq_len:6d}: ~{mem:8.2f} MB attention memory")

Attention memory scales quadratically with sequence length (O(n^2)).

Memory increases quadratically: 512 tokens uses relatively little memory, while 32768 tokens uses enormous memory due to the O(n^2) scaling of the attention matrix.

Multiple Choice Questions

MCQ 1

What is the core training objective of GPT models?

A. Masked language modeling
B. Next-token prediction
C. Image classification
D. Sentence pair comparison

Answer: B
B is correct. GPT models are trained with a causal language modeling objective: predict the next token given all previous tokens. This autoregressive approach enables text generation.

MCQ 2

Which model architecture does BERT use?

A. Decoder-only transformer
B. Encoder-only transformer
C. Encoder-decoder transformer
D. Recurrent neural network

Answer: B
B is correct. BERT uses the encoder part of the transformer architecture. It processes text bidirectionally, allowing each token to attend to all other tokens in the sequence.

MCQ 3

What does temperature = 0 mean in LLM text generation?

A. Maximum randomness in output
B. Model stops generating
C. Greedy decoding -- always picks the most probable token
D. Equal probability for all tokens

Answer: C
C is correct. Temperature 0 makes the probability distribution infinitely sharp, always selecting the token with the highest probability. This produces deterministic, repeatable output.

MCQ 4

What is BPE in the context of LLMs?

A. A training algorithm for neural networks
B. A subword tokenization method that iteratively merges frequent character pairs
C. A method to reduce model size
D. A technique for data augmentation

Answer: B
B is correct. Byte-Pair Encoding (BPE) starts with individual characters and repeatedly merges the most frequent adjacent pair to build a subword vocabulary. This balances vocabulary size with the ability to handle rare words.

MCQ 5

Which GPT version first demonstrated few-shot learning capabilities?

A. GPT-1
B. GPT-2
C. GPT-3
D. GPT-4

Answer: C
C is correct. GPT-3 (175B parameters) was the first to demonstrate strong few-shot learning -- performing tasks from just a few examples in the prompt without any fine-tuning. This was a breakthrough capability enabled by scale.

MCQ 6

What does the Hugging Face pipeline('sentiment-analysis') return?

A. Raw model weights
B. Token embeddings
C. A label (POSITIVE/NEGATIVE) with a confidence score
D. A list of tokens

Answer: C
C is correct. The sentiment analysis pipeline returns a dictionary with a label (POSITIVE or NEGATIVE) and a confidence score (0 to 1). It handles tokenization, inference, and post-processing automatically.

MCQ 7

What is the key advantage of top-p (nucleus) sampling over top-k sampling?

A. It is faster to compute
B. It adapts the number of candidates based on the probability distribution
C. It always produces better text
D. It requires no probability computation

Answer: B
B is correct. Top-p dynamically adjusts the number of tokens considered. When the model is confident (one token has high probability), fewer candidates are used. When uncertain (flat distribution), more candidates are included. Top-k always considers exactly k tokens regardless of the distribution shape.

MCQ 8

What pre-training task does BERT NOT use?

A. Masked Language Modeling
B. Next Sentence Prediction
C. Next-token prediction
D. Both A and B are used by BERT

Answer: C
C is correct. BERT uses Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Next-token prediction is GPT's training objective. BERT predicts masked tokens using bidirectional context, not the next sequential token.

MCQ 9

What does LoRA stand for, and what is its primary benefit?

A. Low-Rank Adaptation -- reduces training parameters by adding small trainable matrices
B. Large Output Regularization Algorithm -- prevents overfitting
C. Layered Optimization for Rapid Acceleration -- speeds up inference
D. Lightweight Optimized Retrieval Architecture -- improves search

Answer: A
A is correct. LoRA (Low-Rank Adaptation) freezes the base model and adds small low-rank matrices to transformer layers. This reduces trainable parameters to 0.1-1% of the total, making fine-tuning feasible on consumer GPUs.

MCQ 10

What is the Chinchilla scaling law's key finding?

A. Bigger models are always better
B. Model size and training data should scale proportionally for optimal performance
C. Training for more epochs is better than more data
D. Only model architecture matters, not size

Answer: B
B is correct. The Chinchilla paper showed that for a fixed compute budget, the optimal strategy is to balance model size and training data. Many existing LLMs were undertrained -- a smaller model trained on more data can outperform a larger model trained on less data.

MCQ 11

What is the typical learning rate range for fine-tuning BERT on a downstream task?

A. 1e-1 to 1e-2
B. 2e-5 to 5e-5
C. 1e-8 to 1e-7
D. 1e0 to 1e1

Answer: B
B is correct. BERT fine-tuning uses a very small learning rate (2e-5 to 5e-5) to make gentle adjustments to the pre-trained weights without destroying the learned representations. Higher rates cause catastrophic forgetting.

MCQ 12

Why does SentencePiece work better than BPE for multilingual models?

A. It produces fewer tokens
B. It processes raw bytes without requiring space-based pre-tokenization
C. It uses a larger vocabulary
D. It is faster at inference time

Answer: B
B is correct. SentencePiece treats input as a raw byte stream and learns subword tokens directly. Unlike BPE, it does not assume words are separated by spaces, making it work equally well for languages like Chinese, Japanese, Hindi, and Thai.

MCQ 13

What is the difference between QLoRA and standard LoRA?

A. QLoRA uses a higher rank for the LoRA matrices
B. QLoRA quantizes the frozen base model to 4-bit, reducing memory by 4x
C. QLoRA trains all parameters instead of just LoRA matrices
D. QLoRA does not use the transformer architecture

Answer: B
B is correct. QLoRA combines LoRA with 4-bit quantization (NF4 data type) of the frozen base model. The LoRA matrices remain in higher precision for training. This reduces memory by approximately 4x compared to standard LoRA with FP16 base weights, enabling fine-tuning of 65B+ models on a single GPU.

MCQ 14

What are 'emergent abilities' in the context of LLMs?

A. Abilities that are explicitly programmed into the model
B. Abilities that appear suddenly at a certain model scale, not present in smaller models
C. Abilities that emerge after fine-tuning
D. Abilities related to image processing

Answer: B
B is correct. Emergent abilities are capabilities that are absent in smaller models but suddenly appear when models reach a critical size. Examples include few-shot learning, chain-of-thought reasoning, and arithmetic. These abilities are not explicitly trained -- they emerge from scale.

MCQ 15

When using Hugging Face Trainer for BERT fine-tuning, what does the eval_strategy='epoch' parameter do?

A. Evaluates the model only at the end of training
B. Evaluates the model on the validation set after each epoch
C. Evaluates the model after every training step
D. Disables evaluation during training

Answer: B
B is correct. Setting eval_strategy='epoch' runs evaluation on the validation dataset at the end of every training epoch. This lets you monitor performance over time and detect overfitting. Other options include 'steps' (evaluate every N steps) and 'no' (skip evaluation).

MCQ 16

In a LoRA configuration with r=16 and lora_alpha=32, what is the effective scaling factor applied to the LoRA update?

A. 16
B. 32
C. 2 (alpha / r = 32 / 16)
D. 0.5 (r / alpha = 16 / 32)

Answer: C
C is correct. The LoRA update is scaled by alpha/r. With lora_alpha=32 and r=16, the scaling factor is 32/16 = 2. This means the LoRA update is multiplied by 2 before being added to the frozen weights. Higher alpha/r ratios amplify the LoRA adaptation.

Coding Challenges

Coding challenges coming soon.

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass