Chapter 20 Advanced 52 Questions

Practice Questions — Transformers and Attention Mechanism

← Back to Notes
9 Easy
10 Medium
8 Hard

Topic-Specific Questions

Question 1
Easy
What paper introduced the Transformer architecture, and what was its key message?
The paper's title itself conveys the message.
The paper "Attention Is All You Need" (Vaswani et al., 2017, Google) introduced the Transformer. Its key message is that you do not need recurrence (RNNs) or convolution (CNNs) to process sequential data. The attention mechanism alone is sufficient and actually superior: it enables parallel processing, better long-range dependencies, and faster training.
Question 2
Easy
What are Query (Q), Key (K), and Value (V) in the attention mechanism?
Use the restaurant analogy.
Query (Q) is what you are looking for -- the question being asked. Key (K) is the identifier of each available item -- like a label or description. Value (V) is the actual content that will be retrieved. Attention computes similarity between Q and each K, then returns a weighted combination of Vs. Restaurant analogy: Q = your food preference, K = menu descriptions, V = the actual dishes.
Question 3
Easy
Why do Transformers need positional encoding?
What information is lost when you remove recurrence?
RNNs inherently process tokens in order, so position information is built into the computation. Transformers process all positions simultaneously with attention, which is permutation-invariant -- it does not care about order. Without positional encoding, 'dog bites man' and 'man bites dog' would produce identical representations. Positional encoding adds position information to the input embeddings so the model knows which word is first, second, etc.
Question 4
Easy
What does the Hugging Face pipeline produce?
from transformers import pipeline
sentiment = pipeline('sentiment-analysis')
result = sentiment("I love learning about Transformers")
print(result)
sentiment-analysis returns a label and confidence score.
[{'label': 'POSITIVE', 'score': 0.9998}] (approximately)
Question 5
Easy
What is the difference between BERT and GPT in terms of attention direction?
One sees all words, the other sees only past words.
BERT uses bidirectional attention: every token can attend to every other token in the sequence, including tokens that come after it. GPT uses causal (autoregressive) attention: each token can only attend to tokens at the same or earlier positions. BERT sees the full context (ideal for understanding). GPT sees only the past (required for generation, since future tokens have not been generated yet).
Question 6
Easy
What is the shape of attention weights?
# For a sequence of 10 tokens with d_model=512
# Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V
# What is the shape of the attention weight matrix?
Each token attends to every other token.
(10, 10)
Question 7
Medium
Explain what self-attention computes for the sentence 'The cat sat on the mat'. What does the word 'sat' attend to?
Think about which words are most relevant to understanding 'sat'.
In self-attention for 'sat': the Query vector for 'sat' is compared against Key vectors for all 6 words. The attention scores (after softmax) determine how much each word contributes to the output representation of 'sat'. In a trained model, 'sat' would strongly attend to 'cat' (the subject -- who is sitting), 'on' (the spatial relationship), and 'mat' (the location). It would attend less to 'The' and 'the' (articles carry less information). The output for 'sat' is a weighted average of all Value vectors, with weights proportional to relevance.
Question 8
Medium
Why does the attention formula divide by sqrt(d_k)?
What happens to dot products when vectors have many dimensions?
As the dimension d_k increases, the dot product of Q and K grows in magnitude (roughly proportional to d_k for random vectors). Large dot products cause softmax to produce extremely sharp distributions (very close to 0 or 1), which kills gradients during backpropagation. Dividing by sqrt(d_k) normalizes the dot products to have unit variance regardless of the dimension, keeping them in a range where softmax produces meaningful, non-saturated distributions. This ensures stable training.
Question 9
Medium
What is multi-head attention and why is it better than single-head attention?
Multiple attention heads capture different types of relationships.
Multi-head attention runs N parallel attention operations (heads), each with its own learned Q, K, V projection matrices. If d_model=512 and N=8, each head operates on 64 dimensions. The outputs are concatenated and linearly projected back to 512 dimensions. Benefits: (1) Different heads can specialize in different relationship types (syntax, semantics, position, coreference). (2) Multiple subspaces -- a single head can only represent one type of relationship per layer; multiple heads capture many simultaneously. (3) More stable training -- averaging over multiple heads reduces variance.
Question 10
Medium
What is the causal attention mask for a sequence of length 4?
import numpy as np
mask = np.tril(np.ones((4, 4)))
print(mask)
np.tril creates a lower triangular matrix.
[[1, 0, 0, 0], [1, 1, 0, 0], [1, 1, 1, 0], [1, 1, 1, 1]]
Question 11
Medium
What is the difference between self-attention and cross-attention in the Transformer decoder?
Where do Q, K, V come from in each case?
In self-attention (decoder layer 1): Q, K, V all come from the decoder's own input (the previously generated tokens). This allows each decoder token to attend to all previous decoder tokens. In cross-attention (decoder layer 2): Q comes from the decoder, but K and V come from the encoder's output. This is how the decoder reads and uses the encoder's representation of the input. For translation, self-attention helps the decoder maintain coherent output, while cross-attention helps it align with the source language.
Question 12
Hard
How does BERT's pre-training work (Masked Language Modeling)?
BERT randomly masks some input tokens and learns to predict them.
BERT is pre-trained using Masked Language Modeling (MLM): 15% of input tokens are randomly selected, of which 80% are replaced with [MASK], 10% with a random word, and 10% left unchanged. The model must predict the original token for all selected positions. For 'The [MASK] sat on the mat', BERT uses bidirectional context (both 'The' and 'sat on the mat') to predict 'cat'. This forces BERT to build deep bidirectional representations. The second pre-training task is Next Sentence Prediction: given two sentences, predict if the second follows the first in the original text.
Question 13
Hard
Why can Transformers be trained much faster than RNNs on the same data?
Think about parallelism and GPU utilization.
RNNs process tokens sequentially: to compute h_t, you need h_(t-1), which needs h_(t-2), and so on. This creates a chain of dependencies that cannot be parallelized. For a sequence of 512 tokens, you need 512 sequential steps. Transformers compute all attention scores in parallel: every token's Q, K, V can be computed simultaneously. The matrix multiplication Q * K^T processes all token pairs at once. On a modern GPU with thousands of cores, this parallelism translates to massive speedups. A 512-token sequence that requires 512 sequential RNN steps can be processed in effectively O(1) depth with a Transformer (one parallel matrix multiply).
Question 14
Hard
How many parameters does the attention mechanism have for one head?
# d_model = 512, num_heads = 8, d_k = d_v = d_model / num_heads = 64
# One attention head has:
# W_Q: d_model x d_k
# W_K: d_model x d_k
# W_V: d_model x d_v
# Total per head = ?
Three weight matrices, each mapping d_model to d_k.
Each head: W_Q = 512 x 64 = 32,768. W_K = 512 x 64 = 32,768. W_V = 512 x 64 = 32,768. Total per head: 98,304. For all 8 heads: 786,432. Plus W_O (512 x 512 = 262,144). Grand total for multi-head attention: 1,048,576 (approximately 1M).
Question 15
Hard
Why did the Transformer architecture replace RNNs for NLP? Give at least three technical reasons.
Think about parallelization, long-range dependencies, and gradient flow.
Three technical reasons: (1) Parallelization -- RNNs process tokens sequentially (O(n) depth). Transformers process all tokens in parallel (O(1) depth). This makes training on massive datasets practical. (2) Long-range dependencies -- RNN information must flow through O(n) steps to connect distant tokens, losing information. Transformer attention connects any two tokens in O(1) steps. (3) Gradient flow -- RNNs suffer from vanishing/exploding gradients over long sequences, even with LSTM. Transformers use residual connections and direct attention paths that provide O(1) gradient flow. Additionally: (4) Scalability -- Transformers scale better to massive model sizes (billions of parameters) because of efficient matrix operations on GPUs.

Mixed & Application Questions

Question 1
Easy
What is the Hugging Face Transformers library?
It provides pre-trained Transformer models.
Hugging Face Transformers is an open-source library that provides a unified API for thousands of pre-trained Transformer models (BERT, GPT-2, T5, DistilBERT, etc.). It offers high-level pipelines for common tasks (sentiment analysis, NER, QA, text generation) and low-level access to tokenizers and models. It supports both PyTorch and TensorFlow. The Model Hub hosts over 500,000 pre-trained models that can be loaded with a single line of code.
Question 2
Easy
What does softmax do to these scores?
import numpy as np
scores = np.array([2.0, 1.0, 0.5])
weights = np.exp(scores) / np.exp(scores).sum()
print(weights.round(3))
Softmax converts scores to probabilities that sum to 1.
[0.659, 0.242, 0.099] (approximately)
Question 3
Easy
What is the attention mask in Transformer models?
It tells the model which tokens are real and which are padding.
The attention mask is a binary tensor (0s and 1s) that tells the model which positions contain real tokens (1) and which are padding (0). When sequences are padded to the same length, the padding tokens should be ignored during attention computation. The attention mask ensures that padding positions receive zero attention weight. Without it, the model would compute attention over meaningless padding tokens, degrading performance.
Question 4
Medium
What models does each task require?
# Task 1: Classify if an email is spam or not
# Task 2: Generate a product description from keywords
# Task 3: Translate English to Hindi
# Which Transformer variant (BERT/GPT/T5) is best for each?
Classification needs understanding; generation needs autoregression; translation needs both.
Task 1 (Spam classification): BERT (encoder-only). Classification requires understanding the full email bidirectionally. Task 2 (Generate description): GPT (decoder-only). Text generation requires autoregressive generation (producing one token at a time). Task 3 (Translation): T5 (encoder-decoder). Translation has a distinct input (English) and output (Hindi), fitting the encoder-decoder paradigm.
Question 5
Medium
What are residual connections in the Transformer and why are they important?
Similar to skip connections in ResNet.
Residual connections add the input of a sub-layer to its output: output = LayerNorm(x + SubLayer(x)). In the Transformer, each self-attention layer and each feed-forward layer has a residual connection. They are important because: (1) They allow gradients to flow directly through the addition, preventing vanishing gradients in deep networks. (2) They make it easier for the network to learn the identity function (if SubLayer(x) = 0, the output is just x). (3) They stabilize training, especially in very deep models (12+ layers). This is the same principle as ResNet skip connections.
Question 6
Medium
What is wrong with this Hugging Face code?
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
texts = ["Great movie", "Bad film", "Okay"]
results = classifier(texts[0], texts[1], texts[2])  # Wrong way to pass multiple texts
Pipeline expects a list of texts, not separate arguments.
The pipeline expects a single text or a list of texts, not multiple arguments. The correct way is: results = classifier(texts) or results = classifier(["Great movie", "Bad film", "Okay"]). Passing texts as separate positional arguments may cause unexpected behavior or an error.
Question 7
Medium
What learning rate should be used for fine-tuning a pre-trained Transformer, and why?
Much lower than training from scratch.
The standard fine-tuning learning rate for Transformers is 2e-5 to 5e-5 (0.00002 to 0.00005). This is 10-100x smaller than a typical training-from-scratch learning rate (1e-3). The reason: pre-trained weights encode knowledge from billions of tokens. A high learning rate would cause large weight updates that destroy this knowledge (catastrophic forgetting). The low learning rate makes small adjustments that adapt the model to the new task while preserving its pre-trained understanding.
Question 8
Medium
How many attention weights are computed in total for a Transformer with 6 encoder layers, 8 heads each, processing a sequence of 100 tokens?
Each head in each layer computes a full attention matrix.
Each head computes a 100 x 100 attention matrix = 10,000 weights. With 8 heads per layer: 8 x 10,000 = 80,000 per layer. With 6 layers: 6 x 80,000 = 480,000 attention weights total.
Question 9
Hard
What is the computational complexity of self-attention and why does it become a problem for long sequences?
The attention matrix is seq_len x seq_len.
Self-attention has O(n^2 * d) time and O(n^2) memory complexity, where n is sequence length and d is the model dimension. The attention matrix Q * K^T has shape (n, n), requiring n^2 computations. For n=512, this is ~262K entries -- manageable. For n=4096, it is ~16.7M entries. For n=32768, it is ~1 billion entries. This quadratic scaling makes standard Transformers impractical for very long documents, books, or high-resolution images. Solutions include efficient attention variants: Sparse Attention (only attend to nearby tokens), Linear Attention (approximating softmax), Flash Attention (optimized GPU memory access), and Sliding Window Attention (fixed-size local windows).
Question 10
Hard
Vikram wants to build a chatbot for customer support. Should he use BERT, GPT, or T5? Explain your reasoning.
A chatbot needs to both understand questions and generate responses.
Vikram should use a GPT-style (decoder-only) model or a T5-style (encoder-decoder) model. BERT alone is not suitable because it cannot generate text (encoder-only). GPT is the natural choice for chatbots because: (1) it generates text autoregressively (one token at a time), which is exactly what conversation requires, (2) modern GPT-based models (like ChatGPT) are specifically designed for dialogue, and (3) the causal attention pattern naturally handles conversation context. T5 could also work by framing each conversation turn as a text-to-text task (input: conversation history, output: response). In practice, the best approach is to fine-tune a pre-trained conversational model (e.g., DialoGPT, Llama, or Mistral) on Vikram's customer support data.
Question 11
Hard
Explain how the Transformer encoder and decoder interact during machine translation.
The encoder processes the source language, the decoder generates the target language using cross-attention.
For translating 'I love coding' to Hindi: (1) The encoder processes the entire English sentence in parallel. Each encoder layer applies self-attention and feed-forward operations. The final encoder output is a rich representation of the source sentence. (2) The decoder generates Hindi tokens one at a time. At each step, it applies: (a) Masked self-attention over previously generated Hindi tokens (cannot see future tokens). (b) Cross-attention where Q comes from the decoder (current Hindi state) and K, V come from the encoder output (English representation). This allows the decoder to 'look at' the relevant English words while generating each Hindi word. (c) Feed-forward network. (3) The decoder predicts 'mujhe', then uses [mujhe] to predict 'coding', then [mujhe, coding] to predict 'pasand', and so on until it generates an end token.
Question 12
Hard
What makes BERT's tokenizer different from simple word splitting? What is WordPiece tokenization?
WordPiece splits words into subword units.
BERT uses WordPiece tokenization, which splits words into subword units from a fixed vocabulary (typically 30,000 tokens). Common words remain whole ('the', 'and', 'cat'), but rare or unknown words are split into known subwords. Example: 'preprocessing' might become ['pre', '##process', '##ing'], where ## indicates continuation of a previous token. 'Unbelievable' might become ['un', '##bel', '##ie', '##va', '##ble']. Benefits: (1) Fixed vocabulary handles any input (no out-of-vocabulary problem). (2) Shared subwords capture morphology ('preprocessing', 'postprocessing' both contain 'processing'). (3) Compact vocabulary -- 30K tokens can represent any English text. Limitation: one word may become multiple tokens, so token-level predictions do not align 1-to-1 with words.

Multiple Choice Questions

MCQ 1
What is the key innovation of the Transformer architecture?
  • A. Using convolutional layers for text
  • B. Processing sequences one token at a time
  • C. Using attention mechanisms instead of recurrence
  • D. Removing all neural network layers
Answer: C
C is correct. The Transformer replaced recurrence (RNNs) with attention mechanisms, enabling parallel processing and better long-range dependencies. This is the key insight of 'Attention Is All You Need'.
MCQ 2
In the attention mechanism, what does the Query (Q) represent?
  • A. The database of all tokens
  • B. What the current token is looking for
  • C. The final output of the model
  • D. The positional encoding
Answer: B
B is correct. The Query represents what the current token is searching for. It is compared against Keys to determine relevance, then the corresponding Values are weighted and summed.
MCQ 3
Which Transformer variant is BERT?
  • A. Decoder-only
  • B. Encoder-only
  • C. Encoder-Decoder
  • D. Neither encoder nor decoder
Answer: B
B is correct. BERT uses only the encoder with bidirectional attention. It processes the full input context simultaneously, making it ideal for understanding tasks (classification, NER, QA).
MCQ 4
What does GPT stand for?
  • A. General Purpose Transformer
  • B. Generative Pre-trained Transformer
  • C. Global Processing Technology
  • D. Gradient Propagation Technique
Answer: B
B is correct. GPT = Generative Pre-trained Transformer. It is pre-trained on next-token prediction and excels at generating text.
MCQ 5
Why is positional encoding needed in Transformers?
  • A. To speed up training
  • B. To add word order information since Transformers have no inherent sense of position
  • C. To reduce memory usage
  • D. To convert words to numbers
Answer: B
B is correct. Transformers process all tokens in parallel with no recurrence, so they have no inherent sense of word order. Positional encoding adds position information to the input embeddings.
MCQ 6
What is the attention formula?
  • A. Attention = Q + K + V
  • B. Attention = softmax(Q * K^T / sqrt(d_k)) * V
  • C. Attention = sigmoid(Q * V) * K
  • D. Attention = Q * K * V
Answer: B
B is correct. Scaled dot-product attention: compute Q*K^T (similarity), scale by sqrt(d_k), apply softmax (normalize to probabilities), multiply by V (weighted sum of values).
MCQ 7
What does multi-head attention achieve that single-head attention cannot?
  • A. Faster computation
  • B. Simultaneously capturing different types of relationships
  • C. Using less memory
  • D. Removing the need for positional encoding
Answer: B
B is correct. Each attention head can specialize in a different type of relationship (syntax, semantics, position). Single-head attention can only capture one type of relationship per layer.
MCQ 8
What is causal masking in GPT?
  • A. Removing random tokens from input
  • B. Preventing each token from attending to future tokens
  • C. Masking the loss for certain tokens
  • D. Hiding the model architecture
Answer: B
B is correct. Causal masking uses a lower-triangular attention mask so each token can only attend to itself and previous tokens. This prevents 'cheating' during generation -- the model cannot look at tokens it has not generated yet.
MCQ 9
What is the recommended learning rate range for fine-tuning a pre-trained Transformer?
  • A. 0.1 to 0.01
  • B. 0.01 to 0.001
  • C. 1e-5 to 5e-5
  • D. 1e-8 to 1e-7
Answer: C
C is correct. Fine-tuning uses a very low learning rate (1e-5 to 5e-5) to make small adjustments to pre-trained weights without destroying the learned knowledge. Higher rates cause catastrophic forgetting.
MCQ 10
Which component connects the encoder and decoder in the original Transformer?
  • A. Self-attention
  • B. Positional encoding
  • C. Cross-attention
  • D. Feed-forward network
Answer: C
C is correct. Cross-attention in the decoder takes Q from the decoder and K, V from the encoder output. This allows the decoder to read and use the encoder's representation of the input.
MCQ 11
What is the computational complexity of self-attention with respect to sequence length n?
  • A. O(n)
  • B. O(n log n)
  • C. O(n^2)
  • D. O(n^3)
Answer: C
C is correct. Self-attention computes an n x n attention matrix (Q * K^T), which requires O(n^2) operations. This quadratic scaling is the main limitation for processing very long sequences.
MCQ 12
How does BERT's Masked Language Modeling work?
  • A. It predicts the next word in a sequence
  • B. It randomly masks 15% of tokens and predicts the original tokens using bidirectional context
  • C. It translates masked words to another language
  • D. It removes stop words and predicts them
Answer: B
B is correct. BERT masks 15% of input tokens (80% replaced with [MASK], 10% with random word, 10% unchanged) and predicts the originals. The bidirectional context (both left and right) is used, which is why BERT builds deep bidirectional representations.
MCQ 13
Why does dividing by sqrt(d_k) in attention prevent training instability?
  • A. It makes the computation faster
  • B. It normalizes dot products to prevent softmax saturation, preserving gradients
  • C. It reduces the number of parameters
  • D. It aligns the dimensions of Q and K
Answer: B
B is correct. Without scaling, large-dimensional dot products produce very large values. Softmax on large values outputs near-0 or near-1, which has near-zero gradients (saturation). Scaling by sqrt(d_k) keeps values in a range where softmax produces meaningful distributions with healthy gradients.
MCQ 14
What is the main advantage of Transformer over LSTM for NLP tasks?
  • A. Transformers have fewer parameters
  • B. Transformers can only process text
  • C. Transformers enable parallelization and better long-range dependencies via O(1) path length
  • D. Transformers do not need GPU
Answer: C
C is correct. Transformers process all tokens in parallel (vs sequential for LSTM) and connect any two tokens in O(1) path length through attention (vs O(n) for LSTM). This enables faster training and better modeling of long-range dependencies.
MCQ 15
What year was the original Transformer paper published?
  • A. 2015
  • B. 2017
  • C. 2019
  • D. 2020
Answer: B
B is correct. 'Attention Is All You Need' was published in 2017 by Vaswani et al. at Google. It introduced the Transformer architecture that now powers virtually all state-of-the-art AI systems.
MCQ 16
What does the softmax function do in attention?
  • A. It normalizes scores into a probability distribution that sums to 1
  • B. It removes negative values
  • C. It doubles the attention scores
  • D. It converts scores to binary values
Answer: A
A is correct. Softmax converts raw attention scores into probabilities between 0 and 1 that sum to 1. Higher scores get larger probabilities. This determines how much each position contributes to the output.
MCQ 17
What is the role of layer normalization in the Transformer?
  • A. It reduces the number of layers
  • B. It normalizes activations within each layer to stabilize training
  • C. It removes certain layers during inference
  • D. It converts all layers to the same type
Answer: B
B is correct. Layer normalization normalizes the activations of each layer to have zero mean and unit variance. This stabilizes training, especially in deep models (12+ layers), by preventing activation values from growing or shrinking uncontrollably.
MCQ 18
What is the T5 model's approach to NLP tasks?
  • A. It uses a CNN backbone for all tasks
  • B. It frames every task as text-to-text: input text, output text
  • C. It only works for translation
  • D. It uses reinforcement learning for all tasks
Answer: B
B is correct. T5 (Text-to-Text Transfer Transformer) frames every NLP task as text-to-text. For classification: input 'classify: This movie is great' -> output 'positive'. For translation: input 'translate: I love AI' -> output 'mujhe AI pasand hai'. This unified framework simplifies multi-task learning.
MCQ 19
How does the feed-forward network in each Transformer layer work?
  • A. It is a single Dense layer with softmax
  • B. It is two Dense layers: first expands dimensions (typically 4x) with ReLU, then projects back to the model dimension
  • C. It is a convolutional layer
  • D. It is an LSTM layer
Answer: B
B is correct. The feed-forward network (FFN) consists of two linear transformations with a ReLU in between: FFN(x) = ReLU(x * W1 + b1) * W2 + b2. Typically W1 expands from d_model to 4*d_model, and W2 projects back to d_model. This provides non-linear transformation capacity.
MCQ 20
Which of these tasks can Hugging Face pipeline() perform?
  • A. Only sentiment analysis
  • B. Sentiment analysis, NER, question answering, text generation, and many more
  • C. Only text generation
  • D. Only translation
Answer: B
B is correct. Hugging Face pipeline supports many tasks: sentiment-analysis, ner, question-answering, text-generation, summarization, translation, zero-shot-classification, fill-mask, and more. Each task uses an appropriate pre-trained model.
MCQ 21
What is zero-shot classification?
  • A. Classification with zero training data by using a model that can classify into categories it was never explicitly trained on
  • B. Classification that always predicts zero
  • C. Classification without a neural network
  • D. Classification that requires zero preprocessing
Answer: A
A is correct. Zero-shot classification uses a pre-trained language model to classify text into arbitrary categories without any task-specific training. The model leverages its general language understanding to match text to candidate labels provided at inference time.
MCQ 22
What is the difference between pre-training and fine-tuning a Transformer?
  • A. Pre-training uses labeled data; fine-tuning uses unlabeled data
  • B. Pre-training learns general language knowledge on massive unlabeled text; fine-tuning adapts the model to a specific task with a small labeled dataset
  • C. Pre-training is faster than fine-tuning
  • D. Pre-training and fine-tuning are the same process
Answer: B
B is correct. Pre-training (e.g., masked language modeling for BERT, next token prediction for GPT) teaches the model general language understanding from billions of tokens of unlabeled text. Fine-tuning then adapts this general knowledge to a specific downstream task (sentiment analysis, NER) using a relatively small labeled dataset.
MCQ 23
In a Transformer with d_model=512 and 8 attention heads, what is the dimension per head?
  • A. 512
  • B. 64
  • C. 8
  • D. 4096
Answer: B
B is correct. Each head operates on d_model / num_heads = 512 / 8 = 64 dimensions. The 8 heads each produce 64-dimensional outputs, which are concatenated back to 512 dimensions.
MCQ 24
Why is WordPiece tokenization used instead of word-level tokenization in BERT?
  • A. It is faster to compute
  • B. It handles any input with a fixed vocabulary by splitting rare words into known subwords, eliminating out-of-vocabulary problems
  • C. It produces shorter sequences
  • D. It only works for English
Answer: B
B is correct. WordPiece tokenization splits rare or unknown words into known subword units from a fixed vocabulary (e.g., 30K tokens). 'preprocessing' -> ['pre', '##process', '##ing']. This handles any input text, shares morphological information between related words, and eliminates the out-of-vocabulary problem.
MCQ 25
Which AI models are based on the Transformer architecture?
  • A. Only BERT
  • B. Only GPT
  • C. BERT, GPT, T5, ChatGPT, Claude, Gemini, and virtually all modern language models
  • D. None of the above
Answer: C
C is correct. The Transformer architecture underpins virtually every major AI model since 2018: BERT (Google), GPT series (OpenAI), T5 (Google), ChatGPT, Claude (Anthropic), Gemini (Google), Vision Transformer, and many more. It is the foundation of modern AI.

Coding Challenges

Coding challenges coming soon.

Need to Review the Concepts?

Go back to the detailed notes for this chapter.

Read Chapter Notes

Want to learn AI and ML with a live mentor?

Explore our AI/ML Masterclass