Practice Questions — Transformers and Attention Mechanism
← Back to NotesTopic-Specific Questions
Question 1
Easy
What paper introduced the Transformer architecture, and what was its key message?
The paper's title itself conveys the message.
The paper "Attention Is All You Need" (Vaswani et al., 2017, Google) introduced the Transformer. Its key message is that you do not need recurrence (RNNs) or convolution (CNNs) to process sequential data. The attention mechanism alone is sufficient and actually superior: it enables parallel processing, better long-range dependencies, and faster training.
Question 2
Easy
What are Query (Q), Key (K), and Value (V) in the attention mechanism?
Use the restaurant analogy.
Query (Q) is what you are looking for -- the question being asked. Key (K) is the identifier of each available item -- like a label or description. Value (V) is the actual content that will be retrieved. Attention computes similarity between Q and each K, then returns a weighted combination of Vs. Restaurant analogy: Q = your food preference, K = menu descriptions, V = the actual dishes.
Question 3
Easy
Why do Transformers need positional encoding?
What information is lost when you remove recurrence?
RNNs inherently process tokens in order, so position information is built into the computation. Transformers process all positions simultaneously with attention, which is permutation-invariant -- it does not care about order. Without positional encoding, 'dog bites man' and 'man bites dog' would produce identical representations. Positional encoding adds position information to the input embeddings so the model knows which word is first, second, etc.
Question 4
Easy
What does the Hugging Face pipeline produce?
from transformers import pipeline
sentiment = pipeline('sentiment-analysis')
result = sentiment("I love learning about Transformers")
print(result)sentiment-analysis returns a label and confidence score.
[{'label': 'POSITIVE', 'score': 0.9998}] (approximately)Question 5
Easy
What is the difference between BERT and GPT in terms of attention direction?
One sees all words, the other sees only past words.
BERT uses bidirectional attention: every token can attend to every other token in the sequence, including tokens that come after it. GPT uses causal (autoregressive) attention: each token can only attend to tokens at the same or earlier positions. BERT sees the full context (ideal for understanding). GPT sees only the past (required for generation, since future tokens have not been generated yet).
Question 6
Easy
What is the shape of attention weights?
# For a sequence of 10 tokens with d_model=512
# Attention(Q, K, V) = softmax(Q @ K^T / sqrt(d_k)) @ V
# What is the shape of the attention weight matrix?Each token attends to every other token.
(10, 10)Question 7
Medium
Explain what self-attention computes for the sentence 'The cat sat on the mat'. What does the word 'sat' attend to?
Think about which words are most relevant to understanding 'sat'.
In self-attention for 'sat': the Query vector for 'sat' is compared against Key vectors for all 6 words. The attention scores (after softmax) determine how much each word contributes to the output representation of 'sat'. In a trained model, 'sat' would strongly attend to 'cat' (the subject -- who is sitting), 'on' (the spatial relationship), and 'mat' (the location). It would attend less to 'The' and 'the' (articles carry less information). The output for 'sat' is a weighted average of all Value vectors, with weights proportional to relevance.
Question 8
Medium
Why does the attention formula divide by sqrt(d_k)?
What happens to dot products when vectors have many dimensions?
As the dimension d_k increases, the dot product of Q and K grows in magnitude (roughly proportional to d_k for random vectors). Large dot products cause softmax to produce extremely sharp distributions (very close to 0 or 1), which kills gradients during backpropagation. Dividing by sqrt(d_k) normalizes the dot products to have unit variance regardless of the dimension, keeping them in a range where softmax produces meaningful, non-saturated distributions. This ensures stable training.
Question 9
Medium
What is multi-head attention and why is it better than single-head attention?
Multiple attention heads capture different types of relationships.
Multi-head attention runs N parallel attention operations (heads), each with its own learned Q, K, V projection matrices. If d_model=512 and N=8, each head operates on 64 dimensions. The outputs are concatenated and linearly projected back to 512 dimensions. Benefits: (1) Different heads can specialize in different relationship types (syntax, semantics, position, coreference). (2) Multiple subspaces -- a single head can only represent one type of relationship per layer; multiple heads capture many simultaneously. (3) More stable training -- averaging over multiple heads reduces variance.
Question 10
Medium
What is the causal attention mask for a sequence of length 4?
import numpy as np
mask = np.tril(np.ones((4, 4)))
print(mask)np.tril creates a lower triangular matrix.
[[1, 0, 0, 0], [1, 1, 0, 0], [1, 1, 1, 0], [1, 1, 1, 1]]Question 11
Medium
What is the difference between self-attention and cross-attention in the Transformer decoder?
Where do Q, K, V come from in each case?
In self-attention (decoder layer 1): Q, K, V all come from the decoder's own input (the previously generated tokens). This allows each decoder token to attend to all previous decoder tokens. In cross-attention (decoder layer 2): Q comes from the decoder, but K and V come from the encoder's output. This is how the decoder reads and uses the encoder's representation of the input. For translation, self-attention helps the decoder maintain coherent output, while cross-attention helps it align with the source language.
Question 12
Hard
How does BERT's pre-training work (Masked Language Modeling)?
BERT randomly masks some input tokens and learns to predict them.
BERT is pre-trained using Masked Language Modeling (MLM): 15% of input tokens are randomly selected, of which 80% are replaced with [MASK], 10% with a random word, and 10% left unchanged. The model must predict the original token for all selected positions. For 'The [MASK] sat on the mat', BERT uses bidirectional context (both 'The' and 'sat on the mat') to predict 'cat'. This forces BERT to build deep bidirectional representations. The second pre-training task is Next Sentence Prediction: given two sentences, predict if the second follows the first in the original text.
Question 13
Hard
Why can Transformers be trained much faster than RNNs on the same data?
Think about parallelism and GPU utilization.
RNNs process tokens sequentially: to compute h_t, you need h_(t-1), which needs h_(t-2), and so on. This creates a chain of dependencies that cannot be parallelized. For a sequence of 512 tokens, you need 512 sequential steps. Transformers compute all attention scores in parallel: every token's Q, K, V can be computed simultaneously. The matrix multiplication Q * K^T processes all token pairs at once. On a modern GPU with thousands of cores, this parallelism translates to massive speedups. A 512-token sequence that requires 512 sequential RNN steps can be processed in effectively O(1) depth with a Transformer (one parallel matrix multiply).
Question 14
Hard
How many parameters does the attention mechanism have for one head?
# d_model = 512, num_heads = 8, d_k = d_v = d_model / num_heads = 64
# One attention head has:
# W_Q: d_model x d_k
# W_K: d_model x d_k
# W_V: d_model x d_v
# Total per head = ?Three weight matrices, each mapping d_model to d_k.
Each head: W_Q = 512 x 64 = 32,768. W_K = 512 x 64 = 32,768. W_V = 512 x 64 = 32,768. Total per head:
98,304. For all 8 heads: 786,432. Plus W_O (512 x 512 = 262,144). Grand total for multi-head attention: 1,048,576 (approximately 1M).Question 15
Hard
Why did the Transformer architecture replace RNNs for NLP? Give at least three technical reasons.
Think about parallelization, long-range dependencies, and gradient flow.
Three technical reasons: (1) Parallelization -- RNNs process tokens sequentially (O(n) depth). Transformers process all tokens in parallel (O(1) depth). This makes training on massive datasets practical. (2) Long-range dependencies -- RNN information must flow through O(n) steps to connect distant tokens, losing information. Transformer attention connects any two tokens in O(1) steps. (3) Gradient flow -- RNNs suffer from vanishing/exploding gradients over long sequences, even with LSTM. Transformers use residual connections and direct attention paths that provide O(1) gradient flow. Additionally: (4) Scalability -- Transformers scale better to massive model sizes (billions of parameters) because of efficient matrix operations on GPUs.
Mixed & Application Questions
Question 1
Easy
What is the Hugging Face Transformers library?
It provides pre-trained Transformer models.
Hugging Face Transformers is an open-source library that provides a unified API for thousands of pre-trained Transformer models (BERT, GPT-2, T5, DistilBERT, etc.). It offers high-level pipelines for common tasks (sentiment analysis, NER, QA, text generation) and low-level access to tokenizers and models. It supports both PyTorch and TensorFlow. The Model Hub hosts over 500,000 pre-trained models that can be loaded with a single line of code.
Question 2
Easy
What does softmax do to these scores?
import numpy as np
scores = np.array([2.0, 1.0, 0.5])
weights = np.exp(scores) / np.exp(scores).sum()
print(weights.round(3))Softmax converts scores to probabilities that sum to 1.
[0.659, 0.242, 0.099] (approximately)Question 3
Easy
What is the attention mask in Transformer models?
It tells the model which tokens are real and which are padding.
The attention mask is a binary tensor (0s and 1s) that tells the model which positions contain real tokens (1) and which are padding (0). When sequences are padded to the same length, the padding tokens should be ignored during attention computation. The attention mask ensures that padding positions receive zero attention weight. Without it, the model would compute attention over meaningless padding tokens, degrading performance.
Question 4
Medium
What models does each task require?
# Task 1: Classify if an email is spam or not
# Task 2: Generate a product description from keywords
# Task 3: Translate English to Hindi
# Which Transformer variant (BERT/GPT/T5) is best for each?Classification needs understanding; generation needs autoregression; translation needs both.
Task 1 (Spam classification): BERT (encoder-only). Classification requires understanding the full email bidirectionally. Task 2 (Generate description): GPT (decoder-only). Text generation requires autoregressive generation (producing one token at a time). Task 3 (Translation): T5 (encoder-decoder). Translation has a distinct input (English) and output (Hindi), fitting the encoder-decoder paradigm.
Question 5
Medium
What are residual connections in the Transformer and why are they important?
Similar to skip connections in ResNet.
Residual connections add the input of a sub-layer to its output: output = LayerNorm(x + SubLayer(x)). In the Transformer, each self-attention layer and each feed-forward layer has a residual connection. They are important because: (1) They allow gradients to flow directly through the addition, preventing vanishing gradients in deep networks. (2) They make it easier for the network to learn the identity function (if SubLayer(x) = 0, the output is just x). (3) They stabilize training, especially in very deep models (12+ layers). This is the same principle as ResNet skip connections.
Question 6
Medium
What is wrong with this Hugging Face code?
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
texts = ["Great movie", "Bad film", "Okay"]
results = classifier(texts[0], texts[1], texts[2]) # Wrong way to pass multiple textsPipeline expects a list of texts, not separate arguments.
The pipeline expects a single text or a list of texts, not multiple arguments. The correct way is:
results = classifier(texts) or results = classifier(["Great movie", "Bad film", "Okay"]). Passing texts as separate positional arguments may cause unexpected behavior or an error.Question 7
Medium
What learning rate should be used for fine-tuning a pre-trained Transformer, and why?
Much lower than training from scratch.
The standard fine-tuning learning rate for Transformers is 2e-5 to 5e-5 (0.00002 to 0.00005). This is 10-100x smaller than a typical training-from-scratch learning rate (1e-3). The reason: pre-trained weights encode knowledge from billions of tokens. A high learning rate would cause large weight updates that destroy this knowledge (catastrophic forgetting). The low learning rate makes small adjustments that adapt the model to the new task while preserving its pre-trained understanding.
Question 8
Medium
How many attention weights are computed in total for a Transformer with 6 encoder layers, 8 heads each, processing a sequence of 100 tokens?
Each head in each layer computes a full attention matrix.
Each head computes a 100 x 100 attention matrix = 10,000 weights. With 8 heads per layer: 8 x 10,000 = 80,000 per layer. With 6 layers: 6 x 80,000 =
480,000 attention weights total.Question 9
Hard
What is the computational complexity of self-attention and why does it become a problem for long sequences?
The attention matrix is seq_len x seq_len.
Self-attention has O(n^2 * d) time and O(n^2) memory complexity, where n is sequence length and d is the model dimension. The attention matrix Q * K^T has shape (n, n), requiring n^2 computations. For n=512, this is ~262K entries -- manageable. For n=4096, it is ~16.7M entries. For n=32768, it is ~1 billion entries. This quadratic scaling makes standard Transformers impractical for very long documents, books, or high-resolution images. Solutions include efficient attention variants: Sparse Attention (only attend to nearby tokens), Linear Attention (approximating softmax), Flash Attention (optimized GPU memory access), and Sliding Window Attention (fixed-size local windows).
Question 10
Hard
Vikram wants to build a chatbot for customer support. Should he use BERT, GPT, or T5? Explain your reasoning.
A chatbot needs to both understand questions and generate responses.
Vikram should use a GPT-style (decoder-only) model or a T5-style (encoder-decoder) model. BERT alone is not suitable because it cannot generate text (encoder-only). GPT is the natural choice for chatbots because: (1) it generates text autoregressively (one token at a time), which is exactly what conversation requires, (2) modern GPT-based models (like ChatGPT) are specifically designed for dialogue, and (3) the causal attention pattern naturally handles conversation context. T5 could also work by framing each conversation turn as a text-to-text task (input: conversation history, output: response). In practice, the best approach is to fine-tune a pre-trained conversational model (e.g., DialoGPT, Llama, or Mistral) on Vikram's customer support data.
Question 11
Hard
Explain how the Transformer encoder and decoder interact during machine translation.
The encoder processes the source language, the decoder generates the target language using cross-attention.
For translating 'I love coding' to Hindi: (1) The encoder processes the entire English sentence in parallel. Each encoder layer applies self-attention and feed-forward operations. The final encoder output is a rich representation of the source sentence. (2) The decoder generates Hindi tokens one at a time. At each step, it applies: (a) Masked self-attention over previously generated Hindi tokens (cannot see future tokens). (b) Cross-attention where Q comes from the decoder (current Hindi state) and K, V come from the encoder output (English representation). This allows the decoder to 'look at' the relevant English words while generating each Hindi word. (c) Feed-forward network. (3) The decoder predicts 'mujhe', then uses [mujhe] to predict 'coding', then [mujhe, coding] to predict 'pasand', and so on until it generates an end token.
Question 12
Hard
What makes BERT's tokenizer different from simple word splitting? What is WordPiece tokenization?
WordPiece splits words into subword units.
BERT uses WordPiece tokenization, which splits words into subword units from a fixed vocabulary (typically 30,000 tokens). Common words remain whole ('the', 'and', 'cat'), but rare or unknown words are split into known subwords. Example: 'preprocessing' might become ['pre', '##process', '##ing'], where ## indicates continuation of a previous token. 'Unbelievable' might become ['un', '##bel', '##ie', '##va', '##ble']. Benefits: (1) Fixed vocabulary handles any input (no out-of-vocabulary problem). (2) Shared subwords capture morphology ('preprocessing', 'postprocessing' both contain 'processing'). (3) Compact vocabulary -- 30K tokens can represent any English text. Limitation: one word may become multiple tokens, so token-level predictions do not align 1-to-1 with words.
Multiple Choice Questions
MCQ 1
What is the key innovation of the Transformer architecture?
Answer: C
C is correct. The Transformer replaced recurrence (RNNs) with attention mechanisms, enabling parallel processing and better long-range dependencies. This is the key insight of 'Attention Is All You Need'.
C is correct. The Transformer replaced recurrence (RNNs) with attention mechanisms, enabling parallel processing and better long-range dependencies. This is the key insight of 'Attention Is All You Need'.
MCQ 2
In the attention mechanism, what does the Query (Q) represent?
Answer: B
B is correct. The Query represents what the current token is searching for. It is compared against Keys to determine relevance, then the corresponding Values are weighted and summed.
B is correct. The Query represents what the current token is searching for. It is compared against Keys to determine relevance, then the corresponding Values are weighted and summed.
MCQ 3
Which Transformer variant is BERT?
Answer: B
B is correct. BERT uses only the encoder with bidirectional attention. It processes the full input context simultaneously, making it ideal for understanding tasks (classification, NER, QA).
B is correct. BERT uses only the encoder with bidirectional attention. It processes the full input context simultaneously, making it ideal for understanding tasks (classification, NER, QA).
MCQ 4
What does GPT stand for?
Answer: B
B is correct. GPT = Generative Pre-trained Transformer. It is pre-trained on next-token prediction and excels at generating text.
B is correct. GPT = Generative Pre-trained Transformer. It is pre-trained on next-token prediction and excels at generating text.
MCQ 5
Why is positional encoding needed in Transformers?
Answer: B
B is correct. Transformers process all tokens in parallel with no recurrence, so they have no inherent sense of word order. Positional encoding adds position information to the input embeddings.
B is correct. Transformers process all tokens in parallel with no recurrence, so they have no inherent sense of word order. Positional encoding adds position information to the input embeddings.
MCQ 6
What is the attention formula?
Answer: B
B is correct. Scaled dot-product attention: compute Q*K^T (similarity), scale by sqrt(d_k), apply softmax (normalize to probabilities), multiply by V (weighted sum of values).
B is correct. Scaled dot-product attention: compute Q*K^T (similarity), scale by sqrt(d_k), apply softmax (normalize to probabilities), multiply by V (weighted sum of values).
MCQ 7
What does multi-head attention achieve that single-head attention cannot?
Answer: B
B is correct. Each attention head can specialize in a different type of relationship (syntax, semantics, position). Single-head attention can only capture one type of relationship per layer.
B is correct. Each attention head can specialize in a different type of relationship (syntax, semantics, position). Single-head attention can only capture one type of relationship per layer.
MCQ 8
What is causal masking in GPT?
Answer: B
B is correct. Causal masking uses a lower-triangular attention mask so each token can only attend to itself and previous tokens. This prevents 'cheating' during generation -- the model cannot look at tokens it has not generated yet.
B is correct. Causal masking uses a lower-triangular attention mask so each token can only attend to itself and previous tokens. This prevents 'cheating' during generation -- the model cannot look at tokens it has not generated yet.
MCQ 9
What is the recommended learning rate range for fine-tuning a pre-trained Transformer?
Answer: C
C is correct. Fine-tuning uses a very low learning rate (1e-5 to 5e-5) to make small adjustments to pre-trained weights without destroying the learned knowledge. Higher rates cause catastrophic forgetting.
C is correct. Fine-tuning uses a very low learning rate (1e-5 to 5e-5) to make small adjustments to pre-trained weights without destroying the learned knowledge. Higher rates cause catastrophic forgetting.
MCQ 10
Which component connects the encoder and decoder in the original Transformer?
Answer: C
C is correct. Cross-attention in the decoder takes Q from the decoder and K, V from the encoder output. This allows the decoder to read and use the encoder's representation of the input.
C is correct. Cross-attention in the decoder takes Q from the decoder and K, V from the encoder output. This allows the decoder to read and use the encoder's representation of the input.
MCQ 11
What is the computational complexity of self-attention with respect to sequence length n?
Answer: C
C is correct. Self-attention computes an n x n attention matrix (Q * K^T), which requires O(n^2) operations. This quadratic scaling is the main limitation for processing very long sequences.
C is correct. Self-attention computes an n x n attention matrix (Q * K^T), which requires O(n^2) operations. This quadratic scaling is the main limitation for processing very long sequences.
MCQ 12
How does BERT's Masked Language Modeling work?
Answer: B
B is correct. BERT masks 15% of input tokens (80% replaced with [MASK], 10% with random word, 10% unchanged) and predicts the originals. The bidirectional context (both left and right) is used, which is why BERT builds deep bidirectional representations.
B is correct. BERT masks 15% of input tokens (80% replaced with [MASK], 10% with random word, 10% unchanged) and predicts the originals. The bidirectional context (both left and right) is used, which is why BERT builds deep bidirectional representations.
MCQ 13
Why does dividing by sqrt(d_k) in attention prevent training instability?
Answer: B
B is correct. Without scaling, large-dimensional dot products produce very large values. Softmax on large values outputs near-0 or near-1, which has near-zero gradients (saturation). Scaling by sqrt(d_k) keeps values in a range where softmax produces meaningful distributions with healthy gradients.
B is correct. Without scaling, large-dimensional dot products produce very large values. Softmax on large values outputs near-0 or near-1, which has near-zero gradients (saturation). Scaling by sqrt(d_k) keeps values in a range where softmax produces meaningful distributions with healthy gradients.
MCQ 14
What is the main advantage of Transformer over LSTM for NLP tasks?
Answer: C
C is correct. Transformers process all tokens in parallel (vs sequential for LSTM) and connect any two tokens in O(1) path length through attention (vs O(n) for LSTM). This enables faster training and better modeling of long-range dependencies.
C is correct. Transformers process all tokens in parallel (vs sequential for LSTM) and connect any two tokens in O(1) path length through attention (vs O(n) for LSTM). This enables faster training and better modeling of long-range dependencies.
MCQ 15
What year was the original Transformer paper published?
Answer: B
B is correct. 'Attention Is All You Need' was published in 2017 by Vaswani et al. at Google. It introduced the Transformer architecture that now powers virtually all state-of-the-art AI systems.
B is correct. 'Attention Is All You Need' was published in 2017 by Vaswani et al. at Google. It introduced the Transformer architecture that now powers virtually all state-of-the-art AI systems.
MCQ 16
What does the softmax function do in attention?
Answer: A
A is correct. Softmax converts raw attention scores into probabilities between 0 and 1 that sum to 1. Higher scores get larger probabilities. This determines how much each position contributes to the output.
A is correct. Softmax converts raw attention scores into probabilities between 0 and 1 that sum to 1. Higher scores get larger probabilities. This determines how much each position contributes to the output.
MCQ 17
What is the role of layer normalization in the Transformer?
Answer: B
B is correct. Layer normalization normalizes the activations of each layer to have zero mean and unit variance. This stabilizes training, especially in deep models (12+ layers), by preventing activation values from growing or shrinking uncontrollably.
B is correct. Layer normalization normalizes the activations of each layer to have zero mean and unit variance. This stabilizes training, especially in deep models (12+ layers), by preventing activation values from growing or shrinking uncontrollably.
MCQ 18
What is the T5 model's approach to NLP tasks?
Answer: B
B is correct. T5 (Text-to-Text Transfer Transformer) frames every NLP task as text-to-text. For classification: input 'classify: This movie is great' -> output 'positive'. For translation: input 'translate: I love AI' -> output 'mujhe AI pasand hai'. This unified framework simplifies multi-task learning.
B is correct. T5 (Text-to-Text Transfer Transformer) frames every NLP task as text-to-text. For classification: input 'classify: This movie is great' -> output 'positive'. For translation: input 'translate: I love AI' -> output 'mujhe AI pasand hai'. This unified framework simplifies multi-task learning.
MCQ 19
How does the feed-forward network in each Transformer layer work?
Answer: B
B is correct. The feed-forward network (FFN) consists of two linear transformations with a ReLU in between: FFN(x) = ReLU(x * W1 + b1) * W2 + b2. Typically W1 expands from d_model to 4*d_model, and W2 projects back to d_model. This provides non-linear transformation capacity.
B is correct. The feed-forward network (FFN) consists of two linear transformations with a ReLU in between: FFN(x) = ReLU(x * W1 + b1) * W2 + b2. Typically W1 expands from d_model to 4*d_model, and W2 projects back to d_model. This provides non-linear transformation capacity.
MCQ 20
Which of these tasks can Hugging Face pipeline() perform?
Answer: B
B is correct. Hugging Face pipeline supports many tasks: sentiment-analysis, ner, question-answering, text-generation, summarization, translation, zero-shot-classification, fill-mask, and more. Each task uses an appropriate pre-trained model.
B is correct. Hugging Face pipeline supports many tasks: sentiment-analysis, ner, question-answering, text-generation, summarization, translation, zero-shot-classification, fill-mask, and more. Each task uses an appropriate pre-trained model.
MCQ 21
What is zero-shot classification?
Answer: A
A is correct. Zero-shot classification uses a pre-trained language model to classify text into arbitrary categories without any task-specific training. The model leverages its general language understanding to match text to candidate labels provided at inference time.
A is correct. Zero-shot classification uses a pre-trained language model to classify text into arbitrary categories without any task-specific training. The model leverages its general language understanding to match text to candidate labels provided at inference time.
MCQ 22
What is the difference between pre-training and fine-tuning a Transformer?
Answer: B
B is correct. Pre-training (e.g., masked language modeling for BERT, next token prediction for GPT) teaches the model general language understanding from billions of tokens of unlabeled text. Fine-tuning then adapts this general knowledge to a specific downstream task (sentiment analysis, NER) using a relatively small labeled dataset.
B is correct. Pre-training (e.g., masked language modeling for BERT, next token prediction for GPT) teaches the model general language understanding from billions of tokens of unlabeled text. Fine-tuning then adapts this general knowledge to a specific downstream task (sentiment analysis, NER) using a relatively small labeled dataset.
MCQ 23
In a Transformer with d_model=512 and 8 attention heads, what is the dimension per head?
Answer: B
B is correct. Each head operates on d_model / num_heads = 512 / 8 = 64 dimensions. The 8 heads each produce 64-dimensional outputs, which are concatenated back to 512 dimensions.
B is correct. Each head operates on d_model / num_heads = 512 / 8 = 64 dimensions. The 8 heads each produce 64-dimensional outputs, which are concatenated back to 512 dimensions.
MCQ 24
Why is WordPiece tokenization used instead of word-level tokenization in BERT?
Answer: B
B is correct. WordPiece tokenization splits rare or unknown words into known subword units from a fixed vocabulary (e.g., 30K tokens). 'preprocessing' -> ['pre', '##process', '##ing']. This handles any input text, shares morphological information between related words, and eliminates the out-of-vocabulary problem.
B is correct. WordPiece tokenization splits rare or unknown words into known subword units from a fixed vocabulary (e.g., 30K tokens). 'preprocessing' -> ['pre', '##process', '##ing']. This handles any input text, shares morphological information between related words, and eliminates the out-of-vocabulary problem.
MCQ 25
Which AI models are based on the Transformer architecture?
Answer: C
C is correct. The Transformer architecture underpins virtually every major AI model since 2018: BERT (Google), GPT series (OpenAI), T5 (Google), ChatGPT, Claude (Anthropic), Gemini (Google), Vision Transformer, and many more. It is the foundation of modern AI.
C is correct. The Transformer architecture underpins virtually every major AI model since 2018: BERT (Google), GPT series (OpenAI), T5 (Google), ChatGPT, Claude (Anthropic), Gemini (Google), Vision Transformer, and many more. It is the foundation of modern AI.
Coding Challenges
Coding challenges coming soon.
Need to Review the Concepts?
Go back to the detailed notes for this chapter.
Read Chapter NotesWant to learn AI and ML with a live mentor?
Explore our AI/ML Masterclass