150+ câu hỏi Trắc nghiệm Xử lý ngôn ngữ tự nhiên online có đáp án

Ngày cập nhật: Tháng 2 8, 2026

Lưu ý và Miễn trừ trách nhiệm:Toàn bộ nội dung câu hỏi, đáp án và thông tin được cung cấp trên website này được xây dựng nhằm mục đích tham khảo, hỗ trợ ôn tập và củng cố kiến thức. Chúng tôi không cam kết về tính chính xác tuyệt đối, tính cập nhật hay độ tin cậy hoàn toàn của các dữ liệu này. Nội dung tại đây KHÔNG PHẢI LÀ ĐỀ THI CHÍNH THỨC của bất kỳ tổ chức giáo dục, trường đại học hay cơ quan cấp chứng chỉ nào. Người sử dụng tự chịu trách nhiệm khi sử dụng các thông tin này vào mục đích học tập, nghiên cứu hoặc áp dụng vào thực tiễn. Chúng tôi không chịu trách nhiệm pháp lý đối với bất kỳ sai sót, thiệt hại hoặc hậu quả nào phát sinh từ việc sử dụng thông tin trên website này.

Chào mừng bạn đến với bộ Trắc nghiệm Xử lý ngôn ngữ tự nhiên online có đáp án. Bộ trắc nghiệm này giúp bạn hệ thống lại kiến thức một cách logic và dễ hiểu. Hãy chọn một bộ câu hỏi phía dưới để bắt đầu. Chúc bạn làm bài thuận lợi và thu được nhiều kiến thức mới

★★★★★

4.5/5 (229 đánh giá)

1. Which NLP task involves converting a sequence of words into a sequence of their corresponding part-of-speech tags?

A. Named Entity Recognition (NER)

B. Part-of-Speech (POS) Tagging

C. Syntactic Parsing

D. Word Sense Disambiguation (WSD)

2. In the context of Transformer architecture, what is the primary function of the Self-Attention mechanism?

A. To apply a positional encoding to input embeddings.

B. To compute the weighted sum of input features based on their relevance to each other.

C. To generate the final output prediction layer.

D. To downsample the sequence length for efficiency.

3. Which NLP technique is most effective for identifying and classifying specific entities like names of people, organizations, and locations in text?

A. Topic Modeling

B. Sentiment Analysis

C. Named Entity Recognition (NER)

D. Machine Translation

4. Why are subword tokenization methods like Byte Pair Encoding (BPE) preferred over simple word tokenization in modern NLP models (e.g., BERT)?

A. BPE requires significantly less computational power during training.

B. BPE effectively handles rare words and Out-Of-Vocabulary (OOV) issues by breaking them into known subwords.

C. BPE eliminates the need for dependency parsing.

D. BPE ensures that every token corresponds to exactly one word.

5. What is the main limitation of using purely word embeddings (like Word2Vec) that is addressed by contextualized embeddings (like ELMo or BERT)?

A. Word2Vec cannot capture semantic similarity between words.

B. Word2Vec embeddings are static and do not change based on the context of the word’s usage.

C. Word2Vec embeddings require significantly larger training datasets.

D. Word2Vec cannot handle sequences longer than 512 tokens.

6. Which decoding strategy in Neural Machine Translation (NMT) explores all possible sequences generated at each step to find the globally most probable translation?

A. Greedy Search

B. Beam Search with a beam width of 1

C. Exhaustive Search (or Breadth-First Search)

D. Sampling

7. When performing sentiment analysis on product reviews, what specific challenge does Aspect-Based Sentiment Analysis (ABSA) address that standard sentence-level sentiment analysis misses?

A. Determining the emotional intensity of the review.

B. Identifying the overall positive or negative polarity of the entire document.

C. Pinpointing the sentiment expressed towards specific attributes or aspects of the product.

D. Classifying reviews into factual vs. subjective statements.

8. In relation to language modeling, what does the perplexity score fundamentally measure?

A. The computational complexity of the model.

B. How well the probability distribution predicted by the model matches the actual distribution of the test data (lower is better).

C. The number of unique tokens in the vocabulary.

D. The speed at which the model generates text.

9. Which component in a recurrent neural network (RNN) architecture is designed specifically to regulate the flow of information through the cell state to prevent the vanishing gradient problem?

A. The output layer.

B. The input gate.

C. The forget gate.

D. The hidden state vector.

10. Contrastively, what is the key difference between Syntactic Parsing and Semantic Role Labeling (SRL)?

A. Syntactic Parsing determines the grammatical structure, while SRL determines the predicate-argument structure (who did what to whom).

B. Syntactic Parsing focuses on word meaning, whereas SRL focuses on word order.

C. Syntactic Parsing uses word embeddings, while SRL uses context-free grammars.

D. Syntactic Parsing is used for classification, and SRL is used for generation.

11. In the context of Question Answering (QA) systems based on reading comprehension (like SQuAD), what process is used to identify the exact span of text that answers a given question?

A. Text Summarization

B. Span Prediction

C. Co-reference Resolution

D. Topic Modeling

12. Which metric is crucial for evaluating Machine Translation systems because it measures the overlap of n-grams between the candidate translation and a set of human reference translations?

A. F1 Score

B. BLEU Score

C. Accuracy

D. Jaccard Similarity

13. What is the primary role of the feed-forward network layer within each Transformer block?

A. To apply positional awareness to the tokens.

B. To perform multi-head attention calculations.

C. To process the output of the attention sub-layer independently at each position.

D. To stabilize gradients across the entire sequence.

14. system that aims to group large collections of documents into meaningful, coherent themes without any predefined labels is primarily utilizing which NLP technique?

A. Supervised Classification

B. Topic Modeling (Unsupervised Clustering)

C. Sequence Labeling

D. Intent Recognition

15. When fine-tuning a large pre-trained language model like BERT for a downstream task, what approach is generally considered most efficient for resource-constrained environments, often involving training only a small fraction of the total parameters?

A. Full fine-tuning of all layers.

B. Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA or Adapter Tuning.

C. Training only the embedding layer.

D. Prompt Engineering only, without any weight updates.

16. What is the primary function of Positional Encoding in the standard Transformer model?

A. To normalize the attention scores.

B. To inject information about the absolute or relative order of tokens in the sequence, as attention itself is order-agnostic.

C. To increase the dimension of the model’s hidden state.

D. To serve as the initial token embedding layer.

17. Which natural language generation (NLG) challenge occurs when a model repeatedly generates the same phrase or sentence segment, leading to repetitive and low-quality output?

A. Hallucination

B. Degeneration (or Repetition)

C. Overfitting

D. Catastrophic Forgetting

18. What is the main objective of discourse analysis in advanced NLP applications?

A. To segment text into individual sentences.

B. To analyze the grammatical structure of individual sentences.

C. To study the relationship between sentences and larger units of text structure beyond the sentence level.

D. To identify the sentiment of the first sentence.

19. In the context of transfer learning for NLP, what is the relationship between Masked Language Modeling (MLM) used in BERT and causal language modeling used in GPT?

A. MLM is for sequence-to-sequence tasks, while Causal LM is for classification.

B. MLM trains bidirectionally by predicting masked tokens, while Causal LM trains unidirectionally by predicting the next token in sequence.

C. They are identical training objectives.

D. Causal LM requires supervised input, whereas MLM is unsupervised.

20. Which step in a standard Machine Translation pipeline is typically responsible for handling words that do not appear in the model’s vocabulary during inference?

A. Beam Search

B. Tokenization using BPE or WordPiece

C. Syntactic Parsing

D. Encoder Stack Initialization

21. What is the most direct measure of the effectiveness of a chatbot’s ability to understand user intent and map it to the correct action or response template?

A. Token accuracy

B. Intent Classification Accuracy

C. BLEU score of the generated response

D. Embedding similarity

22. In relation to co-reference resolution, if the sentence is ‘John told Mary that she should leave’, what are the antecedent and the anaphor, respectively, for the pronoun ‘she’?

A. Antecedent: Mary; Anaphor: John

B. Antecedent: John; Anaphor: Mary

C. Antecedent: Mary; Anaphor: she

D. Antecedent: she; Anaphor: Mary

23. Which method is commonly used in the decoder stack of a Transformer model to prevent tokens from attending to future tokens in the target sequence during training?

A. Dropout layers

B. Masked Self-Attention (Look-Ahead Masking)

C. Layer Normalization

D. Bidirectional Context Integration

24. If a system needs to extract structured data (like date, price, quantity) from unstructured invoices, which specific NLP task is most suitable?

A. Text Generation

B. Information Extraction (IE), specifically Slot Filling or NER.

C. Text Summarization

D. Text Classification

25. What is the purpose of applying Layer Normalization immediately after the Self-Attention and Feed-Forward layers in a standard Transformer block?

A. To introduce non-linearity into the model.

B. To stabilize the learning process by normalizing the summed inputs across the features (hidden dimensions) for each sample.

C. To reduce the computational cost of the attention mechanism.

D. To generate the final output probabilities.

26. When developing an NLP model for formal legal text analysis, which factor related to data collection is often the most critical challenge?

A. The sheer volume of documents available.

B. Obtaining large, high-quality, manually annotated datasets due to domain expertise requirements.

C. The presence of slang and informal language.

D. The lack of available tokenizers for specialized legal jargon.

27. In sequence labeling tasks like POS tagging or NER, the Viterbi algorithm is frequently used because it efficiently finds the single most likely sequence of labels by relying on which property?

A. The independence assumption between adjacent labels (Markov property).

B. The ability to look ahead infinitely.

C. The non-linear nature of the embedding space.

D. The existence of only one possible correct label sequence.

28. Which evaluation method is best suited for assessing how well a text summarization model captures the essential information from the source document in a human-readable format, rather than just matching specific words?

A. ROUGE Scores

B. METEOR Score

C. Human evaluation focusing on coherence and factual correctness

D. Perplexity

29. If you are training a language model using the objective of predicting the entire sequence of words in reverse order from the end to the start (e.g., predicting W_t based on W_{t+1}, W_{t+2}, …), which type of language model architecture are you effectively implementing?

A. Autoregressive (Causal) LM

B. Bidirectional LM (like BERT’s MLM setup)

C. Sequence-to-Sequence LM

D. Unidirectional Right-to-Left LM

30. What potential issue arises if a sentiment analysis model, trained predominantly on formal movie reviews, is applied directly to social media posts containing heavy slang and emojis?

A. Catastrophic Overfitting

B. Data Leakage

C. Severe Domain Shift resulting in poor generalization performance.

D. Over-regularization

31. Which of the following techniques is primarily used for reducing the dimensionality of word embeddings while preserving semantic relationships?

A. Principal Component Analysis (PCA) on the embedding matrix

B. Applying a Convolutional Neural Network (CNN) layer

C. Stochastic Gradient Descent (SGD) optimization

D. Using a Recurrent Neural Network (RNN) for sequence generation

32. In Named Entity Recognition (NER), what does the ‘IOB’ tagging scheme stand for?

A. Identify, Organize, Build

B. Inside, Outside, Beginning

C. Initial, Omitted, Boundary

D. Identify, Outline, Base

33. Which activation function is most commonly associated with the output layer of a model performing multi-class text classification?

A. ReLU (Rectified Linear Unit)

B. Sigmoid

C. Softmax

D. Tanh (Hyperbolic Tangent)

34. Consider the sentence: ‘The bank is very steep.’ If an NLP system incorrectly tags ‘bank’ as a financial institution instead of a river edge, which specific challenge in NLP is primarily being demonstrated?

A. Tokenization Ambiguity

B. Named Entity Recognition Failure

C. Word Sense Disambiguation (WSD)

D. Part-of-Speech Tagging Error

35. What is the primary function of the ‘Attention Mechanism’ introduced in the Transformer architecture?

A. To strictly process tokens sequentially without parallelism

B. To assign differential importance weights to different input tokens when processing a specific token

C. To replace all embedding layers with a single dense layer

D. To normalize the output probabilities across all layers

36. Which technique involves breaking down text into meaningful units, such as words, punctuation, or subwords, for further processing?

A. Lemmatization

B. Stemming

C. Tokenization

D. Parsing

37. What is the main advantage of using Byte-Pair Encoding (BPE) over simple word-level tokenization, especially for handling Out-of-Vocabulary (OOV) words?

A. BPE always produces exactly one token per word, regardless of complexity.

B. BPE merges frequent character sequences iteratively, allowing rare words to be represented by subword units

C. BPE eliminates the need for pre-training by relying purely on character statistics

D. BPE requires significantly less computational power than word-level tokenization

38. If a system aims to translate a sentence from English to French, which NLP task is this process an example of?

A. Language Modeling

B. Machine Translation (MT)

C. Text Summarization

D. Information Retrieval

39. In the context of modern Neural Machine Translation (NMT), what role does the positional encoding play in the Transformer model?

A. It normalizes the input vector magnitudes to prevent exploding gradients.

B. It injects information about the absolute or relative position of tokens since the self-attention mechanism is permutation invariant.

C. It acts as a residual connection between the encoder and decoder layers.

D. It determines the vocabulary size used for the target language.

40. Which metric is most appropriate for evaluating a system that assigns sentiment scores ranging continuously from -1.0 (very negative) to +1.0 (very positive)?

A. F1 Score

B. Accuracy

C. Mean Squared Error (MSE)

D. BLEU Score

41. Which linguistic level focuses on the structural rules governing how words combine to form phrases and sentences?

A. Morphology

B. Semantics

C. Pragmatics

D. Syntax

42. When analyzing the performance of a spam detection classifier, if the model flags legitimate emails as spam, this error is classified as a:

A. True Negative

B. False Positive

C. True Positive

D. False Negative

43. What is the primary disadvantage of using simple Bag-of-Words (BoW) representation compared to modern embedding methods?

A. BoW requires significantly more computational resources for training.

B. BoW completely fails to capture any semantic or syntactic relationships between words.

C. BoW representation leads to extremely sparse vectors with very high dimensionality.

D. BoW only works for languages without inflectional morphology.

44. In the context of large language models, what does ‘temperature’ control during text generation?

A. The learning rate used during fine-tuning.

B. The speed at which the model processes input tokens.

C. The randomness or creativity of the generated text by scaling the logits before the softmax function.

D. The maximum length of the generated output sequence.

45. Which NLP task involves identifying and extracting factual triples (Subject-Relation-Object) from text?

A. Co-reference Resolution

B. Relation Extraction

C. Text Summarization

D. Sentiment Analysis

46. What is the core concept underlying ‘Zero-Shot Learning’ in the context of modern LLMs?

A. Training the model exclusively on large unlabeled datasets.

B. The ability to perform a task correctly without seeing any specific labeled examples for that task during fine-tuning.

C. technique requiring only a single labeled example per class for training.

D. Using only character-level features instead of word embeddings.

47. When dealing with highly inflectional languages (e.g., Turkish, Finnish), which preprocessing step is often more effective than simple stemming for reducing words to a canonical form?

A. Lowercasing

B. Lemmatization

C. Stop Word Removal

D. N-gram Generation

48. In the context of sequence-to-sequence models, what is the role of the ‘Encoder’?

A. To generate the output sequence one token at a time based on the decoder’s context.

B. To process the input sequence and compress its contextual information into a fixed-size context vector (or sequence of vectors).

C. To perform attention weighting across the input sequence only.

D. To directly map input tokens to output tokens without intermediate representation.

49. What does the term ‘Perplexity’ measure when evaluating a language model?

A. The model’s ability to generate diverse and creative outputs.

B. How well the model predicts a sample of text, where lower perplexity indicates better fit.

C. The computational overhead required to run inference.

D. The balance between precision and recall in a classification task.

50. Why is transfer learning crucial in modern NLP?

A. It guarantees that the model will always generalize perfectly to new domains.

B. It allows models trained on massive generic text corpora (pre-training) to be adapted efficiently to specific, data-scarce downstream tasks (fine-tuning).

C. It removes the need for any supervised data collection during the task-specific phase.

D. It ensures that all tokens in the vocabulary have the exact same embedding vector.

51. Which preprocessing step involves replacing words with their base or root forms based on their grammatical function?

A. Stop word removal

B. Stemming

C. Lemmatization

D. Normalization

52. In the BERT model, what is the primary purpose of the Masked Language Modeling (MLM) objective during pre-training?

A. To predict the next sentence in a document sequence.

B. To force the model to learn deep bidirectional representations by reconstructing randomly masked tokens based on their context from both sides.

C. To generate text sequentially, similar to GPT models.

D. To optimize the model for sequence-to-sequence tasks directly.

53. If an NLP system is designed to identify the relationships between clauses and phrases in a sentence, which component is primarily responsible for this structure determination?

A. Syntactic Parser

B. Lexicon Builder

C. Semantic Analyzer

D. Text Vectorizer

54. What is the key difference between Dependency Parsing and Constituency Parsing?

A. Dependency Parsing focuses only on POS tags, while Constituency Parsing focuses on semantic roles.

B. Dependency Parsing models grammatical relationships between individual words (head-dependent), whereas Constituency Parsing groups words into nested phrases (constituents).

C. Dependency Parsing is statistical, and Constituency Parsing is rule-based.

D. Dependency Parsing is used only for machine translation, and Constituency Parsing is for text classification.

55. In evaluating text generation models (like summarization or translation), why is ROUGE used more frequently than BLEU?

A. ROUGE correlates better with human judgment for assessing content overlap in shorter, extractive summaries.

B. ROUGE is computationally cheaper than BLEU.

C. BLEU is designed only for sequential data, while ROUGE is for classification.

D. ROUGE measures fluency, whereas BLEU measures adequacy.

56. When fine-tuning a pre-trained LLM for a downstream task, what is the term for updating only the top classification layer while freezing the weights of the underlying transformer blocks?

A. Full Fine-Tuning

B. Feature Extraction (or Linear Probing)

C. Parameter-Efficient Fine-Tuning (PEFT)

D. Knowledge Distillation

57. What primary challenge does Adversarial Training attempt to address in NLP models?

A. The high computational cost during inference.

B. The model’s vulnerability to small, intentionally crafted perturbations in the input data that cause misclassification.

C. The difficulty in generating coherent long-form text.

D. The bias inherited from the initial training corpus.

58. Which of the following NLP techniques is most closely related to understanding the emotional tone or subjective opinion expressed in a piece of text?

A. Dependency Parsing

B. Part-of-Speech Tagging

C. Sentiment Analysis

D. Term Frequency-Inverse Document Frequency (TF-IDF)

59. If an LLM generates text that is grammatically correct but factually incorrect or nonsensical given the prompt, this phenomenon is often termed:

A. Overfitting

B. Catastrophic Forgetting

C. Hallucination

D. Underfitting

60. What does the ‘Contextualization’ aspect of modern embedding models (like BERT or ELMo) achieve that static embeddings (like Word2Vec) cannot?

A. It reduces the overall vector size without losing information.

B. It generates a unique vector representation for a word based on the specific sentence it appears in.

C. It guarantees that every word has a vector magnitude of 1.

D. It allows for real-time updates during inference.

61. Which preprocessing step in NLP typically involves removing words that carry little semantic meaning, such as ‘the’, ‘a’, and ‘is’, from a text corpus?

A. Stemming

B. Stop word removal

C. Lemmatization

D. Tokenization

62. In the context of Word Embeddings, what is the primary limitation of the traditional Bag-of-Words (BoW) model when compared to modern distributional semantics models?

A. BoW cannot capture word frequency information.

B. BoW fails to account for word order and context.

C. BoW models are too computationally expensive for large corpora.

D. BoW models require complex neural network architectures.

63. Which of the following sequence models is most effective at capturing long-range dependencies in text due to its use of gating mechanisms (input, forget, output gates)?

A. Recurrent Neural Network (RNN)

B. Long Short-Term Memory (LSTM)

C. Convolutional Neural Network (CNN)

D. Transformer Encoder

64. What is the core difference between Stemming and Lemmatization?

A. Stemming handles only nouns, while Lemmatization handles verbs.

B. Stemming is a dictionary-based process, while Lemmatization uses heuristic rules.

C. Stemming reduces words to a crude root, while Lemmatization reduces words to a linguistically valid base form (lemma).

D. Stemming is only used for English, whereas Lemmatization is for inflectional languages.

65. When training a Transformer model for sequence-to-sequence tasks, what is the primary role of the Self-Attention mechanism?

A. To apply positional encoding to input tokens.

B. To compute a weighted sum of input features based on relevance to the current token.

C. To compress the sequence context into a fixed-size vector.

D. To perform non-linear transformations on the output logits.

66. In Named Entity Recognition (NER), which tag schema typically results in shorter sequences of tokens that explicitly define the boundaries and types of entities (e.g., B-PER, I-PER)?

A. IOB (Inside, Outside, Beginning)

B. BIOES (Beginning, Inside, Outside, End, Single)

C. IO (Inside, Outside)

D. BILOU (Beginning, Inside, Last, Outside, Unit)

67. What technique is used to convert word counts into a sparse vector representation where each dimension corresponds to a word in the vocabulary, and the values often represent normalized frequency?

A. Word2Vec Skip-gram

B. TF-IDF Vectorization

C. BERT Tokenization

D. Positional Encoding

68. Why are modern Large Language Models (LLMs) often initialized using unsupervised pre-training on massive text datasets?

A. To force the model to learn syntactic rules exclusively.

B. To build general linguistic knowledge and semantic representations efficiently before task-specific fine-tuning.

C. To guarantee perfect factual recall for any domain.

D. To eliminate the need for any downstream labeled data.

69. In Syntactic Parsing, what does a Dependency Parse primarily focus on identifying?

A. The hierarchical structure of clauses based on phrases.

B. The direct relationship (dependency link) between individual words in a sentence.

C. The sequence of parts-of-speech tags for each word.

D. The probability of a sequence of words occurring together.

70. What key innovation distinguishes the Transformer architecture from preceding sequential models like RNNs and LSTMs?

A. The inclusion of residual connections in every layer.

B. The complete reliance on the self-attention mechanism instead of recurrence.

C. The use of Convolutional layers for feature extraction.

D. The necessity for sequential processing of input tokens.

71. Which technique in NLP is crucial for disambiguating the meaning of a word based on its surrounding context, as captured effectively by models like BERT?

A. POS Tagging

B. Word Sense Disambiguation (WSD)

C. N-gram modeling

D. Part-of-Speech Tagging

72. In the process of Machine Translation, what is the primary function of the Encoder in a standard sequence-to-sequence (Seq2Seq) model?

A. To generate the target language sequence token by token.

B. To convert the source sequence into a fixed-dimensional context vector.

C. To apply positional encoding to the source text.

D. To align the source context with the target vocabulary probabilities.

73. What evaluation metric is most suitable for tasks where false positives and false negatives are equally critical, such as spam detection?

A. Accuracy

B. F1-Score

C. Recall

D. Precision

74. If an NLP model is performing poorly on rare or unseen word combinations but well on common phrases, this suggests the model suffers primarily from which issue?

A. Overfitting to the training data.

B. Underfitting the overall complexity.

C. Poor generalization to low-frequency linguistic structures.

D. Insufficient tokenization granularity.

75. Which subfield of NLP focuses on understanding the implied intent and sentiment behind utterances, going beyond simple sentiment polarity?

A. Text Summarization

B. Intent Recognition and Sentiment Analysis

C. Machine Translation

D. Part-of-Speech Tagging

76. In transformer models, what does the term ‘Masked Language Modeling’ (MLM) refer to, as used in BERT’s pre-training objective?

A. Preventing the model from attending to future tokens in the input sequence.

B. Hiding a percentage of input tokens and forcing the model to predict the original masked tokens.

C. Ensuring the model does not generate toxic content during fine-tuning.

D. Limiting the context window size during sequence processing.

77. Which regularization technique is commonly applied during the training of large neural networks for NLP to prevent co-adaptation of neurons?

A. L2 Regularization

B. Early Stopping

C. Dropout

D. Batch Normalization

78. Consider the phrase: ‘The bank approved the loan application.’ Which type of error would result if a simple POS tagger incorrectly labels ‘bank’ as a Noun of Type Organization instead of a Noun of Type Location?

A. Tokenization Error

B. Syntactic Ambiguity Error

C. Lexical Ambiguity Error

D. Morphological Error

79. What is the key characteristic of zero-shot learning in the context of LLMs?

A. The model is trained on a task but tested on a related, but different, task without any new training.

B. The model is trained only on a few examples for the target task.

C. The model performs the task using only the prompt instruction, without any prior fine-tuning on that specific task.

D. The model uses reinforcement learning to adapt its weights immediately.

80. When analyzing the effectiveness of a neural dependency parser, what does measuring the Unlabeled Attachment Score (UAS) signify?

A. The percentage of predicted dependency labels that match the gold standard labels.

B. The percentage of predicted head words (governors) that match the gold standard head words, irrespective of the arc label.

C. The accuracy of predicting the constituent phrases in the sentence structure.

D. The correlation between the predicted sequence and the actual word order.

81. What challenge does the use of subword tokenization methods like Byte Pair Encoding (BPE) primarily aim to mitigate in LLMs?

A. Computational costs associated with attention calculations.

B. The Out-of-Vocabulary (OOV) problem and managing vocabulary size.

C. The inability to capture syntactic structure.

D. Gradient instability during backpropagation.

82. In Information Extraction, what is the key difference between Relation Extraction and Event Extraction?

A. Relation Extraction identifies entities, while Event Extraction identifies actions.

B. Relation Extraction identifies binary relationships between two entities, whereas Event Extraction identifies complex situations involving multiple participants and triggers.

C. Relation Extraction uses deep learning, and Event Extraction uses rule-based systems.

D. Relation Extraction is context-independent, and Event Extraction is context-dependent.

83. When evaluating a Text Summarization model using ROUGE scores, what aspect of the generated summary does the ROUGE-L metric specifically emphasize?

A. The overlap of unigrams between the candidate and reference summaries.

B. The longest common subsequence (LCS) between the candidate and reference summaries.

C. The presence of specific keywords defined in a separate dictionary.

D. The similarity of sentence embeddings between the candidate and reference summaries.

84. What mechanism in the standard Encoder-Decoder architecture (pre-Transformer) was introduced to allow the Decoder to focus dynamically on different parts of the source sentence during generation?

A. Positional Encoding

B. Self-Attention

C. Context Vector Compression

D. Attention Mechanism

85. If a developer wants to build a system that can accurately determine if a product review is ‘Angry,’ ‘Joyful,’ or ‘Frustrated’ (beyond simple positive/negative), which NLP task is most appropriate?

A. Aspect-Based Sentiment Analysis

B. Text Generation

C. Emotion Detection (Fine-grained Sentiment Analysis)

D. Topic Modeling

86. Which architecture is primarily responsible for enabling modern LLMs to process input sentences in parallel rather than sequentially?

A. Gated Recurrent Units (GRU)

B. Recurrent Neural Networks (RNN)

C. The Transformer’s Self-Attention Layer

D. Convolutional Neural Networks (CNN)

87. What is the purpose of applying ‘Beam Search’ during the decoding phase of sequence generation models (e.g., Machine Translation)?

A. To ensure the output sequence is always the single most probable path.

B. To explore multiple promising sequences simultaneously and select the globally best one based on cumulative probability.

C. To enforce diversity by selecting varied token choices at each step.

D. To accelerate training by pruning low-probability branches early.

88. Which technique is used to transform sparse, high-dimensional word vectors (like those from Count-based models) into dense, lower-dimensional vectors that capture semantic similarity?

A. One-Hot Encoding

B. Principal Component Analysis (PCA)

C. Word Embedding Models (e.g., Word2Vec)

D. N-gram Frequency Analysis

89. If a classification model performs well on the training data but poorly on the validation data, which issue is most likely present?

A. Underfitting

B. High Variance (Overfitting)

C. High Bias

D. Dataset Shift

90. What role does Positional Encoding play in the Transformer architecture?

A. It stabilizes the gradients during deep network training.

B. It injects information about the relative or absolute position of tokens into the input embeddings, since attention is position-agnostic.

C. It calculates the attention weights between the Query and Key vectors.

D. It performs non-linear activation on the final output logits.

91. What is the primary goal of the Tokenization step in Natural Language Processing (NLP)?

A. To convert words into their base forms by removing inflections.

B. To divide a text into smaller meaningful units, such as words or subwords.

C. To assign a grammatical category (like noun, verb, or adjective) to each word.

D. To determine the overall sentiment (positive, negative, or neutral) of the text.

92. Which technique is used to reduce inflected words to their base or dictionary form (e.g., ‘running’ to ‘run’)?

A. Stop Word Removal

B. Lemmatization

C. N-gram generation

D. Part-of-Speech (POS) Tagging

93. In the context of Bag-of-Words (BoW) models, what information is lost when calculating term frequency (TF)?

A. The frequency of the term across the entire corpus.

B. The importance of the term relative to the corpus.

C. The order or sequence of the words in the document.

D. The presence of specific named entities.

94. What does the Inverse Document Frequency (IDF) component in TF-IDF aim to measure?

A. How frequently a term appears in a single document.

B. How rare or unique a term is across the entire document collection.

C. The grammatical role of the term within a sentence.

D. The semantic similarity between two adjacent words.

95. Which NLP technique is most effective for tasks requiring the understanding of context, polysemy, and subtle semantic relationships within sentences?

A. One-Hot Encoding of words

B. Traditional N-gram modeling

C. Contextual Word Embeddings (e.g., BERT)

D. Term Frequency (TF) calculation

96. In Syntax Parsing, what is the primary output of a Dependency Parser?

A. hierarchical structure showing constituents (phrases) and their relationships.

B. sequence of grammatical tags for each word.

C. graph showing word-to-word grammatical relationships (head-dependent).

D. sequence of tokens representing the normalized base forms of words.

97. Which component of a pipeline typically performs Named Entity Recognition (NER) to identify and classify proper nouns like names, locations, and organizations?

A. The POS Tagger module.

B. The Syntactic Analyzer module.

C. The Sequence Labeling module.

D. The Stemmer module.

98. What challenge does Word Sense Disambiguation (WSD) primarily address in NLP?

A. Handling morphological variations of words.

B. Determining the correct meaning of a word based on its context when it has multiple meanings.

C. Identifying the sentiment expressed in informal language.

D. Translating text between languages with vastly different grammatical structures.

99. If a neural network model for Machine Translation uses an Encoder-Decoder architecture, what is the main role of the Encoder?

A. Generating the output sequence word by word in the target language.

B. Compressing the input source sentence into a fixed-length context vector.

C. Directly predicting the probability distribution of the next word.

D. Applying attention mechanisms to highlight crucial source tokens.

100. What advancement did the introduction of the Attention Mechanism primarily bring to Sequence-to-Sequence (Seq2Seq) models, particularly in Machine Translation?

A. Eliminating the need for tokenization.

B. Allowing the decoder to focus on relevant parts of the source input at each decoding step.

C. Replacing recurrent units (RNNs) entirely with feed-forward layers.

D. Enabling the model to perform unsupervised learning.

101. Which NLP task is fundamentally about mapping unstructured text to structured data representing relationships between entities?

A. Text Summarization

B. Relation Extraction

C. Text Generation

D. Sentiment Analysis

102. What is the main drawback of using static word embeddings (like Word2Vec or GloVe) compared to contextual embeddings (like ELMo or BERT)?

A. Static embeddings require significantly more training data.

B. Static embeddings assign only one fixed vector to a word regardless of context.

C. Static embeddings cannot handle rare words effectively.

D. Static embeddings are inherently less effective for sequence tasks.

103. In corpus linguistics, what term refers to pairs or sequences of words that occur together more often than expected by chance?

A. Stop Words

B. Collocations

C. Synonyms

D. Co-references

104. Which evaluation metric is most appropriate for a text generation task (like summarization or translation) where the generated output must closely match a reference output based on word overlap?

A. F1-Score

B. Accuracy

C. BLEU Score (Bilingual Evaluation Understudy)

D. Mean Squared Error (MSE)

105. When fine-tuning a large pre-trained language model (PLM) like BERT for a downstream task, what practice is generally employed to preserve general knowledge while adapting to the specific task?

A. Training only the final classification layer and freezing all other PLM layers.

B. Using an extremely high learning rate to rapidly adapt the weights.

C. Training with a significantly lower learning rate than initial pre-training.

D. Removing all attention mechanisms from the pre-trained layers.

106. What issue does Coreference Resolution aim to solve in text understanding?

A. Identifying the tone or emotion expressed by the speaker.

B. Linking all expressions that refer to the same real-world entity in a text.

C. Determining the grammatical subject and object of a verb.

D. Translating idiomatic expressions accurately.

107. Which technique helps mitigate the problem of feature sparsity when representing text in high-dimensional, discrete spaces?

A. Removing punctuation marks.

B. Applying Principal Component Analysis (PCA) to the feature matrix.

C. Using dense, continuous word embeddings.

D. Increasing the maximum document length for feature extraction.

108. What is the primary limitation of Rule-Based Sentiment Analysis systems compared to Machine Learning approaches?

A. They cannot handle complex negation structures accurately.

B. They require massive amounts of labeled training data.

C. They are difficult and time-consuming to scale and maintain across different domains.

D. They are inherently poor at recognizing sarcasm or irony.

109. In the context of language modeling, what is perplexity?

A. The computational cost required to process a sentence.

B. measure of how well a probability distribution predicts a sample, where lower values indicate better performance.

C. The complexity arising from multiple possible meanings of a word in a sentence.

D. The difficulty in manually labeling a dataset for supervised learning.

110. Which architecture is foundational to modern Transformer models and allows for parallel processing of sequential data?

A. Recurrent Neural Network (RNN)

B. Convolutional Neural Network (CNN)

C. Self-Attention mechanism

D. Long Short-Term Memory (LSTM)

111. What is ‘Zero-Shot Learning’ in the context of large language models (LLMs)?

A. Training the model exclusively on unlabeled data.

B. The ability to perform a task without any task-specific labeled examples, relying only on the prompt.

C. technique where all parameters are updated during inference.

D. Using only the initial token embedding before processing.

112. When performing Morphological Analysis, what is the main difference between Stemming and Lemmatization?

A. Stemming handles inflections while Lemmatization handles derivation.

B. Stemming is dictionary-based, while Lemmatization is rule-based.

C. Stemming is faster but often produces non-dictionary words, while Lemmatization yields linguistically valid words.

D. Stemming only works for nouns, whereas Lemmatization works for all parts of speech.

113. What is the primary challenge that Chunking (Shallow Parsing) attempts to solve, which full parsing might overlook?

A. Identifying the exact sentiment polarity of each clause.

B. Grouping adjacent words into grammatically related phrases (Noun Phrases, Verb Phrases) without building a full sentence tree.

C. Resolving pronoun coreference across multiple sentences.

D. Generating context-aware word embeddings.

114. In the context of Dialog Systems, what distinguishes a Retrieval-Based model from a Generative model?

A. Retrieval models always require domain-specific ontologies, while generative models do not.

B. Retrieval models select responses from a predefined set, whereas generative models create novel responses.

C. Generative models are restricted to short answers, while retrieval models handle long paragraphs.

D. Retrieval models are generally faster but less context-aware than generative models.

115. Which technique is essential for ensuring that the spatial or sequential relationships between tokens are maintained when processing input sequences in a Transformer model?

A. Using deep residual connections.

B. Applying Layer Normalization after each sub-layer.

C. Incorporating Positional Encodings into the input embeddings.

D. Employing dropout regularization.

116. When dealing with highly domain-specific medical or legal text, what is the most crucial adaptation needed when leveraging a general-purpose LLM?

A. Switching the model architecture from Transformer to RNN.

B. Conducting further pre-training (Domain Adaptive Pre-training) on in-domain corpora.

C. Reducing the vocabulary size drastically to limit noise.

D. Disabling all non-linear activation functions in the model.

117. What concept does the term ‘Bias-Variance Trade-off’ refer to in the context of applying NLP models?

A. The balance between the time taken for training versus inference.

B. The trade-off between simplifying the model (high bias, low variance) and fitting the training data perfectly (low bias, high variance).

C. The balance between using word embeddings versus one-hot encoding.

D. The trade-off between recall and precision in classification tasks.

118. Which phase of NLP typically involves creating a graphical representation of the relationships between subjects, verbs, and objects in a sentence?

A. Lexical Analysis (Tokenization)

B. Syntactic Analysis (Parsing)

C. Morphological Analysis (Stemming/Lemmatization)

D. Pragmatic Analysis (Discourse Processing)

119. What is the fundamental difference between traditional Statistical Machine Translation (SMT) and Neural Machine Translation (NMT)?

A. SMT relies solely on linguistic rules, while NMT uses deep learning.

B. SMT learns discrete phrase-to-phrase translation models, while NMT learns an end-to-end mapping using continuous representations.

C. SMT handles long-range dependencies much better than NMT.

D. NMT requires parallel corpora for training, whereas SMT can use monolingual data.

120. If an NLP application frequently misclassifies short, informal social media posts due to slang and abbreviations, which pre-processing step might be inadequate or missing?

A. Part-of-Speech tagging.

B. Normalization or slang expansion techniques.

C. Stop word removal.

D. TF-IDF weighting.

121. Which of the following techniques is primarily used in Natural Language Processing (NLP) to reduce the dimensionality of word representation by compressing context into fewer features, often utilized for tasks like text classification?

A. Part-of-Speech (POS) Tagging

B. Latent Semantic Analysis (LSA)

C. Named Entity Recognition (NER)

D. Tokenization

122. In the context of transformer models, what is the primary function of the ‘Self-Attention’ mechanism?

A. To generate new tokens sequentially based on previous outputs.

B. To weigh the importance of different words in the input sequence relative to a specific word when computing its representation.

C. To enforce a strict sequential order constraint on all input tokens.

D. To perform basic lexical stemming and lemmatization on the input.

123. Which NLP model architecture is best suited for tasks requiring understanding long-range dependencies across very long texts, such as summarizing entire documents, due to its ability to process tokens in parallel?

A. Recurrent Neural Networks (RNNs) with simple backpropagation.

B. Long Short-Term Memory (LSTM) networks.

C. Transformer architecture.

D. Hidden Markov Models (HMMs).

124. When processing Vietnamese text, which preprocessing step is most crucial for accurately segmenting words, given the lack of explicit spaces between morphemes in many words?

A. Stop Word Removal

B. Word Segmentation (Tokenization)

C. Stemming

D. Lemmatization

125. Consider the sentence: ‘The bank authorized the transaction.’ If we are conducting dependency parsing, what grammatical relationship is most likely established between ‘bank’ and ‘authorized’?

A. Object (OBJ)

B. Adverbial Modifier (ADVMOD)

C. Subject (NSUBJ)

D. Direct Object (DOBJ)

126. What is the primary difference between Word Embeddings (like Word2Vec) and Contextualized Embeddings (like BERT’s output)?

A. Word Embeddings use static vectors, whereas Contextualized Embeddings change based on the surrounding text.

B. Word Embeddings require pre-training on large corpora, while Contextualized Embeddings do not.

C. Word Embeddings capture syntactic information, while Contextualized Embeddings capture only semantic information.

D. Word Embeddings use attention mechanisms, while Contextualized Embeddings use Recurrent layers.

127. In the sequence-to-sequence model for Machine Translation, which component is responsible for processing the source language input into a context vector?

A. The Decoder

B. The Attention Mechanism

C. The Encoder

D. The Output Layer

128. What phenomenon describes a situation where a single word in a language has multiple distinct meanings (e.g., ‘bank’ as a financial institution or a river edge)?

A. Lexical Ambiguity

B. Syntactic Ambiguity

C. Referential Ambiguity

D. Pragmatic Ambiguity

129. Which evaluation metric is most appropriate for assessing the performance of a Named Entity Recognition (NER) system, as it accounts for precision and recall simultaneously?

A. Accuracy

B. Mean Squared Error (MSE)

C. F1 Score

D. BLEU Score

130. What is the purpose of applying ‘Masked Language Modeling’ (MLM) during the pre-training of models like BERT?

A. To predict the next sentence in a sequence.

B. To reconstruct masked tokens based on their surrounding context.

C. To ensure the model only learns context from the left side.

D. To generate novel, grammatically correct sentences.

131. In Machine Reading Comprehension (MRC), which approach typically frames the task as identifying the span of text in the context document that directly answers a given question?

A. Generative MRC

B. Extractive MRC

C. Abstractive MRC

D. Knowledge-based MRC

132. Which technique addresses the issue of representing rare or unseen words in a vocabulary by breaking words down into smaller meaningful units?

A. One-Hot Encoding

B. Subword Tokenization (e.g., BPE or WordPiece)

C. Frequency Filtering

D. Embedding Lookup Tables

133. What fundamental limitation of traditional N-gram language models is overcome by modern neural language models (like LSTMs or Transformers)?

A. Difficulty handling morphology.

B. Inability to capture long-range dependencies effectively due to fixed context window.

C. High computational cost during training.

D. Dependence on large labeled datasets.

134. In text generation using a beam search decoding strategy, what does the beam size parameter control?

A. The maximum length of the generated sequence.

B. The number of candidate sequences kept at each decoding step.

C. The penalty applied to repeated N-grams.

D. The temperature used for sampling probabilities.

135. Which phenomenon in statistical NLP requires methods like smoothing (e.g., Kneser-Ney) to ensure non-zero probability estimates for unseen N-grams?

A. Polysemy

B. Data Sparsity (Zero-Probability Problem)

C. Bag-of-Words bias

D. Overfitting to the training corpus

136. What is the main challenge when using the Bag-of-Words (BoW) representation compared to modern embedding methods?

A. BoW cannot handle large vocabularies.

B. BoW loses crucial word order and semantic similarity information.

C. BoW representation vectors are inherently high-dimensional and sparse.

D. BoW cannot be used for unsupervised tasks.

137. In the context of sentiment analysis using supervised learning, what is the role of ‘Transfer Learning’ when fine-tuning a pre-trained language model like RoBERTa for a specific domain (e.g., legal documents)?

A. It ignores the pre-trained knowledge to ensure domain specificity.

B. It leverages general language understanding acquired during pre-training to accelerate and improve performance on the specific, often smaller, downstream task.

C. It involves training the model from scratch using only the target domain data.

D. It is primarily used for sequence generation tasks, not classification.

138. Which evaluation method is standard for generative tasks (like summarization or translation) where multiple correct or plausible outputs exist, focusing on the overlap between the generated text and human references?

A. F1 Score

B. ROUGE Score

C. Area Under the Curve (AUC)

D. Precision/Recall Curve

139. What is the specific challenge that arises when applying traditional embedding methods (like Word2Vec) directly to morphologically rich languages (like Turkish or German)?

A. The vocabulary becomes too small.

B. The model cannot capture syntactic structure.

C. Inflectional variations lead to high sparsity as many word forms are treated as distinct tokens.

D. The resulting vectors are always binary.

140. Which parsing strategy aims to construct a full syntactic tree of a sentence by integrating local information from sub-trees, often relying on dynamic programming techniques?

A. Transition-based Parsing

B. Constituency Parsing (via CKY Algorithm)

C. Head-Driven Phrase Structure Grammar (HPSG) Parsing

D. Dependency Parsing

141. When training a deep neural network for sequence modeling, what is the specific issue known as ‘vanishing gradients’ associated with?

A. The weights becoming excessively large during training.

B. The gradients becoming infinitesimally small as they propagate backward through many layers, hindering learning of early layers.

C. The network failing to converge to a local minimum.

D. The model learning too quickly and oscillating around the optimal solution.

142. In the context of dialog systems, what differentiates a Goal-Oriented Dialog System from a Chitchat Bot?

A. Goal-Oriented systems use statistical models, while Chitchat bots use rule-based systems.

B. Goal-Oriented systems aim to complete specific tasks (e.g., booking a flight), whereas Chitchat bots focus on open-domain, engaging conversation.

C. Chitchat bots require explicit state tracking, while Goal-Oriented systems do not.

D. Goal-Oriented systems rely solely on retrieval methods, while Chitchat bots use end-to-end generation.

143. Which concept is crucial in modern NLP for handling positional information in sequential data when the standard recurrent structure is replaced by parallel processing, as seen in Transformers?

A. Gating mechanisms

B. Positional Encoding

C. Dropout regularization

D. Backpropagation Through Time (BPTT)

144. What does the concept of ‘Interpretability’ or ‘Explainability’ (XAI) in NLP primarily aim to achieve for deep learning models?

A. To ensure 100% generalization capability across all future datasets.

B. To provide insights into why a model made a specific prediction by identifying influential input features or decision paths.

C. To reduce the computational complexity during inference time.

D. To guarantee that the model’s output is always factually correct.

145. In probabilistic parsing, what is the primary role of the Probabilistic Context-Free Grammar (PCFG) rule probabilities?

A. To determine the order in which rules are applied.

B. To assign a probability score to a fully derived syntactic structure.

C. To define the maximum depth of the parse tree.

D. To identify named entities within the sentence.

146. Which technique focuses on refining word embeddings based on the context they appear in during inference, rather than relying solely on pre-trained static vectors?

A. Stemming

B. Contextualization via fine-tuning or dynamic attention layers

C. TF-IDF calculation

D. Principal Component Analysis (PCA)

147. What is the main drawback of using pure Lexical Semantics models (like WordNet) over modern Distributional Semantics models (like BERT) for understanding word meaning?

A. Lexical models cannot distinguish between polysemous senses of a word.

B. Lexical models are too large to store efficiently.

C. Lexical models require manual curation and struggle to capture subtle, evolving, or domain-specific usages not explicitly defined.

D. Lexical models only work for high-resource languages.

148. In evaluating the quality of machine translation outputs, what does a high BLEU score primarily indicate?

A. The fluency and grammatical correctness of the target sentence.

B. The length of the generated translation compared to the reference.

C. The precision of matching N-grams between the candidate translation and the reference translation(s).

D. The recall of semantic concepts present in the source sentence.

149. Which NLP task is typically solved using a process where the input is decomposed into constituent phrases (noun phrases, verb phrases, etc.) to form a hierarchical structure?

A. Sentiment Classification

B. Dependency Parsing

C. Constituency Parsing (Phrase Structure Parsing)

D. Machine Translation Decoding

150. What is the primary function of the softmax activation layer in the output stage of a standard sequence classification model?

A. To introduce non-linearity between layers.

B. To convert the raw scores (logits) into a probability distribution summing to one.

C. To regularize the model weights.

D. To normalize the input features before the first layer.

HƯỚNG DẪN TÌM MẬT KHẨU