Robot AI -- Natural Language Processing

Memory tricks that make tokenization, BERT and LLMs click

From bag-of-words to transformer language models -- these memory tricks lock in the NLP pipeline, text representations, core tasks, and the prompt engineering that makes LLMs useful.

Community
📋
NLP Forum
Ask questions · Share tricks
💬
NLP Study Room
Live · Study together now

Or continue to the sub-topics below for more specialized Study Rooms and Forums

Natural Language Processing

Memory Tricks

Proven Mnemonics & Acronyms — fast to learn, hard to forget.

NLP Pipeline
NLP PIPELINE -- Tokenize, Normalize, Represent, Model, Evaluate
RAW TEXT TO PREDICTIONS IN FIVE STEPS
Modern LLMs skip most preprocessing -- trained end-to-end with BPE subword tokenization
Tokenization: split into tokens. Normalization: lowercase, punctuation, contractions. Stop word removal: remove high-frequency low-information words. Stemming/Lemmatization: reduce to root form. Representation: convert to numerical vectors (BoW, TF-IDF, embeddings). Model: ML or DL algorithm. Evaluate: task-specific metric.
Tokenization
Split text into words, subwords, or characters
Normalization
Lowercase, punctuation handling, contraction expansion
Lemmatization
Returns actual dictionary base form (better than stemming)
Modern LLMs
Skip most preprocessing -- end-to-end with BPE tokenization
Text Representation
BOW to TF-IDF to Word2Vec to BERT -- from counting words to understanding meaning
FOUR GENERATIONS OF TEXT REPRESENTATION -- EACH MORE POWERFUL
BERT gives same word different vectors based on context -- bank in finance vs bank of river
Bag of Words (BoW): word counts, ignores order and context, sparse. TF-IDF: weights rare domain terms higher, still sparse. Word2Vec/GloVe: dense semantic vectors, king - man + woman = queen, fixed per word. BERT / contextual embeddings: same word gets different embedding based on surrounding context. All modern NLP uses contextual embeddings.
BoW
Word counts -- sparse, ignores order and context
TF-IDF
Weights rare words higher -- still sparse but smarter
Word2Vec
Dense semantic vectors -- arithmetic works, but context-free
BERT / contextual
Same word, different vector based on context -- most powerful
BERT vs GPT
BERT reads the whole sentence (bidirectional). GPT reads left-to-right only.
ENCODER FOR UNDERSTANDING -- DECODER FOR GENERATION
BERT cannot generate text. GPT cannot see the right context. Both are excellent at what they do.
BERT (Encoder-only, Bidirectional): sees entire sequence simultaneously, trained on masked language modeling, excellent for classification, NER, question answering, cannot generate text. GPT (Decoder-only, Autoregressive): sees only left context, trained on next-token prediction, excellent for text generation, creative writing, code, conversation. T5/BART: Encoder-Decoder -- best for translation and summarization.
BERT encoder-only
Bidirectional, masked LM, understanding tasks: classification, NER, QA
GPT decoder-only
Left-context only, next-token prediction, generation tasks
T5 and BART
Encoder-Decoder -- translation, summarization, seq2seq tasks
Why GPT dominates now
Scale + emergent few-shot abilities -- can do BERT tasks via prompting
Tokenization (BPE)
BPE -- Byte Pair Encoding: merge the most frequent character pair, repeat until vocabulary size reached
NO TRUE OUT-OF-VOCABULARY WORDS WITH SUBWORD TOKENIZATION
750 English words is approximately 1000 tokens -- other languages use more tokens per word
BPE Algorithm: start with character vocabulary, count all character pairs, merge most frequent into new token, repeat until target vocabulary size. Result: common words as single tokens, rare words split into subword units. No true OOV words -- any word can be handled. GPT models: 50K-100K token vocabulary. Token count affects cost, speed, and context window utilization.
BPE algorithm
Merge most frequent character pairs iteratively until target vocab size
No OOV words
Any word can be split into known subword units
Token counting
750 English words ~ 1000 tokens. Other languages: more tokens per word.
Vocabulary size
GPT models: 50K-100K token vocabulary
Prompt Engineering
CLEAR -- Context, Length, Examples, Ask specifically, Role assignment
BETTER PROMPT = BETTER OUTPUT -- PROMPT ENGINEERING IS A SKILL
Chain-of-Thought: adding Let's think step by step dramatically improves reasoning accuracy
Zero-shot: just ask -- no examples. Few-shot: provide 2-5 examples of desired format. Chain-of-Thought (CoT): add Let's think step by step -- improves math, logic, multi-step reasoning. Role prompting: Act as an expert... -- calibrates vocabulary, depth, style. Structured output: Respond in JSON format. RAG: retrieve context to ground LLM in real facts and reduce hallucination.
Zero-shot
Just ask -- no examples needed for simple clear tasks
Few-shot
Give 2-5 examples -- improves consistency and format
Chain-of-Thought
Let's think step by step -- dramatic improvement on reasoning
RAG
Retrieve documents, inject as context -- reduces hallucination
NLP Evaluation
BLEU scores translation -- ROUGE scores summaries -- Perplexity scores language models
DIFFERENT TASKS NEED DIFFERENT EVALUATION METRICS
BLEU measures n-gram overlap -- it misses semantic equivalents like automobile vs car
BLEU: n-gram overlap between generated and reference translations. Range 0-1. Correlates with human judgment but misses semantic equivalents. ROUGE: n-gram overlap for summarization. ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence). Perplexity: how well LM predicts test set -- lower is better. BERTScore: uses embeddings for semantic similarity -- better than n-gram metrics.
BLEU
N-gram overlap for translation -- misses semantic equivalents
ROUGE
N-gram overlap for summarization -- ROUGE-1, ROUGE-2, ROUGE-L
Perplexity
How surprised LM is by test text -- lower = better LM
BERTScore
Semantic similarity via embeddings -- better than BLEU/ROUGE
Machine Translation
ENCODE source then ATTEND to relevant parts then DECODE to target language
FROM RULE-BASED TO STATISTICAL TO NEURAL TO TRANSFORMER
Parallel corpora are needed for training -- back-translation helps low-resource languages
History: Rule-based (1950s-80s), Statistical phrase-based (1990s-2010s), Neural (2014 -- seq2seq + attention), Transformer (2017 -- surpassed all previous). Modern LLMs competitive for most language pairs. Parallel corpora: same content in two languages needed for training. Back-translation: translate target to source synthetically, use as additional training data.
Rule-based
Hand-crafted grammar rules -- limited coverage
Statistical (SMT)
Phrase-based, learns from parallel corpora
Neural (NMT)
Encoder-decoder + attention -- first major DL NLP breakthrough
Transformer (2017)
Surpassed RNN systems immediately -- now the standard
Information Retrieval
SPARSE uses keywords -- DENSE uses meaning -- HYBRID uses both
BM25 FOR EXACT TERMS AND SEMANTIC SEARCH FOR MEANING -- COMBINE FOR BEST RESULTS
RAG: retrieve top-K chunks and inject into LLM prompt to ground answers in real facts
Sparse retrieval (BM25, TF-IDF): keyword matching -- fast, interpretable, handles exact terms (names, codes). Dense retrieval: semantic embedding similarity -- finds relevant documents even if keywords differ. Two-stage: BM25 narrows candidates (fast), dense re-ranks (accurate). RAG pipeline: embed documents, store in vector DB, retrieve top-K, inject into LLM prompt.
BM25
Keyword matching -- fast, exact terms, still a strong baseline
Dense retrieval
Semantic embeddings -- finds synonyms and paraphrases
Two-stage pipeline
BM25 narrows, dense re-ranks -- best of both worlds
RAG
Retrieve top-K chunks, inject as LLM context, cite sources
Speech Recognition
ASR = Acoustic Signal to Recognition to transcript -- convert audio to text
WHISPER IS OPEN SOURCE AND NEAR-HUMAN ACCURACY ON ENGLISH
WER = (Substitutions + Deletions + Insertions) / Total words -- lower is better
ASR pipeline: audio signal to features (mel spectrogram) to model to text. Traditional: acoustic model + pronunciation dictionary + language model. Modern end-to-end: CTC or attention encoder-decoder. Whisper (OpenAI 2022): trained on 680K hours of multilingual audio, near-human on English, open-source and free to run locally. WER (Word Error Rate): standard evaluation metric.
Traditional ASR
Acoustic model + pronunciation dictionary + language model
End-to-end (CTC)
Directly map audio features to character sequences
Whisper
680K training hours, multilingual, near-human WER, open-source
WER formula
(Substitutions + Deletions + Insertions) / Total reference words
Coreference
COREFERENCE -- John told Mary he liked her -- who is he and who is her?
IDENTIFY WHICH WORDS IN A TEXT REFER TO THE SAME REAL-WORLD ENTITY
Winograd Schema Challenge: The trophy does not fit because it is too big -- what is too big?
Coreference resolution identifies expressions referring to the same entity. Types: pronouns (he, she, it), noun phrases (the president), proper names (John). Challenge: requires world knowledge and reasoning. Example: The city council refused the protesters a permit because they feared violence -- they = city council, not protesters. Applications: reading comprehension, information extraction, summarization.
Pronouns
He, she, it, they -- must resolve to named entities
Noun phrases
The president, the company -- track across long documents
Winograd Schema
Tests common-sense reasoning for coreference resolution
Applications
Reading comprehension, information extraction, summarization
Text Generation Decoding
GREEDY picks top token -- SAMPLING picks randomly -- TEMPERATURE controls creativity
TOP-P NUCLEUS SAMPLING IS THE STANDARD IN PRODUCTION LLM APPLICATIONS
Top-P (nucleus sampling): sample from smallest set of tokens whose cumulative probability exceeds P
Greedy: always pick highest probability token -- deterministic, fast, repetitive. Sampling: randomly sample from distribution -- diverse but can be incoherent. Temperature: below 1 = sharper (conservative), above 1 = flatter (creative). Top-K: sample from top K tokens only. Top-P (nucleus): sample from smallest token set with cumulative probability above P -- adapts to distribution shape, most widely used. Repetition penalty: downweight already-generated tokens.
Greedy
Always pick highest prob -- deterministic but repetitive
Temperature < 1
Sharper distribution -- more conservative, predictable
Temperature > 1
Flatter distribution -- more creative, sometimes incoherent
Top-P (nucleus)
Most widely used -- adapts to actual distribution shape
0
Correct
0
Missed
0
Remaining
What does this mean / stand for?
0
Correct
0
Wrong
0
Remaining
© 2026 MemoryTricks 🧠