NLP Memory Tricks -- Free Natural Language Processing Mnemonics

NLP Pipeline

NLP PIPELINE -- Tokenize, Normalize, Represent, Model, Evaluate

RAW TEXT TO PREDICTIONS IN FIVE STEPS

Modern LLMs skip most preprocessing -- trained end-to-end with BPE subword tokenization

Tokenization: split into tokens. Normalization: lowercase, punctuation, contractions. Stop word removal: remove high-frequency low-information words. Stemming/Lemmatization: reduce to root form. Representation: convert to numerical vectors (BoW, TF-IDF, embeddings). Model: ML or DL algorithm. Evaluate: task-specific metric.

Tokenization

Split text into words, subwords, or characters

Normalization

Lowercase, punctuation handling, contraction expansion

Lemmatization

Returns actual dictionary base form (better than stemming)

Modern LLMs

Skip most preprocessing -- end-to-end with BPE tokenization

Text Representation

BOW to TF-IDF to Word2Vec to BERT -- from counting words to understanding meaning

FOUR GENERATIONS OF TEXT REPRESENTATION -- EACH MORE POWERFUL

BERT gives same word different vectors based on context -- bank in finance vs bank of river

Bag of Words (BoW): word counts, ignores order and context, sparse. TF-IDF: weights rare domain terms higher, still sparse. Word2Vec/GloVe: dense semantic vectors, king - man + woman = queen, fixed per word. BERT / contextual embeddings: same word gets different embedding based on surrounding context. All modern NLP uses contextual embeddings.

BoW

Word counts -- sparse, ignores order and context

TF-IDF

Weights rare words higher -- still sparse but smarter

Word2Vec

Dense semantic vectors -- arithmetic works, but context-free

BERT / contextual

Same word, different vector based on context -- most powerful

BERT vs GPT

BERT reads the whole sentence (bidirectional). GPT reads left-to-right only.

ENCODER FOR UNDERSTANDING -- DECODER FOR GENERATION

BERT cannot generate text. GPT cannot see the right context. Both are excellent at what they do.

BERT (Encoder-only, Bidirectional): sees entire sequence simultaneously, trained on masked language modeling, excellent for classification, NER, question answering, cannot generate text. GPT (Decoder-only, Autoregressive): sees only left context, trained on next-token prediction, excellent for text generation, creative writing, code, conversation. T5/BART: Encoder-Decoder -- best for translation and summarization.

BERT encoder-only

Bidirectional, masked LM, understanding tasks: classification, NER, QA

GPT decoder-only

Left-context only, next-token prediction, generation tasks

T5 and BART

Encoder-Decoder -- translation, summarization, seq2seq tasks

Why GPT dominates now

Scale + emergent few-shot abilities -- can do BERT tasks via prompting

Tokenization (BPE)

BPE -- Byte Pair Encoding: merge the most frequent character pair, repeat until vocabulary size reached

NO TRUE OUT-OF-VOCABULARY WORDS WITH SUBWORD TOKENIZATION

750 English words is approximately 1000 tokens -- other languages use more tokens per word

BPE Algorithm: start with character vocabulary, count all character pairs, merge most frequent into new token, repeat until target vocabulary size. Result: common words as single tokens, rare words split into subword units. No true OOV words -- any word can be handled. GPT models: 50K-100K token vocabulary. Token count affects cost, speed, and context window utilization.

BPE algorithm

Merge most frequent character pairs iteratively until target vocab size

No OOV words

Any word can be split into known subword units

Token counting

750 English words ~ 1000 tokens. Other languages: more tokens per word.

Vocabulary size

GPT models: 50K-100K token vocabulary

Prompt Engineering

CLEAR -- Context, Length, Examples, Ask specifically, Role assignment

BETTER PROMPT = BETTER OUTPUT -- PROMPT ENGINEERING IS A SKILL

Chain-of-Thought: adding Let's think step by step dramatically improves reasoning accuracy

Zero-shot: just ask -- no examples. Few-shot: provide 2-5 examples of desired format. Chain-of-Thought (CoT): add Let's think step by step -- improves math, logic, multi-step reasoning. Role prompting: Act as an expert... -- calibrates vocabulary, depth, style. Structured output: Respond in JSON format. RAG: retrieve context to ground LLM in real facts and reduce hallucination.

Zero-shot

Just ask -- no examples needed for simple clear tasks

Few-shot

Give 2-5 examples -- improves consistency and format

Chain-of-Thought

Let's think step by step -- dramatic improvement on reasoning

RAG

Retrieve documents, inject as context -- reduces hallucination

NLP Evaluation

BLEU scores translation -- ROUGE scores summaries -- Perplexity scores language models

DIFFERENT TASKS NEED DIFFERENT EVALUATION METRICS

BLEU measures n-gram overlap -- it misses semantic equivalents like automobile vs car

BLEU: n-gram overlap between generated and reference translations. Range 0-1. Correlates with human judgment but misses semantic equivalents. ROUGE: n-gram overlap for summarization. ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence). Perplexity: how well LM predicts test set -- lower is better. BERTScore: uses embeddings for semantic similarity -- better than n-gram metrics.

BLEU

N-gram overlap for translation -- misses semantic equivalents

ROUGE

N-gram overlap for summarization -- ROUGE-1, ROUGE-2, ROUGE-L

Perplexity

How surprised LM is by test text -- lower = better LM

BERTScore

Semantic similarity via embeddings -- better than BLEU/ROUGE

Machine Translation

ENCODE source then ATTEND to relevant parts then DECODE to target language

FROM RULE-BASED TO STATISTICAL TO NEURAL TO TRANSFORMER

Parallel corpora are needed for training -- back-translation helps low-resource languages

History: Rule-based (1950s-80s), Statistical phrase-based (1990s-2010s), Neural (2014 -- seq2seq + attention), Transformer (2017 -- surpassed all previous). Modern LLMs competitive for most language pairs. Parallel corpora: same content in two languages needed for training. Back-translation: translate target to source synthetically, use as additional training data.

Rule-based

Hand-crafted grammar rules -- limited coverage

Statistical (SMT)

Phrase-based, learns from parallel corpora

Neural (NMT)

Encoder-decoder + attention -- first major DL NLP breakthrough

Transformer (2017)

Surpassed RNN systems immediately -- now the standard

Information Retrieval

SPARSE uses keywords -- DENSE uses meaning -- HYBRID uses both

BM25 FOR EXACT TERMS AND SEMANTIC SEARCH FOR MEANING -- COMBINE FOR BEST RESULTS

RAG: retrieve top-K chunks and inject into LLM prompt to ground answers in real facts

Sparse retrieval (BM25, TF-IDF): keyword matching -- fast, interpretable, handles exact terms (names, codes). Dense retrieval: semantic embedding similarity -- finds relevant documents even if keywords differ. Two-stage: BM25 narrows candidates (fast), dense re-ranks (accurate). RAG pipeline: embed documents, store in vector DB, retrieve top-K, inject into LLM prompt.

BM25

Keyword matching -- fast, exact terms, still a strong baseline

Dense retrieval

Semantic embeddings -- finds synonyms and paraphrases

Two-stage pipeline

BM25 narrows, dense re-ranks -- best of both worlds

RAG

Retrieve top-K chunks, inject as LLM context, cite sources

Speech Recognition

ASR = Acoustic Signal to Recognition to transcript -- convert audio to text

WHISPER IS OPEN SOURCE AND NEAR-HUMAN ACCURACY ON ENGLISH

WER = (Substitutions + Deletions + Insertions) / Total words -- lower is better

ASR pipeline: audio signal to features (mel spectrogram) to model to text. Traditional: acoustic model + pronunciation dictionary + language model. Modern end-to-end: CTC or attention encoder-decoder. Whisper (OpenAI 2022): trained on 680K hours of multilingual audio, near-human on English, open-source and free to run locally. WER (Word Error Rate): standard evaluation metric.

Traditional ASR

Acoustic model + pronunciation dictionary + language model

End-to-end (CTC)

Directly map audio features to character sequences

Whisper

680K training hours, multilingual, near-human WER, open-source

WER formula

(Substitutions + Deletions + Insertions) / Total reference words

Coreference

COREFERENCE -- John told Mary he liked her -- who is he and who is her?

IDENTIFY WHICH WORDS IN A TEXT REFER TO THE SAME REAL-WORLD ENTITY

Winograd Schema Challenge: The trophy does not fit because it is too big -- what is too big?

Coreference resolution identifies expressions referring to the same entity. Types: pronouns (he, she, it), noun phrases (the president), proper names (John). Challenge: requires world knowledge and reasoning. Example: The city council refused the protesters a permit because they feared violence -- they = city council, not protesters. Applications: reading comprehension, information extraction, summarization.

Pronouns

He, she, it, they -- must resolve to named entities

Noun phrases

The president, the company -- track across long documents

Winograd Schema

Tests common-sense reasoning for coreference resolution

Applications

Reading comprehension, information extraction, summarization

Text Generation Decoding

GREEDY picks top token -- SAMPLING picks randomly -- TEMPERATURE controls creativity

TOP-P NUCLEUS SAMPLING IS THE STANDARD IN PRODUCTION LLM APPLICATIONS

Top-P (nucleus sampling): sample from smallest set of tokens whose cumulative probability exceeds P

Greedy: always pick highest probability token -- deterministic, fast, repetitive. Sampling: randomly sample from distribution -- diverse but can be incoherent. Temperature: below 1 = sharper (conservative), above 1 = flatter (creative). Top-K: sample from top K tokens only. Top-P (nucleus): sample from smallest token set with cumulative probability above P -- adapts to distribution shape, most widely used. Repetition penalty: downweight already-generated tokens.

Greedy

Always pick highest prob -- deterministic but repetitive

Temperature < 1

Sharper distribution -- more conservative, predictable

Temperature > 1

Flatter distribution -- more creative, sometimes incoherent

Top-P (nucleus)

Most widely used -- adapts to actual distribution shape

Memory tricks that make tokenization, BERT and LLMs click

Memory Tricks