Memory tricks that make CNNs, Transformers and LLMs click
From convolutional networks to the attention mechanism that powers ChatGPT -- these memory tricks lock in the deep learning architectures that define modern AI.
Or continue to the sub-topics below for more specialized Study Rooms and Forums
Deep Learning
Memory Tricks
Proven Mnemonics & Acronyms — fast to learn, hard to forget.
CNNs
CNN SEES -- Convolutional layers detect features, Pooling shrinks size, FC layer classifies
PIXELS TO EDGES TO SHAPES TO OBJECTS -- LEARNED AUTOMATICALLY
Early layers detect low-level features. Deep layers detect high-level concepts. This hierarchy is learned from data.
Convolutional layer: filters slide across image detecting local features (edges, textures, shapes). Max Pooling: reduces spatial dimensions, provides translation invariance. Fully Connected: flattens and classifies. Key insight: early layers detect edges and corners, deeper layers detect faces and objects. Parameter sharing: same filter applied everywhere -- far fewer parameters than dense network.
Convolutional layer
Learned filters detect local features at every position
Max pooling
Reduces spatial size -- 2x2 pool halves width and height
Fully connected
Flattens and classifies from deep features
Parameter sharing
Same filter applied everywhere -- drastically fewer parameters
RNNs and LSTMs
RNN = hidden state carries MEMORY forward -- LSTM adds gates to control what to remember
VANILLA RNN VANISHES -- LSTM GATES SOLVE IT
LSTM has three gates: forget what to erase, input what to add, output what to expose
RNNs process sequences by maintaining hidden state memory. Vanilla RNN: vanishing gradients over long sequences -- forgets early inputs. LSTM (Long Short-Term Memory): forget gate (what to erase), input gate (what new info to add), output gate (what to expose). GRU: simplified LSTM with two gates -- fewer parameters, comparable performance. Now largely replaced by Transformers for most NLP tasks.
Vanilla RNN
Vanishing gradients over long sequences -- forgets early input
LSTM forget gate
Decides what to erase from cell state
LSTM input gate
Decides what new information to store
LSTM output gate
Decides what to expose as hidden state
Transformers
Attention Is All You Need -- Google 2017 paper that replaced RNNs for language
ENCODER AND DECODER -- SELF-ATTENTION AND POSITIONAL ENCODING
Positional encoding is required because attention has no inherent sense of word order
No recurrence -- processes entire sequence in parallel. Self-attention: every token attends to every other. Positional encoding: sine/cosine functions inject position information (without it: dog bites man = man bites dog). Layer normalization + residual connections. Encoder-only (BERT): understanding tasks. Decoder-only (GPT, Claude): generation. Encoder-Decoder (T5): translation and summarization. Foundation of all modern LLMs.
Self-attention
Every token attends to every other -- global context from layer 1
Positional encoding
Required -- attention has no inherent sense of order
PRETRAIN then FINE-TUNE -- learn everything first, then specialize, then align
PRETRAINING AND SFT AND RLHF -- THREE TRAINING STAGES
Emergent abilities appear suddenly at scale -- not explicitly trained, not predictable
Pretraining: train massive transformer on huge text corpus, predict next token, self-supervised (next token IS the label). Supervised Fine-Tuning (SFT): train on curated instruction-response pairs -- teaches instruction following. RLHF: human raters rank outputs, reward model trained, RL fine-tunes LLM to maximize it -- aligns with human values. Emergent abilities: few-shot learning, chain-of-thought, code generation appear suddenly at scale.
Pretraining
Next token prediction on massive corpus -- self-supervised
SFT
Curated instruction-response pairs -- teaches instruction following
RLHF
Human preferences + reward model + RL = aligned LLM
Emergent abilities
Appear suddenly at scale -- few-shot, CoT, code generation
GANs and Diffusion
GENERATOR fakes it -- DISCRIMINATOR detects fakes -- They compete until generator wins
TWO FIGHTERS COMPETING -- OR STEP-BY-STEP DENOISING FROM NOISE
Diffusion models now dominate image generation -- more stable than GANs, better diversity
GANs: Generator produces fake data, Discriminator classifies real vs fake, trained adversarially. Challenges: mode collapse (generator produces only a few types), training instability. Diffusion models: forward process adds noise over T steps, reverse process learns to denoise -- generate by starting from pure noise and denoising. Stable Diffusion, DALL-E 3, Midjourney use diffusion. More stable, better diversity, easier text conditioning.
GAN Generator
Takes random noise, produces synthetic data, tries to fool discriminator
GAN Discriminator
Classifies real vs fake -- trained adversarially
Mode collapse
Generator produces only few output types -- major GAN failure
Diffusion
Denoise from pure noise -- most stable, best quality, dominant now
Embeddings
king - man + woman = queen -- meaning encoded as geometry in vector space
DISCRETE ITEMS AS CONTINUOUS DENSE VECTORS WHERE SIMILAR ITEMS ARE CLOSE
Cosine similarity measures the angle between vectors -- standard for embedding comparison
Embeddings map discrete items (words, images, users) to dense vectors where similar items are geometrically close. Word2Vec/GloVe: static -- same word gets same vector regardless of context. BERT: contextual -- same word gets different vector based on context. Cosine similarity measures closeness. Vector databases store embeddings for semantic search. Foundation of RAG: embed documents, retrieve similar chunks as LLM context.
Word2Vec/GloVe
Static -- same word always same vector, arithmetic works
BERT embeddings
Contextual -- bank in finance vs bank of river = different vectors
Cosine similarity
Angle between vectors -- range -1 to 1, standard metric
Vector databases
Store embeddings for efficient similarity search (RAG)
LoRA and PEFT
LoRA = Low-Rank Adaptation -- fine-tune 1% of parameters, get 99% of the performance
FREEZE THE BASE, TRAIN A LOW-RANK UPDATE, MERGE BACK FOR ZERO ADDED LATENCY
QLoRA: 4-bit quantization plus LoRA -- fine-tune a 65B model on a single consumer GPU
LoRA: instead of updating full weight matrix W, add low-rank update DeltaW = B times A where B is d x r and A is r x k with r much smaller than d. Only A and B trained -- 100-10,000x fewer parameters. At inference: merge back into W -- no latency penalty. QLoRA: combine 4-bit quantization with LoRA -- democratized LLM fine-tuning on consumer hardware.
LoRA r parameter
Rank of the update -- r=8-16 typical, controls parameter count
Merge at inference
LoRA weights merged back -- zero added latency
QLoRA
4-bit base model + LoRA adapters -- fine-tune 65B on consumer GPU
ViT -- split image into PATCHES, treat each patch like a TOKEN, run Transformer
AN IMAGE IS WORTH 16x16 WORDS -- THE PAPER THAT LAUNCHED VISION TRANSFORMERS
SWIN Transformer: hierarchical ViT with shifted windows -- efficient for detection and segmentation
ViT (Dosovitskiy et al., 2020): split image into 16x16 pixel patches, flatten each, project to embedding, add position embeddings, run through standard Transformer encoder. Global context from layer 1 -- unlike CNNs that build context gradually. Needs more data than CNNs (no inductive biases). SWIN: hierarchical ViT with shifted windows, O(n) complexity, dominant backbone for detection and segmentation.
Required -- transformer has no inherent sense of patch order
Global attention
Every patch attends to every other -- from the first layer
SWIN Transformer
Hierarchical + shifted windows -- O(n), efficient for high-resolution
Mixture of Experts
MoE -- MANY experts, SPARSE routing -- only 2-4 experts active per token
SCALE MODEL CAPACITY WITHOUT SCALING COMPUTE PROPORTIONALLY
Mixtral 8x7B: 47B total parameters, only 13B active per token -- same compute as 13B model
MoE replaces each dense feed-forward layer with N expert networks, routing each token to only 2-4 of them. Gate network: learned router decides which experts handle each token. Sparse activation: only K of N experts active per token. Mixtral 8x7B: 8 experts, 2 active per token, 47B total params but 13B active compute. Load balancing: auxiliary loss prevents all tokens routing to same expert.
Expert networks
N parallel feed-forward networks replacing one dense layer
Gating network
Learned router -- which K experts handle this token?
Sparse activation
Only K of N active per token -- compute = small dense model
Load balancing
Auxiliary loss prevents collapse to same 1-2 experts
Self-Supervised Learning
CREATE your own labels from the data -- predict what was hidden, no human annotation needed
MASK IT OR ROTATE IT OR CROP IT -- PREDICT WHAT WAS HIDDEN TO LEARN REPRESENTATIONS
Foundation models (BERT, GPT, CLIP, Stable Diffusion) are ALL self-supervised pretrained
Self-supervised creates supervision from data structure itself -- no human labels. NLP: BERT predicts masked tokens, GPT predicts next token. Computer Vision: SimCLR (match two augmented views of same image), MAE (mask 75% of image patches, reconstruct them). Contrastive learning: pull same-sample representations together, push different samples apart. Enables learning from internet-scale unlabeled data.
Masked LM (BERT)
Predict masked tokens using bidirectional context
Next token (GPT)
Predict next token given left context -- autoregressive
SimCLR/DINO
Match augmented views of same image -- contrastive learning
MAE
Mask 75% of image patches, reconstruct -- forces global understanding
Deep RL
DQN = Deep Q-Network -- neural net approximates Q-values for large state spaces
EXPERIENCE REPLAY AND TARGET NETWORK -- THE TWO INNOVATIONS THAT MADE DQN WORK
PPO is the dominant Deep RL algorithm -- used for RLHF in ChatGPT and Claude
Deep RL combines RL with neural networks for large or continuous state spaces. DQN: neural network approximates Q(s,a) for all actions. Experience Replay: store transitions, sample randomly -- breaks temporal correlations. Target Network: frozen network for Q-targets -- stabilizes training. Policy Gradient: directly learn policy pi(a|s). PPO (Proximal Policy Optimization): clips policy updates to prevent destructive large steps -- most widely used DRL algorithm, used in RLHF.
Experience Replay
Store transitions, sample randomly -- breaks temporal correlations
Target Network
Frozen network for Q-targets -- stabilizes training
Policy Gradient
Directly learn the policy instead of Q-values
PPO
Clips policy updates -- prevents destructive large steps, used in RLHF
🎯 Exam Favorite
ATTENTION = The model HIGHLIGHTS what matters — like a student with a yellow marker
QUERY · KEY · VALUE — THE ATTENTION TRIANGLE
The attention mechanism — how Transformers read
Attention lets the model focus on relevant parts of the input when processing each word. For every word, it computes a Query (what am I looking for?), compares it to Keys (what does each word offer?), and retrieves Values (the actual content) weighted by relevance. High attention score = highlighted in yellow. This lets "it" in "The trophy didn't fit in the suitcase because it was too big" correctly link to "trophy" not "suitcase."
🧠 Vivid Story
CNN FILTERS = Sliding a MAGNIFYING GLASS across an image, one patch at a time
CONVOLVE · POOL · REPEAT — THE CNN PIPELINE
How CNNs process images
A CNN slides small filters (like magnifying glasses) across the image, detecting features: edges in layer 1, shapes in layer 2, object parts in layer 3, full objects in deeper layers. This is convolution. Then pooling shrinks the map (keeping the strongest signals) to reduce computation. Stacking many convolution+pool layers builds a hierarchy from pixels to objects. The filters are learned — nobody hand-coded "detect an eye."
🔑 Key Distinction
RNN has MEMORY — reads left to right and remembers. Transformer reads ALL AT ONCE.
SEQUENTIAL vs PARALLEL — THE BIG ARCHITECTURAL DIFFERENCE
RNN vs Transformer — why Transformers won
RNNs process sequence one step at a time, passing a hidden state forward — like reading a book one word at a time, updating your memory. Problem: by the time you reach word 100, word 1 is nearly forgotten (vanishing gradient). Transformers process all words simultaneously using attention — every word sees every other word at once. This parallelism made Transformers dramatically faster to train on GPUs and better at long-range dependencies. Result: RNNs are largely obsolete for NLP.
💡 Concept Anchor
FINE-TUNING = Starting with a TRAINED CHEF and teaching them YOUR recipes
PRE-TRAIN ON EVERYTHING · FINE-TUNE ON YOUR TASK
Pre-training and fine-tuning — the modern deep learning workflow
Pre-training: train a massive model on enormous general data (all of Wikipedia, billions of web pages). It learns language, facts, and reasoning. Fine-tuning: take that pre-trained model and train it a little more on your specific task (medical records, legal documents, customer emails). The chef already knows how to cook — you just teach them your menu. This is why GPT-4, Claude, and BERT work so well with relatively little task-specific data.
📅 Quick Reference
BATCH NORMALIZATION = Resetting the VOLUME between songs so none blare and none whisper
NORMALIZE · STABILIZE · ACCELERATE TRAINING
Batch normalization — why deep networks train faster with it
As data flows through deep network layers, values can explode (very large) or vanish (very small) — causing unstable training. Batch normalization normalizes the outputs of each layer to have mean=0 and variance=1 before passing to the next. Like adjusting the volume between every song in a playlist so they're all at the same level. Result: faster training, higher learning rates, and less sensitivity to weight initialization.
ENCODER · DECODER · ATTENTION — the three-part architecture
The Transformer architecture — how it actually works
The Transformer has two stacks. Encoder: reads the input sequence, builds rich contextual representations using self-attention (every word attends to every other word). Decoder: generates the output sequence token by token, attending to both the encoder output and its own previous outputs. Attention mechanism: lets any position in the sequence directly influence any other, regardless of distance. This solved the long-range dependency problem that destroyed RNNs.
Encoder
Reads full input, builds contextual representations — used alone in BERT
Decoder
Generates output token by token — used alone in GPT
Encoder-Decoder
Full sequence-to-sequence — used in translation (T5, BART)
🐍 Code
# BERT = encoder only (understanding tasks) # GPT = decoder only (generation tasks) # T5 = encoder + decoder (translation, summarization) from transformers import BertModel, GPT2Model bert = BertModel.from_pretrained("bert-base-uncased")
🎯 Exam Favorite
RESIDUAL CONNECTION = SKIP the highway — gradient flows straight through even 100 layers
RESNET SOLVED THE DEEP NETWORK PROBLEM
Residual connections — why networks can now be 100+ layers deep
Before ResNet (2015), networks deeper than ~20 layers degraded — adding layers made things worse, not better. Residual connections add a shortcut: output = F(x) + x. The network learns the residual (the correction) rather than the full mapping. Even if F(x) learns nothing, the identity x passes through unchanged. This lets gradients flow directly through skip connections, keeping early layers learning. ResNet-152 has 152 layers. Modern LLMs have hundreds of transformer blocks — all using residual connections.
Without residual
Gradient vanishes in deep networks — early layers stop learning
With residual
output = F(x) + x — gradient flows through skip path unchanged
Why it works
Network only needs to learn the residual (correction), not the full mapping
🐍 Code
import torch.nn as nn class ResidualBlock(nn.Module): def forward(self, x): return self.layers(x) + x # skip connection # The + x is the residual connection — the key innovation
🔑 Key Distinction
DIFFUSION = Learn to DENOISE · Reverse the noise step by step to generate images
FORWARD NOISE → BACKWARD DENOISE → IMAGE
Diffusion models — how DALL-E and Stable Diffusion work
Diffusion models destroyed GANs as the state-of-the-art for image generation. Forward process: gradually add Gaussian noise to an image over 1000 steps until it looks like pure static. Reverse process: train a neural network to predict and remove that noise, step by step. At generation time: start with pure noise, apply the learned denoising process 1000 times — a coherent image emerges. More stable to train than GANs, better quality, and supports text conditioning (Stable Diffusion, DALL-E 2, Midjourney).
Forward process
Real image → add noise 1000 steps → pure static
Reverse process
Learn to predict the noise at each step and remove it
Generation
Start from random noise → denoise 1000 steps → coherent image
🐍 Code
# Using Hugging Face diffusers library from diffusers import StableDiffusionPipeline pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5") image = pipe("a photorealistic cat on the moon").images[0]
💡 Concept Anchor
RLHF = Human thumbs up/down trains a reward model that then trains the LLM
REINFORCE WITH HUMAN FEEDBACK — HOW CHATGPT BECAME HELPFUL
RLHF — the technique that made LLMs actually useful
Base LLMs (pretrained on internet text) are knowledgeable but unhelpful — they complete text, not follow instructions. RLHF fixes this: Step 1 — collect human feedback on model outputs (which response is better?). Step 2 — train a reward model to predict human preferences. Step 3 — use reinforcement learning (PPO) to fine-tune the LLM to maximize reward. Result: a model that follows instructions, refuses harmful requests, and gives helpful responses. ChatGPT, Claude, and Gemini all use variants of RLHF.
Step 1
Supervised fine-tuning on demonstrations — model learns to follow instructions
Step 2
Reward model trained on human preference comparisons
Step 3
PPO reinforcement learning maximizes reward model score
🐍 Code
# RLHF is not easily reproducible in a snippet # The three phases conceptually: # 1. sft_model = finetune(base_llm, instruction_demos) # 2. reward_model = train(human_preference_data) # 3. rlhf_model = ppo_train(sft_model, reward_model)
0
Correct
0
Missed
0
Remaining
What does this mean / stand for?
0
Correct
0
Wrong
0
Remaining
🔗 Related Sub-Subjects
🕸️ Neural Networks
The neural network fundamentals that deep learning is built on — backprop, activations, layers.
Q: What is the attention mechanism and why is it important?
A: Attention allows every position in a sequence to directly influence every other position, regardless of distance. For each token, it computes Query (what am I looking for?), Key (what do I offer?), and Value (my actual content). Attention scores weight how much each position contributes to the current output. This solves the long-range dependency problem that plagued RNNs — a word at position 1 can directly influence position 100 with no degradation.
Q: What is the difference between GPT and BERT?
A: Both are Transformer-based. BERT uses the encoder — trained by predicting masked words using bidirectional context. Best for understanding tasks: classification, NER, question answering. GPT uses the decoder — trained to predict the next word autoregressively, left to right. Best for generation tasks: writing, completing text, chat. BERT = the reader. GPT = the writer.
Q: What is transfer learning and why is it the standard approach?
A: Transfer learning reuses a model trained on one (large) task for another (smaller) task. In deep learning: pre-train on massive data (all of Wikipedia, billions of images), then fine-tune on your specific task with a small labeled dataset. The pre-trained features (edge detectors, grammar patterns, world knowledge) transfer and dramatically reduce the data and compute needed for the target task.
Q: Explain the GAN architecture and its training instability problem.
A: A GAN has two networks: Generator (creates fake data) and Discriminator (distinguishes real from fake). They compete — as the discriminator improves at detecting fakes, the generator must improve at fooling it. Training instability: mode collapse (generator produces limited variety), discriminator winning too fast (generator gets no useful gradient), and non-convergence. Solutions include Wasserstein loss, progressive growing, and spectral normalization. Diffusion models have largely replaced GANs for image generation due to more stable training.
Q: What is a Vision Transformer (ViT) and how does it differ from a CNN?
A: ViT splits an image into fixed-size patches (e.g., 16×16 pixels), treats each patch as a token (like a word in NLP), and processes them with a standard Transformer encoder. CNNs use spatial convolutions that detect local features with translation invariance — they are inductive biases baked into the architecture. ViTs have no such built-in spatial assumptions — they learn spatial relationships from data. ViTs outperform CNNs on large datasets but need more data to train from scratch.