Deep Learning Memory Tricks — Free AI Mnemonics

Deep Learning

Memory Tricks

Proven Mnemonics & Acronyms — fast to learn, hard to forget.

CNNs

CNN SEES -- Convolutional layers detect features, Pooling shrinks size, FC layer classifies

PIXELS TO EDGES TO SHAPES TO OBJECTS -- LEARNED AUTOMATICALLY

Early layers detect low-level features. Deep layers detect high-level concepts. This hierarchy is learned from data.

Convolutional layer: filters slide across image detecting local features (edges, textures, shapes). Max Pooling: reduces spatial dimensions, provides translation invariance. Fully Connected: flattens and classifies. Key insight: early layers detect edges and corners, deeper layers detect faces and objects. Parameter sharing: same filter applied everywhere -- far fewer parameters than dense network.

Convolutional layer

Learned filters detect local features at every position

Max pooling

Reduces spatial size -- 2x2 pool halves width and height

Fully connected

Flattens and classifies from deep features

Parameter sharing

Same filter applied everywhere -- drastically fewer parameters

RNNs and LSTMs

RNN = hidden state carries MEMORY forward -- LSTM adds gates to control what to remember

VANILLA RNN VANISHES -- LSTM GATES SOLVE IT

LSTM has three gates: forget what to erase, input what to add, output what to expose

RNNs process sequences by maintaining hidden state memory. Vanilla RNN: vanishing gradients over long sequences -- forgets early inputs. LSTM (Long Short-Term Memory): forget gate (what to erase), input gate (what new info to add), output gate (what to expose). GRU: simplified LSTM with two gates -- fewer parameters, comparable performance. Now largely replaced by Transformers for most NLP tasks.

Vanilla RNN

Vanishing gradients over long sequences -- forgets early input

LSTM forget gate

Decides what to erase from cell state

LSTM input gate

Decides what new information to store

LSTM output gate

Decides what to expose as hidden state

Transformers

Attention Is All You Need -- Google 2017 paper that replaced RNNs for language

ENCODER AND DECODER -- SELF-ATTENTION AND POSITIONAL ENCODING

Positional encoding is required because attention has no inherent sense of word order

No recurrence -- processes entire sequence in parallel. Self-attention: every token attends to every other. Positional encoding: sine/cosine functions inject position information (without it: dog bites man = man bites dog). Layer normalization + residual connections. Encoder-only (BERT): understanding tasks. Decoder-only (GPT, Claude): generation. Encoder-Decoder (T5): translation and summarization. Foundation of all modern LLMs.

Self-attention

Every token attends to every other -- global context from layer 1

Positional encoding

Required -- attention has no inherent sense of order

Encoder-only (BERT)

Bidirectional, understanding tasks: classification, NER, QA

Decoder-only (GPT)

Left-context only, autoregressive generation

LLMs

PRETRAIN then FINE-TUNE -- learn everything first, then specialize, then align

PRETRAINING AND SFT AND RLHF -- THREE TRAINING STAGES

Emergent abilities appear suddenly at scale -- not explicitly trained, not predictable

Pretraining: train massive transformer on huge text corpus, predict next token, self-supervised (next token IS the label). Supervised Fine-Tuning (SFT): train on curated instruction-response pairs -- teaches instruction following. RLHF: human raters rank outputs, reward model trained, RL fine-tunes LLM to maximize it -- aligns with human values. Emergent abilities: few-shot learning, chain-of-thought, code generation appear suddenly at scale.

Pretraining

Next token prediction on massive corpus -- self-supervised

SFT

Curated instruction-response pairs -- teaches instruction following

RLHF

Human preferences + reward model + RL = aligned LLM

Emergent abilities

Appear suddenly at scale -- few-shot, CoT, code generation

GANs and Diffusion

GENERATOR fakes it -- DISCRIMINATOR detects fakes -- They compete until generator wins

TWO FIGHTERS COMPETING -- OR STEP-BY-STEP DENOISING FROM NOISE

Diffusion models now dominate image generation -- more stable than GANs, better diversity

GANs: Generator produces fake data, Discriminator classifies real vs fake, trained adversarially. Challenges: mode collapse (generator produces only a few types), training instability. Diffusion models: forward process adds noise over T steps, reverse process learns to denoise -- generate by starting from pure noise and denoising. Stable Diffusion, DALL-E 3, Midjourney use diffusion. More stable, better diversity, easier text conditioning.

GAN Generator

Takes random noise, produces synthetic data, tries to fool discriminator

GAN Discriminator

Classifies real vs fake -- trained adversarially

Mode collapse

Generator produces only few output types -- major GAN failure

Diffusion

Denoise from pure noise -- most stable, best quality, dominant now

Embeddings

king - man + woman = queen -- meaning encoded as geometry in vector space

DISCRETE ITEMS AS CONTINUOUS DENSE VECTORS WHERE SIMILAR ITEMS ARE CLOSE

Cosine similarity measures the angle between vectors -- standard for embedding comparison

Embeddings map discrete items (words, images, users) to dense vectors where similar items are geometrically close. Word2Vec/GloVe: static -- same word gets same vector regardless of context. BERT: contextual -- same word gets different vector based on context. Cosine similarity measures closeness. Vector databases store embeddings for semantic search. Foundation of RAG: embed documents, retrieve similar chunks as LLM context.

Word2Vec/GloVe

Static -- same word always same vector, arithmetic works

BERT embeddings

Contextual -- bank in finance vs bank of river = different vectors

Cosine similarity

Angle between vectors -- range -1 to 1, standard metric

Vector databases

Store embeddings for efficient similarity search (RAG)

LoRA and PEFT

LoRA = Low-Rank Adaptation -- fine-tune 1% of parameters, get 99% of the performance

FREEZE THE BASE, TRAIN A LOW-RANK UPDATE, MERGE BACK FOR ZERO ADDED LATENCY

QLoRA: 4-bit quantization plus LoRA -- fine-tune a 65B model on a single consumer GPU

LoRA: instead of updating full weight matrix W, add low-rank update DeltaW = B times A where B is d x r and A is r x k with r much smaller than d. Only A and B trained -- 100-10,000x fewer parameters. At inference: merge back into W -- no latency penalty. QLoRA: combine 4-bit quantization with LoRA -- democratized LLM fine-tuning on consumer hardware.

LoRA r parameter

Rank of the update -- r=8-16 typical, controls parameter count

Merge at inference

LoRA weights merged back -- zero added latency

QLoRA

4-bit base model + LoRA adapters -- fine-tune 65B on consumer GPU

PEFT category

Parameter-Efficient Fine-Tuning -- LoRA, prefix tuning, adapters

Vision Transformer

ViT -- split image into PATCHES, treat each patch like a TOKEN, run Transformer

AN IMAGE IS WORTH 16x16 WORDS -- THE PAPER THAT LAUNCHED VISION TRANSFORMERS

SWIN Transformer: hierarchical ViT with shifted windows -- efficient for detection and segmentation

ViT (Dosovitskiy et al., 2020): split image into 16x16 pixel patches, flatten each, project to embedding, add position embeddings, run through standard Transformer encoder. Global context from layer 1 -- unlike CNNs that build context gradually. Needs more data than CNNs (no inductive biases). SWIN: hierarchical ViT with shifted windows, O(n) complexity, dominant backbone for detection and segmentation.

Patch size

16x16 pixels typical -- 224x224 image = 196 patches

Position embedding

Required -- transformer has no inherent sense of patch order

Global attention

Every patch attends to every other -- from the first layer

SWIN Transformer

Hierarchical + shifted windows -- O(n), efficient for high-resolution

Mixture of Experts

MoE -- MANY experts, SPARSE routing -- only 2-4 experts active per token

SCALE MODEL CAPACITY WITHOUT SCALING COMPUTE PROPORTIONALLY

Mixtral 8x7B: 47B total parameters, only 13B active per token -- same compute as 13B model

MoE replaces each dense feed-forward layer with N expert networks, routing each token to only 2-4 of them. Gate network: learned router decides which experts handle each token. Sparse activation: only K of N experts active per token. Mixtral 8x7B: 8 experts, 2 active per token, 47B total params but 13B active compute. Load balancing: auxiliary loss prevents all tokens routing to same expert.

Expert networks

N parallel feed-forward networks replacing one dense layer

Gating network

Learned router -- which K experts handle this token?

Sparse activation

Only K of N active per token -- compute = small dense model

Load balancing

Auxiliary loss prevents collapse to same 1-2 experts

Self-Supervised Learning

CREATE your own labels from the data -- predict what was hidden, no human annotation needed

MASK IT OR ROTATE IT OR CROP IT -- PREDICT WHAT WAS HIDDEN TO LEARN REPRESENTATIONS

Foundation models (BERT, GPT, CLIP, Stable Diffusion) are ALL self-supervised pretrained

Self-supervised creates supervision from data structure itself -- no human labels. NLP: BERT predicts masked tokens, GPT predicts next token. Computer Vision: SimCLR (match two augmented views of same image), MAE (mask 75% of image patches, reconstruct them). Contrastive learning: pull same-sample representations together, push different samples apart. Enables learning from internet-scale unlabeled data.

Masked LM (BERT)

Predict masked tokens using bidirectional context

Next token (GPT)

Predict next token given left context -- autoregressive

SimCLR/DINO

Match augmented views of same image -- contrastive learning

MAE

Mask 75% of image patches, reconstruct -- forces global understanding

Deep RL

DQN = Deep Q-Network -- neural net approximates Q-values for large state spaces

EXPERIENCE REPLAY AND TARGET NETWORK -- THE TWO INNOVATIONS THAT MADE DQN WORK

PPO is the dominant Deep RL algorithm -- used for RLHF in ChatGPT and Claude

Deep RL combines RL with neural networks for large or continuous state spaces. DQN: neural network approximates Q(s,a) for all actions. Experience Replay: store transitions, sample randomly -- breaks temporal correlations. Target Network: frozen network for Q-targets -- stabilizes training. Policy Gradient: directly learn policy pi(a|s). PPO (Proximal Policy Optimization): clips policy updates to prevent destructive large steps -- most widely used DRL algorithm, used in RLHF.

Experience Replay

Store transitions, sample randomly -- breaks temporal correlations

Target Network

Frozen network for Q-targets -- stabilizes training

Policy Gradient

Directly learn the policy instead of Q-values

PPO

Clips policy updates -- prevents destructive large steps, used in RLHF

🎯 Exam Favorite

ATTENTION = The model HIGHLIGHTS what matters — like a student with a yellow marker

QUERY · KEY · VALUE — THE ATTENTION TRIANGLE

The attention mechanism — how Transformers read

Attention lets the model focus on relevant parts of the input when processing each word. For every word, it computes a Query (what am I looking for?), compares it to Keys (what does each word offer?), and retrieves Values (the actual content) weighted by relevance. High attention score = highlighted in yellow. This lets "it" in "The trophy didn't fit in the suitcase because it was too big" correctly link to "trophy" not "suitcase."

🧠 Vivid Story

CNN FILTERS = Sliding a MAGNIFYING GLASS across an image, one patch at a time

CONVOLVE · POOL · REPEAT — THE CNN PIPELINE

How CNNs process images

A CNN slides small filters (like magnifying glasses) across the image, detecting features: edges in layer 1, shapes in layer 2, object parts in layer 3, full objects in deeper layers. This is convolution. Then pooling shrinks the map (keeping the strongest signals) to reduce computation. Stacking many convolution+pool layers builds a hierarchy from pixels to objects. The filters are learned — nobody hand-coded "detect an eye."

🔑 Key Distinction

RNN has MEMORY — reads left to right and remembers. Transformer reads ALL AT ONCE.

SEQUENTIAL vs PARALLEL — THE BIG ARCHITECTURAL DIFFERENCE

RNN vs Transformer — why Transformers won

RNNs process sequence one step at a time, passing a hidden state forward — like reading a book one word at a time, updating your memory. Problem: by the time you reach word 100, word 1 is nearly forgotten (vanishing gradient). Transformers process all words simultaneously using attention — every word sees every other word at once. This parallelism made Transformers dramatically faster to train on GPUs and better at long-range dependencies. Result: RNNs are largely obsolete for NLP.

💡 Concept Anchor

FINE-TUNING = Starting with a TRAINED CHEF and teaching them YOUR recipes

PRE-TRAIN ON EVERYTHING · FINE-TUNE ON YOUR TASK

Pre-training and fine-tuning — the modern deep learning workflow

Pre-training: train a massive model on enormous general data (all of Wikipedia, billions of web pages). It learns language, facts, and reasoning. Fine-tuning: take that pre-trained model and train it a little more on your specific task (medical records, legal documents, customer emails). The chef already knows how to cook — you just teach them your menu. This is why GPT-4, Claude, and BERT work so well with relatively little task-specific data.

📅 Quick Reference

BATCH NORMALIZATION = Resetting the VOLUME between songs so none blare and none whisper

NORMALIZE · STABILIZE · ACCELERATE TRAINING

Batch normalization — why deep networks train faster with it

As data flows through deep network layers, values can explode (very large) or vanish (very small) — causing unstable training. Batch normalization normalizes the outputs of each layer to have mean=0 and variance=1 before passing to the next. Like adjusting the volume between every song in a playlist so they're all at the same level. Result: faster training, higher learning rates, and less sensitivity to weight initialization.

⭐ Most Important

TRANSFORMER = ENCODER understands · DECODER generates · ATTENTION connects them

ENCODER · DECODER · ATTENTION — the three-part architecture

The Transformer architecture — how it actually works

The Transformer has two stacks. Encoder: reads the input sequence, builds rich contextual representations using self-attention (every word attends to every other word). Decoder: generates the output sequence token by token, attending to both the encoder output and its own previous outputs. Attention mechanism: lets any position in the sequence directly influence any other, regardless of distance. This solved the long-range dependency problem that destroyed RNNs.

Encoder

Reads full input, builds contextual representations — used alone in BERT

Decoder

Generates output token by token — used alone in GPT

Encoder-Decoder

Full sequence-to-sequence — used in translation (T5, BART)

🐍 Code

# BERT = encoder only (understanding tasks)
# GPT  = decoder only (generation tasks)
# T5   = encoder + decoder (translation, summarization)
from transformers import BertModel, GPT2Model
bert = BertModel.from_pretrained("bert-base-uncased")

🎯 Exam Favorite

RESIDUAL CONNECTION = SKIP the highway — gradient flows straight through even 100 layers

RESNET SOLVED THE DEEP NETWORK PROBLEM

Residual connections — why networks can now be 100+ layers deep

Before ResNet (2015), networks deeper than ~20 layers degraded — adding layers made things worse, not better. Residual connections add a shortcut: output = F(x) + x. The network learns the residual (the correction) rather than the full mapping. Even if F(x) learns nothing, the identity x passes through unchanged. This lets gradients flow directly through skip connections, keeping early layers learning. ResNet-152 has 152 layers. Modern LLMs have hundreds of transformer blocks — all using residual connections.

Without residual

Gradient vanishes in deep networks — early layers stop learning

With residual

output = F(x) + x — gradient flows through skip path unchanged

Why it works

Network only needs to learn the residual (correction), not the full mapping

🐍 Code

import torch.nn as nn
class ResidualBlock(nn.Module):
    def forward(self, x):
        return self.layers(x) + x  # skip connection
# The + x is the residual connection — the key innovation

🔑 Key Distinction

DIFFUSION = Learn to DENOISE · Reverse the noise step by step to generate images

FORWARD NOISE → BACKWARD DENOISE → IMAGE

Diffusion models — how DALL-E and Stable Diffusion work

Diffusion models destroyed GANs as the state-of-the-art for image generation. Forward process: gradually add Gaussian noise to an image over 1000 steps until it looks like pure static. Reverse process: train a neural network to predict and remove that noise, step by step. At generation time: start with pure noise, apply the learned denoising process 1000 times — a coherent image emerges. More stable to train than GANs, better quality, and supports text conditioning (Stable Diffusion, DALL-E 2, Midjourney).

Forward process

Real image → add noise 1000 steps → pure static

Reverse process

Learn to predict the noise at each step and remove it

Generation

Start from random noise → denoise 1000 steps → coherent image

🐍 Code

# Using Hugging Face diffusers library
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
image = pipe("a photorealistic cat on the moon").images[0]

💡 Concept Anchor

RLHF = Human thumbs up/down trains a reward model that then trains the LLM

REINFORCE WITH HUMAN FEEDBACK — HOW CHATGPT BECAME HELPFUL

RLHF — the technique that made LLMs actually useful

Base LLMs (pretrained on internet text) are knowledgeable but unhelpful — they complete text, not follow instructions. RLHF fixes this: Step 1 — collect human feedback on model outputs (which response is better?). Step 2 — train a reward model to predict human preferences. Step 3 — use reinforcement learning (PPO) to fine-tune the LLM to maximize reward. Result: a model that follows instructions, refuses harmful requests, and gives helpful responses. ChatGPT, Claude, and Gemini all use variants of RLHF.

Step 1

Supervised fine-tuning on demonstrations — model learns to follow instructions

Step 2

Reward model trained on human preference comparisons

Step 3

PPO reinforcement learning maximizes reward model score

🐍 Code

# RLHF is not easily reproducible in a snippet
# The three phases conceptually:
# 1. sft_model = finetune(base_llm, instruction_demos)
# 2. reward_model = train(human_preference_data)
# 3. rlhf_model = ppo_train(sft_model, reward_model)

Correct

Missed

Remaining

What does this mean / stand for?

Correct

Wrong

Remaining

🔗 Related Sub-Subjects

🕸️ Neural Networks

The neural network fundamentals that deep learning is built on — backprop, activations, layers.

Neural Networks →

💬 NLP

How deep learning Transformers power language understanding and generation.

NLP →

👁️ Computer Vision

CNNs and Vision Transformers — how deep learning sees and understands images.

Computer Vision →

🎓 Common Exam Questions

Q: What is the attention mechanism and why is it important?

A: Attention allows every position in a sequence to directly influence every other position, regardless of distance. For each token, it computes Query (what am I looking for?), Key (what do I offer?), and Value (my actual content). Attention scores weight how much each position contributes to the current output. This solves the long-range dependency problem that plagued RNNs — a word at position 1 can directly influence position 100 with no degradation.

Q: What is the difference between GPT and BERT?

A: Both are Transformer-based. BERT uses the encoder — trained by predicting masked words using bidirectional context. Best for understanding tasks: classification, NER, question answering. GPT uses the decoder — trained to predict the next word autoregressively, left to right. Best for generation tasks: writing, completing text, chat. BERT = the reader. GPT = the writer.

Q: What is transfer learning and why is it the standard approach?

A: Transfer learning reuses a model trained on one (large) task for another (smaller) task. In deep learning: pre-train on massive data (all of Wikipedia, billions of images), then fine-tune on your specific task with a small labeled dataset. The pre-trained features (edge detectors, grammar patterns, world knowledge) transfer and dramatically reduce the data and compute needed for the target task.

Q: Explain the GAN architecture and its training instability problem.

A: A GAN has two networks: Generator (creates fake data) and Discriminator (distinguishes real from fake). They compete — as the discriminator improves at detecting fakes, the generator must improve at fooling it. Training instability: mode collapse (generator produces limited variety), discriminator winning too fast (generator gets no useful gradient), and non-convergence. Solutions include Wasserstein loss, progressive growing, and spectral normalization. Diffusion models have largely replaced GANs for image generation due to more stable training.

Q: What is a Vision Transformer (ViT) and how does it differ from a CNN?

A: ViT splits an image into fixed-size patches (e.g., 16×16 pixels), treats each patch as a token (like a word in NLP), and processes them with a standard Transformer encoder. CNNs use spatial convolutions that detect local features with translation invariance — they are inductive biases baked into the architecture. ViTs have no such built-in spatial assumptions — they learn spatial relationships from data. ViTs outperform CNNs on large datasets but need more data to train from scratch.

Memory tricks that make CNNs, Transformers and LLMs click

Memory Tricks