Memory tricks that make neurons, backprop and deep architectures click
From the single neuron to backpropagation through thousands of layers -- these memory tricks lock in how neural networks are built, trained, and prevented from overfitting.
Or continue to the sub-topics below for more specialized Study Rooms and Forums
Neural Networks
Memory Tricks
Proven Mnemonics & Acronyms — fast to learn, hard to forget.
Foundation
WISE -- Weights, Input, Summation, Excitation (activation) -- what every neuron does
INPUT TIMES WEIGHT THEN SUM THEN ACTIVATE THEN OUTPUT
Without activation functions, stacking layers is pointless -- still just one linear function
Inputs times Weights, summed plus bias: z = w1x1 + w2x2 + b. Activation function applies nonlinearity. Without activations, any depth network can only learn linear functions. Nonlinear activations allow networks to approximate any function (Universal Approximation Theorem). The activation function is what makes deep learning powerful.
Weights
Learned parameters -- how important is each input?
Bias
Allows activation threshold to shift -- also learned
Summation z
z = w1x1 + w2x2 + ... + b -- the dot product
Activation
Applies nonlinearity -- without it, deep = pointless
Activation Functions
Silly Tom Really Likes -- Sigmoid, Tanh, ReLU, Leaky ReLU
Sigmoid (Silly): squashes to 0-1, use ONLY at binary output layer, vanishing gradient kills it in hidden layers. Tanh (Tom): output -1 to 1, centered at zero, still vanishes in deep nets. ReLU (Really): max(0,x), THE default for hidden layers, fast, avoids vanishing gradient. Leaky ReLU (Likes): fixes dying ReLU with small negative slope. Softmax: always at multi-class output -- converts logits to probabilities summing to 1.
Sigmoid
Binary output layer ONLY -- never in hidden layers
Tanh
Centered at zero -- slightly better than sigmoid, still vanishes
ReLU
max(0,x) -- default for hidden layers, fast, no vanishing
Softmax
Multi-class output -- logits to probabilities (sum=1)
Architecture
IHO -- Input, Hidden, Output -- every network has these three types of layers
DEPTH = HIDDEN LAYERS AND WIDTH = NEURONS PER LAYER
Deep networks learn hierarchical features -- edges to shapes to objects to concepts
Input layer: one neuron per feature, no computation. Hidden layers: where representations are learned. Output layer: one per class (classification) or one neuron (regression). Depth allows hierarchical representation learning. Universal Approximation Theorem: even one hidden layer with enough neurons can approximate any continuous function.
Input layer
One neuron per feature -- no computation, just passes data in
Hidden layers
Where representations are learned -- the depth
Output layer
One per class (classification) or one (regression)
Depth vs width
Deep = hierarchical features. Wide = capacity within a layer.
Backpropagation
ERROR flows BACKWARDS like a RIVER OF REGRET -- backpropagation in one vivid image
FORWARD PREDICT THEN MEASURE REGRET THEN FLOW BACKWARD THEN ADJUST
Each weight gets its blame assigned via the chain rule -- then gets updated
Forward pass: data flows forward producing a prediction. Loss: measures how wrong the prediction was -- the regret. Backward pass: regret flows BACKWARD through the network via chain rule of calculus, assigning blame (gradient) to each weight. Weight update: w = w - alpha times gradient. Adam optimizer: adaptive per-parameter learning rate -- the standard default (lr=0.001).
Forward pass
Input flows through network -- prediction at output
Loss
Measures how wrong -- the regret (MSE, cross-entropy)
DROP -- Dropout Randomly Omits Percentages of neurons during training
DROPOUT AND BATCH NORM AND L2 AND EARLY STOPPING -- FOUR TOOLS
Batch Normalization speeds training so dramatically it should be used by default
Dropout: 20-50% neurons disabled per step, forces redundant representations, scale by (1-p) at inference. Batch Normalization: normalizes layer inputs to mean=0, std=1, speeds training dramatically, allows higher learning rates. L2 weight decay: penalizes large weights. Early stopping: monitor validation loss, stop when it rises.
Dropout
20-50% neurons disabled per step -- inference uses all, scaled by (1-p)
Batch Normalization
Normalize layer inputs -- speeds training, allows higher LR
L2 weight decay
Penalize large weights -- keeps model simple
Early stopping
Stop when validation loss stops improving
Hyperparameters
BLAND -- Batch size, Layers, Activation, Neurons, Dropout -- set BEFORE training
NOT LEARNED FROM DATA -- SET BY THE DEVELOPER BEFORE TRAINING BEGINS
Learning rate is the single most important hyperparameter -- start with 0.001 for Adam
Learning rate (most critical), batch size (32-128 typical), number of layers, neurons per layer, activation functions, dropout rate. Tuning methods: random search (often better than grid for same budget), Bayesian optimization. Transfer learning: start from pretrained model -- dramatically helps with small datasets.
Learning rate
Most critical -- too high=diverges, too low=slow. Start 0.001.
Batch size
32-128 typical. Small=noisy. Large=stable.
Tune with
Random search or Bayesian optimization
Transfer learning
Pretrained model + fine-tune = state of art with small data
Loss Functions
MSE for regression and Cross-Entropy for classification -- match loss to task
WRONG LOSS FUNCTION = MODEL CANNOT LEARN WHAT YOU NEED
Binary cross-entropy with sigmoid. Categorical cross-entropy with softmax. Always pair correctly.
MSE (Mean Squared Error): regression, sensitive to outliers. MAE: regression, robust to outliers. Binary Cross-Entropy: binary classification with sigmoid output. Categorical Cross-Entropy: multi-class classification with softmax output. Loss is optimized during training (must be differentiable). Metric is what you report (can be anything -- accuracy, F1, AUC).
MSE
Regression -- averages squared prediction errors
Binary CE
Binary classification + sigmoid output
Categorical CE
Multi-class + softmax output
Loss vs Metric
Loss: optimized. Metric: reported. They can differ.
Weight Initialization
ZERO INIT = all neurons learn the SAME thing -- symmetry never breaks
XAVIER FOR SIGMOID AND TANH -- HE FOR RELU -- NEVER ZERO
Random initialization breaks symmetry -- every neuron starts differently and learns different features
Zero initialization: catastrophic -- all neurons identical, all gradients identical, network never diversifies (symmetry problem). Xavier: scales by 1/sqrt(n_in) -- designed for sigmoid and tanh. He initialization: scales by 2/sqrt(n_in) -- designed for ReLU (accounts for half of neurons being zeroed). Batch Normalization reduces sensitivity to initialization.
Zero init
Catastrophic -- symmetry problem -- never use
Xavier init
For sigmoid and tanh activations
He init
For ReLU -- larger variance since half neurons are zeroed
Batch Norm
Reduces sensitivity to initialization choice
Gradient Problems
DEEP networks VANISH or EXPLODE -- residual connections and clipping are the fixes
VANISHING = EARLY LAYERS DON'T LEARN -- EXPLODING = WEIGHTS BLOW UP
ResNet skip connections solved the vanishing gradient problem -- enabled 100+ layer networks
Vanishing gradients: gradients shrink exponentially backward through many layers -- early layers barely update. Caused by sigmoid/tanh and many layers. Fix: ReLU, batch normalization, residual connections (ResNet skip connections -- gradient highway bypasses layers). Exploding gradients: weights blow up. Fix: gradient clipping (cap magnitude), careful initialization.
Vanishing
Sigmoid + many layers = early layers get nearly zero gradient
Exploding
Gradients grow exponentially -- weights blow up, training diverges
Residual connections
Skip connections -- gradient highway from output to early layers
Gradient clipping
Cap gradient magnitude -- essential for RNNs and LSTMs
Attention
SPOTLIGHT on the important words -- attention decides what to illuminate at each step
QUERY TIMES KEY = SCORE -- WEIGHTED SUM OF VALUES = CONTEXT
Multi-head attention runs h parallel spotlights capturing different relationship types
For each output position: Query (what am I looking for?), Key (what do I contain?), Value (what do I provide?). Score = Q dot K divided by sqrt(d_k), then softmax for weights, then weighted sum of Values. Self-attention: tokens attend to each other. Causal (decoder): only attends LEFT -- future tokens masked to enable autoregressive generation.
Query
What am I looking for at this position?
Key
What information do I contain?
Value
What information do I provide to the output?
Causal masking
Decoder looks LEFT only -- enables autoregressive text generation
Normalization Layers
BATCH normalizes ACROSS samples -- LAYER normalizes WITHIN a sample
BATCH NORM FOR CNNs -- LAYER NORM FOR TRANSFORMERS AND RNNs
Different axis of normalization -- choose based on your architecture type
Batch Normalization: normalize across batch dimension for each feature -- great for CNNs, needs large batch, fails with small batches or variable-length sequences. Layer Normalization: normalize across feature dimension within each sample -- batch-size independent, works with variable lengths, standard for Transformers and RNNs. Both add learnable scale and shift parameters.
Batch Norm
Across batch -- great for CNNs, needs large consistent batch
Layer Norm
Within each sample -- standard for Transformers and RNNs
RMSNorm
Simpler LayerNorm -- used in Llama and modern LLMs
Both add
Learnable scale (gamma) and shift (beta) parameters
Optimizers
SGD then Momentum then RMSProp then Adam -- each fixed a problem with the previous
ADAM = MOMENTUM PLUS ADAPTIVE LEARNING RATE = THE MODERN DEFAULT
AdamW adds proper weight decay -- the standard for fine-tuning large language models
SGD: simple, often oscillates. Momentum: adds inertia, accelerates in consistent directions. RMSProp: adaptive per-parameter learning rate. Adam: combines momentum plus RMSProp -- most popular, works with defaults (lr=0.001, beta1=0.9, beta2=0.999). AdamW: proper weight decay fix -- better generalization, standard for fine-tuning transformers.
SGD
Simple but oscillates and gets stuck in saddle points
Momentum
Adds inertia -- faster convergence, less oscillation
Adam
Momentum + RMSProp -- works well with default settings
AdamW
Adam + proper weight decay -- standard for transformer fine-tuning