Neural Networks Memory Tricks -- Free AI Mnemonics

Foundation

WISE -- Weights, Input, Summation, Excitation (activation) -- what every neuron does

INPUT TIMES WEIGHT THEN SUM THEN ACTIVATE THEN OUTPUT

Without activation functions, stacking layers is pointless -- still just one linear function

Inputs times Weights, summed plus bias: z = w1x1 + w2x2 + b. Activation function applies nonlinearity. Without activations, any depth network can only learn linear functions. Nonlinear activations allow networks to approximate any function (Universal Approximation Theorem). The activation function is what makes deep learning powerful.

Weights

Learned parameters -- how important is each input?

Bias

Allows activation threshold to shift -- also learned

Summation z

z = w1x1 + w2x2 + ... + b -- the dot product

Activation

Applies nonlinearity -- without it, deep = pointless

Activation Functions

Silly Tom Really Likes -- Sigmoid, Tanh, ReLU, Leaky ReLU

SIGMOID = OUTPUT ONLY -- RELU = HIDDEN LAYERS -- SOFTMAX = MULTI-CLASS

Picture Silly Tom Really Liking lemons -- activation functions unlocked forever

Sigmoid (Silly): squashes to 0-1, use ONLY at binary output layer, vanishing gradient kills it in hidden layers. Tanh (Tom): output -1 to 1, centered at zero, still vanishes in deep nets. ReLU (Really): max(0,x), THE default for hidden layers, fast, avoids vanishing gradient. Leaky ReLU (Likes): fixes dying ReLU with small negative slope. Softmax: always at multi-class output -- converts logits to probabilities summing to 1.

Sigmoid

Binary output layer ONLY -- never in hidden layers

Tanh

Centered at zero -- slightly better than sigmoid, still vanishes

ReLU

max(0,x) -- default for hidden layers, fast, no vanishing

Softmax

Multi-class output -- logits to probabilities (sum=1)

Architecture

IHO -- Input, Hidden, Output -- every network has these three types of layers

DEPTH = HIDDEN LAYERS AND WIDTH = NEURONS PER LAYER

Deep networks learn hierarchical features -- edges to shapes to objects to concepts

Input layer: one neuron per feature, no computation. Hidden layers: where representations are learned. Output layer: one per class (classification) or one neuron (regression). Depth allows hierarchical representation learning. Universal Approximation Theorem: even one hidden layer with enough neurons can approximate any continuous function.

Input layer

One neuron per feature -- no computation, just passes data in

Hidden layers

Where representations are learned -- the depth

Output layer

One per class (classification) or one (regression)

Depth vs width

Deep = hierarchical features. Wide = capacity within a layer.

Backpropagation

ERROR flows BACKWARDS like a RIVER OF REGRET -- backpropagation in one vivid image

FORWARD PREDICT THEN MEASURE REGRET THEN FLOW BACKWARD THEN ADJUST

Each weight gets its blame assigned via the chain rule -- then gets updated

Forward pass: data flows forward producing a prediction. Loss: measures how wrong the prediction was -- the regret. Backward pass: regret flows BACKWARD through the network via chain rule of calculus, assigning blame (gradient) to each weight. Weight update: w = w - alpha times gradient. Adam optimizer: adaptive per-parameter learning rate -- the standard default (lr=0.001).

Forward pass

Input flows through network -- prediction at output

Loss

Measures how wrong -- the regret (MSE, cross-entropy)

Chain rule

Carries error signal backward through all layers

Adam

Adaptive per-parameter learning rate -- start lr=0.001

Regularization

DROP -- Dropout Randomly Omits Percentages of neurons during training

DROPOUT AND BATCH NORM AND L2 AND EARLY STOPPING -- FOUR TOOLS

Batch Normalization speeds training so dramatically it should be used by default

Dropout: 20-50% neurons disabled per step, forces redundant representations, scale by (1-p) at inference. Batch Normalization: normalizes layer inputs to mean=0, std=1, speeds training dramatically, allows higher learning rates. L2 weight decay: penalizes large weights. Early stopping: monitor validation loss, stop when it rises.

Dropout

20-50% neurons disabled per step -- inference uses all, scaled by (1-p)

Batch Normalization

Normalize layer inputs -- speeds training, allows higher LR

L2 weight decay

Penalize large weights -- keeps model simple

Early stopping

Stop when validation loss stops improving

Hyperparameters

BLAND -- Batch size, Layers, Activation, Neurons, Dropout -- set BEFORE training

NOT LEARNED FROM DATA -- SET BY THE DEVELOPER BEFORE TRAINING BEGINS

Learning rate is the single most important hyperparameter -- start with 0.001 for Adam

Learning rate (most critical), batch size (32-128 typical), number of layers, neurons per layer, activation functions, dropout rate. Tuning methods: random search (often better than grid for same budget), Bayesian optimization. Transfer learning: start from pretrained model -- dramatically helps with small datasets.

Learning rate

Most critical -- too high=diverges, too low=slow. Start 0.001.

Batch size

32-128 typical. Small=noisy. Large=stable.

Tune with

Random search or Bayesian optimization

Transfer learning

Pretrained model + fine-tune = state of art with small data

Loss Functions

MSE for regression and Cross-Entropy for classification -- match loss to task

WRONG LOSS FUNCTION = MODEL CANNOT LEARN WHAT YOU NEED

Binary cross-entropy with sigmoid. Categorical cross-entropy with softmax. Always pair correctly.

MSE (Mean Squared Error): regression, sensitive to outliers. MAE: regression, robust to outliers. Binary Cross-Entropy: binary classification with sigmoid output. Categorical Cross-Entropy: multi-class classification with softmax output. Loss is optimized during training (must be differentiable). Metric is what you report (can be anything -- accuracy, F1, AUC).

MSE

Regression -- averages squared prediction errors

Binary CE

Binary classification + sigmoid output

Categorical CE

Multi-class + softmax output

Loss vs Metric

Loss: optimized. Metric: reported. They can differ.

Weight Initialization

ZERO INIT = all neurons learn the SAME thing -- symmetry never breaks

XAVIER FOR SIGMOID AND TANH -- HE FOR RELU -- NEVER ZERO

Random initialization breaks symmetry -- every neuron starts differently and learns different features

Zero initialization: catastrophic -- all neurons identical, all gradients identical, network never diversifies (symmetry problem). Xavier: scales by 1/sqrt(n_in) -- designed for sigmoid and tanh. He initialization: scales by 2/sqrt(n_in) -- designed for ReLU (accounts for half of neurons being zeroed). Batch Normalization reduces sensitivity to initialization.

Zero init

Catastrophic -- symmetry problem -- never use

Xavier init

For sigmoid and tanh activations

He init

For ReLU -- larger variance since half neurons are zeroed

Batch Norm

Reduces sensitivity to initialization choice

Gradient Problems

DEEP networks VANISH or EXPLODE -- residual connections and clipping are the fixes

VANISHING = EARLY LAYERS DON'T LEARN -- EXPLODING = WEIGHTS BLOW UP

ResNet skip connections solved the vanishing gradient problem -- enabled 100+ layer networks

Vanishing gradients: gradients shrink exponentially backward through many layers -- early layers barely update. Caused by sigmoid/tanh and many layers. Fix: ReLU, batch normalization, residual connections (ResNet skip connections -- gradient highway bypasses layers). Exploding gradients: weights blow up. Fix: gradient clipping (cap magnitude), careful initialization.

Vanishing

Sigmoid + many layers = early layers get nearly zero gradient

Exploding

Gradients grow exponentially -- weights blow up, training diverges

Residual connections

Skip connections -- gradient highway from output to early layers

Gradient clipping

Cap gradient magnitude -- essential for RNNs and LSTMs

Attention

SPOTLIGHT on the important words -- attention decides what to illuminate at each step

QUERY TIMES KEY = SCORE -- WEIGHTED SUM OF VALUES = CONTEXT

Multi-head attention runs h parallel spotlights capturing different relationship types

For each output position: Query (what am I looking for?), Key (what do I contain?), Value (what do I provide?). Score = Q dot K divided by sqrt(d_k), then softmax for weights, then weighted sum of Values. Self-attention: tokens attend to each other. Causal (decoder): only attends LEFT -- future tokens masked to enable autoregressive generation.

Query

What am I looking for at this position?

Key

What information do I contain?

Value

What information do I provide to the output?

Causal masking

Decoder looks LEFT only -- enables autoregressive text generation

Normalization Layers

BATCH normalizes ACROSS samples -- LAYER normalizes WITHIN a sample

BATCH NORM FOR CNNs -- LAYER NORM FOR TRANSFORMERS AND RNNs

Different axis of normalization -- choose based on your architecture type

Batch Normalization: normalize across batch dimension for each feature -- great for CNNs, needs large batch, fails with small batches or variable-length sequences. Layer Normalization: normalize across feature dimension within each sample -- batch-size independent, works with variable lengths, standard for Transformers and RNNs. Both add learnable scale and shift parameters.

Batch Norm

Across batch -- great for CNNs, needs large consistent batch

Layer Norm

Within each sample -- standard for Transformers and RNNs

RMSNorm

Simpler LayerNorm -- used in Llama and modern LLMs

Both add

Learnable scale (gamma) and shift (beta) parameters

Optimizers

SGD then Momentum then RMSProp then Adam -- each fixed a problem with the previous

ADAM = MOMENTUM PLUS ADAPTIVE LEARNING RATE = THE MODERN DEFAULT

AdamW adds proper weight decay -- the standard for fine-tuning large language models

SGD: simple, often oscillates. Momentum: adds inertia, accelerates in consistent directions. RMSProp: adaptive per-parameter learning rate. Adam: combines momentum plus RMSProp -- most popular, works with defaults (lr=0.001, beta1=0.9, beta2=0.999). AdamW: proper weight decay fix -- better generalization, standard for fine-tuning transformers.

SGD

Simple but oscillates and gets stuck in saddle points

Momentum

Adds inertia -- faster convergence, less oscillation

Adam

Momentum + RMSProp -- works well with default settings

AdamW

Adam + proper weight decay -- standard for transformer fine-tuning

Memory tricks that make neurons, backprop and deep architectures click

Memory Tricks