Machine Learning Memory Tricks -- Free AI Mnemonics

Star Most Tested

SPLIT -- Select, Prepare, Learn, Interpret, Test -- the 5-step ML workflow

THE ML PROJECT WORKFLOW

Every ML project follows this repeatable process -- know it cold

Select the right algorithm for your problem type. Prepare data: clean, normalize, handle missing values, encode categoricals, split train/test. Learn: train the model -- algorithm finds optimal parameters. Interpret: examine what was learned. Test: evaluate on held-out test set -- never seen during training. Performance on test data is the only honest measure of model quality.

Select

Match algorithm to problem type (classification, regression, clustering)

Prepare

~80% of ML work -- clean, normalize, encode, split

Learn

Train on training set -- find optimal parameters

Test

Held-out test set -- the only honest performance measure

Regression

LINE -- Linear regression predicts a NUMBER. Logistic predicts a CATEGORY.

THE NAMING TRAP -- LOGISTIC REGRESSION IS A CLASSIFIER

Linear predicts a number. Logistic predicts a category. Despite the name.

Linear Regression: y = mx + b, minimizes squared errors, predicts continuous values (house price, temperature). Logistic Regression: despite the name, used for CLASSIFICATION -- uses sigmoid function to output probability 0-1. Decision boundary at 0.5. Above = positive class. Below = negative class. This distinction appears on virtually every ML exam.

Linear Regression

Continuous output -- house price, temperature, salary

Logistic Regression

CLASSIFICATION despite the name -- outputs probability 0-1

Sigmoid function

Squashes to 0-1: s(z) = 1/(1+e^-z)

Decision boundary

0.5 threshold -- above=positive, below=negative

Decision Trees

TREE -- Test feature, Recurse on subsets, End at leaf, Ensemble to improve

DECISION TREE to RANDOM FOREST to XGBOOST

Single trees overfit. Ensembles are among the most powerful ML algorithms.

Decision Trees split data by feature questions using Gini impurity or Information Gain. Prone to overfitting alone. Random Forest (bagging): parallel trees on random data and feature subsets -- reduces variance. XGBoost (boosting): sequential trees, each corrects previous errors -- dominant for tabular data competitions. No feature scaling needed for tree-based methods.

Gini impurity

Probability of misclassifying a random sample -- used to pick splits

Random Forest

Bagging -- parallel independent trees, majority vote

XGBoost

Boosting -- sequential, each tree corrects prior errors

No scaling needed

Tree-based methods are scale-invariant

K-Nearest Neighbors

K-NN -- K Neighbors vote, No training needed, Nearest wins

LAZY LEARNER -- ALL COMPUTATION AT PREDICTION TIME

K-NN has no training phase -- it simply memorizes the training set

Find K closest training examples (Euclidean distance), take majority vote (classification) or average (regression). Always normalize -- large-range features dominate distance. Curse of dimensionality: too many features makes distance meaningless. Choose K with cross-validation. K=1 overfits. Large K underfits.

Always normalize

Distance is meaningless without feature scaling

Curse of dimensionality

Many features = sparse space = nearest neighbor meaningless

K=1

Overfits -- very sensitive to noise

Choose K

Cross-validate -- odd K for binary classification avoids ties

SVM

SVM = finds the WIDEST STREET between classes

MAXIMUM MARGIN CLASSIFIER

Support vectors are the training points closest to the boundary -- they define everything

SVM finds the hyperplane that maximally separates classes. Wider margin = better generalization. Kernel trick: maps data to higher dimensions where linearly separable -- RBF kernel handles nonlinear data. C parameter: low C = wider margin, more misclassifications tolerated. High C = narrow margin, fewer misclassifications. Best for high-dimensional sparse data.

Maximum margin

Wider margin = better generalization to new data

Kernel trick

Map to higher dimension where data IS linearly separable

C parameter

Low C = wide margin (tolerant). High C = narrow margin (strict).

Best for

Text classification and high-dimensional sparse data

K-Means Clustering

K-MEANS -- K clusters, Move centroids, Assign nearest, Repeat until stable

UNSUPERVISED CLUSTERING -- GROUP WITHOUT LABELS

Use the elbow method to find optimal K -- plot inertia vs K, pick the bend

Steps: choose K, randomly initialize K centroids, assign each point to nearest centroid, recalculate centroids as cluster means, repeat until stable. Elbow method: plot inertia vs K, choose where improvement slows. DBSCAN alternative: finds arbitrary-shaped clusters and outliers automatically without specifying K.

Elbow method

Plot inertia vs K -- choose where improvement slows

Weakness

Needs K in advance, sensitive to outliers, assumes spherical clusters

DBSCAN

No K needed, finds arbitrary shapes, detects outliers

Scale first

K-Means uses distance -- normalize all features

Feature Engineering

SCALE before you TRAIN -- unscaled features ruin distance-based algorithms

NORMALIZE and STANDARDIZE and ENCODE and REDUCE

Always split data BEFORE fitting any transformers to prevent data leakage

Normalization (Min-Max): scales to 0-1. Standardization (Z-score): mean=0, std=1. Scale for: K-NN, SVM, neural networks, PCA. NOT for: tree-based methods. One-hot encoding: categoricals to binary columns. Label encoding: integers to categories -- only when ordinal. PCA: reduce dimensions by keeping directions of maximum variance. Golden rule: fit transformers on training data only.

Scale for

K-NN, SVM, neural networks, PCA, logistic regression

Skip scaling for

Decision trees, Random Forest, XGBoost -- scale-invariant

One-hot encoding

Red/Green/Blue becomes 3 binary columns

PCA

Keep maximum variance directions -- reduce noise and computation

Naive Bayes

NAIVE = assumes all features INDEPENDENT -- wrong but works surprisingly well

BAYES THEOREM WITH AN INDEPENDENCE ASSUMPTION -- FAST AND EFFECTIVE

For text classification the independence assumption holds approximately -- that's enough

P(class|features) proportional to P(class) times product of P(feature|class). Despite independence assumption being almost always violated, it works well for text -- word frequencies are roughly independent given class. Gaussian NB: continuous features. Multinomial NB: word counts. Laplace smoothing: add 1 to all counts to prevent zero probabilities.

Gaussian NB

Assumes continuous features follow Gaussian distribution

Multinomial NB

For word counts -- text classification, spam detection

Laplace smoothing

Add 1 to counts -- prevents zero probability killing the product

Why it works

For text, word independence is approximately true

Reinforcement Learning

AGENT SPAR -- State, Policy, Action, Reward -- maximize cumulative reward

TRIAL ERROR AND REWARDS -- THE THIRD ML PARADIGM

RLHF is how ChatGPT and Claude are aligned -- human feedback trains a reward model

Agent: learner. State: current situation. Action: choice made. Reward: feedback signal. Policy: strategy mapping states to actions. Goal: maximize expected cumulative reward. Exploration vs exploitation. Q-learning: learns value of state-action pairs. RLHF (Reinforcement Learning from Human Feedback): human raters rank outputs, reward model trained, RL maximizes it -- standard for aligning LLMs.

Agent

Learner that takes actions and receives rewards

Policy

Strategy: given state s, what action a to take

Q-learning

Learns Q(s,a) = expected cumulative reward from state s taking action a

RLHF

Human rankings + reward model + RL = aligned ChatGPT/Claude

Time Series

TASC -- Trend, Autocorrelation, Seasonality, Cycle

NEVER SHUFFLE TIME SERIES DATA -- ALWAYS SPLIT CHRONOLOGICALLY

Future data cannot predict the past -- chronological order must be preserved

Trend: long-term direction. Seasonality: regular repeating patterns (daily, weekly, yearly). Autocorrelation: current value correlates with its own past values. Cycle: irregular multi-year patterns. NEVER shuffle before splitting -- always chronological. Classic models: ARIMA. Modern: LSTMs, Temporal CNNs, Transformer-based forecasters.

Trend

Long-term direction (upward, downward, flat)

Seasonality

Regular repeating patterns -- daily, weekly, yearly

Autocorrelation

Current value correlates with past values (lag-1, lag-2...)

Never shuffle

Always split chronologically -- future cannot predict past

Anomaly Detection

RARE events are ISOLATED -- Isolation Forest finds anomalies with fewer random splits

DETECT THE UNUSUAL WITHOUT LABELED ANOMALY EXAMPLES

Train on normal data only -- anomalies stand out by being hard to reconstruct

Isolation Forest: anomalies isolated by fewer random splits (short path = anomaly) -- fast, scalable, no distance metric needed. Statistical: z-score, flag beyond N standard deviations. Autoencoder: train on normal data only, flag high reconstruction error points as anomalies. Applications: fraud detection, network intrusion, manufacturing defects, medical outliers.

Isolation Forest

Short path to isolate = anomaly. Fast, scalable.

Z-score

Simple -- flag points beyond 2-3 standard deviations

Autoencoder

Train on normal data -- anomalies cannot be reconstructed well

Applications

Fraud, network intrusion, defects, medical outliers

AutoML

AutoML = automate algorithm selection, hyperparameter tuning, feature engineering

AI BUILDING AI -- NEURAL ARCHITECTURE SEARCH FOUND EFFICIENTNET

AutoML automates the tuning, not the thinking -- still need good problem framing

AutoML tools: Google AutoML (cloud, no-code), H2O AutoML (open-source), Auto-sklearn. NAS (Neural Architecture Search): discovered EfficientNet -- outperformed human-designed networks. Active Learning: model queries the most uncertain examples for labeling -- reduces labeling cost by 5-10x. Limitations: still requires clean data, correct evaluation metric, good problem definition.

Google AutoML

Cloud-based, no-code, requires little ML expertise

H2O AutoML

Open-source, competitive with cloud options on tabular data

NAS

Found EfficientNet -- better than human-designed CNN architectures

Active Learning

Query uncertain examples -- 5-10x more label-efficient

Memory tricks that make machine learning algorithms click

Memory Tricks