Memory tricks that make machine learning algorithms click
From supervised to reinforcement learning -- these memory tricks lock in every major ML algorithm, the bias-variance tradeoff, and the workflow that makes models actually work.
Or continue to the sub-topics below for more specialized Study Rooms and Forums
Machine Learning
Memory Tricks
Proven Mnemonics & Acronyms — fast to learn, hard to forget.
Star Most Tested
SPLIT -- Select, Prepare, Learn, Interpret, Test -- the 5-step ML workflow
THE ML PROJECT WORKFLOW
Every ML project follows this repeatable process -- know it cold
Select the right algorithm for your problem type. Prepare data: clean, normalize, handle missing values, encode categoricals, split train/test. Learn: train the model -- algorithm finds optimal parameters. Interpret: examine what was learned. Test: evaluate on held-out test set -- never seen during training. Performance on test data is the only honest measure of model quality.
Select
Match algorithm to problem type (classification, regression, clustering)
Prepare
~80% of ML work -- clean, normalize, encode, split
Learn
Train on training set -- find optimal parameters
Test
Held-out test set -- the only honest performance measure
Regression
LINE -- Linear regression predicts a NUMBER. Logistic predicts a CATEGORY.
THE NAMING TRAP -- LOGISTIC REGRESSION IS A CLASSIFIER
Linear predicts a number. Logistic predicts a category. Despite the name.
Linear Regression: y = mx + b, minimizes squared errors, predicts continuous values (house price, temperature). Logistic Regression: despite the name, used for CLASSIFICATION -- uses sigmoid function to output probability 0-1. Decision boundary at 0.5. Above = positive class. Below = negative class. This distinction appears on virtually every ML exam.
Linear Regression
Continuous output -- house price, temperature, salary
Logistic Regression
CLASSIFICATION despite the name -- outputs probability 0-1
Sigmoid function
Squashes to 0-1: s(z) = 1/(1+e^-z)
Decision boundary
0.5 threshold -- above=positive, below=negative
Decision Trees
TREE -- Test feature, Recurse on subsets, End at leaf, Ensemble to improve
DECISION TREE to RANDOM FOREST to XGBOOST
Single trees overfit. Ensembles are among the most powerful ML algorithms.
Decision Trees split data by feature questions using Gini impurity or Information Gain. Prone to overfitting alone. Random Forest (bagging): parallel trees on random data and feature subsets -- reduces variance. XGBoost (boosting): sequential trees, each corrects previous errors -- dominant for tabular data competitions. No feature scaling needed for tree-based methods.
Gini impurity
Probability of misclassifying a random sample -- used to pick splits
Boosting -- sequential, each tree corrects prior errors
No scaling needed
Tree-based methods are scale-invariant
K-Nearest Neighbors
K-NN -- K Neighbors vote, No training needed, Nearest wins
LAZY LEARNER -- ALL COMPUTATION AT PREDICTION TIME
K-NN has no training phase -- it simply memorizes the training set
Find K closest training examples (Euclidean distance), take majority vote (classification) or average (regression). Always normalize -- large-range features dominate distance. Curse of dimensionality: too many features makes distance meaningless. Choose K with cross-validation. K=1 overfits. Large K underfits.
Always normalize
Distance is meaningless without feature scaling
Curse of dimensionality
Many features = sparse space = nearest neighbor meaningless
K=1
Overfits -- very sensitive to noise
Choose K
Cross-validate -- odd K for binary classification avoids ties
SVM
SVM = finds the WIDEST STREET between classes
MAXIMUM MARGIN CLASSIFIER
Support vectors are the training points closest to the boundary -- they define everything
SVM finds the hyperplane that maximally separates classes. Wider margin = better generalization. Kernel trick: maps data to higher dimensions where linearly separable -- RBF kernel handles nonlinear data. C parameter: low C = wider margin, more misclassifications tolerated. High C = narrow margin, fewer misclassifications. Best for high-dimensional sparse data.
Maximum margin
Wider margin = better generalization to new data
Kernel trick
Map to higher dimension where data IS linearly separable
C parameter
Low C = wide margin (tolerant). High C = narrow margin (strict).
Best for
Text classification and high-dimensional sparse data
K-Means Clustering
K-MEANS -- K clusters, Move centroids, Assign nearest, Repeat until stable
UNSUPERVISED CLUSTERING -- GROUP WITHOUT LABELS
Use the elbow method to find optimal K -- plot inertia vs K, pick the bend
Steps: choose K, randomly initialize K centroids, assign each point to nearest centroid, recalculate centroids as cluster means, repeat until stable. Elbow method: plot inertia vs K, choose where improvement slows. DBSCAN alternative: finds arbitrary-shaped clusters and outliers automatically without specifying K.
Elbow method
Plot inertia vs K -- choose where improvement slows
Weakness
Needs K in advance, sensitive to outliers, assumes spherical clusters
DBSCAN
No K needed, finds arbitrary shapes, detects outliers
Scale first
K-Means uses distance -- normalize all features
Feature Engineering
SCALE before you TRAIN -- unscaled features ruin distance-based algorithms
NORMALIZE and STANDARDIZE and ENCODE and REDUCE
Always split data BEFORE fitting any transformers to prevent data leakage
Normalization (Min-Max): scales to 0-1. Standardization (Z-score): mean=0, std=1. Scale for: K-NN, SVM, neural networks, PCA. NOT for: tree-based methods. One-hot encoding: categoricals to binary columns. Label encoding: integers to categories -- only when ordinal. PCA: reduce dimensions by keeping directions of maximum variance. Golden rule: fit transformers on training data only.
Decision trees, Random Forest, XGBoost -- scale-invariant
One-hot encoding
Red/Green/Blue becomes 3 binary columns
PCA
Keep maximum variance directions -- reduce noise and computation
Naive Bayes
NAIVE = assumes all features INDEPENDENT -- wrong but works surprisingly well
BAYES THEOREM WITH AN INDEPENDENCE ASSUMPTION -- FAST AND EFFECTIVE
For text classification the independence assumption holds approximately -- that's enough
P(class|features) proportional to P(class) times product of P(feature|class). Despite independence assumption being almost always violated, it works well for text -- word frequencies are roughly independent given class. Gaussian NB: continuous features. Multinomial NB: word counts. Laplace smoothing: add 1 to all counts to prevent zero probabilities.
Gaussian NB
Assumes continuous features follow Gaussian distribution
Multinomial NB
For word counts -- text classification, spam detection
Laplace smoothing
Add 1 to counts -- prevents zero probability killing the product
RLHF is how ChatGPT and Claude are aligned -- human feedback trains a reward model
Agent: learner. State: current situation. Action: choice made. Reward: feedback signal. Policy: strategy mapping states to actions. Goal: maximize expected cumulative reward. Exploration vs exploitation. Q-learning: learns value of state-action pairs. RLHF (Reinforcement Learning from Human Feedback): human raters rank outputs, reward model trained, RL maximizes it -- standard for aligning LLMs.
Agent
Learner that takes actions and receives rewards
Policy
Strategy: given state s, what action a to take
Q-learning
Learns Q(s,a) = expected cumulative reward from state s taking action a
RLHF
Human rankings + reward model + RL = aligned ChatGPT/Claude
NEVER SHUFFLE TIME SERIES DATA -- ALWAYS SPLIT CHRONOLOGICALLY
Future data cannot predict the past -- chronological order must be preserved
Trend: long-term direction. Seasonality: regular repeating patterns (daily, weekly, yearly). Autocorrelation: current value correlates with its own past values. Cycle: irregular multi-year patterns. NEVER shuffle before splitting -- always chronological. Classic models: ARIMA. Modern: LSTMs, Temporal CNNs, Transformer-based forecasters.
Current value correlates with past values (lag-1, lag-2...)
Never shuffle
Always split chronologically -- future cannot predict past
Anomaly Detection
RARE events are ISOLATED -- Isolation Forest finds anomalies with fewer random splits
DETECT THE UNUSUAL WITHOUT LABELED ANOMALY EXAMPLES
Train on normal data only -- anomalies stand out by being hard to reconstruct
Isolation Forest: anomalies isolated by fewer random splits (short path = anomaly) -- fast, scalable, no distance metric needed. Statistical: z-score, flag beyond N standard deviations. Autoencoder: train on normal data only, flag high reconstruction error points as anomalies. Applications: fraud detection, network intrusion, manufacturing defects, medical outliers.
Isolation Forest
Short path to isolate = anomaly. Fast, scalable.
Z-score
Simple -- flag points beyond 2-3 standard deviations
Autoencoder
Train on normal data -- anomalies cannot be reconstructed well
Applications
Fraud, network intrusion, defects, medical outliers
AI BUILDING AI -- NEURAL ARCHITECTURE SEARCH FOUND EFFICIENTNET
AutoML automates the tuning, not the thinking -- still need good problem framing
AutoML tools: Google AutoML (cloud, no-code), H2O AutoML (open-source), Auto-sklearn. NAS (Neural Architecture Search): discovered EfficientNet -- outperformed human-designed networks. Active Learning: model queries the most uncertain examples for labeling -- reduces labeling cost by 5-10x. Limitations: still requires clean data, correct evaluation metric, good problem definition.
Google AutoML
Cloud-based, no-code, requires little ML expertise
H2O AutoML
Open-source, competitive with cloud options on tabular data
NAS
Found EfficientNet -- better than human-designed CNN architectures
Active Learning
Query uncertain examples -- 5-10x more label-efficient