Model Evaluation Memory Tricks -- Free AI Mnemonics

Confusion Matrix

CONFUSE the matrix -- TP, FP, FN, TN are the four boxes everything else comes from

TRUE POSITIVE AND FALSE POSITIVE AND FALSE NEGATIVE AND TRUE NEGATIVE

Draw the matrix -- predicted on columns, actual on rows. Diagonal = correct predictions.

True Positive (TP): predicted positive, actually positive. False Positive (FP): predicted positive, actually negative -- Type I error, false alarm. False Negative (FN): predicted negative, actually positive -- Type II error, missed detection. True Negative (TN): predicted negative, actually negative. Every classification metric -- accuracy, precision, recall, F1, AUC -- derives from these four numbers.

TP

Predicted positive, actually positive -- correct detection

FP

Predicted positive, actually negative -- false alarm (Type I error)

FN

Predicted negative, actually positive -- missed detection (Type II error)

TN

Predicted negative, actually negative -- correct rejection

Precision and Recall

Precision is PICKY -- Recall REMEMBERS to catch everything

PICKY FOR PRECISION AND REMEMBERS FOR RECALL -- IMPOSSIBLE TO FORGET

P for Picky -- few false alarms. R for Remembers -- few misses. Two words for life.

Precision is PICKY: only says yes when really sure -- few false alarms. TP divided by (TP + FP). Recall REMEMBERS everything: sweeps wide to catch every real positive -- few misses. TP divided by (TP + FN). Tradeoff: raising classification threshold raises precision but lowers recall. Spam filter: be PICKY (don't block real email). Cancer screening: REMEMBER everything (don't miss a cancer).

Precision = TP/(TP+FP)

Picky -- minimize false alarms

Recall = TP/(TP+FN)

Remembers -- minimize missed detections

Tradeoff

Higher threshold = more precise but misses more

Cancer screening

Prioritize recall -- missing a cancer is catastrophic

F1 Score

F1 = Harmonic mean of Precision and Recall -- punishes extreme imbalance between the two

F1 = 2 TIMES (P TIMES R) DIVIDED BY (P + R)

Harmonic mean is low if EITHER value is low -- you cannot compensate low recall with high precision

F1 combines precision and recall into one metric. Harmonic mean punishes extreme imbalance -- you cannot hide 10% recall behind 99% precision. F1 = 1.0 is perfect, 0 is worst. Use when class imbalance exists and you care about both FP and FN. F-beta: beta greater than 1 emphasizes recall (medical). Beta less than 1 emphasizes precision.

Harmonic mean

Low if EITHER P or R is low -- no hiding bad recall

Use F1 when

Class imbalance exists and both FP and FN matter

F-beta > 1

Emphasizes recall -- medical diagnosis

F-beta < 1

Emphasizes precision -- spam filtering

ROC and AUC

ROC curve plots TPR vs FPR at every threshold -- AUC = area under that curve

AUC 0.5 = RANDOM AND AUC 1.0 = PERFECT AND AUC 0.8+ = GOOD

AUC = probability that model ranks a random positive higher than a random negative

ROC plots True Positive Rate (recall) vs False Positive Rate at every classification threshold. AUC = 0.5: no better than random. AUC = 1.0: perfect. AUC = 0.8+: good. Threshold-independent -- evaluates ranking ability. Use PR curve instead of ROC when positive class is very rare -- ROC can be optimistic with severe class imbalance.

TPR (Y-axis)

True Positive Rate = Recall = TP/(TP+FN)

FPR (X-axis)

False Positive Rate = FP/(FP+TN)

AUC meaning

Probability that model ranks a random positive above a random negative

PR curve

Better than ROC when positive class is very rare

Regression Metrics

MAE in original units -- MSE squares the error -- RMSE back to original units -- R-squared explains variance

FOUR WAYS TO MEASURE REGRESSION ERROR

R-squared can be negative -- model is worse than simply predicting the mean for every example

MAE (Mean Absolute Error): average of absolute differences -- robust to outliers, original units. MSE: average of squared differences -- penalizes large errors, not in original units. RMSE: square root of MSE -- back in original units, most commonly reported. R-squared: proportion of variance explained -- 1.0 perfect, 0 no better than mean, can be negative.

MAE

Average absolute error -- robust to outliers, original units

MSE

Average squared error -- penalizes large errors heavily

RMSE

Square root of MSE -- back in original units

R-squared

Variance explained: 1=perfect, 0=no better than mean, negative=worse

Cross-Validation

K-FOLD -- K times: train on K-1 folds, test on 1 fold, rotate, average all scores

MORE RELIABLE THAN A SINGLE TRAIN/TEST SPLIT

Golden rule: NEVER use the test set for anything except final evaluation -- no tuning on test

Split data into K equal folds. For each fold: train on K-1 folds, evaluate on remaining fold. Average performance across all K folds. 5-fold and 10-fold standard. Stratified K-fold: ensures each fold has same class distribution -- important for imbalanced data. Time series: always split chronologically -- never random.

5-fold standard

5 rounds: each fold serves as test set once

Stratified K-fold

Each fold maintains class distribution -- use for imbalanced data

Time series CV

Walk-forward validation -- always train on past, test on future

Golden rule

Never use test set for anything except final evaluation

Class Imbalance

SMOTE -- Synthetic Minority Oversampling TEchnique -- create synthetic rare-class examples

WHEN 99% IS ONE CLASS -- ACCURACY IS USELESS

Apply resampling to training data ONLY -- never to test set

A model that always predicts the majority class gets 99% accuracy but 0% recall on minority -- useless. SMOTE: creates synthetic minority examples by interpolating between existing ones. Class weights: tell algorithm to penalize minority class errors more (class_weight=balanced). Use F1 or AUC, not accuracy, for evaluation.

Accuracy trap

99% accuracy with all-majority predictor = useless

SMOTE

Synthetic minority oversampling -- apply to training data ONLY

Class weights

class_weight=balanced -- equivalent to oversampling

Metrics to use

F1, AUC-ROC -- not accuracy with imbalanced data

Data Leakage

LEAK = test set information contaminating training -- model looks great until it hits the real world

THE MOST COMMON REASON ML MODELS FAIL IN PRODUCTION

Always split FIRST then transform -- never fit transformers on the full dataset

Data leakage: information from outside training set contaminates the model. Sources: fitting scaler on full dataset before splitting (never do this), including features that wouldn't be available at inference time, using future information for historical prediction. Detection: unrealistically high performance, feature importances that don't make causal sense, dramatic performance drop in production.

Train-test contamination

Fitting transformers on full dataset before splitting

Target leakage

Feature correlated with target only because of how target was defined

Temporal leakage

Using future data to predict past

Split FIRST

Always split data BEFORE fitting any transformers

Calibration

A CALIBRATED model -- when it says 80% confident, it is right 80% of the time

HIGH ACCURACY DOES NOT EQUAL WELL CALIBRATED

Modern neural networks are typically overconfident -- temperature scaling is the fix

Calibration measures whether predicted probabilities match actual frequencies. Reliability diagram: predicted probability vs actual frequency -- diagonal line = perfect calibration. Overconfident: predictions cluster near 0 and 1. Underconfident: cluster near 0.5. Temperature scaling: divide logits by T before softmax -- T greater than 1 softens probabilities. Critical for medical AI, financial risk, any application where you act on confidence.

Perfect calibration

When model says 80%, it is correct 80% of the time

Reliability diagram

Plot predicted probability vs actual frequency -- diagonal = perfect

Overconfidence

Common in neural networks -- predictions too extreme

Temperature scaling

Divide logits by T before softmax -- T>1 softens overconfidence

MLOps

MLOps = DevOps + Data + Models -- CI/CD for machine learning systems

DATA DRIFT AND CONCEPT DRIFT AND MODEL MONITORING AND RETRAINING

Model code is 5% of ML system code -- the other 95% is pipelines, monitoring, and infrastructure

MLOps applies software engineering to ML systems. Model monitoring: track data drift (input distribution changes), concept drift (label relationship changes), model performance degradation. Version control: code (Git), data (DVC), models (MLflow). CI/CD pipelines: automatically retrain and test models when data or code changes. Feature stores: centralized reusable feature computation.

Data drift

Input feature distribution changes over time

Concept drift

Relationship between features and labels changes

Model monitoring

Track data drift, performance, and prediction distributions

Version control

Track code, data, and model versions -- enable rollback

A/B Testing

A/B test = controlled experiment -- one change, random assignment, measure the right metric

NEVER STOP EARLY -- WAIT FOR THE FULL PRE-SPECIFIED DURATION

A/B testing compares two versions by randomly assigning users and measuring outcomes

Split live traffic: 50% current model (A), 50% new model (B). Measure business metric (conversion, revenue) not just ML metric (accuracy). Key principles: random assignment (eliminates selection bias), single change (isolate cause), sufficient sample size, pre-specified duration. Common mistake: stopping test when p<0.05 appears -- leads to inflated false positive rates.

Random assignment

Eliminates selection bias -- essential for valid comparison

Single change

Isolate exactly what is causing any difference

Pre-specified duration

Never stop early -- peeking inflates false positive rate

Business metric

Measure what actually matters -- not just ML accuracy

Memory tricks that make precision, recall, F1 and AUC click

Memory Tricks