Robot AI -- Model Evaluation

Memory tricks that make precision, recall, F1 and AUC click

From the confusion matrix to cross-validation -- these memory tricks lock in every model evaluation metric, when to use each one, and how to honestly report your results.

Community
📋
Model Evaluation Forum
Ask questions · Share tricks
💬
Model Evaluation Study Room
Live · Study together now

Or continue to the sub-topics below for more specialized Study Rooms and Forums

Model Evaluation

Memory Tricks

Proven Mnemonics & Acronyms — fast to learn, hard to forget.

Confusion Matrix
CONFUSE the matrix -- TP, FP, FN, TN are the four boxes everything else comes from
TRUE POSITIVE AND FALSE POSITIVE AND FALSE NEGATIVE AND TRUE NEGATIVE
Draw the matrix -- predicted on columns, actual on rows. Diagonal = correct predictions.
True Positive (TP): predicted positive, actually positive. False Positive (FP): predicted positive, actually negative -- Type I error, false alarm. False Negative (FN): predicted negative, actually positive -- Type II error, missed detection. True Negative (TN): predicted negative, actually negative. Every classification metric -- accuracy, precision, recall, F1, AUC -- derives from these four numbers.
TP
Predicted positive, actually positive -- correct detection
FP
Predicted positive, actually negative -- false alarm (Type I error)
FN
Predicted negative, actually positive -- missed detection (Type II error)
TN
Predicted negative, actually negative -- correct rejection
Precision and Recall
Precision is PICKY -- Recall REMEMBERS to catch everything
PICKY FOR PRECISION AND REMEMBERS FOR RECALL -- IMPOSSIBLE TO FORGET
P for Picky -- few false alarms. R for Remembers -- few misses. Two words for life.
Precision is PICKY: only says yes when really sure -- few false alarms. TP divided by (TP + FP). Recall REMEMBERS everything: sweeps wide to catch every real positive -- few misses. TP divided by (TP + FN). Tradeoff: raising classification threshold raises precision but lowers recall. Spam filter: be PICKY (don't block real email). Cancer screening: REMEMBER everything (don't miss a cancer).
Precision = TP/(TP+FP)
Picky -- minimize false alarms
Recall = TP/(TP+FN)
Remembers -- minimize missed detections
Tradeoff
Higher threshold = more precise but misses more
Cancer screening
Prioritize recall -- missing a cancer is catastrophic
F1 Score
F1 = Harmonic mean of Precision and Recall -- punishes extreme imbalance between the two
F1 = 2 TIMES (P TIMES R) DIVIDED BY (P + R)
Harmonic mean is low if EITHER value is low -- you cannot compensate low recall with high precision
F1 combines precision and recall into one metric. Harmonic mean punishes extreme imbalance -- you cannot hide 10% recall behind 99% precision. F1 = 1.0 is perfect, 0 is worst. Use when class imbalance exists and you care about both FP and FN. F-beta: beta greater than 1 emphasizes recall (medical). Beta less than 1 emphasizes precision.
Harmonic mean
Low if EITHER P or R is low -- no hiding bad recall
Use F1 when
Class imbalance exists and both FP and FN matter
F-beta > 1
Emphasizes recall -- medical diagnosis
F-beta < 1
Emphasizes precision -- spam filtering
ROC and AUC
ROC curve plots TPR vs FPR at every threshold -- AUC = area under that curve
AUC 0.5 = RANDOM AND AUC 1.0 = PERFECT AND AUC 0.8+ = GOOD
AUC = probability that model ranks a random positive higher than a random negative
ROC plots True Positive Rate (recall) vs False Positive Rate at every classification threshold. AUC = 0.5: no better than random. AUC = 1.0: perfect. AUC = 0.8+: good. Threshold-independent -- evaluates ranking ability. Use PR curve instead of ROC when positive class is very rare -- ROC can be optimistic with severe class imbalance.
TPR (Y-axis)
True Positive Rate = Recall = TP/(TP+FN)
FPR (X-axis)
False Positive Rate = FP/(FP+TN)
AUC meaning
Probability that model ranks a random positive above a random negative
PR curve
Better than ROC when positive class is very rare
Regression Metrics
MAE in original units -- MSE squares the error -- RMSE back to original units -- R-squared explains variance
FOUR WAYS TO MEASURE REGRESSION ERROR
R-squared can be negative -- model is worse than simply predicting the mean for every example
MAE (Mean Absolute Error): average of absolute differences -- robust to outliers, original units. MSE: average of squared differences -- penalizes large errors, not in original units. RMSE: square root of MSE -- back in original units, most commonly reported. R-squared: proportion of variance explained -- 1.0 perfect, 0 no better than mean, can be negative.
MAE
Average absolute error -- robust to outliers, original units
MSE
Average squared error -- penalizes large errors heavily
RMSE
Square root of MSE -- back in original units
R-squared
Variance explained: 1=perfect, 0=no better than mean, negative=worse
Cross-Validation
K-FOLD -- K times: train on K-1 folds, test on 1 fold, rotate, average all scores
MORE RELIABLE THAN A SINGLE TRAIN/TEST SPLIT
Golden rule: NEVER use the test set for anything except final evaluation -- no tuning on test
Split data into K equal folds. For each fold: train on K-1 folds, evaluate on remaining fold. Average performance across all K folds. 5-fold and 10-fold standard. Stratified K-fold: ensures each fold has same class distribution -- important for imbalanced data. Time series: always split chronologically -- never random.
5-fold standard
5 rounds: each fold serves as test set once
Stratified K-fold
Each fold maintains class distribution -- use for imbalanced data
Time series CV
Walk-forward validation -- always train on past, test on future
Golden rule
Never use test set for anything except final evaluation
Class Imbalance
SMOTE -- Synthetic Minority Oversampling TEchnique -- create synthetic rare-class examples
WHEN 99% IS ONE CLASS -- ACCURACY IS USELESS
Apply resampling to training data ONLY -- never to test set
A model that always predicts the majority class gets 99% accuracy but 0% recall on minority -- useless. SMOTE: creates synthetic minority examples by interpolating between existing ones. Class weights: tell algorithm to penalize minority class errors more (class_weight=balanced). Use F1 or AUC, not accuracy, for evaluation.
Accuracy trap
99% accuracy with all-majority predictor = useless
SMOTE
Synthetic minority oversampling -- apply to training data ONLY
Class weights
class_weight=balanced -- equivalent to oversampling
Metrics to use
F1, AUC-ROC -- not accuracy with imbalanced data
Data Leakage
LEAK = test set information contaminating training -- model looks great until it hits the real world
THE MOST COMMON REASON ML MODELS FAIL IN PRODUCTION
Always split FIRST then transform -- never fit transformers on the full dataset
Data leakage: information from outside training set contaminates the model. Sources: fitting scaler on full dataset before splitting (never do this), including features that wouldn't be available at inference time, using future information for historical prediction. Detection: unrealistically high performance, feature importances that don't make causal sense, dramatic performance drop in production.
Train-test contamination
Fitting transformers on full dataset before splitting
Target leakage
Feature correlated with target only because of how target was defined
Temporal leakage
Using future data to predict past
Split FIRST
Always split data BEFORE fitting any transformers
Calibration
A CALIBRATED model -- when it says 80% confident, it is right 80% of the time
HIGH ACCURACY DOES NOT EQUAL WELL CALIBRATED
Modern neural networks are typically overconfident -- temperature scaling is the fix
Calibration measures whether predicted probabilities match actual frequencies. Reliability diagram: predicted probability vs actual frequency -- diagonal line = perfect calibration. Overconfident: predictions cluster near 0 and 1. Underconfident: cluster near 0.5. Temperature scaling: divide logits by T before softmax -- T greater than 1 softens probabilities. Critical for medical AI, financial risk, any application where you act on confidence.
Perfect calibration
When model says 80%, it is correct 80% of the time
Reliability diagram
Plot predicted probability vs actual frequency -- diagonal = perfect
Overconfidence
Common in neural networks -- predictions too extreme
Temperature scaling
Divide logits by T before softmax -- T>1 softens overconfidence
MLOps
MLOps = DevOps + Data + Models -- CI/CD for machine learning systems
DATA DRIFT AND CONCEPT DRIFT AND MODEL MONITORING AND RETRAINING
Model code is 5% of ML system code -- the other 95% is pipelines, monitoring, and infrastructure
MLOps applies software engineering to ML systems. Model monitoring: track data drift (input distribution changes), concept drift (label relationship changes), model performance degradation. Version control: code (Git), data (DVC), models (MLflow). CI/CD pipelines: automatically retrain and test models when data or code changes. Feature stores: centralized reusable feature computation.
Data drift
Input feature distribution changes over time
Concept drift
Relationship between features and labels changes
Model monitoring
Track data drift, performance, and prediction distributions
Version control
Track code, data, and model versions -- enable rollback
A/B Testing
A/B test = controlled experiment -- one change, random assignment, measure the right metric
NEVER STOP EARLY -- WAIT FOR THE FULL PRE-SPECIFIED DURATION
A/B testing compares two versions by randomly assigning users and measuring outcomes
Split live traffic: 50% current model (A), 50% new model (B). Measure business metric (conversion, revenue) not just ML metric (accuracy). Key principles: random assignment (eliminates selection bias), single change (isolate cause), sufficient sample size, pre-specified duration. Common mistake: stopping test when p<0.05 appears -- leads to inflated false positive rates.
Random assignment
Eliminates selection bias -- essential for valid comparison
Single change
Isolate exactly what is causing any difference
Pre-specified duration
Never stop early -- peeking inflates false positive rate
Business metric
Measure what actually matters -- not just ML accuracy
0
Correct
0
Missed
0
Remaining
What does this mean / stand for?
0
Correct
0
Wrong
0
Remaining
© 2026 MemoryTricks 🧠