AI Algorithms Memory Tricks -- Free Mnemonics

Search Algorithms

A* = g(n) + h(n) -- actual cost PLUS heuristic estimate to goal

BFS AND DFS AND UNIFORM COST AND A-STAR

A-star is optimal if the heuristic is admissible -- never overestimates the true cost

BFS (queue, level-by-level): complete and optimal for unweighted graphs. DFS (stack, goes deep first): memory efficient but not optimal. Uniform Cost Search: optimal for weighted graphs. A* Search: f(n) = g(n) + h(n). g(n) = actual cost from start. h(n) = heuristic estimate to goal. Optimal if heuristic is admissible (never overestimates). Used in GPS navigation and game pathfinding.

BFS

Queue (FIFO), level-by-level, complete, optimal for unweighted

DFS

Stack (LIFO), memory efficient, NOT optimal

A* f(n)=g(n)+h(n)

Optimal if heuristic admissible (never overestimates)

Admissible examples

Straight-line distance, misplaced tiles in 8-puzzle

Minimax

MINIMAX -- MAXimize your score, MINimize opponent score at alternating levels

ASSUMES OPTIMAL OPPONENT -- ALWAYS PLAN FOR WORST CASE

Alpha-beta pruning eliminates branches that cannot affect the final decision -- free speed

Minimax for two-player zero-sum games (chess, tic-tac-toe). MAX player (you) chooses action maximizing score. MIN player (opponent) minimizes your score. Alternate MAX and MIN levels. Alpha-Beta Pruning: prune branches where alpha is >= beta -- same result, dramatically fewer nodes. Reduces from O(b^d) to O(b^(d/2)) in best case. AlphaZero: minimax plus deep learning plus self-play -- beat world champions.

Minimax

Alternate MAX (your move) and MIN (opponent move) levels

Alpha-beta pruning

Prune when alpha >= beta -- same result, far fewer nodes

AlphaZero

Minimax + deep learning + self-play -- beat Go, chess, shogi

CSP

CSP = Variables, Domains, Constraints must all be satisfied simultaneously

SUDOKU AND MAP COLORING AND SCHEDULING ARE ALL CSPs

MRV heuristic -- assign the most constrained variable first -- fails early and saves time

Variables (cells, regions), Domains (digits 1-9, colors), Constraints (no repeats, adjacent regions differ). Backtracking: assign one variable at a time, backtrack when constraint violated. MRV (Minimum Remaining Values): assign most constrained variable first. AC-3 Arc Consistency: preprocessing that removes domain values with no compatible assignment.

Variables

What needs to be assigned (cells, regions, time slots)

Domains

Possible values for each variable

Constraints

Rules that must hold (no repeats, adjacency, order)

MRV heuristic

Assign most constrained variable first -- catches failures early

Genetic Algorithms

SELECT, CROSS, MUTATE -- evolution as an optimization algorithm

POPULATION AND FITNESS AND SELECTION AND CROSSOVER AND MUTATION

Mutation maintains diversity -- without it the population converges to a local optimum

Population: candidate solutions. Fitness function: evaluate solution quality. Selection: fitter solutions more likely to reproduce. Crossover: combine two parents to produce offspring. Mutation: randomly alter part of a solution -- maintains diversity, avoids local optima. Repeat for many generations. Used when search space is huge or gradient is unavailable.

Population

Set of candidate solutions

Fitness function

Evaluates how good each solution is

Crossover

Combine parents -- offspring inherits from both

Mutation

Random change -- prevents convergence to local optima

Bayesian Networks

BAYES NET -- probabilistic graphical model where ARROWS MEAN CAUSES

DIRECTED ACYCLIC GRAPH WITH CONDITIONAL PROBABILITY TABLES

Given its parents, a node is conditionally independent of all non-descendants

Directed Acyclic Graph (DAG). Nodes = random variables. Edges = direct probabilistic influence (X to Y means X directly influences Y). Each node stores a Conditional Probability Table (CPT) given its parents. Applications: medical diagnosis, spam filtering, fault diagnosis. Inference: given evidence, compute probability of query variable. Inference is NP-hard in general.

Nodes

Random variables (symptoms, diseases, sensor readings)

Edges

X to Y means X directly influences Y

CPT

P(node|parents) -- conditional probability table at each node

Inference

Given evidence, compute P(query) -- NP-hard in general

Gradient Descent

GRADIENT DESCENT goes DOWNHILL -- like a ball rolling to the lowest point

BATCH AND SGD AND MINI-BATCH AND ADAM -- FOUR VARIANTS

Learning rate is the most critical hyperparameter -- too high diverges, too low is very slow

Batch GD: gradient on entire dataset -- stable, slow. SGD: gradient on one example -- fast, very noisy. Mini-batch (32-256 samples): standard in deep learning -- best of both worlds. Adam: adaptive per-parameter learning rate -- most popular optimizer, works with defaults (lr=0.001). Momentum: adds inertia, avoids oscillation, escapes saddle points.

Batch GD

Entire dataset per update -- stable but slow

SGD

One example per update -- fast, noisy, can escape local optima

Mini-batch (32-256)

Standard in deep learning -- best compromise

Adam

Adaptive per-parameter LR -- use by default, start lr=0.001

Ensemble Methods

BAG and BOOST -- Bagging = parallel, Boosting = sequential error correction

RANDOM FOREST (BAGGING) AND XGBOOST (BOOSTING) AND STACKING

Ensembles consistently outperform single models -- they win Kaggle competitions

Bagging: parallel independent models on random subsets, average predictions -- reduces variance. Random Forest = bagging of decision trees. Boosting: sequential, each model corrects previous errors -- reduces bias. XGBoost and LightGBM = gradient boosting. Stacking: meta-model learns to combine base model outputs. No Free Lunch: try ensembles before complex single models.

Bagging

Parallel, independent -- reduces variance (overfitting)

Boosting

Sequential, corrects prior errors -- reduces bias (underfitting)

Random Forest

Bagging of decision trees -- reliable off-the-shelf algorithm

XGBoost

Gradient boosting -- most powerful for tabular data

PCA

PCA finds the DIRECTIONS OF MAXIMUM VARIANCE in the data

REDUCE DIMENSIONS WHILE PRESERVING THE MOST IMPORTANT STRUCTURE

Always standardize data before PCA -- large-scale features will otherwise dominate

PCA finds orthogonal directions (principal components) of maximum variance. PC1: greatest variance direction. Project data onto top k components. Scree plot: variance explained vs k -- choose where curve elbows (~95% variance kept). Always standardize first. t-SNE and UMAP: nonlinear alternatives for visualization only.

PC1

Direction of greatest variance in the data

Scree plot

Variance explained vs k -- pick the elbow (~95% total)

Standardize first

Large-scale features dominate without standardization

t-SNE and UMAP

Nonlinear -- visualization only, not for feature extraction

Monte Carlo

SAMPLE then AVERAGE -- Monte Carlo estimates by running many random simulations

LAW OF LARGE NUMBERS: RANDOM SAMPLE AVERAGE CONVERGES TO TRUE VALUE

Error decreases as 1 over sqrt(N) -- quadruple samples to halve the error

Run many random simulations, average results -- converges to true value as samples increase. Monte Carlo Tree Search (MCTS): simulate random game playouts from each position to guide search -- used in AlphaGo. MCMC (Markov Chain Monte Carlo): sample from complex probability distributions -- used in Bayesian inference.

Core idea

Many random samples, take average, converges to true value

MCTS

Game AI -- simulate playouts to estimate position value

MCMC

Sample from complex distributions -- Bayesian inference

Error rate

Decreases as 1/sqrt(N) -- 4x samples to halve error

Association Rules

SUPPORT, CONFIDENCE, LIFT -- the three metrics that make bread-to-butter meaningful

MARKET BASKET ANALYSIS -- ITEMS THAT FREQUENTLY APPEAR TOGETHER

Apriori uses anti-monotonicity -- if A,B is infrequent then A,B,C must also be infrequent

Support: how often do X and Y appear together? Confidence (A to B): P(B given A). Lift: is the relationship stronger than chance? Lift greater than 1 = positive association beyond chance. Apriori: uses anti-monotonicity to prune itemset search efficiently. Applications: recommendations, fraud detection, drug interactions.

Support

Frequency of itemset -- how common is the pair?

Confidence (A to B)

P(B|A) -- given A, how often is B also present?

Lift greater than 1

Items appear together more than expected by chance

Apriori

Prune by anti-monotonicity -- efficient frequent itemset mining

Dynamic Programming

DIVIDE into subproblems, STORE solutions, REUSE -- never solve the same subproblem twice

OPTIMAL SUBSTRUCTURE PLUS OVERLAPPING SUBPROBLEMS = USE DP

Viterbi algorithm uses DP for HMMs -- speech recognition and POS tagging both use it

DP solves complex problems by breaking into overlapping subproblems, solving each once, storing results (memoization). Two conditions: optimal substructure and overlapping subproblems. Top-down (memoization): recursive with caching. Bottom-up (tabulation): iterative. Viterbi algorithm: DP for finding most likely sequence in a Hidden Markov Model -- used in speech recognition and POS tagging.

Optimal substructure

Optimal solution contains optimal sub-solutions

Overlapping subproblems

Same subproblems solved repeatedly in naive recursion

Memoization

Store results -- top-down recursive approach

Viterbi

DP for HMM sequences -- speech recognition and POS tagging

Memory tricks that make search, optimization and classic AI click

Memory Tricks