Computer Vision Memory Tricks -- Free AI Mnemonics

CV Fundamentals

PIXELS to FEATURES to PREDICTIONS -- the three levels of what a CNN sees

EDGES THEN SHAPES THEN OBJECTS -- LEARNED HIERARCHICALLY FROM DATA

Before deep learning: hand-crafted SIFT and HOG features. After: learned automatically from data.

Early layers detect low-level features (edges, corners, color gradients). Middle layers detect textures, shapes, and patterns. Deep layers detect semantic concepts (faces, cars, dogs). This hierarchy mirrors the human visual cortex. AlexNet (2012): first deep CNN to win ImageNet -- crushed all competitors and launched the deep learning revolution in computer vision.

Early layers

Edges, corners, color gradients -- low-level features

Middle layers

Textures, shapes, and patterns -- mid-level features

Deep layers

Semantic concepts: faces, cars, dogs -- high-level features

AlexNet 2012

First deep CNN win at ImageNet -- launched CV revolution

Object Detection

YOLO -- You Only Look Once -- detect all objects in one single forward pass

CLASSIFICATION = WHAT AND DETECTION = WHAT + WHERE AND SEGMENTATION = WHAT + WHICH PIXELS

IoU = Intersection over Union -- measures overlap between predicted and ground truth box

Classification: what is in this image. Detection: WHAT and WHERE -- bounding box plus class label for each object. Segmentation: which exact pixels belong to each object. YOLO: one-stage detector, divides image into grid, predicts boxes and classes simultaneously -- real-time speed. Faster R-CNN: two-stage, slower but more accurate. IoU: area of overlap divided by area of union -- threshold at 0.5 for acceptable detection.

Classification

Single label for whole image -- is there a cat?

Detection

Bounding box (x,y,w,h) + class label per object

Segmentation

Pixel-level mask per object -- most precise

YOLO

One forward pass, real-time speed -- slight accuracy tradeoff

U-Net Segmentation

U-NET looks like a U -- encoder compresses down, decoder expands back up with skip connections

ENCODER SHRINKS TO CAPTURE CONTEXT -- DECODER EXPANDS TO RECOVER SPATIAL DETAIL

Skip connections preserve fine spatial detail that is lost during max pooling

U-Net: contracting path (encoder) compresses spatial dimensions while increasing channels -- captures what is in the image. Expanding path (decoder) restores spatial resolution -- recovers where. Skip connections: copy encoder feature maps directly to corresponding decoder layer -- preserve fine spatial detail lost during downsampling. Dominant for medical image segmentation. SAM (Meta 2023): foundation model -- segment any object with a click.

Encoder path

Compress spatial dimensions -- capture semantic context

Decoder path

Restore spatial resolution -- recover precise locations

Skip connections

Encoder feature maps passed directly to decoder -- preserve detail

SAM (Segment Anything)

Meta 2023 -- foundation model for segmentation

Data Augmentation

FLIP, CROP, ROTATE, COLOR -- data augmentation creates bigger datasets for free

AUGMENTATION IS REGULARIZATION -- FORCES INVARIANCE TO IRRELEVANT VARIATIONS

A cat is still a cat when flipped, slightly rotated, or slightly brighter -- teach this

Geometric: horizontal/vertical flip, random crop, rotation, perspective warp. Color: brightness, contrast, saturation, hue jitter. Advanced: CutOut (mask random patches), MixUp (blend two images and labels), CutMix (paste patches from one image into another), AutoAugment (learned augmentation policy). Standard ImageNet recipe: RandomResizedCrop(224) + RandomHorizontalFlip + ColorJitter + Normalize.

Geometric augmentation

Flip, crop, rotate, perspective -- spatial invariance

Color augmentation

Brightness, contrast, saturation -- color invariance

MixUp

Blend two images and labels -- smooth decision boundaries

AutoAugment

Learned augmentation policy -- often best performance

Vision Transformer

ViT -- split image into PATCHES, treat each patch like a TOKEN, run Transformer

AN IMAGE IS WORTH 16 by 16 WORDS -- THE PAPER THAT LAUNCHED VISION TRANSFORMERS

SWIN Transformer: hierarchical ViT with shifted windows -- efficient for detection and segmentation

ViT (Dosovitskiy et al., 2020): split image into 16x16 patches, flatten, project to embedding, add position embedding, run through standard Transformer encoder. Global context from layer 1 -- unlike CNNs that build context gradually. Needs more data than CNNs (no inductive bias for locality). SWIN Transformer: hierarchical ViT with shifted windows, efficient for high-resolution detection and segmentation.

Patch embedding

16x16 pixel patches treated like word tokens

Position embedding

Required -- Transformer has no inherent sense of spatial order

Global attention

Every patch attends to every other from layer 1

SWIN Transformer

Hierarchical + shifted windows -- efficient for dense prediction

Generative Vision

VAE encodes to SMOOTH space -- GAN has two fighters -- Diffusion REVERSES NOISE

THREE WAYS TO TEACH A MACHINE TO GENERATE REALISTIC IMAGES

Diffusion models now dominate -- more stable training, better diversity, easier text conditioning

VAE: encode image to distribution (mean and variance), sample, decode -- smooth latent space enables interpolation. GAN: Generator vs Discriminator, adversarial training, sharp results but mode collapse and instability. Diffusion: add noise over T steps, train to denoise, generate by starting from noise -- most stable training, best diversity, dominant for image generation (Stable Diffusion, DALL-E 3, Midjourney).

VAE

Smooth latent space, enables interpolation -- foundation of Stable Diffusion

GAN

Generator vs Discriminator -- sharp but unstable, mode collapse risk

Diffusion

Most stable, best quality, easiest text conditioning -- dominant now

ControlNet

Adds spatial conditioning (pose, depth, edges) to diffusion models

Face Recognition

EMBED then COMPARE -- face recognition maps faces to vectors then finds nearest match

DETECTION AND ALIGNMENT AND EMBEDDING AND COMPARISON -- FOUR STEP PIPELINE

ArcFace: current state of art -- additive angular margin loss for face embeddings

Pipeline: detect faces (MTCNN, RetinaFace), align using landmark points, encode face into compact vector (128 or 512 dimensions with ArcFace), compare via cosine similarity to known embeddings. Triplet loss: pull same-person embeddings together, push different people apart. Bias: higher error rates for dark-skinned women. EU AI Act bans real-time biometric surveillance in public spaces.

Detection

Find faces in image -- MTCNN, RetinaFace

Alignment

Normalize face orientation using landmark points

Embedding

CNN encodes identity into compact vector (ArcFace is state of art)

Bias

Higher error rates for dark-skinned women -- documented in MIT study

Autonomous Driving

SENSE then PERCEIVE then PLAN then ACT -- the four-stage AV pipeline

CAMERA AND LIDAR AND RADAR AND SENSOR FUSION -- THE SENSING DEBATE

SAE Level 2 is Tesla Autopilot -- human still monitors and is responsible

SAE Levels: L0 (none) to L5 (full autonomy any condition). L2: partial, human monitors (Tesla Autopilot). L4: high automation, limited geofenced area (Waymo One). Camera-only (Tesla) vs sensor fusion (Waymo). LiDAR: precise 3D depth, expensive. Camera: cheap, texture-rich, needs depth estimation. Radar: all-weather, velocity measurement. Long tail problem: rare edge cases too rare to appear in training data -- kills autonomous driving reliability.

SAE L2

Partial automation -- human monitors and is legally responsible

SAE L4

High automation -- limited geofenced area, Waymo in Phoenix and SF

Camera-only (Tesla)

Cost reduction -- camera provides texture but needs depth estimation

Long tail problem

Rare edge cases not in training data -- major unsolved challenge

Medical Imaging

READ the scan better than a doctor -- but only when told exactly what to look for

AI AUGMENTS RADIOLOGISTS -- AI PLUS PHYSICIAN OUTPERFORMS EITHER ALONE

Grad-CAM highlights which pixels most influenced the prediction -- required for clinical trust

Proven: diabetic retinopathy (matches ophthalmologist), CheXNet chest X-ray pneumonia detection, lymph node metastasis detection. Challenges: distribution shift (fails at hospitals not in training data), class imbalance (rare diseases), FDA approval required (510(k)). Grad-CAM: highlight influential image regions -- required for clinician trust. 3D U-Net: standard for CT and MRI volumetric segmentation.

Diabetic retinopathy

Google AI matches ophthalmologist accuracy

CheXNet

Stanford AI outperformed radiologists on pneumonia from chest X-rays

Distribution shift

Trained on teaching hospital data -- fails at rural clinics

Grad-CAM

Highlights influential pixels -- required for clinician trust

CV Metrics

mAP -- mean Average Precision -- the standard metric for object detection

AP = AREA UNDER PRECISION-RECALL CURVE FOR ONE CLASS

COCO mAP averages over IoU thresholds 0.5 to 0.95 -- much harder than Pascal VOC at 0.5 only

Classification: Top-1 accuracy, Top-5 accuracy. Object Detection: mAP -- compute AP (area under precision-recall curve) for each class, average across classes. COCO mAP: average over IoU thresholds 0.5 to 0.95 -- much harder than Pascal VOC (IoU=0.5 only). Segmentation: mIoU (mean Intersection over Union across all semantic classes). FPS: frames per second -- real-time requires 30+ FPS.

Top-1 accuracy

Is the top prediction the correct class?

mAP

Average AP across all classes and IoU thresholds

COCO vs VOC

COCO: IoU 0.5-0.95 (hard). VOC: IoU=0.5 only (easier).

mIoU

Mean Intersection over Union -- standard for segmentation

Optical Flow

FLOW = how pixels MOVE between frames -- a velocity field for every pixel in the video

LUCAS-KANADE IS CLASSICAL -- RAFT IS CURRENT STATE OF ART

Two-stream networks: spatial stream (RGB) plus temporal stream (optical flow) for action recognition

Optical flow: 2D vector (dx, dy) per pixel between consecutive frames showing apparent motion. Classical: Lucas-Kanade (local patch matching), Horn-Schunck (global smoothness). Deep learning: FlowNet, PWC-Net, RAFT (current state of art). Applications: video compression (encode motion), action recognition, video stabilization, slow-motion generation (interpolate frames), object tracking.

Optical flow output

(dx, dy) velocity vector per pixel between frames

Lucas-Kanade

Classical local patch matching -- assumes constant brightness

RAFT

Current state of art deep optical flow estimation

Two-stream networks

RGB appearance + optical flow motion for action recognition

Memory tricks that make computer vision and image AI click

Memory Tricks