Robot AI -- Computer Vision

Memory tricks that make computer vision and image AI click

From pixel to prediction -- these memory tricks lock in CNNs, object detection, image segmentation, generative vision models, and the architectures that gave machines the ability to see.

Community
📋
Computer Vision Forum
Ask questions · Share tricks
💬
Computer Vision Study Room
Live · Study together now

Or continue to the sub-topics below for more specialized Study Rooms and Forums

Computer Vision

Memory Tricks

Proven Mnemonics & Acronyms — fast to learn, hard to forget.

CV Fundamentals
PIXELS to FEATURES to PREDICTIONS -- the three levels of what a CNN sees
EDGES THEN SHAPES THEN OBJECTS -- LEARNED HIERARCHICALLY FROM DATA
Before deep learning: hand-crafted SIFT and HOG features. After: learned automatically from data.
Early layers detect low-level features (edges, corners, color gradients). Middle layers detect textures, shapes, and patterns. Deep layers detect semantic concepts (faces, cars, dogs). This hierarchy mirrors the human visual cortex. AlexNet (2012): first deep CNN to win ImageNet -- crushed all competitors and launched the deep learning revolution in computer vision.
Early layers
Edges, corners, color gradients -- low-level features
Middle layers
Textures, shapes, and patterns -- mid-level features
Deep layers
Semantic concepts: faces, cars, dogs -- high-level features
AlexNet 2012
First deep CNN win at ImageNet -- launched CV revolution
Object Detection
YOLO -- You Only Look Once -- detect all objects in one single forward pass
CLASSIFICATION = WHAT AND DETECTION = WHAT + WHERE AND SEGMENTATION = WHAT + WHICH PIXELS
IoU = Intersection over Union -- measures overlap between predicted and ground truth box
Classification: what is in this image. Detection: WHAT and WHERE -- bounding box plus class label for each object. Segmentation: which exact pixels belong to each object. YOLO: one-stage detector, divides image into grid, predicts boxes and classes simultaneously -- real-time speed. Faster R-CNN: two-stage, slower but more accurate. IoU: area of overlap divided by area of union -- threshold at 0.5 for acceptable detection.
Classification
Single label for whole image -- is there a cat?
Detection
Bounding box (x,y,w,h) + class label per object
Segmentation
Pixel-level mask per object -- most precise
YOLO
One forward pass, real-time speed -- slight accuracy tradeoff
U-Net Segmentation
U-NET looks like a U -- encoder compresses down, decoder expands back up with skip connections
ENCODER SHRINKS TO CAPTURE CONTEXT -- DECODER EXPANDS TO RECOVER SPATIAL DETAIL
Skip connections preserve fine spatial detail that is lost during max pooling
U-Net: contracting path (encoder) compresses spatial dimensions while increasing channels -- captures what is in the image. Expanding path (decoder) restores spatial resolution -- recovers where. Skip connections: copy encoder feature maps directly to corresponding decoder layer -- preserve fine spatial detail lost during downsampling. Dominant for medical image segmentation. SAM (Meta 2023): foundation model -- segment any object with a click.
Encoder path
Compress spatial dimensions -- capture semantic context
Decoder path
Restore spatial resolution -- recover precise locations
Skip connections
Encoder feature maps passed directly to decoder -- preserve detail
SAM (Segment Anything)
Meta 2023 -- foundation model for segmentation
Data Augmentation
FLIP, CROP, ROTATE, COLOR -- data augmentation creates bigger datasets for free
AUGMENTATION IS REGULARIZATION -- FORCES INVARIANCE TO IRRELEVANT VARIATIONS
A cat is still a cat when flipped, slightly rotated, or slightly brighter -- teach this
Geometric: horizontal/vertical flip, random crop, rotation, perspective warp. Color: brightness, contrast, saturation, hue jitter. Advanced: CutOut (mask random patches), MixUp (blend two images and labels), CutMix (paste patches from one image into another), AutoAugment (learned augmentation policy). Standard ImageNet recipe: RandomResizedCrop(224) + RandomHorizontalFlip + ColorJitter + Normalize.
Geometric augmentation
Flip, crop, rotate, perspective -- spatial invariance
Color augmentation
Brightness, contrast, saturation -- color invariance
MixUp
Blend two images and labels -- smooth decision boundaries
AutoAugment
Learned augmentation policy -- often best performance
Vision Transformer
ViT -- split image into PATCHES, treat each patch like a TOKEN, run Transformer
AN IMAGE IS WORTH 16 by 16 WORDS -- THE PAPER THAT LAUNCHED VISION TRANSFORMERS
SWIN Transformer: hierarchical ViT with shifted windows -- efficient for detection and segmentation
ViT (Dosovitskiy et al., 2020): split image into 16x16 patches, flatten, project to embedding, add position embedding, run through standard Transformer encoder. Global context from layer 1 -- unlike CNNs that build context gradually. Needs more data than CNNs (no inductive bias for locality). SWIN Transformer: hierarchical ViT with shifted windows, efficient for high-resolution detection and segmentation.
Patch embedding
16x16 pixel patches treated like word tokens
Position embedding
Required -- Transformer has no inherent sense of spatial order
Global attention
Every patch attends to every other from layer 1
SWIN Transformer
Hierarchical + shifted windows -- efficient for dense prediction
Generative Vision
VAE encodes to SMOOTH space -- GAN has two fighters -- Diffusion REVERSES NOISE
THREE WAYS TO TEACH A MACHINE TO GENERATE REALISTIC IMAGES
Diffusion models now dominate -- more stable training, better diversity, easier text conditioning
VAE: encode image to distribution (mean and variance), sample, decode -- smooth latent space enables interpolation. GAN: Generator vs Discriminator, adversarial training, sharp results but mode collapse and instability. Diffusion: add noise over T steps, train to denoise, generate by starting from noise -- most stable training, best diversity, dominant for image generation (Stable Diffusion, DALL-E 3, Midjourney).
VAE
Smooth latent space, enables interpolation -- foundation of Stable Diffusion
GAN
Generator vs Discriminator -- sharp but unstable, mode collapse risk
Diffusion
Most stable, best quality, easiest text conditioning -- dominant now
ControlNet
Adds spatial conditioning (pose, depth, edges) to diffusion models
Face Recognition
EMBED then COMPARE -- face recognition maps faces to vectors then finds nearest match
DETECTION AND ALIGNMENT AND EMBEDDING AND COMPARISON -- FOUR STEP PIPELINE
ArcFace: current state of art -- additive angular margin loss for face embeddings
Pipeline: detect faces (MTCNN, RetinaFace), align using landmark points, encode face into compact vector (128 or 512 dimensions with ArcFace), compare via cosine similarity to known embeddings. Triplet loss: pull same-person embeddings together, push different people apart. Bias: higher error rates for dark-skinned women. EU AI Act bans real-time biometric surveillance in public spaces.
Detection
Find faces in image -- MTCNN, RetinaFace
Alignment
Normalize face orientation using landmark points
Embedding
CNN encodes identity into compact vector (ArcFace is state of art)
Bias
Higher error rates for dark-skinned women -- documented in MIT study
Autonomous Driving
SENSE then PERCEIVE then PLAN then ACT -- the four-stage AV pipeline
CAMERA AND LIDAR AND RADAR AND SENSOR FUSION -- THE SENSING DEBATE
SAE Level 2 is Tesla Autopilot -- human still monitors and is responsible
SAE Levels: L0 (none) to L5 (full autonomy any condition). L2: partial, human monitors (Tesla Autopilot). L4: high automation, limited geofenced area (Waymo One). Camera-only (Tesla) vs sensor fusion (Waymo). LiDAR: precise 3D depth, expensive. Camera: cheap, texture-rich, needs depth estimation. Radar: all-weather, velocity measurement. Long tail problem: rare edge cases too rare to appear in training data -- kills autonomous driving reliability.
SAE L2
Partial automation -- human monitors and is legally responsible
SAE L4
High automation -- limited geofenced area, Waymo in Phoenix and SF
Camera-only (Tesla)
Cost reduction -- camera provides texture but needs depth estimation
Long tail problem
Rare edge cases not in training data -- major unsolved challenge
Medical Imaging
READ the scan better than a doctor -- but only when told exactly what to look for
AI AUGMENTS RADIOLOGISTS -- AI PLUS PHYSICIAN OUTPERFORMS EITHER ALONE
Grad-CAM highlights which pixels most influenced the prediction -- required for clinical trust
Proven: diabetic retinopathy (matches ophthalmologist), CheXNet chest X-ray pneumonia detection, lymph node metastasis detection. Challenges: distribution shift (fails at hospitals not in training data), class imbalance (rare diseases), FDA approval required (510(k)). Grad-CAM: highlight influential image regions -- required for clinician trust. 3D U-Net: standard for CT and MRI volumetric segmentation.
Diabetic retinopathy
Google AI matches ophthalmologist accuracy
CheXNet
Stanford AI outperformed radiologists on pneumonia from chest X-rays
Distribution shift
Trained on teaching hospital data -- fails at rural clinics
Grad-CAM
Highlights influential pixels -- required for clinician trust
CV Metrics
mAP -- mean Average Precision -- the standard metric for object detection
AP = AREA UNDER PRECISION-RECALL CURVE FOR ONE CLASS
COCO mAP averages over IoU thresholds 0.5 to 0.95 -- much harder than Pascal VOC at 0.5 only
Classification: Top-1 accuracy, Top-5 accuracy. Object Detection: mAP -- compute AP (area under precision-recall curve) for each class, average across classes. COCO mAP: average over IoU thresholds 0.5 to 0.95 -- much harder than Pascal VOC (IoU=0.5 only). Segmentation: mIoU (mean Intersection over Union across all semantic classes). FPS: frames per second -- real-time requires 30+ FPS.
Top-1 accuracy
Is the top prediction the correct class?
mAP
Average AP across all classes and IoU thresholds
COCO vs VOC
COCO: IoU 0.5-0.95 (hard). VOC: IoU=0.5 only (easier).
mIoU
Mean Intersection over Union -- standard for segmentation
Optical Flow
FLOW = how pixels MOVE between frames -- a velocity field for every pixel in the video
LUCAS-KANADE IS CLASSICAL -- RAFT IS CURRENT STATE OF ART
Two-stream networks: spatial stream (RGB) plus temporal stream (optical flow) for action recognition
Optical flow: 2D vector (dx, dy) per pixel between consecutive frames showing apparent motion. Classical: Lucas-Kanade (local patch matching), Horn-Schunck (global smoothness). Deep learning: FlowNet, PWC-Net, RAFT (current state of art). Applications: video compression (encode motion), action recognition, video stabilization, slow-motion generation (interpolate frames), object tracking.
Optical flow output
(dx, dy) velocity vector per pixel between frames
Lucas-Kanade
Classical local patch matching -- assumes constant brightness
RAFT
Current state of art deep optical flow estimation
Two-stream networks
RGB appearance + optical flow motion for action recognition
0
Correct
0
Missed
0
Remaining
What does this mean / stand for?
0
Correct
0
Wrong
0
Remaining
© 2026 MemoryTricks 🧠