Module 7: Vision

“Machines that See”

Research Document for DATA 201 Course Development


Table of Contents

  1. Introduction
  2. Part I: The Quest for Machine Vision
  3. Part II: Classical Computer Vision
  4. Part III: The Deep Learning Revolution
  5. Part IV: Modern Applications
  6. DEEP DIVE: ImageNet and the 2012 Moment
  7. Lecture Plan and Hands-On Exercise
  8. Recommended Resources
  9. References

Introduction

Vision seems effortless to humans—we recognize faces, read text, navigate spaces without conscious effort. But teaching computers to see has been one of AI’s greatest challenges.

This module explores:

Core Question: What does it mean for a machine to “see”?


Part I: The Quest for Machine Vision

The Summer Vision Project (1966)

In 1966, MIT professor Marvin Minsky assigned a summer project to an undergraduate:

“Connect a camera to a computer and get the computer to describe what it sees.”

The project was expected to take one summer. It’s now been 60 years and we’re still working on it.

Why Vision Is Hard

What seems “simple” to humans requires:

A 3-year-old effortlessly recognizes a cat. This took AI research 50+ years.


David Marr’s Computational Vision (1982)

David Marr, a neuroscientist at MIT, proposed a theory of how vision works in his book Vision (1982).

Three Levels of Analysis

  1. Computational: What problem is vision solving?
  2. Algorithmic: What steps solve the problem?
  3. Implementation: How does the brain/computer do it?

Marr’s Stages of Vision

  1. Primal sketch: Edges, textures, basic shapes
  2. 2.5D sketch: Surfaces, depth, orientation
  3. 3D model: Full 3D representation of objects

Influence

Marr died at 35, but his framework influenced computer vision for decades. His book remains required reading.

Limitation

Marr’s approach was top-down: figure out the theory, then implement it. Modern deep learning is bottom-up: learn from data.


The AI Winter and Computer Vision

Expert Systems Era (1980s)

Computer vision tried to encode human knowledge:

The Problem

Real-world images vary endlessly:

No amount of rules could handle this variability.


Part II: Classical Computer Vision

Edge Detection: Sobel, Canny

Finding Edges

Edges are boundaries between regions—changes in intensity.

Sobel Operator (1968): Detect horizontal and vertical gradients using 3x3 filters.

Canny Edge Detector (1986):

Why Edges Matter

Edges were thought to be the fundamental building blocks:

But edges alone don’t tell you what you’re looking at.


Feature Detection: SIFT and HOG

SIFT - Scale-Invariant Feature Transform (1999)

David Lowe at UBC developed SIFT to find “keypoints” that are:

HOG - Histogram of Oriented Gradients (2005)

Dalal and Triggs developed HOG for pedestrian detection:

HOG was state-of-the-art for object detection until deep learning.

The Approach

  1. Extract hand-crafted features (SIFT, HOG)
  2. Feed features to classifier (SVM)
  3. Train on labeled data

This worked, but required engineering the right features for each problem.


The MNIST Dataset (1998)

Yann LeCun and colleagues created MNIST: 70,000 handwritten digits.

Why MNIST Matters

The First CNNs: LeNet

LeCun developed LeNet-5 to recognize MNIST digits using Convolutional Neural Networks:

LeNet achieved 99%+ accuracy on MNIST in 1998.

“MNIST is solved”

Today, even simple models achieve >99.5% on MNIST. It’s often called “the hello world of deep learning.”


Part III: The Deep Learning Revolution

The Pieces Come Together

GPUs for Computing (2006-2010)

NVIDIA GPUs, designed for video games, were repurposed for neural networks:

Large Datasets

Algorithmic Improvements


AlexNet: The 2012 Moment

In the 2012 ImageNet competition, Alex Krizhevsky’s neural network (AlexNet) achieved:

The gap was unprecedented. Deep learning had arrived.

Architecture

Impact

After AlexNet:


Deeper and Deeper: VGG, ResNet

VGG (2014)

Oxford’s VGG network showed: deeper is better.

The Degradation Problem

But simply adding layers stopped working. Very deep networks trained poorly—not from overfitting, but from optimization difficulties.

ResNet (2015)

Microsoft’s ResNet introduced skip connections:

output = F(x) + x

If the layer can’t learn something useful, it can learn F(x) = 0, preserving the input.

The Result


Part IV: Modern Applications

Medical Imaging

The Promise

AI radiologists could:

Diabetic Retinopathy Detection

Google developed a system to detect diabetic retinopathy from retinal scans:

Challenges


Self-Driving Cars

DARPA Grand Challenge (2004-2007)

The US military sponsored competitions for autonomous vehicles:

Modern Autonomous Vehicles

Tesla, Waymo, and others use:

The Trolley Problem Goes Real

How should autonomous vehicles handle impossible situations? These philosophical questions become engineering decisions.


Face Recognition

Eigenfaces (1991)

Turk and Pentland: represent faces as combinations of “eigenfaces” (PCA components).

DeepFace (2014)

Facebook’s DeepFace achieved near-human performance:

Ethical Concerns

Face recognition raises serious issues:

Several cities have banned government use of facial recognition.


DEEP DIVE: ImageNet and the 2012 Moment

The Vision

In 2006, Fei-Fei Li, a young professor at Princeton (later Stanford), had an audacious idea: create a dataset with every object in the world.

The Problem

Computer vision was stuck. Researchers used tiny datasets:

Li realized: data was the bottleneck.

Building ImageNet

The WordNet Foundation

ImageNet organized images according to WordNet, a lexical database:

The Scale

Amazon Mechanical Turk

How do you label 14 million images? Li’s insight: use the crowd.

Amazon Mechanical Turk: A platform where workers complete small tasks for small payments.

ImageNet workers:

Cost and Time

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

Starting in 2010, ImageNet hosted an annual competition:

The Metrics

Top-5 error: Did the correct label appear in the model’s top 5 guesses? Top-1 error: Was the top guess correct?

Before 2012

Best systems used hand-crafted features (SIFT, HOG) plus classifiers (SVM):

AlexNet: The Breakthrough

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted a deep convolutional neural network.

Results

Why It Won

  1. Deep architecture: 8 layers, learned hierarchical features
  2. GPU training: Used two NVIDIA GTX 580 GPUs
  3. ReLU activation: Faster training than sigmoid/tanh
  4. Dropout: Prevented overfitting
  5. Data augmentation: Artificially increased training data

The Paper

“ImageNet Classification with Deep Convolutional Neural Networks” became one of the most cited papers in history.

The Aftermath

2013-2017: Deeper Networks Win

Year Winner Error Depth
2012 AlexNet 15.3% 8
2013 ZFNet 11.2% 8
2014 VGG/GoogLeNet 6.7% 19/22
2015 ResNet 3.6% 152
2017 SENet 2.3% 154

Superhuman Performance

By 2015, ResNet surpassed estimated human performance (~5% error).

The Competition Ends

In 2017, ImageNet discontinued the classification challenge. The problem was “solved” (for this benchmark).

The Controversies

Dataset Bias

ImageNet’s images are:

Performance on ImageNet doesn’t guarantee real-world performance.

The Mechanical Turk Workers

The dataset was built on low-wage crowd labor:

Problematic Categories

ImageNet included some troubling categories:

The Legacy

Positive

Complicated

Fei-Fei Li’s Reflection

Li has spoken about wanting AI development to be more inclusive and ethical. She later co-founded AI4ALL to diversify the field.

The Data Journey


Lecture Plan and Hands-On Exercise

Lecture Plan: “Teaching Machines to See” (75-90 minutes)

Part 1: Why Vision Is Hard (15 min)

Opening: Show an image and ask students what they see.

The 1966 Summer Project: Minsky’s optimism, 60 years later

Part 2: From Pixels to Features (20 min)

What is an image to a computer?

Edge detection:

The classical approach:

  1. Extract features (SIFT, HOG)
  2. Train classifier (SVM)
  3. Predict

Part 3: The ImageNet Story (20 min)

Show the graph: Error rates dropping after 2012

Part 4: How CNNs Work (15 min)

Convolution intuition:

Show visualizations of what each layer “sees”

Part 5: Applications and Ethics (10 min)


Hands-On Exercise: “Build a Cat vs. Dog Classifier”

Objective

Train a convolutional neural network to classify images.

Duration

2-3 hours

Setup

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import numpy as np

# Download cats vs dogs dataset
url = "https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip"
path = keras.utils.get_file("cats_and_dogs.zip", origin=url, extract=True)
base_dir = path.replace('.zip', '')

train_dir = f"{base_dir}/train"
val_dir = f"{base_dir}/validation"

Task 1: Explore the Data (20 min)

import os
from PIL import Image

# Count images
train_cats = len(os.listdir(f"{train_dir}/cats"))
train_dogs = len(os.listdir(f"{train_dir}/dogs"))
print(f"Training: {train_cats} cats, {train_dogs} dogs")

# View some examples
fig, axes = plt.subplots(2, 4, figsize=(12, 6))
for i, animal in enumerate(['cats', 'dogs']):
    files = os.listdir(f"{train_dir}/{animal}")[:4]
    for j, f in enumerate(files):
        img = Image.open(f"{train_dir}/{animal}/{f}")
        axes[i, j].imshow(img)
        axes[i, j].axis('off')
        axes[i, j].set_title(animal)
plt.tight_layout()
plt.show()

Questions:

Task 2: Build Data Pipeline (20 min)

# Create data generators with augmentation
train_datagen = keras.preprocessing.image.ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

val_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='binary'
)

val_generator = val_datagen.flow_from_directory(
    val_dir,
    target_size=(150, 150),
    batch_size=32,
    class_mode='binary'
)

Task 3: Build a Simple CNN (30 min)

model = keras.Sequential([
    # First conv block
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
    layers.MaxPooling2D((2, 2)),

    # Second conv block
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    # Third conv block
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),

    # Dense layers
    layers.Flatten(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.summary()

Task 4: Train the Model (30 min)

history = model.fit(
    train_generator,
    epochs=15,
    validation_data=val_generator
)

# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(history.history['accuracy'], label='Train')
ax1.plot(history.history['val_accuracy'], label='Validation')
ax1.set_title('Accuracy')
ax1.legend()

ax2.plot(history.history['loss'], label='Train')
ax2.plot(history.history['val_loss'], label='Validation')
ax2.set_title('Loss')
ax2.legend()

plt.show()

Task 5: Transfer Learning (30 min)

# Use pre-trained VGG16
base_model = keras.applications.VGG16(
    weights='imagenet',
    include_top=False,
    input_shape=(150, 150, 3)
)

# Freeze base model
base_model.trainable = False

# Add our classifier
model_transfer = keras.Sequential([
    base_model,
    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

model_transfer.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train only the top layers
history_transfer = model_transfer.fit(
    train_generator,
    epochs=5,
    validation_data=val_generator
)

Compare: How does transfer learning compare to training from scratch?

Task 6: Visualize What the Model Sees (20 min)

# Get a sample image
img_path = f"{val_dir}/cats/cat.2000.jpg"
img = keras.preprocessing.image.load_img(img_path, target_size=(150, 150))
x = keras.preprocessing.image.img_to_array(img) / 255.0
x = np.expand_dims(x, axis=0)

# Get first layer activations
layer_outputs = [layer.output for layer in model.layers[:6]]
activation_model = keras.Model(inputs=model.input, outputs=layer_outputs)
activations = activation_model.predict(x)

# Plot first layer filters
first_layer_activation = activations[0]
plt.figure(figsize=(15, 5))
for i in range(min(8, first_layer_activation.shape[-1])):
    plt.subplot(2, 4, i+1)
    plt.imshow(first_layer_activation[0, :, :, i], cmap='viridis')
    plt.axis('off')
plt.suptitle('First Conv Layer Activations')
plt.show()

Recommended Resources

Books

Online Courses

Tools

Videos


References

Historical

ImageNet and Deep Learning

Applications


Document compiled for SCDS DATA 201: Introduction to Data Science I Module 7: Vision “Machines that See”