Lecture 2: How Machines See - Modern Machine Learning

Module: Vision Lecture

Course: Introduction to Data Science and Computing
Instructor: Prof. Joseph Bakarji
School of Data Science and Computing, AUB

From Hand-Coded Features to Learned Representations

Last lecture: We engineered features (edges, spikes, shapes) manually.
Today: Let machines learn the best features from data.

Learning Objectives

Understand dimensionality reduction (PCA, t-SNE)
Implement autoencoders for non-linear feature learning
Visualize learned representations in latent space
Apply pre-trained deep learning models
Understand the evolution to modern vision AI

Core Idea

The right representation makes classification trivial.
Instead of engineering features, learn a transformation where classes separate naturally.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

import warnings
warnings.filterwarnings('ignore')

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)

# Configure matplotlib
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11

# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("Libraries loaded successfully!")

Using device: cpu
Libraries loaded successfully!

Part 1: Dimensionality Reduction

The Problem

Each 8×8 image = 64-dimensional vector
But digits don’t span all of $\mathbb{R}^{64}$
They lie on a lower-dimensional manifold

The Solution

Find a transformation: $f: \mathbb{R}^{64} \to \mathbb{R}^2$ that preserves structure.

Goal: Project to 2D for visualization and see if classes separate.

# Load data
digits = datasets.load_digits()
X = digits.data / 16.0  # Normalize to [0, 1]
y = digits.target
images = digits.images

print(f"Dataset: {X.shape[0]} images, {X.shape[1]} features (pixels)")
print(f"Goal: Reduce from {X.shape[1]} dimensions to 2 dimensions")

Dataset: 1797 images, 64 features (pixels)
Goal: Reduce from 64 dimensions to 2 dimensions

1.1 Principal Component Analysis (PCA)

PCA finds directions of maximum variance.

Mathematical Formulation

Center the data: $\tilde{X} = X - \mu$
Compute covariance: $\Sigma = \frac{1}{n} \tilde{X}^T \tilde{X}$
Eigendecomposition: $\Sigma v_i = \lambda_i v_i$
Project: $Z = \tilde{X} V_k$ where $V_k$ contains top $k$ eigenvectors

Key property: PCA is linear - it finds the best linear projection.

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Visualize
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', 
                     alpha=0.7, edgecolors='k', linewidth=0.5, s=50)
plt.colorbar(scatter, ticks=range(10), label='Digit')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)', fontsize=12)
plt.title('PCA: 64D → 2D Projection', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\n✅ PCA projection complete!")
print(f"   - Explained variance (2 components): {pca.explained_variance_ratio_.sum()*100:.1f}%")
print(f"   - PC1 explains {pca.explained_variance_ratio_[0]*100:.1f}% of variance")
print(f"   - PC2 explains {pca.explained_variance_ratio_[1]*100:.1f}% of variance")

png

✅ PCA projection complete!
   - Explained variance (2 components): 28.5%
   - PC1 explains 14.9% of variance
   - PC2 explains 13.6% of variance

Understanding PCA Components

What do the principal components actually look like?

# Visualize top principal components as images
n_components = 8
pca_full = PCA(n_components=n_components)
pca_full.fit(X)

fig, axes = plt.subplots(2, 4, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    component = pca_full.components_[i].reshape(8, 8)
    ax.imshow(component, cmap='RdBu_r')
    ax.set_title(f'PC{i+1} ({pca_full.explained_variance_ratio_[i]*100:.1f}%)')
    ax.axis('off')

plt.suptitle('Top 8 Principal Components', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n🔍 These are the 'basis patterns' PCA discovered.")
print("   Every digit can be approximated as a weighted combination of these.")

png

🔍 These are the 'basis patterns' PCA discovered.
   Every digit can be approximated as a weighted combination of these.

Classification in PCA Space

Question: Is classification easier in the 2D PCA space?

# Split data
X_train_pca, X_test_pca, y_train, y_test = train_test_split(
    X_pca, y, test_size=0.3, random_state=42, stratify=y
)

# Train logistic regression in 2D PCA space
clf_pca = LogisticRegression(max_iter=1000, random_state=42)
clf_pca.fit(X_train_pca, y_train)
y_pred_pca = clf_pca.predict(X_test_pca)
acc_pca = accuracy_score(y_test, y_pred_pca)

# Compare to original 64D space
X_train_orig, X_test_orig, _, _ = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
clf_orig = LogisticRegression(max_iter=1000, random_state=42)
clf_orig.fit(X_train_orig, y_train)
y_pred_orig = clf_orig.predict(X_test_orig)
acc_orig = accuracy_score(y_test, y_pred_orig)

print(f"\n📊 Classification Accuracy:")
print(f"   - Original 64D space: {acc_orig*100:.2f}%")
print(f"   - PCA 2D space: {acc_pca*100:.2f}%")
print(f"\n💡 We reduced dimensionality by 97% but only lost {(acc_orig-acc_pca)*100:.1f}% accuracy!")

📊 Classification Accuracy:
   - Original 64D space: 96.67%
   - PCA 2D space: 61.85%

💡 We reduced dimensionality by 97% but only lost 34.8% accuracy!

1.2 t-SNE: Non-Linear Dimensionality Reduction

PCA is linear - it can’t capture non-linear structure.

t-SNE (t-Distributed Stochastic Neighbor Embedding) preserves local neighborhoods.

Intuition

In high-D: Compute pairwise similarities $P_{ij}$
In low-D: Compute pairwise similarities $Q_{ij}$
Minimize: $\text{KL}(P Q) = \sum_{ij} P_{ij} \log \frac{P_{ij}}{Q_{ij}}$

Goal: Similar points in high-D stay close in low-D; dissimilar points spread apart.

# Apply t-SNE (use subset for speed)
n_samples = 1000
indices = np.random.choice(len(X), n_samples, replace=False)
X_subset = X[indices]
y_subset = y[indices]

print("Running t-SNE (this may take a minute)...")
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_subset)

# Visualize both PCA and t-SNE side by side
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# PCA (on same subset)
X_pca_subset = pca.transform(X_subset)
scatter1 = axes[0].scatter(X_pca_subset[:, 0], X_pca_subset[:, 1], 
                          c=y_subset, cmap='tab10', alpha=0.7, 
                          edgecolors='k', linewidth=0.5, s=50)
axes[0].set_xlabel('PC1', fontsize=12)
axes[0].set_ylabel('PC2', fontsize=12)
axes[0].set_title('PCA (Linear)', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# t-SNE
scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], 
                          c=y_subset, cmap='tab10', alpha=0.7,
                          edgecolors='k', linewidth=0.5, s=50)
axes[1].set_xlabel('t-SNE Dim 1', fontsize=12)
axes[1].set_ylabel('t-SNE Dim 2', fontsize=12)
axes[1].set_title('t-SNE (Non-Linear)', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# Add colorbar
fig.colorbar(scatter2, ax=axes, ticks=range(10), label='Digit')
plt.tight_layout()
plt.show()

print("\n🔍 Observation: t-SNE creates much tighter, more separated clusters!")
print("   This is because it can capture non-linear manifold structure.")

Running t-SNE (this may take a minute)...

png

🔍 Observation: t-SNE creates much tighter, more separated clusters!
   This is because it can capture non-linear manifold structure.

Part 2: Autoencoders - Learning Non-Linear Compressions

The Idea

Autoencoder: Neural network that learns to compress and reconstruct data.

Architecture

Input (64D) → Encoder → Bottleneck (2D) → Decoder → Output (64D)

Training Objective

Minimize reconstruction error: $\mathcal{L} = \frac{1}{n} \sum_{i=1}^n ||x_i - \hat{x}_i||^2$

where $\hat{x}i = f{\text{dec}}(f_{\text{enc}}(x_i))$

Why This Works

Forces bottleneck to learn compressed representation
Non-linear activations allow capturing complex patterns
Similar to PCA, but non-linear

2.1 Building an Autoencoder

class Autoencoder(nn.Module):
    def __init__(self, input_dim=64, latent_dim=2):
        super(Autoencoder, self).__init__()
        
        # Encoder: 64 → 32 → 16 → 2
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, latent_dim)
        )
        
        # Decoder: 2 → 16 → 32 → 64
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 16),
            nn.ReLU(),
            nn.Linear(16, 32),
            nn.ReLU(),
            nn.Linear(32, input_dim),
            nn.Sigmoid()  # Output in [0, 1]
        )
    
    def forward(self, x):
        z = self.encoder(x)
        x_recon = self.decoder(z)
        return x_recon
    
    def encode(self, x):
        return self.encoder(x)
    
    def decode(self, z):
        return self.decoder(z)

# Initialize model
autoencoder = Autoencoder(input_dim=64, latent_dim=2).to(device)
print("\n🏗️ Autoencoder Architecture:")
print(autoencoder)
print(f"\nTotal parameters: {sum(p.numel() for p in autoencoder.parameters()):,}")

🏗️ Autoencoder Architecture:
Autoencoder(
  (encoder): Sequential(
    (0): Linear(in_features=64, out_features=32, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=16, bias=True)
    (3): ReLU()
    (4): Linear(in_features=16, out_features=2, bias=True)
  )
  (decoder): Sequential(
    (0): Linear(in_features=2, out_features=16, bias=True)
    (1): ReLU()
    (2): Linear(in_features=16, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=64, bias=True)
    (5): Sigmoid()
  )
)

Total parameters: 5,346

2.2 Training the Autoencoder

# Prepare data
X_train, X_test, y_train_ae, y_test_ae = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train).to(device)
X_test_t = torch.FloatTensor(X_test).to(device)

train_dataset = TensorDataset(X_train_t, X_train_t)  # Input = Target
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder.parameters(), lr=0.001)

# Training loop
n_epochs = 50
train_losses = []

print("\n🎯 Training autoencoder...")
for epoch in range(n_epochs):
    autoencoder.train()
    epoch_loss = 0
    
    for batch_x, _ in train_loader:
        # Forward pass
        x_recon = autoencoder(batch_x)
        loss = criterion(x_recon, batch_x)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    avg_loss = epoch_loss / len(train_loader)
    train_losses.append(avg_loss)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{n_epochs}], Loss: {avg_loss:.6f}")

print("\n✅ Training complete!")

# Plot training loss
plt.figure(figsize=(10, 4))
plt.plot(train_losses, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.title('Autoencoder Training Loss', fontsize=13, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

🎯 Training autoencoder...
Epoch [10/50], Loss: 0.071815
Epoch [20/50], Loss: 0.060159
Epoch [30/50], Loss: 0.052960
Epoch [40/50], Loss: 0.049561
Epoch [50/50], Loss: 0.048478

✅ Training complete!

png

2.3 Visualizing the Learned Latent Space

# Encode all data into 2D latent space
autoencoder.eval()
with torch.no_grad():
    X_all_t = torch.FloatTensor(X).to(device)
    X_latent = autoencoder.encode(X_all_t).cpu().numpy()

# Visualize latent space
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# PCA
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.6, s=30)
axes[0].set_title('PCA (Linear)', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Autoencoder
axes[1].scatter(X_latent[:, 0], X_latent[:, 1], c=y, cmap='tab10', alpha=0.6, s=30)
axes[1].set_title('Autoencoder (Non-Linear)', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

# t-SNE (on subset)
axes[2].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_subset, cmap='tab10', alpha=0.6, s=30)
axes[2].set_title('t-SNE (Non-Linear)', fontsize=13, fontweight='bold')
axes[2].grid(True, alpha=0.3)

plt.suptitle('Comparing Dimensionality Reduction Methods', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n💡 Key Insight: All three methods project to 2D, but structure differs!")
print("   - PCA: Linear, preserves global variance")
print("   - Autoencoder: Non-linear, learned from reconstruction")
print("   - t-SNE: Non-linear, preserves local neighborhoods")

png

💡 Key Insight: All three methods project to 2D, but structure differs!
   - PCA: Linear, preserves global variance
   - Autoencoder: Non-linear, learned from reconstruction
   - t-SNE: Non-linear, preserves local neighborhoods

2.4 Reconstruction Quality

# Reconstruct test images
autoencoder.eval()
with torch.no_grad():
    X_test_recon = autoencoder(X_test_t).cpu().numpy()

# Display original vs reconstructed
n_show = 10
fig, axes = plt.subplots(2, n_show, figsize=(15, 3))

for i in range(n_show):
    # Original
    axes[0, i].imshow(X_test[i].reshape(8, 8), cmap='gray')
    axes[0, i].axis('off')
    if i == 0:
        axes[0, i].set_ylabel('Original', fontsize=12)
    
    # Reconstructed
    axes[1, i].imshow(X_test_recon[i].reshape(8, 8), cmap='gray')
    axes[1, i].axis('off')
    if i == 0:
        axes[1, i].set_ylabel('Reconstructed', fontsize=12)

plt.suptitle('Autoencoder: Original vs Reconstructed (2D Bottleneck)', 
             fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# Compute reconstruction error
mse = np.mean((X_test - X_test_recon)**2)
print(f"\n📏 Reconstruction MSE: {mse:.6f}")
print("   Despite compressing 64D → 2D, reconstruction is quite good!")

png

📏 Reconstruction MSE: 0.049532
   Despite compressing 64D → 2D, reconstruction is quite good!

2.5 Interpolation in Latent Space

Cool property: We can interpolate between digits in latent space!

# Pick two digits
idx1, idx2 = 10, 50
img1, img2 = X_test[idx1], X_test[idx2]
label1, label2 = y_test_ae[idx1], y_test_ae[idx2]

# Encode to latent space
with torch.no_grad():
    z1 = autoencoder.encode(torch.FloatTensor(img1).unsqueeze(0).to(device))
    z2 = autoencoder.encode(torch.FloatTensor(img2).unsqueeze(0).to(device))
    
    # Interpolate
    n_steps = 10
    alphas = np.linspace(0, 1, n_steps)
    interpolated = []
    
    for alpha in alphas:
        z_interp = (1 - alpha) * z1 + alpha * z2
        img_interp = autoencoder.decode(z_interp).cpu().numpy()
        interpolated.append(img_interp)

# Visualize interpolation
fig, axes = plt.subplots(1, n_steps, figsize=(15, 2))
for i, ax in enumerate(axes):
    ax.imshow(interpolated[i].reshape(8, 8), cmap='gray')
    ax.axis('off')
    if i == 0:
        ax.set_title(f'Digit {label1}', color='blue', fontweight='bold')
    elif i == n_steps - 1:
        ax.set_title(f'Digit {label2}', color='red', fontweight='bold')

plt.suptitle(f'Latent Space Interpolation: {label1} → {label2}', 
             fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n✨ Magic! We can 'morph' between digits by moving through latent space.")
print("   This shows the latent space captures semantic structure.")

png

✨ Magic! We can 'morph' between digits by moving through latent space.
   This shows the latent space captures semantic structure.

Part 3: Modern Deep Learning for Vision

From Digits to ImageNet: The 2012 Revolution

AlexNet (Krizhevsky et al., 2012):

Won ImageNet competition with 84.6% top-5 accuracy (vs. 73.8% previous best)
Used Convolutional Neural Networks (CNNs)
Sparked the modern deep learning era

Convolutional Neural Networks (Conceptual)

Key Ideas

Local connectivity: Each neuron looks at small patch
Weight sharing: Same filter applied across entire image
Translation invariance: Detects features regardless of position
Hierarchical learning: Edges → Textures → Parts → Objects

Architecture

Input Image → Conv + ReLU → Pool → Conv + ReLU → Pool → ... → Flatten → Dense → Output

We won’t implement CNNs from scratch today (see advanced courses).
Instead, let’s use pre-trained models as black boxes.

3.1 Using Pre-Trained Models

Modern approach: Transfer learning

Use model trained on ImageNet (1.2M images, 1000 classes)
Apply to your dataset

Let’s demonstrate with MNIST (full dataset, 28×28 images).

# Load full MNIST-like dataset
from sklearn.datasets import fetch_openml

print("Loading MNIST dataset (this may take a moment)...")
# Using sklearn's 8x8 digits for demonstration
# In practice, you'd use torchvision.datasets.MNIST for full 28x28 images

# Simple CNN for digit classification
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(32 * 2 * 2, 64)
        self.fc2 = nn.Linear(64, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))  # 8x8 → 4x4
        x = self.pool(self.relu(self.conv2(x)))  # 4x4 → 2x2
        x = x.view(-1, 32 * 2 * 2)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

cnn = SimpleCNN().to(device)
print("\n🏗️ Simple CNN Architecture:")
print(cnn)
print(f"\nTotal parameters: {sum(p.numel() for p in cnn.parameters()):,}")
print("\n💡 This network learns features hierarchically:")
print("   Conv1: Detects edges and simple patterns")
print("   Conv2: Combines edges into textures and shapes")
print("   Dense: Classifies based on learned features")

Loading MNIST dataset (this may take a moment)...

🏗️ Simple CNN Architecture:
SimpleCNN(
  (conv1): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=128, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=10, bias=True)
  (relu): ReLU()
)

Total parameters: 13,706

💡 This network learns features hierarchically:
   Conv1: Detects edges and simple patterns
   Conv2: Combines edges into textures and shapes
   Dense: Classifies based on learned features

3.2 Quick Training Demo (Optional)

We could train this CNN, but for demonstration purposes, let’s just show the architecture.

In practice:

Training on MNIST: ~99% accuracy in 10 epochs
Training on ImageNet: Weeks on multiple GPUs
Using pre-trained models: Minutes to fine-tune

3.3 From Classification to Understanding

Modern vision AI goes beyond classification:

Object Detection

Task: Find and label all objects in an image
Examples: YOLO, Faster R-CNN, RetinaNet
Output: Bounding boxes + labels

Semantic Segmentation

Task: Label every pixel
Examples: U-Net, FCN, DeepLab
Output: Pixel-wise classification

Image Captioning

Task: Generate text description of image
Examples: Show and Tell, Show Attend and Tell
Output: Natural language sentence

Modern Breakthroughs (2021-2024)

CLIP (OpenAI): Vision + language understanding
Segment Anything (SAM): Universal segmentation
DALL-E, Stable Diffusion: Text-to-image generation
GPT-4V: Multimodal understanding

Summary and Reflection

What We Learned

Dimensionality reduction reveals structure
- PCA: Linear, preserves global variance
- t-SNE: Non-linear, preserves neighborhoods
- Right representation makes classification trivial
Autoencoders learn compressed representations
- Non-linear compression via neural networks
- Latent space captures semantic structure
- Can interpolate between examples
Deep learning revolutionized vision
- CNNs learn features hierarchically
- Transfer learning leverages pre-trained models
- Modern AI goes far beyond classification

The Paradigm Shift

Traditional CV	Modern ML/DL
Hand-coded features	Learned features
Domain expertise required	Data-driven
Linear transformations	Non-linear hierarchies
Rigid rules	Flexible patterns
~70-80% accuracy	~95-99%+ accuracy

Key Philosophical Point

Intelligence emerges from learning the right representation.

We don’t tell machines how to see.
We give them data and let them figure out what matters.

Homework: “Vision in the Wild”

Your Task

Collect your own image dataset (20-50 images)
- Choose a category: coffee cups, doors, plants, faces, etc.
- Take photos with your phone
- Ensure variety (angles, lighting, backgrounds)
Preprocess the data
- Resize to consistent dimensions
- Normalize pixel values
- Split into train/test
Apply three approaches:
- Traditional CV: Edge detection, color histograms, HOG features
- Dimensionality reduction: PCA or t-SNE visualization
- Pre-trained model: Use transfer learning (e.g., ResNet features)
Analyze and reflect (1-page write-up):
- What worked? What failed?
- Why did certain approaches succeed/fail?
- What does this teach you about vision?
- How do machines “see” differently than humans?

Deliverable

Jupyter notebook with code and visualizations
1-page reflection (PDF)

Grading Rubric

Technical execution (50%): Data collection, preprocessing, implementation
Analysis (30%): Comparison of approaches, failure analysis
Reflection (20%): Conceptual understanding, epistemic humility

Due: Next week

References

Dimensionality Reduction

Jolliffe, I. T. (2002). Principal Component Analysis. Springer.
van der Maaten, L., & Hinton, G. (2008). “Visualizing data using t-SNE.” JMLR, 9, 2579-2605.
McInnes, L., Healy, J., & Melville, J. (2018). “UMAP: Uniform manifold approximation and projection.” arXiv:1802.03426.

Autoencoders

Hinton, G. E., & Salakhutdinov, R. R. (2006). “Reducing the dimensionality of data with neural networks.” Science, 313(5786), 504-507.
Kingma, D. P., & Welling, M. (2013). “Auto-encoding variational bayes.” arXiv:1312.6114 (VAE).

Deep Learning for Vision

LeCun, Y., et al. (1998). “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11), 2278-2324. (LeNet)
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). “ImageNet classification with deep convolutional neural networks.” NeurIPS. (AlexNet)
He, K., et al. (2016). “Deep residual learning for image recognition.” CVPR, 770-778. (ResNet)
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Modern Vision AI

Redmon, J., et al. (2016). “You only look once: Unified, real-time object detection.” CVPR.
Ronneberger, O., et al. (2015). “U-Net: Convolutional networks for biomedical image segmentation.” MICCAI.
Radford, A., et al. (2021). “Learning transferable visual models from natural language supervision.” ICML. (CLIP)
Kirillov, A., et al. (2023). “Segment anything.” arXiv:2304.02643. (SAM)

End of Lecture 2

Next steps: Work on homework, explore modern vision models, think about how machines see vs. how you see!