Module 11: Deep Learning for Vision and Language
Introduction
In the span of a single decade, deep learning transformed from an academic curiosity dismissed by mainstream AI to the dominant paradigm powering image recognition, speech synthesis, machine translation, and AI systems that can converse, create, and reason. This revolution didn’t happen overnight—it was the culmination of 50 years of patient research, stubborn belief, and a few key breakthroughs that unlocked the power of neural networks.
This module explores the theory and practice of deep learning, the architectures that power modern AI, and the people whose persistence made it possible. From the perceptron debates of the 1960s to GPT-4 and beyond, we trace the arc of one of science’s great vindication stories.
Part 1: The Long Road to Deep Learning
The Perceptron and the First AI Winter
The story begins in 1958, when Frank Rosenblatt unveiled the perceptron—a simple neural network that could learn to classify inputs by adjusting weighted connections. The New York Times proclaimed it “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”
The hype was unsustainable. In 1969, Marvin Minsky and Seymour Papert published Perceptrons, mathematically proving that single-layer perceptrons couldn’t learn certain simple functions (like XOR). Their critique was nuanced—they acknowledged that multi-layer networks might overcome these limitations—but the damage was done. Funding dried up. Researchers abandoned neural networks. The first AI winter had begun.
Backpropagation: The Key That Took Decades
The solution was always there, waiting to be discovered: multi-layer networks with backpropagation—an algorithm to compute how each weight contributes to the error, allowing gradual improvement.
Backpropagation was invented multiple times:
- Paul Werbos described it in his 1974 PhD thesis
- David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized it in a landmark 1986 Nature paper
Yet the technique didn’t revolutionize AI—not immediately. Training deep networks remained difficult. Gradients vanished in deep layers. Computers were too slow. Data was scarce. Neural networks remained a niche interest through the 1990s and 2000s, overshadowed by SVMs and kernel methods.
The Deep Learning Renaissance
The renaissance began around 2006 when Geoffrey Hinton and colleagues showed that deep networks could be trained effectively using “pre-training”—unsupervised layer-by-layer initialization before fine-tuning with backpropagation.
But the true breakthrough came in 2012:
AlexNet (Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton) entered the ImageNet competition and won by a huge margin—16.4% error rate versus 26.2% for the second-place traditional method. The architecture was simple by today’s standards: 8 layers, 60 million parameters. But it proved that deep learning could dominate real-world problems.
The key ingredients:
- GPUs: Graphics cards designed for video games turned out to be perfect for parallel matrix operations
- Large datasets: ImageNet provided millions of labeled images
- Techniques: ReLU activations, dropout regularization, data augmentation
Within years, every major tech company had deep learning research labs. The AI winter was over.
Part 2: Convolutional Neural Networks - Seeing with Mathematics
The Biological Inspiration
In 1959, David Hubel and Torsten Wiesel inserted electrodes into a cat’s brain and made a Nobel Prize-winning discovery: neurons in the visual cortex respond to specific oriented edges in specific locations. The visual system has a hierarchical structure—early neurons detect simple features; later neurons combine these into complex objects.
This insight inspired convolutional neural networks (CNNs): layers of artificial neurons that scan across images, detecting local patterns and progressively combining them into higher-level representations.
The Convolution Operation
A convolution slides a small filter (kernel) across an image, computing dot products at each position. A 3×3 edge-detection filter might be:
[-1 0 1]
[-1 0 1]
[-1 0 1]
This filter produces strong responses at vertical edges.
In a CNN:
- The convolutional layers learn these filters automatically through backpropagation
- Pooling layers downsample, providing translation invariance
- Fully connected layers at the end combine features for classification
Key CNN Architectures
LeNet-5 (Yann LeCun, 1998): The pioneer, designed for handwritten digit recognition. Two convolutional layers, modest by modern standards, but the blueprint for everything that followed.
AlexNet (2012): The ImageNet breakthrough. Deeper, trained on GPUs, used ReLU and dropout.
VGGNet (2014): Showed that depth matters. Used only 3×3 filters stacked deep.
GoogLeNet/Inception (2014): Introduced “inception modules” that process at multiple scales simultaneously.
ResNet (2015): Revolutionized deep learning with skip connections—direct paths that bypass layers, enabling training of networks with hundreds of layers. The key insight: it’s easier to learn a residual (the difference from identity) than the full transformation.
EfficientNet (2019): Systematically scaled depth, width, and resolution for optimal efficiency.
Beyond Classification
CNNs power far more than image classification:
- Object Detection: Locating and classifying multiple objects (YOLO, Faster R-CNN)
- Semantic Segmentation: Labeling every pixel (U-Net, DeepLab)
- Image Generation: Creating new images (VAEs, GANs)
- Face Recognition: Identifying individuals (FaceNet, DeepFace)
- Medical Imaging: Detecting tumors, analyzing X-rays and MRIs
Part 3: Recurrent Neural Networks - Learning Sequences
The Problem with Sequences
Standard neural networks process fixed-size inputs independently. But language, speech, music, and time series are sequences where context matters. The meaning of “bank” depends on whether we’re discussing rivers or money.
RNN Architecture
A Recurrent Neural Network (RNN) maintains a hidden state that updates with each new input:
\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\] \[y_t = W_{hy} h_t + b_y\]The hidden state $h_t$ acts as memory, carrying information from earlier in the sequence.
The Vanishing Gradient Problem
Training RNNs is notoriously difficult. When backpropagating through many timesteps, gradients either:
- Vanish: Multiply by values < 1 repeatedly, approaching zero
- Explode: Multiply by values > 1 repeatedly, becoming huge
Both make learning impossible for long sequences.
Long Short-Term Memory (LSTM)
In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented LSTM, which solved the vanishing gradient problem with a brilliant mechanism: the cell state—a highway that can carry information unchanged across many timesteps, with gates that control what enters, exits, and persists.
The three gates:
- Forget gate: What to erase from memory
- Input gate: What new information to add
- Output gate: What to reveal to the next layer
LSTMs dominated sequence modeling for a decade, powering Google Translate, Apple’s Siri, and Amazon’s Alexa.
Gated Recurrent Units (GRU)
Kyunghyun Cho (2014) simplified LSTM into the GRU, with just two gates (update and reset). GRUs are faster to train and often perform comparably.
Part 4: The Transformer Revolution
Attention Is All You Need
In 2017, a team at Google published “Attention Is All You Need,” introducing the Transformer architecture. It abandoned recurrence entirely, processing entire sequences in parallel using attention mechanisms.
The key innovation: self-attention allows each position to directly attend to all other positions, learning which parts of the input are relevant to each other.
The Attention Mechanism
Given queries Q, keys K, and values V:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]For each query (position), compute similarity to all keys, then take a weighted sum of values. Multi-head attention runs this multiple times with different learned projections.
Encoder-Decoder Architecture
The original Transformer had:
- Encoder: Processes input sequence, building representations
- Decoder: Generates output sequence, attending to encoder and its own previous outputs
This was designed for machine translation, but the components proved independently powerful.
BERT and the Encoder Revolution (2018)
BERT (Bidirectional Encoder Representations from Transformers) from Google used only the encoder, trained on a masked language modeling task: predict hidden words from context.
Pre-trained BERT embeddings revolutionized NLP. Fine-tuning BERT beat specialized models on virtually every benchmark—sentiment analysis, question answering, named entity recognition.
GPT and the Decoder Revolution (2018-2023)
GPT (Generative Pre-trained Transformer) from OpenAI used only the decoder, trained to predict the next word. This simple objective, scaled to billions of parameters and trillions of words, produced surprising capabilities:
- GPT-2 (2019): 1.5B parameters, wrote coherent paragraphs
- GPT-3 (2020): 175B parameters, few-shot learning, emergent abilities
- GPT-4 (2023): Multimodal, passes professional exams, reasons about images
The shift to large language models (LLMs) redefined what AI could do.
Part 5: Training Deep Networks
The Optimization Landscape
Training neural networks means minimizing a loss function over millions of parameters. The landscape is complex—riddled with local minima, saddle points, and flat regions.
Gradient Descent and Its Variants
Stochastic Gradient Descent (SGD): Update weights using gradients from random mini-batches. Noisy but efficient.
Momentum: Accumulate velocity, smoothing updates and escaping local minima.
Adam (2014): Adaptive learning rates for each parameter, combining momentum with RMSprop. The default choice for most deep learning.
Regularization Techniques
Dropout (Hinton, 2012): Randomly zero out neurons during training. Prevents co-adaptation, improves generalization.
Batch Normalization (2015): Normalize activations within mini-batches. Stabilizes training, enables higher learning rates.
Weight Decay: L2 penalty on weights, preventing them from growing too large.
Data Augmentation: Artificially expand training data through transformations (flips, rotations, crops for images).
Architectural Innovations
Skip/Residual Connections: Let gradients flow directly through deep networks.
Layer Normalization: Normalize across features rather than batches.
Attention: Allow direct connections across positions.
Mixture of Experts: Activate only relevant subnetworks, scaling parameters without scaling compute.
Part 6: Deep Learning in Practice
Transfer Learning
Training deep networks from scratch requires massive data and compute. Transfer learning leverages models pre-trained on large datasets:
- Take a model trained on ImageNet (images) or large text corpora (language)
- Replace the final layer(s) for your specific task
- Fine-tune on your smaller dataset
This democratized deep learning—anyone can build powerful models without Google-scale resources.
Computer Vision Applications
- Medical imaging: Detecting diabetic retinopathy, skin cancer, COVID from X-rays
- Autonomous vehicles: Recognizing pedestrians, traffic signs, lane markings
- Agriculture: Identifying crop diseases, counting livestock
- Manufacturing: Defect detection, quality control
- Security: Facial recognition, anomaly detection
Natural Language Processing Applications
- Machine translation: Google Translate, DeepL
- Chatbots and assistants: ChatGPT, Claude, Siri, Alexa
- Search: Semantic understanding of queries
- Content moderation: Detecting hate speech, misinformation
- Legal/medical: Document analysis, summarization
Multimodal Models
Modern systems combine vision and language:
- CLIP (OpenAI): Learns joint image-text representations
- DALL-E, Midjourney, Stable Diffusion: Generate images from text descriptions
- GPT-4V: Understands and reasons about images
- Gemini: Native multimodal understanding
Part 7: The Limits and Future of Deep Learning
What Deep Learning Struggles With
- Reasoning: Multi-step logical deduction remains challenging
- Causal understanding: Correlation patterns, not causal mechanisms
- Data efficiency: Humans learn from few examples; deep learning often needs millions
- Robustness: Small perturbations can fool classifiers
- Interpretability: Understanding why a network makes decisions
Emerging Directions
Neuro-symbolic AI: Combining neural networks with symbolic reasoning
Self-supervised learning: Learning from unlabeled data (contrastive learning, masked prediction)
Efficient architectures: Making deep learning work on edge devices
Foundation models: Pre-trained models adapted to many tasks
Scaling laws: Understanding how performance improves with model size and data
DEEP DIVE: Geoffrey Hinton and the 40-Year Quest to Vindicate Neural Networks
The Prophet in the Wilderness
In the winter of 2012, as the deep learning revolution was just beginning, Geoffrey Hinton stood before a crowd of skeptics and believers at a machine learning conference. His student Alex Krizhevsky had just won the ImageNet competition by a huge margin using a deep neural network. But Hinton’s path to this moment had taken 40 years—four decades of patient research on an approach that most of the field had abandoned.
Geoffrey Hinton was born in 1947 in London into a family of extraordinary thinkers. His great-great-grandfather was George Boole, inventor of Boolean algebra. His father was an entomologist; his cousin a mathematician. From childhood, Hinton was fascinated by the brain and how it might be understood mathematically.
The Edinburgh Years: Finding a Calling
As a psychology undergraduate at Cambridge in the late 1960s, Hinton became convinced that the brain’s ability to learn came from adjusting the strengths of connections between neurons. This idea—that learning is about changing weights—would guide his entire career.
He pursued a PhD at Edinburgh, one of the few places doing AI research in Britain. But the field was in crisis. Minsky and Papert’s Perceptrons had convinced most researchers that neural networks were a dead end. Funding agencies turned away. Prominent researchers advised students to work on something else.
Hinton didn’t listen. He believed the critics were wrong—that multi-layer networks, if we could figure out how to train them, would be far more powerful than single-layer perceptrons.
The Backpropagation Breakthrough
In 1986, Hinton, along with David Rumelhart and Ronald Williams, published a landmark paper in Nature: “Learning representations by back-propagating errors.” The paper described backpropagation—an algorithm for training multi-layer networks by propagating error signals backward through the layers.
Backpropagation wasn’t entirely new (Paul Werbos had described it in 1974), but the Rumelhart/Hinton/Williams paper made it accessible and demonstrated its power. The paper is now one of the most cited in all of computer science.
The AI community took notice. Briefly, neural networks were back. But the renewed interest didn’t last. Through the 1990s, backpropagation struggled with deeper networks. Gradients vanished. Training was slow. Support Vector Machines, with their elegant theory and guarantees, seemed more principled.
The Second Wilderness: 1995-2006
Through these lean years, Hinton kept working. At the University of Toronto, he built a small but dedicated research group. He explored Boltzmann machines, wake-sleep algorithms, and other approaches to training deep networks.
The mainstream AI community moved on. Machine learning conferences increasingly rejected neural network papers. Hinton later recalled that reviewers would dismiss submissions simply because they involved neural networks.
“They couldn’t believe that we were still interested in that stuff,” he said in an interview. “They thought we were crazy.”
The Deep Learning Breakthrough
In 2006, Hinton made a discovery that would change everything. With Simon Osindero and Yee-Whye Teh, he showed that deep networks could be trained effectively by first using unsupervised “pre-training”—teaching each layer to model its inputs before fine-tuning the whole network with backpropagation.
The paper, “A Fast Learning Algorithm for Deep Belief Nets,” demonstrated that deep architectures could learn meaningful representations. More importantly, it inspired a wave of research revisiting neural networks.
The term “deep learning” emerged around this time, distinguishing the new methods from earlier “shallow” neural networks.
The ImageNet Moment
By 2012, the pieces were in place. NVIDIA’s GPUs provided massive parallelism. ImageNet provided millions of labeled images. Dropout regularization prevented overfitting. ReLU activations solved vanishing gradients.
Hinton’s students Alex Krizhevsky and Ilya Sutskever built AlexNet, entered the ImageNet competition, and won by a margin that shocked the field. The error rate dropped from 26% to 16%—a quantum leap in a field accustomed to incremental progress.
The paper, “ImageNet Classification with Deep Convolutional Neural Networks,” has been cited over 100,000 times. It launched the deep learning revolution.
Recognition and Reflection
In 2018, Geoffrey Hinton shared the Turing Award—computing’s Nobel Prize—with Yann LeCun and Yoshua Bengio, “for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.”
But Hinton’s story doesn’t end with triumph. In 2023, he left Google to speak freely about the risks of the technology he helped create. He has become increasingly concerned about existential risks from AI, the potential for misuse, and whether we can control systems that may become smarter than us.
“I’m just a scientist who suddenly realized that these things are getting smarter than us,” he told The New York Times. “I’m scared.”
Lessons from Hinton’s Journey
Hinton’s story offers profound lessons for data science:
-
Persistence in the face of paradigm: The mainstream was wrong about neural networks. Hinton kept working when most had given up. Revolutionary ideas often start as minority views.
-
The importance of engineering: Backpropagation was known for 12 years before the Nature paper made it practical. Ideas need implementation, optimization, and demonstration.
-
Timing and infrastructure: Deep learning needed GPUs, big data, and specific techniques. The idea was right; the circumstances had to catch up.
-
The burden of success: Creating powerful technology brings responsibility. Hinton’s later concerns about AI safety reflect the ethical weight of transformative discoveries.
-
The value of fundamental research: Hinton worked on the brain-inspired principles of learning for 40 years before commercial applications emerged. Basic research pays off unpredictably but enormously.
LECTURE PLAN: The Deep Learning Revolution
Learning Objectives
By the end of this lecture, students will be able to:
- Explain how deep neural networks learn through backpropagation
- Understand the key architectures: CNNs for images, Transformers for sequences
- Apply transfer learning for practical problems
- Appreciate the history and future challenges of deep learning
Lecture Structure (90 minutes)
Opening Hook (8 minutes)
The 40-Year Wait
- Show AlexNet’s ImageNet victory (2012)
- Ask: “How long did it take to develop this ‘overnight success’?”
- Reveal: 40 years of research, multiple ‘AI winters’
- Introduce Geoffrey Hinton’s journey
- Frame the lecture: “Today we’ll understand what made deep learning finally work”
Part 1: Neural Networks Foundations (18 minutes)
The Neuron and the Perceptron (5 minutes)
- Biological inspiration: neurons, dendrites, axons
- The perceptron: weighted sum → activation
- Demo: single neuron classification
Multi-Layer Networks (5 minutes)
- Why one layer isn’t enough (XOR problem)
- Adding hidden layers
- The universal approximation theorem
- Interactive: show how adding layers increases expressivity
Backpropagation (8 minutes)
- The learning problem: how to credit/blame each weight
- The chain rule of calculus
- Forward pass: compute outputs
- Backward pass: propagate gradients
- Demo: simple backprop calculation by hand
- Why this is “deep”: gradients through many layers
Part 2: Convolutional Neural Networks (18 minutes)
The Convolution Operation (6 minutes)
- From brain to algorithm: Hubel & Wiesel’s discovery
- Convolution: sliding filters across images
- Demo: edge detection filters
- Show: what trained CNN filters look like
CNN Architecture (6 minutes)
- Convolutional layers → Pooling → Fully connected
- Translation invariance through weight sharing
- Walk through VGG or AlexNet architecture
- Visualize feature hierarchies: edges → textures → parts → objects
Modern Architectures (6 minutes)
- ResNet: skip connections enable 150+ layers
- Why residuals help: easier to learn “do nothing + small change”
- The architecture zoo: Inception, EfficientNet, Vision Transformer
- Demo: use a pre-trained model to classify images
Part 3: Transformers and Language Models (20 minutes)
The Sequence Problem (5 minutes)
- Why language is hard for neural networks
- RNNs and their limitations (vanishing gradients)
- LSTM: the gated memory solution
Attention Mechanism (7 minutes)
- The key insight: direct connections between any positions
- Query, Key, Value: the attention formula
- Multi-head attention: attending in multiple ways
- Demo: visualize attention patterns in a sentence
The Modern LLM (8 minutes)
- BERT: bidirectional encoders for understanding
- GPT: autoregressive decoders for generation
- Scaling: from millions to trillions of parameters
- Emergent abilities: few-shot learning, reasoning
- Live demo: GPT-style text completion
Part 4: Training Deep Networks (12 minutes)
The Optimization Challenge (4 minutes)
- The loss landscape: local minima, saddle points
- SGD with momentum
- Adam: adaptive learning rates
Regularization (4 minutes)
- The overfitting problem
- Dropout: random neuron silencing
- Batch normalization: stabilizing activations
- Data augmentation: synthetic training examples
Transfer Learning (4 minutes)
- Why train from scratch when others have done it?
- ImageNet pre-training for vision
- BERT/GPT pre-training for language
- Fine-tuning: adapt to your task
- Demo: fine-tune a model with 10 lines of code
Part 5: Limitations and Future (10 minutes)
What Deep Learning Struggles With (5 minutes)
- Reasoning and logic
- Causal understanding vs. correlation
- Data efficiency vs. humans
- Adversarial examples
- Interpretability: the black box problem
Looking Forward (5 minutes)
- Multimodal models: vision + language
- Self-supervised learning
- Efficiency: smaller, faster models
- Foundation models as a paradigm
- The responsibility of powerful AI (Hinton’s concerns)
Wrap-Up (4 minutes)
- Recap: neurons → networks → convolutions → attention
- Hinton’s message: persistence, but also caution
- Preview the hands-on exercise
- Closing thought: “The tools are powerful; use them wisely”
Materials Needed
- Visualization of neural network architectures
- Pre-trained model demos (image classification, text generation)
- Attention visualization tools
- Historical photos (Hinton, AlexNet moment)
Discussion Questions
- Why did it take 40 years for neural networks to become practical?
- What’s the difference between how CNNs and Transformers process information?
- Why might transfer learning be the most important practical technique?
- What responsibilities come with creating powerful AI systems?
HANDS-ON EXERCISE: Building a Deep Learning Image Classifier
Overview
In this exercise, students will:
- Build a CNN from scratch for image classification
- Use transfer learning with a pre-trained model
- Compare performance and training time
- Visualize what the network has learned
Prerequisites
- Python 3.8+
- Libraries: tensorflow/keras or pytorch, numpy, matplotlib
Setup
# Install required packages
# pip install tensorflow numpy matplotlib
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.applications import VGG16, ResNet50
from tensorflow.keras.preprocessing.image import ImageDataGenerator
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {len(tf.config.list_physical_devices('GPU')) > 0}")
Part 1: Loading and Exploring Data (10 minutes)
# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# Class names
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
print(f"Training images: {x_train.shape}")
print(f"Training labels: {y_train.shape}")
print(f"Test images: {x_test.shape}")
print(f"Image shape: {x_train[0].shape}")
print(f"Pixel value range: {x_train.min()} to {x_train.max()}")
# Visualize some examples
fig, axes = plt.subplots(3, 5, figsize=(12, 8))
for i, ax in enumerate(axes.flat):
ax.imshow(x_train[i])
ax.set_title(class_names[y_train[i][0]])
ax.axis('off')
plt.suptitle('Sample CIFAR-10 Images')
plt.tight_layout()
plt.show()
Part 2: Data Preprocessing (10 minutes)
# Normalize pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
# Convert labels to one-hot encoding
y_train_cat = keras.utils.to_categorical(y_train, 10)
y_test_cat = keras.utils.to_categorical(y_test, 10)
# Create validation split
x_val = x_train[-5000:]
y_val = y_train_cat[-5000:]
x_train_final = x_train[:-5000]
y_train_final = y_train_cat[:-5000]
print(f"Training set: {x_train_final.shape}")
print(f"Validation set: {x_val.shape}")
print(f"Test set: {x_test.shape}")
Part 3: Building a CNN from Scratch (25 minutes)
def build_cnn():
"""Build a simple CNN architecture."""
model = models.Sequential([
# First convolutional block
layers.Conv2D(32, (3, 3), padding='same', input_shape=(32, 32, 3)),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Conv2D(32, (3, 3), padding='same'),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.25),
# Second convolutional block
layers.Conv2D(64, (3, 3), padding='same'),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Conv2D(64, (3, 3), padding='same'),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.25),
# Third convolutional block
layers.Conv2D(128, (3, 3), padding='same'),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Conv2D(128, (3, 3), padding='same'),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.25),
# Dense layers
layers.Flatten(),
layers.Dense(512),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
return model
# Build and compile the model
cnn_model = build_cnn()
cnn_model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Model summary
cnn_model.summary()
# Count parameters
total_params = cnn_model.count_params()
print(f"\nTotal parameters: {total_params:,}")
Task 3.1: Train the model
# Data augmentation
datagen = ImageDataGenerator(
rotation_range=15,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True
)
datagen.fit(x_train_final)
# Train with early stopping
early_stop = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=5,
restore_best_weights=True
)
# Train the model
history = cnn_model.fit(
datagen.flow(x_train_final, y_train_final, batch_size=64),
epochs=30,
validation_data=(x_val, y_val),
callbacks=[early_stop],
verbose=1
)
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].set_title('Loss Over Training')
axes[1].plot(history.history['accuracy'], label='Training Accuracy')
axes[1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].legend()
axes[1].set_title('Accuracy Over Training')
plt.tight_layout()
plt.show()
Part 4: Evaluating the Model (15 minutes)
# Evaluate on test set
test_loss, test_acc = cnn_model.evaluate(x_test, y_test_cat, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")
print(f"Test loss: {test_loss:.4f}")
# Predictions
y_pred = cnn_model.predict(x_test)
y_pred_classes = np.argmax(y_pred, axis=1)
# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
cm = confusion_matrix(y_test, y_pred_classes)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()
# Per-class accuracy
print("\nClassification Report:")
print(classification_report(y_test, y_pred_classes, target_names=class_names))
Task 4.1: Visualize correct and incorrect predictions
# Find correct and incorrect predictions
correct_idx = np.where(y_pred_classes == y_test.flatten())[0]
incorrect_idx = np.where(y_pred_classes != y_test.flatten())[0]
# Show some incorrect predictions
fig, axes = plt.subplots(3, 5, figsize=(15, 9))
for i, ax in enumerate(axes.flat):
idx = incorrect_idx[i]
ax.imshow(x_test[idx])
true_label = class_names[y_test[idx][0]]
pred_label = class_names[y_pred_classes[idx]]
confidence = y_pred[idx][y_pred_classes[idx]]
ax.set_title(f"True: {true_label}\nPred: {pred_label} ({confidence:.2f})")
ax.axis('off')
plt.suptitle('Incorrect Predictions')
plt.tight_layout()
plt.show()
Part 5: Transfer Learning (25 minutes)
# For transfer learning, we need to resize images to the expected input size
def resize_images(images, size=(224, 224)):
"""Resize images for transfer learning models."""
return tf.image.resize(images, size).numpy()
# This is slow, so we'll use a subset for demonstration
subset_size = 5000
x_train_subset = resize_images(x_train_final[:subset_size])
y_train_subset = y_train_final[:subset_size]
x_val_subset = resize_images(x_val[:1000])
y_val_subset = y_val[:1000]
x_test_resized = resize_images(x_test)
print(f"Resized image shape: {x_train_subset[0].shape}")
Task 5.1: Build transfer learning model
def build_transfer_model(base_model_name='vgg16'):
"""Build a transfer learning model."""
# Load pre-trained base model (without top layers)
if base_model_name == 'vgg16':
base_model = VGG16(
weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)
)
else:
base_model = ResNet50(
weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)
)
# Freeze base model layers
base_model.trainable = False
# Build the model
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
return model
# Build and compile transfer learning model
transfer_model = build_transfer_model('vgg16')
transfer_model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Count parameters
trainable_params = sum([np.prod(w.shape) for w in transfer_model.trainable_weights])
total_params = transfer_model.count_params()
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Frozen parameters: {total_params - trainable_params:,}")
Task 5.2: Train the transfer learning model
# Train with early stopping
transfer_history = transfer_model.fit(
x_train_subset, y_train_subset,
batch_size=32,
epochs=10,
validation_data=(x_val_subset, y_val_subset),
callbacks=[early_stop],
verbose=1
)
# Evaluate
transfer_test_loss, transfer_test_acc = transfer_model.evaluate(
x_test_resized, y_test_cat, verbose=0
)
print(f"\nTransfer Learning Test Accuracy: {transfer_test_acc:.4f}")
Part 6: Visualizing What CNNs Learn (15 minutes)
def visualize_filters(model, layer_name):
"""Visualize the filters of a convolutional layer."""
# Get the layer
layer = model.get_layer(layer_name)
filters = layer.get_weights()[0]
# Normalize filters for visualization
f_min, f_max = filters.min(), filters.max()
filters = (filters - f_min) / (f_max - f_min)
# Plot filters
n_filters = min(filters.shape[3], 32)
n_cols = 8
n_rows = (n_filters + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 1.5))
for i, ax in enumerate(axes.flat):
if i < n_filters:
# For RGB filters, show them as images
if filters.shape[2] == 3:
ax.imshow(filters[:, :, :, i])
else:
ax.imshow(filters[:, :, 0, i], cmap='gray')
ax.set_title(f'Filter {i}')
ax.axis('off')
plt.suptitle(f'Filters from {layer_name}')
plt.tight_layout()
plt.show()
# Visualize first conv layer filters
visualize_filters(cnn_model, 'conv2d')
Task 6.1: Visualize feature maps
def visualize_feature_maps(model, image, layer_names):
"""Visualize feature maps for given layers."""
# Create a model that outputs feature maps
outputs = [model.get_layer(name).output for name in layer_names]
feature_model = keras.Model(inputs=model.input, outputs=outputs)
# Get feature maps
features = feature_model.predict(image[np.newaxis, ...])
# Plot
for layer_name, feature_map in zip(layer_names, features):
n_features = min(feature_map.shape[-1], 16)
fig, axes = plt.subplots(2, 8, figsize=(16, 4))
for i, ax in enumerate(axes.flat):
if i < n_features:
ax.imshow(feature_map[0, :, :, i], cmap='viridis')
ax.axis('off')
plt.suptitle(f'Feature maps from {layer_name}')
plt.tight_layout()
plt.show()
# Visualize feature maps for a test image
test_image = x_test[0]
plt.figure(figsize=(4, 4))
plt.imshow(test_image)
plt.title(f'Input: {class_names[y_test[0][0]]}')
plt.axis('off')
plt.show()
# Get conv layer names
conv_layers = [layer.name for layer in cnn_model.layers if 'conv2d' in layer.name][:3]
visualize_feature_maps(cnn_model, test_image, conv_layers)
Challenge Questions
-
Architecture Design: How does the number of convolutional layers affect accuracy and training time? Experiment with 1, 2, and 4 conv blocks.
-
Hyperparameter Tuning: What happens when you change the learning rate? The batch size? The dropout rate?
-
Data Augmentation: Remove data augmentation and compare results. Which augmentations help most for CIFAR-10?
-
Fine-Tuning: Instead of freezing all base model layers, try unfreezing the last few layers and training with a very small learning rate.
-
Model Comparison: How does a simple fully-connected network (no convolutions) perform on CIFAR-10?
Expected Outputs
Students should submit:
- Training curves showing loss and accuracy over epochs
- Confusion matrix and per-class accuracy analysis
- Comparison between custom CNN and transfer learning
- Visualization of learned filters and feature maps
- Written analysis of what the network has learned and why certain classes are confused
Evaluation Rubric
| Criteria | Points |
|---|---|
| Correct CNN architecture implementation | 20 |
| Proper training with regularization | 15 |
| Thorough evaluation and metrics | 20 |
| Transfer learning implementation | 20 |
| Visualization and interpretation | 15 |
| Code quality and documentation | 10 |
| Total | 100 |
Recommended Resources
Books
Technical
- Deep Learning by Goodfellow, Bengio, and Courville - The comprehensive textbook (free online)
- Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélien Géron - Practical focus
- Deep Learning with Python by François Chollet - By the creator of Keras
- Neural Networks and Deep Learning by Michael Nielsen - Free online, intuitive explanations
- Dive into Deep Learning by Zhang et al. - Free, interactive, with code
Historical and Popular
- The Deep Learning Revolution by Terrence Sejnowski - History from an insider
- Genius Makers by Cade Metz - The story of AI pioneers
- The Alignment Problem by Brian Christian - AI safety and values
- You Look Like a Thing and I Love You by Janelle Shane - Humorous introduction
Academic Papers
- Rumelhart, Hinton, Williams (1986). “Learning representations by back-propagating errors” - The backpropagation paper
- Krizhevsky, Sutskever, Hinton (2012). “ImageNet Classification with Deep Convolutional Neural Networks” - AlexNet
- He et al. (2016). “Deep Residual Learning for Image Recognition” - ResNet
- Vaswani et al. (2017). “Attention Is All You Need” - Transformers
- Devlin et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers”
- Brown et al. (2020). “Language Models are Few-Shot Learners” - GPT-3
Video Lectures
- 3Blue1Brown: Neural Networks - Beautiful visualizations
- Stanford CS231n: CNNs for Visual Recognition - Andrej Karpathy’s course
- Stanford CS224n: NLP with Deep Learning - Chris Manning’s course
- MIT 6.S191: Introduction to Deep Learning - Accessible introduction
- Fast.ai: Practical Deep Learning - Top-down practical approach
Online Courses
- Fast.ai: Practical, code-first deep learning
- Coursera: Deep Learning Specialization (Andrew Ng)
- DeepLearning.AI: Various specialized courses
- Hugging Face Course: NLP with Transformers
Tools and Libraries
- TensorFlow/Keras (https://tensorflow.org/) - Google’s framework
- PyTorch (https://pytorch.org/) - Meta’s framework, research standard
- Hugging Face Transformers (https://huggingface.co/) - Pre-trained models
- Weights & Biases (https://wandb.ai/) - Experiment tracking
- TensorBoard - Training visualization
Datasets
- ImageNet - The benchmark for image classification
- CIFAR-10/100 - Small image classification
- COCO - Object detection and segmentation
- Common Crawl - Web text for language models
- Hugging Face Datasets - Curated ML datasets
References
-
Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533-536.
-
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, 25.
-
He, K., Zhang, X., Ren, S., & Sun, J. (2016). “Deep Residual Learning for Image Recognition.” CVPR.
-
Vaswani, A., et al. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems, 30.
-
Devlin, J., et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL-HLT.
-
Brown, T.B., et al. (2020). “Language Models are Few-Shot Learners.” Advances in Neural Information Processing Systems, 33.
-
LeCun, Y., Bengio, Y., & Hinton, G. (2015). “Deep Learning.” Nature, 521(7553), 436-444.
-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
-
Hochreiter, S., & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735-1780.
-
Hinton, G.E., Osindero, S., & Teh, Y.W. (2006). “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation, 18(7), 1527-1554.
-
Srivastava, N., et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research, 15, 1929-1958.
-
Ioffe, S., & Szegedy, C. (2015). “Batch Normalization: Accelerating Deep Network Training.” ICML.
Module 11 explores the theory and practice of deep learning—the neural network revolution that transformed artificial intelligence. Through Geoffrey Hinton’s 40-year journey from ignored researcher to Nobel-level recognition, we learn about the architectures that power modern AI: CNNs for vision, Transformers for language, and the techniques that make them work.