Module 11: Deep Learning for Vision and Language

Introduction

In the span of a single decade, deep learning transformed from an academic curiosity dismissed by mainstream AI to the dominant paradigm powering image recognition, speech synthesis, machine translation, and AI systems that can converse, create, and reason. This revolution didn’t happen overnight—it was the culmination of 50 years of patient research, stubborn belief, and a few key breakthroughs that unlocked the power of neural networks.

This module explores the theory and practice of deep learning, the architectures that power modern AI, and the people whose persistence made it possible. From the perceptron debates of the 1960s to GPT-4 and beyond, we trace the arc of one of science’s great vindication stories.


Part 1: The Long Road to Deep Learning

The Perceptron and the First AI Winter

The story begins in 1958, when Frank Rosenblatt unveiled the perceptron—a simple neural network that could learn to classify inputs by adjusting weighted connections. The New York Times proclaimed it “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

The hype was unsustainable. In 1969, Marvin Minsky and Seymour Papert published Perceptrons, mathematically proving that single-layer perceptrons couldn’t learn certain simple functions (like XOR). Their critique was nuanced—they acknowledged that multi-layer networks might overcome these limitations—but the damage was done. Funding dried up. Researchers abandoned neural networks. The first AI winter had begun.

Backpropagation: The Key That Took Decades

The solution was always there, waiting to be discovered: multi-layer networks with backpropagation—an algorithm to compute how each weight contributes to the error, allowing gradual improvement.

Backpropagation was invented multiple times:

Yet the technique didn’t revolutionize AI—not immediately. Training deep networks remained difficult. Gradients vanished in deep layers. Computers were too slow. Data was scarce. Neural networks remained a niche interest through the 1990s and 2000s, overshadowed by SVMs and kernel methods.

The Deep Learning Renaissance

The renaissance began around 2006 when Geoffrey Hinton and colleagues showed that deep networks could be trained effectively using “pre-training”—unsupervised layer-by-layer initialization before fine-tuning with backpropagation.

But the true breakthrough came in 2012:

AlexNet (Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton) entered the ImageNet competition and won by a huge margin—16.4% error rate versus 26.2% for the second-place traditional method. The architecture was simple by today’s standards: 8 layers, 60 million parameters. But it proved that deep learning could dominate real-world problems.

The key ingredients:

Within years, every major tech company had deep learning research labs. The AI winter was over.


Part 2: Convolutional Neural Networks - Seeing with Mathematics

The Biological Inspiration

In 1959, David Hubel and Torsten Wiesel inserted electrodes into a cat’s brain and made a Nobel Prize-winning discovery: neurons in the visual cortex respond to specific oriented edges in specific locations. The visual system has a hierarchical structure—early neurons detect simple features; later neurons combine these into complex objects.

This insight inspired convolutional neural networks (CNNs): layers of artificial neurons that scan across images, detecting local patterns and progressively combining them into higher-level representations.

The Convolution Operation

A convolution slides a small filter (kernel) across an image, computing dot products at each position. A 3×3 edge-detection filter might be:

[-1  0  1]
[-1  0  1]
[-1  0  1]

This filter produces strong responses at vertical edges.

In a CNN:

Key CNN Architectures

LeNet-5 (Yann LeCun, 1998): The pioneer, designed for handwritten digit recognition. Two convolutional layers, modest by modern standards, but the blueprint for everything that followed.

AlexNet (2012): The ImageNet breakthrough. Deeper, trained on GPUs, used ReLU and dropout.

VGGNet (2014): Showed that depth matters. Used only 3×3 filters stacked deep.

GoogLeNet/Inception (2014): Introduced “inception modules” that process at multiple scales simultaneously.

ResNet (2015): Revolutionized deep learning with skip connections—direct paths that bypass layers, enabling training of networks with hundreds of layers. The key insight: it’s easier to learn a residual (the difference from identity) than the full transformation.

EfficientNet (2019): Systematically scaled depth, width, and resolution for optimal efficiency.

Beyond Classification

CNNs power far more than image classification:


Part 3: Recurrent Neural Networks - Learning Sequences

The Problem with Sequences

Standard neural networks process fixed-size inputs independently. But language, speech, music, and time series are sequences where context matters. The meaning of “bank” depends on whether we’re discussing rivers or money.

RNN Architecture

A Recurrent Neural Network (RNN) maintains a hidden state that updates with each new input:

\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\] \[y_t = W_{hy} h_t + b_y\]

The hidden state $h_t$ acts as memory, carrying information from earlier in the sequence.

The Vanishing Gradient Problem

Training RNNs is notoriously difficult. When backpropagating through many timesteps, gradients either:

Both make learning impossible for long sequences.

Long Short-Term Memory (LSTM)

In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented LSTM, which solved the vanishing gradient problem with a brilliant mechanism: the cell state—a highway that can carry information unchanged across many timesteps, with gates that control what enters, exits, and persists.

The three gates:

LSTMs dominated sequence modeling for a decade, powering Google Translate, Apple’s Siri, and Amazon’s Alexa.

Gated Recurrent Units (GRU)

Kyunghyun Cho (2014) simplified LSTM into the GRU, with just two gates (update and reset). GRUs are faster to train and often perform comparably.


Part 4: The Transformer Revolution

Attention Is All You Need

In 2017, a team at Google published “Attention Is All You Need,” introducing the Transformer architecture. It abandoned recurrence entirely, processing entire sequences in parallel using attention mechanisms.

The key innovation: self-attention allows each position to directly attend to all other positions, learning which parts of the input are relevant to each other.

The Attention Mechanism

Given queries Q, keys K, and values V:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

For each query (position), compute similarity to all keys, then take a weighted sum of values. Multi-head attention runs this multiple times with different learned projections.

Encoder-Decoder Architecture

The original Transformer had:

This was designed for machine translation, but the components proved independently powerful.

BERT and the Encoder Revolution (2018)

BERT (Bidirectional Encoder Representations from Transformers) from Google used only the encoder, trained on a masked language modeling task: predict hidden words from context.

Pre-trained BERT embeddings revolutionized NLP. Fine-tuning BERT beat specialized models on virtually every benchmark—sentiment analysis, question answering, named entity recognition.

GPT and the Decoder Revolution (2018-2023)

GPT (Generative Pre-trained Transformer) from OpenAI used only the decoder, trained to predict the next word. This simple objective, scaled to billions of parameters and trillions of words, produced surprising capabilities:

The shift to large language models (LLMs) redefined what AI could do.


Part 5: Training Deep Networks

The Optimization Landscape

Training neural networks means minimizing a loss function over millions of parameters. The landscape is complex—riddled with local minima, saddle points, and flat regions.

Gradient Descent and Its Variants

Stochastic Gradient Descent (SGD): Update weights using gradients from random mini-batches. Noisy but efficient.

Momentum: Accumulate velocity, smoothing updates and escaping local minima.

Adam (2014): Adaptive learning rates for each parameter, combining momentum with RMSprop. The default choice for most deep learning.

Regularization Techniques

Dropout (Hinton, 2012): Randomly zero out neurons during training. Prevents co-adaptation, improves generalization.

Batch Normalization (2015): Normalize activations within mini-batches. Stabilizes training, enables higher learning rates.

Weight Decay: L2 penalty on weights, preventing them from growing too large.

Data Augmentation: Artificially expand training data through transformations (flips, rotations, crops for images).

Architectural Innovations

Skip/Residual Connections: Let gradients flow directly through deep networks.

Layer Normalization: Normalize across features rather than batches.

Attention: Allow direct connections across positions.

Mixture of Experts: Activate only relevant subnetworks, scaling parameters without scaling compute.


Part 6: Deep Learning in Practice

Transfer Learning

Training deep networks from scratch requires massive data and compute. Transfer learning leverages models pre-trained on large datasets:

  1. Take a model trained on ImageNet (images) or large text corpora (language)
  2. Replace the final layer(s) for your specific task
  3. Fine-tune on your smaller dataset

This democratized deep learning—anyone can build powerful models without Google-scale resources.

Computer Vision Applications

Natural Language Processing Applications

Multimodal Models

Modern systems combine vision and language:


Part 7: The Limits and Future of Deep Learning

What Deep Learning Struggles With

Emerging Directions

Neuro-symbolic AI: Combining neural networks with symbolic reasoning

Self-supervised learning: Learning from unlabeled data (contrastive learning, masked prediction)

Efficient architectures: Making deep learning work on edge devices

Foundation models: Pre-trained models adapted to many tasks

Scaling laws: Understanding how performance improves with model size and data


DEEP DIVE: Geoffrey Hinton and the 40-Year Quest to Vindicate Neural Networks

The Prophet in the Wilderness

In the winter of 2012, as the deep learning revolution was just beginning, Geoffrey Hinton stood before a crowd of skeptics and believers at a machine learning conference. His student Alex Krizhevsky had just won the ImageNet competition by a huge margin using a deep neural network. But Hinton’s path to this moment had taken 40 years—four decades of patient research on an approach that most of the field had abandoned.

Geoffrey Hinton was born in 1947 in London into a family of extraordinary thinkers. His great-great-grandfather was George Boole, inventor of Boolean algebra. His father was an entomologist; his cousin a mathematician. From childhood, Hinton was fascinated by the brain and how it might be understood mathematically.

The Edinburgh Years: Finding a Calling

As a psychology undergraduate at Cambridge in the late 1960s, Hinton became convinced that the brain’s ability to learn came from adjusting the strengths of connections between neurons. This idea—that learning is about changing weights—would guide his entire career.

He pursued a PhD at Edinburgh, one of the few places doing AI research in Britain. But the field was in crisis. Minsky and Papert’s Perceptrons had convinced most researchers that neural networks were a dead end. Funding agencies turned away. Prominent researchers advised students to work on something else.

Hinton didn’t listen. He believed the critics were wrong—that multi-layer networks, if we could figure out how to train them, would be far more powerful than single-layer perceptrons.

The Backpropagation Breakthrough

In 1986, Hinton, along with David Rumelhart and Ronald Williams, published a landmark paper in Nature: “Learning representations by back-propagating errors.” The paper described backpropagation—an algorithm for training multi-layer networks by propagating error signals backward through the layers.

Backpropagation wasn’t entirely new (Paul Werbos had described it in 1974), but the Rumelhart/Hinton/Williams paper made it accessible and demonstrated its power. The paper is now one of the most cited in all of computer science.

The AI community took notice. Briefly, neural networks were back. But the renewed interest didn’t last. Through the 1990s, backpropagation struggled with deeper networks. Gradients vanished. Training was slow. Support Vector Machines, with their elegant theory and guarantees, seemed more principled.

The Second Wilderness: 1995-2006

Through these lean years, Hinton kept working. At the University of Toronto, he built a small but dedicated research group. He explored Boltzmann machines, wake-sleep algorithms, and other approaches to training deep networks.

The mainstream AI community moved on. Machine learning conferences increasingly rejected neural network papers. Hinton later recalled that reviewers would dismiss submissions simply because they involved neural networks.

“They couldn’t believe that we were still interested in that stuff,” he said in an interview. “They thought we were crazy.”

The Deep Learning Breakthrough

In 2006, Hinton made a discovery that would change everything. With Simon Osindero and Yee-Whye Teh, he showed that deep networks could be trained effectively by first using unsupervised “pre-training”—teaching each layer to model its inputs before fine-tuning the whole network with backpropagation.

The paper, “A Fast Learning Algorithm for Deep Belief Nets,” demonstrated that deep architectures could learn meaningful representations. More importantly, it inspired a wave of research revisiting neural networks.

The term “deep learning” emerged around this time, distinguishing the new methods from earlier “shallow” neural networks.

The ImageNet Moment

By 2012, the pieces were in place. NVIDIA’s GPUs provided massive parallelism. ImageNet provided millions of labeled images. Dropout regularization prevented overfitting. ReLU activations solved vanishing gradients.

Hinton’s students Alex Krizhevsky and Ilya Sutskever built AlexNet, entered the ImageNet competition, and won by a margin that shocked the field. The error rate dropped from 26% to 16%—a quantum leap in a field accustomed to incremental progress.

The paper, “ImageNet Classification with Deep Convolutional Neural Networks,” has been cited over 100,000 times. It launched the deep learning revolution.

Recognition and Reflection

In 2018, Geoffrey Hinton shared the Turing Award—computing’s Nobel Prize—with Yann LeCun and Yoshua Bengio, “for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.”

But Hinton’s story doesn’t end with triumph. In 2023, he left Google to speak freely about the risks of the technology he helped create. He has become increasingly concerned about existential risks from AI, the potential for misuse, and whether we can control systems that may become smarter than us.

“I’m just a scientist who suddenly realized that these things are getting smarter than us,” he told The New York Times. “I’m scared.”

Lessons from Hinton’s Journey

Hinton’s story offers profound lessons for data science:

  1. Persistence in the face of paradigm: The mainstream was wrong about neural networks. Hinton kept working when most had given up. Revolutionary ideas often start as minority views.

  2. The importance of engineering: Backpropagation was known for 12 years before the Nature paper made it practical. Ideas need implementation, optimization, and demonstration.

  3. Timing and infrastructure: Deep learning needed GPUs, big data, and specific techniques. The idea was right; the circumstances had to catch up.

  4. The burden of success: Creating powerful technology brings responsibility. Hinton’s later concerns about AI safety reflect the ethical weight of transformative discoveries.

  5. The value of fundamental research: Hinton worked on the brain-inspired principles of learning for 40 years before commercial applications emerged. Basic research pays off unpredictably but enormously.


LECTURE PLAN: The Deep Learning Revolution

Learning Objectives

By the end of this lecture, students will be able to:

  1. Explain how deep neural networks learn through backpropagation
  2. Understand the key architectures: CNNs for images, Transformers for sequences
  3. Apply transfer learning for practical problems
  4. Appreciate the history and future challenges of deep learning

Lecture Structure (90 minutes)

Opening Hook (8 minutes)

The 40-Year Wait

Part 1: Neural Networks Foundations (18 minutes)

The Neuron and the Perceptron (5 minutes)

Multi-Layer Networks (5 minutes)

Backpropagation (8 minutes)

Part 2: Convolutional Neural Networks (18 minutes)

The Convolution Operation (6 minutes)

CNN Architecture (6 minutes)

Modern Architectures (6 minutes)

Part 3: Transformers and Language Models (20 minutes)

The Sequence Problem (5 minutes)

Attention Mechanism (7 minutes)

The Modern LLM (8 minutes)

Part 4: Training Deep Networks (12 minutes)

The Optimization Challenge (4 minutes)

Regularization (4 minutes)

Transfer Learning (4 minutes)

Part 5: Limitations and Future (10 minutes)

What Deep Learning Struggles With (5 minutes)

Looking Forward (5 minutes)

Wrap-Up (4 minutes)

Materials Needed

Discussion Questions

  1. Why did it take 40 years for neural networks to become practical?
  2. What’s the difference between how CNNs and Transformers process information?
  3. Why might transfer learning be the most important practical technique?
  4. What responsibilities come with creating powerful AI systems?

HANDS-ON EXERCISE: Building a Deep Learning Image Classifier

Overview

In this exercise, students will:

  1. Build a CNN from scratch for image classification
  2. Use transfer learning with a pre-trained model
  3. Compare performance and training time
  4. Visualize what the network has learned

Prerequisites

Setup

# Install required packages
# pip install tensorflow numpy matplotlib

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.applications import VGG16, ResNet50
from tensorflow.keras.preprocessing.image import ImageDataGenerator

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {len(tf.config.list_physical_devices('GPU')) > 0}")

Part 1: Loading and Exploring Data (10 minutes)

# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Class names
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

print(f"Training images: {x_train.shape}")
print(f"Training labels: {y_train.shape}")
print(f"Test images: {x_test.shape}")
print(f"Image shape: {x_train[0].shape}")
print(f"Pixel value range: {x_train.min()} to {x_train.max()}")

# Visualize some examples
fig, axes = plt.subplots(3, 5, figsize=(12, 8))
for i, ax in enumerate(axes.flat):
    ax.imshow(x_train[i])
    ax.set_title(class_names[y_train[i][0]])
    ax.axis('off')
plt.suptitle('Sample CIFAR-10 Images')
plt.tight_layout()
plt.show()

Part 2: Data Preprocessing (10 minutes)

# Normalize pixel values to [0, 1]
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Convert labels to one-hot encoding
y_train_cat = keras.utils.to_categorical(y_train, 10)
y_test_cat = keras.utils.to_categorical(y_test, 10)

# Create validation split
x_val = x_train[-5000:]
y_val = y_train_cat[-5000:]
x_train_final = x_train[:-5000]
y_train_final = y_train_cat[:-5000]

print(f"Training set: {x_train_final.shape}")
print(f"Validation set: {x_val.shape}")
print(f"Test set: {x_test.shape}")

Part 3: Building a CNN from Scratch (25 minutes)

def build_cnn():
    """Build a simple CNN architecture."""
    model = models.Sequential([
        # First convolutional block
        layers.Conv2D(32, (3, 3), padding='same', input_shape=(32, 32, 3)),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Conv2D(32, (3, 3), padding='same'),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.25),

        # Second convolutional block
        layers.Conv2D(64, (3, 3), padding='same'),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Conv2D(64, (3, 3), padding='same'),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.25),

        # Third convolutional block
        layers.Conv2D(128, (3, 3), padding='same'),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Conv2D(128, (3, 3), padding='same'),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.MaxPooling2D((2, 2)),
        layers.Dropout(0.25),

        # Dense layers
        layers.Flatten(),
        layers.Dense(512),
        layers.BatchNormalization(),
        layers.Activation('relu'),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])

    return model


# Build and compile the model
cnn_model = build_cnn()
cnn_model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Model summary
cnn_model.summary()

# Count parameters
total_params = cnn_model.count_params()
print(f"\nTotal parameters: {total_params:,}")

Task 3.1: Train the model

# Data augmentation
datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True
)
datagen.fit(x_train_final)

# Train with early stopping
early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

# Train the model
history = cnn_model.fit(
    datagen.flow(x_train_final, y_train_final, batch_size=64),
    epochs=30,
    validation_data=(x_val, y_val),
    callbacks=[early_stop],
    verbose=1
)

# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].set_title('Loss Over Training')

axes[1].plot(history.history['accuracy'], label='Training Accuracy')
axes[1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].legend()
axes[1].set_title('Accuracy Over Training')

plt.tight_layout()
plt.show()

Part 4: Evaluating the Model (15 minutes)

# Evaluate on test set
test_loss, test_acc = cnn_model.evaluate(x_test, y_test_cat, verbose=0)
print(f"Test accuracy: {test_acc:.4f}")
print(f"Test loss: {test_loss:.4f}")

# Predictions
y_pred = cnn_model.predict(x_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

cm = confusion_matrix(y_test, y_pred_classes)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

# Per-class accuracy
print("\nClassification Report:")
print(classification_report(y_test, y_pred_classes, target_names=class_names))

Task 4.1: Visualize correct and incorrect predictions

# Find correct and incorrect predictions
correct_idx = np.where(y_pred_classes == y_test.flatten())[0]
incorrect_idx = np.where(y_pred_classes != y_test.flatten())[0]

# Show some incorrect predictions
fig, axes = plt.subplots(3, 5, figsize=(15, 9))
for i, ax in enumerate(axes.flat):
    idx = incorrect_idx[i]
    ax.imshow(x_test[idx])
    true_label = class_names[y_test[idx][0]]
    pred_label = class_names[y_pred_classes[idx]]
    confidence = y_pred[idx][y_pred_classes[idx]]
    ax.set_title(f"True: {true_label}\nPred: {pred_label} ({confidence:.2f})")
    ax.axis('off')
plt.suptitle('Incorrect Predictions')
plt.tight_layout()
plt.show()

Part 5: Transfer Learning (25 minutes)

# For transfer learning, we need to resize images to the expected input size
def resize_images(images, size=(224, 224)):
    """Resize images for transfer learning models."""
    return tf.image.resize(images, size).numpy()

# This is slow, so we'll use a subset for demonstration
subset_size = 5000
x_train_subset = resize_images(x_train_final[:subset_size])
y_train_subset = y_train_final[:subset_size]
x_val_subset = resize_images(x_val[:1000])
y_val_subset = y_val[:1000]
x_test_resized = resize_images(x_test)

print(f"Resized image shape: {x_train_subset[0].shape}")

Task 5.1: Build transfer learning model

def build_transfer_model(base_model_name='vgg16'):
    """Build a transfer learning model."""

    # Load pre-trained base model (without top layers)
    if base_model_name == 'vgg16':
        base_model = VGG16(
            weights='imagenet',
            include_top=False,
            input_shape=(224, 224, 3)
        )
    else:
        base_model = ResNet50(
            weights='imagenet',
            include_top=False,
            input_shape=(224, 224, 3)
        )

    # Freeze base model layers
    base_model.trainable = False

    # Build the model
    model = models.Sequential([
        base_model,
        layers.GlobalAveragePooling2D(),
        layers.Dense(256, activation='relu'),
        layers.Dropout(0.5),
        layers.Dense(10, activation='softmax')
    ])

    return model


# Build and compile transfer learning model
transfer_model = build_transfer_model('vgg16')
transfer_model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Count parameters
trainable_params = sum([np.prod(w.shape) for w in transfer_model.trainable_weights])
total_params = transfer_model.count_params()
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Frozen parameters: {total_params - trainable_params:,}")

Task 5.2: Train the transfer learning model

# Train with early stopping
transfer_history = transfer_model.fit(
    x_train_subset, y_train_subset,
    batch_size=32,
    epochs=10,
    validation_data=(x_val_subset, y_val_subset),
    callbacks=[early_stop],
    verbose=1
)

# Evaluate
transfer_test_loss, transfer_test_acc = transfer_model.evaluate(
    x_test_resized, y_test_cat, verbose=0
)
print(f"\nTransfer Learning Test Accuracy: {transfer_test_acc:.4f}")

Part 6: Visualizing What CNNs Learn (15 minutes)

def visualize_filters(model, layer_name):
    """Visualize the filters of a convolutional layer."""
    # Get the layer
    layer = model.get_layer(layer_name)
    filters = layer.get_weights()[0]

    # Normalize filters for visualization
    f_min, f_max = filters.min(), filters.max()
    filters = (filters - f_min) / (f_max - f_min)

    # Plot filters
    n_filters = min(filters.shape[3], 32)
    n_cols = 8
    n_rows = (n_filters + n_cols - 1) // n_cols

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, n_rows * 1.5))
    for i, ax in enumerate(axes.flat):
        if i < n_filters:
            # For RGB filters, show them as images
            if filters.shape[2] == 3:
                ax.imshow(filters[:, :, :, i])
            else:
                ax.imshow(filters[:, :, 0, i], cmap='gray')
            ax.set_title(f'Filter {i}')
        ax.axis('off')

    plt.suptitle(f'Filters from {layer_name}')
    plt.tight_layout()
    plt.show()


# Visualize first conv layer filters
visualize_filters(cnn_model, 'conv2d')

Task 6.1: Visualize feature maps

def visualize_feature_maps(model, image, layer_names):
    """Visualize feature maps for given layers."""
    # Create a model that outputs feature maps
    outputs = [model.get_layer(name).output for name in layer_names]
    feature_model = keras.Model(inputs=model.input, outputs=outputs)

    # Get feature maps
    features = feature_model.predict(image[np.newaxis, ...])

    # Plot
    for layer_name, feature_map in zip(layer_names, features):
        n_features = min(feature_map.shape[-1], 16)
        fig, axes = plt.subplots(2, 8, figsize=(16, 4))

        for i, ax in enumerate(axes.flat):
            if i < n_features:
                ax.imshow(feature_map[0, :, :, i], cmap='viridis')
            ax.axis('off')

        plt.suptitle(f'Feature maps from {layer_name}')
        plt.tight_layout()
        plt.show()


# Visualize feature maps for a test image
test_image = x_test[0]
plt.figure(figsize=(4, 4))
plt.imshow(test_image)
plt.title(f'Input: {class_names[y_test[0][0]]}')
plt.axis('off')
plt.show()

# Get conv layer names
conv_layers = [layer.name for layer in cnn_model.layers if 'conv2d' in layer.name][:3]
visualize_feature_maps(cnn_model, test_image, conv_layers)

Challenge Questions

  1. Architecture Design: How does the number of convolutional layers affect accuracy and training time? Experiment with 1, 2, and 4 conv blocks.

  2. Hyperparameter Tuning: What happens when you change the learning rate? The batch size? The dropout rate?

  3. Data Augmentation: Remove data augmentation and compare results. Which augmentations help most for CIFAR-10?

  4. Fine-Tuning: Instead of freezing all base model layers, try unfreezing the last few layers and training with a very small learning rate.

  5. Model Comparison: How does a simple fully-connected network (no convolutions) perform on CIFAR-10?

Expected Outputs

Students should submit:

  1. Training curves showing loss and accuracy over epochs
  2. Confusion matrix and per-class accuracy analysis
  3. Comparison between custom CNN and transfer learning
  4. Visualization of learned filters and feature maps
  5. Written analysis of what the network has learned and why certain classes are confused

Evaluation Rubric

Criteria Points
Correct CNN architecture implementation 20
Proper training with regularization 15
Thorough evaluation and metrics 20
Transfer learning implementation 20
Visualization and interpretation 15
Code quality and documentation 10
Total 100

Books

Technical

Historical and Popular

Academic Papers

Video Lectures

Online Courses

Tools and Libraries

Datasets


References

  1. Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). “Learning representations by back-propagating errors.” Nature, 323(6088), 533-536.

  2. Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, 25.

  3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). “Deep Residual Learning for Image Recognition.” CVPR.

  4. Vaswani, A., et al. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems, 30.

  5. Devlin, J., et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL-HLT.

  6. Brown, T.B., et al. (2020). “Language Models are Few-Shot Learners.” Advances in Neural Information Processing Systems, 33.

  7. LeCun, Y., Bengio, Y., & Hinton, G. (2015). “Deep Learning.” Nature, 521(7553), 436-444.

  8. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

  9. Hochreiter, S., & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735-1780.

  10. Hinton, G.E., Osindero, S., & Teh, Y.W. (2006). “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation, 18(7), 1527-1554.

  11. Srivastava, N., et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research, 15, 1929-1958.

  12. Ioffe, S., & Szegedy, C. (2015). “Batch Normalization: Accelerating Deep Network Training.” ICML.


Module 11 explores the theory and practice of deep learning—the neural network revolution that transformed artificial intelligence. Through Geoffrey Hinton’s 40-year journey from ignored researcher to Nobel-level recognition, we learn about the architectures that power modern AI: CNNs for vision, Transformers for language, and the techniques that make them work.