DATA 202 Module 6: Video Data and Action Recognition

Introduction

Video is the richest data modality—combining visual information across time. YouTube hosts over 800 million videos; TikTok users upload millions more daily; surveillance systems record continuously. Processing video data requires understanding not just what appears in frames but how things change across time.

This module explores video analysis: from basic processing to action recognition, object tracking, and video understanding with deep learning.


Part 1: Video Fundamentals

What is Video?

Video is a sequence of images (frames) displayed rapidly to create the illusion of motion:

One minute of uncompressed 4K video at 30 fps:

Compression is essential.

Video as Structured Data

Extract structured information from video:

Video Processing Pipeline

import cv2

# Read video
cap = cv2.VideoCapture('video.mp4')

# Video properties
fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

# Process frames
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Process frame
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # Display or save results
    cv2.imshow('Frame', gray)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()

Part 2: Object Detection and Tracking

Detection in Video

Apply image object detection (YOLO, Faster R-CNN) to each frame:

from ultralytics import YOLO

model = YOLO('yolov8n.pt')

cap = cv2.VideoCapture('video.mp4')
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame)
    annotated = results[0].plot()
    cv2.imshow('Detection', annotated)

Object Tracking

Challenge: Link detections across frames to track objects.

Approaches:

# Simple tracking with DeepSORT
from deep_sort_realtime.deepsort_tracker import DeepSort

tracker = DeepSort(max_age=30)

# For each frame
detections = detect(frame)  # Get detections
tracks = tracker.update_tracks(detections, frame=frame)

for track in tracks:
    if track.is_confirmed():
        track_id = track.track_id
        bbox = track.to_ltrb()

Part 3: Action Recognition

Understanding Actions in Video

Action recognition classifies what activity is occurring:

Approaches:

  1. Two-Stream: Separate spatial (appearance) and temporal (motion) streams
  2. 3D Convolutions: Extend CNNs to space-time (C3D, I3D)
  3. Temporal Modeling: Use recurrent networks on frame features
  4. Transformers: Video Transformers (ViViT, TimeSformer)

Optical Flow

Optical flow captures motion between frames—the apparent motion of pixels.

import cv2

# Lucas-Kanade optical flow
p0 = cv2.goodFeaturesToTrack(prev_gray, maxCorners=100, qualityLevel=0.3, minDistance=7)
p1, status, err = cv2.calcOpticalFlowPyrLK(prev_gray, gray, p0, None)

# Dense optical flow (Farneback)
flow = cv2.calcOpticalFlowFarneback(prev_gray, gray, None, 0.5, 3, 15, 3, 5, 1.2, 0)

Video Transformers

Modern video understanding uses Transformers:

Models: ViViT, TimeSformer, Video Swin Transformer


Part 4: Video Understanding at Scale

Video Captioning

Generate text descriptions of video content:

Video Question Answering

Answer questions about video:

Video-Language Models

Multimodal models align video with language:


DEEP DIVE: The Rise of TikTok and Short-Form Video AI

The Algorithm That Learned What You Want

TikTok’s recommendation algorithm became legendary for its accuracy—users report feeling “understood” within minutes of first use. How?

Key Components:

  1. Video Understanding: Deep learning extracts visual and audio features
  2. Engagement Signals: Watch time, replays, shares, comments
  3. User Modeling: Build preference profiles from behavior
  4. Real-Time Learning: Adapt quickly to changing interests

The For You Page algorithm reportedly:

The Dark Side

Optimization for engagement creates problems:

Video recommendation is a powerful demonstration of AI’s ability to shape behavior—for better and worse.


HANDS-ON EXERCISE: Video Analysis Pipeline

Part 1: Object Detection in Video

from ultralytics import YOLO
import cv2

model = YOLO('yolov8n.pt')

def process_video(input_path, output_path):
    cap = cv2.VideoCapture(input_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

    frame_count = 0
    detections_log = []

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        results = model(frame)

        for box in results[0].boxes:
            cls = int(box.cls)
            conf = float(box.conf)
            detections_log.append({
                'frame': frame_count,
                'class': model.names[cls],
                'confidence': conf
            })

        annotated = results[0].plot()
        out.write(annotated)
        frame_count += 1

    cap.release()
    out.release()
    return pd.DataFrame(detections_log)

Part 2: Motion Analysis

def compute_motion_magnitude(video_path, sample_rate=5):
    """Compute average motion magnitude per sampled frame."""
    cap = cv2.VideoCapture(video_path)
    motion_data = []
    prev_gray = None
    frame_idx = 0

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if frame_idx % sample_rate == 0:
            gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

            if prev_gray is not None:
                flow = cv2.calcOpticalFlowFarneback(
                    prev_gray, gray, None, 0.5, 3, 15, 3, 5, 1.2, 0
                )
                magnitude = np.sqrt(flow[..., 0]**2 + flow[..., 1]**2)
                motion_data.append({
                    'frame': frame_idx,
                    'mean_motion': np.mean(magnitude),
                    'max_motion': np.max(magnitude)
                })

            prev_gray = gray

        frame_idx += 1

    cap.release()
    return pd.DataFrame(motion_data)

Libraries

Datasets


Module 6 explores video data—the richest but most computationally demanding data modality. From object tracking to action recognition to video understanding, we learn how machines interpret the temporal visual world.