DATA 202 Module 6: Video Data and Action Recognition
Introduction
Video is the richest data modality—combining visual information across time. YouTube hosts over 800 million videos; TikTok users upload millions more daily; surveillance systems record continuously. Processing video data requires understanding not just what appears in frames but how things change across time.
This module explores video analysis: from basic processing to action recognition, object tracking, and video understanding with deep learning.
Part 1: Video Fundamentals
What is Video?
Video is a sequence of images (frames) displayed rapidly to create the illusion of motion:
- Frame rate: Frames per second (24 fps for film, 30 fps for TV, 60+ for games)
- Resolution: Pixels per frame (1920×1080 for HD, 3840×2160 for 4K)
- Bit rate: Data per second of video
- Codec: Compression algorithm (H.264, H.265, VP9, AV1)
One minute of uncompressed 4K video at 30 fps:
- 3840 × 2160 pixels × 3 bytes × 30 fps × 60 seconds ≈ 44 GB
Compression is essential.
Video as Structured Data
Extract structured information from video:
- Per-frame: Objects, faces, text
- Temporal: Motion, actions, events
- Aggregated: Statistics, summaries, highlights
Video Processing Pipeline
import cv2
# Read video
cap = cv2.VideoCapture('video.mp4')
# Video properties
fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
# Process frames
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Process frame
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
# Display or save results
cv2.imshow('Frame', gray)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
Part 2: Object Detection and Tracking
Detection in Video
Apply image object detection (YOLO, Faster R-CNN) to each frame:
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
cap = cv2.VideoCapture('video.mp4')
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
results = model(frame)
annotated = results[0].plot()
cv2.imshow('Detection', annotated)
Object Tracking
Challenge: Link detections across frames to track objects.
Approaches:
- SORT (Simple Online Realtime Tracking): Kalman filter + Hungarian algorithm
- DeepSORT: Add appearance features for re-identification
- ByteTrack: Consider low-confidence detections
- Transformers: MOTR, TrackFormer
# Simple tracking with DeepSORT
from deep_sort_realtime.deepsort_tracker import DeepSort
tracker = DeepSort(max_age=30)
# For each frame
detections = detect(frame) # Get detections
tracks = tracker.update_tracks(detections, frame=frame)
for track in tracks:
if track.is_confirmed():
track_id = track.track_id
bbox = track.to_ltrb()
Part 3: Action Recognition
Understanding Actions in Video
Action recognition classifies what activity is occurring:
- Running, jumping, waving
- Cooking, playing piano
- Fighting, stealing (for security)
Approaches:
- Two-Stream: Separate spatial (appearance) and temporal (motion) streams
- 3D Convolutions: Extend CNNs to space-time (C3D, I3D)
- Temporal Modeling: Use recurrent networks on frame features
- Transformers: Video Transformers (ViViT, TimeSformer)
Optical Flow
Optical flow captures motion between frames—the apparent motion of pixels.
import cv2
# Lucas-Kanade optical flow
p0 = cv2.goodFeaturesToTrack(prev_gray, maxCorners=100, qualityLevel=0.3, minDistance=7)
p1, status, err = cv2.calcOpticalFlowPyrLK(prev_gray, gray, p0, None)
# Dense optical flow (Farneback)
flow = cv2.calcOpticalFlowFarneback(prev_gray, gray, None, 0.5, 3, 15, 3, 5, 1.2, 0)
Video Transformers
Modern video understanding uses Transformers:
- Patch frames into tokens
- Add temporal position encoding
- Self-attention across space and time
Models: ViViT, TimeSformer, Video Swin Transformer
Part 4: Video Understanding at Scale
Video Captioning
Generate text descriptions of video content:
- “A man is playing guitar in a park”
- “The cat jumps onto the table and knocks over a glass”
Video Question Answering
Answer questions about video:
- “What color is the car that appears first?”
- “How many people enter the room?”
Video-Language Models
Multimodal models align video with language:
- CLIP for Video: Extend image-text to video-text
- VideoBERT: BERT for video
- Video-LLaMA, Video-ChatGPT: LLMs that understand video
DEEP DIVE: The Rise of TikTok and Short-Form Video AI
The Algorithm That Learned What You Want
TikTok’s recommendation algorithm became legendary for its accuracy—users report feeling “understood” within minutes of first use. How?
Key Components:
- Video Understanding: Deep learning extracts visual and audio features
- Engagement Signals: Watch time, replays, shares, comments
- User Modeling: Build preference profiles from behavior
- Real-Time Learning: Adapt quickly to changing interests
The For You Page algorithm reportedly:
- Uses hundreds of signals per video
- Weighs watch completion heavily
- Diversifies to avoid filter bubbles (somewhat)
- Serves new content to test user response
The Dark Side
Optimization for engagement creates problems:
- Addictive feedback loops
- Misinformation spread
- Mental health impacts
- Polarization concerns
Video recommendation is a powerful demonstration of AI’s ability to shape behavior—for better and worse.
HANDS-ON EXERCISE: Video Analysis Pipeline
Part 1: Object Detection in Video
from ultralytics import YOLO
import cv2
model = YOLO('yolov8n.pt')
def process_video(input_path, output_path):
cap = cv2.VideoCapture(input_path)
fps = cap.get(cv2.CAP_PROP_FPS)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))
frame_count = 0
detections_log = []
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
results = model(frame)
for box in results[0].boxes:
cls = int(box.cls)
conf = float(box.conf)
detections_log.append({
'frame': frame_count,
'class': model.names[cls],
'confidence': conf
})
annotated = results[0].plot()
out.write(annotated)
frame_count += 1
cap.release()
out.release()
return pd.DataFrame(detections_log)
Part 2: Motion Analysis
def compute_motion_magnitude(video_path, sample_rate=5):
"""Compute average motion magnitude per sampled frame."""
cap = cv2.VideoCapture(video_path)
motion_data = []
prev_gray = None
frame_idx = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_idx % sample_rate == 0:
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
if prev_gray is not None:
flow = cv2.calcOpticalFlowFarneback(
prev_gray, gray, None, 0.5, 3, 15, 3, 5, 1.2, 0
)
magnitude = np.sqrt(flow[..., 0]**2 + flow[..., 1]**2)
motion_data.append({
'frame': frame_idx,
'mean_motion': np.mean(magnitude),
'max_motion': np.max(magnitude)
})
prev_gray = gray
frame_idx += 1
cap.release()
return pd.DataFrame(motion_data)
Recommended Resources
Libraries
- OpenCV: Video processing fundamentals
- Ultralytics YOLO: Object detection
- MMAction2: Action recognition toolkit
- PyTorchVideo: Facebook’s video understanding library
Datasets
- Kinetics: Large-scale action recognition
- ActivityNet: Untrimmed video understanding
- YouTube-8M: Video classification
- MOT Challenge: Multi-object tracking
Module 6 explores video data—the richest but most computationally demanding data modality. From object tracking to action recognition to video understanding, we learn how machines interpret the temporal visual world.