DATA 202 Module 3: Real-Time and Streaming Data

Introduction

The world generates data continuously—every sensor reading, every click, every transaction, every heartbeat. Traditional batch processing assumes data sits still while you analyze it. But increasingly, data arrives in streams that must be processed as they flow.

This module explores streaming data paradigms: the technologies, architectures, and use cases for processing data in motion. From IoT sensors to financial markets to social media firehoses, we’ll learn to build systems that respond in real-time.

Part 1: Batch vs. Stream Processing

The Batch Paradigm

Traditional data processing is batch-oriented:

Collect data over a period
Store in a data warehouse or lake
Run analysis on the complete dataset
Generate reports

Limitations:

Latency: Hours or days between event and insight
Staleness: Decisions based on yesterday’s data
Missed Opportunities: Real-time patterns invisible

The Streaming Paradigm

Stream processing treats data as a continuous flow:

Process each event as it arrives
Maintain running calculations (counts, averages, windows)
Trigger actions immediately
Handle unbounded datasets

Trade-offs:

Complexity: Harder to reason about stateful processing
Ordering: Events may arrive out of order
Exactly-once: Guaranteeing each event processed exactly once is hard
State management: Where do running aggregates live?

Use Cases for Streaming

Financial Services: Fraud detection, algorithmic trading, risk monitoring IoT: Sensor monitoring, predictive maintenance, smart cities Social Media: Trending topics, content moderation, recommendations E-commerce: Real-time personalization, inventory tracking Transportation: Vehicle tracking, routing optimization, ETA calculation

Part 2: Streaming Concepts

Events and Streams

An event is an immutable record of something that happened:

{
    "event_type": "purchase",
    "user_id": "user123",
    "product_id": "prod456",
    "amount": 29.99,
    "timestamp": "2024-01-15T14:30:00Z"
}

A stream is an unbounded sequence of events, ordered by time.

Windowing

Since streams are unbounded, we need ways to group events:

Tumbling Windows: Fixed, non-overlapping intervals

“Count events per minute”
Each event belongs to exactly one window

Sliding Windows: Overlapping intervals

“Average over last 5 minutes, updated every second”
Events may belong to multiple windows

Session Windows: Dynamic windows based on activity

Group events within idle gaps
“All actions in a browsing session”

Time Semantics

Event Time: When the event actually occurred Processing Time: When the system processes the event Ingestion Time: When the event enters the system

Late-arriving data complicates event-time processing—an event timestamped 10:00 may arrive at 10:05.

State and Checkpointing

Streaming computations maintain state:

Running counts and aggregates
Session information
Machine learning model parameters

Checkpointing saves state periodically for fault tolerance—if a processor fails, restart from the last checkpoint.

Part 3: Streaming Technologies

Apache Kafka

Kafka is the dominant streaming platform:

Topics: Categories of messages
Producers: Publish messages to topics
Consumers: Subscribe to topics and process messages
Partitions: Enable parallel processing and scaling
Retention: Messages stored for configurable time

Kafka is more “log” than “queue”—messages persist and can be replayed.

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
producer.send('events', {'user': 'john', 'action': 'click'})

# Consumer
consumer = KafkaConsumer(
    'events',
    bootstrap_servers='localhost:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
    print(message.value)

Apache Spark Streaming

Spark Streaming processes micro-batches—small batch jobs run frequently:

Leverages Spark’s batch capabilities
Unified API for batch and stream
Structured Streaming uses DataFrames

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamExample").getOrCreate()

# Read from Kafka
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()

# Process
events = df.selectExpr("CAST(value AS STRING)")

# Write results
query = events.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

Apache Flink

Flink is a true stream processor (not micro-batch):

Event-at-a-time processing
Rich windowing and event-time handling
Exactly-once semantics
Low latency

Cloud Services

AWS: Kinesis Data Streams, Kinesis Data Analytics Google Cloud: Pub/Sub, Dataflow Azure: Event Hubs, Stream Analytics

Part 4: IoT and Sensor Data

The IoT Data Challenge

IoT devices generate continuous streams:

Thousands or millions of devices
High frequency (sensor readings per second)
Often unreliable connectivity
Edge processing may be required

Edge Computing

Process data at the edge (on or near the device):

Reduce bandwidth by filtering/aggregating
Lower latency for time-critical decisions
Handle intermittent connectivity

Time-Series Databases

IoT data is inherently time-series:

InfluxDB: Purpose-built for time-series
TimescaleDB: PostgreSQL extension
Prometheus: Monitoring-focused
QuestDB: High-performance ingestion

DEEP DIVE: The Lambda Architecture and Its Evolution

Handling Two Speeds of Truth

In 2011, Nathan Marz, working on Twitter’s analytics, faced a dilemma. Batch processing (Hadoop) gave accurate, complete results but took hours. Real-time processing gave immediate results but lacked completeness and accuracy. How to get both?

His answer was the Lambda Architecture:

Batch Layer: Process complete historical data periodically

Master dataset: immutable, append-only
Batch views: Precomputed from complete data
Slow but accurate

Speed Layer: Process real-time data as it arrives

Compensate for batch latency
Approximate views
Fast but potentially incomplete

Serving Layer: Merge batch and real-time views

Answer queries from both
Real-time view replaced by batch when available

The Lambda Architecture solved real problems and was widely adopted. But it had a significant flaw: you had to maintain two separate codebases doing essentially the same computation.

The Kappa Architecture

Jay Kreps, while at LinkedIn developing Kafka, proposed a simpler approach:

Use streaming for everything. Store all data in a log (Kafka). Run streaming jobs to produce views. If you need to recompute from scratch, replay the log.

The Kappa Architecture eliminates code duplication—one codebase, one paradigm.

Modern Reality

Today, most organizations use hybrid approaches:

Kafka for real-time ingestion
Cloud data warehouses for historical analysis
Streaming for real-time views
The distinction between batch and stream blurs with tools like Spark and Flink that handle both

Lesson for Data Scientists

Understand both paradigms. Many problems need real-time responses built on historical context. The architecture choice depends on latency requirements, accuracy needs, and operational complexity tolerance.

HANDS-ON EXERCISE: Building a Real-Time Data Pipeline

Overview

Students will:

Set up a local Kafka environment
Build a producer that generates streaming events
Create a consumer that processes and aggregates
Implement windowed analytics

Setup (Using Docker)

# docker-compose.yml for Kafka
# Start with: docker-compose up -d

Part 1: Event Producer

import json
import random
import time
from datetime import datetime
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

def generate_event():
    return {
        'event_id': random.randint(1, 1000000),
        'user_id': f'user_{random.randint(1, 100)}',
        'action': random.choice(['view', 'click', 'purchase']),
        'amount': round(random.uniform(1, 100), 2) if random.random() > 0.7 else 0,
        'timestamp': datetime.now().isoformat()
    }

while True:
    event = generate_event()
    producer.send('user_events', event)
    print(f"Sent: {event}")
    time.sleep(0.1)

Part 2: Streaming Consumer with Aggregation

from kafka import KafkaConsumer
from collections import defaultdict
from datetime import datetime, timedelta
import json

consumer = KafkaConsumer(
    'user_events',
    bootstrap_servers='localhost:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='latest'
)

# Track events in current window
window_size = timedelta(seconds=10)
window_start = datetime.now()
window_counts = defaultdict(int)

for message in consumer:
    event = message.value
    window_counts[event['action']] += 1

    # Check if window expired
    if datetime.now() - window_start > window_size:
        print(f"\n=== Window Summary ===")
        for action, count in window_counts.items():
            print(f"{action}: {count}")

        # Reset window
        window_start = datetime.now()
        window_counts = defaultdict(int)

Recommended Resources

Books

Streaming Systems by Tyler Akidau, et al.
Kafka: The Definitive Guide by Neha Narkhede, et al.
Designing Data-Intensive Applications by Martin Kleppmann

Documentation

Module 3 explores real-time and streaming data processing—the technologies and paradigms for handling data in motion. From Lambda and Kappa architectures to modern streaming platforms, we learn to build systems that respond as events unfold.