DATA 202 Module 3: Real-Time and Streaming Data

Introduction

The world generates data continuously—every sensor reading, every click, every transaction, every heartbeat. Traditional batch processing assumes data sits still while you analyze it. But increasingly, data arrives in streams that must be processed as they flow.

This module explores streaming data paradigms: the technologies, architectures, and use cases for processing data in motion. From IoT sensors to financial markets to social media firehoses, we’ll learn to build systems that respond in real-time.


Part 1: Batch vs. Stream Processing

The Batch Paradigm

Traditional data processing is batch-oriented:

  1. Collect data over a period
  2. Store in a data warehouse or lake
  3. Run analysis on the complete dataset
  4. Generate reports

Limitations:

The Streaming Paradigm

Stream processing treats data as a continuous flow:

Trade-offs:

Use Cases for Streaming

Financial Services: Fraud detection, algorithmic trading, risk monitoring IoT: Sensor monitoring, predictive maintenance, smart cities Social Media: Trending topics, content moderation, recommendations E-commerce: Real-time personalization, inventory tracking Transportation: Vehicle tracking, routing optimization, ETA calculation


Part 2: Streaming Concepts

Events and Streams

An event is an immutable record of something that happened:

{
    "event_type": "purchase",
    "user_id": "user123",
    "product_id": "prod456",
    "amount": 29.99,
    "timestamp": "2024-01-15T14:30:00Z"
}

A stream is an unbounded sequence of events, ordered by time.

Windowing

Since streams are unbounded, we need ways to group events:

Tumbling Windows: Fixed, non-overlapping intervals

Sliding Windows: Overlapping intervals

Session Windows: Dynamic windows based on activity

Time Semantics

Event Time: When the event actually occurred Processing Time: When the system processes the event Ingestion Time: When the event enters the system

Late-arriving data complicates event-time processing—an event timestamped 10:00 may arrive at 10:05.

State and Checkpointing

Streaming computations maintain state:

Checkpointing saves state periodically for fault tolerance—if a processor fails, restart from the last checkpoint.


Part 3: Streaming Technologies

Apache Kafka

Kafka is the dominant streaming platform:

Kafka is more “log” than “queue”—messages persist and can be replayed.

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
producer.send('events', {'user': 'john', 'action': 'click'})

# Consumer
consumer = KafkaConsumer(
    'events',
    bootstrap_servers='localhost:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
for message in consumer:
    print(message.value)

Apache Spark Streaming

Spark Streaming processes micro-batches—small batch jobs run frequently:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamExample").getOrCreate()

# Read from Kafka
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()

# Process
events = df.selectExpr("CAST(value AS STRING)")

# Write results
query = events.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

Flink is a true stream processor (not micro-batch):

Cloud Services

AWS: Kinesis Data Streams, Kinesis Data Analytics Google Cloud: Pub/Sub, Dataflow Azure: Event Hubs, Stream Analytics


Part 4: IoT and Sensor Data

The IoT Data Challenge

IoT devices generate continuous streams:

Edge Computing

Process data at the edge (on or near the device):

Time-Series Databases

IoT data is inherently time-series:


DEEP DIVE: The Lambda Architecture and Its Evolution

Handling Two Speeds of Truth

In 2011, Nathan Marz, working on Twitter’s analytics, faced a dilemma. Batch processing (Hadoop) gave accurate, complete results but took hours. Real-time processing gave immediate results but lacked completeness and accuracy. How to get both?

His answer was the Lambda Architecture:

Batch Layer: Process complete historical data periodically

Speed Layer: Process real-time data as it arrives

Serving Layer: Merge batch and real-time views

The Lambda Architecture solved real problems and was widely adopted. But it had a significant flaw: you had to maintain two separate codebases doing essentially the same computation.

The Kappa Architecture

Jay Kreps, while at LinkedIn developing Kafka, proposed a simpler approach:

Use streaming for everything. Store all data in a log (Kafka). Run streaming jobs to produce views. If you need to recompute from scratch, replay the log.

The Kappa Architecture eliminates code duplication—one codebase, one paradigm.

Modern Reality

Today, most organizations use hybrid approaches:

Lesson for Data Scientists

Understand both paradigms. Many problems need real-time responses built on historical context. The architecture choice depends on latency requirements, accuracy needs, and operational complexity tolerance.


HANDS-ON EXERCISE: Building a Real-Time Data Pipeline

Overview

Students will:

  1. Set up a local Kafka environment
  2. Build a producer that generates streaming events
  3. Create a consumer that processes and aggregates
  4. Implement windowed analytics

Setup (Using Docker)

# docker-compose.yml for Kafka
# Start with: docker-compose up -d

Part 1: Event Producer

import json
import random
import time
from datetime import datetime
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

def generate_event():
    return {
        'event_id': random.randint(1, 1000000),
        'user_id': f'user_{random.randint(1, 100)}',
        'action': random.choice(['view', 'click', 'purchase']),
        'amount': round(random.uniform(1, 100), 2) if random.random() > 0.7 else 0,
        'timestamp': datetime.now().isoformat()
    }

while True:
    event = generate_event()
    producer.send('user_events', event)
    print(f"Sent: {event}")
    time.sleep(0.1)

Part 2: Streaming Consumer with Aggregation

from kafka import KafkaConsumer
from collections import defaultdict
from datetime import datetime, timedelta
import json

consumer = KafkaConsumer(
    'user_events',
    bootstrap_servers='localhost:9092',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='latest'
)

# Track events in current window
window_size = timedelta(seconds=10)
window_start = datetime.now()
window_counts = defaultdict(int)

for message in consumer:
    event = message.value
    window_counts[event['action']] += 1

    # Check if window expired
    if datetime.now() - window_start > window_size:
        print(f"\n=== Window Summary ===")
        for action, count in window_counts.items():
            print(f"{action}: {count}")

        # Reset window
        window_start = datetime.now()
        window_counts = defaultdict(int)

Books

Documentation


Module 3 explores real-time and streaming data processing—the technologies and paradigms for handling data in motion. From Lambda and Kappa architectures to modern streaming platforms, we learn to build systems that respond as events unfold.