Module 2: Data Structures and Data Visualization

“Telling Stories with Data”

Research Document for DATA 201 Course Development


Table of Contents

  1. Introduction
  2. Part I: The Evolution of Data Representation
  3. Part II: Pioneers of Information Design
  4. Part III: The Science of Visual Perception
  5. Part IV: Modern Data Visualization
  6. Part V: Data Structures - From Biology to Networks
  7. DEEP DIVE: W.E.B. Du Bois’s Data Portraits
  8. Lecture Plan and Hands-On Exercise
  9. Recommended Resources
  10. References

Introduction

This module explores how humans have developed ways to represent and visualize data across millennia—from ancient maps to modern interactive dashboards. We examine:

The central question: How do we transform raw data into understanding?


Part I: The Evolution of Data Representation

From Cave Paintings to Coordinate Systems

The Oldest “Data Visualizations”

Human beings have been visualizing information for at least 40,000 years. Cave paintings at Lascaux (France) and Altamira (Spain) include tally marks that may represent lunar cycles or hunting counts—proto-data visualizations carved in stone.

Ancient Maps: The First Spatial Data

The Babylonian World Map (c. 600 BCE)

Ptolemy’s Geographia (2nd century CE)

The Coordinate Revolution: René Descartes (1637)

René Descartes’s invention of the Cartesian coordinate system in La Géométrie transformed how we represent relationships between variables. For the first time, abstract mathematical relationships could be visualized as geometric shapes.

The Data Journey:


The Mercator Projection (1569): When Maps Lie

Gerardus Mercator, a Flemish cartographer, created his famous projection in 1569 specifically for navigation. His innovation: representing rhumb lines (constant compass bearings) as straight lines.

Why Mercator Worked for Navigation

Sailors could plot a straight course on the map and follow it with a constant compass bearing—revolutionary for ocean navigation.

The Hidden Distortion

The Mercator projection preserves angles but distorts areas. Landmasses near the poles appear vastly larger than they are:

The Politics of Projection

In 1974, historian Arno Peters promoted an “equal area” projection, arguing that Mercator’s distortions reinforced colonial perceptions—making European countries appear larger relative to Africa and South America.

Modern Alternatives:

Key Insight: Every map projection involves trade-offs. There is no “neutral” way to flatten a sphere—all choices embed values and priorities.

Sources


Part II: Pioneers of Information Design

Otto Neurath and ISOTYPE (1920s-1930s)

The Visual Language That Became Modern Infographics

Austrian sociologist Otto Neurath believed that visual communication could democratize knowledge. In 1920s Vienna, he developed ISOTYPE (International System of Typographic Picture Education)—a standardized pictogram language for showing social and economic data.

The Vienna Method

At the Social and Economic Museum of Vienna (1925-1934), Neurath and his collaborators created a system of:

Philosophy: “Words Divide, Pictures Unite”

Neurath believed visual statistics could communicate across language barriers and education levels. ISOTYPE was designed for “visual education”—always accompanied by text, but making complex data accessible to ordinary citizens.

The “Transformer”

Marie Reidemeister (later Marie Neurath) served as the “transformer”—the crucial role of translating data and ideas from subject experts into visual form. This anticipated the modern role of data visualization designers.

Legacy

ISOTYPE’s influence appears everywhere today:

Sources


Jacques Bertin’s Semiology of Graphics (1967)

The First Theory of Data Visualization

French cartographer Jacques Bertin published Sémiologie graphique in 1967—the first systematic theoretical foundation for information graphics.

The Seven Visual Variables

Bertin identified seven fundamental visual properties that can encode data:

Variable Best For Example
Position Quantitative data X-Y coordinates on scatter plot
Size Quantitative data Bubble size in bubble chart
Value (lightness) Ordered data Light to dark shading
Texture Categorical data Patterns in maps
Color (hue) Categorical data Different colors for categories
Orientation Categorical data Angle of lines
Shape Categorical data Circles vs. squares

Planar vs. Retinal Variables

Bertin distinguished between:

Influence

Bertin’s framework influenced:

Sources


Edward Tufte: The Leonardo da Vinci of Data

Chartjunk, Data-Ink, and Graphical Excellence

Edward Tufte, professor emeritus at Yale, published The Visual Display of Quantitative Information in 1983—probably the most influential book on data visualization ever written.

Core Principles

1. The Data-Ink Ratio

\[\text{Data-Ink Ratio} = \frac{\text{Ink used to display data}}{\text{Total ink used in graphic}}\]

Maximize this ratio. Remove everything that doesn’t contribute to understanding.

2. Chartjunk

Tufte coined “chartjunk” to describe unnecessary decorative elements:

3. The Lie Factor

\[\text{Lie Factor} = \frac{\text{Size of effect shown in graphic}}{\text{Size of effect in data}}\]

A Lie Factor of 1 is truthful. Tufte found examples with Lie Factors of 14.8—grossly misrepresenting the underlying data.

4. Small Multiples

Display multiple small graphics to reveal patterns across different conditions or time periods.

5. Sparklines

Tufte invented “sparklines”—word-sized graphics that can be embedded in text, maximizing data-ink ratio.

The Self-Published Books

Tufte self-published his books after being dissatisfied with traditional publishers’ treatment of visual content. They became design classics:

Modern Critique

Recent research suggests Tufte’s minimalism isn’t always optimal:

Sources


Part III: The Science of Visual Perception

Anscombe’s Quartet (1973)

Why We Must Always Visualize Data

In 1973, statistician Francis Anscombe constructed four datasets that revolutionized how we think about data analysis.

The Four Datasets

All four datasets have nearly identical statistical properties:

But When Plotted…

The four datasets reveal completely different patterns:

  1. Dataset I: Normal linear relationship with scatter
  2. Dataset II: Perfect curved (quadratic) relationship—not linear at all!
  3. Dataset III: Perfect linear relationship with one outlier
  4. Dataset IV: No relationship except for one extreme outlier

The Lesson

“Numerical calculations are exact, but graphs are rough” — a misconception Anscombe sought to counter.

Summary statistics can completely obscure the nature of your data. Always visualize.

The Datasaurus Dozen (2017)

Researchers at Autodesk extended Anscombe’s idea to create the “Datasaurus Dozen”—twelve datasets with identical statistics that form:

“Never trust summary statistics alone; always visualize your data.”

Sources


Pre-Attentive Processing

What the Eye Sees Before the Brain Thinks

Certain visual features are processed by the brain in less than 250 milliseconds—before conscious attention engages. These “pre-attentive” features enable instant pattern recognition.

Pre-Attentive Visual Features

Highly Pre-Attentive:

Moderately Pre-Attentive:

Design Implications

  1. Use pre-attentive features to highlight the most important information
  2. Don’t use too many pre-attentive channels simultaneously (visual overload)
  3. Color is powerful but should be used consistently

Gestalt Principles

The Gestalt psychologists identified how we perceive visual groupings:

These principles inform effective chart design.


Part IV: Modern Data Visualization

Hans Rosling and Gapminder

The Best Stats You’ve Ever Seen

Hans Rosling (1948-2017), a Swedish physician and professor, transformed how the world sees global development data.

The 2006 TED Talk

Rosling’s TED talk “The Best Stats You’ve Ever Seen” is one of the most viewed TED videos ever. In 19 minutes, he:

“I produce a road-less sound.”

The Gapminder Bubble Chart

The visualization showed:

The Tool: Trendalyzer

Rosling’s son Ola built Trendalyzer, later acquired by Google (2007) and released as Google Motion Charts and Public Data Explorer.

Rosling’s Teaching Philosophy

Rosling didn’t just show data—he told stories:

Legacy

Sources


The Grammar of Graphics

From Bertin to ggplot2

Leland Wilkinson’s Grammar of Graphics (1999)

Wilkinson formalized a system for describing any statistical graphic as a composition of layers:

ggplot2 and the Tidyverse

Hadley Wickham implemented these ideas in ggplot2 (2005) for R:

ggplot(data, aes(x = income, y = life_exp, color = continent, size = pop)) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~year)

This declarative approach to visualization has influenced:


Part V: Data Structures - From Biology to Networks

Biological Data: From Lists to Images

Linnaeus and Taxonomy (1735)

Carl Linnaeus created the modern system of biological classification—hierarchical tree structures that organize all living things:

Kingdom → Phylum → Class → Order → Family → Genus → Species

This is a fundamental tree data structure—the same structure used in file systems, XML/HTML, and decision trees.

DNA as Data: Rosalind Franklin’s Photo 51 (1952)

Rosalind Franklin’s X-ray crystallography image—Photo 51—revealed the helical structure of DNA. This single image contained the key information Watson and Crick needed to deduce the double helix.

The Data Journey:

What the Image Shows:

“Watson, Crick, and Wilkins repeatedly acknowledged that they could not have solved the structure without the crystallographic evidence.”

Franklin died in 1958 and was not included in the 1962 Nobel Prize. Her contributions went largely unrecognized for decades—she’s been called “the dark lady of DNA.”

Sources


Geographic Data: Maps as Data Structures

GIS: Geographic Information Systems

GIS represents geographic data in layers:

The Canada Geographic Information System (CGIS), developed in the 1960s, was the first true GIS. It enabled layered spatial analysis—overlaying maps to find patterns.

Modern Applications


Network Data: Graphs and Connections

Euler and the Seven Bridges of Königsberg (1736)

Leonhard Euler asked: Can you walk through Königsberg crossing each of its seven bridges exactly once?

His proof that it’s impossible founded graph theory—the mathematics of networks.

Network Data Structures

Nodes (vertices): Entities
Edges (links): Relationships between entities

Types of Networks:

Famous Network Datasets


DEEP DIVE: W.E.B. Du Bois’s Data Portraits

Visualizing Black America at the 1900 Paris Exposition

The Story

In 1900, 37 years after the Emancipation Proclamation, sociologist W.E.B. Du Bois traveled to Paris with a radical mission: to show the world the progress of African Americans through the language of data visualization.

The Context: Post-Reconstruction America

By 1900, the promises of Reconstruction had been dismantled:

The Paris Exposition Universelle

The 1900 World’s Fair in Paris drew 50 million visitors. It showcased technological marvels like the Grande Roue (Ferris wheel), moving sidewalks, and talking films.

Thomas J. Calloway, an African American educator, secured space for “The Exhibit of American Negroes”—and invited Du Bois to create a statistical portrait of Black America.

Du Bois’s Vision

Du Bois, then a professor at Atlanta University, saw an opportunity. He would counter racist narratives not with emotion but with data—irrefutable evidence of progress despite systematic oppression.

With his students at Atlanta University, Du Bois created 63 hand-drawn data visualizations—what he called “data portraits.”

The Visualizations

Two Series

Series 1: National/International View “A Series of Statistical Charts Illustrating the Condition of the Descendants of Former African Slaves Now in Residence in the United States of America”

Series 2: The Georgia Negro Detailed focus on Georgia’s Black population—demographics, economics, education, property

Design Innovation

Du Bois’s visualizations were decades ahead of their time:

Bold Colors:

Novel Chart Types:

Data Sources:

Examples of Charts

“City and Rural Population” (1890)

“Assessed Value of Household and Kitchen Furniture Owned by Georgia Negroes”

“Occupations of Negroes and Whites in Georgia”

“Illiteracy”

The Message

These weren’t neutral statistics. Du Bois deliberately chose data that demonstrated:

  1. Progress: Despite slavery and oppression, Black Americans were building wealth, education, and institutions
  2. Contribution: Black labor was essential to American prosperity
  3. Humanity: Statistics humanized a population dehumanized by stereotypes

“The problem of the Twentieth Century is the problem of the color line.” — Du Bois, 1903

Awards and Reception

The exhibit won a Gold Medal at the Paris Exposition. International visitors saw evidence that contradicted American racist propaganda.

Rediscovery

The visualizations were shipped to the Library of Congress and largely forgotten for over a century. They were rediscovered and published in full color in 2018:

W.E.B. Du Bois’s Data Portraits: Visualizing Black America (Princeton Architectural Press)

Legacy

Du Bois’s work anticipates:

Key Insight: Data visualization is always political. The choice of what to measure, how to present it, and who the audience is shapes the message. Du Bois used this power deliberately—to advocate for justice.

Sources


Lecture Plan and Hands-On Exercise

Lecture Plan: “Data Portraits” (75-90 minutes)

Part 1: The Power of Visualization (20 min)

Opening Hook: Show Anscombe’s Quartet

Key Message: “Always visualize your data.”

Transition: But visualization isn’t just about understanding data—it’s about communication and persuasion.

Part 2: The Du Bois Story (25 min)

Historical Context:

The Challenge:

The Visualizations:

Discussion Questions:

Part 3: Design Principles (20 min)

From Bertin’s Visual Variables:

From Tufte’s Principles:

Modern Tools:

Part 4: Hands-On Exercise Introduction (10 min)

Introduce the exercise and available datasets.


Hands-On Exercise: “Creating Your Own Data Portrait”

Objective

Create a data visualization that tells a story about a social, economic, or environmental issue—inspired by Du Bois’s approach.

Duration

2-3 hours (can be homework)

Materials Provided

Datasets (choose one):

  1. Modern Census Data
    • US Census API: Education, income, housing by race/ethnicity
    • World Bank indicators: Development data by country
  2. Historical Recreation
    • Du Bois’s original Georgia data (available in R’s duboisr package)
    • Recreate one of his charts with modern tools
  3. Local Data
    • Lebanon/AUB-specific datasets
    • Regional economic or social indicators

Tasks

Task 1: Data Exploration (30 min)

import pandas as pd
import seaborn as sns

# Load your chosen dataset
data = pd.read_csv('your_data.csv')

# Explore
print(data.describe())
print(data.info())

# What story could this data tell?

Task 2: Sketch Your Story (20 min)

Task 3: Create the Visualization (60 min)

Using Python (matplotlib/seaborn) or R (ggplot2):

import matplotlib.pyplot as plt
import seaborn as sns

# Example: Recreate a Du Bois-style chart
plt.figure(figsize=(10, 8))
plt.style.use('seaborn-whitegrid')

# Your visualization code here

# Add Du Bois-inspired styling
plt.title('YOUR TITLE HERE', fontsize=16, fontweight='bold')
plt.xlabel('X Label')
plt.ylabel('Y Label')

# Color palette inspired by Du Bois
dubois_colors = ['#dc143c', '#00aa00', '#000000', '#ffc107', '#7b3f00']

plt.tight_layout()
plt.savefig('my_data_portrait.png', dpi=300)
plt.show()

Task 4: Reflection (20 min)

Write a short paragraph (150-200 words) answering:

  1. What story does your visualization tell?
  2. What design choices did you make and why?
  3. What might be misleading or missing from your visualization?
  4. How might different audiences interpret it differently?

Extension: Du Bois Challenge

Recreate one of Du Bois’s original visualizations using modern data:

Resources:

Evaluation Criteria

Criterion Excellent Good Needs Work
Story Clear, compelling narrative Clear but conventional Unclear or no narrative
Design Thoughtful choices, minimal chartjunk Adequate, some clutter Cluttered, confusing
Accuracy Data represented truthfully Minor issues Misleading representation
Reflection Deep engagement with choices Surface-level Missing or minimal

Recommended Resources

Books

Data Visualization Theory

History and Context

Practical Guides

Online Courses

Websites and Tools

Learning Resources

Tools

Datasets

Videos

TED Talks

YouTube Channels

Documentaries


References

Data Visualization History

W.E.B. Du Bois

Visual Perception

Modern Visualization


Document compiled for SCDS DATA 201: Introduction to Data Science I Module 2: Data Structures and Data Visualization “Telling Stories with Data”