Module 1: Introduction to Data Science and Systems Thinking
“Understanding the Data Revolution”
Research Document for DATA 201 Course Development
Table of Contents
- Introduction: The Data Journey Framework
- Part I: The Ancient Origins of Data Collection
- Part II: The Birth of Modern Statistics
- Part III: Data Visualization Pioneers
- Part IV: The Computing Revolution
- Part V: From Data to Discovery - Landmark Stories
- Part VI: The Information Age
- Recommended Resources
- References
Introduction: The Data Journey Framework
Every great data story follows a journey:
- Collection - How was the data gathered? What motivated someone to count, measure, or record?
- Understanding/Modeling - How did they make sense of patterns? What tools and mental frameworks emerged?
- Prediction/Inference - What actions resulted? How did data change decisions, save lives, or transform society?
This module explores the human stories behind data science—not just dates and discoveries, but the lives of people who saw patterns where others saw chaos, who counted when others assumed, and who visualized truths that words alone could not convey.
Part I: The Ancient Origins of Data Collection
The Babylonian Census (4000 BCE)
The First Data Collection
The census is older than Egyptian, Greek, and Roman civilizations. Around 4000 BCE, the Babylonians conducted what may be humanity’s first systematic data collection—a census to determine how much food they needed to find for each member of the population.
The Data Journey:
- Collection: Clay tablets recorded population counts and resource needs
- Understanding: Simple ratios of food to people
- Prediction: Planning agricultural production and distribution
Evidence of these records exists today in the British Museum—clay tiles that represent humanity’s first attempt to transform social reality into manageable numbers.
Sources
Ancient Egypt: Building Pyramids with Data (2500 BCE)
The Egyptians used censuses not for democratic representation, but for monumental engineering. From around 2,500 BCE, they counted their population to:
- Calculate the labor force needed to build pyramids
- Plan land redistribution after the annual flooding of the Nile
This represents one of the earliest examples of data-driven project management—using population statistics to coordinate massive construction projects.
The Pharaoh Amasis Census (570 BCE)
The oldest known census record that survives comes from the reign of Pharaoh Amasis around 570 BCE. It was ordered for two practical purposes:
- Taxation - knowing who could be taxed
- Military planning - knowing who could be called upon in times of war
Key Insight: From the very beginning, data collection was tied to power—the power to tax, the power to mobilize armies, and the power to plan large-scale projects.
The Roman Census: Where the Word Comes From
The word “census” originates from the Latin censere, meaning “to estimate.” The Roman census was arguably the most sophisticated data collection system of the ancient world.
Key Facts:
- Instituted by King Servius Tullius in the 6th century BCE
- Initially counted ~80,000 arms-bearing citizens
- Conducted every five years (quinquennial)
- Extended to the entire Roman Empire in 5 BCE
- Determined citizenship class, tax obligations, and military duties
The Data Journey:
- Collection: Official registers of citizens and their property
- Understanding: Classification systems (patricians, plebeians, etc.)
- Prediction: Military capacity, tax revenue projections
The Roman census wasn’t just counting—it was classification. Your census record determined your social class, your rights, and your obligations. Data became identity.
Sources
The Han Dynasty Census (2 CE): The World’s Most Accurate Ancient Count
The world’s oldest extant (surviving) census data comes from China during the Han Dynasty. Conducted in the fall of 2 CE, scholars consider it remarkably accurate:
- Total population: 59.6 million people
- Households: 12,366,470 families
- Recorded in official government archives
This was the largest population in the world at the time. The census focused on taxable families, revealing its fiscal purpose.
A Demographic Mystery: A later Han census in 140 CE recorded only 48 million people—an apparent decline of 11.6 million. Mass migrations to southern China are believed to explain this demographic shift, demonstrating how census data can reveal hidden historical movements.
Sources
The Domesday Book (1086): Medieval Big Data
After the collapse of Rome, systematic census-taking disappeared from Western Europe—until William the Conqueror.
In 1086, just 20 years after conquering England, William ordered the Great Survey—what would become known as the Domesday Book.
The Scale:
- Over 13,000 places recorded
- Every manor catalogued with:
- Names of holders in 1066 (before conquest) and 1086 (after)
- Dimensions and plowing capacity
- Number of agricultural workers by type
- Mills, fishponds, and amenities
- Monetary value in pounds
The Method:
- 7-8 panels of commissioners divided the country into circuits
- Each commission carried standardized questions
- Juries of barons and villagers provided testimony
- “Against great popular resentment”
Why “Domesday”? The English held this book in awe. “Doom” was the Old English term for judgment—like the Last Judgment, this record was definitive and unchallengeable. Once recorded in Domesday Book, a property holding was legally established.
Legacy:
- Oldest government record held by the UK National Archives
- First printed in 1783
- Made available online in 2011 (Open Domesday)
- For most English villages, Domesday is literally the starting point of recorded history
Sources
Part II: The Birth of Modern Statistics
John Graunt and the Bills of Mortality (1662)
The Founding Father of Demography
John Graunt was a London draper—a cloth merchant with no formal scientific training. Yet his 1662 book Natural and Political Observations Made Upon the Bills of Mortality founded three fields simultaneously:
- Demography (population studies)
- Epidemiology (disease patterns)
- Vital statistics (births, deaths, health data)
The Bills of Mortality
Since 1593, London had published weekly Bills of Mortality—documents recording births, deaths, and causes of death by parish. These were printed on Thursdays and distributed throughout the city. They were originally created during plague outbreaks so citizens could track the disease’s spread.
What Graunt Did Differently:
Where others saw mere lists of the dead, Graunt saw patterns waiting to be discovered. He:
- Created the first life table - Predicting what percentage of people would survive to each age
- Discovered sex ratio patterns - More males are born, but males die at higher rates, equalizing the adult population
- Identified the “urban penalty” - City dwellers died younger than rural populations
- Documented “excess deaths” during epidemics - The statistical fingerprint of disease outbreaks
- Developed sampling methods - Using ratios to estimate total populations from partial data
The Revolutionary Insight:
“The originality in his Observations was phenomenal. The new deep perception that Graunt presented was the value of population-level analysis. Healers had always thought about causes of illness and death in individuals, but no one before him had studied community-wide patterns.”
Graunt’s work earned him election to the Royal Society in 1662, endorsed by King Charles II himself. A cloth merchant, through systematic analysis of data, joined the most elite scientific body in England.
Sources
- PMC - John Graunt F.R.S. (1620-74): The founding father of human demography
- Britannica - John Graunt
- PMC - Epidemiology’s 350th Anniversary: 1662–2012
Adolphe Quetelet: The Average Man (1796-1874)
Adolphe Quetelet, a Belgian astronomer and mathematician, asked a radical question: Could statistical methods used for astronomy be applied to human society?
Social Physics
Quetelet founded what he called “social physics”—the application of mathematical analysis to social phenomena. His central concept was l’homme moyen (“the average man”):
“Quetelet postulated that for any population, there exists a typical or ‘average man,’ characterized by the mean values of measured variables that follow a normal distribution.”
Key Discoveries:
- Crime rates remained remarkably stable year to year
- Marriage rates, suicide rates, and other social phenomena followed predictable patterns
- Human measurements (height, chest circumference) followed the bell curve
The Crime Statistics Paradox:
Quetelet studied French crime statistics and discovered disturbing regularity:
“Thus we pass from one year to another with the sad perspective of seeing the same crimes reproduced in the same order.”
If crime rates are statistically predictable, what does that say about free will? Are criminals making individual choices, or are they products of social forces?
The BMI Origin Story:
In trying to characterize the “average man,” Quetelet developed what we now call the Body Mass Index (BMI)—originally the “Quetelet Index.” He was searching for an ideal human form and believed the average body represented optimal health and beauty.
Controversy:
Quetelet’s work sparked fierce debate:
- Determinism vs. Free Will: If social statistics are predictable, are individuals truly free?
- Auguste Comte, who originally coined “social physics,” was so upset that Quetelet appropriated his term that he invented the word “sociology” to distinguish his approach
- Quetelet’s concept of an ideal “average man” later influenced eugenics, particularly Francis Galton’s work
Sources
- Wikipedia - Adolphe Quetelet
- Britannica - Adolphe Quetelet
- PMC - Quetelet and the emergence of the behavioral sciences
Francis Galton: Genius and Darkness (1822-1911)
Francis Galton was Charles Darwin’s cousin—a polymath who made fundamental contributions to statistics while pursuing deeply troubling goals.
Statistical Contributions
Galton developed or discovered:
- Regression to the mean - Observing that extreme values tend to be followed by more moderate ones (originally studying sweet pea seeds)
- Correlation - Measuring how two variables move together (first calculated correlation coefficients by comparing arm length to height)
- Standard deviation - Building on earlier work
The Sweet Pea Experiment:
Galton noticed something odd when breeding sweet peas. If he selected very large seeds and planted them, the offspring seeds were large—but not as large as the parents. They “regressed” toward the average.
This wasn’t a flaw in his experiment. It was a fundamental statistical principle that applies everywhere: extremely tall parents tend to have tall (but not quite as tall) children; exceptional performance tends to be followed by merely good performance.
Fingerprint Identification
Galton collected fingerprints in his anthropometric laboratory and proved:
- Fingerprints remain constant throughout life
- Fingerprints can serve as unique identifiers
- He developed classification characteristics
The Galton-Henry system of fingerprint classification was published in 1900 and adopted by Scotland Yard in 1901. It spread worldwide and remains the basis for forensic fingerprint identification.
The Dark Legacy: Eugenics
In 1883, Galton coined the term “eugenics”—from the Greek for “well-born.” After reading his cousin Darwin’s Origin of Species, Galton became convinced that humanity could be improved through selective breeding.
“He had in mind a purposeful breeding program, similar to agricultural animal husbandry.”
Galton’s ideas led to:
- Involuntary sterilization programs in the USA, Canada, and Scandinavia
- The Immigration Act of 1924 in the US (restricting immigration from Southern and Eastern Europe)
- Nazi racial policies
The Statistical-Eugenics Connection:
This dark history is important for data science students to understand. Statistics and data analysis are not neutral tools—they can be weaponized to justify prejudice and oppression. The same person who gave us correlation and regression also laid the groundwork for scientific racism.
Other prominent statisticians who supported eugenics:
- Karl Pearson
- R.A. Fisher
Sources
- Wikipedia - Francis Galton
- MacTutor History of Mathematics - Francis Galton
- Significance - The troubling legacy of Francis Galton
Karl Pearson: The Institutionalization of Statistics (1857-1936)
Karl Pearson established statistics as an academic discipline, founding the world’s first university statistics department at University College London in 1911.
Key Contributions
- Chi-squared test (1900) - A method to test whether observed data differs significantly from expected values
- Pearson correlation coefficient - The standard measure of linear correlation
- Standard deviation - Pearson coined the term in an 1893 lecture
- Standardized methods for estimator errors
The Gresham College Lectures
From 1891 to 1894, Pearson was Professor of Geometry at Gresham College, delivering public lectures that attracted over 300 attendees. His lectures on:
- “Geometry of Statistics” (1891-1892) - Comprehensive treatment of graphical representation
- “Laws of Chance” (1892-1893) - Probability theory and correlation
These lectures transformed statistics from a scattered set of techniques into a coherent mathematical discipline.
The Biometrika Journal
In 1901, with W.F.R. Weldon and Francis Galton, Pearson founded Biometrika—the first journal dedicated to mathematical statistics. He edited it until his death. This institutionalized statistics as a field with its own publication venue, peer review, and professional community.
Sources
R.A. Fisher and “The Lady Tasting Tea” (1935)
Ronald A. Fisher is considered one of the greatest statisticians of the 20th century. His 1935 book The Design of Experiments introduced the concept of the null hypothesis through a charming story.
The Story
At a tea party in Cambridge (sometime in the 1920s), a colleague named Muriel Bristol claimed she could tell whether milk or tea was added to the cup first. The scientists were skeptical—surely this was impossible!
Her future husband, William Roach, suggested Fisher design an experiment. Fisher proposed:
- Eight cups of tea
- Four with milk added first, four with tea added first
- Presented in random order
- Bristol must correctly identify all eight
The Statistical Framework
Fisher asked: What is the probability she could identify all eight correctly by pure chance?
Answer: 1/70, or about 1.4%
This simple experiment established:
- The null hypothesis - “The subject has no ability to distinguish the teas”
- Randomization in experimental design
- Statistical significance - We reject the null only if the probability of the result occurring by chance is sufficiently low
Did It Work?
According to Fisher’s colleague H. Fairfield Smith, Bristol correctly identified all eight cups.
Fisher’s Exact Test
The mathematical method Fisher developed for this problem became known as Fisher’s Exact Test, still used today when sample sizes are small and the chi-squared approximation is unreliable.
Sources
Part III: Data Visualization Pioneers
William Playfair: The Scottish Scoundrel (1759-1823)
William Playfair invented most of the statistical graphics we use today—and led one of the most colorful lives imaginable.
Career Trajectory
Playfair was, in turn:
- Millwright
- Engineer
- Draftsman (assistant to James Watt, inventor of the steam engine)
- Accountant
- Inventor
- Silversmith
- Merchant
- Investment broker
- Economist
- Statistician
- Pamphleteer
- Land speculator
- Banker
- Intelligence officer
- Convicted criminal
- Editor
- Blackmailer
- Journalist
He was present at the storming of the Bastille in 1789.
Graphical Inventions
1786: The Commercial and Political Atlas
- 43 time-series line charts showing England’s trade over time
- The first bar chart (invented because he lacked continuous data for Scotland!)
“Much to Playfair’s frustration, when he tried to plot trade data for Scotland, he found that there were a lot of records missing, meaning he couldn’t plot a time series as usual. And so the bar chart was born.”
Playfair himself considered bar charts “inferior in utility” to line charts!
1801: Statistical Breviary
- The first pie chart - Showing the Turkish Empire’s landholdings across Europe, Asia, and Africa
- The first color-coded chart - Using red for Europe, green for Asia, yellow for Africa
Why Playfair Succeeded
Playfair’s training with James Watt as an engineering draftsman gave him skills in technical drawing. But more importantly, Playfair had an intuitive understanding of human perception:
“William Playfair… had an instinctive understanding of our psychological capabilities and, moreover, understood how to exploit them. He anticipated many ideas that are the focus of work in experimental psychology to this day.”
Legacy Rediscovered
Playfair’s work was often neglected after his death, corresponding to periods when statistical graphics fell out of fashion. With the rise of computer-based data visualization, interest has surged. In 2010, a copy of his Commercial and Political Atlas sold at Christie’s for $43,750.
Sources
- Wikipedia - William Playfair
- Atlas Obscura - The Scottish Scoundrel Who Changed How We See Data
- History of Data Visualization - Chapter 5
Florence Nightingale: The Lady with the Lamp and the Data (1820-1910)
Florence Nightingale is remembered as the founder of modern nursing. Less known is her role as a pioneer of data visualization and statistical advocacy.
The Crimean War Crisis
In 1854, Nightingale led a team of nurses to care for British soldiers in the Crimean War. What she found horrified her:
- Hospitals overrun with disease
- Mortality rates exceeding 40%
- Soldiers dying not from wounds, but from cholera, dysentery, and typhus
- No systematic record-keeping
Her Response: She immediately began counting things. “She recognized the counting system was in complete shambles. She was very much in favor of fact-based statistics.”
The Rose Diagram (Coxcomb Chart)
In 1858, Nightingale created her famous “polar area diagram”—often called the “coxcomb” or “rose diagram”:
- A circular chart divided into 12 slices (months)
- The length extending from center represents death rate
- Color-coding showed cause:
- Blue: Deaths from preventable diseases
- Orange: Deaths from battle wounds
- Black: Other causes
The Shocking Truth:
The diagram revealed that most soldiers who died during the Crimean War died of sickness rather than of wounds. After sanitary improvements were made (March 1855), death rates plummeted.
Collaboration with William Farr
Nightingale worked closely with William Farr, a founder of medical statistics. Together, they compiled rigorous data from battlefield hospitals.
“She is famous for using graphical displays of her data to give the statistics context, realizing early on that officials would likely ignore numbers without a picture to get their attention.”
Impact
Within months of publication:
- The issue of overcrowded barracks was debated in Parliament
- Reforms were enacted
- Data collection improved dramatically
- Mortality from preventable disease among soldiers dropped below civilian levels
The Data Journey:
- Collection: Systematic counting of deaths by cause, location, and time
- Understanding: Pattern recognition through visualization
- Prediction/Action: Successful advocacy for sanitary reforms
Sources
- Scientific American - How Florence Nightingale Changed Data Visualization Forever
- McGill - How Florence Nightingale Used Data Visualization to Save Lives
- PMC - Florence Nightingale: An Unexpected Master of Data
John Snow: The Map That Stopped an Epidemic (1813-1858)
In 1854, cholera struck London’s Soho neighborhood with devastating speed. Dr. John Snow’s investigation became a founding moment of epidemiology—and a landmark in data visualization.
The Setting
London in 1854 was the world’s largest city (2.5 million people) with:
- No public health departments
- Inadequate sewage systems
- The Thames serving as both water source and sewer
- A perfect environment for epidemic disease
The prevailing theory blamed “miasma”—bad air—for cholera transmission.
Snow’s Method
Snow did something unprecedented: he mapped the deaths.
- Collected addresses of 578 cholera deaths
- Plotted them on a street map of Soho
- Also marked the locations of 13 water pumps
The Pattern:
Deaths clustered densely around one pump—the Broad Street pump. Areas served by other pumps had far fewer deaths.
The Pump Handle
Snow presented his findings to local officials. On September 8, 1854, they removed the handle from the Broad Street pump. The epidemic slowed.
Important Caveat: Snow acknowledged that people fleeing the area may also have reduced deaths. The epidemic was already declining when the pump was disabled. But his analysis provided the first compelling evidence for waterborne transmission of cholera.
Legacy
- Snow is called the “father of epidemiology”
- His map pioneered spatial analysis in public health
- His methods anticipated GIS (Geographic Information Systems) by over a century
- Modern data scientists still use his dataset as a teaching example
Sources
- Harvard Online - PredictionX: John Snow and the Cholera Epidemic of 1854
- Britannica - John Snow cholera
- arXiv - Revisiting John Snow’s Cholera Map
Charles Minard: The Best Statistical Graphic Ever Drawn (1781-1870)
Charles Minard was a French civil engineer who, after retirement at age 70, devoted himself to creating “graphic tables and figurative maps.”
The Napoleon Graphic (1869)
At age 88, Minard created what information designer Edward Tufte called “may well be the best statistical graphic ever drawn”—a visualization of Napoleon’s 1812 Russian campaign.
Six Variables in Two Dimensions:
- Army size - The thickness of the line (1 mm = 10,000 men)
- Geographic location - Latitude and longitude
- Direction of travel - Tan line advancing, black line retreating
- Date - Connected to the temperature scale
- Temperature - Scale at bottom showing the brutal Russian winter
- Terrain - Rivers crossed, cities passed
The Story in Numbers:
- June 1812: 422,000 soldiers enter Russia
- Moscow: ~100,000 remain
- December 1812: 10,000 return
The graphic shows the army shrinking as it advances and dying in the retreat through Russian winter. The temperature scale at the bottom shows temperatures dropping to -30°C.
“Brutal Eloquence”
French physiologist Étienne-Jules Marey praised the graphic’s “brutal eloquence, which seems to defy the pen of the historian.”
Why It Works:
The power comes from combining statistical reality (the numbers) with human geography (the actual route) and environmental context (the temperature). You don’t just see that many soldiers died—you see where they died and why.
Sources
- National Geographic - The Underappreciated Man Behind the “Best Graphic Ever Produced”
- Wikipedia - Charles Joseph Minard
- Age of Revolution - Flow Map of Napoleon’s Invasion of Russia
Part IV: The Computing Revolution
Ada Lovelace: The First Programmer (1815-1852)
Ada Lovelace, daughter of the poet Lord Byron, is credited with writing the first computer program—a century before electronic computers existed.
Meeting Babbage
In June 1833, at age 17, Ada met Charles Babbage at a party. Babbage showed her his prototype Difference Engine—a mechanical calculator. He was so impressed by her intellect that he called her “The Enchantress of Number.”
The Analytical Engine
Babbage later designed a more ambitious machine: the Analytical Engine—a general-purpose mechanical computer that was never built but anticipated modern computer architecture.
In 1843, Ada translated an Italian article about the Analytical Engine, adding her own notes that were three times longer than the original article.
Note G: The First Algorithm
Ada’s “Note G” described a method for the Analytical Engine to calculate Bernoulli numbers. This is recognized as the first published computer algorithm.
“Bernoulli numbers can be calculated in many ways, but Lovelace deliberately chose an elaborate method in order to demonstrate the power of the engine.”
Visionary Insight
Ada saw something that even Babbage missed:
“She developed a vision of the capability of computers to go beyond mere calculating or number-crunching… Lovelace was the first to point out the possibility of encoding information besides mere arithmetical figures, such as music, and manipulating it with such a machine.”
She understood that computers could manipulate symbols, not just numbers—the fundamental distinction between calculation and computation.
Legacy
- The Ada programming language is named in her honor
- Ada Lovelace Day (second Tuesday of October) celebrates women in STEM
- She established that programming is distinct from hardware engineering
Sources
- Wikipedia - Ada Lovelace
- Britannica - Ada Lovelace: The First Computer Programmer
- Computer History Museum - Ada Lovelace
Herman Hollerith: The Punch Card Revolution (1860-1929)
The 1880 US Census took eight years to tabulate. Projections warned the 1890 census might not be finished before the 1900 census began!
The Problem
The US Constitution requires a census every decade. With fewer than 4 million Americans in 1790, this was manageable. With 63 million in 1890, it was becoming impossible.
Hollerith’s Solution
Herman Hollerith, frustrated by the tedious manual process while working at the Census Office, invented an electromechanical tabulating machine using punched cards.
Key Insight: A datum could be recorded by the presence or absence of a hole at a specific location on a card—essentially binary encoding.
The 1888 Competition
The Census Office held a competition. Three systems were tested:
- Competitor 1: 144.5 hours to capture data
- Competitor 2: 100.5 hours
- Hollerith: 72.5 hours
For data preparation, Hollerith logged 5.5 hours versus 44.5 and 55.5 for competitors.
The 1890 Census Success
Results:
- Completed the count in six months (vs. eight years for 1880)
- Full tabulation finished in two years
- Saved the Census Office $5 million and two years of labor
The Road to IBM
1896: Hollerith founded the Tabulating Machine Company 1911: Merged into Computing-Tabulating-Recording Company (CTR) 1924: CTR renamed International Business Machines Corporation (IBM)
Legacy:
Hollerith’s punched card system dominated data processing for nearly a century. It introduced:
- Mechanized binary code
- Semi-automatic data processing
- The foundation for the computer age
Sources
- IBM - The punched card tabulator
- Computer History Museum - Making Sense of the Census
- Columbia Magazine - How Herman Hollerith Helped Launch the Information Age
The Women of ENIAC: Hidden Figures of Computing (1940s)
Before “computer” meant a machine, it meant a job description.
Human Computers
During World War II, women—often with mathematics degrees—were hired to perform ballistic calculations by hand. They could be paid much less than men with comparable training.
At the University of Pennsylvania’s Moore School, 200 female computers calculated artillery-firing tables for the US Army. Even so, one table took about a month to complete.
The ENIAC Project
The ENIAC (Electronic Numerical Integrator and Computer) was the first general-purpose, programmable, all-electronic computer—a secret US Army project with 18,000 vacuum tubes.
Out of approximately 100 human computers, six women were chosen to program ENIAC:
- Jean (Jennings) Bartik
- Betty (Snyder) Holberton
- Frances (Bilas) Spence
- Kay (McNulty) Mauchly
- Marlyn (Wescoff) Meltzer
- Ruth (Lichterman) Teitelbaum
Programming Without Manuals
“There were no manuals available and ‘programming’, as we know it today, didn’t yet exist—it was much more physical. Not only did the ‘ENIAC six’ have to correctly wire each cable they had to fully understand the machine’s underlying blueprints and electronic circuits.”
They taught themselves, learning by trial and error, sometimes crawling inside the machine to fix broken wires.
Hidden from History
When ENIAC was presented to the press in 1946, the six women programmers were not mentioned. Programming was seen as “subprofessional” women’s work—the hardware was considered important, not the software.
A museum photo later labeled them as “just models hired to make the machine look better.”
Rediscovery
In the 1980s, a young programmer named Kathy Kleiman found the photo and refused to accept the “models” explanation. Her investigation revealed the truth.
Jean Bartik’s contributions went unrecognized for 40 years. She and Betty Holberton later worked on UNIVAC with Grace Hopper.
Sources
- IEEE Spectrum - The Women Behind ENIAC
- National Women’s History Museum - Women and Computing
- Engineering and Technology History Wiki - Women Computers in WWII
Alan Turing: Breaking Enigma with Data (1912-1954)
Alan Turing’s work at Bletchley Park during World War II represents one of the greatest data analysis feats in history—though it remained secret for decades.
The Enigma Challenge
Germany’s Enigma machine could produce messages with 158 quintillion possible settings (later increased further). The Germans believed it unbreakable.
Polish Groundwork
Polish cryptanalysts, recognizing that Enigma required mathematical rather than linguistic analysis, achieved the first breaks in the 1930s. When Poland was invaded in 1939, they shared their work with Britain.
The Bombe
Within weeks of arriving at Bletchley Park in September 1939, Turing designed the “bombe”—an electromechanical device to search for Enigma settings.
The Method:
Turing’s approach relied on “cribs”—likely fragments of plaintext. If you could guess part of a message (like the standard greeting “Heil Hitler”), the bombe could test which Enigma settings would produce that result.
Breaking Naval Enigma
The German Navy used a more complex Enigma system. Turing and his team cracked it, allowing the Allies to track U-boat movements during the Battle of the Atlantic (1941-1943).
Scale and Secrecy
The Bletchley Park operation grew from hundreds of workers to 10,000 at peak in 1944.
The operation remained classified until 1974—nearly 30 years after the war ended. Only then did the world learn what had been achieved.
Impact
General Dwight D. Eisenhower said the ULTRA intelligence (derived from Enigma decrypts) “saved thousands of British and American lives and, in no small way, contributed to the speed with which the enemy was routed.”
Historians estimate Bletchley Park’s work shortened the war by two years, saving millions of lives.
Sources
- The National WWII Museum - Alan Turing and the Hidden Heroes of Bletchley Park
- Imperial War Museums - How Alan Turing Cracked The Enigma Code
- CIA - The Enigma of Alan Turing
Part V: From Data to Discovery - Landmark Stories
Galileo and the Pendulum (c. 1602)
The Birth of Quantitative Science
According to his student Viviani, young Galileo sat in the Pisa cathedral watching a lamp swing back and forth. Using his pulse to measure time, he noticed something remarkable: the period of swing was independent of how far the lamp swung.
This observation—isochronism—would revolutionize timekeeping and science itself.
A New Way of Thinking
“Galileo quickly began questioning the Aristotelian approach. Where Aristotle had taken a qualitative and verbal approach, Galileo developed a quantitative and mathematical approach.”
Galileo’s key innovations:
- Measurement - Quantifying natural phenomena
- Hypothesis - Making testable predictions
- Mathematics - Describing nature with equations
- Reproducibility - Experiments others could repeat
Stephen Hawking wrote: “Galileo, perhaps more than any other single person, was responsible for the birth of modern science.”
The Pendulum Clock
Galileo never built a pendulum clock, but his principle enabled Christiaan Huygens to build the first one in 1657. Pendulum clocks remained the world’s most accurate timekeepers for 300 years, until the 1930s.
The Data Journey:
- Collection: Measuring swing periods with pulse beats
- Understanding: Discovering isochronism
- Prediction: Enabling precise timekeeping that made navigation, astronomy, and experimental science possible
Sources
Tycho Brahe and Kepler: The Partnership That Unlocked the Solar System
Tycho Brahe’s Obsession (1546-1601)
Tycho Brahe, a Danish nobleman, was dissatisfied with the accuracy of existing astronomical tables. He dedicated his life—and considerable wealth—to fixing this.
Resources: The King of Denmark gave Tycho:
- An entire island
- Money to build an observatory
- An estimated 10% of Denmark’s GDP at the time
Achievement: Twenty years of continuous observations of planetary positions, accurate to one arc-minute—a tremendous feat before the telescope.
Enter Johannes Kepler
In 1600, the young mathematician Johannes Kepler became Tycho’s assistant in Prague. Tycho had the data; Kepler had the mathematical skills to analyze it.
But Tycho mistrusted Kepler, fearing the young man might eclipse him. He revealed only partial data, assigning Kepler the particularly troublesome observations of Mars.
The Ironic Twist
Mars has the most elliptical orbit of the visible planets. In trying to fit circular orbits to Mars’s motion—as everyone assumed planets must move—Kepler repeatedly failed.
This failure forced him to a revolutionary conclusion: planetary orbits are ellipses, not circles.
“In a twist of irony, Brahe unwittingly gave Kepler the very part of his data that would enable Kepler to formulate the correct theory of the solar system, banishing Brahe’s own geocentric theory.”
Kepler’s Three Laws
From Tycho’s data, Kepler derived:
- Planets move in elliptical orbits with the Sun at one focus
- Planets sweep out equal areas in equal times
- The orbital period squared is proportional to the semi-major axis cubed
These laws enabled Newton to formulate universal gravitation.
The Data Journey:
- Collection: 20 years of painstaking observation
- Understanding: Mathematical analysis revealing elliptical orbits
- Prediction: The Rudolphine Tables—unprecedented accuracy in predicting planetary positions
Sources
- NASA - Orbits and Kepler’s Laws
- University of Virginia - Tycho Brahe and Johannes Kepler
- Physics World - Kepler and Tycho Brahe: the odd couple
Gauss and the Lost Planet Ceres (1801)
The Birth of Least Squares
On January 1, 1801, Italian astronomer Giuseppe Piazzi discovered a new celestial body between Mars and Jupiter—filling a gap predicted by the Titius-Bode Law. He named it Ceres.
The Problem
Astronomers could only observe Ceres for 41 days before it disappeared behind the Sun. When it emerged months later, they couldn’t find it. They had data on less than 1% of its orbit—how could they predict where it would reappear?
The Challenge: Solve Kepler’s complex non-linear equations for elliptical orbits with minimal data.
Gauss’s Solution
Carl Friedrich Gauss, then just 24 years old, applied new mathematical techniques to the problem. His prediction pointed to an entirely different region of the sky than other astronomers suggested.
On December 7, 1801, astronomer Franz Xaver von Zach found Ceres—within half a degree of where Gauss predicted.
The Least Squares Method
In his 1809 book Theoria Motus Corporum Coelestium, Gauss described the method of least squares—minimizing the sum of squared errors when fitting a model to data.
“Gauss went beyond Legendre and succeeded in connecting the method of least squares with the principles of probability and to the normal distribution.”
Legacy: Least squares remains one of the most fundamental techniques in statistics and machine learning. Every linear regression uses Gauss’s insight.
The Data Journey:
- Collection: 41 days of telescope observations
- Understanding: Mathematical modeling of elliptical orbits
- Prediction: Successful recovery of a lost celestial body
Sources
- ThatsMaths - Gauss Predicts the Orbit of Ceres
- Actuaries Institute - Gauss, Least Squares, and the Missing Planet
Semmelweis: The Doctor Who Could Have Saved Millions (1818-1865)
Ignaz Semmelweis made one of the most important medical discoveries in history—and was destroyed for it.
The Vienna Maternity Clinics
At Vienna General Hospital in the 1840s, two maternity clinics operated side by side:
- Clinic 1: Staffed by doctors and medical students - Mortality rate: up to 18%
- Clinic 2: Staffed by midwives - Mortality rate: ~3%
Women begged not to be sent to Clinic 1. Some gave birth in the street rather than enter.
The Breakthrough (1847)
Semmelweis’s friend Jakob Kolletschka died after being accidentally cut during an autopsy. His autopsy revealed pathology identical to women dying of childbed fever.
The Connection: Doctors and students in Clinic 1 went directly from autopsies to delivering babies—carrying “cadaverous particles” on their hands. Midwives in Clinic 2 never touched corpses.
The Intervention
In May 1847, Semmelweis required all doctors and students to wash their hands with chlorinated lime solution before examining patients.
Results:
- Mortality dropped from 12.2% to 2.2% in one month
- Eventually matched the midwife clinic
The Rejection
Despite clear evidence, the medical establishment rejected Semmelweis’s findings:
- The germ theory of disease didn’t exist yet
- His ideas implied doctors were killing patients
- He was abrasive and politically naive
His contract was not renewed. He returned to Hungary, grew increasingly unstable, and died in a mental institution in 1865—possibly beaten by guards.
Twenty years later, Louis Pasteur and Joseph Lister validated germ theory. Semmelweis became “the savior of mothers”—posthumously.
The Data Journey:
- Collection: Mortality statistics from two clinics
- Understanding: Identifying the transmission mechanism
- Prediction/Action: Handwashing intervention with dramatic results
- Tragedy: Data alone couldn’t overcome institutional resistance
Sources
- Wikipedia - Ignaz Semmelweis
- Science History Institute - Ignaz Semmelweis
- PMC - How dramatic were the effects of handwashing
Edward Lorenz and the Butterfly Effect (1963)
The Discovery of Chaos
On a winter day in 1961, Edward Lorenz, a meteorology professor at MIT, ran a computer simulation of weather patterns. He decided to repeat a run—but rounded one variable from .506127 to .506.
The result completely diverged from the original.
Deterministic Chaos
Lorenz had discovered something profound: in certain systems, tiny differences in initial conditions produce vastly different outcomes—even though the underlying equations are completely deterministic.
His 1963 paper “Deterministic Nonperiodic Flow” founded chaos theory.
The Butterfly Effect
The famous metaphor came later. In 1972, Lorenz couldn’t think of a title for a talk. His colleague Philip Merilees suggested:
“Does the flap of a butterfly’s wings in Brazil set off a tornado in Texas?”
The name stuck.
Implications for Weather Prediction
“In meteorology, it led to the conclusion that it may be fundamentally impossible to predict weather beyond two or three weeks with a reasonable degree of accuracy.”
This isn’t a limitation of our instruments or computers—it’s a fundamental property of the atmosphere.
The Lorenz Attractor
Lorenz’s simplified model of atmospheric convection produces a beautiful mathematical object—the Lorenz attractor—whose shape famously resembles a butterfly.
Impact
Some scientists argue the 20th century will be remembered for three scientific revolutions:
- Relativity
- Quantum mechanics
- Chaos
MIT colleague Kerry Emanuel: “By showing that certain deterministic systems have formal predictability limits, Ed put the last nail in the coffin of the Cartesian universe.”
The Data Journey:
- Collection: Computer simulations of atmospheric equations
- Understanding: Discovery of sensitive dependence on initial conditions
- Prediction: Recognition of fundamental limits to long-range forecasting
Sources
- MIT Technology Review - When the Butterfly Effect Took Flight
- Britannica - Edward Lorenz
- MIT News - Edward Lorenz obituary
The Hudson Bay Company: 200 Years of Predator-Prey Data
For over two hundred years, trappers working for the Hudson’s Bay Company recorded pelts traded—creating one of the longest ecological time series in existence.
The Data
Starting in the 1840s, records track populations of:
- Snowshoe hares (prey)
- Canada lynx (predator)
The data shows a striking pattern: populations oscillate with a period of about 10 years, with the lynx population lagging the hare population.
Lotka-Volterra Model
In the 1920s, Alfred Lotka and Vito Volterra independently developed differential equations describing predator-prey dynamics:
- When prey increase, predators have more food and increase
- When predators increase, they eat more prey, causing prey to decrease
- When prey decrease, predators starve and decrease
- When predators decrease, prey recover
The Hudson Bay data provided empirical validation of these theoretical models.
What the Data Shows
“Notice how the predator population lags the prey population: an increase in prey numbers results in a delayed increase in predator numbers as the predators eat more prey.”
This phase lag—about one quarter of a cycle—is a fundamental signature of predator-prey dynamics.
Modern Analysis
The data continues to be studied:
- Bayesian model fitting (Stan)
- Climate correlations (Southern Oscillation Index)
- Distinguishing between fur records and questionnaire data
The Data Journey:
- Collection: Commercial fur trapping records (200+ years)
- Understanding: Mathematical modeling of population dynamics
- Prediction: Ecological forecasting and conservation planning
Sources
- Stan - Predator-Prey Population Dynamics: the Lotka-Volterra model
- Mathematics LibreTexts - The Lotka-Volterra Predator-Prey Model
Part VI: The Information Age
Claude Shannon: The Magna Carta of the Digital Era (1916-2001)
In 1948, Claude Shannon, a mathematician at Bell Labs, published a paper that created the foundation for all digital communication.
A Mathematical Theory of Communication
Shannon’s paper appeared in the Bell System Technical Journal. Historian James Gleick rated it more important than the transistor—”even more profound and more fundamental.”
Scientific American called it the “Magna Carta of the Information Age.”
Key Concepts
Information Entropy: Shannon defined a measure of information content analogous to entropy in thermodynamics—essentially, the number of binary digits (bits) needed to encode a message. He credited John Tukey with coining the term “bit.”
Channel Capacity: Every communication channel has a maximum rate at which information can be reliably transmitted—the Shannon limit. You can approach it but never exceed it.
Error Correction: Shannon proved you can transmit information with arbitrarily small error rates below channel capacity—a result that surprised engineers who believed reducing errors required reducing speed.
Impact
“His theory was motivated by practical engineering problems. And while it was esoteric to the engineers of his day, Shannon’s theory has now become the standard framework underlying all modern-day communication systems: optical, underwater, even interplanetary.”
Roboticist Rodney Brooks: Shannon was “the 20th century engineer who contributed the most to 21st century technologies.”
The Reluctant Publisher
Remarkably, Shannon initially wasn’t planning to publish the paper. He only did so at colleagues’ urging.
Sources
- Wikipedia - A Mathematical Theory of Communication
- Quanta Magazine - How Claude Shannon Invented the Future
- Scientific American - Claude E. Shannon: Founder of Information Theory
Recommended Resources
Books
History of Statistics and Data Science
- “The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century” by David Salsburg - Engaging history of modern statistics told through personal stories of statisticians
- “The Signal and the Noise: Why So Many Predictions Fail — but Some Don’t” by Nate Silver (2012) - Bayesian thinking and prediction across domains
- “Thinking, Fast and Slow” by Daniel Kahneman (2011) - Cognitive biases and statistical reasoning
- “The Ghost Map” by Steven Johnson - John Snow and the cholera epidemic
Data Visualization
- “The Visual Display of Quantitative Information” by Edward Tufte - The classic reference on statistical graphics
- “How to Lie with Statistics” by Darrell Huff - Classic introduction to statistical deception
Computing History
- “The Innovators” by Walter Isaacson - History of the digital revolution
- “Hidden Figures” by Margot Lee Shetterly - African American women mathematicians at NASA
Online Courses
- Harvard PredictionX: John Snow and the Cholera Epidemic of 1854 - Free online course
Websites
- HistData R Package Documentation: https://friendly.github.io/HistData/ - Historical datasets including Nightingale, Snow, and Galton data
- History of Data Visualization: https://friendly.github.io/HistDataVis/ - Comprehensive online book by Michael Friendly
- Open Domesday: https://opendomesday.org/ - Explore the 1086 Domesday Book online
- Bletchley Park: https://bletchleypark.org.uk/ - The codebreaking headquarters
Videos and Documentaries
- Search: “Florence Nightingale data visualization” on YouTube
- Search: “John Snow cholera map” on YouTube
- Search: “History of statistics documentary” on YouTube
- “The Imitation Game” (2014) - Feature film about Alan Turing (dramatized)
- “Hidden Figures” (2016) - Feature film about NASA’s human computers
Interactive Visualizations
- Minard’s Napoleon Graphic recreated: Various D3.js implementations online
- John Snow’s Cholera Map: Interactive versions on ESRI and other GIS platforms
- Nightingale’s Rose Diagram: Recreatable in R, Python, Tableau
References
Primary Sources and Academic Papers
Ancient and Medieval Data Collection
- Office for National Statistics. “Census-taking in the ancient world.” https://www.ons.gov.uk/census/2011census/howourcensusworks/aboutcensuses/censushistory/censustakingintheancientworld
- Wikipedia. “Domesday Book.” https://en.wikipedia.org/wiki/Domesday_Book
- The National Archives. “Domesday Book.” https://www.nationalarchives.gov.uk/education/resources/domesday-book/
Early Statistics and Demography
- Connor, H. (2024). “John Graunt F.R.S. (1620-74): The founding father of human demography, epidemiology and vital statistics.” Journal of Medical Biography. https://pmc.ncbi.nlm.nih.gov/articles/PMC10919065/
- PMC. “Epidemiology’s 350th Anniversary: 1662–2012.” https://pmc.ncbi.nlm.nih.gov/articles/PMC3640843/
Quetelet and Social Statistics
- Wikipedia. “Adolphe Quetelet.” https://en.wikipedia.org/wiki/Adolphe_Quetelet
- PMC. “Quetelet and the emergence of the behavioral sciences.” https://pmc.ncbi.nlm.nih.gov/articles/PMC4559562/
Galton, Pearson, and Fisher
- MacTutor History of Mathematics. “Francis Galton.” https://mathshistory.st-andrews.ac.uk/Biographies/Galton/
- Rutherford Journal. “Karl Pearson and the Origins of Modern Statistics.” https://rutherfordjournal.org/article010107.html
- Wikipedia. “Lady tasting tea.” https://en.wikipedia.org/wiki/Lady_tasting_tea
Data Visualization
- Atlas Obscura. “The Scottish Scoundrel Who Changed How We See Data.” https://www.atlasobscura.com/articles/the-scottish-scoundrel-who-changed-how-we-see-data
- Scientific American. “How Florence Nightingale Changed Data Visualization Forever.” https://www.scientificamerican.com/article/how-florence-nightingale-changed-data-visualization-forever/
- Harvard Online. “PredictionX: John Snow and the Cholera Epidemic of 1854.” https://harvardonline.harvard.edu/course/predictionx-john-snow-cholera-epidemic-1854
- National Geographic. “The Underappreciated Man Behind the ‘Best Graphic Ever Produced’.” https://www.nationalgeographic.com/culture/article/charles-minard-cartography-infographics-history
Computing History
- Computer History Museum. “Ada Lovelace.” https://www.computerhistory.org/babbage/adalovelace/
- IBM. “The punched card tabulator.” https://www.ibm.com/history/punched-card-tabulator
- IEEE Spectrum. “The Women Behind ENIAC.” https://spectrum.ieee.org/the-women-behind-eniac
- The National WWII Museum. “Alan Turing and the Hidden Heroes of Bletchley Park.” https://www.nationalww2museum.org/war/articles/alan-turing-betchley-park
Scientific Discoveries
- Museo Galileo. “Isochronism of the pendulum.” https://catalogue.museogalileo.it/indepth/IsochronismPendulum.html
- NASA. “Orbits and Kepler’s Laws.” https://science.nasa.gov/resource/orbits-and-keplers-laws/
- ThatsMaths. “Gauss Predicts the Orbit of Ceres.” https://thatsmaths.com/2021/06/24/gauss-predicts-the-orbit-of-ceres/
- Science History Institute. “Ignaz Semmelweis.” https://www.sciencehistory.org/education/scientific-biographies/ignaz-semmelweis/
- MIT Technology Review. “When the Butterfly Effect Took Flight.” https://www.technologyreview.com/2011/02/22/196987/when-the-butterfly-effect-took-flight/
- Stan. “Predator-Prey Population Dynamics.” https://mc-stan.org/learn-stan/case-studies/lotka-volterra-predator-prey.html
Information Theory
- Quanta Magazine. “How Claude Shannon Invented the Future.” https://www.quantamagazine.org/how-claude-shannons-information-theory-invented-the-future-20201222/
- Scientific American. “Claude E. Shannon: Founder of Information Theory.” https://www.scientificamerican.com/article/claude-e-shannon-founder/
Document compiled for SCDS DATA 201: Introduction to Data Science I Module 1: Introduction to Data Science and Systems Thinking “Understanding the Data Revolution”