Module 3: Statistical Thinking

“Probabilistic Thinking”

Research Document for DATA 201 Course Development


Table of Contents

  1. Introduction
  2. Part I: The Birth of Probability
  3. Part II: The Great Statistical Ideas
  4. Part III: Statistical Pitfalls and Paradoxes
  5. Part IV: The Bayesian Revolution
  6. Part V: Modern Statistical Practice
  7. DEEP DIVE: The Literary Digest Disaster of 1936
  8. Lecture Plan and Hands-On Exercise
  9. Recommended Resources
  10. References

Introduction

Statistical thinking is the art of reasoning under uncertainty. This module traces the journey from gambling problems to modern data science, exploring:

Core Question: How do we make decisions when we can’t be certain?


Part I: The Birth of Probability

The Problem of Points: Pascal and Fermat (1654)

In 1654, a French gambler named Antoine Gombaud (known as the Chevalier de Méré) posed a problem to mathematician Blaise Pascal: If a fair game of chance is interrupted, how should the stakes be divided based on the current score?

The Correspondence

Pascal wrote to Pierre de Fermat, and their exchange of letters founded probability theory. They solved the “problem of points” by reasoning about all possible future outcomes—the first systematic use of expected value.

Example Problem: Two players need 3 wins each to win the pot. Player A has 2 wins, Player B has 1 win. The game is interrupted. How should they split the pot fairly?

The Solution: Consider all possible ways the game could end:

A should get 3/4 of the pot, B should get 1/4.

The Data Journey

Sources


The Bernoulli Family: A Dynasty of Probabilists

The Bernoulli family of Basel, Switzerland produced at least eight prominent mathematicians across three generations, many contributing to probability and statistics.

Jacob Bernoulli (1655-1705)

Daniel Bernoulli (1700-1782)

The St. Petersburg Paradox

A casino offers a game: Flip a coin until it lands heads. If heads appears on flip n, you win $2^n.

Expected value: $1 + $1 + $1 + … = ∞

Would you pay $1,000,000 to play? Most wouldn’t—revealing that humans don’t simply maximize expected value.


Pierre-Simon Laplace: The Probability of Causes (1749-1827)

Laplace systematized probability in his monumental Théorie analytique des probabilités (1812). He developed:

Laplace’s Demon

Laplace imagined an intellect that knew the position and momentum of every particle in the universe—such a being could predict all future states. This “demon” represents the deterministic worldview that probability theory would eventually challenge.

“Probability is nothing but common sense reduced to calculation.” — Laplace


Part II: The Great Statistical Ideas

The Central Limit Theorem: Order from Chaos

The CLT states that the sum (or average) of many independent random variables approaches a normal distribution, regardless of the original distribution.

History

Why It Matters

The CLT explains why the normal distribution appears everywhere:

The Data Journey

Hands-On Demo

import numpy as np
import matplotlib.pyplot as plt

# Sample means from ANY distribution approach normal
# Try with: exponential, uniform, Bernoulli, etc.

np.random.seed(42)
sample_means = []

for _ in range(10000):
    # Sample from exponential distribution
    sample = np.random.exponential(scale=1, size=30)
    sample_means.append(np.mean(sample))

plt.hist(sample_means, bins=50, density=True, alpha=0.7)
plt.title("Distribution of Sample Means (n=30)")
plt.xlabel("Sample Mean")
plt.show()

Correlation Is Not Causation

The Classic Example: Ice Cream and Drowning

Ice cream sales and drowning deaths are strongly correlated. Does ice cream cause drowning?

No—both are caused by a confounding variable: hot weather. More people buy ice cream AND more people swim when it’s hot.

Spurious Correlations

Tyler Vigen’s website tylervigen.com/spurious-correlations documents absurd correlations:

When Correlation DOES Suggest Causation

Bradford Hill’s criteria (1965) for inferring causation:

  1. Strength: Strong associations more likely causal
  2. Consistency: Reproducible across studies
  3. Specificity: Specific exposure → specific outcome
  4. Temporality: Cause precedes effect
  5. Biological gradient: Dose-response relationship
  6. Plausibility: Mechanism makes sense
  7. Coherence: Consistent with known biology
  8. Experiment: Manipulation produces effect
  9. Analogy: Similar causes produce similar effects

Sources


Simpson’s Paradox: When Aggregation Misleads

The Berkeley Admissions Case (1973)

UC Berkeley’s graduate admissions appeared to discriminate against women:

But examining individual departments told a different story—in most departments, women had HIGHER admission rates than men!

The Explanation

Women applied disproportionately to more competitive departments (humanities, arts). Men applied more to less competitive departments (engineering, sciences). The aggregated data hid this pattern.

The Kidney Stone Treatment Paradox

Two treatments for kidney stones:

Treatment B seems better! But for small stones, A is better. For large stones, A is also better. How?

Treatment A was used more often on large (harder to treat) stones, pulling down its overall average.

Lesson

Always examine subgroups. Aggregate statistics can completely reverse when disaggregated.

Sources


Part III: Statistical Pitfalls and Paradoxes

The Challenger Disaster (1986): A Statistical Tragedy

On January 28, 1986, the Space Shuttle Challenger exploded 73 seconds after launch, killing all seven crew members. The cause: O-ring failure in cold weather.

The Night Before

Engineers at Morton Thiokol warned that the O-rings might fail at the predicted launch temperature of 36°F. They had data showing O-ring damage at lower temperatures.

The Statistical Failure

When presenting data, engineers showed only flights WITH O-ring damage, not the full dataset. The critical scatter plot—showing temperature vs. O-ring incidents for ALL flights—was never created.

What the Full Data Showed

When ALL data points are plotted (including flights without damage), the relationship is clear: colder temperatures strongly predict O-ring problems. The coldest previous launch (53°F) had significant damage. Launching at 36°F was far outside the safe range.

The Lesson

Show ALL the data. Selection bias—even unintentional—can lead to fatal decisions.

Sources


The Hot Hand Fallacy (Or Is It?)

The Original Study (1985)

Gilovich, Vallone, and Tversky analyzed basketball shooting data and concluded that the “hot hand” (a player being more likely to make a shot after making previous shots) was a cognitive illusion. Fans and players believed in streaks that didn’t exist statistically.

The Reversal (2014-2018)

Researchers Miller and Sanjurjo discovered a subtle but critical flaw: the original analysis had a selection bias. When you condition on having just made a shot, you’re more likely to be in a sequence that looks “cold” by chance.

Correcting for this bias, there IS evidence of a hot hand—just smaller than intuition suggests.

The Lesson

Even experts can miss subtle statistical traps. Always question the sampling procedure.

Sources


Part IV: The Bayesian Revolution

Thomas Bayes and the Essay (1763)

Thomas Bayes was an English Presbyterian minister whose Essay towards solving a Problem in the Doctrine of Chances was published posthumously in 1763 by his friend Richard Price.

Bayes’ Theorem

\[P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}\]

The Bayesian vs. Frequentist Debate

Frequentist view:

Bayesian view:

The Revival of Bayesian Methods

For most of the 20th century, frequentist methods dominated. The Bayesian revival came with:

Sources


James Lind and the Scurvy Trial (1747): The First Clinical Trial

The Problem

On long sea voyages, sailors developed scurvy—bleeding gums, weakness, and death. On some voyages, more than half the crew died.

Lind’s Experiment

Ship’s surgeon James Lind conducted what’s considered the first controlled clinical trial. He took 12 sailors with scurvy and divided them into six groups of two, each receiving a different treatment:

  1. Cider
  2. Elixir vitriol (sulfuric acid)
  3. Vinegar
  4. Seawater
  5. Oranges and lemons
  6. Nutmeg and barley water

The Result

The two sailors given citrus fruit recovered almost immediately. The others showed no improvement.

Why It Took 50 Years

Despite clear evidence, the British Navy didn’t mandate citrus until 1795. Reasons:

The Data Journey

Sources


Part V: Modern Statistical Practice

A/B Testing: Statistics at Scale

The Netflix Optimization Machine

Netflix runs hundreds of A/B tests simultaneously, optimizing everything from thumbnail images to recommendation algorithms. Each test involves millions of users.

Famous example: Netflix discovered that images of faces showing complex emotions drive 30% more clicks than neutral expressions.

A/B Testing Best Practices

  1. Randomization: Users randomly assigned to variants
  2. Sample size calculation: Ensure sufficient power
  3. Pre-registration: Specify analysis before seeing data
  4. Multiple testing correction: Adjust for running many tests
  5. Practical significance: Statistical significance ≠ meaningful effect

The Replication Crisis

In 2015, the Open Science Collaboration attempted to replicate 100 psychology studies. Only 36% replicated successfully.

Causes:

Solutions

Sources


DEEP DIVE: The Literary Digest Disaster of 1936

The Most Famous Polling Failure in History

The Story

In 1936, The Literary Digest—a prestigious American magazine—conducted the largest poll in history to predict the presidential election between Franklin D. Roosevelt (Democrat) and Alf Landon (Republican).

The Method

The Digest mailed 10 million questionnaires to Americans, using lists from:

They received over 2.3 million responses—an unprecedented sample size.

The Prediction

Based on these massive returns, the Digest confidently predicted:

The Actual Result

Roosevelt won in one of the largest landslides in American history:

The Digest was off by 19 percentage points. The magazine, which had correctly predicted the previous five elections, never recovered from the humiliation and ceased publication within two years.

What Went Wrong

1. Selection Bias: The Sample Wasn’t Representative

In 1936, during the Great Depression:

The Digest’s sample systematically overrepresented wealthy Republicans and excluded poorer Democratic voters.

2. Non-Response Bias

Of 10 million surveys mailed, only 2.3 million responded (23%). Those who felt strongly about the election—disproportionately anti-Roosevelt—were more likely to return surveys.

3. The Fallacy of Large Numbers

A biased sample doesn’t become unbiased by making it larger.

The Digest’s massive sample size gave false confidence. In the words of statistician George Gallup:

“A sampling procedure that has built-in biases will not improve its accuracy no matter how big it gets.”

The Counterexample: George Gallup’s Success

Meanwhile, a young statistician named George Gallup used scientific sampling with only 50,000 respondents:

Gallup predicted Roosevelt’s victory—and, brilliantly, also predicted the Literary Digest’s error!

Gallup’s Method

  1. Define the target population
  2. Use demographic quotas (age, gender, region, income)
  3. Random selection within quotas
  4. Weight responses to match population

Why This Matters Today

Modern Echoes

2016 US Presidential Election: Most polls predicted Hillary Clinton would win. Donald Trump won. What went wrong?

2020 Polling Errors: Polls again underestimated Trump support, suggesting systematic issues with reaching certain voters.

The Fundamental Lessons

  1. Sample quality > sample size: A small random sample beats a large biased one
  2. Non-response matters: Who doesn’t answer tells you something
  3. The population changes: Who the “likely voters” are shifts
  4. Beware of confidence: Large numbers create false certainty

The Data Journey


Lecture Plan and Hands-On Exercise

Lecture Plan: “Why Sampling Matters” (75-90 minutes)

Part 1: The Setup (15 min)

Opening Question: “If you want to know what Americans think, how many do you need to ask?”

Show students:

“What went wrong?”

Part 2: The Story (25 min)

Historical Context:

The Two Errors:

  1. Selection bias (demonstrate with classroom example)
  2. Non-response bias

George Gallup’s Alternative:

Part 3: The Statistics (20 min)

Sampling Theory:

Key Formula: \(\text{Margin of Error} \approx \frac{1}{\sqrt{n}}\)

But this assumes random sampling! With bias: \(\text{True Error} = \text{Sampling Error} + \text{Bias}\)

Bias doesn’t shrink with sample size.

Part 4: Modern Applications (15 min)

Discussion:

The Replication Crisis:


Hands-On Exercise: “The Biased Sample”

Objective

Experience how bias corrupts inference, even with large samples.

Duration

1.5-2 hours

Setup

Dataset: Simulated population of 10,000 “voters” with known preferences

Task 1: Random Sampling (30 min)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create population
np.random.seed(42)
n_pop = 10000

population = pd.DataFrame({
    'age': np.random.normal(45, 15, n_pop).clip(18, 90),
    'income': np.random.lognormal(10.5, 0.8, n_pop),
    'education_years': np.random.normal(14, 3, n_pop).clip(8, 22)
})

# True preference: affected by demographics
# Younger, lower income → prefer A
prob_A = 0.4 + 0.2 * (population['age'] < 45) - 0.15 * (population['income'] > 60000)
prob_A = prob_A.clip(0.1, 0.9)
population['vote'] = np.random.binomial(1, prob_A)

# True population proportion
true_prop_A = population['vote'].mean()
print(f"True proportion voting A: {true_prop_A:.3f}")

# Random sample of size 1000
random_sample = population.sample(1000)
est_random = random_sample['vote'].mean()
print(f"Random sample estimate: {est_random:.3f}")

Questions:

  1. What’s your estimate from the random sample?
  2. Run it 100 times. What’s the distribution of estimates?
  3. What’s the standard error?

Task 2: Biased Sampling (30 min)

# "Literary Digest" style sampling: oversample wealthy
# Only people with income > $50,000 can receive survey
wealthy_only = population[population['income'] > 50000]
wealthy_sample = wealthy_only.sample(min(2000, len(wealthy_only)))
est_wealthy = wealthy_sample['vote'].mean()
print(f"Wealthy-only sample estimate: {est_wealthy:.3f}")

# Add non-response bias: only 25% respond,
# and strong A supporters more likely to respond
response_prob = 0.25 + 0.15 * (wealthy_sample['vote'] == 1)
responders = wealthy_sample[np.random.random(len(wealthy_sample)) < response_prob]
est_biased = responders['vote'].mean()
print(f"Wealthy + non-response bias estimate: {est_biased:.3f}")

Questions:

  1. How does the biased estimate compare to truth?
  2. Is a larger biased sample more accurate?
  3. How would you correct for this bias if you knew the population demographics?

Task 3: Post-Stratification (30 min)

Technique: If your sample is biased but you know population demographics, you can reweight.

# Post-stratification: weight sample to match population
# Assume we know population proportion in each income bracket

pop_income_cats = pd.cut(population['income'],
                          bins=[0, 30000, 60000, 100000, np.inf],
                          labels=['low', 'medium', 'high', 'very_high'])
pop_weights = pop_income_cats.value_counts(normalize=True)

sample_income_cats = pd.cut(wealthy_sample['income'],
                             bins=[0, 30000, 60000, 100000, np.inf],
                             labels=['low', 'medium', 'high', 'very_high'])
sample_weights = sample_income_cats.value_counts(normalize=True)

# Calculate weights for each observation
# ... (implement post-stratification weighting)

# Compare weighted vs unweighted estimates

Task 4: Reflection (20 min)

Write responses to:

  1. Why didn’t the Literary Digest’s huge sample help?
  2. In what modern contexts might we face similar biases?
  3. What’s the minimum information you need to correct for sampling bias?

Evaluation Rubric

Criterion Excellent Good Needs Work
Coding Correct implementation, clear code Minor errors Significant errors
Analysis Quantifies bias accurately Identifies bias Misunderstands bias
Reflection Deep insight about real-world applications Good connections Surface-level

Recommended Resources

Books

Accessible

Technical

History

Online Resources

Courses

Websites

Videos

Datasets


References

Historical

Classical Papers

Modern

Books


Document compiled for SCDS DATA 201: Introduction to Data Science I Module 3: Statistical Thinking “Probabilistic Thinking”