Module 10: Regression and Classification - Predicting Outcomes

Introduction

At the heart of supervised machine learning lies a simple question: given what we know, can we predict what we don’t? Will this email be spam? What price will this house sell for? Will this patient develop diabetes? Is this transaction fraudulent?

These questions divide into two types. Regression predicts continuous values—prices, temperatures, durations, quantities. Classification predicts categories—spam or not spam, malignant or benign, approved or denied. Together, they form the foundation of predictive modeling, turning historical data into actionable predictions about the future.

This module traces the intellectual history of these methods, from a Victorian scientist’s study of heredity to the sophisticated ensemble methods that power modern AI systems. We’ll learn not just how these algorithms work, but why they were invented and how they reflect evolving ideas about the relationship between inputs and outputs.

Part 1: The Origins of Regression

Francis Galton and the Birth of Regression (1886)

The word “regression” has a curious history. It does not mean what it sounds like—going backward—but rather refers to a specific statistical phenomenon first observed by Francis Galton (1822-1911), the Victorian polymath who gave us fingerprinting, weather maps, and the foundations of statistics.

Galton was obsessed with heredity and eugenics (a field he named and promoted, now rightfully discredited). In the 1880s, he conducted a famous experiment: he grew sweet pea seeds, carefully measured the sizes of parent seeds and their offspring, and noticed something unexpected.

Large parent seeds did indeed produce offspring that were larger than average. But they weren’t as large as their parents. Small parent seeds produced offspring that were smaller than average—but not as small as their parents. Everything “regressed” toward the mean.

Galton then turned to humans. He collected data on the heights of parents and adult children. He found the same pattern: tall parents had children who were tall, but on average less tall than themselves. Short parents had children who were short, but on average taller than themselves.

This was regression to the mean—the tendency of extreme values to be followed by less extreme values. It’s not that nature “corrects” extremes; it’s that extreme values are partly due to chance, and chance doesn’t repeat perfectly.

Karl Pearson and the Correlation Coefficient

Galton’s student Karl Pearson (1857-1936) took the next step. He developed the mathematical framework we still use today:

The correlation coefficient (r) measures the strength and direction of a linear relationship between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).

The regression line is the best-fitting straight line through a scatter plot—the line that minimizes the sum of squared vertical distances from each point to the line (the method of least squares, invented earlier by Gauss and Legendre).

The equation: $\hat{y} = \beta_0 + \beta_1 x$

where $\beta_1 = r \cdot \frac{s_y}{s_x}$ (slope) and $\beta_0 = \bar{y} - \beta_1 \bar{x}$ (intercept).

Least Squares: From Gauss to Machine Learning

The method of least squares—finding parameters that minimize squared errors—predates Galton by decades. Carl Friedrich Gauss (1777-1855) and Adrien-Marie Legendre (1752-1833) developed it independently in the early 1800s for astronomical calculations.

Gauss used it to predict the orbit of the asteroid Ceres from just a few observations—his success made him famous throughout Europe. The method works because:

It has a closed-form solution (for linear problems)
It’s computationally tractable
It has optimal properties if errors are normally distributed

Today, least squares remains the foundation—gradient descent for neural networks is just iterative least squares for complex, nonlinear functions.

Part 2: From Linear to Logistic - Classification Emerges

The Need for Classification

Regression predicts numbers, but many predictions are inherently categorical:

Will this tumor be malignant or benign?
Will this customer default on their loan?
Is this email spam or legitimate?

You could try to use linear regression for classification (predict 0 or 1, threshold at 0.5), but this causes problems: predictions can go below 0 or above 1, the relationship between inputs and probability isn’t linear, and extreme values distort the fit.

The Logit Transform

The solution came from logistic regression, which predicts the probability of class membership and keeps predictions bounded between 0 and 1.

The key is the logit function (or log-odds): $\text{logit}(p) = \log\left(\frac{p}{1-p}\right)$

And its inverse, the sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-z}}$

Instead of modeling the outcome directly, we model the log-odds as a linear function of predictors: $\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...$

The sigmoid squashes any real number into the (0, 1) interval—perfect for probabilities.

The History of Logistic Regression

The logistic function was introduced by Pierre François Verhulst in the 1830s-1840s to model population growth with resource constraints (the S-curve of growth that slows as carrying capacity is reached).

Its use for classification developed through the mid-20th century, with key contributions from:

Joseph Berkson (1944): coined “logit” and advocated logistic models in biostatistics
David Cox (1958): the theoretical foundations in regression context
Jerome Cornfield (1962): applied to epidemiology and disease risk

Maximum Likelihood Estimation

Unlike linear regression, logistic regression can’t be solved in closed form. Instead, we use maximum likelihood estimation (MLE)—finding the parameters that make the observed data most probable.

For binary outcomes: $L(\beta) = \prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i}$

We maximize this (or equivalently, minimize the negative log-likelihood) using iterative optimization—Newton’s method or gradient descent.

Part 3: Beyond the Line - Non-Linear Methods

Polynomial Regression

The simplest extension of linear regression adds polynomial terms: $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + ...$

This is still “linear regression” because it’s linear in the parameters (the βs), even though it’s non-linear in x.

k-Nearest Neighbors (k-NN)

Perhaps the simplest non-parametric method: to predict for a new point, find the k closest training points and:

For regression: average their values
For classification: take the majority vote

k-NN makes no assumptions about the form of the relationship. It can capture arbitrarily complex patterns. But it suffers from the curse of dimensionality: in high dimensions, “nearest neighbors” become meaningless because all points are approximately equidistant.

Decision Trees

Decision trees partition the feature space through a series of yes/no questions:

Is income > $50,000?
- Yes: Is age > 35?
  - Yes: Class A
  - No: Class B
- No: Class C

Invented in various forms over decades, decision trees were popularized by Leo Breiman and colleagues with their CART algorithm (Classification and Regression Trees) in 1984.

Trees are interpretable—you can follow the decision path—but they’re unstable (small changes in data can produce very different trees) and prone to overfitting.

Ensemble Methods: Random Forests and Boosting

The instability of trees became a feature, not a bug:

Random Forests (Leo Breiman, 2001): Build many trees, each on a random subset of data and features. Average their predictions. The randomness decorrelates the trees, reducing variance.

Boosting: Build trees sequentially, each one focusing on the errors of the previous ones. AdaBoost (Freund & Schapire, 1997) and Gradient Boosting (Friedman, 2001) became the dominant methods for structured data.

XGBoost (Tianqi Chen, 2014) optimized gradient boosting for speed and accuracy, becoming the winning algorithm in most Kaggle competitions. LightGBM and CatBoost followed with further improvements.

Part 4: Support Vector Machines

The Maximum Margin Classifier

In 1963, Vladimir Vapnik and Alexey Chervonenkis developed the foundations of statistical learning theory. Their work led to Support Vector Machines (SVMs) in the 1990s.

The key insight: for classification, find the hyperplane that maximizes the margin—the distance to the nearest training points (the “support vectors”). This provides theoretical guarantees about generalization.

The Kernel Trick

SVMs became powerful through the kernel trick: instead of working in the original feature space, implicitly map the data to a higher-dimensional space where linear separation becomes possible.

The Gaussian (RBF) kernel effectively maps to an infinite-dimensional space: $K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)$

For a brief period in the late 1990s and early 2000s, SVMs were the state of the art for many classification tasks—until neural networks resurged.

Part 5: Evaluation and Model Selection

The Bias-Variance Tradeoff

Every prediction model balances two sources of error:

Bias: Error from oversimplifying—missing the true pattern. A linear model for a curved relationship has high bias.

Variance: Error from being too sensitive to training data. A model that fits noise as if it were signal has high variance.

Total Error = Bias² + Variance + Irreducible Noise

Simple models (high bias, low variance) underfit. Complex models (low bias, high variance) overfit. The sweet spot lies in between.

Cross-Validation

How do we estimate how well a model will perform on new data?

Holdout validation: Split data into training and test sets. Simple but wastes data.

k-Fold Cross-Validation: Split data into k folds. Train on k-1 folds, test on the remaining fold. Repeat k times. Average the scores.

Leave-One-Out: k-fold with k = n (each observation is a test set once). Computationally expensive but uses all data.

Regression Metrics

Mean Squared Error (MSE): $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ - emphasizes large errors

Root Mean Squared Error (RMSE): $\sqrt{MSE}$ - same units as outcome

Mean Absolute Error (MAE): $\frac{1}{n}\sum

y_i - \hat{y}_i

$ - robust to outliers

R² (Coefficient of Determination): Proportion of variance explained. $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$

Classification Metrics

Accuracy: Proportion correct. Can be misleading with imbalanced classes.

Confusion Matrix: Table of true positives, false positives, true negatives, false negatives.

Precision: $\frac{TP}{TP + FP}$ - of predicted positives, how many are correct?

Recall (Sensitivity): $\frac{TP}{TP + FN}$ - of actual positives, how many did we find?

F1 Score: Harmonic mean of precision and recall.

AUC-ROC: Area under the Receiver Operating Characteristic curve. Measures discrimination ability across all thresholds.

Part 6: Regularization - Controlling Complexity

The Problem of Overfitting

With enough parameters, any model can fit the training data perfectly—even noise. But such a model fails on new data. This is overfitting.

Ridge Regression (L2)

Add a penalty for large coefficients: $\text{minimize } \sum(y_i - \hat{y}_i)^2 + \lambda \sum \beta_j^2$

This shrinks coefficients toward zero, reducing variance at the cost of some bias.

Lasso Regression (L1)

Use the absolute value of coefficients as penalty: $\text{minimize } \sum(y_i - \hat{y}_i)^2 + \lambda \sum |\beta_j|$

Lasso can shrink coefficients all the way to zero, performing automatic feature selection.

Elastic Net

Combines L1 and L2 penalties, getting the best of both: sparsity from Lasso, stability from Ridge.

The Regularization Parameter

The strength of regularization (λ) is a hyperparameter that must be tuned, typically using cross-validation. Too little regularization → overfitting. Too much → underfitting.

Part 7: Modern Supervised Learning Practice

Feature Engineering

Raw data rarely works well directly. Feature engineering transforms raw inputs into representations that help the model:

Scaling: Standardize features to mean 0, std 1
Encoding: Convert categories to numbers (one-hot, target encoding)
Interactions: Create products of features
Transformations: Log, square root, polynomial terms
Binning: Convert continuous to categorical
Domain knowledge: Create features that capture relevant information

Feature engineering often matters more than algorithm choice.

Handling Imbalanced Data

When one class is rare (fraud detection, disease diagnosis):

Oversampling: Create synthetic minority examples (SMOTE)
Undersampling: Reduce majority class
Class weights: Penalize errors on minority class more
Threshold adjustment: Lower the classification threshold
Anomaly detection: Treat minority as anomalies

The ML Pipeline

A complete machine learning pipeline:

Data collection and exploration
Data cleaning (missing values, outliers)
Feature engineering
Train/test split
Model selection and hyperparameter tuning (with cross-validation)
Evaluation on holdout test set
Interpretation and validation
Deployment and monitoring

AutoML

Automated machine learning systems (AutoML) automate steps 3-6:

Auto-sklearn, H2O AutoML, TPOT: Search over algorithms and hyperparameters
AutoFeat, Featuretools: Automated feature engineering
Google Cloud AutoML, AWS SageMaker Autopilot: Cloud-based solutions

DEEP DIVE: Francis Galton and the Discovery of Regression

The Polymath’s Obsession

Francis Galton was one of the most remarkable minds of the Victorian era—and one of the most troubling. A half-cousin of Charles Darwin, he made fundamental contributions to meteorology, psychology, genetics, and statistics. He also founded eugenics, believing that human society could be improved through selective breeding. His science was brilliant; his application of it was morally catastrophic.

But setting aside the darkness (which we must acknowledge and not minimize), Galton’s statistical discoveries revolutionized how we understand the world. And it all started with peas.

The Sweet Pea Experiment (1877)

In 1877, Galton distributed seeds from the same sweet pea plant to friends across Britain. He carefully measured each seed’s weight before sending it and asked his friends to grow the peas, collect the offspring seeds, and return them for measurement.

The experiment was designed to understand heredity. Darwin’s Origin of Species had been published in 1859, but the mechanisms of inheritance remained mysterious (Mendel’s work was ignored until 1900). Galton wanted to quantify how traits passed from parents to offspring.

When the data came back, Galton plotted parent seed size against offspring seed size. He saw a linear relationship—larger parents produced larger offspring—but something odd: the slope was less than 1.

If a parent seed was 1 standard deviation above average, the offspring was only about 0.33 standard deviations above average. The offspring were less extreme than their parents.

Galton called this phenomenon reversion—later renamed regression to the mean.

From Peas to People

Galton then turned to humans, studying family records of height across generations. In 1886, he published “Regression Towards Mediocrity in Hereditary Stature,” presenting data on 928 adult children and their parents.

He invented several techniques for this analysis:

The ellipse of equal frequency: Galton drew contour lines on his scatter plot, showing that parent and child heights formed an elliptical distribution—the first visualization of bivariate data.

The regression line: He fitted a line through the data showing that if the mid-parent height was 1 inch above average, the child’s expected height was only about 2/3 inch above average.

The correlation coefficient: He quantified the strength of association, though his student Karl Pearson would formalize this.

Understanding Regression to the Mean

Galton initially thought regression was a biological phenomenon—nature “correcting” extremes. But he eventually realized it was mathematical. Consider why:

A tall person is tall partly because of genetics (which can be inherited) and partly because of random factors during development (which are not inherited). Their children inherit the genetic component but not the random component. So children of tall parents are, on average, less extremely tall.

This works both ways: short parents have children who are, on average, taller than themselves. The population isn’t collapsing toward the mean—it’s a statistical phenomenon about individual extremes, not about population change.

The Regression Fallacy

Regression to the mean creates a cognitive trap. After an extreme event, things tend to become less extreme—but we often attribute this to our interventions:

A student has a terrible test score; we tutor them; they improve. Was it the tutoring, or regression to the mean?
An athlete has a career-best year; they appear on a magazine cover; their next year is worse. The “Sports Illustrated jinx” is just regression.
A CEO takes over during a crisis; things improve; we credit their leadership. But things were likely to improve anyway.

Understanding regression to the mean is essential for evaluating interventions, treatments, and policies.

Galton’s Legacy

Galton invented:

The correlation coefficient (in concept; Pearson formalized it)
Regression analysis
The standard deviation (he called it “probable error”)
The percentile
Fingerprint analysis for identification
The weather map (with isobars)
Survey questionnaires
Twin studies for nature vs. nurture

His 1889 book Natural Inheritance laid out the foundations of modern statistics.

Karl Pearson, his protégé, built on this work to create the mathematical framework we use today. Together with R.A. Fisher (who would develop ANOVA, maximum likelihood, and experimental design), they established statistics as a discipline.

The Dark Side

We cannot discuss Galton without confronting his eugenics. He coined the term in 1883 and spent decades promoting the idea that human society could be improved by encouraging the “fit” to reproduce and discouraging the “unfit.”

These ideas, dressed in the authority of science, had horrific consequences: forced sterilization laws in the United States (upheld by the Supreme Court in Buck v. Bell, 1927), the horrors of Nazi racial policies, and continuing discrimination.

The lesson is sobering: brilliant science can be put to terrible uses. Statistical tools are not neutral—they can be used to justify inequality, discrimination, and violence. Every data scientist must grapple with the ethical implications of their work.

Why This Story Matters

Galton’s story illustrates several critical themes:

Discovery through data: Galton’s breakthroughs came from careful measurement and visualization. He let the data reveal patterns rather than imposing theories.
The power of abstraction: The regression line—a simple equation relating input to output—became the foundation for all of supervised learning.
Statistical intuition: Regression to the mean is subtle and counterintuitive. Understanding it prevents common reasoning errors.
Responsibility: Technical brilliance doesn’t guarantee ethical wisdom. Data scientists must think carefully about how their work affects people.

LECTURE PLAN: From Galton’s Peas to Modern Prediction

Learning Objectives

By the end of this lecture, students will be able to:

Explain regression to the mean and why it matters
Fit and interpret simple linear regression models
Understand logistic regression for classification
Evaluate models using appropriate metrics
Recognize the bias-variance tradeoff

Lecture Structure (90 minutes)

Opening Hook (8 minutes)

The Height Paradox

Ask students: “If tall parents have tall children, why doesn’t everyone eventually become the same height?”
Present Galton’s puzzle
Show the original scatter plot of parent and child heights
Introduce regression to the mean through this historical mystery

Part 1: Linear Regression Foundations (20 minutes)

From Data to Line (8 minutes)

Plot a simple scatter plot (two variables)
Ask: “What’s the best line to summarize this relationship?”
Intuition: minimize the vertical distances (residuals)
The formula: $\hat{y} = \beta_0 + \beta_1 x$
Interactive: show how changing β₀ and β₁ changes the line

Least Squares (6 minutes)

Why squared errors? (Differentiable, penalizes large errors)
The closed-form solution: $\beta_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}$
Connection to correlation: $\beta_1 = r \cdot \frac{s_y}{s_x}$
Demo: fit a line in Python, show coefficients

Interpretation (6 minutes)

Slope: “For every 1-unit increase in X, Y changes by β₁”
Intercept: “When X = 0, Y = β₀”
R²: “What proportion of variance does X explain?”
Important: correlation ≠ causation!

Part 2: Classification with Logistic Regression (20 minutes)

Why Not Linear Regression for Categories? (5 minutes)

Demo: fit linear regression to binary outcome
Problems: predictions outside [0, 1], wrong functional form
The solution: predict probability, not outcome

The Logistic Function (8 minutes)

The S-curve (sigmoid): $\sigma(z) = \frac{1}{1 + e^{-z}}$
Squashes any input to (0, 1)—perfect for probability
The model: $P(Y=1) = \sigma(\beta_0 + \beta_1 x)$
Demo: show how sigmoid transforms linear combination

Interpretation in Logistic Regression (7 minutes)

Log-odds (logit): $\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x$
Coefficients as log-odds ratios
Exponentiated coefficient = odds ratio
Example: if exp(β₁) = 2, the odds double for each unit increase in X

Part 3: Evaluation and the Bias-Variance Tradeoff (18 minutes)

Regression Metrics (5 minutes)

MSE, RMSE, MAE: when to use which
R²: what it tells us, and its limitations
Live demo: calculate metrics on a real dataset

Classification Metrics (7 minutes)

Accuracy and its limitations (imbalanced classes)
Confusion matrix: TP, FP, TN, FN
Precision vs. recall: the tradeoff
ROC curve and AUC: threshold-independent evaluation
Demo: plot confusion matrix and ROC curve

Bias-Variance Tradeoff (6 minutes)

Draw the classic U-shaped curves
Underfitting: too simple, misses pattern (high bias)
Overfitting: too complex, memorizes noise (high variance)
The sweet spot: just complex enough
Demo: polynomial regression with increasing degree

Part 4: Beyond Linear Models (15 minutes)

Decision Trees (5 minutes)

The intuition: a flowchart of questions
Demo: visualize a simple decision tree
Pros: interpretable, handles non-linear relationships
Cons: unstable, prone to overfitting

Ensemble Methods (5 minutes)

Random Forest: many trees, random subsets, vote/average
Gradient Boosting: sequential, each tree fixes previous errors
Why ensembles work: wisdom of crowds

Regularization (5 minutes)

The problem: too many features, overfitting
Ridge (L2): shrink coefficients
Lasso (L1): drive coefficients to zero (feature selection)
Cross-validation to choose λ

Part 5: Practical Considerations (5 minutes)

The ML Pipeline

Data cleaning → Feature engineering → Train/test split → Model selection → Evaluation
The importance of holdout test sets
Cross-validation for hyperparameter tuning

Regression to the Mean in Practice

Return to opening question: Galton’s insight applies everywhere
Examples: sports performance, medical interventions, business cycles
Key lesson: expect extremes to become less extreme

Wrap-Up (4 minutes)

Recap: linear regression → logistic regression → evaluation → beyond
Galton’s legacy: both technical and cautionary
Preview the hands-on exercise
Key message: “The best model is the simplest one that works”

Materials Needed

Scatter plots from Galton’s original data
Interactive visualization of regression lines
Confusion matrix examples
Python notebooks with live demonstrations

Discussion Questions

Why did Galton call it “regression” if it’s about prediction?
When would you choose logistic regression over a decision tree?
How would you explain regression to the mean to someone who thinks tutoring always works?
What features would you engineer to predict house prices?

HANDS-ON EXERCISE: Predicting Survival on the Titanic

Overview

In this exercise, students will:

Explore and prepare the Titanic dataset
Build regression and classification models
Evaluate model performance with appropriate metrics
Compare different algorithms and interpret results

Prerequisites

Python 3.8+
Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn

Setup

# Install required packages
# pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, classification_report,
                             roc_curve, auc, mean_squared_error, r2_score)

import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

Part 1: Data Loading and Exploration (15 minutes)

# Load the Titanic dataset
# You can download from: https://www.kaggle.com/c/titanic/data
# Or use seaborn's built-in version:
titanic = sns.load_dataset('titanic')

print("Dataset shape:", titanic.shape)
print("\nColumn names:", titanic.columns.tolist())
print("\nFirst few rows:")
titanic.head()

Task 1.1: Explore the data structure

# Data types and missing values
print("Data types:")
print(titanic.dtypes)
print("\nMissing values:")
print(titanic.isnull().sum())

# Basic statistics
print("\nSummary statistics:")
titanic.describe()

Task 1.2: Visualize survival by different features

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Survival rate by class
sns.countplot(data=titanic, x='pclass', hue='survived', ax=axes[0, 0])
axes[0, 0].set_title('Survival by Class')

# Survival rate by sex
sns.countplot(data=titanic, x='sex', hue='survived', ax=axes[0, 1])
axes[0, 1].set_title('Survival by Sex')

# Age distribution by survival
sns.histplot(data=titanic, x='age', hue='survived', bins=30, ax=axes[0, 2])
axes[0, 2].set_title('Age Distribution by Survival')

# Survival rate by embarkation port
sns.countplot(data=titanic, x='embarked', hue='survived', ax=axes[1, 0])
axes[1, 0].set_title('Survival by Embarkation')

# Fare distribution by survival
sns.boxplot(data=titanic, x='survived', y='fare', ax=axes[1, 1])
axes[1, 1].set_title('Fare by Survival')

# Survival by siblings/spouse aboard
sns.countplot(data=titanic, x='sibsp', hue='survived', ax=axes[1, 2])
axes[1, 2].set_title('Survival by Siblings/Spouse')

plt.tight_layout()
plt.show()

Part 2: Data Preparation (20 minutes)

# Create a copy for processing
df = titanic.copy()

# Handle missing values
# Age: fill with median by class and sex
df['age'] = df.groupby(['pclass', 'sex'])['age'].transform(
    lambda x: x.fillna(x.median())
)

# Embarked: fill with mode
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Drop columns we won't use
df = df.drop(['deck', 'alive', 'embark_town', 'class', 'who', 'adult_male'], axis=1)

print("Missing values after cleaning:")
print(df.isnull().sum())

Task 2.1: Feature engineering

# Create new features
# Family size
df['family_size'] = df['sibsp'] + df['parch'] + 1

# Is alone?
df['is_alone'] = (df['family_size'] == 1).astype(int)

# Age categories
df['age_group'] = pd.cut(df['age'], bins=[0, 12, 18, 35, 60, 100],
                          labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior'])

# Fare per person
df['fare_per_person'] = df['fare'] / df['family_size']

print("\nNew features:")
print(df[['family_size', 'is_alone', 'age_group', 'fare_per_person']].head(10))

Task 2.2: Encode categorical variables

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['sex', 'embarked', 'age_group'], drop_first=True)

# Print the resulting features
print("Features after encoding:")
print(df_encoded.columns.tolist())

Part 3: Building Classification Models (30 minutes)

# Prepare features and target
X = df_encoded.drop(['survived'], axis=1)
y = df_encoded['survived']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Survival rate in training set: {y_train.mean():.2%}")
print(f"Survival rate in test set: {y_test.mean():.2%}")

# Scale features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Task 3.1: Logistic Regression

# Fit logistic regression
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_prob_log = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("Logistic Regression Results:")
print(classification_report(y_test, y_pred_log))

# Feature importance (coefficients)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': log_reg.coef_[0],
    'odds_ratio': np.exp(log_reg.coef_[0])
}).sort_values('coefficient', ascending=False)

print("\nFeature Coefficients (top 10):")
print(feature_importance.head(10))

Task 3.2: Decision Tree

# Fit decision tree
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X_train, y_train)

# Predictions
y_pred_tree = tree.predict(X_test)

# Evaluate
print("Decision Tree Results:")
print(classification_report(y_test, y_pred_tree))

# Visualize the tree (top levels)
from sklearn.tree import plot_tree

plt.figure(figsize=(20, 10))
plot_tree(tree, feature_names=X.columns, class_names=['Died', 'Survived'],
          filled=True, rounded=True, max_depth=3)
plt.title('Decision Tree (First 3 Levels)')
plt.tight_layout()
plt.show()

Task 3.3: Random Forest

# Fit random forest
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]

# Evaluate
print("Random Forest Results:")
print(classification_report(y_test, y_pred_rf))

# Feature importance
feature_importance_rf = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nRandom Forest Feature Importance:")
print(feature_importance_rf.head(10))

# Visualize
plt.figure(figsize=(12, 6))
top_features = feature_importance_rf.head(10)
plt.barh(top_features['feature'], top_features['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Part 4: Model Comparison and Evaluation (20 minutes)

def evaluate_model(name, y_true, y_pred, y_prob=None):
    """Compute evaluation metrics."""
    results = {
        'Model': name,
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1': f1_score(y_true, y_pred)
    }
    if y_prob is not None:
        fpr, tpr, _ = roc_curve(y_true, y_prob)
        results['AUC'] = auc(fpr, tpr)
    return results

# Compare models
models_to_compare = [
    ('Logistic Regression', y_pred_log, y_prob_log),
    ('Decision Tree', y_pred_tree, None),
    ('Random Forest', y_pred_rf, y_prob_rf),
]

# Add Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
y_prob_gb = gb.predict_proba(X_test)[:, 1]
models_to_compare.append(('Gradient Boosting', y_pred_gb, y_prob_gb))

# Compute metrics for all models
results_list = []
for name, y_pred, y_prob in models_to_compare:
    results_list.append(evaluate_model(name, y_test, y_pred, y_prob))

comparison_df = pd.DataFrame(results_list)
print("\nModel Comparison:")
print(comparison_df.to_string(index=False))

Task 4.1: Plot ROC curves

plt.figure(figsize=(10, 8))

for name, _, y_prob in models_to_compare:
    if y_prob is not None:
        fpr, tpr, _ = roc_curve(y_test, y_prob)
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend(loc='lower right')
plt.tight_layout()
plt.show()

Task 4.2: Analyze confusion matrices

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for ax, (name, y_pred, _) in zip(axes.flat, models_to_compare):
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Died', 'Survived'],
                yticklabels=['Died', 'Survived'])
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title(f'{name}\nConfusion Matrix')

plt.tight_layout()
plt.show()

Part 5: Cross-Validation and Regularization (15 minutes)

# Cross-validation comparison
from sklearn.model_selection import cross_val_score

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=3)
}

cv_results = []
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled if 'Logistic' in name else X_train,
                             y_train, cv=5, scoring='accuracy')
    cv_results.append({
        'Model': name,
        'Mean CV Accuracy': scores.mean(),
        'Std CV Accuracy': scores.std()
    })

cv_df = pd.DataFrame(cv_results)
print("Cross-Validation Results:")
print(cv_df.to_string(index=False))

Task 5.1: Regularization in Logistic Regression

# Test different regularization strengths
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
cv_scores = []

for C in C_values:
    model = LogisticRegression(C=C, max_iter=1000)
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    cv_scores.append({
        'C': C,
        'Mean Accuracy': scores.mean(),
        'Std Accuracy': scores.std()
    })

reg_df = pd.DataFrame(cv_scores)
print("\nRegularization Effect (C = 1/λ):")
print(reg_df.to_string(index=False))

# Plot
plt.figure(figsize=(10, 5))
plt.errorbar(np.log10(reg_df['C']), reg_df['Mean Accuracy'],
             yerr=reg_df['Std Accuracy'], marker='o', capsize=5)
plt.xlabel('log10(C) (higher = less regularization)')
plt.ylabel('CV Accuracy')
plt.title('Effect of Regularization on Cross-Validation Accuracy')
plt.tight_layout()
plt.show()

Challenge Questions

Regression to the Mean: If a model performs exceptionally well on one random train/test split, what should you expect on different splits? How does this relate to Galton’s observations?
Feature Engineering: What additional features might you create to improve predictions? (Think about: titles in names, cabin locations, ticket classes)
Threshold Selection: The default threshold for classification is 0.5. How would you choose a different threshold if you wanted to maximize recall (finding all survivors)?
Imbalanced Classes: What if only 10% of passengers had survived? How would you modify your approach?
Interpretability vs. Accuracy: Logistic regression is more interpretable than random forest. When would you choose interpretability over higher accuracy?

Expected Outputs

Students should submit:

Exploratory data analysis with visualizations and insights
Feature engineering decisions with justification
At least three trained models with evaluation metrics
ROC curves and confusion matrices comparison
Cross-validation results showing model reliability
Written analysis of which model they would deploy and why

Evaluation Rubric

Criteria	Points
Data exploration and visualization	15
Feature engineering quality	15
Correct model implementation	20
Proper evaluation methodology	20
Model comparison and selection	15
Code quality and interpretation	15
Total	100

Recommended Resources

Books

Technical

An Introduction to Statistical Learning (ISL) by James, Witten, Hastie, Tibshirani - Free online, the essential introduction
The Elements of Statistical Learning (ESL) by Hastie, Tibshirani, Friedman - Free online, more mathematical
Pattern Recognition and Machine Learning by Christopher Bishop - Deep, comprehensive
Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélien Géron - Practical Python focus

Historical and Popular

The Lady Tasting Tea by David Salsburg - Stories of statistical pioneers including Galton and Pearson
The Theory That Would Not Die by Sharon McGrayne - History of Bayesian statistics
Moneyball by Michael Lewis - Regression and prediction in baseball
Naked Statistics by Charles Wheelan - Accessible introduction

Academic Papers

Galton, F. (1886). “Regression Towards Mediocrity in Hereditary Stature” - The original regression paper
Breiman, L. (2001). “Random Forests” - Machine Learning, 45(1), 5-32
Friedman, J.H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine”
Chen, T., & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”
Hastie, T., et al. (2020). “Best Subset Selection is Hard” - On model selection complexity

Video Lectures

StatQuest with Josh Starmer: Clear explanations of regression, logistic regression, decision trees
Stanford CS229 Machine Learning: Andrew Ng’s classic course
3Blue1Brown: Gradient Descent - Beautiful visualization
MIT OpenCourseWare: 18.S096 Topics in Mathematics - Applied statistics

Online Courses

Coursera: Machine Learning by Andrew Ng - The classic introduction
Fast.ai: Practical Machine Learning - Practical, code-first approach
DataCamp: Supervised Learning with scikit-learn - Hands-on Python
Kaggle Learn: Intro to Machine Learning - Short, practical tutorials

Tools and Libraries

scikit-learn (https://scikit-learn.org/) - Python machine learning
XGBoost (https://xgboost.readthedocs.io/) - Gradient boosting
LightGBM (https://lightgbm.readthedocs.io/) - Fast gradient boosting
CatBoost (https://catboost.ai/) - Handles categorical features
SHAP (https://shap.readthedocs.io/) - Model interpretability
Yellowbrick (https://www.scikit-yb.org/) - ML visualization

Datasets for Practice

Kaggle Titanic - Classic binary classification
Boston Housing - Regression (now deprecated due to ethical concerns)
California Housing - Better housing price regression dataset
Pima Indians Diabetes - Medical classification
Credit Card Fraud - Imbalanced classification
UCI Machine Learning Repository - Hundreds of datasets

References

Galton, F. (1886). “Regression Towards Mediocrity in Hereditary Stature.” Journal of the Anthropological Institute, 15, 246-263.
Galton, F. (1889). Natural Inheritance. London: Macmillan.
Pearson, K. (1896). “Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia.” Philosophical Transactions of the Royal Society of London, 187, 253-318.
Cox, D.R. (1958). “The Regression Analysis of Binary Sequences.” Journal of the Royal Statistical Society: Series B, 20(2), 215-242.
Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth.
Breiman, L. (2001). “Random Forests.” Machine Learning, 45(1), 5-32.
Friedman, J.H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 29(5), 1189-1232.
Chen, T., & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd ACM SIGKDD.
Vapnik, V.N. (1995). The Nature of Statistical Learning Theory. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press.

Module 10 explores the fundamental methods of supervised learning—regression and classification—tracing their origins from Galton’s pioneering statistical work to the ensemble methods that dominate modern machine learning. Through the story of how “regression” got its name, we learn not just the techniques but the deeper insights about prediction, variability, and the statistical nature of the world.