Module 13: Project – Application and Integration

Introduction

Throughout this course, we’ve learned tools: NumPy arrays, pandas DataFrames, scikit-learn models, visualization techniques, statistical tests, neural networks. We’ve explored stories: from Florence Nightingale’s rose diagrams to Geoffrey Hinton’s 40-year quest, from Galton’s peas to the COMPAS controversy.

Now it’s time to put everything together. In this module, you will conceive, develop, and present a complete data science project—from identifying a question through gathering data, analysis, modeling, and interpretation. This is where you become not just a learner of data science, but a practitioner of it.


Part 1: What Makes a Good Data Science Project?

The Data Science Project Lifecycle

Every data science project follows a similar arc:

  1. Question Formulation: What do you want to know? What decisions will this inform?
  2. Data Acquisition: Where will the data come from? How will you obtain it?
  3. Data Exploration: What does the data look like? What are its properties and limitations?
  4. Data Preparation: How will you clean, transform, and engineer features?
  5. Modeling: What techniques will you apply? How will you evaluate them?
  6. Interpretation: What do the results mean? What are the limitations?
  7. Communication: How will you present findings to others?

Characteristics of Strong Projects

Interesting Question: The project addresses something genuinely curious—not just “I trained a model” but “I wanted to understand X and discovered Y.”

Appropriate Scope: Neither too ambitious (impossible to complete) nor too trivial (just running a tutorial). Achievable within the time available with meaningful depth.

Real Data: Working with real data brings real challenges—missing values, inconsistencies, unexpected patterns. These challenges teach more than clean textbook datasets.

Technical Rigor: Proper methodology—appropriate train/test splits, fair comparisons, correct metrics, honest treatment of uncertainty.

Clear Communication: Findings presented in a way others can understand and critique. Visualizations that illuminate, not obscure.

Ethical Awareness: Consideration of who might be affected by the analysis and its conclusions.

Project Types

Exploratory Analysis: “What’s happening here?” Diving deep into a dataset to discover patterns and generate hypotheses.

Predictive Modeling: “Can we predict X?” Building and evaluating models for forecasting or classification.

Causal Investigation: “Does X cause Y?” Using careful methodology to tease apart correlation and causation.

Tool Building: Creating a useful tool—a dashboard, a pipeline, an application that others can use.

Replication and Extension: Reproducing a published analysis, then extending it in new directions.


Part 2: Finding Your Question

Sources of Inspiration

Personal Curiosity: What have you always wondered about? What patterns do you notice in daily life?

Current Events: What’s in the news? COVID trends, climate data, election patterns, economic indicators.

Your Field: If you’re studying biology, economics, engineering, humanities—what questions matter there?

Existing Research: Read data journalism (FiveThirtyEight, The Pudding), academic papers, Kaggle competitions. What questions interest you? What analyses could you extend?

Local Context: What’s happening in your city, region, country? What local data is available?

From Topic to Question

A topic is not a question. “Climate change” is a topic. “How have average temperatures changed in Lebanon over the past 50 years?” is a question. “Can we predict next month’s temperature from historical patterns?” is a different question.

Good data science questions are:

The Iteration Process

Your question will evolve as you work. Initial exploration reveals what’s actually in the data. Perhaps your original question is unanswerable because the data doesn’t exist. Perhaps you discover something more interesting along the way.

This iteration is normal and expected. Document your journey—how the question evolved and why.


Part 3: Data Acquisition

Public Datasets

Government Data:

Research Data:

Domain-Specific:

Web Scraping

When data isn’t available in convenient form, you may need to collect it yourself through web scraping. Important considerations:

Tools: Beautiful Soup, Scrapy, Selenium

APIs

Many platforms provide APIs for structured data access:

Surveys and Collection

Sometimes you need to collect original data:

Data Quality Assessment

Before diving into analysis, assess your data:


Part 4: Project Structure and Workflow

Directory Structure

Organize your project systematically:

project/
├── data/
│   ├── raw/           # Original, immutable data
│   └── processed/     # Cleaned, transformed data
├── notebooks/
│   ├── 01-exploration.ipynb
│   ├── 02-preprocessing.ipynb
│   └── 03-modeling.ipynb
├── src/               # Python modules for reusable code
├── reports/
│   ├── figures/       # Generated graphics
│   └── final-report.md
├── README.md
└── requirements.txt

Version Control

Use Git from the start:

Reproducibility

Your analysis should be reproducible:

Documentation

Document as you go:


Part 5: Analysis and Modeling

Exploratory Data Analysis (EDA)

Before modeling, understand your data:

Visualizations are primary tools for EDA. Generate many plots. Not all will be in the final report—they’re for your understanding.

Feature Engineering

Transform raw data into useful model inputs:

Feature engineering often matters more than model selection.

Model Selection

Choose appropriate models for your question:

Evaluation

Rigorous evaluation prevents self-deception:

Interpretation

What do results mean?


Part 6: Communication

Writing the Report

Your report tells a story:

Introduction: What question are you addressing? Why does it matter?

Data: Where did data come from? What does it contain? What are its limitations?

Methods: What techniques did you use? Why these choices?

Results: What did you find? Show key visualizations.

Discussion: What do results mean? What are limitations? What follow-up questions arise?

Conclusion: What’s the key takeaway?

Visualization Principles

Presenting Your Work

Prepare for oral presentation:


Part 7: Project Ideas by Domain

Social Sciences / Public Policy

Environment and Climate

Health and Medicine

Business and Economics

Sports and Entertainment

Technology and Web

Culture and Arts


Project Templates

Template 1: Exploratory Analysis

Structure:

  1. Introduce the dataset and its context
  2. Ask 3-5 specific exploratory questions
  3. Investigate each with appropriate visualizations
  4. Synthesize findings into a coherent narrative
  5. Propose follow-up questions or analyses

Example: “What does the NYC 311 complaint data reveal about quality of life in different neighborhoods?”

Template 2: Predictive Modeling

Structure:

  1. Define prediction problem clearly
  2. Acquire and prepare data
  3. Establish baseline model
  4. Develop and compare multiple models
  5. Evaluate thoroughly with appropriate metrics
  6. Interpret results and discuss limitations

Example: “Can we predict which Kickstarter projects will be successfully funded?”

Template 3: Comparative Study

Structure:

  1. Identify phenomenon to compare across groups/time/places
  2. Define comparison framework
  3. Collect and harmonize data
  4. Conduct systematic comparison
  5. Explain observed differences

Example: “How do traffic accident patterns differ between European and American cities?”

Template 4: Tool or Dashboard

Structure:

  1. Identify user need
  2. Design data pipeline
  3. Develop interactive visualization
  4. Deploy accessible tool
  5. Document usage and maintenance

Example: “Build an interactive dashboard for exploring local air quality data”


Grading Rubric

Criterion Excellent (90-100) Good (75-89) Adequate (60-74) Needs Work (<60)
Question (10%) Insightful, well-scoped, original question Clear question, appropriate scope Basic question, slightly too broad/narrow Unclear or inappropriate question
Data (15%) Rich, relevant data; thorough quality assessment Appropriate data; documented limitations Basic data; minimal quality discussion Insufficient or problematic data
Methodology (25%) Rigorous, appropriate methods; proper evaluation Sound methods; reasonable evaluation Acceptable methods; some issues in evaluation Flawed methodology or evaluation
Analysis (20%) Deep insights; sophisticated techniques well-applied Good analysis; techniques correctly used Basic analysis; some technical issues Superficial or incorrect analysis
Visualization (10%) Publication-quality; illuminating and elegant Clear and informative visualizations Adequate visualizations; some issues Poor or misleading visualizations
Communication (15%) Clear, compelling narrative; professional presentation Well-organized; clear writing Understandable but could be clearer Disorganized or unclear
Reproducibility (5%) Fully reproducible; excellent documentation Reproducible with minor issues Mostly reproducible; some gaps Not reproducible

Timeline and Milestones

Week 1: Ideation and Data Assessment

Week 2: Data Acquisition and Exploration

Week 3: Analysis Development

Week 4: Refinement and Writing

Week 5: Presentation and Peer Review


Resources

Tools

Data Sources

Inspiration

Writing


Module 13 guides you through conceiving and executing a complete data science project. This is where the skills from the entire course come together—from data wrangling to modeling to communication—in service of answering a question that matters to you.