DATA 202 Module 11: Capstone Project
Introduction
The capstone project is the culmination of DATA 202, where you integrate everything learned throughout both courses into a substantial, original data science project. Unlike the DATA 201 project, this capstone challenges you to work with advanced data types, modern AI tools, and production-quality engineering practices.
Part 1: Capstone Expectations
What Distinguishes a Capstone
A capstone project should demonstrate:
Technical Depth: Use of advanced techniques from DATA 202
- Non-traditional data types (audio, video, documents, graphs)
- Modern ML approaches (transformers, foundation models)
- Production-quality implementation
Originality: Not a tutorial reproduction
- Novel question or application
- Creative problem framing
- Contribution beyond coursework
Completeness: Full lifecycle coverage
- Data acquisition (APIs, scraping, collection)
- Rigorous analysis and modeling
- Deployment or deployment-ready system
- Documentation and reproducibility
Reflection: Critical analysis
- What worked and what didn’t
- Limitations and future directions
- Ethical considerations
Scope Guidelines
Too Small:
- Replicating a tutorial with different data
- Simple classification on a clean dataset
- API wrapper without substantial analysis
Too Large:
- Multi-year research project
- Production system for a company
- Something requiring resources beyond your access
Just Right:
- Novel application of course techniques
- Integration of multiple data types or methods
- Working prototype demonstrating capability
Part 2: Project Categories
Category A: Novel Data Applications
Apply advanced data science to a domain or dataset not typically explored:
Examples:
- Analyze Arabic poetry with NLP techniques
- Study traffic patterns in Beirut from video feeds
- Map the network of regional art collectors
- Build a dialect classifier for Levantine Arabic
Requirements:
- Original data collection or curation
- Domain expertise or research
- Novel insights beyond obvious analysis
Category B: Multi-Modal Systems
Combine multiple data types or modalities:
Examples:
- Music video analysis (audio + video)
- Social media sentiment with image and text
- Document processing pipeline (OCR + NLP)
- Podcast transcription and topic modeling
Requirements:
- Integration of techniques from multiple modules
- Meaningful combination (not just sequential)
- Demonstration of cross-modal insights
Category C: Deployed Applications
Build a working application that serves predictions:
Examples:
- Real-time translation app for regional dialects
- Document classification service for a domain
- Recommendation system with web interface
- Monitoring dashboard with ML predictions
Requirements:
- Working deployment (local or cloud)
- API or user interface
- Documentation for use and maintenance
- Monitoring or evaluation plan
Category D: Research Replication and Extension
Replicate and extend a published paper:
Examples:
- Replicate a fairness analysis on local data
- Apply a new technique to a classic dataset
- Evaluate claims from a paper with different methods
- Combine approaches from multiple papers
Requirements:
- Complete replication (or documented failure)
- Substantive extension
- Critical analysis of original claims
Part 3: Project Phases
Phase 1: Proposal (Week 1-2)
Deliverable: 2-3 page proposal including:
- Problem statement and motivation
- Data sources and acquisition plan
- Methodology overview
- Timeline and milestones
- Potential challenges and mitigations
Review Process: Instructor feedback and approval
Phase 2: Data and Exploration (Week 3-4)
Deliverable: Progress report with:
- Acquired/collected data
- Exploratory analysis
- Refined methodology
- Any pivots from proposal
Checkpoint: Verify feasibility and scope
Phase 3: Implementation (Week 5-8)
Work: Core technical development
- Feature engineering and preprocessing
- Model development and evaluation
- Integration and deployment (if applicable)
- Iteration based on results
Mid-Point Check-in: Brief progress presentation
Phase 4: Documentation and Presentation (Week 9-10)
Deliverables:
- Final report (10-15 pages)
- Working code repository
- Presentation (15-20 minutes)
- Demo (if applicable)
Part 4: Deliverables in Detail
Final Report
Structure:
- Abstract: One paragraph summary
- Introduction: Problem, motivation, contributions
- Related Work: Prior approaches and context
- Data: Sources, collection, description, limitations
- Methods: Approach, techniques, architecture
- Results: Findings, evaluations, comparisons
- Discussion: Interpretation, limitations, implications
- Conclusion: Summary and future work
- References: Proper citations
Quality Expectations:
- Clear, professional writing
- Appropriate visualizations
- Honest treatment of limitations
- Reproducibility information
Code Repository
Requirements:
- README with setup instructions
- requirements.txt or environment.yml
- Organized structure
- Documentation of key functions
- Example usage
Optional but Valued:
- Unit tests
- CI/CD configuration
- Docker deployment
- Interactive notebooks
Presentation
Components:
- Motivation and question
- Approach and methods
- Key results and insights
- Demo (if applicable)
- Limitations and future work
- Q&A
Delivery:
- Clear visuals (minimal text per slide)
- Narrative structure (tell a story)
- Technical depth calibrated to audience
- Time management
Part 5: Evaluation
Rubric
| Criterion | Weight | Description |
|---|---|---|
| Originality | 15% | Novel question, approach, or application |
| Technical Quality | 25% | Correct methodology, appropriate techniques |
| Data Work | 15% | Acquisition, preparation, documentation |
| Implementation | 20% | Code quality, reproducibility, engineering |
| Results and Analysis | 10% | Meaningful findings, honest evaluation |
| Communication | 10% | Report quality, presentation delivery |
| Reflection | 5% | Limitations, ethics, future directions |
Excellence Markers
A-Level Work:
- Publishable or portfolio-ready
- Could be extended to real-world use
- Demonstrates mastery of course material
- Thoughtful, thorough, polished
B-Level Work:
- Solid technical execution
- Demonstrates competence
- Some limitations in depth or polish
- Room for improvement but fundamentally sound
C-Level Work:
- Meets basic requirements
- Technical issues or superficial analysis
- Incomplete components
- Limited effort beyond minimum
Part 6: Project Ideas
Data Acquisition Focus
- Build a dataset of regional news articles and analyze bias
- Create a database of restaurant reviews for Arabic cities
- Collect and process historical documents from archives
NLP and Language
- Sentiment analysis for Arabic dialects
- Arabic-English code-switching detection
- Named entity recognition for regional proper nouns
- Summarization of news in local languages
Computer Vision
- Traffic analysis from intersection cameras
- Building damage assessment from satellite imagery
- Plant disease detection for regional agriculture
- Art style classification for regional artists
Audio and Speech
- Dialect identification from speech
- Transcription improvements for accented speech
- Music genre classification for regional music
- Podcast analysis and recommendation
Networks and Graphs
- Academic collaboration networks in the region
- Trade networks between Middle Eastern countries
- Social influence mapping on regional platforms
Foundation Models
- Fine-tuning models for domain-specific tasks
- RAG system for specialized knowledge base
- Prompt engineering for specific applications
- Evaluation of model performance across languages
Deployed Systems
- Personal finance tracker with ML insights
- Study assistant with document understanding
- Local event recommendation system
- Health monitoring with sensor data
Part 7: Resources and Support
Office Hours
- Weekly scheduled hours for project consultation
- Additional appointments by request
- Peer feedback sessions
Computing Resources
- Cloud credits available for deployment
- GPU access for model training
- Data storage allocation
Ethical Review
- Projects involving human subjects need ethics consideration
- Discuss sensitive data with instructor
- Privacy and consent requirements
Final Thoughts
The capstone is your opportunity to synthesize everything you’ve learned and create something meaningful. The best projects come from genuine curiosity—questions you actually want to answer, problems you want to solve, tools you want to build.
Start early. Iterate often. Ask for help when stuck. And remember: the goal is learning, not perfection. A project that encountered challenges and documented them honestly is more valuable than one that hid difficulties behind polished output.
Welcome to the final challenge of DATA 202. Make it count.
Module 11 structures the capstone project—the culminating experience of DATA 202 where students integrate advanced techniques into substantial, original work. From proposal to presentation, the capstone demonstrates mastery of modern data science.