DATA 202 Module 1: Data Acquisition - From Web to Warehouse
Introduction
The most sophisticated machine learning algorithm is useless without data. And in the real world, data rarely arrives in clean CSV files ready for analysis. It must be hunted, gathered, scraped, requested, negotiated, and often fought for.
This module explores the art and science of data acquisition—the methods and tools for obtaining data from the wild web, from APIs, from documents, and from streams. We’ll learn not just the technical skills but also the legal, ethical, and practical considerations that separate responsible data collection from digital trespass.
Part 1: The Data Landscape
Where Data Lives
Data exists in many forms across the digital landscape:
Structured APIs: Services that provide data in organized formats (JSON, XML) through programmatic interfaces. Twitter, weather services, financial data providers.
Semi-Structured Web Pages: HTML containing embedded data—product listings, news articles, social media posts—that must be extracted.
Documents: PDFs, images of text, scanned forms that require optical character recognition and parsing.
Databases: Corporate and government databases, some accessible, many locked behind authentication.
Real-Time Streams: Continuous flows of data—stock tickers, sensor readings, social media firehoses.
Unstructured Media: Images, audio, video containing information that must be extracted through recognition systems.
The Data Acquisition Hierarchy
From easiest to most challenging:
- Published Datasets: Already collected, cleaned, documented
- API Data: Structured, documented, but requires coding
- Web Scraping: Requires parsing, more fragile
- Document Extraction: Requires OCR, handling noise
- Real-Time Streams: Requires infrastructure, handling velocity
- Original Collection: Surveys, sensors, experiments
Part 2: Working with APIs
What is an API?
An Application Programming Interface (API) is a contract between software systems. Web APIs allow programs to request data from servers over HTTP, receiving structured responses.
When you visit a weather website, your browser makes requests to their servers and receives HTML to display. When a program uses their API, it makes similar requests but receives structured data (typically JSON) instead of a webpage.
REST APIs
REST (Representational State Transfer) is the dominant architectural style for web APIs:
Resources: Things you want to access (users, tweets, products) identified by URLs
https://api.example.com/users/123https://api.example.com/products?category=electronics
HTTP Methods:
GET: Retrieve dataPOST: Create new dataPUT: Update existing dataDELETE: Remove data
Responses: Typically JSON with status codes
200 OK: Success401 Unauthorized: Need authentication404 Not Found: Resource doesn’t exist429 Too Many Requests: Rate limited
Authentication
Most valuable APIs require authentication:
API Keys: Simple tokens passed in headers or query parameters
headers = {'Authorization': 'Bearer YOUR_API_KEY'}
response = requests.get(url, headers=headers)
OAuth: More complex, used when accessing user data
- Redirect user to service login
- Receive temporary token
- Exchange for access token
- Use access token for requests
Rate Limiting
APIs protect themselves from overuse:
Rate limits: Maximum requests per time period (e.g., 100/hour)
Handling:
- Track your usage
- Implement backoff when limited
- Cache responses to avoid repeated requests
- Respect
Retry-Afterheaders
Working with JSON
JSON (JavaScript Object Notation) is the lingua franca of APIs:
import requests
response = requests.get('https://api.example.com/data')
data = response.json() # Parse JSON to Python dict/list
# Navigate the structure
users = data['users']
for user in users:
print(user['name'], user['email'])
Common APIs for Data Science
General Data:
- OpenWeatherMap: Weather data
- Twitter/X API: Social media (limited free tier)
- Reddit API: Discussions and posts
- Wikipedia API: Encyclopedia data
Financial:
- Alpha Vantage: Stock prices
- Coinbase: Cryptocurrency
- Yahoo Finance: Market data
Geographic:
- OpenStreetMap: Maps and locations
- Google Maps: Geocoding, places
- Mapbox: Mapping services
Government:
- data.gov API: US government data
- World Bank API: Development indicators
- Census API: Population data
Part 3: Web Scraping
When APIs Aren’t Available
Many websites don’t offer APIs, or their APIs are limited. Web scraping extracts data directly from HTML pages.
The Ethics and Legality of Scraping
Web scraping exists in legal gray areas:
Potentially Allowed:
- Public data for personal use
- Research and journalism
- Data your organization owns
- When explicitly permitted
Potentially Problematic:
- Violating Terms of Service
- Overloading servers
- Accessing data behind logins
- Commercial use of scraped data
- Republishing copyrighted content
Check First:
robots.txt: Machine-readable crawling rules- Terms of Service: Legal restrictions
- Rate limits: How much traffic is reasonable
Tools and Libraries
Requests: Make HTTP requests
import requests
html = requests.get('https://example.com').text
Beautiful Soup: Parse HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
titles = soup.find_all('h1')
Selenium: Automate browsers for JavaScript-rendered content
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com')
content = driver.page_source
Scrapy: Full-featured scraping framework for larger projects
HTML Parsing Strategies
Find by tag: soup.find('div'), soup.find_all('a')
Find by class: soup.find_all('div', class_='product')
Find by ID: soup.find(id='main-content')
CSS selectors: soup.select('div.product > span.price')
Navigate structure: element.parent, element.next_sibling
Handling Common Challenges
JavaScript-Rendered Content: Use Selenium or APIs like Splash
Pagination: Follow “next” links or construct page URLs
Rate Limiting: Add delays between requests
import time
for url in urls:
response = requests.get(url)
time.sleep(2) # Wait 2 seconds
Changing Structures: Build robust selectors, add error handling
CAPTCHAs and Blocks: Respect them—they’re telling you to stop
Part 4: Data Formats and Parsing
Common Formats
CSV/TSV: Comma or tab-separated values
- Simple, widely supported
- No nested structures
- Encoding issues common
JSON: JavaScript Object Notation
- Nested structures
- Standard for APIs
- Human-readable
XML: eXtensible Markup Language
- Hierarchical
- Schema validation
- Verbose
Parquet: Columnar binary format
- Efficient for analytics
- Preserves types
- Not human-readable
Excel: Microsoft spreadsheet
- Multiple sheets
- Formatting metadata
- pandas can read directly
Parsing Strategies
Pandas for tabular data:
df = pd.read_csv('data.csv')
df = pd.read_json('data.json')
df = pd.read_excel('data.xlsx')
df = pd.read_parquet('data.parquet')
json module for nested structures:
import json
with open('data.json') as f:
data = json.load(f)
xml.etree for XML:
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
Handling Messy Data
Real-world data is messy:
Encoding issues: Specify encoding explicitly
pd.read_csv('data.csv', encoding='utf-8')
Missing values: Identify and handle
pd.read_csv('data.csv', na_values=['NA', 'N/A', ''])
Inconsistent formats: Clean as you parse
df['date'] = pd.to_datetime(df['date'], errors='coerce')
Part 5: Building Data Pipelines
From One-Off Script to Reliable Pipeline
One-off data collection is easy. Reliable, repeatable collection is engineering.
Pipeline Components
Extraction: Get data from sources Transformation: Clean and structure Loading: Store in destination Monitoring: Track success/failure Scheduling: Run automatically
Tools for Pipelines
Simple scheduling: cron (Unix), Task Scheduler (Windows)
Workflow orchestration:
- Apache Airflow: Full-featured, Python-based
- Prefect: Modern alternative
- Luigi: Simpler predecessor
Cloud services:
- AWS Glue, Lambda
- Google Cloud Dataflow
- Azure Data Factory
Error Handling and Reliability
Retry logic: Temporary failures should retry
from tenacity import retry, stop_after_attempt
@retry(stop=stop_after_attempt(3))
def fetch_data(url):
return requests.get(url)
Logging: Record what happened
import logging
logging.info(f"Fetched {len(records)} records from {url}")
Alerting: Know when things break
Idempotency: Running twice should give same result
DEEP DIVE: The Rise and Fall of the Twitter API
The Birth of the API Economy
In 2006, Twitter launched with a simple premise: share 140-character messages with the world. But what made Twitter revolutionary wasn’t just the product—it was the decision to open their data through an API.
The Twitter API allowed anyone to build applications that read and wrote tweets. This spawned an ecosystem of clients, analytics tools, and integrations. TweetDeck, the popular client, was built entirely on the API. Researchers could study information spread in real-time. Businesses monitored brand mentions. Developers experimented with new social applications.
This openness was strategic. Twitter was a small startup competing against Facebook. By opening the API, they gained developers as allies, extending Twitter’s reach far beyond what they could build alone.
The Golden Age (2006-2012)
The early Twitter API was remarkably permissive:
Firehose: The complete stream of all tweets—billions of messages—was available to partners.
Search API: Query historical tweets freely.
Streaming API: Real-time filtered streams of tweets.
High Rate Limits: Generous quotas for authenticated developers.
Researchers published thousands of papers using Twitter data—studying elections, disasters, public health, language, and behavior at unprecedented scale. The Arab Spring protests of 2011 were analyzed in real-time. COVID-19 misinformation would later be tracked through the platform.
Companies built entire businesses on Twitter data:
- DataSift: Sold filtered firehose access
- Gnip: Provided historical data (later acquired by Twitter)
- Social media monitoring tools: Brandwatch, Sprinklr, Hootsuite
The Closing Begins (2012-2022)
As Twitter grew and monetized, the open API became a liability:
2012: Twitter restricted API access, killing third-party clients that competed with their official app.
2013-2018: Gradual tightening—lower rate limits, more restrictive terms, fewer features.
2020: Academic Research Product Track launched—providing better access for researchers, but with application requirements.
2022: Twitter started charging for API access that was previously free.
The shift reflected a fundamental tension: Twitter’s data was valuable, and giving it away undermined their business model. They could charge for data access or use it for their own advertising targeting—not both.
The Musk Era (2023-Present)
When Elon Musk acquired Twitter in 2022, the API transformation accelerated dramatically:
Free tier eliminated: Basic API access that was free now costs hundreds of dollars per month.
Research access restricted: The Academic Research track was discontinued.
Rate limits slashed: Even paid tiers received far fewer requests.
Pricing restructured: The firehose, once available to partners, now costs millions of dollars annually.
For researchers, the change was catastrophic. Decades of methodology built on Twitter data became impossible to replicate. Studies tracking misinformation, public health, and political discourse lost their data source.
Lessons for Data Scientists
The Twitter API saga teaches essential lessons:
-
Data access is not guaranteed: APIs can change or disappear. Don’t assume today’s access will continue.
-
Terms of Service matter: Using an API creates a contractual relationship. Violations have consequences.
-
Data dependency is risk: Building on others’ data creates dependency. Consider alternatives and backups.
-
Store what you collect: If you have legitimate access, archive data you might need later.
-
The economics drive decisions: API access is a business decision. Understand the provider’s incentives.
-
Reproducibility challenges: Research built on proprietary data may not be reproducible.
The Post-Twitter Landscape
As Twitter closed, alternatives emerged:
Mastodon/Fediverse: Decentralized, open protocols Bluesky: Open API from the start Reddit API: Also restricted in 2023 Common Crawl: Web archive data Original collection: Building your own datasets
The era of freely available social media data may be ending. Future data science may require more direct data collection, better relationships with data providers, and more attention to data sustainability.
LECTURE PLAN: Data Acquisition in the Wild
Learning Objectives
By the end of this lecture, students will be able to:
- Design an API data collection strategy
- Build a basic web scraper
- Handle common data formats
- Understand legal and ethical considerations
- Plan for reliable data pipelines
Lecture Structure (90 minutes)
Opening Hook (8 minutes)
The Data Hunt
- Present a scenario: “Your organization needs social media data about a topic. Where do you start?”
- Show the landscape of options
- Introduce the Twitter API story as cautionary tale
- Frame: “Data acquisition is the foundation—and often the bottleneck”
Part 1: The API Approach (20 minutes)
Understanding APIs (7 minutes)
- What is an API? Restaurant metaphor (menu, order, kitchen)
- REST principles: Resources, URLs, HTTP methods
- Live demo: Simple API call in Python
- Show: JSON response structure
Authentication and Limits (7 minutes)
- API keys: The simplest case
- OAuth: When you need user permissions
- Rate limiting: Why and how to handle
- Demo: authenticated request, handle rate limit
Practical Strategies (6 minutes)
- Read documentation first
- Test interactively before scripting
- Handle errors gracefully
- Cache to avoid redundant calls
Part 2: Web Scraping (25 minutes)
When and Whether to Scrape (5 minutes)
- When APIs aren’t available
- Legal considerations: robots.txt, ToS
- Ethical considerations: Server load, privacy
- The gray areas
HTML Parsing (10 minutes)
- How web pages are structured
- Beautiful Soup walkthrough
- Demo: scrape a simple page
- Common patterns: lists, tables, links
Advanced Techniques (10 minutes)
- JavaScript-rendered content: Selenium
- Pagination and crawling
- Handling failures and changes
- When to stop: signs you’re being blocked
Part 3: Data Formats (12 minutes)
The Format Zoo (5 minutes)
- CSV, JSON, XML, Parquet, Excel
- When to use which
- Common pitfalls: encoding, types
Parsing Strategies (7 minutes)
- Pandas for everything tabular
- JSON parsing for nested data
- XML when you must
- Demo: Load messy real-world data
Part 4: Building Reliable Pipelines (15 minutes)
From Script to Pipeline (5 minutes)
- One-off vs. recurring collection
- Components: extract, transform, load
- Scheduling and orchestration
Reliability Engineering (5 minutes)
- Retry logic
- Logging and monitoring
- Error handling
- Idempotency
The Twitter Lesson (5 minutes)
- Story of API access changes
- Implications for research
- Strategies for robustness
- Importance of data archiving
Part 5: Hands-On Preview (5 minutes)
- Introduce the exercise
- Show the target data sources
- Expectations and tips
Wrap-Up (5 minutes)
- Recap key points
- Data acquisition is foundational
- Legal and ethical awareness
- Pipeline thinking
- Preview next module
Materials Needed
- Jupyter notebook with API examples
- Sample websites for scraping practice
- Various data format samples
- Pipeline architecture diagrams
Discussion Questions
- When is it okay to scrape data? When isn’t it?
- What would you do if an API you depend on shut down?
- How do you balance collecting enough data with respecting rate limits?
- What makes a data pipeline reliable?
HANDS-ON EXERCISE: Building a Multi-Source Data Collector
Overview
In this exercise, students will:
- Collect data from a public API
- Scrape data from a website
- Parse and combine multiple data sources
- Build a basic data pipeline with error handling
Prerequisites
- Python 3.8+
- Libraries: requests, beautifulsoup4, pandas, schedule
Setup
# Install required packages
# pip install requests beautifulsoup4 pandas lxml
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
import logging
from datetime import datetime
# Configure logging
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s')
Part 1: Working with APIs (20 minutes)
# We'll use the Open-Meteo weather API (free, no auth required)
def get_weather(latitude, longitude):
"""
Fetch current weather data from Open-Meteo API.
"""
url = "https://api.open-meteo.com/v1/forecast"
params = {
"latitude": latitude,
"longitude": longitude,
"current_weather": True,
"hourly": "temperature_2m,precipitation",
"timezone": "auto"
}
try:
response = requests.get(url, params=params, timeout=10)
response.raise_for_status() # Raise exception for bad status codes
return response.json()
except requests.exceptions.RequestException as e:
logging.error(f"API request failed: {e}")
return None
# Test the function
weather = get_weather(33.89, 35.50) # Beirut coordinates
if weather:
print(f"Current temperature: {weather['current_weather']['temperature']}°C")
print(f"Wind speed: {weather['current_weather']['windspeed']} km/h")
Task 1.1: Build a multi-city weather collector
cities = [
{"name": "Beirut", "lat": 33.89, "lon": 35.50},
{"name": "Dubai", "lat": 25.27, "lon": 55.30},
{"name": "Cairo", "lat": 30.04, "lon": 31.24},
{"name": "Amman", "lat": 31.95, "lon": 35.93},
{"name": "Riyadh", "lat": 24.77, "lon": 46.74}
]
def collect_regional_weather(cities):
"""
Collect weather for multiple cities with rate limiting.
"""
results = []
for city in cities:
logging.info(f"Fetching weather for {city['name']}")
weather = get_weather(city['lat'], city['lon'])
if weather:
results.append({
'city': city['name'],
'temperature': weather['current_weather']['temperature'],
'windspeed': weather['current_weather']['windspeed'],
'timestamp': datetime.now().isoformat()
})
time.sleep(0.5) # Rate limiting: 500ms between requests
return pd.DataFrame(results)
weather_df = collect_regional_weather(cities)
print(weather_df)
Part 2: Web Scraping (25 minutes)
# We'll scrape a simple, scrapable website for practice
# Example: Books to Scrape (http://books.toscrape.com)
def scrape_books(url="http://books.toscrape.com"):
"""
Scrape book information from Books to Scrape.
"""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
logging.error(f"Failed to fetch page: {e}")
return []
soup = BeautifulSoup(response.text, 'html.parser')
books = []
# Find all book containers
for article in soup.find_all('article', class_='product_pod'):
# Extract title
title = article.h3.a['title']
# Extract price
price_text = article.find('p', class_='price_color').text
price = float(price_text.replace('£', ''))
# Extract rating
rating_class = article.find('p', class_='star-rating')['class'][1]
rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
rating = rating_map.get(rating_class, 0)
# Extract availability
availability = article.find('p', class_='instock availability')
in_stock = 'In stock' in availability.text if availability else False
books.append({
'title': title,
'price': price,
'rating': rating,
'in_stock': in_stock
})
return books
# Test the scraper
books = scrape_books()
books_df = pd.DataFrame(books)
print(books_df.head(10))
Task 2.1: Add pagination to scrape multiple pages
def scrape_all_books(base_url="http://books.toscrape.com"):
"""
Scrape all books across multiple pages.
"""
all_books = []
page = 1
while True:
if page == 1:
url = base_url
else:
url = f"{base_url}/catalogue/page-{page}.html"
logging.info(f"Scraping page {page}")
try:
response = requests.get(url, timeout=10)
if response.status_code == 404:
logging.info("No more pages")
break
response.raise_for_status()
except requests.exceptions.RequestException as e:
logging.error(f"Failed to fetch page {page}: {e}")
break
soup = BeautifulSoup(response.text, 'html.parser')
articles = soup.find_all('article', class_='product_pod')
if not articles:
break
for article in articles:
title = article.h3.a['title']
price = float(article.find('p', class_='price_color').text.replace('£', ''))
rating_class = article.find('p', class_='star-rating')['class'][1]
rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
all_books.append({
'title': title,
'price': price,
'rating': rating_map.get(rating_class, 0),
'page': page
})
page += 1
time.sleep(1) # Be respectful
# Limit for exercise purposes
if page > 5:
break
return pd.DataFrame(all_books)
all_books_df = scrape_all_books()
print(f"Scraped {len(all_books_df)} books")
print(all_books_df.describe())
Part 3: Handling Multiple Data Formats (15 minutes)
# Simulate receiving data in different formats
# CSV data
csv_data = """name,value,category
item1,100,A
item2,200,B
item3,150,A
"""
# JSON data
json_data = '''
{
"items": [
{"name": "item4", "value": 175, "category": "B"},
{"name": "item5", "value": 225, "category": "C"}
],
"metadata": {"source": "api", "version": "1.0"}
}
'''
# Parse different formats
from io import StringIO
df_csv = pd.read_csv(StringIO(csv_data))
print("CSV Data:")
print(df_csv)
json_parsed = json.loads(json_data)
df_json = pd.DataFrame(json_parsed['items'])
print("\nJSON Data:")
print(df_json)
# Combine data sources
combined_df = pd.concat([df_csv, df_json], ignore_index=True)
print("\nCombined Data:")
print(combined_df)
Part 4: Building a Data Pipeline (20 minutes)
class DataPipeline:
"""
A simple data collection pipeline with error handling and logging.
"""
def __init__(self, output_dir="."):
self.output_dir = output_dir
self.run_log = []
def log_run(self, source, status, records=0, error=None):
"""Log pipeline run information."""
self.run_log.append({
'timestamp': datetime.now().isoformat(),
'source': source,
'status': status,
'records': records,
'error': str(error) if error else None
})
def fetch_weather_data(self, cities):
"""Fetch weather data with error handling."""
try:
df = collect_regional_weather(cities)
self.log_run('weather', 'success', len(df))
return df
except Exception as e:
self.log_run('weather', 'failed', error=e)
logging.error(f"Weather collection failed: {e}")
return pd.DataFrame()
def fetch_books_data(self, max_pages=3):
"""Fetch books data with error handling."""
try:
all_books = []
for page in range(1, max_pages + 1):
if page == 1:
url = "http://books.toscrape.com"
else:
url = f"http://books.toscrape.com/catalogue/page-{page}.html"
response = requests.get(url, timeout=10)
if response.status_code == 404:
break
soup = BeautifulSoup(response.text, 'html.parser')
for article in soup.find_all('article', class_='product_pod'):
all_books.append({
'title': article.h3.a['title'],
'price': float(article.find('p', class_='price_color').text.replace('£', '')),
})
time.sleep(0.5)
df = pd.DataFrame(all_books)
self.log_run('books', 'success', len(df))
return df
except Exception as e:
self.log_run('books', 'failed', error=e)
logging.error(f"Books collection failed: {e}")
return pd.DataFrame()
def run(self):
"""Execute the full pipeline."""
logging.info("Starting data pipeline")
# Collect from all sources
weather_df = self.fetch_weather_data(cities)
books_df = self.fetch_books_data()
# Save results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
if not weather_df.empty:
weather_file = f"weather_{timestamp}.csv"
weather_df.to_csv(weather_file, index=False)
logging.info(f"Saved weather data to {weather_file}")
if not books_df.empty:
books_file = f"books_{timestamp}.csv"
books_df.to_csv(books_file, index=False)
logging.info(f"Saved books data to {books_file}")
# Return summary
return pd.DataFrame(self.run_log)
# Run the pipeline
pipeline = DataPipeline()
run_summary = pipeline.run()
print("\nPipeline Run Summary:")
print(run_summary)
Part 5: Advanced Topics (10 minutes)
Task 5.1: Add retry logic
from functools import wraps
def retry(max_attempts=3, delay=1):
"""Decorator to retry failed functions."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
logging.warning(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_attempts - 1:
time.sleep(delay * (attempt + 1)) # Exponential backoff
raise last_exception
return wrapper
return decorator
@retry(max_attempts=3)
def fetch_with_retry(url):
"""Fetch URL with automatic retry."""
response = requests.get(url, timeout=5)
response.raise_for_status()
return response
# Test retry logic (this will fail gracefully)
try:
result = fetch_with_retry("http://httpstat.us/500") # Always returns 500
except Exception as e:
print(f"All retries failed: {e}")
Challenge Questions
-
API Design: Design an API collection strategy for a service that has a 100 requests/hour limit and you need 10,000 records. How would you structure this?
-
Scraping Ethics: You want to collect data from a website for academic research. The robots.txt doesn’t prohibit it, but the ToS says “no automated collection.” What do you do?
-
Data Freshness: You’re building a weather dashboard that needs hourly updates. Design a system that handles temporary API outages gracefully.
-
Deduplication: Your pipeline runs every hour and might collect overlapping data. How would you handle duplicates?
-
Scaling Up: Your single-threaded scraper takes 10 hours to collect all the data you need. How might you speed it up while remaining respectful to the source?
Expected Outputs
Students should submit:
- Working code for API data collection with error handling
- A web scraper with pagination support
- A data pipeline combining multiple sources
- Documentation of legal/ethical considerations for chosen data sources
- Run logs demonstrating successful collection
Evaluation Rubric
| Criteria | Points |
|---|---|
| API collection implementation | 20 |
| Web scraping with error handling | 25 |
| Data format handling | 15 |
| Pipeline design and reliability | 25 |
| Ethics/legal awareness documentation | 10 |
| Code quality | 5 |
| Total | 100 |
Recommended Resources
Documentation
Tutorials
Legal and Ethics
API Directories
References
-
Mitchell, R. (2018). Web Scraping with Python (2nd ed.). O’Reilly Media.
-
Masse, M. (2011). REST API Design Rulebook. O’Reilly Media.
-
Krotov, V., & Silva, L. (2018). “Legality and Ethics of Web Scraping.” Proceedings of AMCIS.
-
Freelon, D. (2018). “Computational Research in the Post-API Age.” Political Communication, 35(4).
-
Bruns, A. (2019). “After the ‘APIcalypse’: Social Media Platforms and Their Fight Against Critical Scholarly Research.” Information, Communication & Society, 22(11).
Module 1 of DATA 202 explores the foundational skill of data acquisition—the methods and tools for obtaining data from APIs, websites, and other sources. Through the story of the Twitter API’s rise and fall, we learn that data access is never guaranteed, and responsible data science requires understanding both the technical and ethical dimensions of data collection.