Module 8: Machines That Hear - Audio and Signal Processing
Introduction
Sound is vibration traveling through air, a continuous wave of pressure changes that our ears translate into the rich tapestry of human experience—music, speech, the warning cry of a child, the whisper of wind through trees. For millennia, sound remained ephemeral: once produced, it vanished forever into silence. Then, in the span of barely 150 years, humanity learned not only to capture sound but to analyze it, manipulate it, and teach machines to understand it.
This module explores the journey from Thomas Edison’s first crackling “Mary Had a Little Lamb” on a tinfoil cylinder to modern AI systems that can transcribe speech, identify songs from ambient noise, generate synthetic voices indistinguishable from humans, and even compose original music. At the heart of this revolution lies a mathematical discovery made two centuries ago by a French mathematician studying heat.
Part 1: The Mathematics of Sound - Fourier’s Gift
Jean-Baptiste Joseph Fourier (1768-1830)
The story of audio signal processing begins not with sound but with heat. Joseph Fourier was born in Auxerre, France, the son of a tailor who died when Fourier was nine. Orphaned and poor, he found refuge in mathematics, eventually becoming a professor and later accompanying Napoleon on his Egyptian campaign, where he helped establish the scientific study of Egyptian artifacts.
But Fourier’s immortal contribution came from his study of heat conduction. In 1807, he presented a paper to the French Academy claiming something that seemed almost magical: any periodic function, no matter how complex or jagged, could be represented as a sum of simple sine and cosine waves. The Academy was skeptical—Lagrange particularly objected—and it took until 1822 for Fourier to publish his full theory in “Théorie analytique de la chaleur” (The Analytical Theory of Heat).
The Fourier Transform
Fourier’s insight was profound: complexity can be decomposed into simplicity. A complex wave—whether describing heat distribution in a metal bar or the sound of a violin—is actually the sum of many simple waves, each with its own frequency and amplitude.
Consider the sound of a violin playing the note A (440 Hz). It’s not a pure 440 Hz tone—that would sound thin and lifeless, like a tuning fork. Instead, it contains the fundamental frequency (440 Hz) plus a series of harmonics (880 Hz, 1320 Hz, 1760 Hz, and so on), each at different amplitudes. This unique “recipe” of harmonics is what gives the violin its characteristic timbre, distinguishing it from a flute or piano playing the same note.
The Fourier Transform is the mathematical operation that takes a signal in the time domain (amplitude changing over time) and reveals its frequency domain representation (which frequencies are present and at what strength). The inverse transform goes the other way, reconstructing the time signal from its frequency components.
From Analog to Digital: Harry Nyquist and Claude Shannon
For Fourier analysis to be useful for digital audio, we needed to understand how to sample continuous signals. Two engineers at Bell Labs provided the answer:
Harry Nyquist (1889-1976), a Swedish-American engineer, showed in the 1920s that to accurately capture a signal, you must sample it at least twice as fast as its highest frequency component. This became known as the Nyquist rate.
Claude Shannon (1916-2001), the father of information theory, rigorously proved this sampling theorem in 1949. Human hearing extends to roughly 20,000 Hz, which is why CD audio samples at 44,100 Hz—just over twice the limit of human perception.
Part 2: Capturing and Storing Sound
Thomas Edison and the Phonograph (1877)
“Mary had a little lamb, its fleece was white as snow…”
These were the first words ever recorded and played back, spoken by Thomas Edison in December 1877 into his newly invented phonograph. The device was brutally simple: a horn channeled sound onto a diaphragm connected to a needle, which etched grooves into tinfoil wrapped around a rotating cylinder. Playing it back reversed the process—the needle following the grooves made the diaphragm vibrate, reproducing the sound.
Edison initially saw the phonograph as a business machine for dictation. He completely missed its potential for music. That insight would come from others, eventually leading to the vinyl record, the tape recorder, and the CD.
Digital Audio: From PCM to MP3
The conversion from analog sound to digital data uses Pulse Code Modulation (PCM):
- Sample the audio at regular intervals (e.g., 44,100 times per second)
- Quantize each sample to a discrete value (e.g., 16 bits = 65,536 possible values)
- Encode the sequence of values as binary data
Uncompressed digital audio is massive: one minute of CD-quality stereo is about 10 MB. The need for compression led to one of the most successful audio technologies ever:
MP3 (MPEG-1 Audio Layer III): Developed through the 1980s-90s by a team led by Karlheinz Brandenburg at the Fraunhofer Institute in Germany, MP3 exploits psychoacoustic principles—the quirks of human hearing. We can’t hear sounds that are masked by louder nearby frequencies. We’re less sensitive to certain frequencies. MP3 discards the “inaudible” data, achieving 10:1 compression with minimal perceived quality loss.
Part 3: Making Machines Understand Speech
Early Dreams of Talking Machines
The dream of machines that understand speech is ancient. In the 18th century, Wolfgang von Kempelen built a mechanical speaking machine that could produce vowels and some consonants using bellows, resonators, and a leather “mouth.”
The modern era of speech recognition began at Bell Labs in the 1950s:
Audrey (1952): Built by Davis, Biddulph, and Balashek at Bell Labs, Audrey could recognize spoken digits—but only from a single speaker, in a carefully controlled environment, and it filled an entire room.
Hidden Markov Models: The Statistical Revolution
The breakthrough in speech recognition came not from better acoustics but from better statistics. In the late 1970s and 1980s, researchers at IBM, led by Frederick Jelinek (1932-2010), applied Hidden Markov Models (HMMs) to speech:
Speech is modeled as a sequence of hidden states (phonemes) that produce observable outputs (acoustic features). The “hidden” aspect reflects the fact that we observe the sound, not the underlying phonetic units directly. HMMs provide a principled way to:
- Train the model: learn the probability distributions from labeled data
- Decode: given new audio, find the most likely sequence of words
Jelinek famously quipped: “Every time I fire a linguist, the performance of the speech recognizer goes up.” This reflected the power of statistical approaches over rule-based linguistic analysis.
The Deep Learning Revolution
Starting around 2010, deep learning transformed speech recognition:
Deep Neural Networks for Acoustic Modeling (2010-2012): Geoffrey Hinton’s group at Toronto, working with Microsoft and IBM, showed that deep neural networks dramatically outperformed HMMs for acoustic modeling.
Recurrent Neural Networks and LSTMs: Long Short-Term Memory networks, invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997, proved ideal for sequential data like speech.
End-to-End Models: Systems like Deep Speech (Baidu, 2014) and Listen, Attend and Spell (Google, 2015) eliminated the traditional pipeline, directly mapping audio to text.
Transformers and Wav2Vec: Facebook AI’s Wav2Vec 2.0 (2020) and OpenAI’s Whisper (2022) represent the current state of the art, achieving human-parity transcription across many languages.
Part 4: Music Information Retrieval - Finding Songs in Sound
The Shazam Story
In 1999, Avery Wang and Chris Barton faced an impossible problem: how do you identify a song from a 10-second clip captured on a cell phone in a noisy bar? The audio would be distorted, partial, and competing with conversation, clinking glasses, and other background noise.
Their solution, now known as audio fingerprinting, was elegant:
- Create spectrograms: Convert audio to time-frequency representations
- Find peaks: Identify the loudest points in the spectrogram—these are robust to noise
- Create fingerprints: Hash pairs of peaks and their time differences into compact codes
- Build a database: Fingerprint millions of songs, storing the fingerprints in a searchable database
- Match: Fingerprint the query audio and find matching sequences in the database
The genius was in what they didn’t try to understand. They didn’t identify melody, rhythm, or key. They simply found a robust way to match acoustic patterns. In 2018, Apple acquired Shazam for $400 million.
Music Classification and Generation
Beyond identification, audio ML enables:
Genre Classification: Using features like tempo, spectral properties, and MFCCs (Mel-Frequency Cepstral Coefficients, which capture the “shape” of sound in a way inspired by human perception)
Mood Detection: Training models on labeled data to identify emotional content in music
Music Generation: From WaveNet (DeepMind, 2016) generating raw audio to Jukebox (OpenAI, 2020) creating music with lyrics, to Suno and AIVA creating complete compositions.
Part 5: The Spectrogram - Seeing Sound
Making the Invisible Visible
A spectrogram is a visual representation of sound showing time on the x-axis, frequency on the y-axis, and intensity as color or brightness. It transforms the one-dimensional waveform into a two-dimensional image, revealing patterns invisible to the ear.
The development of the sound spectrograph at Bell Labs in the 1940s was driven by wartime needs—analyzing enemy communications, developing better voice transmission. But it became an essential tool for linguistics, bioacoustics (studying animal sounds), and audio engineering.
Mel Spectrograms and MFCCs
Human perception of pitch is not linear—we perceive the difference between 100 Hz and 200 Hz as the same “distance” as between 1000 Hz and 2000 Hz (both are octaves). The Mel scale captures this perceptual nonlinearity.
Mel-Frequency Cepstral Coefficients (MFCCs) compress the mel spectrogram into a compact representation that captures the essential acoustic features. Developed in the 1980s, MFCCs became the standard features for speech and audio analysis for decades.
Part 6: Voice Synthesis - Teaching Machines to Speak
From Voder to WaveNet
At the 1939 World’s Fair in New York, Bell Labs demonstrated the Voder (Voice Operating Demonstrator), the first electronic speech synthesizer. A trained operator used a keyboard and pedals to control the synthetic voice—it was like playing speech as an instrument.
The path from Voder to natural-sounding synthesis was long:
Formant Synthesis (1950s-1980s): Modeling the resonant frequencies (formants) of the vocal tract. The result sounded robotic but intelligible—think early GPS voices.
Concatenative Synthesis (1990s-2000s): Splicing together recorded fragments of real speech. Better quality but required large databases and produced occasional “glitches” at join points.
Statistical Parametric Synthesis: Using HMMs to generate acoustic features, then converting to audio with a vocoder.
WaveNet (2016): DeepMind’s autoregressive neural network generated audio sample-by-sample, producing remarkably natural speech. The first version was too slow for real-time use (requiring minutes to generate seconds of audio), but subsequent optimizations made it practical for Google Assistant.
Neural Vocoders: Systems like WaveGlow and HiFi-GAN generate high-quality audio efficiently from mel spectrograms.
Voice Cloning and Deepfakes
Modern systems like Eleven Labs, Resemble AI, and VALL-E can clone voices from just seconds of sample audio. This raises profound ethical questions: How do we verify the authenticity of audio recordings? What happens when anyone’s voice can be synthetically reproduced?
Part 7: Audio Data Science Pipeline
The Modern Audio Processing Stack
- Acquisition: Recording with appropriate sample rate and bit depth
- Preprocessing: Noise reduction, normalization, silence trimming
- Feature Extraction: Spectrograms, MFCCs, embeddings from pretrained models
- Modeling: Classification, regression, generation
- Evaluation: Word Error Rate (WER) for speech, Mean Opinion Score (MOS) for quality
Libraries and Tools
- librosa: The standard Python library for audio analysis
- PyTorch Audio (torchaudio): Audio processing integrated with PyTorch
- Hugging Face Transformers: Pre-trained models for speech recognition, speaker identification
- Whisper: OpenAI’s open-source multilingual speech recognition
- Essentia: C++ library with Python bindings for music information retrieval
DEEP DIVE: Shazam and the Invention of Audio Fingerprinting
The Problem: “What’s That Song?”
It’s the late 1990s. You’re at a party, a coffee shop, or driving in your car when you hear an incredible song—but you have no idea what it is. The DJ doesn’t announce it. The barista shrugs. The radio host has moved on. That moment of musical discovery slips away into the void of unknowability.
This experience was universal and frustrating. If you knew some lyrics, you could try searching online (a new possibility in 1999). If you could hum the melody, a musically inclined friend might identify it. But for instrumental music, or when you only caught a brief snippet—you were out of luck.
Avery Wang was a PhD student at Stanford studying electrical engineering when he began thinking about this problem. Born in 1974, Wang had grown up fascinated by both music and mathematics. At Stanford, he worked on audio signal processing under Julius Smith, one of the pioneers of the field.
Chris Barton was an MBA student at UC Berkeley with a vision for a consumer service that would identify songs. He had the entrepreneurial drive but needed the technical solution.
When they connected, the match was perfect: Barton’s business vision plus Wang’s signal processing expertise. But they faced a problem that seemed almost impossible to solve.
The Challenges
Consider what makes audio fingerprinting extraordinarily difficult:
-
Degradation: The user is recording audio through a phone microphone in a noisy environment. The signal is distorted, compressed, and mixed with ambient sound.
-
Partial Matching: The system receives maybe 10-15 seconds of audio that could be from any point in a 3-5 minute song.
-
Scale: There are millions of songs. The database must be comprehensive enough to be useful.
-
Speed: Users expect answers in seconds. Searching millions of songs with lossy audio seems computationally intractable.
-
Robustness: The system must work regardless of whether the source is a crystal-clear studio recording or a crackling radio in a moving car.
The Breakthrough Insight
Wang’s key insight was counter-intuitive: instead of trying to understand the music (melody, rhythm, harmony), he would focus on finding features that were:
- Robust to noise: Survived even in degraded recordings
- Unique enough to discriminate: Distinguished between different songs
- Compact enough to search: Could be indexed efficiently
He called these features “landmarks” or “anchors”—stable points in the time-frequency representation of audio that could be reliably extracted from both the clean original and the noisy recording.
The Algorithm: A Technical Deep Dive
Wang published his approach in 2003 in a paper titled “An Industrial-Strength Audio Search Algorithm.” Here’s how it works:
Step 1: Create a Spectrogram
The audio is converted into a spectrogram—a 2D representation with time on the x-axis and frequency on the y-axis. Each point has an intensity representing how much energy is present at that frequency at that moment.
Step 2: Find Spectral Peaks
Rather than using the entire spectrogram, Wang extracts only the peaks—points that are louder than their local neighborhood. These peaks are remarkably stable. Even when noise is added or the audio is compressed, the same peaks tend to appear because they represent the dominant energy in the signal.
The peak extraction uses a local maximum filter: a point is a peak if it’s larger than all points within some neighborhood (e.g., 20 frequency bins by 20 time frames).
Step 3: Create Fingerprints from Peak Pairs
Here’s the clever part. A single peak isn’t distinctive enough—there might be hundreds of peaks in any audio segment, and different songs could share individual peaks. But pairs of peaks are much more distinctive.
For each peak (the “anchor”), Wang looks at nearby peaks (within a “target zone”) and creates a hash combining:
- Frequency of the anchor point (f1)
- Frequency of the target point (f2)
- Time difference between them (Δt)
This produces a fingerprint hash like: hash(f1, f2, Δt) along with an offset time t1 (when the anchor appears in the song).
Step 4: Build the Database
For each song in the database:
- Extract the spectrogram
- Find all peaks
- Generate all fingerprint hashes
- Store each hash in a hash table with the song ID and offset time
A 3-minute song might generate 10,000-20,000 fingerprint hashes. With millions of songs, the database contains billions of hashes—but hash table lookups are O(1), so searching is fast.
Step 5: Match Query Audio
When a user submits a query:
- Generate fingerprints from the query audio
- Look up each fingerprint hash in the database
- For matches, record the song ID and compute the time offset (database offset - query offset)
- Songs with many matches at a consistent time offset are candidates
The key insight is the time coherence requirement. Random matches will occur—different songs might share some fingerprints. But in the correct song, many fingerprints will match, and they’ll all show the same time offset (because the query’s position in the song is fixed).
For example, if your 10-second query is from 1:30 to 1:40 in the song, then every matching fingerprint will show an offset of 90 seconds. False matches will have random offsets, filtering them out.
The Result: Magic in Your Pocket
Shazam launched in the UK in 2002 as a phone service—you’d dial a number, hold your phone up to the music, and receive a text message with the song title. By the time smartphones arrived, Shazam was ready to become an app.
The numbers tell the story:
- By 2014: 10 billion songs identified
- By 2018: Over 1 billion app downloads
- Apple’s acquisition price: $400 million
Wang’s algorithm proved so robust that it worked in scenarios he never anticipated. Users Shazam’d songs playing from laptop speakers. From TV commercials. From their own humming (a later feature using a different approach). The system identified songs from live concerts, complete with crowd noise.
Why This Story Matters for Data Science
The Shazam story embodies several crucial data science principles:
-
Feature Engineering Over Model Complexity: Wang didn’t need neural networks or complex machine learning. The power came from clever feature design—finding the right representation that captured what mattered while discarding what didn’t.
-
Robustness Through Invariance: By focusing on time-frequency peaks rather than raw audio, the algorithm gained natural invariance to noise and amplitude changes. Good features are robust to irrelevant variations.
-
Scalability Through Hashing: The problem seemed to require O(n) search through a database of millions. Hash-based lookup reduced this to O(1) per query fingerprint, making real-time matching possible.
-
Physical Intuition: Wang’s engineering insight that spectral peaks are stable wasn’t derived from machine learning—it came from understanding the physics of audio signals and psychoacoustics.
-
Less is More: The algorithm works precisely because it ignores most of the audio signal. By extracting only a few thousand robust features per song, matching becomes tractable.
LECTURE PLAN: From Fourier to Shazam - The Mathematics of Music Recognition
Learning Objectives
By the end of this lecture, students will be able to:
- Explain the Fourier Transform and its role in audio analysis
- Create and interpret spectrograms
- Understand the concept of audio fingerprinting
- Implement a simplified audio matching system
- Appreciate the engineering trade-offs in real-world audio systems
Lecture Structure (90 minutes)
Opening Hook (8 minutes)
The Mystery Song
- Play a 10-second audio clip recorded in a noisy environment
- Ask students: “How could a computer identify this song?”
- Demonstrate Shazam identifying it instantly
- Pose the question: “How is this possible? How does it search millions of songs in seconds?”
Part 1: The Sound of Data (20 minutes)
What is Sound? (5 minutes)
- Sound as pressure waves in air
- Demonstrate with tuning forks or audio software
- The waveform: amplitude vs. time
- Limitations of the time-domain view: “What note is this? What instrument?”
Fourier’s Revolution (10 minutes)
- Historical context: Fourier studying heat, not sound
- The core idea: any wave = sum of sine waves
- Interactive demo: build complex waves from simple ones
- Live demonstration: show Fourier decomposition of:
- A pure sine wave
- A violin note (fundamental + harmonics)
- Human speech
From Time to Frequency (5 minutes)
- The spectrogram: the bridge between time and frequency
- Live demo: visualize different sounds as spectrograms
- Show how different instruments have different “fingerprints”
- Demonstrate the distinctiveness of spectrograms
Part 2: Digital Audio Fundamentals (15 minutes)
Sampling and Quantization (7 minutes)
- Nyquist theorem: sample at 2x the highest frequency
- CD audio: 44.1 kHz sampling rate, 16-bit depth
- Interactive: what happens with under-sampling (aliasing)
- Compute data rates: why compression matters
Audio Features (8 minutes)
- Raw audio vs. extracted features
- Mel scale: how humans perceive pitch
- MFCCs: the “shape” of sound
- Show how features compress information
- Demo: same song in different features
Part 3: The Shazam Algorithm (30 minutes)
The Challenge (5 minutes)
- The pre-smartphone world: “What’s that song?”
- The technical requirements: noise, partial matching, speed
- Why naive approaches fail
The Solution: Audio Fingerprinting (15 minutes)
- Step through the algorithm:
- Spectrogram generation
- Peak detection (draw on board, show examples)
- Fingerprint creation (pairs of peaks)
- Hash table storage
- Matching with time coherence
- Work through example with actual spectrograms
- Show why peak pairs are distinctive
- Demonstrate time coherence filtering
Scaling Up (10 minutes)
- Database design: billions of hashes
- Hash table lookup complexity: O(1)
- Distributed systems for real-world deployment
- Handling edge cases: live performances, covers, remixes
Part 4: Modern Audio ML (12 minutes)
Beyond Fingerprinting (4 minutes)
- Limitations of fingerprinting: must have exact recording
- What about identifying covers? Remixes? Hummed melodies?
Deep Learning for Audio (8 minutes)
- Spectrograms as images: CNNs for audio
- Speech recognition: from HMMs to end-to-end
- Music generation: WaveNet, Jukebox, Suno
- Voice synthesis and its ethical implications
Wrap-Up and Preview (5 minutes)
- Recap: Fourier → spectrogram → fingerprints → matching
- Key insights: feature engineering, scalability, robustness
- Preview the hands-on exercise
- Open questions: What other applications can you imagine for audio fingerprinting?
Materials Needed
- Audio playback system with speakers
- Visualization software (Audacity, librosa in Jupyter)
- Pre-prepared spectrograms and audio clips
- Shazam or similar app for demonstration
- Tuning forks (optional, for physical demonstration)
Discussion Questions
- Why did Wang use pairs of peaks instead of single peaks for fingerprinting?
- What would happen if someone recorded a song at a different speed? Would Shazam still work?
- How might you fingerprint a song to match cover versions?
- What are the ethical implications of audio fingerprinting technology?
HANDS-ON EXERCISE: Building an Audio Analysis and Mini-Shazam System
Overview
In this exercise, students will:
- Load and visualize audio files
- Generate spectrograms and extract features
- Implement a simplified audio fingerprinting algorithm
- Build a small song identification system
Prerequisites
- Python 3.8+
- Libraries: librosa, numpy, scipy, matplotlib, hashlib
- Audio files: 5-10 short music clips
Setup
# Install required packages
# pip install librosa numpy scipy matplotlib
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import maximum_filter
from collections import defaultdict
import hashlib
# Disable warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')
Part 1: Loading and Visualizing Audio (20 minutes)
# Load an audio file
# Note: librosa converts to mono and resamples by default
audio_path = "path/to/your/song.mp3"
y, sr = librosa.load(audio_path, duration=30) # Load first 30 seconds
print(f"Sample rate: {sr} Hz")
print(f"Audio length: {len(y)} samples")
print(f"Duration: {len(y)/sr:.2f} seconds")
# Plot the waveform
plt.figure(figsize=(14, 5))
librosa.display.waveshow(y, sr=sr)
plt.title('Audio Waveform')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.tight_layout()
plt.show()
Task 1.1: Load a song and plot its waveform. What do you notice about the amplitude changes over time?
# Create a spectrogram
D = librosa.stft(y) # Short-Time Fourier Transform
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)
plt.figure(figsize=(14, 6))
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')
plt.tight_layout()
plt.show()
Task 1.2: Generate a spectrogram for your audio. Can you identify the bass line (low frequencies) and higher-pitched elements?
# Create a mel spectrogram (more aligned with human perception)
S_mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
S_mel_db = librosa.amplitude_to_db(S_mel, ref=np.max)
plt.figure(figsize=(14, 6))
librosa.display.specshow(S_mel_db, sr=sr, x_axis='time', y_axis='mel')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.tight_layout()
plt.show()
Part 2: Feature Extraction (20 minutes)
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
plt.figure(figsize=(14, 5))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
plt.colorbar()
plt.title('MFCCs')
plt.ylabel('MFCC Coefficient')
plt.tight_layout()
plt.show()
print(f"MFCC shape: {mfccs.shape}")
# Each column is a feature vector representing a short time frame
Task 2.1: Extract MFCCs from two different songs (or two different genres). How do the patterns differ?
# Extract chroma features (pitch class representation)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
plt.figure(figsize=(14, 5))
librosa.display.specshow(chroma, sr=sr, x_axis='time', y_axis='chroma')
plt.colorbar()
plt.title('Chroma Features (Pitch Classes)')
plt.tight_layout()
plt.show()
Part 3: Audio Fingerprinting (40 minutes)
Now we’ll implement a simplified version of the Shazam algorithm.
def create_spectrogram(y, sr):
"""
Create a spectrogram for fingerprinting.
Uses parameters similar to Shazam.
"""
# Compute spectrogram with specific window size
hop_length = 512
n_fft = 2048
D = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
S = np.abs(D)
return S, hop_length
def find_peaks(S, neighborhood_size=20):
"""
Find local peaks in the spectrogram.
These are points that are louder than their local neighborhood.
"""
# Apply maximum filter
local_max = maximum_filter(S, size=neighborhood_size)
# Peaks are points equal to their local maximum
# Also apply a threshold to ignore very quiet peaks
threshold = np.mean(S) + np.std(S)
peaks = (S == local_max) & (S > threshold)
# Get coordinates of peaks
freq_idx, time_idx = np.where(peaks)
return list(zip(time_idx, freq_idx))
def create_fingerprints(peaks, fan_value=15):
"""
Create fingerprints from pairs of peaks.
Each fingerprint is (f1, f2, delta_t) -> (t1)
"""
fingerprints = []
# Sort peaks by time
peaks_sorted = sorted(peaks, key=lambda x: x[0])
for i, (t1, f1) in enumerate(peaks_sorted):
# Look at the next 'fan_value' peaks
for j in range(1, min(fan_value, len(peaks_sorted) - i)):
t2, f2 = peaks_sorted[i + j]
# Time difference
delta_t = t2 - t1
# Only consider peaks within a reasonable time window
if delta_t > 0 and delta_t < 200:
# Create hash from (f1, f2, delta_t)
fingerprint = (f1, f2, delta_t)
fingerprints.append((fingerprint, t1))
return fingerprints
def hash_fingerprint(fingerprint):
"""
Create a hash from a fingerprint tuple.
"""
f1, f2, delta_t = fingerprint
key = f"{f1}|{f2}|{delta_t}"
return hashlib.md5(key.encode()).hexdigest()
Now let’s build a database and matching system:
class AudioFingerprinter:
def __init__(self):
self.database = defaultdict(list) # hash -> [(song_id, offset), ...]
self.song_names = {} # song_id -> name
self.song_count = 0
def add_song(self, audio_path, song_name):
"""Add a song to the database."""
print(f"Adding: {song_name}")
# Load audio
y, sr = librosa.load(audio_path, duration=60)
# Create spectrogram
S, hop_length = create_spectrogram(y, sr)
# Find peaks
peaks = find_peaks(S)
print(f" Found {len(peaks)} peaks")
# Create fingerprints
fingerprints = create_fingerprints(peaks)
print(f" Created {len(fingerprints)} fingerprints")
# Store in database
song_id = self.song_count
self.song_names[song_id] = song_name
self.song_count += 1
for fingerprint, offset in fingerprints:
h = hash_fingerprint(fingerprint)
self.database[h].append((song_id, offset))
return len(fingerprints)
def identify(self, query_audio):
"""
Identify a song from a query audio clip.
Returns the best matching song and confidence score.
"""
# Load query audio
if isinstance(query_audio, str):
y, sr = librosa.load(query_audio, duration=15)
else:
y = query_audio
sr = 22050 # Default sample rate
# Create spectrogram and find peaks
S, hop_length = create_spectrogram(y, sr)
peaks = find_peaks(S)
# Create fingerprints
fingerprints = create_fingerprints(peaks)
# Count matches for each song and offset
matches = defaultdict(lambda: defaultdict(int)) # song_id -> offset_diff -> count
for fingerprint, query_offset in fingerprints:
h = hash_fingerprint(fingerprint)
if h in self.database:
for song_id, db_offset in self.database[h]:
# The offset difference tells us where in the song the query came from
offset_diff = db_offset - query_offset
matches[song_id][offset_diff] += 1
# Find the best match
best_song = None
best_count = 0
for song_id, offset_counts in matches.items():
max_count = max(offset_counts.values())
if max_count > best_count:
best_count = max_count
best_song = song_id
if best_song is not None:
return self.song_names[best_song], best_count
else:
return None, 0
# Demo with synthetic audio (since we might not have actual song files)
def create_synthetic_song(freq1, freq2, duration=10, sr=22050):
"""Create a synthetic song with two frequencies and some noise."""
t = np.linspace(0, duration, int(sr * duration))
y = 0.5 * np.sin(2 * np.pi * freq1 * t) + 0.3 * np.sin(2 * np.pi * freq2 * t)
# Add some noise
y += 0.1 * np.random.randn(len(y))
return y, sr
Task 3.1: Create a database with a few songs and test the matching:
# Create fingerprinter
fp = AudioFingerprinter()
# Add synthetic songs (or replace with real audio paths)
songs = [
(create_synthetic_song(440, 880), "Song A"),
(create_synthetic_song(330, 660), "Song B"),
(create_synthetic_song(523, 1046), "Song C"),
]
for (y, sr), name in songs:
# Save temporarily and add
temp_path = f"/tmp/{name.replace(' ', '_')}.wav"
import soundfile as sf
sf.write(temp_path, y, sr)
fp.add_song(temp_path, name)
# Test matching
query_audio, _ = create_synthetic_song(440, 880, duration=3)
result, confidence = fp.identify(query_audio)
print(f"\nIdentified: {result} (confidence: {confidence})")
Part 4: Analysis and Extensions (10 minutes)
Task 4.1: Test the robustness of your system by:
- Adding noise to the query
- Using only a portion of the song
- Time-stretching the audio slightly
def add_noise(y, noise_level=0.1):
"""Add Gaussian noise to audio."""
noise = np.random.randn(len(y)) * noise_level
return y + noise
def test_robustness(fp, original_audio, song_name):
"""Test matching with various degradations."""
print(f"\nTesting robustness for {song_name}:")
# Clean audio
result, conf = fp.identify(original_audio[:int(len(original_audio)*0.3)])
print(f" Clean (30% of song): {result} (conf: {conf})")
# Noisy audio
noisy = add_noise(original_audio, 0.2)
result, conf = fp.identify(noisy[:int(len(noisy)*0.3)])
print(f" With noise: {result} (conf: {conf})")
# Very short clip
result, conf = fp.identify(original_audio[:int(len(original_audio)*0.1)])
print(f" 10% of song: {result} (conf: {conf})")
Challenge Questions
-
Feature Selection: Why do we use spectral peaks instead of other audio features for fingerprinting?
-
Hash Collisions: What happens if two different songs produce the same fingerprint hash? How does the algorithm handle this?
-
Time Invariance: The algorithm uses pairs of peaks with time differences. Why is this important for matching audio from different positions in a song?
-
Scaling: Our implementation stores fingerprints in a Python dictionary. What data structures would you use for a database of millions of songs?
-
Cover Songs: This algorithm matches exact recordings. How might you modify it to identify cover versions or different recordings of the same song?
Expected Outputs
Students should submit:
- Visualizations of spectrograms for at least 3 different audio sources
- Analysis of how different audio types produce different spectrograms
- A working fingerprinting system that can match short clips
- Performance analysis: accuracy vs. clip length and noise level
- Written reflection on the strengths and limitations of audio fingerprinting
Evaluation Rubric
| Criteria | Points |
|---|---|
| Correct spectrogram generation and visualization | 20 |
| Working peak detection algorithm | 20 |
| Functional fingerprint creation and matching | 25 |
| Robustness testing and analysis | 20 |
| Code quality and documentation | 15 |
| Total | 100 |
Recommended Resources
Books
Technical
- Digital Signal Processing by Alan Oppenheim and Ronald Schafer - The classic DSP textbook
- Speech and Language Processing by Dan Jurafsky and James Martin - Comprehensive NLP/speech text (free online)
- Fundamentals of Music Processing by Meinard Müller - Excellent audio/music analysis book
- The Scientist and Engineer’s Guide to Digital Signal Processing by Steven Smith - Free online, very accessible
Historical and Popular
- The Information by James Gleick - Shannon, information theory, and the digital age
- Chasing Sound: Technology, Culture, and the Art of Studio Recording by Susan Schmidt Horning
- How Music Works by David Byrne - Music, technology, and perception
- Perfecting Sound Forever by Greg Milner - The history of recorded music
Academic Papers
- Wang, A. (2003). “An Industrial-Strength Audio Search Algorithm” - The original Shazam paper
- Hinton, G., et al. (2012). “Deep Neural Networks for Acoustic Modeling in Speech Recognition”
- Oord, A., et al. (2016). “WaveNet: A Generative Model for Raw Audio”
- Radford, A., et al. (2022). “Robust Speech Recognition via Large-Scale Weak Supervision” - Whisper paper
- Baevski, A., et al. (2020). “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”
Video Lectures
- 3Blue1Brown: “But what is the Fourier Transform?” - Beautiful visual explanation
- MIT OpenCourseWare: 6.003 Signals and Systems - Rigorous mathematical treatment
- Stanford CS224S: Spoken Language Processing - Dan Jurafsky’s course
- Computerphile: “Audio Fingerprinting” - Accessible explanation of Shazam
Online Courses
- Coursera: Audio Signal Processing for Music Applications - Stanford course on Coursera
- Udacity: Digital Signal Processing - Georgia Tech course
- Fast.ai: Practical Deep Learning - Includes audio classification examples
Tools and Libraries
- librosa (https://librosa.org/) - Python audio analysis
- Essentia (https://essentia.upf.edu/) - Music information retrieval
- torchaudio (https://pytorch.org/audio/) - PyTorch audio
- Whisper (https://github.com/openai/whisper) - Open source speech recognition
- Audacity (https://www.audacityteam.org/) - Free audio editor with spectrogram view
- Praat (https://www.fon.hum.uva.nl/praat/) - Speech analysis software
Datasets
- Free Music Archive (https://freemusicarchive.org/) - Open audio for experimentation
- LibriSpeech - Large-scale speech recognition dataset
- GTZAN Genre Collection - 1000 audio tracks for genre classification
- AudioSet (Google) - Large-scale audio event classification
- Common Voice (Mozilla) - Multilingual speech recognition
- UrbanSound8K - Urban environmental sounds
References
-
Fourier, J.B.J. (1822). Théorie analytique de la chaleur. Paris: Firmin Didot.
-
Shannon, C.E. (1949). “Communication in the Presence of Noise.” Proceedings of the IRE, 37(1), 10-21.
-
Wang, A. (2003). “An Industrial-Strength Audio Search Algorithm.” Proceedings of the 4th International Conference on Music Information Retrieval.
-
Davis, K.H., Biddulph, R., & Balashek, S. (1952). “Automatic Recognition of Spoken Digits.” The Journal of the Acoustical Society of America, 24(6), 637-642.
-
Hinton, G., et al. (2012). “Deep Neural Networks for Acoustic Modeling in Speech Recognition.” IEEE Signal Processing Magazine, 29(6), 82-97.
-
Oord, A.v.d., et al. (2016). “WaveNet: A Generative Model for Raw Audio.” arXiv:1609.03499.
-
Radford, A., et al. (2022). “Robust Speech Recognition via Large-Scale Weak Supervision.” OpenAI Technical Report.
-
McFee, B., et al. (2015). “librosa: Audio and Music Signal Analysis in Python.” Proceedings of the 14th Python in Science Conference.
-
Müller, M. (2015). Fundamentals of Music Processing. Springer.
-
Rabiner, L.R. (1989). “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE, 77(2), 257-286.
-
Hochreiter, S., & Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8), 1735-1780.
-
Brandenburg, K., & Stoll, G. (1994). “ISO/MPEG-1 Audio: A Generic Standard for Coding of High-Quality Digital Audio.” Journal of the Audio Engineering Society, 42(10), 780-792.
Module 8 explores how data science enables machines to process and understand the world of sound—from the mathematical foundations of the Fourier Transform to modern AI systems that can transcribe, identify, and generate audio. The story of Shazam demonstrates how elegant algorithms and clever engineering can solve seemingly impossible problems.