DATA 202 Module 5: Music and Speech Processing

Introduction

Audio carries information beyond what’s captured in text—emotion in a voice, identity in a song, meaning in intonation. This module explores music information retrieval and speech processing: how machines learn to understand, generate, and transform audio.

Building on Module 8 of DATA 201 (Audio and Signal Processing), we dive deeper into specialized applications: music recommendation, speech recognition, speaker identification, and the emerging world of audio generation.

Part 1: Music Information Retrieval (MIR)

Understanding Music Computationally

Music is structured sound with:

Melody: Sequence of pitches
Harmony: Simultaneous pitches (chords)
Rhythm: Patterns in time
Timbre: Sound “color” (why a guitar differs from a piano)
Structure: Verses, choruses, bridges

MIR extracts these elements and uses them for:

Recommendation: Suggest similar songs
Classification: Genre, mood, era
Transcription: Convert audio to notation
Separation: Isolate instruments or vocals
Generation: Create new music

Audio Features for Music

Low-Level Features:

Spectral centroid: Brightness
Spectral flux: Rate of change
Zero-crossing rate: Noisiness
Chroma: Pitch class distribution

Mid-Level Features:

Beats and tempo
Key and mode
Chord progressions
Melodic contour

High-Level Features:

Genre
Mood/emotion
Similarity to other songs
Era/style

Music Recommendation

Content-Based: Analyze audio features, recommend similar

from sklearn.neighbors import NearestNeighbors
# Extract features for all songs
# Find nearest neighbors in feature space

Collaborative Filtering: Users who liked X also liked Y

Hybrid: Combine audio analysis with listening patterns

Spotify combines audio features (extracted via neural networks) with collaborative signals and contextual information (time of day, activity).

Part 2: Speech Recognition

From Sound to Words

Automatic Speech Recognition (ASR) converts audio to text. The modern pipeline:

Audio Input: Waveform at 16kHz or higher
Feature Extraction: Mel spectrograms or learned features
Acoustic Model: Neural network mapping audio to phonemes or characters
Language Model: Predict likely word sequences
Decoder: Combine acoustic and language scores for final transcription

Modern ASR Architectures

Encoder-Decoder with Attention:

Encode audio with CNN + Transformer
Decode text autoregressively
Attention aligns audio to text

CTC (Connectionist Temporal Classification):

Direct mapping from audio frames to characters
No explicit alignment needed
Faster decoding

Transducer/RNN-T:

Combines CTC and attention benefits
Used in streaming applications

OpenAI Whisper

Whisper (2022) is a breakthrough in ASR:

Trained on 680,000 hours of web audio
Multilingual (99 languages)
Zero-shot: Works without fine-tuning
Robust to accents, noise, domains

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

# With language specification
result = model.transcribe("arabic_audio.mp3", language="ar")

Speaker Diarization

“Who spoke when?” involves:

Voice activity detection
Speaker embedding extraction
Clustering similar embeddings
Assigning segments to speakers

Part 3: Text-to-Speech and Voice Synthesis

From Text to Natural Speech

Text-to-Speech (TTS) has evolved dramatically:

Concatenative (1990s-2000s): Stitch recorded speech fragments Statistical Parametric (2000s-2010s): Generate acoustic features from statistical models Neural (2010s-present): End-to-end deep learning

Modern TTS Pipeline

Text Analysis: Normalize, expand abbreviations
Linguistic Analysis: Phonemes, prosody prediction
Acoustic Model: Generate spectrograms (Tacotron, FastSpeech)
Vocoder: Convert spectrogram to audio (WaveNet, HiFi-GAN)

Voice Cloning

Modern systems clone voices from minimal samples:

VALL-E: Clone from 3 seconds of audio
Eleven Labs: Commercial voice cloning
XTTS: Open-source multilingual cloning

Ethical concerns:

Deepfake audio for fraud
Impersonation
Consent for voice use

Part 4: Music Generation

Neural Music Generation

Symbolic Generation (MIDI/scores):

MuseGAN: Generate multi-track music
Music Transformer: Long-range structure

Audio Generation:

Jukebox (OpenAI): Raw audio with lyrics
AudioLM (Google): Audio continuation
MusicGen (Meta): Text-to-music
Suno: Commercial text-to-music

Ethical and Legal Issues

Copyright of training data
Rights to AI-generated music
Impact on human musicians
Authenticity and attribution

DEEP DIVE: The Voice Cloning Revolution

Three Seconds to Clone a Voice

In January 2023, Microsoft Research unveiled VALL-E: a neural network that could clone any voice from just three seconds of sample audio. The demonstration was striking—the model captured not just the voice but speaking style, accent, and emotional expression.

VALL-E treated speech synthesis as a language modeling problem:

Train on 60,000 hours of English speech
Learn to predict audio tokens (discrete representations of audio)
Condition on a short reference sample
Generate arbitrary new speech in that voice

The implications rippled immediately:

Positive: Accessibility tools, voice preservation for those losing their voice
Negative: Voice fraud, fake emergency calls, audio deepfakes

Banks have reported voice-cloning fraud cases. Scammers clone relatives’ voices from social media videos to make fake emergency calls.

The Response

Technology companies face a dilemma:

Powerful tools enable both beneficial and harmful uses
Restricting release doesn’t prevent bad actors
Detection tools lag behind generation

Watermarking, detection models, and authentication systems are emerging responses, but the cat-and-mouse game continues.

HANDS-ON EXERCISE: Speech and Music Analysis

Part 1: Speech Recognition with Whisper

import whisper
import librosa

# Load model (options: tiny, base, small, medium, large)
model = whisper.load_model("base")

# Transcribe
result = model.transcribe("speech.wav")
print(result["text"])

# Word-level timestamps
result = model.transcribe("speech.wav", word_timestamps=True)
for segment in result["segments"]:
    for word in segment.get("words", []):
        print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

Part 2: Music Feature Extraction

import librosa
import numpy as np

# Load audio
y, sr = librosa.load("song.mp3")

# Extract features
tempo, beats = librosa.beat.beat_track(y=y, sr=sr)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)

print(f"Tempo: {tempo:.1f} BPM")
print(f"Chroma shape: {chroma.shape}")
print(f"MFCC shape: {mfcc.shape}")

Part 3: Music Similarity

def extract_features(audio_path):
    y, sr = librosa.load(audio_path, duration=30)
    features = {
        'tempo': librosa.beat.tempo(y=y, sr=sr)[0],
        'mfcc_mean': np.mean(librosa.feature.mfcc(y=y, sr=sr), axis=1),
        'chroma_mean': np.mean(librosa.feature.chroma_stft(y=y, sr=sr), axis=1),
        'spectral_centroid': np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))
    }
    return np.concatenate([
        [features['tempo'], features['spectral_centroid']],
        features['mfcc_mean'],
        features['chroma_mean']
    ])

# Extract features for multiple songs
# Find similar songs using nearest neighbors

Recommended Resources

Libraries

librosa: Audio analysis
Whisper: Speech recognition
pyannote.audio: Speaker diarization
Coqui TTS: Text-to-speech

Courses and Tutorials

Papers

“Attention Is All You Need” - Transformers foundation
“Whisper” - OpenAI’s multilingual ASR
“VALL-E” - Zero-shot TTS
“MusicGen” - Text-to-music generation

Module 5 explores music and speech processing—the technologies for understanding, analyzing, and generating audio content. From speech recognition to music recommendation to voice synthesis, we learn how machines interact with the auditory world.