DATA 202 Module 5: Music and Speech Processing

Introduction

Audio carries information beyond what’s captured in text—emotion in a voice, identity in a song, meaning in intonation. This module explores music information retrieval and speech processing: how machines learn to understand, generate, and transform audio.

Building on Module 8 of DATA 201 (Audio and Signal Processing), we dive deeper into specialized applications: music recommendation, speech recognition, speaker identification, and the emerging world of audio generation.


Part 1: Music Information Retrieval (MIR)

Understanding Music Computationally

Music is structured sound with:

MIR extracts these elements and uses them for:

Audio Features for Music

Low-Level Features:

Mid-Level Features:

High-Level Features:

Music Recommendation

Content-Based: Analyze audio features, recommend similar

from sklearn.neighbors import NearestNeighbors
# Extract features for all songs
# Find nearest neighbors in feature space

Collaborative Filtering: Users who liked X also liked Y

Hybrid: Combine audio analysis with listening patterns

Spotify combines audio features (extracted via neural networks) with collaborative signals and contextual information (time of day, activity).


Part 2: Speech Recognition

From Sound to Words

Automatic Speech Recognition (ASR) converts audio to text. The modern pipeline:

  1. Audio Input: Waveform at 16kHz or higher
  2. Feature Extraction: Mel spectrograms or learned features
  3. Acoustic Model: Neural network mapping audio to phonemes or characters
  4. Language Model: Predict likely word sequences
  5. Decoder: Combine acoustic and language scores for final transcription

Modern ASR Architectures

Encoder-Decoder with Attention:

CTC (Connectionist Temporal Classification):

Transducer/RNN-T:

OpenAI Whisper

Whisper (2022) is a breakthrough in ASR:

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

# With language specification
result = model.transcribe("arabic_audio.mp3", language="ar")

Speaker Diarization

“Who spoke when?” involves:

  1. Voice activity detection
  2. Speaker embedding extraction
  3. Clustering similar embeddings
  4. Assigning segments to speakers

Part 3: Text-to-Speech and Voice Synthesis

From Text to Natural Speech

Text-to-Speech (TTS) has evolved dramatically:

Concatenative (1990s-2000s): Stitch recorded speech fragments Statistical Parametric (2000s-2010s): Generate acoustic features from statistical models Neural (2010s-present): End-to-end deep learning

Modern TTS Pipeline

  1. Text Analysis: Normalize, expand abbreviations
  2. Linguistic Analysis: Phonemes, prosody prediction
  3. Acoustic Model: Generate spectrograms (Tacotron, FastSpeech)
  4. Vocoder: Convert spectrogram to audio (WaveNet, HiFi-GAN)

Voice Cloning

Modern systems clone voices from minimal samples:

Ethical concerns:


Part 4: Music Generation

Neural Music Generation

Symbolic Generation (MIDI/scores):

Audio Generation:


DEEP DIVE: The Voice Cloning Revolution

Three Seconds to Clone a Voice

In January 2023, Microsoft Research unveiled VALL-E: a neural network that could clone any voice from just three seconds of sample audio. The demonstration was striking—the model captured not just the voice but speaking style, accent, and emotional expression.

VALL-E treated speech synthesis as a language modeling problem:

  1. Train on 60,000 hours of English speech
  2. Learn to predict audio tokens (discrete representations of audio)
  3. Condition on a short reference sample
  4. Generate arbitrary new speech in that voice

The implications rippled immediately:

Banks have reported voice-cloning fraud cases. Scammers clone relatives’ voices from social media videos to make fake emergency calls.

The Response

Technology companies face a dilemma:

Watermarking, detection models, and authentication systems are emerging responses, but the cat-and-mouse game continues.


HANDS-ON EXERCISE: Speech and Music Analysis

Part 1: Speech Recognition with Whisper

import whisper
import librosa

# Load model (options: tiny, base, small, medium, large)
model = whisper.load_model("base")

# Transcribe
result = model.transcribe("speech.wav")
print(result["text"])

# Word-level timestamps
result = model.transcribe("speech.wav", word_timestamps=True)
for segment in result["segments"]:
    for word in segment.get("words", []):
        print(f"{word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

Part 2: Music Feature Extraction

import librosa
import numpy as np

# Load audio
y, sr = librosa.load("song.mp3")

# Extract features
tempo, beats = librosa.beat.beat_track(y=y, sr=sr)
chroma = librosa.feature.chroma_stft(y=y, sr=sr)
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
spectral_centroid = librosa.feature.spectral_centroid(y=y, sr=sr)

print(f"Tempo: {tempo:.1f} BPM")
print(f"Chroma shape: {chroma.shape}")
print(f"MFCC shape: {mfcc.shape}")

Part 3: Music Similarity

def extract_features(audio_path):
    y, sr = librosa.load(audio_path, duration=30)
    features = {
        'tempo': librosa.beat.tempo(y=y, sr=sr)[0],
        'mfcc_mean': np.mean(librosa.feature.mfcc(y=y, sr=sr), axis=1),
        'chroma_mean': np.mean(librosa.feature.chroma_stft(y=y, sr=sr), axis=1),
        'spectral_centroid': np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))
    }
    return np.concatenate([
        [features['tempo'], features['spectral_centroid']],
        features['mfcc_mean'],
        features['chroma_mean']
    ])

# Extract features for multiple songs
# Find similar songs using nearest neighbors

Libraries

Courses and Tutorials

Papers


Module 5 explores music and speech processing—the technologies for understanding, analyzing, and generating audio content. From speech recognition to music recommendation to voice synthesis, we learn how machines interact with the auditory world.