Electronics | Acoustics » Ellis-Mandel - Speech and Audio Processing and Recognition, E6820

Datasheet

Year, pagecount:2009, 511 page(s)

Language:English

Downloads:5

Uploaded:November 23, 2020

Size:14 MB

Institution:
-

Comments:
Columbia University Dept. of Electrical Engineering

Attachment:-

Download in PDF:Please log in!



Comments

No comments yet. You can be the first!

Content extract

Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 1: Introduction & DSP Dan Ellis <dpwe@ee.columbiaedu> Mike Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 January 22, 2009 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33 Source: http://www.doksinet Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 2 / 33 Source: http://www.doksinet Sound and information Sound is air pressure variation Mechanical vibration Pressure waves in air Motion of sensor ++++ v(t) Time-varying voltage t Transducers convert air pressure ↔ voltage Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 3 / 33 Source:

http://www.doksinet What use is sound? Footsteps examples: 0.5 0 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 time / s 3 3.5 4 4.5 5 0.5 0 -0.5 Hearing confers an evolutionary advantage useful information, complements vision . at a distance, in the dark, around corners listeners are highly adapted to ‘natural sounds’ (including speech) Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 4 / 33 Source: http://www.doksinet The scope of audio processing Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 5 / 33 Source: http://www.doksinet The acoustic communication chain message signal channel receiver decoder ! synthesis audio processing recognition Sound is an information bearer Received sound reflects source(s) plus effect of environment (channel) Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 6 / 33 Source: http://www.doksinet Levels of abstraction Much processing

concerns shifting between levels of abstraction abstract representation (e.g t-f energy) Synthesis Analysis ‘information’ sound p(t) concrete Different representations serve different tasks separating aspects, making things explicit, . Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 7 / 33 Source: http://www.doksinet Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 8 / 33 Source: http://www.doksinet Source structure Goals I I I survey topics in sound analysis & processing develop and intuition for sound signals learn some specific technologies Course structure I I I weekly assignments (25%) midterm event (25%) final project (50%) Text Speech and Audio Signal Processing Ben Gold & Nelson Morgan Wiley, 2000 ISBN: 0-471-35154-7 Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 9 / 33 Source:

http://www.doksinet Web-based Course website: http://www.eecolumbiaedu/∼dpwe/e6820/ for lecture notes, problem sets, examples, . + student web pages for homework, etc. Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 10 / 33 Source: http://www.doksinet Course outline Fundamentals L1: DSP L2: Acoustics Audio processing L5: Signal models Applications L6: Music analysis/ synthesis L8: L7: Spatial sound Audio compression & rendering Dan Ellis (Ellis & Mandel) L4: L3: Auditory Pattern recognition perception L9: Speech recognition L10: Music retrieval L11: Signal separation L12: Multimedia indexing Intro & DSP January 22, 2009 11 / 33 Source: http://www.doksinet Weekly assignments Research papers I I I journal & conference publications summarize & discuss in class written summaries on web page + Courseworks discussion Practical experiments I I I Matlab-based (+ Signal Processing Toolbox) direct experience of sound

processing skills for project Book sections Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 12 / 33 Source: http://www.doksinet Final project Most significant part of course (50%) of grade Oral proposals mid-semester; Presentations in final class + website Scope I I I practical (Matlab recommended) identify a problem; try some solutions evaluation Topic I I I few restrictions within world of audio investigate other resources develop in discussion with me Citation & plagiarism Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 13 / 33 Source: http://www.doksinet Examples of past projects Automatic prosody classification Dan Ellis (Ellis & Mandel) Model-based note transcription Intro & DSP January 22, 2009 14 / 33 Source: http://www.doksinet Outline 1 Sound and information 2 Course Structure 3 DSP review: Timescale modification Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 15 / 33 Source:

http://www.doksinet DSP review: digital signals Discrete-time sampling limits bandwidth xd[n] = Q( xc(nT ) ) Discrete-level quantization limits dynamic range time ε T sampling interval T sampling frequency ΩT =   quantizer Q(y ) =  y Dan Ellis (Ellis & Mandel) 2π T Intro & DSP January 22, 2009 16 / 33 Source: http://www.doksinet The speech signal: time domain Speech is a sequence of different sound types Vowel: periodic “has” .1 0 0 .1 Fricative: aperiodic “watch” 0.05 1.38 1.4 -0.05 1.86 1.42 1.88 1.9 1.92 0.2 0.1 0 -0.1 -0.2 1.4 1.6 has a 1.8 2 2.2 watch thin 2.4 2.6 as a time/s dime 0.1 0.02 0 -0.1 1.52 0 -0.02 1.54 1.56 1.58 Glide: smooth transition “watch” Dan Ellis (Ellis & Mandel) Intro & DSP 2.42 2.44 2.46 2.4 Stop burst: transient “dime” January 22, 2009 17 / 33 Source: http://www.doksinet Timescale modification (TSM) Can we modify a sound to make it ‘slower’ ? i.e speech

pronounced more slowly e.g to help comprehension, analysis or more quickly for ‘speed listening’ ? Why not just slow it down? xs (t) = xo ( rt ), r = slowdown factor (> 1 slower) equivalent to playback at a different sampling rate 0.1 0.05 Original 0 -0.05 -0.1 2.35 2.4 2.45 2.4 2.45 2.5 2.55 2.6 2.55 2.6 r =2 0.1 0.05 2x slower 0 -0.05 -0.1 2.35 Dan Ellis (Ellis & Mandel) 2.5 Intro & DSP time/s January 22, 2009 18 / 33 Source: http://www.doksinet Time-domain TSM Problem: want to preserve local time structure but alter global time structure Repeat segments I but: artifacts from abrupt edges Cross-fade & overlap y m [mL + n] = y m−1 [mL + n] + w [n] · x 0.1 1 2 3 4 5 hj m k r L+n i 6 0 -0.1 2.35 1 2.4 2.45 2.5 4 2 1 1 2.6 time / s 6 3 0.1 2.55 5 2 2 3 3 4 4 5 5 6 4.95 time / s 0 -0.1 Dan Ellis (Ellis & Mandel) 4.7 4.75 4.8 4.85 Intro & DSP 4.9 January 22, 2009 19 / 33 Source:

http://www.doksinet Synchronous overlap-add (SOLA) Idea: allow some leeway in placing window to optimize alignment of waveforms 1 2 Km maximizes alignment of 1 and 2 Hence, m y [mL + n] = y m−1 [mL + n] + w [n] · x hj m k r L + n + Km i Where Km chosen by cross-correlation: PNov    + n] · x mr L + n + K Km = argmax qP  P   0≤K ≤Ku (y m−1 [mL + n])2 (x mr L + n + K )2 n=0 y Dan Ellis (Ellis & Mandel) m−1 [mL Intro & DSP January 22, 2009 20 / 33 Source: http://www.doksinet The Fourier domain Fourier Series (periodic continuous x) Ω0 = x(t) = 2π T X x(t) 0.5 0 0.5 ck e jkΩ0 t 1 1.5 1 2πT 0.5 0 0.5 1 1.5 t 1.0 k ck = 1 Z T /2 |ck| x(t)e −jkΩ0 t dt 1 2 3 4 5 6 7 k −T /2 Fourier Transform (aperiodic continuous x) 0.02 x(t) 0.01 Z 1 X (jΩ)e jΩt dΩ 2π Z X (jΩ) = x(t)e −jΩt dt x(t) = 0 -0.01 0 0.004 0.006 0.008 time / sec |X(jΩ)| -40 -60 -80 0 Dan Ellis (Ellis & Mandel) 0.002 level /

dB -20 Intro & DSP 2000 4000 6000 8000 freq / Hz January 22, 2009 21 / 33 Source: http://www.doksinet Discrete-time Fourier DT Fourier Transform (aperiodic sampled x) x [n] x[n] = X (e jω ) = 1 2π X Z π X (e jω )e jωn dω −π -1 n 1 2 3 4 5 6 7 |X(ejω)| 3 x[n]e −jωn 2 1 0 π ω 2π 3π 4π 5π Discrete Fourier Transform (N-point x) x[n] = X X [k] = X X [k]e j x [n] 2πkn N k x[n]e |X[k]| n k=1. Dan Ellis (Ellis & Mandel) n |X(ejω)| 1 2 3 4 5 6 7 −j 2πkn N Intro & DSP k January 22, 2009 22 / 33 Source: http://www.doksinet Sampling and aliasing Discrete-time signals equal the continuous time signal at discrete sampling instants: xd [n] = xc (nT ) Sampling cannot represent rapid fluctuations 1 0.5 0 0.5 1 0 1  sin ΩM + 2 2π T 3 4  5 6 7 8 9 10  Tn = sin(ΩM Tn) ∀n ∈ Z Nyquist limit (ΩT /2) from periodic spectrum: Gp(jΩ) −ΩM −ΩT −ΩT + ΩM Dan Ellis (Ellis &

Mandel) “alias” of “baseband” signal Ga(jΩ) ΩM Intro & DSP ΩT ΩT - Ω M Ω January 22, 2009 23 / 33 Source: http://www.doksinet Speech sounds in the Fourier domain energy / dB time domain 0.1 Vowel: periodic “has” 0 -0.1 1.37 0.05 1.38 1.39 1.4 1.41 1.42 time / s -60 -80 -100 freq / Hz 0 1000 2000 3000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 -60 Fricative: aperiodic 0 “watch” -0.05 1.86 0.1 frequency domain -40 -80 1.87 1.88 1.89 1.9 -100 1.91 -40 -60 Glide: transition “watch” 0 -80 -0.1 1.52 1.54 1.56 1.58 -60 0.02 Stop: transient “dime” -100 0 -80 -0.02 2.42 2.44 2.46 2.48 -100 dB = 20 log10 (amplitude) = 10 log10 (power) Voiced spectrum has pitch + formants Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 24 / 33 Source: http://www.doksinet Short-time Fourier Transform Want to localize energy in time and frequency break sound

into short-time pieces calculate DFT of each one 0.1 L 2L 3L 0 -0.1 2.35 2.4 short-time window 2.45 2.5 2.55 2.6 time / s DFT k freq / Hz 4000 3000 2000 1000 0 m= 0 m=1 m=2 m=3 Mathematically, X [k, m] = N−1 X n=0 Dan Ellis (Ellis & Mandel)   2πk(n − mL) x[n] w [n − mL] exp −j N Intro & DSP January 22, 2009 25 / 33 Source: http://www.doksinet The Spectrogram Plot STFT X [k, m] as a gray-scale image 0.1 4000 10 0 3000 -10 2000 -20 -30 1000 intensity / dB freq / Hz 0 -40 0 2.35 2.4 2.45 2.5 2.55 2.6 -50 time / s freq / Hz 4000 3000 2000 1000 0 0 Dan Ellis (Ellis & Mandel) 0.5 1 1.5 Intro & DSP 2 2.5 time / s January 22, 2009 26 / 33 Source: http://www.doksinet Time-frequency tradeoff Longer window w [n] gains frequency resolution at cost of time resolution 0.2 freq / Hz 4000 3000 2000 10 0 1000 -10 0 freq / Hz Window = 48 pt Wideband Window = 256 pt Narrowband 0 -20 4000 -30 -40

3000 -50 level / dB 2000 1000 0 Dan Ellis (Ellis & Mandel) 1.4 1.6 1.8 2 Intro & DSP 2.2 2.4 2.6 time / s January 22, 2009 27 / 33 Source: http://www.doksinet Speech sounds on the Spectrogram Stop: transient “dime” 1.6 Fricve: aperiodic “watch” Glide: transition “watch” 1.4 freq / Hz Vowel: periodic “has” Most popular speech visualization 4000 3000 2000 1000 0 has a 1.8 2 watch 2.2 thin 2.4 as a 2.6 time/s dime Wideband (short window) better than narrowband (long window) to see formants Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 28 / 33 Source: http://www.doksinet TSM with the Spectrogram Just stretch out the spectrogram? 4000 Frequency 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time 4000 Frequency 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Time how to resynthesize? spectrogram is only |Y [k, m]| Dan Ellis (Ellis & Mandel) Intro & DSP January 22,

2009 29 / 33 Source: http://www.doksinet The Phase Vocoder Timescale modification in the STFT domain Magnitude from ‘stretched’ spectrogram: h mi |Y [k, m]| = X k, r I e.g by linear interpolation But preserve phase increment between slices: h mi θ̇Y [k, m] = θ̇X k, r I e.g by discrete differentiator Does right thing for single sinusoid I keeps overlapped parts of sinusoid aligned ∆T . θ = ∆θ ∆T Dan Ellis (Ellis & Mandel) time . ∆θ = θ·2∆T Intro & DSP January 22, 2009 30 / 33 Source: http://www.doksinet General issues in TSM Time window I stretching a narrowband spectrogram Malleability of different sounds I vowels stretch well, stops lose nature Not a well-formed problem? want to alter time without frequency . but time and frequency are not separate! I ‘satisfying’ result is a subjective judgment ⇒ solution depends on auditory perception. I Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 31 / 33

Source: http://www.doksinet Summary Information in sound I lots of it, multiple levels of abstraction Course overview I I survey of audio processing topics practicals, readings, project DSP review I I digital signals, time domain Fourier domain, STFT Timescale modification I I I properties of the speech signal time-domain phase vocoder Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 32 / 33 Source: http://www.doksinet References J. L Flanagan and R M Golden Phase vocoder Bell System Technical Journal, pages 1493–1509, 1966. M. Dolson The Phase Vocoder: A Tutorial Computer Music Journal, 10(4):14–27, 1986. M. Puckette Phase-locked vocoder In Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 222–225, 1995. A. T Cemgil and S J Godsill Probabilistic Phase Vocoder and its application to Interpolation of Missing Values in Audio Signals. In 13th European Signal Processing Conference, Antalya, Turkey, 2005.

Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 33 / 33 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 2: Acoustics Dan Ellis & Mike Mandel Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 January 29, 2009 1 The wave equation 2 Acoustic tubes: reflections & resonance 3 Oscillations & musical acoustics 4 Spherical waves & room acoustics E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 1 / 38 Source: http://www.doksinet Outline 1 The wave equation 2 Acoustic tubes: reflections & resonance 3 Oscillations & musical acoustics 4 Spherical waves & room acoustics E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 2 / 38 Source: http://www.doksinet Acoustics & sound Acoustics is the study of physical waves (Acoustic) waves transmit energy without permanently displacing matter (e.g ocean waves) Same math

recurs in many domains Intuition: pulse going down a rope E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 3 / 38 Source: http://www.doksinet The wave equation Consider a small section of the rope y S ε φ2 x φ1 S Displacement y (x), tension S, mass  dx ⇒ Lateral force is Fy = S sin(φ2 ) − S sin(φ1 ) ≈S E6820 SAPR (Ellis & Mandel) ∂2y dx ∂x 2 Acoustics January 29, 2009 4 / 38 Source: http://www.doksinet Wave equation (2) Newton’s law: F = ma S ∂2y ∂2y dx =  dx ∂x 2 ∂t 2 Call c 2 = S/ (tension to mass-per-length) hence, the Wave Equation: c2 ∂2y ∂2y = ∂x 2 ∂t 2 . partial DE relating curvature and acceleration E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 5 / 38 Source: http://www.doksinet Solution to the wave equation If y (x, t) = f (x − ct) (any f (·)) then ∂y = −cf 0 (x − ct) ∂t ∂2y = c 2 f 00 (x − ct) ∂t 2 ∂y = f 0 (x − ct) ∂x ∂2y = f 00 (x − ct) ∂x 2 also

works for y (x, t) = f (x + ct) Hence, general solution: c2 ∂2y ∂2y = ∂x 2 ∂t 2 ⇒ y (x, t) = y + (x − ct) + y − (x + ct) E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 6 / 38 Source: http://www.doksinet Solution to the wave equation (2) y + (x − ct) and y − (x + ct) are traveling waves I shape stays constant but changes position y y+ time 0: y∆x = c·T x y+ time T: yx c is traveling wave velocity (∆x/∆t) y + moves right, y − moves left resultant y (x) is sum of the two waves E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 7 / 38 Source: http://www.doksinet Wave equation solutions (3) What is the form of y + , y − ? I any doubly-differentiable function will satisfy wave equation Actual waveshapes dictated by boundary conditions I I e.g y (x) at t = 0 plus constraints on y at particular xs e.g input motion y (0, t) = m(t) rigid termination y (L, t) = 0 y y(0,t) = m(t) x y+(x,t) E6820 SAPR (Ellis &

Mandel) y(L,t) = 0 Acoustics January 29, 2009 8 / 38 Source: http://www.doksinet Terminations and reflections System constraints: I I I I initial y (x, 0) = 0 (flat rope) input y (0, t) = m(t) (at agent’s hand) ( y + ) termination y (L, t) = 0 (fixed end) wave equation y (x, t) = y + (x − ct) + y − (x + ct) At termination: I y (L, t) = 0 ⇒ y + (L − ct) = −y − (L + ct) i.e y + and y − are mirrored in time and amplitude around x = L ⇒ inverted reflection at termination y+ [simulation travel1.m] y(x,t) = y+ + y– y– x=L E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 9 / 38 Source: http://www.doksinet Outline 1 The wave equation 2 Acoustic tubes: reflections & resonance 3 Oscillations & musical acoustics 4 Spherical waves & room acoustics E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 10 / 38 Source: http://www.doksinet Acoustic tubes pressure Sound waves travel down acoustic tubes: x I

1-dimensional; very similar to strings Common situation: I I I wind instrument bores ear canal vocal tract E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 11 / 38 Source: http://www.doksinet Pressure and velocity Consider air particle displacement ξ(x, t) ξ(x) x Particle velocity v (x, t) = ∂ξ ∂t hence volume velocity u(x, t) = Av (x, t) ∂ξ (Relative) air pressure p(x, t) = − κ1 ∂x E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 12 / 38 Source: http://www.doksinet Wave equation for a tube Consider elemental volume Area dA Force p·dA x Force (p+∂p/∂x·dx)·dA Volume dA·dx Mass ρ·dA·dx Newton’s law: F = ma − E6820 SAPR (Ellis & Mandel) ∂p ∂v dx dA = ρ dA dx ∂x ∂t ∂p ∂v ⇒ = −ρ ∂x ∂t 2 2 ∂ ξ ∂ ξ 1 ∴ c2 2 = 2 c=√ ∂x ∂t ρκ Acoustics January 29, 2009 13 / 38 Source: http://www.doksinet Acoustic tube traveling waves Traveling waves in particle displacement: ξ(x, t) = ξ +

(x − ct) + ξ − (x + ct) ∂ + Call u + (α) = −cA ∂α ξ (α), Z0 = ρc A Then volume velocity: u(x, t) = A ∂ξ = u + (x − ct) − u − (x + ct) ∂t And pressure: p(x, t) = −   1 ∂ξ = Z0 u + (x − ct) + u − (x + ct) κ ∂x (Scaled) sum and difference of traveling waves E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 14 / 38 Source: http://www.doksinet Acoustic traveling waves (2) Different resultants for pressure and volume velocity: Acoustic tube x c u+ c u- E6820 SAPR (Ellis & Mandel) u(x,t) = u+ - u- Volume velocity p(x,t) = Z0[u+ + u-] Pressure Acoustics January 29, 2009 15 / 38 Source: http://www.doksinet Terminations in tubes Equivalent of fixed point for tubes? Solid wall forces hence u+ = uu(x,t) = 0 u0(t) (Volume velocity input) Open end forces p(x,t) = 0 hence u+ = -u- Open end is like fixed point for rope: reflects wave back inverted Unlike fixed point, solid wall reflects traveling wave without inversion

E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 16 / 38 Source: http://www.doksinet Standing waves Consider (complex) sinusoidal input u0 (t) = U0 e jωt Pressure/volume must have form Ke j(ωt+φ) Hence traveling waves: u + (x − ct) = |A|e j(−kx+ωt+φA ) u − (x + ct) = |B|e j(kx+ωt+φB ) where k = ω/c (spatial frequency, rad/m) (wavelength λ = c/f = 2πc/ω) Pressure and volume velocity resultants show stationary pattern: standing waves I even when |A| 6= |B| ⇒ [simulation sintwavemov.m] E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 17 / 38 Source: http://www.doksinet Standing waves (2) U0 ejωt kx = π x=λ/2 pressure = 0 (node) vol.veloc = max (antinode) For lossless termination (|u + | = |u − |), have true nodes and antinodes Pressure and volume velocity are phase shifted I in space and in time E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 18 / 38 Source: http://www.doksinet Transfer function Consider tube

excited by u0 (t) = U0 e jωt sinusoidal traveling waves must satisfy termination ‘boundary conditions’ satisfied by complex constants A and B in u(x, t) = u + (x − ct) + u − (x + ct) = Ae j(−kx+ωt) + Be j(kx+ωt) = e jωt (Ae −jkx + Be jkx ) standing wave pattern will scale with input magnitude point of excitation makes a big difference . E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 19 / 38 Source: http://www.doksinet Transfer function (2) For open-ended tube of length L excited at x = 0 by U0 e jωt u(x, t) = U0 e jωt cos k(L − x) cos kL k= ω c (matches at x = 0, maximum at x = L) i.e standing wave pattern e.g varying L for a given ω (and hence k): U0 ejωt U0 U0 ejωt UL U0 UL magnitude of UL depends on L (and ω) E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 20 / 38 Source: http://www.doksinet Transfer function (3) Varying ω for a given L, i.e at x = L u(L, t) 1 1 UL = = = U0 u(0, t) cos kL cos(ωL/c) u(L) u(0)

L u(L) u(0) ∞ at ωL/c = (2r+1)π/2, r = 0,1,2. Output volume velocity always larger than input πc Unbounded for L = (2r + 1) 2ω = (2r + 1) λ4 i.e resonance (amplitude grows without bound) E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 21 / 38 Source: http://www.doksinet Resonant modes For lossless tube with L = m λ4 , m odd, λ wavelength u(L) u(0) is unbounded, meaning: transfer function has pole on frequency axis energy at that frequency sustains indefinitely L = 3 · λ1/4 ω1 = 3ω0 L = λ0/4 compare to time domain . e.g 175 cm vocal tract, c = 350 m/s ⇒ ω0 = 2π 500 Hz (then 1500, 2500, . ) E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 22 / 38 Source: http://www.doksinet Scattering junctions u+k+1 u+k uk u-k+1 Area Ak At abrupt change in area: • pressure must be continuous pk(x, t) = pk+1(x, t) • vol. veloc must be continuous uk(x, t) = uk+1(x, t) • traveling waves u+k, u-k, u+k+1, u-k+1 Area Ak+1 will be

different + Solve e.g for uk− and uk+1 : (generalized term) 2r 1+r u+k 1-r 1+r u-k E6820 SAPR (Ellis & Mandel) u+k+1 + r= r-1 r+1 u-k+1 + 2 r+1 Acoustics Ak+1 Ak “Area ratio” January 29, 2009 23 / 38 Source: http://www.doksinet Concatenated tube model Vocal tract acts as a waveguide Lips x=L Lips uL(t) Glottis u0(t) x=0 Glottis Discrete approximation as varying-diameter tube Ak, Lk Ak+1, Lk+1 x E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 24 / 38 Source: http://www.doksinet Concatenated tube resonances Concatenated tubes scattering junctions lattice filter u+k u-k e-jωτ1 + e-jωτ1 + e-jωτ2 + e-jωτ2 + + + e-jω2τ1 + + + + e-jω2τ2 + + Can solve for transfer function – all-pole 1 5 0 -1 -0.5 0 0.5 1 Approximate vowel synthesis from resonances [sound example: ah ee oo] E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 25 / 38 Source: http://www.doksinet Outline 1 The wave equation

2 Acoustic tubes: reflections & resonance 3 Oscillations & musical acoustics 4 Spherical waves & room acoustics E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 26 / 38 Source: http://www.doksinet Oscillations & musical acoustics Pitch (periodicity) is essence of music why? why music? Different kinds of oscillators simple harmonic motion (tuning fork) relaxation oscillator (voice) string traveling wave (plucked/struck/bowed) air column (nonlinear energy element) E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 27 / 38 Source: http://www.doksinet Simple harmonic motion Basic mechanical oscillation ẍ = −ω 2 x x = A cos(ωt + φ) Spring + mass (+ damper) F = kx m ω2 = ζ x k m e.g tuning fork Not great for music I I fundamental (cos ωt) only relatively low energy E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 28 / 38 Source: http://www.doksinet Relaxation oscillator Multi-state process I I I one

state builds up potential (e.g pressure) switch to second (release) state revert to first state, etc. e.g vocal folds: p u http://www.youtubecom/watch?v=ajbcJiYhFKY Oscillation period depends on force (tension) easy to change hard to keep stable ⇒ less used in music I I E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 29 / 38 Source: http://www.doksinet Ringing string e.g our original ‘rope’ example tension S mass/length ε π2 S ω2 = 2 L ε L Many musical instruments I I I guitar (plucked) piano (struck) violin (bowed) Control period (pitch): I I I change length (fretting) change tension (tuning piano) change mass (piano strings) Influence of excitation . [pluck1am] E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 30 / 38 Source: http://www.doksinet Wind tube Resonant tube + energy input nonlinear element scattering junction (tonehole) energy acoustic waveguide ω= πc 2L (quarter wavelength) e.g clarinet I I I lip

pressure keeps reed closed reflected pressure wave opens reed reinforced pressure wave passes through finger holds determine first reflection ⇒ effective waveguide length E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 31 / 38 Source: http://www.doksinet Outline 1 The wave equation 2 Acoustic tubes: reflections & resonance 3 Oscillations & musical acoustics 4 Spherical waves & room acoustics E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 32 / 38 Source: http://www.doksinet Room acoustics Sound in free air expands spherically: radius r Spherical wave equation: ∂ 2 p 2 ∂p 1 ∂2p + = ∂r 2 r ∂r c 2 ∂t 2 P0 j(ωt−kr ) r e as r12 solved by p(r , t) = Energy ∝ p 2 falls E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 33 / 38 Source: http://www.doksinet Effect of rooms (1): Images Ideal reflections are like multiple sources: virtual (image) sources reflected path source listener ‘Early

echoes’ in room impulse response: direct path early echos hroom(t) t actual reflections may be hr (t), not δ(t) E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 34 / 38 Source: http://www.doksinet Effect of rooms (2): Modes Regularly-spaced echoes behave like acoustic tubes Real rooms have lots of modes! dense, sustained echoes in impulse response complex pattern of peaks in frequency response E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 35 / 38 Source: http://www.doksinet Reverberation Exponential decay of reflections: ~e-t/T hroom(t) t Frequency-dependent I greater absorption at high frequencies ⇒ faster decay Size-dependent I larger rooms longer delays slower decay Sabine’s equation: 0.049V S ᾱ Time constant varies with size, absorption RT60 = E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 36 / 38 Source: http://www.doksinet Summary Traveling waves Acoustic tubes & resonances Musical acoustics &

periodicity Room acoustics & reverberation Parting thought Musical bottles E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 37 / 38 Source: http://www.doksinet References E6820 SAPR (Ellis & Mandel) Acoustics January 29, 2009 38 / 38 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 February 7, 2008 1 2 3 4 Classification Generative models Gaussian models Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 1 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 2 / 43 Source: http://www.doksinet Classification and generative models F2/Hz

f/Hz 4000 2000 3000 ay ao x 2000 1000 1000 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1.0 time/s 0 0 1000 2000 F1/Hz Classification I I I discriminative models discrete, categorical random variable of interest fixed set of categories Generative models I I I I descriptive models continuous or discrete random variable(s) of interest can estimate parameters Bayes’ rule makes them useful for classification Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 3 / 43 Source: http://www.doksinet Building a classifier Define classes/attributes I I could state explicit rules better to define through ‘training’ examples Define feature space Define decision algorithm I set parameters from examples Measure performance I calculated (weighted) error rate Pols vowel formants: "u" (x) vs. "o" (o) 1100 F2 / Hz 1000 900 800 700 600 250 Michael Mandel (E6820 SAPR) 300 350 400 450 F1 / Hz Machine learning 500 550 600 February

7, 2008 4 / 43 Source: http://www.doksinet Classification system parts Sensor signal Pre-processing/ segmentation • STFT • Locate vowels segment Feature extraction • Formant extraction feature vector Classification class Post-processing Michael Mandel (E6820 SAPR) • Context constraints • Costs/risk Machine learning February 7, 2008 5 / 43 Source: http://www.doksinet Feature extraction Right features are critical I I waveform vs formants vs cepstra invariance under irrelevant modifications Theoretically equivalent features may act very differently in a particular classifier I I representations make important aspects explicit remove irrelevant information Feature design incorporates ‘domain knowledge’ I although more data ⇒ less need for ‘cleverness’ Smaller ‘feature space’ (fewer dimensions) simpler models (fewer parameters) less training data needed faster training [inverting MFCCs] Michael Mandel (E6820 SAPR) Machine learning

February 7, 2008 6 / 43 Source: http://www.doksinet Optimal classification Minimize probability of error with Bayes optimal decision θ̂ = argmax p(θi | x) θ Z i p(error) = p(error | x)p(x) dx XZ = (1 − p(θi | x))p(x) dx i Λi I where Λi is the region of x where θi is chosen . but p(θi | x) is largest in that region I so p(error) is minimized Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 7 / 43 Source: http://www.doksinet Sources of error p(x|ω1)·Pr(ω1) Xω^ 1 Xω^ 2 p(x|ω2)·Pr(ω2) Pr(ω2|Xω^ 1) = Pr(err|Xω^ 1) x Pr(ω1|Xω^ 2) = Pr(err|Xω^ 2) Suboptimal threshold / regions (bias error) I use a Bayes classifier Incorrect distributions (model error) I better distribution models / more training data Misleading features (‘Bayes error’) I I irreducible for given feature set regardless of classification scheme Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 8 / 43 Source: http://www.doksinet Two roads to

classification Optimal classifier is θ̂ = argmax p(θi | x) θi but we don’t know p(θi | x) Can model distribution directly e.g Nearest neighbor, SVM, AdaBoost, neural net I maps from inputs x to outputs θi I a discriminative model Often easier to model data likelihood p(x | θi ) I I use Bayes’ rule to convert to p(θi | x) a generative (descriptive) model Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 9 / 43 Source: http://www.doksinet Nearest neighbor classification Pols vowel formants: "u" (x), "o" (o), "a" (+) 1800 1600 1400 + F2 / Hz 1200 1000 o x 800 * 600 0 200 400 new observation x 600 800 1000 F1 / Hz 1200 1400 1600 1800 Find closest match (Nearest Neighbor) Naı̈ve implementation takes O(N) time for N training points As N ∞, error rate approaches twice the Bayes error rate With K summarized classes, takes O(K ) time Locality sensitive hashing gives approximate nearest neighbors 2 in O(dn1/c )

time (Andoni and Indyk, 2006) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 10 / 43 Source: http://www.doksinet Support vector machines (w . x) + b = + 1 (w .x) + b = –1 “Large margin” linear classifier for separable data I I I regularization of margin avoids over-fitting can be adapted to non-separable data (C parameter) made nonlinear using kernels k(x1 , x2 ) = Φ(x1 ) · Φ(x2 ) x1 x2 yi = – 1 yi = +1 w (w .x) + b = 0 Depends only on training points near the decision boundary, the support vectors Unique, optimal solution for given Φ and C Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 11 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 12 / 43 Source: http://www.doksinet Generative models Describe the data using structured probabilistic models Observations are random

variables whose distribution depends on model parameters Source distributions p(x | θi ) I I I reflect variability in features reflect noise in observation generally have to be estimated from data (rather than known in advance) p(x|ωi) ω1 ω2 ω3 ω4 x Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 13 / 43 Source: http://www.doksinet Generative models (2) Three things to do with generative models Evaluate the probability of an observation, possibly under multiple parameter settings p(x), p(x | θ1 ), p(x | θ2 ), . Estimate model parameters from observed data θ̂ = argmin C (θ∗ , θ | x) θ Run the model forward to generate new data x̃ ∼ p(x | θ̂) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 14 / 43 Source: http://www.doksinet Random variables review Random variables have joint distributions, p(x, y ) p(x,y) y Knowing one value in a joint distribution constrains the remainder Conditional distribution of x given y p(x | y

) ≡ p(x, y ) p(x, y ) =R p(y ) p(x, y ) dy Michael Mandel (E6820 SAPR) Machine learning σxy µy σy σx p(y) Marginal distribution of y Z p(y ) = p(x, y ) dx µx x p(x) p(x,y) y Y x p(x y = Y ) | February 7, 2008 15 / 43 Source: http://www.doksinet Bayes’ rule p(x | y )p(y ) = p(x, y ) = p(y | x)p(x) ∴ p(y | x) = p(x | y )p(y ) p(x) ⇒ can reverse conditioning given priors/marginals terms can be discrete or continuous generalizes to more variables p(x, y , z) = p(x | y , z)p(y , z) = p(x | y , z)p(y | z)p(z) allows conversion between joint, marginals, conditionals Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 16 / 43 Source: http://www.doksinet Bayes’ rule for generative models Run generative models backwards to compare them Likelihood R p(x | θ) · p(θ) p(θ | x) = p(x | θ)p(θ) Posterior prob Evidence = p(x) Prior prob Posterior is the classification we’re looking for I I I combination of prior belief in each class with

likelihood under our model R normalized by evidence (so posteriors = 1) Objection: priors are often unknown . but omitting them amounts to assuming they are all equal Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 17 / 43 Source: http://www.doksinet Computing probabilities and estimating parameters Want probability of the observation under a model, p(x) I regardless of parameter settings Full Bayesian integral Z p(x) = p(x | θ)p(θ) dθ Difficult to compute in general, approximate as p(x | θ̂) I Maximum likelihood (ML) θ̂ = argmax p(x | θ) θ I Maximum a posteriori (MAP): ML + prior θ̂ = argmax p(θ | x) = argmax p(x | θ)p(θ) θ Michael Mandel (E6820 SAPR) θ Machine learning February 7, 2008 18 / 43 Source: http://www.doksinet Model checking After estimating parameters, run the model forward Check that I I I model is rich enough to capture variation in data parameters are estimated correctly there aren’t any bugs in your code

Generate data from the model and compare it to observations x̃ ∼ p(x | θ) I I are they similar under some statistics T (x) : Rd 7 R ? can you find the real data set in a group of synthetic data sets? Then go back and update your model accordingly Gelman et al. (2003, ch 6) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 19 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 20 / 43 Source: http://www.doksinet Gaussian models Easiest way to model distributions is via parametric model I assume known form, estimate a few parameters Gaussian model is simple and useful. In 1D "   # 1 1 x − µi 2 p(x | θi ) = √ exp − 2 σi σi 2π Parameters mean µi and variance σi fit PM PM e1/2 Michael Mandel (E6820 SAPR) p(x|ωi) σi µi Machine learning x February 7, 2008 21 / 43 Source: http://www.doksinet

Gaussians in d dimensions   1 1 T −1 exp − (x − µi ) Σi (x − µi ) p(x | θi ) = 2 (2π)d/2 |Σi |1/2 Described by a d-dimensional mean µi and a d × d covariance matrix Σi 5 4 1 0.5 3 0 2 4 1 2 4 2 0 0 Michael Mandel (E6820 SAPR) 0 0 Machine learning 1 2 3 4 5 February 7, 2008 22 / 43 Source: http://www.doksinet Gaussian mixture models Single Gaussians cannot model I I distributions with multiple modes distributions with nonlinear correlations What about a weighted sum? X p(x) ≈ ck p(x | θk ) k I I where {ck } is a set of weights and {p(x | θk )} is a set of Gaussian components can fit anything given enough components Interpretation: each observation is generated by one of the Gaussians, chosen with probability ck = p(θk ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 23 / 43 Source: http://www.doksinet Gaussian mixtures (2) e.g nonlinear correlation resulting surface 3 2 1.4 1 1.2 0 1 -1 0.8 -2 0.6 0 5

Gaussian components 0.4 10 original data 15 20 0.2 25 30 35 0 Problem: finding ck and θk parameters easy if we knew which θk generated each x Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 24 / 43 Source: http://www.doksinet Expectation-maximization (EM) General procedure for estimating model parameters when some are unknown e.g which GMM component generated a point Iteratively updated model parameters θ to maximize Q, the expected log-probability of observed data x and hidden data z Z Q(θ, θt ) = p(z | x, θt ) log p(z, x | θ) z I I I I I E step: calculate p(z | x, θt ) using θt M step: find θ that maximizes Q using p(z | x, θt ) can prove p(x | θ) non-decreasing hence maximum likelihood model local optimumdepends on initialization Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 25 / 43 Source: http://www.doksinet Fitting GMMs with EM Want to find parameters of the Gaussians θk = {µk , Σk } weights/priors on Gaussians

ck = p(θk ) . that maximize likelihood of training data x I I If we could assign each x to a particular θk , estimation would be direct Hence treat mixture indices, z, as hidden form Q = E [p(x, z | θ)] differentiate wrt model parameters equations for µk , Σk , ck to maximize Q I I Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 26 / 43 Source: http://www.doksinet GMM EM updated equations Parameters that maximize Q νnk ≡ p(zk | xn , θt ) P νnk xn µk = Pn νnk P n νnk (xn − µk )(xn − µk )T P Σk = n n νnk X 1 νnk ck = N n Each involves νnk , ‘fuzzy membership’ of xn in Gaussian k Updated parameter is just sample average, weighted by fuzzy membership Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 27 / 43 Source: http://www.doksinet GMM examples Vowel data fit with different mixture counts 1 Gauss logp(x)=-1911 2 Gauss logp(x)=-1864 1600 1600 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800

1000 1200 600 200 3 Gauss logp(x)=-1849 1600 1600 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800 1000 1200 400 600 800 1000 1200 4 Gauss logp(x)=-1840 600 200 400 600 800 1000 1200 [Example. ] Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 28 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 29 / 43 Source: http://www.doksinet Markov models A (first order) Markov model is a finite-state system whose behavior depends only on the current state “The future is independent of the past, conditioned on the present” e.g generative Markov model S .8 .8 .1 .1 A .1 B .1 .1 C .7 .1 .1 E p(qn+1|qn) S A qn B C E qn+1 S A B C E 0 0 0 0 0 1 .8 .1 .1 0 0 .1 .8 .1 0 0 .1 .1 .7 0 0 0 0 .1 1 SAAAAAAAABBBBBBBBBCCCCBBBBBBCE Michael Mandel (E6820 SAPR) Machine learning February

7, 2008 30 / 43 Source: http://www.doksinet Hidden Markov models Markov model where state sequence Q = {qn } is not directly observable (‘hidden’) But, observations X do depend on Q I xn is RV that depends only on current state p(xn | qn ) State sequence q=A 0.8 q=B AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC q=C 3 0.6 xn p(x|q) Emission distributions 0.4 0 0 0.8 q=A q=B q=C 3 0.6 xn p(x|q) 2 1 0.2 0.4 2 1 0.2 0 Observation sequence 0 1 2 3 4 0 observation x 0 10 20 30 time step n can still tell something about state sequence. Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 31 / 43 Source: http://www.doksinet (Generative) Markov models HMM is specified by parameters Θ: - states qi - transition probabilities aij - emission distributions bi(x) k • a t • • k a t • • k a t • • k a t k 1.0 0.9 0.0 0.0 a 0.0 0.1 0.9 0.0 t 0.0 0.0 0.1 0.9 • 0.0 0.0 0.0 0.1 p(x|q) x (+ initial state probabilities

πi ) i aij ≡ p(qnj | qn−1 ) Michael Mandel (E6820 SAPR) bi (x) ≡ p(x | qi ) Machine learning πi ≡ p(q1i ) February 7, 2008 32 / 43 Source: http://www.doksinet Markov models for sequence recognition Independence of observations I observation xn depends only on current state qn p(X | Q) = p(x1 , x2 , . xN | q1 , q2 , qN ) = p(x1 | q1 )p(x2 | q2 ) · · · p(xN | qN ) N Y = p(xn | qn ) = n=1 N Y bqn (xn ) n=1 Markov transitions I transition to next state qi+1 depends only on qi p(Q | M) = p(q1 , q2 , . | M) = p(qN | qN−1 . q1 )p(qN−1 | qN−2 q1 )p(q2 | q1 )p(q1 ) = p(qN | qN−1 )p(qN−1 | qN−2 )p(q2 | q1 )p(q1 ) = p(q1 ) N Y p(qn | qn−1 ) = πq1 n=2 Michael Mandel (E6820 SAPR) N Y aqn−1 qn n=2 Machine learning February 7, 2008 33 / 43 Source: http://www.doksinet Model-fit calculations From ‘state-based modeling’: X p(X | Θj ) = p(X | Q, Θj )p(Q | Θj ) all Q For HMMs p(X | Q) = N Y bqn (xn ) n=1 p(Q | M) = πq1

N Y aqn−1 qn n=2 Hence, solve for Θ̂ = argmaxΘj p(Θj | X ) I Using Bayes’ rule to convert from p(X | Θj ) Sum over all Q??? Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 34 / 43 Source: http://www.doksinet Summing over all paths Model M1 A 0.2 0.1 E 0.1 S • • • • S A B E B 0.2 1 A B E 0.9 01 • 0.7 02 01 • 0.8 02 • • 1 E States 0.9 S xn Observations x1, x2, x3 p(x|B) p(x|A) 0.8 0.7 B 3 n q{ x1 x2 x3 A 2.5 02 01 B 0.1 22 23 Paths 0.8 08 02 0.1 0.1 02 02 A S 2 Observation likelihoods p(x|q) 0.7 0.7 0.9 0 1 2 3 4 time n All possible 3-emission paths Qk from S to E q0 q1 q2 q3 q4 S S S S A A A B A A B B A B B B E E E E Michael Mandel (E6820 SAPR) p(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M) .9 x 7 x 7 x 1 = 00441 .9 x 7 x 2 x 2 = 00252 .9 x 2 x 8 x 2 = 00288 .1 x 8 x 8 x 2 = 00128 Σ = 0.1109 0.0022 2.5 x 02 x 01 = 005 2.5 x 02 x 23 = 115 0.0290 2.5 x 22 x 23 = 1265 03643 0.1 x 22

x 23 = 0506 00065 Σ = p(X | M) = 0.4020 Machine learning February 7, 2008 35 / 43 Source: http://www.doksinet The ‘forward recursion’ Dynamic-programming-like technique to sum over all Q Define αn (i) as the probability of getting to state q i at time step n (by any path): αn (i) = p(x1 , x2 , . xn , qn = q i ) ≡ p(X1n , qni ) Model states qi αn+1 (j) can be calculated recursively: αn+1(j) = αn(i+1) αn(i) Michael Mandel (E6820 SAPR) ai+1j S [i=1 Σ αn(i)·aij]·bj(xn+1) bj(xn+1) aij Time steps n Machine learning February 7, 2008 36 / 43 Source: http://www.doksinet Forward recursion (2) Initialize α1 (i) = πi bi (x1 ) Then total probability p(X1N | Θ) = PS i=1 αN (i) Practical way to solve for p(X | Θj ) and hence select the most probable model (recognition) p(X | M1)·p(M1) p(X | M2)·p(M2) Observations X Michael Mandel (E6820 SAPR) Choose best Machine learning February 7, 2008 37 / 43 Source: http://www.doksinet Optimal path May

be interested in actual qn assignments I which state was ‘active’ at each time frame e.g phone labeling (for training?) Total probability is over all paths . but can also solve for single best path, “Viterbi” state sequence j Probability along best path to state qn+1 :   α̂n+1 (j) = max {α̂n (i)aij } bj (xn+1 ) i backtrack from final state to get best path final probability is product only (no sum) log-domain calculation is just summation I I Best path often dominates total probability p(X | Θ) ≈ p(X , Q̂ | Θ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 38 / 43 Source: http://www.doksinet Interpreting the Viterbi path Viterbi path assigns each xn to a state q i I performing classification based on b (x) i . at the same time applying transition constraints aij Inferred classification xn 3 2 1 0 Viterbi labels: 0 10 20 30 AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC Can be used for segmentation I I train an HMM with ‘garbage’ and

‘target’ states decode on new data to find ‘targets’, boundaries Can use for (heuristic) training e.g forced alignment to bootstrap speech recognizer e.g train classifiers based on labels Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 39 / 43 Source: http://www.doksinet Aside: Training and test data A rich model can learn every training example (overtraining) But the goal is to classify new, unseen data I sometimes use ‘cross validation’ set to decide when to stop training For evaluation results to be meaningful: I I don’t test with training data! don’t train on test data (even indirectly. ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 40 / 43 Source: http://www.doksinet Aside (2): Model complexity More training data allows the use of larger models More model parameters create a better fit to the training data I I more Gaussian mixture components more HMM states For fixed training set size, there will be some

optimal model size that avoids overtraining Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 41 / 43 Source: http://www.doksinet Summary Classification is making discrete (hard) decisions Basis is comparison with known examples I explicitly or via a model Classification models I I I discriminative models, like SVMs, neural nets, boosters, directly learn posteriors p(θi | x) generative models, like Gaussians, GMMs, HMMs, model likelihoods p(x | θ) Bayes’ rule lets us use generative models for classification EM allows parameter estimation even with some data missing Parting thought Is it wise to use generative models for discrimination or vice versa? Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 42 / 43 Source: http://www.doksinet References Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc IEEE Symposium on Foundations of Computer Science, pages 459–468,

Washington, DC, USA, 2006. IEEE Computer Society. Andrew Gelman, John B. Carlin, Hal S Stern, and Donald B Rubin Bayesian Data Analysis. Chapman & Hall/CRC, second edition, July 2003 ISBN 158488388X Lawrence R. Rabiner A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989 Jeff A. Bilmes A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1997. Christopher J. C Burges A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov, 2(2):121–167, June 1998 Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 43 / 43 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering

http://www.eecolumbiaedu/∼dpwe/e6820 February 7, 2008 1 2 3 4 Classification Generative models Gaussian models Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 1 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 2 / 43 Source: http://www.doksinet Classification and generative models F2/Hz f/Hz 4000 2000 3000 ay ao x 2000 1000 1000 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 x 1.0 time/s 0 0 1000 2000 F1/Hz Classification I I I discriminative models discrete, categorical random variable of interest fixed set of categories Generative models I I I I descriptive models continuous or discrete random variable(s) of interest can estimate parameters Bayes’ rule makes them useful for classification Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 3 / 43 Source:

http://www.doksinet Building a classifier Define classes/attributes I I could state explicit rules better to define through ‘training’ examples Define feature space Define decision algorithm I set parameters from examples Measure performance I calculated (weighted) error rate Pols vowel formants: "u" (x) vs. "o" (o) 1100 F2 / Hz 1000 900 800 700 600 250 Michael Mandel (E6820 SAPR) 300 350 400 450 F1 / Hz Machine learning 500 550 600 February 7, 2008 4 / 43 Source: http://www.doksinet Classification system parts Sensor signal Pre-processing/ segmentation • STFT • Locate vowels segment Feature extraction • Formant extraction feature vector Classification class Post-processing Michael Mandel (E6820 SAPR) • Context constraints • Costs/risk Machine learning February 7, 2008 5 / 43 Source: http://www.doksinet Feature extraction Right features are critical I I waveform vs formants vs cepstra invariance under irrelevant

modifications Theoretically equivalent features may act very differently in a particular classifier I I representations make important aspects explicit remove irrelevant information Feature design incorporates ‘domain knowledge’ I although more data ⇒ less need for ‘cleverness’ Smaller ‘feature space’ (fewer dimensions) simpler models (fewer parameters) less training data needed faster training [inverting MFCCs] Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 6 / 43 Source: http://www.doksinet Optimal classification Minimize probability of error with Bayes optimal decision θ̂ = argmax p(θi | x) θ Z i p(error) = p(error | x)p(x) dx XZ = (1 − p(θi | x))p(x) dx i Λi I where Λi is the region of x where θi is chosen . but p(θi | x) is largest in that region I so p(error) is minimized Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 7 / 43 Source: http://www.doksinet Sources of error p(x|ω1)·Pr(ω1) Xω^ 1

Xω^ 2 p(x|ω2)·Pr(ω2) Pr(ω2|Xω^ 1) = Pr(err|Xω^ 1) x Pr(ω1|Xω^ 2) = Pr(err|Xω^ 2) Suboptimal threshold / regions (bias error) I use a Bayes classifier Incorrect distributions (model error) I better distribution models / more training data Misleading features (‘Bayes error’) I I irreducible for given feature set regardless of classification scheme Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 8 / 43 Source: http://www.doksinet Two roads to classification Optimal classifier is θ̂ = argmax p(θi | x) θi but we don’t know p(θi | x) Can model distribution directly e.g Nearest neighbor, SVM, AdaBoost, neural net I maps from inputs x to outputs θi I a discriminative model Often easier to model data likelihood p(x | θi ) I I use Bayes’ rule to convert to p(θi | x) a generative (descriptive) model Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 9 / 43 Source: http://www.doksinet Nearest neighbor classification Pols

vowel formants: "u" (x), "o" (o), "a" (+) 1800 1600 1400 + F2 / Hz 1200 1000 o x 800 * 600 0 200 400 new observation x 600 800 1000 F1 / Hz 1200 1400 1600 1800 Find closest match (Nearest Neighbor) Naı̈ve implementation takes O(N) time for N training points As N ∞, error rate approaches twice the Bayes error rate With K summarized classes, takes O(K ) time Locality sensitive hashing gives approximate nearest neighbors 2 in O(dn1/c ) time (Andoni and Indyk, 2006) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 10 / 43 Source: http://www.doksinet Support vector machines (w . x) + b = + 1 (w .x) + b = –1 “Large margin” linear classifier for separable data I I I regularization of margin avoids over-fitting can be adapted to non-separable data (C parameter) made nonlinear using kernels k(x1 , x2 ) = Φ(x1 ) · Φ(x2 ) x1 x2 yi = – 1 yi = +1 w (w .x) + b = 0 Depends only on training points near the

decision boundary, the support vectors Unique, optimal solution for given Φ and C Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 11 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 12 / 43 Source: http://www.doksinet Generative models Describe the data using structured probabilistic models Observations are random variables whose distribution depends on model parameters Source distributions p(x | θi ) I I I reflect variability in features reflect noise in observation generally have to be estimated from data (rather than known in advance) p(x|ωi) ω1 ω2 ω3 ω4 x Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 13 / 43 Source: http://www.doksinet Generative models (2) Three things to do with generative models Evaluate the probability of an observation, possibly under multiple parameter settings

p(x), p(x | θ1 ), p(x | θ2 ), . Estimate model parameters from observed data θ̂ = argmin C (θ∗ , θ | x) θ Run the model forward to generate new data x̃ ∼ p(x | θ̂) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 14 / 43 Source: http://www.doksinet Random variables review Random variables have joint distributions, p(x, y ) p(x,y) y Knowing one value in a joint distribution constrains the remainder Conditional distribution of x given y p(x | y ) ≡ p(x, y ) p(x, y ) =R p(y ) p(x, y ) dy Michael Mandel (E6820 SAPR) Machine learning σxy µy σy σx p(y) Marginal distribution of y Z p(y ) = p(x, y ) dx µx x p(x) p(x,y) y Y x p(x y = Y ) | February 7, 2008 15 / 43 Source: http://www.doksinet Bayes’ rule p(x | y )p(y ) = p(x, y ) = p(y | x)p(x) ∴ p(y | x) = p(x | y )p(y ) p(x) ⇒ can reverse conditioning given priors/marginals terms can be discrete or continuous generalizes to more variables p(x, y , z) = p(x | y , z)p(y

, z) = p(x | y , z)p(y | z)p(z) allows conversion between joint, marginals, conditionals Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 16 / 43 Source: http://www.doksinet Bayes’ rule for generative models Run generative models backwards to compare them Likelihood R p(x | θ) · p(θ) p(θ | x) = p(x | θ)p(θ) Posterior prob Evidence = p(x) Prior prob Posterior is the classification we’re looking for I I I combination of prior belief in each class with likelihood under our model R normalized by evidence (so posteriors = 1) Objection: priors are often unknown . but omitting them amounts to assuming they are all equal Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 17 / 43 Source: http://www.doksinet Computing probabilities and estimating parameters Want probability of the observation under a model, p(x) I regardless of parameter settings Full Bayesian integral Z p(x) = p(x | θ)p(θ) dθ Difficult to compute in general,

approximate as p(x | θ̂) I Maximum likelihood (ML) θ̂ = argmax p(x | θ) θ I Maximum a posteriori (MAP): ML + prior θ̂ = argmax p(θ | x) = argmax p(x | θ)p(θ) θ Michael Mandel (E6820 SAPR) θ Machine learning February 7, 2008 18 / 43 Source: http://www.doksinet Model checking After estimating parameters, run the model forward Check that I I I model is rich enough to capture variation in data parameters are estimated correctly there aren’t any bugs in your code Generate data from the model and compare it to observations x̃ ∼ p(x | θ) I I are they similar under some statistics T (x) : Rd 7 R ? can you find the real data set in a group of synthetic data sets? Then go back and update your model accordingly Gelman et al. (2003, ch 6) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 19 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR)

Machine learning February 7, 2008 20 / 43 Source: http://www.doksinet Gaussian models Easiest way to model distributions is via parametric model I assume known form, estimate a few parameters Gaussian model is simple and useful. In 1D "   # 1 1 x − µi 2 p(x | θi ) = √ exp − 2 σi σi 2π Parameters mean µi and variance σi fit PM PM e1/2 Michael Mandel (E6820 SAPR) p(x|ωi) σi µi Machine learning x February 7, 2008 21 / 43 Source: http://www.doksinet Gaussians in d dimensions   1 1 T −1 exp − (x − µi ) Σi (x − µi ) p(x | θi ) = 2 (2π)d/2 |Σi |1/2 Described by a d-dimensional mean µi and a d × d covariance matrix Σi 5 4 1 0.5 3 0 2 4 1 2 4 2 0 0 Michael Mandel (E6820 SAPR) 0 0 Machine learning 1 2 3 4 5 February 7, 2008 22 / 43 Source: http://www.doksinet Gaussian mixture models Single Gaussians cannot model I I distributions with multiple modes distributions with nonlinear correlations What about a

weighted sum? X p(x) ≈ ck p(x | θk ) k I I where {ck } is a set of weights and {p(x | θk )} is a set of Gaussian components can fit anything given enough components Interpretation: each observation is generated by one of the Gaussians, chosen with probability ck = p(θk ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 23 / 43 Source: http://www.doksinet Gaussian mixtures (2) e.g nonlinear correlation resulting surface 3 2 1.4 1 1.2 0 1 -1 0.8 -2 0.6 0 5 Gaussian components 0.4 10 original data 15 20 0.2 25 30 35 0 Problem: finding ck and θk parameters easy if we knew which θk generated each x Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 24 / 43 Source: http://www.doksinet Expectation-maximization (EM) General procedure for estimating model parameters when some are unknown e.g which GMM component generated a point Iteratively updated model parameters θ to maximize Q, the expected log-probability of observed data x

and hidden data z Z Q(θ, θt ) = p(z | x, θt ) log p(z, x | θ) z I I I I I E step: calculate p(z | x, θt ) using θt M step: find θ that maximizes Q using p(z | x, θt ) can prove p(x | θ) non-decreasing hence maximum likelihood model local optimumdepends on initialization Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 25 / 43 Source: http://www.doksinet Fitting GMMs with EM Want to find parameters of the Gaussians θk = {µk , Σk } weights/priors on Gaussians ck = p(θk ) . that maximize likelihood of training data x I I If we could assign each x to a particular θk , estimation would be direct Hence treat mixture indices, z, as hidden form Q = E [p(x, z | θ)] differentiate wrt model parameters equations for µk , Σk , ck to maximize Q I I Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 26 / 43 Source: http://www.doksinet GMM EM updated equations Parameters that maximize Q νnk ≡ p(zk | xn , θt ) P νnk xn µk = Pn νnk P

n νnk (xn − µk )(xn − µk )T P Σk = n n νnk X 1 νnk ck = N n Each involves νnk , ‘fuzzy membership’ of xn in Gaussian k Updated parameter is just sample average, weighted by fuzzy membership Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 27 / 43 Source: http://www.doksinet GMM examples Vowel data fit with different mixture counts 1 Gauss logp(x)=-1911 2 Gauss logp(x)=-1864 1600 1600 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800 1000 1200 600 200 3 Gauss logp(x)=-1849 1600 1600 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800 1000 1200 400 600 800 1000 1200 4 Gauss logp(x)=-1840 600 200 400 600 800 1000 1200 [Example. ] Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 28 / 43 Source: http://www.doksinet Outline 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 29

/ 43 Source: http://www.doksinet Markov models A (first order) Markov model is a finite-state system whose behavior depends only on the current state “The future is independent of the past, conditioned on the present” e.g generative Markov model S .8 .8 .1 .1 A .1 B .1 .1 C .7 .1 .1 E p(qn+1|qn) S A qn B C E qn+1 S A B C E 0 0 0 0 0 1 .8 .1 .1 0 0 .1 .8 .1 0 0 .1 .1 .7 0 0 0 0 .1 1 SAAAAAAAABBBBBBBBBCCCCBBBBBBCE Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 30 / 43 Source: http://www.doksinet Hidden Markov models Markov model where state sequence Q = {qn } is not directly observable (‘hidden’) But, observations X do depend on Q I xn is RV that depends only on current state p(xn | qn ) State sequence q=A 0.8 q=B AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC q=C 3 0.6 xn p(x|q) Emission distributions 0.4 0 0 0.8 q=A q=B q=C 3 0.6 xn p(x|q) 2 1 0.2 0.4 2 1 0.2 0 Observation sequence 0 1 2 3 4 0 observation x 0 10 20

30 time step n can still tell something about state sequence. Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 31 / 43 Source: http://www.doksinet (Generative) Markov models HMM is specified by parameters Θ: - states qi - transition probabilities aij - emission distributions bi(x) k • a t • • k a t • • k a t • • k a t k 1.0 0.9 0.0 0.0 a 0.0 0.1 0.9 0.0 t 0.0 0.0 0.1 0.9 • 0.0 0.0 0.0 0.1 p(x|q) x (+ initial state probabilities πi ) i aij ≡ p(qnj | qn−1 ) Michael Mandel (E6820 SAPR) bi (x) ≡ p(x | qi ) Machine learning πi ≡ p(q1i ) February 7, 2008 32 / 43 Source: http://www.doksinet Markov models for sequence recognition Independence of observations I observation xn depends only on current state qn p(X | Q) = p(x1 , x2 , . xN | q1 , q2 , qN ) = p(x1 | q1 )p(x2 | q2 ) · · · p(xN | qN ) N Y = p(xn | qn ) = n=1 N Y bqn (xn ) n=1 Markov transitions I transition to next state qi+1 depends

only on qi p(Q | M) = p(q1 , q2 , . | M) = p(qN | qN−1 . q1 )p(qN−1 | qN−2 q1 )p(q2 | q1 )p(q1 ) = p(qN | qN−1 )p(qN−1 | qN−2 )p(q2 | q1 )p(q1 ) = p(q1 ) N Y p(qn | qn−1 ) = πq1 n=2 Michael Mandel (E6820 SAPR) N Y aqn−1 qn n=2 Machine learning February 7, 2008 33 / 43 Source: http://www.doksinet Model-fit calculations From ‘state-based modeling’: X p(X | Θj ) = p(X | Q, Θj )p(Q | Θj ) all Q For HMMs p(X | Q) = N Y bqn (xn ) n=1 p(Q | M) = πq1 N Y aqn−1 qn n=2 Hence, solve for Θ̂ = argmaxΘj p(Θj | X ) I Using Bayes’ rule to convert from p(X | Θj ) Sum over all Q??? Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 34 / 43 Source: http://www.doksinet Summing over all paths Model M1 A 0.2 0.1 E 0.1 S • • • • S A B E B 0.2 1 A B E 0.9 01 • 0.7 02 01 • 0.8 02 • • 1 E States 0.9 S xn Observations x1, x2, x3 p(x|B) p(x|A) 0.8 0.7 B 3 n q{ x1 x2 x3 A 2.5 02 01 B 0.1 22 23

Paths 0.8 08 02 0.1 0.1 02 02 A S 2 Observation likelihoods p(x|q) 0.7 0.7 0.9 0 1 2 3 4 time n All possible 3-emission paths Qk from S to E q0 q1 q2 q3 q4 S S S S A A A B A A B B A B B B E E E E Michael Mandel (E6820 SAPR) p(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M) .9 x 7 x 7 x 1 = 00441 .9 x 7 x 2 x 2 = 00252 .9 x 2 x 8 x 2 = 00288 .1 x 8 x 8 x 2 = 00128 Σ = 0.1109 0.0022 2.5 x 02 x 01 = 005 2.5 x 02 x 23 = 115 0.0290 2.5 x 22 x 23 = 1265 03643 0.1 x 22 x 23 = 0506 00065 Σ = p(X | M) = 0.4020 Machine learning February 7, 2008 35 / 43 Source: http://www.doksinet The ‘forward recursion’ Dynamic-programming-like technique to sum over all Q Define αn (i) as the probability of getting to state q i at time step n (by any path): αn (i) = p(x1 , x2 , . xn , qn = q i ) ≡ p(X1n , qni ) Model states qi αn+1 (j) can be calculated recursively: αn+1(j) = αn(i+1) αn(i) Michael Mandel (E6820 SAPR) ai+1j S [i=1 Σ

αn(i)·aij]·bj(xn+1) bj(xn+1) aij Time steps n Machine learning February 7, 2008 36 / 43 Source: http://www.doksinet Forward recursion (2) Initialize α1 (i) = πi bi (x1 ) Then total probability p(X1N | Θ) = PS i=1 αN (i) Practical way to solve for p(X | Θj ) and hence select the most probable model (recognition) p(X | M1)·p(M1) p(X | M2)·p(M2) Observations X Michael Mandel (E6820 SAPR) Choose best Machine learning February 7, 2008 37 / 43 Source: http://www.doksinet Optimal path May be interested in actual qn assignments I which state was ‘active’ at each time frame e.g phone labeling (for training?) Total probability is over all paths . but can also solve for single best path, “Viterbi” state sequence j Probability along best path to state qn+1 :   α̂n+1 (j) = max {α̂n (i)aij } bj (xn+1 ) i backtrack from final state to get best path final probability is product only (no sum) log-domain calculation is just summation I I Best path

often dominates total probability p(X | Θ) ≈ p(X , Q̂ | Θ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 38 / 43 Source: http://www.doksinet Interpreting the Viterbi path Viterbi path assigns each xn to a state q i I performing classification based on b (x) i . at the same time applying transition constraints aij Inferred classification xn 3 2 1 0 Viterbi labels: 0 10 20 30 AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC Can be used for segmentation I I train an HMM with ‘garbage’ and ‘target’ states decode on new data to find ‘targets’, boundaries Can use for (heuristic) training e.g forced alignment to bootstrap speech recognizer e.g train classifiers based on labels Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 39 / 43 Source: http://www.doksinet Aside: Training and test data A rich model can learn every training example (overtraining) But the goal is to classify new, unseen data I sometimes use ‘cross validation’

set to decide when to stop training For evaluation results to be meaningful: I I don’t test with training data! don’t train on test data (even indirectly. ) Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 40 / 43 Source: http://www.doksinet Aside (2): Model complexity More training data allows the use of larger models More model parameters create a better fit to the training data I I more Gaussian mixture components more HMM states For fixed training set size, there will be some optimal model size that avoids overtraining Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 41 / 43 Source: http://www.doksinet Summary Classification is making discrete (hard) decisions Basis is comparison with known examples I explicitly or via a model Classification models I I I discriminative models, like SVMs, neural nets, boosters, directly learn posteriors p(θi | x) generative models, like Gaussians, GMMs, HMMs, model likelihoods p(x | θ)

Bayes’ rule lets us use generative models for classification EM allows parameter estimation even with some data missing Parting thought Is it wise to use generative models for discrimination or vice versa? Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 42 / 43 Source: http://www.doksinet References Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc IEEE Symposium on Foundations of Computer Science, pages 459–468, Washington, DC, USA, 2006. IEEE Computer Society. Andrew Gelman, John B. Carlin, Hal S Stern, and Donald B Rubin Bayesian Data Analysis. Chapman & Hall/CRC, second edition, July 2003 ISBN 158488388X Lawrence R. Rabiner A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989 Jeff A. Bilmes A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and

hidden markov models, 1997. Christopher J. C Burges A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov, 2(2):121–167, June 1998 Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 43 / 43 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 4: Auditory Perception Mike Mandel <mim@ee.columbiaedu> Dan Ellis <dpwe@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 February 10, 2009 1 2 3 4 5 6 Motivation: Why & how Auditory physiology Psychophysics: Detection & discrimination Pitch perception Speech perception Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 1 / 44 Source: http://www.doksinet Outline 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination 4 Pitch perception 5 Speech perception 6

Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 2 / 44 Source: http://www.doksinet Why study perception? Perception is messy: can we avoid it? No! Audition provides the ‘ground truth’ in audio I I I what is relevant and irrelevant subjective importance of distortion (coding etc.) (there could be other information in sound. ) Some sounds are ‘designed’ for audition I co-evolution of speech and hearing The auditory system is very successful I we would do extremely well to duplicate it We are now able to model complex systems I faster computers, bigger memories E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 3 / 44 Source: http://www.doksinet How to study perception? Three different approaches: Analyze the example: physiology I dissection & nerve recordings Black box input/output: psychophysics I fit simple models of simple functions Information processing models

investigate and model complex functions e.g scene analysis, speech perception I E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 4 / 44 Source: http://www.doksinet Outline 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination 4 Pitch perception 5 Speech perception 6 Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 5 / 44 Source: http://www.doksinet Physiology Processing chain from air to brain: Middle ear Auditory nerve Cortex Outer ear Midbrain Inner ear (cochlea) Study via: I I anatomy nerve recordings Signals flow in both directions E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 6 / 44 Source: http://www.doksinet Outer & middle ear Ear canal Middle ear bones Pinna Cochlea (inner ear) Eardrum (tympanum) Pinna ‘horn’ I complex reflections give spatial (elevation) cues Ear canal I

acoustic tube Middle ear I bones provide impedance matching E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 7 / 44 Source: http://www.doksinet Inner ear: Cochlea Oval window (from ME bones) Basilar Membrane (BM) Travelling wave Cochlea 16 kHz Resonant frequency 50 Hz 0 Position 35mm Mechanical input from middle ear starts traveling wave moving down Basilar membrane Varying stiffness and mass of BM results in continuous variation of resonant frequency At resonance, traveling wave energy is dissipated in BM vibration I Frequency (Fourier) analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 8 / 44 Source: http://www.doksinet Cochlea hair cells Ear converts sound to BM motion I each point on BM corresponds to a frequency Cochlea Tectorial membrane Basilar membrane Auditory nerve Inner Hair Cell (IHC) Outer Hair Cell (OHC) Hair cells on BM convert motion into nerve impulses (firings) Inner Hair Cells detect motion

Outer Hair Cells? Variable damping? E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 9 / 44 Source: http://www.doksinet Inner Hair Cells IHCs convert BM vibration into nerve firings Human ear has ∼3500 IHCs I each IHC has ∼7 connections to Auditory Nerve Each nerve fires (sometimes) near peak displacement Local BM displacement 50 time / ms Typical nerve signal (mV) Histogram to get firing probability Firing count Cycle angle E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 10 / 44 Source: http://www.doksinet Auditory nerve (AN) signals Single nerve measurements Tone burst histogram Spike count Frequency threshold dB SPL 80 100 60 40 Time 20 100 ms 1 kHz 100 Hz Tone burst 10 kHz (log) frequency Rate vs intensity One fiber: ~ 25 dB dynamic range Spikes/sec 300 200 100 Intensity / dB SPL 0 0 20 40 60 80 100 Hearing dynamic range > 100 dB Hard to measure: probe living ANs? E6820 (Ellis & Mandel)

L4: Auditory Perception February 10, 2009 11 / 44 Source: http://www.doksinet AN population response All the information the brain has about sound average rate & spike timings on 30,000 fibers Not unlike a (constant-Q) spectrogram freq / 8ve re 100 Hz PatSla rectsmoo on bbctmp2 (2001-02-18) 5 4 3 2 1 0 0 E6820 (Ellis & Mandel) 10 20 30 40 50 L4: Auditory Perception 60 time / ms February 10, 2009 12 / 44 Source: http://www.doksinet Beyond the auditory nerve Ascending and descending Tonotopic × ? I modulation, position, source?? E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 13 / 44 Source: http://www.doksinet Periphery models IHC Sound IHC Outer/middle ear filtering Cochlea filterbank SlaneyPatterson 12 chans/oct from 180 Hz, BBC1tmp (20010218) 60 Modeled aspects I I I outer / middle ear hair cell transduction cochlea filtering efferent feedback? channel I 50 40 30 20 10 0 0.1 0.2 0.3 0.4 0.5 time / s

Results: ‘neurogram’ / ‘cochleagram’ E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 14 / 44 Source: http://www.doksinet Outline 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination 4 Pitch perception 5 Speech perception 6 Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 15 / 44 Source: http://www.doksinet Psychophysics Physiology looks at the implementation Psychology looks at the function/behavior Analyze audition as signal detection: p(θ | x) psychological tests reflect internal decisions assume optimal decision process I infer nature of internal representations, noise, . lower bounds on more complex functions I I Different aspects to measure I I I I time, frequency, intensity tones, complexes, noise binaural pitch, detuning E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 16 / 44 Source:

http://www.doksinet Basic psychophysics Relate physical and perceptual variables e.g intensity loudness frequency pitch Methodology: subject tests I I just noticeable difference (JND) magnitude scaling e.g “adjust to twice as loud” Results for Intensity vs Loudness: Weber’s law ∆I ∝ I ⇒ log(L) = k log(I ) Log(loudness rating) Hartmann(1993) Classroom loudness scaling data 2.6 2.4 Textbook figure: L α I 0.3 2.2 2.0 Power law fit: L α I 0.22 1.8 1.6 1.4 -20 -10 E6820 (Ellis & Mandel) 0 10 log2 (L) = 0.3 log2 (I ) log I = 0.3 10 log10 2 0.3 dB = log10 2 10 = dB/10 Sound level / dB L4: Auditory Perception February 10, 2009 17 / 44 Source: http://www.doksinet Loudness as a function of frequency Fletcher-Munsen equal-loudness curves Intensity / dB SPL 120 100 80 60 40 20 0 100 100 100 Hz 80 60 rapid loudness growth 40 20 E6820 (Ellis & Mandel) 1 kHz 80 60 0 10,000 freq / Hz Equivalent loudness @ 1kHz Equivalent loudness @ 1kHz 100

1000 0 20 40 60 80 Intensity / dB 40 20 0 0 L4: Auditory Perception 20 40 60 80 Intensity / dB February 10, 2009 18 / 44 Source: http://www.doksinet Loudness as a function of bandwidth Same total energy, different distribution e.g 2 channels at −6 dB (not −10 dB) freq I0 I1 freq B1 Loudness B0 mag mag time Same total energy I·B freq . but wider perceived as louder ‘Critical’ Bandwidth B bandwidth Critical bands: independent frequency channels I ∼25 total (4-6 / octave) E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 19 / 44 Source: http://www.doksinet Simultaneous masking Intensity / dB A louder tone can ‘mask’ the perception of a second tone nearby in frequency: absolute threshold masking tone masked threshold log freq Suggests an ‘internal noise’ model: p(x | I) p(x | I+∆I) p(x | I) σn I E6820 (Ellis & Mandel) internal noise decision variable L4: Auditory Perception x February 10, 2009 20 / 44

Source: http://www.doksinet Sequential masking Backward/forward in time: masker envelope Intensity / dB simultaneous masking ~10 dB masked threshold time backward masking ~5 ms forward masking ~100 ms Time-frequency masking ‘skirt’: intensity Masking tone freq Masked threshold time E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 21 / 44 Source: http://www.doksinet What we do and don’t hear A “two-interval forced-choice”: X B X = A or B? time Timing: 2 ms attack resolution, 20 ms discrimination I but: spectral splatter Tuning: ∼1% discrimination I but: beats Spectrum: profile changes, formants I variables time-frequency resolution Harmonic phase? Noisy signals & texture (Trace vs categorical memory) E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 22 / 44 Source: http://www.doksinet Outline 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination

4 Pitch perception 5 Speech perception 6 Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 23 / 44 Source: http://www.doksinet Pitch perception: a classic argument in psychophysics Harmonic complexes are a pattern on AN freq. chan 70 60 50 40 30 20 10 0 I 0.05 0.1 time/s but give a fused percept (ecological) What determines the pitch percept? I not the fundamental How is it computed? Two competing models: place and time E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 24 / 44 Source: http://www.doksinet Place model of pitch AN excitation pattern shows individual peaks ‘Pattern matching’ method to find pitch AN excitation broader HF channels cannot resolve harmonics resolved harmonics frequency channel Correlate with harmonic ‘sieve’: Pitch strength frequency channel Support: Low harmonics are very important But: Flat-spectrum noise can carry pitch E6820 (Ellis

& Mandel) L4: Auditory Perception February 10, 2009 25 / 44 Source: http://www.doksinet Time model of pitch Timing information is preserved in AN down to ∼1 ms scale freq Extract periodicity by e.g autocorrelation and combine across frequency channels per-channel autocorrelation autocorrelation time Summary autocorrelation 0 10 20 30 common period (pitch) lag / ms But: HF gives weak pitch (in practice) E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 26 / 44 Source: http://www.doksinet Alternate & competing cues Pitch perception could rely on various cues I I I average excitation pattern summary autocorrelation more complex pattern matching Relying on just one cue is brittle I e.g missing fundamental Perceptual system appears to use a flexible, opportunistic combination Optimal detector justification? argmax p(θ | x) = argmax p(x | θ)p(θ) θ θ = argmax p(x1 | θ)p(x2 | θ)p(θ) θ I if x1 and x2 are conditionally

independent E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 27 / 44 Source: http://www.doksinet Outline 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination 4 Pitch perception 5 Speech perception 6 Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 28 / 44 Source: http://www.doksinet Speech perception Highly specialized function I subsequent to source organization? . but also can interact Kinds of speech sounds glide freq / Hz vowel nasal fricative stop burst 60 4000 50 3000 40 2000 30 1000 20 level/dB 0 1.4 has a E6820 (Ellis & Mandel) 1.6 1.8 watch 2 2.2 thin 2.4 as L4: Auditory Perception a 2.6 time/s dime February 10, 2009 29 / 44 Source: http://www.doksinet Cues to phoneme perception Linguists describe speech with phonemes ^ e hε z tcl c θ w has a watch I n thin z I I d as a ay

m dime Acoustic-phoneticians describe phonemes by freq • formants & transitions • bursts & onset times transition stop burst voicing onset time vowel formants time E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 30 / 44 Source: http://www.doksinet Categorical perception (Some) speech sounds perceived categorically rather than analogically I e.g stop-burst and timing: vowel formants fb time T burst freq fb / Hz stop burst 3000 2000 K P P 1000 i e ε P a c freq 4000 o u following vowel I I tokens within category are hard to distinguish category boundaries are very sharp Categories are learned for native tongue I “merry” / “Mary” / “marry” E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 31 / 44 Source: http://www.doksinet Where is the information in speech? ‘Articulation’ of high/low-pass filtered speech: Articulation / % high-pass low-pass 80 60 40 20 1000 2000 3000

4000 freq / Hz sums to more than 1. Speech message is highly redundant e.g constraints of language, context listeners can understand with very few cues E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 32 / 44 Source: http://www.doksinet Top-down influences: Phonemic restoration (Warren, 1970) freq / Hz What if a noise burst obscures speech? 4000 3000 2000 1000 0 1.4 1.6 1.8 2 2.2 2.4 2.6 time / s auditory system ‘restores’ the missing phoneme . based on semantic content . even in retrospect Subjects are typically unaware of which sounds are restored E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 33 / 44 Source: http://www.doksinet A predisposition for speech: Sinewave replicas freq / Hz freq / Hz Replace each formant with a single sinusoid (Remez et al., 1981) 5000 4000 3000 2000 1000 0 5000 4000 3000 2000 1000 0 Speech Sines 0.5 1 1.5 2 2.5 3 time / s speech is (somewhat) intelligible people

hear both whistles and speech (“duplex”) processed as speech despite un-speech-like What does it take to be speech? E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 34 / 44 Source: http://www.doksinet Simultaneous vowels dB Mix synthetic vowels with different f0 s /iy/ @ 100 Hz + freq = /ah/ @ 125 Hz Pitch difference helps (though not necessarily) % both vowels correct DV identification vs. ∆f0 (200ms) (Culling & Darwin 1993) 75 50 25 0 1/4 1/2 1 2 4 ∆f0 (semitones) E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 35 / 44 Source: http://www.doksinet Computational models of speech perception Various theoretical-practical models of speech comprehension e.g Speech Phoneme recognition Lexical access Words Grammar constraints Open questions: I I I mechanism of phoneme classification mechanism of lexical recall mechanism of grammar constraints ASR is a practical implementation (?) E6820 (Ellis & Mandel) L4:

Auditory Perception February 10, 2009 36 / 44 Source: http://www.doksinet Outline 1 Motivation: Why & how 2 Auditory physiology 3 Psychophysics: Detection & discrimination 4 Pitch perception 5 Speech perception 6 Auditory organization & Scene analysis E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 37 / 44 Source: http://www.doksinet Auditory organization Detection model is huge simplification The real role of hearing is much more general: Recover useful information from the outside world Sound organization into events and sources frq/Hz Voice 4000 Stab 2000 Rumble 0 0 2 4 time/s Research questions: I I I what determines perception of sources? how do humans separate mixtures? how much can we tell about a source? E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 38 / 44 Source: http://www.doksinet Auditory scene analysis: simultaneous fusion freq Harmonics are distinct on AN, but perceived as

one sound (“fused”) time I I depends on common onset depends on harmonicity (common period) Methodologies: I I I ask subject how many ‘objects’ match attributes e.g object pitch manipulate high level e.g vowel identity E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 39 / 44 Source: http://www.doksinet Sequential grouping: streaming Pattern / rhythm: property of a set of objects I subsequent to fusion ∵ employs fused events? frequency TRT: 60-150 ms 1 kHz ∆f: –2 octaves time Measure by relative timing judgments I cannot compare between streams Separate ‘coherence’ and ‘fusion’ boundaries Can interact and compete with fusion E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 40 / 44 Source: http://www.doksinet Continuity and restoration freq Tone is interrupted by noise burst: what happened? + ? + + time I masking makes tone undetectable during noise Need to infer most probable real-world events

I I observation equally likely for either explanation prior on continuous tone much higher ⇒ choose Top-down influence on perceived events. E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 41 / 44 Source: http://www.doksinet Models of auditory organization Psychological accounts suggest bottom-up input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod Brown and Cooke (1994) Complicated in practice formation of separate elements contradictory cues influence of top-down constraints (context, expectations, . ) E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 42 / 44 Source: http://www.doksinet Summary Auditory perception provides the ‘ground truth’ underlying audio processing Physiology specifies information available Psychophysics measure basic sensitivities Sounds sources require further organization Strong contextual effects in

speech perception Transduce Sound Multiple representns Scene analysis High-level recognition Parting thought Is pitch central to communication? Why? E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 43 / 44 Source: http://www.doksinet References Richard M. Warren Perceptual restoration of missing speech sounds Science, 167 (3917):392–393, January 1970. R. E Remez, P E Rubin, D B Pisoni, and T D Carrell Speech perception without traditional speech cues. Science, 212(4497):947–949, May 1981 G. J Brown and M Cooke Computational auditory scene analysis Computer Speech & Language, 8(4):297–336, 1994. Brian C. J Moore An Introduction to the Psychology of Hearing Academic Press, fifth edition, April 2003. ISBN 0125056281 James O. Pickles An Introduction to the Physiology of Hearing Academic Press, second edition, January 1988. ISBN 0125547544 E6820 (Ellis & Mandel) L4: Auditory Perception February 10, 2009 44 / 44 Source: http://www.doksinet

EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 February 19, 2009 1 2 3 4 5 Modeling speech signals Spectral and cepstral models Linear predictive models (LPC) Other signal models Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 1 / 46 Source: http://www.doksinet Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 2 / 46 Source: http://www.doksinet The speech signal w e has a ^ hε z tcl c θ watch I n I thin z I d ay as a m dime Elements of the speech signal spectral resonances (formants, moving) periodic excitation (voicing, pitched) + pitch contour noise

excitation transients (stop-release bursts) amplitude modulation (nasals, approximants) timing! E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 3 / 46 Source: http://www.doksinet The source-filter model Notional separation of source: excitation, fine time-frequency structure filter: resonance, broad spectral structure Formants Pitch t Glottal pulse train Voiced/ unvoiced Vocal tract resonances + Radiation characteristic Speech f Frication noise t Source Filter More a modeling approach than a single model E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 4 / 46 Source: http://www.doksinet Signal modeling Signal models are a kind of representation I I I to make some aspect explicit for efficiency for flexibility Nature of model depends on goal I I I classification: remove irrelevant details coding/transmission: remove perceptual irrelevance modification: isolate control parameters But commonalities emerge I I I perceptually

irrelevant detail (coding) will also be irrelevant for classification modification domain will usually reflect ‘independent’ perceptual attributes getting at the abstract information in the signal E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 5 / 46 Source: http://www.doksinet Different influences for signal models Receiver I see how signal is treated by listeners cochlea-style filterbank models . Transmitter (source) I physical vocal apparatus can generate only a limited range of signals . LPC models of vocal tract resonances Making explicit particular aspects I compact, separable correlates of resonances cepstrum I modeling prominent features of NB spectrogram I addressing unnaturalness in synthesis sinusoid models Harmonic+noise model E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 6 / 46 Source: http://www.doksinet Application of (speech) signal models Classification / matching Goal: highlight important

information I I I I speech recognition (lexical content) speaker recognition (identity or class) other signal classification content-based retrieval Coding / transmission / storage Goal: represent just enough information I I real-time transmission, e.g mobile phones archive storage, e.g voicemail Modification / synthesis Goal: change certain parts independently I I speech synthesis / text-to-speech (change the words) speech transformation / disguise (change the speaker) E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 7 / 46 Source: http://www.doksinet Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 8 / 46 Source: http://www.doksinet Spectral and cepstral models Spectrogram seems like a good representation I I I long history satisfying in use experts can ‘read’ the speech What is

the information? I I intensity in time-frequency cells typically 5ms × 200 Hz × 50 dB Discarded detail: I I phase fine-scale timing The starting point for other representations E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 9 / 46 Source: http://www.doksinet Short-time Fourier transform (STFT) as filterbank View spectrogram rows as coming from separate bandpass filters f sound Mathematically: X [k, n0 ] = X = X n   2πk(n − n0 ) x[n]w [n − n0 ] exp −j N x[n]hk [n0 − n] n hk[n] where hk [n] = w [−n] exp j 2πkn N  Hk(ejω) w[-n] W(ej(ω − 2πk/N)) n 2πk/N E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 ω 10 / 46 Source: http://www.doksinet Spectral models: which bandpass filters? Constant bandwidth? (analog / FFT) But: cochlea physiology & critical bandwidths implement ear models with bandpass filters & choose bandwidths by e.g CB estimates Auditory frequency scales I constant ‘Q’

(center freq / bandwidth), mel, Bark, . E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 11 / 46 Source: http://www.doksinet Gammatone filterbank Given bandwidths, which filter shapes? I I I match inferred temporal integration window match inferred spectral shape (sharp high-freq slope) keep it simple (since it’s only approximate) Gammatone filters I I 2N poles, 2 zeros, low complexity reasonable linear match to cochlea h[n] = nN−1 e −bn cos(ωi n) 0 2 -10 2 2 2 mag / dB z plane -20 -30 -40 -50 E6820 (Ellis & Mandel) time 50 100 200 L5: Speech modeling 500 1000 2000 5000 freq / Hz February 19, 2009 12 / 46 Source: http://www.doksinet Constant-BW vs. cochlea model Frequency responses Spectrograms Effective FFT filterbank 0 freq / Hz Gain / dB -10 -20 -30 -40 -50 0 1000 2000 3000 4000 5000 6000 7000 2000 0.5 1 1.5 2 2.5 3 Q=4 4 pole 2 zero cochlea model downsampled @ 64 5000 2000 freq / Hz Gain / dB

4000 0 0 8000 -10 -20 -30 -40 -50 6000 Gammatone filterbank 0 FFT-based WB spectrogram (N=128) 8000 1000 500 200 100 0 1000 2000 3000 4000 5000 6000 7000 8000 Freq / Hz 0 0.5 1 1.5 2 2.5 3 time / s Magnitude smoothed over 5-20 ms time window E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 13 / 46 Source: http://www.doksinet Limitations of spectral models Not much data thrown away I I I just fine phase / time structure (smoothing) little actual ‘modeling’ still a large representation Little separation of features e.g formants and pitch Highly correlated features I modifications affect multiple parameters But, quite easy to reconstruct I iterative reconstruction of lost phase E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 14 / 46 Source: http://www.doksinet The cepstrum Original motivation: assume a source-filter model: Excitation source g[n] Resonance filter H(ejω) n ω n Define ‘Homomorphic

deconvolution’: source-filter convolution g [n] ∗ h[n] FT product G (e jω )H(e jω ) log sum log G (e jω ) + log H(e jω ) IFT separate fine structure cg [n] + ch [n] = deconvolution Definition Real cepstrum cn = idft (log |dft(x[n])|) E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 15 / 46 Source: http://www.doksinet Stages in cepstral deconvolution Original waveform has excitation fine structure convolved with resonances DFT shows harmonics modulated by resonances Waveform and min. phase IR 0.2 0 -0.2 20 Log DFT is sum of harmonic ‘comb’ and resonant bumps 10 IDFT separates out resonant bumps (low quefrency) and regular, fine structure (‘pitch pulse’) dB 0 Selecting low-n cepstrum separates resonance information (deconvolution / ‘liftering’) 0 0 L5: Speech modeling 100 200 300 400 abs(dft) and liftered samps 2000 3000 1000 log(abs(dft)) and liftered freq / Hz 1000 2000 3000 real cepstrum and lifter freq / Hz -20 -40 0 200

pitch pulse 100 0 0 E6820 (Ellis & Mandel) 0 100 200 quefrency February 19, 2009 16 / 46 Source: http://www.doksinet Properties of the cepstrum Separate source (fine) from filter (broad structure) I smooth the log magnitude spectrum to get resonances Smoothing spectrum is filtering along frequency i.e convolution applied in Fourier domain multiplication in IFT (‘liftering’) Periodicity in time harmonics in spectrum ‘pitch pulse’ in high-n cepstrum Low-n cepstral coefficients are DCT of broad filter / resonance shape Z  dω  cn = log X (e jω ) (cos nω +  j sin nω) 5th order Cepstral reconstruction Cepstral coefs 0.5 0.1 2 1 0 0 -1 0 1 2 3 4 5 -0.1 0 E6820 (Ellis & Mandel) 1000 2000 3000 L5: Speech modeling 4000 5000 6000 7000 February 19, 2009 17 / 46 Source: http://www.doksinet Aside: correlation of elements Cepstrum is popular in speech recognition I feature vector elements are decorrelated Cepstral coefficients

Auditory spectrum Features 20 20 16 15 12 10 8 5 4 Example joint distrib (10,15) -2 -3 -4 -5 20 18 16 14 12 10 8 6 4 2 16 12 8 4 50 I Covariance matrix 25 100 150 frames 5 10 15 20 3 2 1 0 -1 -2 -3 -4 -5 0 5 c0 ‘normalizes out’ average log energy Decorrelated pdfs fit diagonal Gaussians I simple correlation is a waste of parameters DCT is close to PCA for (mel) spectra? E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 18 / 46 Source: http://www.doksinet Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 19 / 46 Source: http://www.doksinet Linear predictive modeling (LPC) LPC is a very successful speech model I I I it is mathematically efficient (IIR filters) it is remarkably accurate for voice (fits source-filter distinction) it has a satisfying physical

interpretation (resonances) Basic math I model output as linear function of prior outputs: ! p X s[n] = ak s[n − k] + e[n] k=1 I . hence “linear prediction” (p th order) e[n] is excitation (input), AKA prediction error ⇒ 1 S(z) 1 Pp = = −k E (z) A(z) 1 − k=1 ak z . all-pole modeling, ‘autoregression’ (AR) model E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 20 / 46 Source: http://www.doksinet Vocal tract motivation for LPC Direct expression of source-filter model ! p X s[n] = ak s[n − k] + e[n] k=1 Pulse/noise excitation e[n] Vocal tract H(z) = 1/A(z) s[n] |H(ejω)| H(z) f z-plane Acoustic tube models suggest all-pole model for vocal tract Relatively slowly-changing I update A(z) every 10-20 ms Not perfect: Nasals introduce zeros E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 21 / 46 Source: http://www.doksinet Estimating LPC parameters Minimize short-time squared prediction error E= m X 2 e [n] =

X s[n] − n n=1 p X !2 ak s[n − k] k=1 Differentiate w.rt ak to get equations for each k:   p X X 0= 2 s[n] − aj s[n − j] (−s[n − k]) n X s[n]s[n − k] = n φ(0, k) = j=1 X aj X s[n − j]s[n − k] j n X aj φ(j, k) j where φ(j, k) = coefficients I Pm n=1 s[n − j]s[n − k] are correlation p linear equations to solve for all aj s . E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 22 / 46 Source: http://www.doksinet Evaluating parameters P Linear equations φ(0, k) = pj=1 aj φ(j, k) If s[n] is assumed to be zero outside of some window X φ(j, k) = s[n − j]s[n − k] = rss (|j − k|) n I rss (τ ) is autocorrelation Hence equations become:    r (1) r (0) r (1) ···  r (2)   r (1) r (2) ···     .  =  . . .  .   . . . r (p) r (p − 1) r (p − 2) · · · r (p − 1) r (p − 2) . . r (0)      a1 a2 . .     

ap Toeplitz matrix (equal antidiagonals) can use Durbin recursion to solve (Solve full φ(j, k) via Cholesky) E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 23 / 46 Source: http://www.doksinet LPC illustration 0.1 0 windowed original -0.1 -0.2 LPC residual -0.3 0 dB 50 100 150 200 250 300 350 400 time / samp original spectrum 0 LPC spectrum -20 -40 -60 residual spectrum 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz Actual poles z-plane E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 24 / 46 Source: http://www.doksinet Interpreting LPC Picking out resonances I if signal really was source + all-pole resonances, LPC should find the resonances Least-squares fit to spectrum minimizing e 2 [n] in time domain is the same as minimizing E 2 (e jω ) by Parseval close fit to spectral peaks; valleys don’t matter I Removing smooth variation in spectrum 1 A(z) S(z) I E (z) I I is a low-order approximation to

S(z) 1 = A(z) hence, residual E (z) = A(z)S(z) is a ‘flat’ version of S Signal whitening: white noise (independent x[n]s) has flat spectrum whitening removes temporal correlation I E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 25 / 46 Source: http://www.doksinet Alternative LPC representations Many alternate p-dimensional representations I I I I I coefficients {a  P Qj } 1 − λj z −j = 1 − aj z −1 roots {λj }: line spectrum frequencies. reflection coefficients {kj } from lattice  form  tube model log area ratios gj = log 1−kj 1+kj Choice depends on: I I I I I mathematical convenience / complexity quantization sensitivity ease of guaranteeing stability what is made explicit distributions as statistics E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 26 / 46 Source: http://www.doksinet LPC applications Analysis-synthesis (coding, transmission) (z) S(z) = EA(z) hence can reconstruct by filtering e[n] with {aj }s

I whitened, decorrelated, minimized e[n]s are easy to quantize . or can model e[n] eg as simple pulse train I Recognition / classification I I LPC fit responds to spectral peaks (formants) can use for recognition (convert to cepstra?) Modification I I separating source and filter supports cross-synthesis pole / resonance model supports ‘warping’ e.g male female E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 27 / 46 Source: http://www.doksinet Aside: Formant tracking Formants carry (most?) linguistic information Why not classify speech recognition? e.g local maxima in cepstral-liftered spectrum pole frequencies in LPC fit But: recognition needs to work in all circumstances I formants can be obscured or undefined Original (mpgr1 sx419) freq / Hz 4000 3000 2000 1000 0 Noise-excited LPC resynthesis with pole freqs freq / Hz 4000 3000 2000 1000 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 time / s need more graceful, robust parameters E6820 (Ellis

& Mandel) L5: Speech modeling February 19, 2009 28 / 46 Source: http://www.doksinet Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 29 / 46 Source: http://www.doksinet Sinusoid modeling Early signal models required low complexity e.g LPC Advances in hardware open new possibilities. freq / Hz NB spectrogram suggests harmonics model 4000 3000 2000 1000 0 I I I 0 0.5 1 time / s 1.5 ‘important’ info in 2D surface is set of tracks? harmonic tracks have ∼smooth properties straightforward resynthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 30 / 46 Source: http://www.doksinet Sine wave models Model sound as sum of AM/FM sinusoids N[n] s[n] = X Ak [n] cos(n ωk [n] + φk [n]) k=1 I I Ak , ωk , φk piecewise linear or constant can enforce harmonicity: ωk = kω0

Extract parameters directly from STFT frames: time mag freq I I find local maxima of |S[k, n]| along frequency track birth/death and correspondence E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 31 / 46 Source: http://www.doksinet Finding sinusoid peaks Look for local maxima along DFT frame i.e |s[k − 1, n]| < |S[k, n]| > |S[k + 1, n]| Want exact frequency of implied sinusoid DFT is normally quantized quite coarsely e.g 4000 Hz / 256 bands = 156 Hz/band I quadratic fit to 3 points magnitude interpolated frequency and magnitude spectral samples frequency I may also need interpolated unwrapped phase Or, use differential of phase along time (pvoc): ω= E6820 (Ellis & Mandel) aḃ − b ȧ a2 + b 2 where S[k, n] = a + jb L5: Speech modeling February 19, 2009 32 / 46 Source: http://www.doksinet Sinewave modeling applications Modification (interpolation) and synthesis I connecting arbitrary ω and φ requires cubic phase

interpolation (because ω = φ̇) Types of modification I time and frequency scale modification . with or without changing formant envelope I I concatenation / smoothing boundaries phase realignment (for crest reduction) freq / Hz Non-harmonic signals? OK-ish 4000 3000 2000 1000 0 E6820 (Ellis & Mandel) 0 0.5 L5: Speech modeling 1 time / s 1.5 February 19, 2009 33 / 46 Source: http://www.doksinet Harmonics + noise model Motivation to improve sinusoid model because I I I problems with analysis of real (noisy) signals problems with synthesis quality (esp. noise) perceptual suspicions Model N[n] s[n] = X k=1 I I Ak [n] cos(nkω0 [n]) + e[n](hn [n] ∗ b[n]) | {z } | {z } Harmonics Noise sinusoids are forced to be harmonic remainder is filtered and time-shaped noise ‘Break frequency’ Fm [n] between H and N dB 40 Harmonics Harmonicity limit Fm[n] Noise 20 0 0 E6820 (Ellis & Mandel) 1000 2000 L5: Speech modeling 3000 freq / Hz February 19, 2009

34 / 46 Source: http://www.doksinet HNM analysis and synthesis freq / Hz Dynamically adjust Fm [n] based on ‘harmonic test’: 4000 3000 2000 1000 0 0 0.5 1 time / s 1.5 Noise has envelopes in time e[n] and frequency Hn 0 0 1000 dB 40 2000 Hn[k] freq / Hz 3000 e[n] 0 0.01 0.02 0.03 time / s reconstruct bursts / synchronize to pitch pulses E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 35 / 46 Source: http://www.doksinet Outline 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 36 / 46 Source: http://www.doksinet Speech synthesis One thing you can do with models Synthesis easier than recognition? I listeners do the work . but listeners are very critical Overview of synthesis text Phoneme generation Text normalization Synthesis algorithm speech Prosody generation

normalization disambiguates text (abbreviations) phonetic realization from pronunciation dictionary I prosodic synthesis by rule (timing, pitch contour) . all control waveform generation I I E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 37 / 46 Source: http://www.doksinet Source-filter synthesis Flexibility of source-filter model is ideal for speech synthesis Pitch info Voiced/ unvoiced t Glottal pulse source Phoneme info th ax k ae t t Vocal tract filter + t Speech Noise source t Excitation source issues voiced / unvoiced / mixture ([th] etc.) pitch cycles of voiced segments glottal pulse shape voice quality? E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 38 / 46 Source: http://www.doksinet Vocal tract modeling freq Simplest idea: store a single VT model for each phoneme time th ax k ae t but discontinuities are very unnatural freq Improve by smoothing between templates time th ax k ae t trick is finding

the right domain E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 39 / 46 Source: http://www.doksinet Cepstrum-based synthesis Low-n cepstrum is compact model of target spectrum Can invert to get actual VT IR waveforms: cn = idft(log |dft(x[n])|) ⇒ h[n] = idft(exp(dft(cn ))) All-zero (FIR) VT response can pre-convolve with glottal pulses Glottal pulse inventory Pitch pulse times (from pitch contour) ee ae time ah I cross-fading between templates OK E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 40 / 46 Source: http://www.doksinet LPC-based synthesis Very compact representation of target spectra I 3 or 4 pole pairs per template Low-order IIR filter very efficient synthesis How to interpolate? cannot just interpolate aj in a running filter but lattice filter has better-behaved interpolation + s[n] a1 z-1 a2 z-1 a3 z-1 e[n] + + + kp-1 z-1 - + e[n] z-1 - + I I s[n] -1 k0 z -1 What to use for excitation I I I

residual from original analysis reconstructed periodic pulse train parametrized residual resynthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 41 / 46 Source: http://www.doksinet Diphone synethsis Problems in phone-concatenation synthesis I I I phonemes are context-dependent coarticulation is complex transitions are critical to perception store transitions instead of just phonemes e hε z w ^ Phones tcl c θ I n I z I d ay m Diphone segments I I ∼ 40 phones ⇒ ∼ 800 diphones or even more context if have larger database How to splice diphones together? I I TD-PSOLA: align pitch pulses and cross fade MBROLA: normalized multiband E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 42 / 46 Source: http://www.doksinet HNM synthesis High quality resynthesis of real diphone units + parametric representation for modification I I pitch, timing modifications removal of discontinuities at boundaries Synthesis procedure

I I I I linguistic processing gives phones, pitch, timing database search gives best-matching units use HNM to fine-tune pitch and timing cross-fade Ak and ω0 parameters at boundaries freq time Careful preparation of database is key I I sine models allow phase alignment of all units larger database improves unit match E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 43 / 46 Source: http://www.doksinet Generating prosody The real factor limiting speech synthesis? Waveform synthesizers have inputs for I I I intensity (stress) duration (phrasing) fundamental frequency (pitch) Curves produced by superposition of (many) inferred linguistic rules I phrase final lengthening, unstressed shortening, . Or learn rules from transcribed elements E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 44 / 46 Source: http://www.doksinet Summary Range of models I I spectral, cepstral LPC, sinusoid, HNM Range of applications I I I general spectral

shape (filterbank) ASR precise description (LPC + residual) coding pitch, time modification (HNM) synthesis Issues I I I performance vs computational complexity generality vs accuracy representation size vs quality Parting thought not all parameters are created equal. E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 45 / 46 Source: http://www.doksinet References Alan V. Oppenheim Speech analysis-synthesis system based on homomorphic filtering. The Journal of the Acoustical Society of America, 45(1):309–309, 1969 J. Makhoul Linear prediction: A tutorial review Proceedings of the IEEE, 63(4): 561–580, 1975. Bishnu S. Atal and Suzanne L Hanauer Speech analysis and synthesis by linear prediction of the speech wave. The Journal of the Acoustical Society of America, 50(2B):637–655, 1971. J.E Markel and AH Gray Linear Prediction of Speech Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1982 R. McAulay and T Quatieri Speech analysis/synthesis based on a

sinusoidal representation. Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on, 34(4):744–754, 1986. Wael Hamza, Ellen Eide, Raimo Bakis, Michael Picheny, and John Pitrelli. The IBM expressive speech synthesis system. In INTERSPEECH, pages 2577–2580, October 2004. E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 46 / 46 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 6: Nonspeech and Music Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 February 26, 2009 1 2 3 4 Music & nonspeech Environmental Sounds Music Synthesis Techniques Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 1 / 30 Source: http://www.doksinet Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music

Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 2 / 30 Source: http://www.doksinet Music & nonspeech What is ‘nonspeech’ ? I I according to research effort: a little music in the world: most everything high Information content speech low music animal sounds wind & water natural contact/ collision Origin machines & engines man-made attributes? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 3 / 30 Source: http://www.doksinet Sound attributes Attributes suggest model parameters What do we notice about ‘general’ sound? I I I psychophysics: pitch, loudness, ‘timbre’ bright/dull; sharp/soft; grating/soothing sound is not ‘abstract’: tendency is to describe by source-events Ecological perspective what matters about sound is ‘what happened’ our percepts express this more-or-less directly I E6820 (Ellis & Mandel) L6: Nonspeech and Music

February 26, 2009 4 / 30 Source: http://www.doksinet Motivations for modeling Describe/classify I cast sound into model because want to use the resulting parameters Store/transmit I model implicitly exploits limited structure of signal Resynthesize/modify I model separates out interesting parameters Sound E6820 (Ellis & Mandel) Model parameter space L6: Nonspeech and Music February 26, 2009 5 / 30 Source: http://www.doksinet Analysis and synthesis Analysis is the converse of synthesis: Model / representation Synthesis Analysis Sound Can exist apart: I I analysis for classification synthesis of artificial sounds Often used together: I I I encoding/decoding of compressed formats resynthesis based on analyses analysis-by-synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 6 / 30 Source: http://www.doksinet Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820

(Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 7 / 30 Source: http://www.doksinet Environmental Sounds Where sound comes from: mechanical interactions I I I contact / collisions rubbing / scraping ringing / vibrating Interest in environmental sounds I I carry information about events around us . including indirect hints need to create them in virtual environments . including soundtracks Approaches to synthesis I I recording / sampling synthesis algorithms E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 8 / 30 Source: http://www.doksinet Collision sounds Factors influencing: I I I colliding bodies: size, material, damping local properties at contact point (hardness) energy of collision Source-filter model I I ”source” = excitation of collision event (energy, local properties at contact) ”filter” = resonance and radiation of energy (body properties) Variety of strike/scraping sounds I resonant freqs ∼ size/shape

damping ∼ material HF content in excitation/strike ∼ mallet, force I (from Gaver, 1993) I I E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 9 / 30 Source: http://www.doksinet Sound textures What do we hear in: I I a city street a symphony orchestra How do we distinguish: I I I I waterfall rainfall applause static Rain01 5000 4000 4000 freq / Hz freq / Hz Applause04 5000 3000 3000 2000 2000 1000 1000 0 0 1 2 3 4 0 time / s 0 1 2 3 4 time / s Levels of ecological description. E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 10 / 30 Source: http://www.doksinet Sound texture modeling (Athineos) Model broad spectral structure with LPC I could just resynthesize with noise Model fine temporal structure in residual with linear prediction in time domain y[n] Sound e[n] E[k] TD-LP FD-LP DCT y[n] = Σ iaiy[n-i] Whitened Residual E[k] = ΣibiE[k-i] residual Per-frame spectral parameters I I spectrum

Per-frame temporal envelope parameters precise dual of LPC in frequency ‘poles’ model temporal events Temporal envelopes (40 poles, 256ms) 0.02 0 amplitude -0.02 0.03 0.02 0.01 0.05 0.1 0.15 0.2 time / sec 0.25 Allows modification / synthesis? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 11 / 30 Source: http://www.doksinet Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 12 / 30 Source: http://www.doksinet Music synthesis techniques HA http://www.score -on-line com MESSIAH What is music? I 44. chorus could be anything flexible synthesisHALLELUJAH! needed! Allegro music Key elements of conventional 3 # j j œ. œ œ œ Œ & # c „ ∑ instruments J note-events (time, pitch, accent level) 3 j # œ . œ œj œj Œ & # c „ ∑ melody, harmony, rhythm I patterns of repetition &

variation 3 . I Soprano Hal - le - lu - jah! Alto # V # c Synthesis framework: Tenor „ ∑ œ œ œ œ Œ J J J Hal - le - lu - jah! G. F œ . œj œ œj ‰ œ œ R R J œ œ œ œ œ œ J J ‰R R J J r r j j r r j œ . œ œj œj ‰ œ œ Jœ œ ‰ œ œ Jœ œ Hal - le - lu - jah! Hal - le - œ. œ œ œ ‰ œ œ J J J R R Hal - le - lu - jah! Hal - le - lu - jah! Hal - le - lu - jah œ œ J Jœ ‰ Rœ Rœ J Jœ lu - jah! Hal - le - lu - jah 3 instruments: common for many notes ? ## framework c „ ∑ œ . Jœ Jœ œ Œ œ Jœ Jœ œ ‰ Rœ Rœ Jœ œ ‰ Rœ Rœ Jœ œ J J J score: sequence of (time, pitch, level) note events J Hal - le - lu - jah! Hal - le - lu - jah! Hal - le - Hal - le - lu - jah! Hal - le - lu - jah! Hal - le - lu - jah lu - jah! Hal - le - lu - jah Bass 7 S A # & # Jœ œ Jœ œ Œ & ## le j # œ V # Jœ œ œ Œ E6820 (Ellis & Mandel) - lu - jah, ? ## œ œ œ Jœ œ Œ le B lu - jah, œ

œ œ œj œ Œ le T - - lu - jah, Hal - le - lu - jah! œ . œj Jœ œ Œ J j j j œ. œ œ œ Œ Hal - le - lu - jah, œ . Jœ Jœ œ Œ J œ œ . œ J J œ Œ J œ . œj Jœ œ ‰ œ œ J R R j j j r r j j r r j j œ. œ œ œ ‰ œ œ œ œ ‰ œ œ œ œ Hal - le - lu - jah, Hal - le œ . Jœ Jœ œ ‰ Rœ Rœ J œ œ . œ J J œ ‰ Rœ Rœ J - Hal - le - lu - jah, Hal - le - lu - jah, Hal - le - Hal - le - lu - jah, Hal - le - lu - jah, Hal - le - L6: Nonspeech and Music œ œ J Jœ ‰ Rœ Rœ J Jœ lu - jah, Hal - le - lu - jah, œ œ œ œ J Jœ ‰ R R J Jœ lu - jah, Hal - le - lu - jah, œ œ J Jœ ‰ Rœ Rœ J Jœ lu - jah, February 26, 2009 Hal - le - lu - jah, 13 / 30 Source: http://www.doksinet The nature of musical instrument notes Characterized by instrument (register), note, loudness/emphasis, articulation. Frequency Piano Violin 4000 4000 3000 3000 2000 2000 1000 0 1000 0 0 1 2 3 Time 4 0 1 Frequency Clarinet

4000 3000 3000 2000 2000 1000 0 1000 0 1 2 3 Time 4 Trumpet 4000 0 2 3 Time 4 0 1 2 3 Time 4 Distinguish how? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 14 / 30 Source: http://www.doksinet Development of music synthesis Goals of music synthesis: I I generate realistic / pleasant new notes control / explore timbre (quality) Earliest computer systems in 1960s (voice synthesis, algorithmic) Pure synthesis approaches: I I I 1970s: Analog synths 1980s: FM (Stanford/Yamaha) 1990s: Physical modeling, hybrids Analysis-synthesis methods: I I I sampling / wavetables sinusoid modeling harmonics + noise (+ transients) others? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 15 / 30 Source: http://www.doksinet Analog synthesis The minimum to make an ‘interesting’ sound Envelope Trigger Pitch t Vibrato Oscillator t Filter + + + Cutoff freq f Sound Gain Elements: I I I I harmonics-rich

oscillators time-varying filters time-varying envelope modulation: low frequency + envelope-based Result: I time-varying spectrum, independent pitch E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 16 / 30 Source: http://www.doksinet FM synthesis Fast frequency modulation P sidebands: cos(ωc t + β sin(ωm t)) = ∞ n=−∞ Jn (β) cos((ωc + nωm )t) I a harmonic series if ωc = r ωm Jn (β) is a Bessel function: 1 J0 0.5 J1 J2 J3 J4 Jn(β) ≈ 0 for β < n - 2 0 -0.5 0 1 2 3 4 5 6 7 8 9 modulation index β Complex harmonic spectra by varying β 4000 freq / Hz 3000 2000 1000 0 I I 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 time / s ωc = 2000 Hz, ωm = 200 Hz what use? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 17 / 30 Source: http://www.doksinet Sampling synthesis Resynthesis from real notes vary pitch, duration, level Pitch: stretch (resample) waveform 0.2 0.2 596 Hz 0.1 0.1 0

0 -0.1 -0.2 0 894 Hz -0.1 0.002 0.004 0.006 0.008 -0.2 0 time / s 0.002 0.004 0.006 0.008 time / s Duration: loop a ‘sustain’ section 0.2 0.2 0.1 0.174 0176 0 0.1 0.204 0206 0 -0.1 -0.1 -0.2 0 -0.2 0.1 0.2 0.3 0 time / s 0.1 0.2 0.3 time / s Level: cross-fade different examples 0.2 0.2 mix Soft 0.1 0 -0.1 -0.2 I I Loud 0.1 0 veloc 0 0.05 0.1 0.15 time / s -0.1 -0.2 0 0.05 0.1 0.15 time / s need to ‘line up’ source samples good & bad? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 18 / 30 Source: http://www.doksinet Outline 1 Music & nonspeech 2 Environmental Sounds 3 Music Synthesis Techniques 4 Sinewave Synthesis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 19 / 30 Source: http://www.doksinet Sinewave synthesis If patterns of harmonics are what matter, why notPgenerate them all explicitly: s[n] = k Ak [n] cos(k · ω0 [n] · n) I particularly

powerful model for pitched signals Analysis (as with speech): find peaks in STFT |S[ω, n]| & track or track fundamental ω0 (harmonics / autocorrelation) & sample STFT at k · ω0 set of Ak [n] to duplicate tone: I freq / Hz I 8000 6000 mag 4000 1 2000 0 2 0 5000 0 0.05 0.1 0.15 0.2 time / s freq / Hz 0 0 0.1 0.2 time / s Synthesis via bank of oscillators E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 20 / 30 Source: http://www.doksinet Steps to sinewave modeling - 1 The underlying STFT: X [k, n0 ] = N−1 X  x[n + n0 ] · w [n] · exp −j n=0 I I 2πkn N  what value for N (FFT length & window size)? what value for H (hop size: n0 = r · H, r = 0, 1, 2 . )? STFT window length determines freq. resolution: Xw (e jω ) = X (e jω ) ∗ W (e jω ) Choose N long enough to resolve harmonics 2-3x longest (lowest) fundamental period I e.g 30-60 ms = 480-960 samples @ 16 kHz I choose H ≤ N/2 N too long lost time

resolution I limits sinusoid amplitude rate of change E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 21 / 30 Source: http://www.doksinet Steps to sinewave modeling - 2 Choose candidate sinusoids at each time by picking peaks in each STFT frame: freq / Hz 8000 6000 4000 2000 level / dB 0 0 20 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 time / s 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz Quadratic fit for peak: y ab2/4 0 -10 0 y = ax(x-b) x b/2 -20 400 600 phase / rad level / dB 20 10 -5 -10 800 freq / Hz 400 600 800 freq / Hz + linear interpolation of unwrapped phase E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 22 / 30 Source: http://www.doksinet Steps to sinewave modeling - 3 Which peaks to pick? Want ‘true’ sinusoids, not noise fluctuations ‘prominence’ threshold above smoothed spectrum I level / dB 20 0 -20 -40 -60 0 1000 2000 3000 4000 5000 6000 7000

freq / Hz Sinusoids exhibit stability. of amplitude in time of phase derivative in time compare with adjacent time frames to test? I I E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 23 / 30 Source: http://www.doksinet Steps to sinewave modeling - 4 freq ‘Grow’ tracks by appending newly-found peaks to existing tracks: birth existing tracks death I new time peaks ambiguous assignments possible Unclaimed new peak I I ‘birth’ of new track backtrack to find earliest trace? No continuation peak for existing track I I ‘death’ of track or: reduce peak threshold for hysteresis E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 24 / 30 Source: http://www.doksinet Resynthesis of sinewave models After analysis, each track defines contours in frequency, amplitude fk [n], Ak [n] (+ phase?) use to drive a sinewave oscillators & sum up I 3 level 3 2 2 Ak[n]·cos(2πfk[n]·t) Ak[n] 1 0 / Hz 0 700 -1 600

freq 1 -2 fk[n] 0.05 01 015 02 500 0 n -3 0 time / s 0.05 0.1 015 time / s 0.2 6000 freq / Hz freq / Hz ‘Regularize’ to exactly harmonic fk [n] = k · f0 [n] 4000 2000 0 0 E6820 (Ellis & Mandel) 0.05 0.1 015 0.2 time / s 700 650 600 550 0 L6: Nonspeech and Music 0.05 0.1 015 0.2 time / s February 26, 2009 25 / 30 Source: http://www.doksinet Modification in sinewave resynthesis Change duration by warping timebase I may want to keep onset unwarped freq / Hz 5000 4000 3000 2000 1000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 time / s Change pitch by scaling frequencies either stretching or resampling envelope 40 level / dB level / dB I 30 20 10 0 0 1000 2000 3000 4000 40 30 20 10 0 0 1000 2000 3000 4000 freq / Hz freq / Hz Change timbre by interpolating parameters E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 26 / 30 Source: http://www.doksinet Sinusoids + residual Only ‘prominent peaks’

became tracks remainder of spectral energy was noisy? model residual energy with noise I How to obtain ‘non-harmonic’ spectrum? I I zero-out spectrum near extracted peaks? or: resynthesize (exactly) & subtract waveforms X es [n] = s[n] − Ak [n] cos(2πn · fk [n]) k . must preserve phase! mag / dB 20 sinusoids 0 original -20 -40 -60 -80 residual LPC 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz Can model residual signal with LPC flexible representation of noisy residual E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 27 / 30 Source: http://www.doksinet Sinusoids + noise + transients Sound represented as sinusoids and noise: X s[n] = Ak [n] cos(2πn · fk [n]) + hn [n] ∗ b[n] k Sinusoids Parameters are Ak [n], fk [n], hn [n] Residual es [n] 8000 freq / Hz 6000 8000 4000 6000 2000 4000 8000 6000 2000 0 0 0.2 0.4 0.6 time / s 4000 2000 0 0 0.2 0.4 0.6 SeparatePout abrupt transients in residual? es [n] = k tk [n]

+ hn [n] ∗ b 0 [n] I more specific more flexible E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 28 / 30 Source: http://www.doksinet Summary ‘Nonspeech audio’ I I i.e sound in general characteristics: ecological Music synthesis I I I control of pitch, duration, loudness, articulation evolution of techniques sinusoids + noise + transients Music analysis. and beyond? E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 29 / 30 Source: http://www.doksinet References W.W Gaver Synthesizing auditory icons In Proc Conference on Human factors in computing systems INTERCHI-93, pages 228–235. Addison-Wesley, 1993 M. Athineos and D P W Ellis Autoregressive modeling of temporal envelopes IEEE Tr. Signal Proc, 15(11):5237–5245, 2007 URL http://www.eecolumbiaedu/~dpwe/pubs/AthinE07-fdlppdf X. Serra and J Smith III Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic Plus Stochastic Decomposition.

Computer Music Journal, 14(4):12–24, 1990. T. S Verma and T H Y Meng An analysis/synthesis tool for transient signals that allows aflexible sines+ transients+ noise model for audio. In Proc ICASSP, pages VI–3573–3576, Seattle, 1998. E6820 (Ellis & Mandel) L6: Nonspeech and Music February 26, 2009 30 / 30 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio compression and coding Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 March 31, 2009 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 1 / 37 Source: http://www.doksinet Outline 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E6820 (Ellis & Mandel) L7: Audio

compression and coding March 31, 2009 2 / 37 Source: http://www.doksinet Compression & Quantization How big is audio data? What is the bitrate? Fs frames/second (e.g 8000 or 44100) x C samples/frame (e.g 1 or 2 channels) x B bits/sample (e.g 8 or 16) Fs · C · B bits/second (e.g 64 Kbps or 14 Mbps) bits / frame I CD Audio 1.4 Mbps 32 8 Mobile !13 Kbps 8000 44100 frames / sec Telephony 64 Kbps How to reduce? I I I lower sampling rate less bandwidth (muffled) lower channel count no stereo image lower sample size quantization noise Or: use data compression E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 3 / 37 Source: http://www.doksinet Data compression: Redundancy vs. Irrelevance Two main principles in compression: I I remove redundant information remove irrelevant information Redundant information is implicit in remainder e.g signal bandlimited to 20kHz, but sample at 80kHz can recover every other sample by interpolation: I

sample In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples time Irrelevant info is unique but unnecessary I e.g recording a microphone signal at 80 kHz sampling rate E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 4 / 37 Source: http://www.doksinet Irrelevant data in audio coding For coding of audio signals, irrelevant means perceptually insignificant I an empirical property Compact Disc standard is adequate: I I 44 kHz sampling for 20 kHz bandwidth 16 bit linear samples for ∼ 96 dB peak SNR Reflect limits of human sensitivity: I I I 20 kHz bandwidth, 100 dB intensity sinusoid phase, detail of noise structure dynamic properties - hard to characterize Problem: separating salient & irrelevant E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 5 / 37 Source: http://www.doksinet Quantization Represent waveform with discrete levels Q[x] 5 4 3 2 1 0 -1 -2 -3 -4

-5 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 x[n] Q[x[n]] 4 2 0 -2 error e[n] = x[n] - Q[x[n]] 0 5 10 15 20 25 30 35 40 x Equivalent to adding error e[n]: x[n] = Q [x[n]] + e[n] e[n] ∼ uncorrelated, uniform white noise p(e[n]) variance σe2 = -D/ 2 D2 12 +D/ 2 E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 6 / 37 Source: http://www.doksinet Quantization noise (Q-noise) Uncorrelated noise has flat spectrum With a B bit word and a quantization step D I I max signal range (x) = −(2B−1 ) · D . (2B−1 − 1) · D quantization noise (e) = −D/2 . D/2 Best signal-to-noise ratio (power) SNR = E [x 2 ]/E [e 2 ] = (2B )2 . or, in dB, 20 · log10 2 · B ≈ 6 · B dB 0 level / dB Quantized at 7 bits -20 -40 -60 -80 0 1000 E6820 (Ellis & Mandel) 2000 3000 4000 5000 6000 7000 freq / Hz L7: Audio compression and coding March 31, 2009 7 / 37 Source: http://www.doksinet Redundant information Redundancy removal is lossless

Signal correlation implies redundant information I I e.g if x[n] = x[n − 1] + v [n] x[n] has a greater amplitude range uses more bits than v [n] sending v [n] = x[n] − x[n − 1] can reduce amplitude, hence bitrate x[n] - x[n-1] I but: ‘white noise’ sequence has no redundancy . Problem: separating unique & redundant E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 8 / 37 Source: http://www.doksinet Optimal coding Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5 p(B) = 0.5 ABBBBAAABBABBABBABB p(A) = 0.9 p(B) = 0.1 AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is ‘big news’ Information in bits I = − log2 (probability ) I clearly works when all possibilities equiprobable Optimal bitrate av.token length = entropy H = E [I ] I . equal-length tokens are equally likely How to achieve this? I I I transform signal to have uniform pdf nonuniform quantization for equiprobable tokens

variable-length tokens Huffman coding E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 9 / 37 Source: http://www.doksinet Quantization for optimum bitrate Quantization should reflect pdf of signal: p(x < x0) p(x = x0) x 1.0 0.8 0.6 0.4 0.2 -0.02 -0015 -001 -0005 I 0 0.005 001 0.015 002 0.025 x 0 cumulative pdf p(x < x0 ) maps to uniform x 0 Or, codeword length per Shannon log2 (p(x)): -0.02 -0.01 0 0.01 0.02 0.03 I p(x) Shannon info / bits Codewords 111111111xx 111101xx 111100xx 101xx 100xx 0xx 110xx 1110xx 111110xx 1111110xx 111111100xx 111111101xx 111111110xx 0 2 4 6 8 Huffman coding: tree-structured decoder E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 10 / 37 Source: http://www.doksinet Vector Quantization Quantize mutually dependent values in joint space: 3 x1 2 1 0 -1 -2 x2 -6 -4 -2 0 2 4 May help even if values are largely independent I larger space x1,x2 is easier for

Huffman E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 11 / 37 Source: http://www.doksinet Compression & Representation As always, success depends on representation Appropriate domain may be ‘naturally’ bandlimited I I e.g vocal-tract-shape coefficients can reduce sampling rate without data loss In right domain, irrelevance is easier to ‘get at’ I e.g STFT to separate magnitude and phase E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 12 / 37 Source: http://www.doksinet Aside: Coding standards Coding only useful if recipient knows the code! Standardization efforts are important Federal Standards: Low bit-rate secure voice: I I FS1015e: LPC-10 2.4 Kbps FS1016: 4.8 Kbps CELP ITU G.x series (also Hx for video) I I G.726 ADPCM G.729 Low delay CELP MPEG I I I MPEG-Audio layers 1,2,3 (mp3) MPEG 2 Advanced Audio Codec (AAC) MPEG 4 Synthetic-Natural Hybrid Codec More recent ‘standards’ I I

proprietary: WMA, Skype. Speex . E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 13 / 37 Source: http://www.doksinet Outline 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 14 / 37 Source: http://www.doksinet Speech coding Standard voice channel: I I analog: 4 kHz slot (∼ 40 dB SNR) digital: 64 Kbps = 8 bit µ-law x 8 kHz How to compress? Redundant I signal assumed to be a single voice, not any possible waveform Irrelevant I need code only enough for intelligibility, speaker identification (c/w analog channel) Specifically, source-filter decomposition I vocal tract & f0 change slowly Applications: I I live communications offline storage E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 15 / 37 Source: http://www.doksinet Channel Vocoder (1940s-1960s) Basic source-filter

decomposition I I filterbank breaks into spectral bands transmit slowly-changing energy in each band Encoder Bandpass filter 1 Decoder Smoothed energy Downsample & encode E1 Bandpass filter 1 Output Input Bandpass filter N Smoothed energy Voicing analysis I Downsample & encode EN Bandpass filter 1 V/UV Pitch Pulse generator Noise source 10-20 bands, perceptually spaced Downsampling? Excitation? I I pitch / noise model or: baseband + ‘flattening’. E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 16 / 37 Source: http://www.doksinet LPC encoding The classic source-filter model Encoder Decoder Filter coefficients {ai} |1/A(ej!)| Input s[n] Represent & encode f LPC analysis Represent & encode Residual e[n] Excitation generator ^ e[n] All-pole filter H(z) = t Output ^ s[n] 1 1 - "aiz-i Compression gains: I I filter parameters are ∼slowly changing excitation can be represented many ways 20 ms

Filter parameters Excitation/pitch parameters 5 ms E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 17 / 37 Source: http://www.doksinet Encoding LPC filter parameters For ‘communications quality’: I I I 8 kHz sampling (4 kHz bandwidth) ∼10th order LPC (up to 5 pole pairs) update every 20-30 ms 300 - 500 param/s Representation & quantization ai I I I {ai } - poor distribution, can’t interpolate reflection coefficients {ki }: guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1 -0.5 0 0.5 1 1.5 2 ki -2 -1.5 fLi LSPs - lovely! 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Bit allocation (filter): I I GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 11 Kbps E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 18 / 37 Source: http://www.doksinet Line Spectral Pairs (LSPs) LSPs encode LPC filter by a set of frequencies Excellent for

quantization & interpolation Definition: zeros of P(z) = A(z) + z −p−1 · A(z −1 ) Q(z) = A(z) − z −p−1 · A(z −1 ) I I I I z = e jω z −1 = e −jω |A(z)| = |A(z −1 )| on u.circ P(z), Q(z) have (interleaved) zeros when −1 ∠{A(z)} = ±∠{z −p−1 A(zQ )} reconstruct P(z), Q(z) = i (1 − ζi z −1 ) etc. A(z) = [P(z) + Q(z)]/2 A(z-1) = 0 Q(z) = 0 A(z) = 0 P(z) = 0 E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 19 / 37 Source: http://www.doksinet Encoding LPC excitation Excitation already better than raw signal: 5000 0 -5000 100 Original signal 0 -100 LPC residual 1.3 135 1.4 145 I 1.5 1.55 1.6 1.65 time / s 1.7 save several bits/sample, but still > 32 Kbps Crude model: U/V flag + pitch period I ∼ 7 bits / 5 ms = 1.4 Kbps LPC10 @ 24 Kbps Pitch period values 16 ms frame boundaries 100 50 0 -50 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s Band-limit then re-extend (RELP) E6820

(Ellis & Mandel) L7: Audio compression and coding March 31, 2009 20 / 37 Source: http://www.doksinet Encoding excitation Something between full-quality residual (32 Kbps) and pitch parameters (1.4 kbps)? ‘Analysis by synthesis’ loop: Input Filter coefficients A(z) LPC analysis s[n] Excitation code Synthetic excitation control c[n] Pitch-cycle predictor + b·z–NL ‘Perceptual’ weighting W(z)= A(z) A(z/!) MSError minimization ^ x[n] LPC filter 1 A(z) – + ^ x[n] *ha[n] - s[n] Excitation encoding ‘Perceptual’ weighting discounts peaks: gain / dB 20 10 0 W(z)= A(z) A(z/!) 1/A(z/!) -10 -20 0 E6820 (Ellis & Mandel) 1/A(z) 500 1000 1500 2000 2500 L7: Audio compression and coding 3000 3500 freq / Hz March 31, 2009 21 / 37 Source: http://www.doksinet Multi-Pulse Excitation (MPE-LPC) Stylize excitation as N discrete pulses 15 original LPC residual 10 5 0 -5 mi multi-pulse excitation 5 0 -5 I 0 20 ti 40 60 80 100 120 time

/ samps encode as N × (ti , mi ) pairs Greedy algorithm places one pulse at a time:   A(z) X (z) Epcp = − S(z) A(z/γ) A(z) X (z) R(z) = − A(z/γ) A(z/γ) I I R(z) is residual of target waveform after inverse-filtering cross-correlate hγ and r ∗ hγ , iterate E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 22 / 37 Source: http://www.doksinet CELP Represent excitation with codebook e.g 512 sparse excitation vectors Codebook Search index 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 0 I excitation 60 time / samples linear search for minimum weighted error? FS1016 4.8 Kbps CELP (30ms frame = 144 bits): 10 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits Pitch gain 4 x 5 bits = 20 bits I 138 bits Codebk index 4 x 9 bits = 36 bits Codebk gain 4 x 5 bits = 20 bits E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 23 / 37 Source: http://www.doksinet Aside:

CELP for nonspeech? CELP is sometimes called a ‘hybrid’ coder: I I originally based on source-filter voice model CELP residual is waveform coding (no model) freq / kHz Original (mrZebra-8k) 4 3 2 CELP does not break with multiple voices etc. 1 0 5 kbps buzz-hiss 4 3 2 I just does the best it can 1 0 4.8 kbps CELP 4 3 2 1 0 0 1 2 3 4 5 6 7 8 time / sec LPC filter models vocal tract; also matches auditory system? I i.e the ‘source-filter’ separation is good for relevance as well as redundancy? E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 24 / 37 Source: http://www.doksinet Outline 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 25 / 37 Source: http://www.doksinet Wide-Bandwidth Audio Coding Goals: I I transparent coding i.e no perceptible effect general purpose - handles any signal Simple

approaches (redundancy removal) I Adaptive Differential PCM (ADPCM) X[n] + – Xp[n] I + D[n] Adaptive quantizer C[n] = Q[D[n]] Predictor Xp[n] = F[X[n-i]] C[n] X[n] + + + D[n] Dequantizer D[n] = Q-1[C[n]] as prediction gets smarter, becomes LPC e.g shorten - lossless LPC encoding Larger compression gains needs irrelevance I hide quantization noise with psychoacoustic masking E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 26 / 37 Source: http://www.doksinet Noise shaping Plain Q-noise sounds like added white noise I I actually, not all that disturbing . but worst-case for exploiting masking 1 Have Q-noise scale with signal level mu-law quantization 0.8 0.6 I I i.e quantizer step gets larger with amplitude mu(x) 0.4 minimum distortion for some center-heavy pdf 0.2 x - mu(x) 0 0 0.2 0.4 0.6 0.8 1 Or: put Q-noise around peaks in spectrum key to getting benefit of perceptual masking level / dB I 20 Signal 0 Linear

Q-noise -20 -40 -60 E6820 (Ellis & Mandel) 0 Transform Q-noise 1500 2000 2500 3000 L7: Audio compression and coding 3500 freq / Hz March 31, 2009 27 / 37 Source: http://www.doksinet Subband coding Idea: Quantize separately in separate bands Encoder Bandpass f Input Decoder Downsample M Analysis filters Q-1[•] M M f Reconstruction filters (M channels) f I Quantize Q[•] Q-1[•] Q[•] Output + M Q-noise stays within band, gets masked gain / dB ‘Critical sampling’ 1/M of spectrum per band I 0 -10 -20 -30 -40 -50 0 alias energy 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1 some aliasing inevitable Trick is to cancel with alias of adjacent band ‘quadrature-mirror’ filters E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 28 / 37 Source: http://www.doksinet MPEG-Audio (layer I, II) Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Scale indices

Scale normalize Input 32 band polyphase filterbank Format Bitstream & pack bitstream Data frame Scale normalize Psychoacoustic masking model I I I I Quantize Control & scales Quantize 32 chans x 36 samples Per-band masking margins 24 ms 22 kHz ÷ 32 equal bands = 690 Hz bandwidth 8 / 24 ms frames = 12 / 36 subband samples fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp) scale factors are like LPC envelope? E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 29 / 37 Source: http://www.doksinet MPEG Psychoacoustic model Based on simultaneous masking experiments Difficulties: Procedure I pick ‘tonal peaks’ in NB FFT spectrum I remaining energy ‘noisy’ peaks I apply nonlinear ‘spreading function’ I sum all masking & threshold in power domain E6820 (Ellis & Mandel) SPL / dB I noise energy masks ∼10 dB better than tones masking level nonlinear in frequency & intensity complex, dynamic sounds not well

understood SPL / dB I SPL / dB I 60 50 40 30 20 10 0 -10 60 50 40 30 20 10 0 -10 60 50 40 30 20 10 0 -10 Tonal peaks Non-tonal energy Signal spectrum 1 3 5 7 9 11 13 15 17 19 21 23 25 23 25 23 25 Masking spread 1 3 5 7 9 11 13 15 17 19 21 Resultant masking 1 3 L7: Audio compression and coding 5 7 9 11 13 15 freq / Bark 17 19 21 March 31, 2009 30 / 37 Source: http://www.doksinet MPEG Bit allocation Result of psychoacoustic model is maximum tolerable noise per subband SNR ~ 6·B level Masking tone Masked threshold Safe noise level freq Quantization noise Subband N I safe noise level required SNR bits B Bit allocation procedure (fixed bit rate): I I I pick channel with worst noise-masker ratio improve its quantization by one step repeat while more bits available for this frame Bands with no signal above masking curve can be skipped E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 31 / 37 Source:

http://www.doksinet MPEG Audio Layer III ‘Transform coder’ on top of ‘subband coder’ Line Subband 575 31 Filterbank MDCT 32 Subbands 0 0 Window Switching Coded Audio Signal 192 kbit/s Distortion Control Loop Nonuniform Quantization Rate Control Loop Huffman Encoding Coding of Sideinformation Bitstream Formatting CRC-Check Digital Audio Signal (PCM) (768 kbit/s) 32 kbit/s Psychoacoustic Model FFT 1024 Points External Control Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples I I I more redundancy in spectral domain finer control e.g of aliasing, masking scale factors now in band-blocks Fixed Huffman tables optimized for audio data Power-law quantizer E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 32 / 37 Source: http://www.doksinet Adaptive time window Time window relies on temporal masking I single quantization level over 8-24 ms window ‘Nightmare’ scenario: Pre-echo distortion I

‘backward masking’ saves in most cases Adaptive switching of time window: normal window level 1 0.6 0.4 0.2 0 E6820 (Ellis & Mandel) transition short 0.8 0 20 40 60 80 100 time / ms L7: Audio compression and coding March 31, 2009 33 / 37 Source: http://www.doksinet The effects of MP3 Before & after: freq / kHz Josie - direct from CD 20 15 10 5 0 freq / kHz After MP3 encode (160 kbps) and decode 20 15 10 5 0 freq / kHz Residual (after aligning 1148 sample delay) 20 15 10 5 0 0 I I I 2 4 6 8 10 time / sec chop off high frequency (above 16 kHz) occasional other time-frequency ‘holes’ quantization noise under signal E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 34 / 37 Source: http://www.doksinet MP3 & Beyond MP3 is ‘transparent’ at ∼ 128 Kbps for stereo (11x smaller than 1.4 Mbps CD rate) I only decoder is standardized: better psychological models better encoders MPEG2 AAC I I I rebuild

of MP3 without backwards compatibility 30% better (stereo at 96 Kbps?) multichannel etc. MPEG4-Audio I I wide range of component encodings MPEG Audio, LSPs, . SAOL I I I ‘synthetic’ component of MPEG-4 Audio complete DSP/computer music language! how to encode into it? E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 35 / 37 Source: http://www.doksinet Summary For coding, every bit counts I I take care over quantization domain & effects Shannon limits. Speech coding I I LPC modeling is old but good CELP residual modeling can go beyond speech Wide-band coding I I noise shaping ‘hides’ quantization noise detailed psychoacoustic models are key E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 36 / 37 Source: http://www.doksinet References Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley, July 1999 ISBN 0471351547 D. Pan A tutorial on

MPEG/audio compression IEEE Multimedia, 2(2):60–74, 1995 Karlheinz Brandenburg. MP3 and AAC explained In Proc AES 17th International Conference on High Quality Audio Coding, pages 1–12, 1999. T. Painter and A Spanias Perceptual coding of digital audio Proc IEEE, 80(4): 451–513, Apr. 2000 E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 37 / 37 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 8: Spatial sound Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 March 27, 2008 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 1 / 33 Source: http://www.doksinet Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR)

Spatial sound March 27, 2008 2 / 33 Source: http://www.doksinet Spatial acoustics Received sound = source + channel I so far, only considered ideal source waveform Sound carries information on its spatial origin I ”ripples in the lake” I evolutionary significance The basis of scene analysis? I yes and notry blocking an ear Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 3 / 33 Source: http://www.doksinet Ripples in the lake Listener Source Wavefront (@ c m/s) Energy ∝ 1/r2 Effect of relative position on sound I I I I delay = ∆r c energy decay ∼ r12 absorption ∼ G (f )r direct energy plus reflections Give cues for recovering source position Describe wavefront by its normal Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 4 / 33 Source: http://www.doksinet Recovering spatial information Source direction as wavefront normal pressure moving plane found from timing at 3 points B ∆t/c = ∆s = AB·cosθ θ A C time wavefront

need to solve correspondence ge ran Space: need 3 parameters r e.g 2 angles and range elevation φ azimuth θ Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 5 / 33 Source: http://www.doksinet The effect of the environment Reflection causes additional wavefronts reflection diffraction & shadowing + scattering, absorption many paths many echoes I I Reverberant effect causal ‘smearing’ of signal energy I + reverb from hlwy16 freq / Hz freq / Hz Dry speech airvib16 8000 6000 8000 6000 4000 4000 2000 2000 0 0 0.5 Michael Mandel (E6820 SAPR) 1 1.5 0 time / sec Spatial sound 0 0.5 1 1.5 time / sec March 27, 2008 6 / 33 Source: http://www.doksinet Reverberation impulse response Exponential decay of reflections: hlwy16 - 128pt window freq / Hz ~e-t/T hroom(t) 8000 -10 -20 6000 -30 -40 4000 -50 2000 -60 t 0 -70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 time / s Frequency-dependent I greater absorption at high frequencies faster

decay Size-dependent I larger rooms longer delays slower decay Sabine’s equation: 0.049V S ᾱ Time constant as size, absorption RT60 = Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 7 / 33 Source: http://www.doksinet Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 8 / 33 Source: http://www.doksinet Binaural perception R L head shadow (high freq) path length difference source What is the information in the 2 ear signals? I I the sound of the source(s) (L+R) the position of the source(s) (L-R) Example waveforms (ShATR database) shatr78m3 waveform 0.1 Left 0.05 0 -0.05 Right -0.1 2.2 Michael Mandel (E6820 SAPR) 2.205 2.21 2.215 2.22 Spatial sound 2.225 2.23 2.235 time / s March 27, 2008 9 / 33 Source: http://www.doksinet Main cues to spatial hearing Interaural time difference (ITD) I I I from different path

lengths around head dominates in low frequency (< 1.5 kHz) max ∼750 µs ambiguous for freqs > 600 Hz Interaural intensity difference (IID) I I from head shadowing of far ear negligible for LF; increases with frequency Spectral detail (from pinna reflections) useful for elevation & range Direct-to-reverberant useful for range freq / kHz Claps 33 and 34 from 627M:nf90 20 15 10 5 0 0 0.2 Michael Mandel (E6820 SAPR) 0.4 0.6 0.8 1 Spatial sound 1.2 1.4 1.6 1.8 time / s March 27, 2008 10 / 33 Source: http://www.doksinet Head-Related Transfer Functions (HRTFs) Capture source coupling as impulse responses {`θ,φ,R (t), rθ,φ,R (t)} Collection: (http://interface.cipicucdavisedu/) HRIR 021 Left @ 0 el HRIR 021 Right @ 0 el 1 45 0 0 1 -45 0 Azimuth / deg RIGHT LEFT -1 0 0.5 1 1.5 0 0.5 1 1.5 time / ms 0 HRIR 021 Left @ 0 el 0 az HRIR 021 Right @ 0 el 0 az 0.5 1 1.5 time / ms Highly individual! Michael Mandel (E6820 SAPR) Spatial

sound March 27, 2008 11 / 33 Source: http://www.doksinet Cone of confusion azimuth θ Cone of confusion (approx equal ITD) Interaural timing cue dominates (below 1kHz) I from differing path lengths to two ears But: only resolves to a cone I Up/down? Front/back? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 12 / 33 Source: http://www.doksinet Further cues Pinna causes elevation-dependent coloration Monaural perception I separate coloration from source spectrum? Head motion I I synchronized spectral changes also for ITD (front/back) etc. Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 13 / 33 Source: http://www.doksinet Combining multiple cues Both ITD and ILD influence azimuth; What happens when they disagree? Identical signals to both ears image is centered l(t) r(t) t 1 ms t Delaying right channel moves image to left l(t) r(t) t t Attenuating left channel returns image to center l(t) r(t) t t “Time-intensity trading”

Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 14 / 33 Source: http://www.doksinet Binaural position estimation Imperfect results: (Wenzel et al., 1993) Judged Azimuth (Deg) 180 120 60 0 -60 -120 -180 -180 -120 -60 0 60 120 180 Target Azimuth (Deg) listening to ‘wrong’ HRTFs errors front/back reversals stay on cone of confusion Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 15 / 33 Source: http://www.doksinet The Precedence Effect Reflections give misleading spatial cues l(t) direct reflected R t r(t) R/ c t But: Spatial impression based on 1st wavefront then ‘switches off’ for ∼50 ms . even if ‘reflections’ are louder . leads to impression of room Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 16 / 33 Source: http://www.doksinet Binaural Masking Release Adding noise to reveal target Tone + noise to one ear: tone is masked + t t Identical noise to other ear: tone is audible + t t t

Binaural Masking Level Difference up to 12dB I greatest for noise in phase, tone anti-phase Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 17 / 33 Source: http://www.doksinet Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 18 / 33 Source: http://www.doksinet Synthesizing spatial audio Goal: recreate realistic soundfield I I hi-fi experience synthetic environments (VR) Constraints I I I resources information (individual HRTFs) delivery mechanism (headphones) Source material types I I live recordings (actual soundfields) synthetic (studio mixing, virtual environments) Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 19 / 33 Source: http://www.doksinet Classic stereo L R ‘Intensity panning’: no timing modifications, just vary level ±20 dB I works as long as listener is equidistant (ILD) Surround sound: extra

channels in center, sides, . I same basic effect: pan between pairs Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 20 / 33 Source: http://www.doksinet Simulating reverberation Can characterize reverb by impulse response I I spatial cues are important: record in stereo IRs of ∼1 sec very long convolution Image model: reflections as duplicate sources virtual (image) sources reflected path source listener ‘Early echos’ in room impulse response: direct path early echos hroom(t) t Actual reflection may be hreflect (t), not δ(t) Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 21 / 33 Source: http://www.doksinet Artificial reverberation Reproduce perceptually salient aspects I I I early echo pattern ( room size impression) overall decay tail ( wall materials. ) interaural coherence ( spaciousness) Nested allpass filters (Gardner, 1992) Allpass y[n] x[n] + z-k - g 1 - g·z-k H(z) = -g z-k + h[n] 1-g2 g(1-g2) 2 g (1-g2) k g g,k n

Synthetic Reverb + 30,0.7 a0 + AP0 g Michael Mandel (E6820 SAPR) 3k -g Nested+Cascade Allpass 50,0.5 20,0.3 2k Spatial sound AP1 + a1 a2 AP2 LPF March 27, 2008 22 / 33 Source: http://www.doksinet Synthetic binaural audio Source convolved with {L,R} HRTFs gives precise positioning . for headphone presentation I can combine multiple sources (by adding) Where to get HRTFs? I I I measured set, but: specific to individual, discrete interpolate by linear crossfade, PCA basis set or: parametric model - delay, shadow, pinna (Brown and Duda, 1998) Delay Shadow Pinna z-tDL(θ) 1 - bL(θ)z-1 1 - azt Σ pkL(θ,φ)·z-tPkL(θ,φ) Room echo Source z-tDR(θ) + KE·z-tE 1 - bR(θ)z-1 1 - azt Σ pkR(θ,φ)·z-tPkR(θ,φ) + (after Brown & Duda 97) Head motion cues? I head tracking + fast updates Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 23 / 33 Source: http://www.doksinet Transaural sound Binaural signals without headphones? Can cross-cancel

wrap-around signals I I speakers SL,R , ears EL,R , binaural signals BL,R Goal: present BL,R to EL,R BL BR M SR SL −1 SL = HLL (BL − HRL SR ) HLR −1 SR = HRR (BR − HLR SL ) HRL HRR HLL EL ER Narrow ‘sweet spot’ I head motion? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 24 / 33 Source: http://www.doksinet Soundfield reconstruction Stop thinking about ears I just reconstruct pressure + spatial derivatives ∂p(t)/∂y ∂p(t)/∂z ∂p(t)/∂x p(x,y,z,t) I ears in reconstructed field receive same sounds Complex reconstruction setup (ambisonics) I able to preserve head motion cues? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 25 / 33 Source: http://www.doksinet Outline 1 Spatial acoustics 2 Binaural perception 3 Synthesizing spatial audio 4 Extracting spatial sounds Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 26 / 33 Source: http://www.doksinet Extracting spatial sounds Given access to

soundfield, can we recover separate components? I I degrees of freedom: > N signals from N sensors is hard but: people can do it (somewhat) Information-theoretic approach I I use only very general constraints rely on precision measurements Anthropic approach I I examine human perception attempt to use same information Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 27 / 33 Source: http://www.doksinet Microphone arrays Signals from multiple microphones can be combined to enhance/cancel certain sources ‘Coincident’ mics with different directional gains m1 s1 a11 a22 a12 a21 s2 m2  m1 m2   = a11 a12 a21 a22  s1 s2   ŝ1 ŝ2 ⇒  = Â−1 m Microphone arrays (endfire) 0 λ = 4D -20 λ = 2D -40 λ=D + Michael Mandel (E6820 SAPR) Spatial sound D + D + D March 27, 2008 28 / 33 Source: http://www.doksinet Adaptive Beamforming & Independent Component Analysis (ICA) Formulate mathematical criteria to optimize

Beamforming: Drive interference to zero I cancel energy during nontarget intervals ICA: maximize mutual independence of outputs I from higher-order moments during overlap m1 m2 x a11 a12 a21 a22 s1 s2 −δ MutInfo δa Limited by separation model parameter space I only N × N? Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 29 / 33 Source: http://www.doksinet Binaural models Human listeners do better? I certainly given only 2 channels Extract ITD and IID cues? I I I I cross-correlation finds timing differences ‘consume’ counter-moving pulses how to achieve IID, trading vertical cues. Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 30 / 33 Source: http://www.doksinet Time-frequency masking How to separate sounds based on direction? I I I assume one source dominates each time-frequency point assign regions of spectrogram to sources based on probabilistic models re-estimate model parameters based on regions selected Model-based EM

Source Separation and Localization I I I Mandel and Ellis (2007) models include IID as RLωω and IPD as arg RLωω independent of source, but can model it separately Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 31 / 33 Source: http://www.doksinet Summary Spatial sound I sampling at more than one point gives information on origin direction Binaural perception I time & intensity cues used between/within ears Sound rendering I I conventional stereo HRTF-based Spatial analysis I I optimal linear techniques elusive auditory models Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 32 / 33 Source: http://www.doksinet References Elizabeth M. Wenzel, Marianne Arruda, Doris J Kistler, and Frederic L Wightman Localization using nonindividualized head-related transfer functions. The Journal of the Acoustical Society of America, 94(1):111–123, 1993. William G. Gardner A real-time multichannel room simulator The Journal of the Acoustical Society of

America, 92(4):2395–2395, 1992. C. P Brown and R O Duda A structural model for binaural sound synthesis IEEE Transactions on Speech and Audio Processing, 6(5):476–488, 1998. Michael I. Mandel and Daniel P Ellis EM localization and separation using interaural level and phase cues. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 275–278, 2007. J. C Middlebrooks and D M Green Sound localization by human listeners Annu Rev Psychol, 42:135–159, 1991. Brian C. J Moore An Introduction to the Psychology of Hearing Academic Press, fifth edition, April 2003. ISBN 0125056281 Jens Blauert. Spatial Hearing - Revised Edition: The Psychophysics of Human Sound Localization. The MIT Press, October 1996 V. R Algazi, R O Duda, D M Thompson, and C Avendano The cipic hrtf database. In Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on the, pages 99–102, 2001. Michael Mandel (E6820 SAPR) Spatial sound March 27, 2008 33 / 33

Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 April 7, 2009 1 2 3 4 Recognizing speech Feature calculation Sequence recognition Large vocabulary, continuous speech recognition (LVCSR) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 1 / 43 Source: http://www.doksinet Outline 1 Recognizing speech 2 Feature calculation 3 Sequence recognition 4 Large vocabulary, continuous speech recognition (LVCSR) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 2 / 43 Source: http://www.doksinet Recognizing speech “So, I thought about that and I think it’s still possible” Frequency 4000 2000 0 0 0.5 1 1.5 2 2.5 3 Time What kind of information might we want from the speech signal? I I I I words

phrasing, ‘speech acts’ (prosody) mood / emotion speaker identity What kind of processing do we need to get at that information? I I I time scale of feature extraction signal aspects to capture in features signal aspects to exclude from features E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 3 / 43 Source: http://www.doksinet Speech recognition as Transcription Transcription = “speech to text” I find a word string to match the utterance Gives neat objective measure: word error rate (WER) % I can be a sensitive measure of performance Reference: THE CAT SAT ON THE MAT – Recognized: CAT SAT AN THE A MAT Deletion Substitution Insertion Three kinds of errors: WER = (S + D + I )/N E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 4 / 43 Source: http://www.doksinet Problems: Within-speaker variability Timing variation I word duration varies enormously Frequency 4000 2000 0 0 0.5 s ow SO I 1.0 1.5 2.0 ay aa ax b aw ax

ay ih k t s t ih l th dx th n th n ih I ABOUT I ITS STILL THOUGHT THAT THINK AND 2.5 3.0 p aa s b ax l POSSIBLE fast speech ‘reduces’ vowels Speaking style variation I I careful/casual articulation soft/loud speech Contextual effects I speech sounds vary with context, role: “How do you do?” E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 5 / 43 Source: http://www.doksinet Problems: Between-speaker variability Accent variation I regional / mother tongue Voice quality variation I gender, age, huskiness, nasality Individual characteristics mbma0 mannerisms, speed, prosody freq / Hz I 8000 6000 4000 2000 0 8000 6000 fjdm2 4000 2000 0 0 E6820 (Ellis & Mandel) 0.5 1 1.5 L9: Speech recognition 2 2.5 time / s April 7, 2009 6 / 43 Source: http://www.doksinet Problems: Environment variability Background noise I fans, cars, doors, papers Reverberation I ‘boxiness’ in recordings Microphone/channel I huge effect on relative

spectral gain Close mic freq / Hz 4000 2000 0 4000 Tabletop mic 2000 0 E6820 (Ellis & Mandel) 0 0.2 0.4 0.6 0.8 L9: Speech recognition 1 1.2 1.4 time / s April 7, 2009 7 / 43 Source: http://www.doksinet How to recognize speech? Cross correlate templates? I I I waveform? spectrogram? time-warp problems Match short-segments & handle time-warp later I I model with slices of ∼10 ms pseudo-stationary model of words: freq / Hz sil g w 0.15 0.2 eh n sil 4000 3000 2000 1000 0 I 0 0.05 0.1 0.25 0.3 0.35 0.4 0.45 time / s other sources of variation. E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 8 / 43 Source: http://www.doksinet Probabilistic formulation Probability that segment label is correct I gives standard form of speech recognizers Feature calculation: s[n] Xm I (m = n H) transforms signal into easily-classified domain Acoustic classifier: p(q i | X ) I calculates probabilities of each

mutually-exclusive state q i ‘Finite state acceptor’ (i.e HMM) Q ∗ = argmax p(q0 , q1 , . qL | X0 , X1 , XL ) {q0 ,q1 ,.qL } I MAP match of allowable sequence to probabilities: q0 = “ay” q1 . X E6820 (Ellis & Mandel) 0 1 2 . L9: Speech recognition time April 7, 2009 9 / 43 Source: http://www.doksinet Standard speech recognizer structure Fundamental equation of speech recognition: Q ∗ = argmax p(Q | X , Θ) Q = argmax p(X | Q, Θ)p(Q | Θ) Q I I I I X = acoustic features p(X | Q, Θ) = acoustic model p(Q | Θ) = language model argmaxQ = search over sequences Questions: I I I what are the best features? how do we do model them? how do we find/match the state sequence? E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 10 / 43 Source: http://www.doksinet Outline 1 Recognizing speech 2 Feature calculation 3 Sequence recognition 4 Large vocabulary, continuous speech recognition (LVCSR) E6820 (Ellis & Mandel) L9: Speech

recognition April 7, 2009 11 / 43 Source: http://www.doksinet Feature Calculation Goal: Find a representational space most suitable for classification I I I waveform: voluminous, redundant, variable spectrogram: better, still quite variable .? Pattern Recognition: representation is upper bound on performance I I maybe we should use the waveform. or, maybe the representation can do all the work Feature calculation is intimately bound to classifier I pragmatic strengths and weaknesses Features develop by slow evolution I current choices more historical than principled E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 12 / 43 Source: http://www.doksinet Features (1): Spectrogram Plain STFT as features e.g X Xm [k] = S[mH, k] = s[n + mH] w [n] e −j2πkn/N n Consider examples: freq / Hz Feature vector slice 8000 6000 4000 2000 0 8000 6000 4000 2000 0 0 0.5 1 1.5 2 2.5 time / s Similarities between corresponding segments I but still large

differences E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 13 / 43 Source: http://www.doksinet Features (2): Cepstrum Idea: Decorrelate, summarize spectral slices: Xm [`] = IDFT{log |S[mH, k]|} I I good for Gaussian models greatly reduce feature dimension Male spectrum cepstrum Female 8000 4000 0 8 6 4 2 8000 spectrum 4000 cepstrum 0 8 6 4 2 0 E6820 (Ellis & Mandel) 0.5 1 1.5 L9: Speech recognition 2 2.5 time / s April 7, 2009 14 / 43 Source: http://www.doksinet Features (3): Frequency axis warp Linear frequency axis gives equal ‘space’ to 0-1 kHz and 3-4 kHz I but perceptual importance very different Warp frequency axis closer to perceptual axis I mel, Bark, constant-Q . X [c] = uc X |S[k]|2 k=`c Male 8000 spectrum 4000 0 15 audspec 10 5 Female audspec 15 10 5 0 E6820 (Ellis & Mandel) 0.5 1 1.5 L9: Speech recognition 2 2.5 time / s April 7, 2009 15 / 43 Source: http://www.doksinet Features (4):

Spectral smoothing Generalizing across different speakers is helped by smoothing (i.e blurring) spectrum Truncated cepstrum is one way: I MMSE approx to log |S[k]| LPC modeling is a little different: MMSE approx to |S[k]| prefers detail at peaks level / dB I 50 audspec 30 Male plp 40 0 2 4 6 8 10 12 14 16 18 freq / chan 15 audspec 10 5 15 plp 10 smoothed 5 0 E6820 (Ellis & Mandel) 0.5 1 1.5 L9: Speech recognition 2 2.5 time / s April 7, 2009 16 / 43 Source: http://www.doksinet Features (5): Normalization along time Idea: feature variations, not absolute level Hence: calculate average level and subtract it: Ŷ [n, k] = X̂ [n, k] − mean{X̂ [n, k]} n Factors out fixed channel frequency response x[n] = hc ∗ s[n] X̂ [n, k] = log |X [n, k]| = log |Hc [k]| + log |S[n, k]| Male plp 15 10 5 15 mean norm 10 5 Female 15 mean 10 norm 5 0 E6820 (Ellis & Mandel) 0.5 1 1.5 L9: Speech recognition 2 2.5 time / s April 7, 2009 17 / 43

Source: http://www.doksinet Delta features Want each segment to have ‘static’ feature vals I but some segments intrinsically dynamic! calculate their derivativesmaybe steadier? Append dX /dt (+ d 2 X /dt 2 ) to feature vectors Male ddeltas deltas 15 10 5 15 10 5 15 plp 10 (µ,σ norm) 5 0 0.5 1 1.5 2 2.5 time / s Relates to onset sensitivity in humans? E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 18 / 43 Source: http://www.doksinet Overall feature calculation MFCCs and/or RASTA-PLP Sound FFT X[k] FFT X[k] spectra Mel scale freq. warp Key attributes: Bark scale freq. warp spectral, auditory scale audspec decorrelation log|X[k]| log|X[k]| IFFT Rasta band-pass Truncate LPC smooth Subtract mean Cepstral recursion CMN MFCC features Rasta-PLP cepstral features cepstra E6820 (Ellis & Mandel) smoothed onsets smoothed (spectral) detail normalization of levels LPC spectra L9: Speech recognition April 7, 2009 19 / 43

Source: http://www.doksinet Features summary Male Female 8000 spectrum 4000 0 15 audspec 10 5 rasta 15 10 5 deltas 15 10 5 0 0.5 1 1.5 0 0.5 1 time / s Normalize same phones Contrast different phones E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 20 / 43 Source: http://www.doksinet Outline 1 Recognizing speech 2 Feature calculation 3 Sequence recognition 4 Large vocabulary, continuous speech recognition (LVCSR) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 21 / 43 Source: http://www.doksinet Sequence recognition: Dynamic Time Warp (DTW) FOUR ONE TWO THREE Reference FIVE Framewise comparison with stored templates: 70 60 50 40 30 20 10 10 20 30 40 50 time /frames Test I I distance metric? comparison across templates? E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 22 / 43 Source: http://www.doksinet Dynamic Time Warp (2) Find lowest-cost constrained path: I I matrix d(i, j) of

distances between input frame fi and reference frame rj allowable predecessors and transition costs Txy T10 D(i,j) = d(i,j) + min D(i-1,j) T01 11 D(i-1,j) T Reference frames rj Lowest cost to (i,j) { D(i-1,j) + T10 D(i,j-1) + T01 D(i-1,j-1) + T11 } Local match cost D(i-1,j) Best predecessor (including transition cost) Input frames fi Best path via traceback from final state I store predecessors for each (i, j) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 23 / 43 Source: http://www.doksinet DTW-based recognition Reference templates for each possible word For isolated words: I mark endpoints of input word calculate scores through each template (+prune) Reference ONE TWO THREE FOUR I Input frames I continuous speech: link together word ends Successfully handles timing variation I recognize speech at reasonable cost E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 24 / 43 Source: http://www.doksinet Statistical

sequence recognition DTW limited because it’s hard to optimize I I learning from multiple observations interpretation of distance, transition costs? Need a theoretical foundation: Probability Formulate recognition as MAP choice among word sequences: Q ∗ = argmax p(Q | X , Θ) Q I I I X = observed features Q = word-sequences Θ = all current parameters E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 25 / 43 Source: http://www.doksinet State-based modeling Assume discrete-state model for the speech: I I observations are divided up into time frames model states observations: Model Mj Qk : q1 q2 q3 q4 q5 q6 . states time X1N : x1 x2 x3 x4 x5 x6 . observed feature vectors Probability of observations given model is: X p(X | Θ) = p(X1N | Q, Θ) p(Q | Θ) all Q I sum over all possible state sequences Q How do observations X1N depend on states Q? How do state sequences Q depend on model Θ? E6820 (Ellis & Mandel) L9: Speech recognition April 7,

2009 26 / 43 Source: http://www.doksinet HMM review HMM is specified by parameters Θ: - states qi - transition probabilities aij - emission distributions bi(x) k • a t • • k a t • • k a t • • k a t k 1.0 0.9 0.0 0.0 a 0.0 0.1 0.9 0.0 t 0.0 0.0 0.1 0.9 • 0.0 0.0 0.0 0.1 p(x|q) x (+ initial state probabilities πi ) i aij ≡ p(qnj | qn−1 ) E6820 (Ellis & Mandel) bi (x) ≡ p(x | qi ) L9: Speech recognition πi ≡ p(q1i ) April 7, 2009 27 / 43 Source: http://www.doksinet HMM summary (1) HMMs are a generative model: recognition is inference of p(Q | X ) During generation, behavior of model depends only on current state qn : I I transition probabilities p(qn+1 | qn ) = aij observation distributions p(xn | qn ) = bi (x) Given states Q = {q1 , q2 , . , qN } and observations X = X1N = {x1 , x2 , . , xN } Markov assumption makes Y p(X , Q | Θ) = p(xn | qn )p(qn | qn−1 ) n E6820 (Ellis & Mandel) L9: Speech recognition

April 7, 2009 28 / 43 Source: http://www.doksinet HMM summary (2) Calculate p(X | Θ) via forward recursion: " S # X n j p(X1 , qn ) = αn (j) = αn−1 (i)aij bj (xn ) i=1 Viterbi (best path) approximation    ∗ ∗ αn (j) = max αn−1 (i)aij bj (xn ) i I then backtrace. Q ∗ = argmax(X , Q | Θ) Q Pictorially: X M* Q* observed inferred M= Q = {q1,q2,.qn} assumed, hidden E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 29 / 43 Source: http://www.doksinet Outline 1 Recognizing speech 2 Feature calculation 3 Sequence recognition 4 Large vocabulary, continuous speech recognition (LVCSR) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 30 / 43 Source: http://www.doksinet Recognition with HMMs Isolated word I choose best p(M | X ) ∝ p(X | M)p(M) Model M1 w Input ah p(X | M1)·p(M1) = . n Model M2 t p(X | M2)·p(M2) = . uw Model M3 th r p(X | M3)·p(M3) = . iy Continuous speech I Viterbi decoding

of one large HMM gives words p(M1) Input p(M2) sil p(M3) E6820 (Ellis & Mandel) w th L9: Speech recognition ah t n uw r iy April 7, 2009 31 / 43 Source: http://www.doksinet Training HMMs Probabilistic foundation allows us to train HMMs to ‘fit’ training data i.e estimate aij , bi (x) given data I better than DTW. Algorithms to improve p(Θ | X ) are key to success of HMMs I maximum-likelihood of models. State alignments Q for training examples are generally unknown I . else estimating parameters would be easy Viterbi training I I ‘Forced alignment’ choose ‘best’ labels (heuristic) EM training I ‘fuzzy labels’ (guaranteed local convergence) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 32 / 43 Source: http://www.doksinet Overall training procedure Labelled training data Word models “two one” one “five” two “four three” w t three Data ah th n uw r iy Models t uw th r f ao w ah n th

r iy iy Fit models to data Repeat until Re-estimate model parameters convergence E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 33 / 43 Source: http://www.doksinet Language models Recall, fundamental equation of speech recognition Q ∗ = argmax p(Q | X , Θ) Q = argmax p(X | Q, ΘA )p(Q | ΘL ) Q So far, looked at p(X | Q, ΘA ) What about p(Q | ΘL )? I I Q is a particular word sequence ΘL are parameters related to the language Two components: I I link state sequences to words p(Q | wi ) priors on word sequences p(wi | Mj ) E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 34 / 43 Source: http://www.doksinet HMM Hierarchy HMMs support composition I can handle time dilation, pronunciation, grammar all within the same framework ae1 k ae2 ae ae3 p(q | M) = p(q, φ, w | M) t aa = p(q | φ) · p(φ | w ) CAT ATE · p(wn | w1n−1 , M) THE SAT DOG E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 35 /

43 Source: http://www.doksinet Pronunciation models Define states within each word p(Q | wi ) Can have unique states for each word (‘whole-word’ modeling), or . Sharing (tying) subword units between words to reflect underlying phonology I I I more training examples for each unit generalizes to unseen words (or can do it automatically. ) Start e.g from pronunciation dictionary: ZERO(0.5) ZERO(0.5) ONE(1.0) TWO(1.0) E6820 (Ellis & Mandel) z iy r ow z ih r ow w ah n tcl t uw L9: Speech recognition April 7, 2009 36 / 43 Source: http://www.doksinet Learning pronunciations ‘Phone recognizer’ transcribes training data as phones I align to ‘canonical’ pronunciations Baseform Phoneme String f ay v y iy r ow l d f ah ay v y uh r ow l Surface Phone String I I infer modification rules predict other pronunciation variants e.g ‘d deletion’: d ∅|`stop p = 0.9 Generate pronunciation variants; use forced alignment to find weights E6820 (Ellis & Mandel)

L9: Speech recognition April 7, 2009 37 / 43 Source: http://www.doksinet Grammar Account for different likelihoods of different words and word sequences p(wi | Mj ) ‘True’ probabilities are very complex for LVCSR I need parses, but speech often agrammatic Use n-grams: p(wn | w1L ) = p(wn | wn−K , . , wn−1 ) e.g n-gram models of Shakespeare: n=1 n=2 n=3 n=4 To him swallowed confess hear both. Which Of save on What means, sir. I confess she? then all sorts, he is trim, Sweet prince, Falstaff shall die. Harry of Monmouth’s grave King Henry. What! I will go seek the traitor Gloucester Big win in recognizer WER I I raw recognition results often highly ambiguous grammar guides to ‘reasonable’ solutions E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 38 / 43 Source: http://www.doksinet Smoothing LVCSR grammars n-grams (n = 3 or 4) are estimated from large text corpora I I 100M+ words but: not like spoken language 100,000 word

vocabulary 1015 trigrams! I I never see enough examples unobserved trigrams should NOT have Pr = 0! Backoff to bigrams, unigrams I I p(wn ) as an approx to p(wn | wn−1 ) etc. interpolate 1-gram, 2-gram, 3-gram with learned weights? Lots of ideas e.g category grammars I I I p(PLACE | “went”, “to”)p(wn | PLACE) how to define categories? how to tag words in training corpus? E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 39 / 43 Source: http://www.doksinet Decoding How to find the MAP word sequence? States, pronunciations, words define one big HMM I with 100,000+ individual states for LVCSR! Exploit hierarchic structure I I phone states independent of word next word (semi) independent of word history oy DECOY s DECODES z DECODES axr DECODER k iy d uw ow d DECODE DO root b E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 40 / 43 Source: http://www.doksinet Decoder pruning Searching ‘all possible word

sequences’ ? I I I need to restrict search to most promising ones: beam search sort by estimates of total probability = Pr (so far)+ lower bound estimate of remains trade search errors for speed Start-synchronous algorithm: I I extract top hypothesis from queue: [Pn, {w1 , . , wk }, n] pr. so far words next time frame find plausible words {wi } starting at time n new hypotheses: [Pn p(Xnn+N−1 | w i )p(w i | wk . ), I I {w1 , . , wk , w i }, n + N] discard if too unlikely, or queue is too long else re-insert into queue and repeat E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 41 / 43 Source: http://www.doksinet Summary Speech signal is highly variable I I need models that absorb variability hide what we can with robust features Speech is modeled as a sequence of features I I need temporal aspect to recognition best time-alignment of templates = DTW Hidden Markov models are rigorous solution I I self-loops allow temporal dilation exact,

efficient likelihood calculations Language modeling captures larger structure I I I pronunciation, word sequences fits directly into HMM state structure need to ‘prune’ search space in decoding Parting thought Forward-backward trains to generate, can we train to discriminate? E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 42 / 43 Source: http://www.doksinet References Lawrence R. Rabiner A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989 Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1):69–88, 2002 Wendy Holmes. Speech Synthesis and Recognition CRC, December 2001 ISBN 0748408576. Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition Prentice Hall PTR, April 1993. ISBN 0130151572 Daniel Jurafsky and James H. Martin Speech and Language Processing: An

Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, January 2000 ISBN 0130950696 Frederick Jelinek. Statistical Methods for Speech Recognition (Language, Speech, and Communication). The MIT Press, January 1998 ISBN 0262100665 Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR, April 2001. ISBN 0130226165 E6820 (Ellis & Mandel) L9: Speech recognition April 7, 2009 43 / 43 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 10: Signal Separation Dan Ellis <dpwe@ee.columbiaedu> Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 April 14, 2009 1 2 3 4 Sound mixture organization Computational auditory scene analysis Independent component analysis Model-based separation E6820 (Ellis & Mandel)

L10: Signal separation April 14, 2009 1 / 45 Source: http://www.doksinet Outline 1 Sound mixture organization 2 Computational auditory scene analysis 3 Independent component analysis 4 Model-based separation E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 2 / 45 Source: http://www.doksinet Sound Mixture Organization 4000 frq/Hz 3000 0 -20 2000 -40 1000 0 0 2 4 6 8 10 12 time/s -60 level / dB Analysis Voice (evil) Voice (pleasant) Stab Rumble Choir Strings Auditory Scene Analysis: describing a complex sound in terms of high-level sources / events . like listeners do Hearing is ecologically grounded I I reflects ‘natural scene’ properties subjective, not absolute E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 3 / 45 Source: http://www.doksinet Sound, mixtures, and learning freq / kHz Speech Noise Speech + Noise 4 3 + 2 = 1 0 0 0.5 1 time / s 0 0.5 1 time / s 0 0.5 1 time / s

Sound I I carries useful information about the world complements vision Mixtures . are the rule, not the exception I medium is ‘transparent’, sources are many I must be handled! Learning I I the ‘speech recognition’ lesson: let the data do the work like listeners E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 4 / 45 Source: http://www.doksinet The problem with recognizing mixtures “Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman, 1990) Received waveform is a mixture I two sensors, N signals . underconstrained Disentangling mixtures as the primary goal? I I perfect solution is not possible need experience-based constraints E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 5 / 45 Source:

http://www.doksinet Approaches to sound mixture recognition Separate signals, then recognize e.g Computational Auditory Scene Analysis (CASA), Independent Component Analysis (ICA) I nice, if you can do it Recognize combined signal I I ‘multicondition training’ combinatorics. Recognize with parallel models I I full joint-state space? divide signal into fragments, then use missing-data recognition E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 6 / 45 Source: http://www.doksinet What is the goal of sound mixture analysis? CASA ? Separate signals? I I I output is unmixed waveforms underconstrained, very hard . too hard? not required? Source classification? I I output is set of event-names listeners do more than this. Something in-between? Identify independent sources + characteristics I standard task, results? E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 7 / 45 Source: http://www.doksinet Segregation vs. Inference

Source separation requires attribute separation I I sources are characterized by attributes (pitch, loudness, timbre, and finer details) need to identify and gather different attributes for different sources. Need representation that segregates attributes I I spectral decomposition periodicity decomposition Sometimes values can’t be separated e.g unvoiced speech I maybe infer factors from probabilistic model? p(O, x, y ) p(x, y | O) I or: just skip those values & infer from higher-level context E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 8 / 45 Source: http://www.doksinet Outline 1 Sound mixture organization 2 Computational auditory scene analysis 3 Independent component analysis 4 Model-based separation E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 9 / 45 Source: http://www.doksinet Auditory Scene Analysis (Bregman, 1990) How do people analyze sound mixtures? I I I break mixture into small elements (in

time-freq) elements are grouped in to sources using cues sources have aggregate attributes Grouping ‘rules’ (Darwin and Carlyon, 1995) I cues: common onset/offset/modulation, harmonicity, spatial location, . Onset map Sound Frequency analysis Elements Harmonicity map Sources Grouping mechanism Source properties Spatial map E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 10 / 45 Source: http://www.doksinet Cues to simultaneous grouping freq / Hz Elements + attributes 8000 6000 4000 2000 0 0 1 2 3 4 5 6 7 8 9 time / s Common onset I simultaneous energy has common source Periodicity I energy in different bands with same cycle Other cues I spatial (ITD/IID), familiarity, . E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 11 / 45 Source: http://www.doksinet The effect of context Context can create an ‘expectation’ i.e a bias towards a particular interpretation I A change in a signal will be

interpreted as an added source whenever possible frequency / kHz e.g Bregman’s “old-plus-new” principle: 2 1 + 0 0.0 0.4 0.8 1.2 time / s I a different division of the same energy depending on what preceded it E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 12 / 45 Source: http://www.doksinet Computational Auditory Scene Analysis (CASA) Sound mixture Object 1 percept Object 2 percept Object 3 percept Goal: Automatic sound organization I Systems to ‘pick out’ sounds in a mixture . like people do e.g voice against a noisy background I to improve speech recognition Approach psychoacoustics describes grouping ‘rules’ . just implement them? I E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 13 / 45 "#$#!%&()*+(,!-&.+//0(1 CASA front-end processing Source: http://www.doksinet 2 "&&+31&456 Correlogram: Loosely based on known/possible physiology 7/+38!94/+,!(!:(;(<-//093+!-=8/0318

ine yl ela short-time autocorrelation d sound Cochlea filterbank frequency channels "&&+31&45 /30.+ envelope follower freq lag time I I I I - linear filterbank cochlear approximation linear filterbank cochlear approximation static nonlinearity static -nonlinearity zero-delay slice is likeslice spectrogram - zero-delay is like spectrogram periodicity from delay-and-multiply detectors - periodicity from delay-and-multiply detectors E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 14 / 45 Source: http://www.doksinet Bottom-Up Approach (Brown and Cooke, 1994) Implement psychoacoustic theory input mixture Front end signal features (maps) Object formation discrete objects Grouping rules Source groups freq onset time period frq.mod I I left-to-right processing uses common onset & periodicity cues Able to extract voiced speech v3n7 - Hu & Wang 02 freq / khz freq / khz v3n7 original mix 8 8 7 20 6 6 10 5 5 0 4 4

-10 3 3 -20 2 2 -30 1 1 0 0 7 0 0.2 0.4 0.6 E6820 (Ellis & Mandel) 0.8 1 1.2 1.4 time / sec 0 -40 0.2 0.4 0.6 0.8 1 L10: Signal separation 1.2 1.4 time / sec level ./ dB April 14, 2009 15 / 45 Source: http://www.doksinet Problems with ‘bottom-up’ CASA "#$%&()!*+,-!.%$,,$(/012!3454 freq time 6 3+#70()7#+%+89!,+(/:#;087<!&(8,) Circumscribing time-frequency elements - need to have ‘regions’, but hard to find need have ‘regions’, but hard to find 6 to"#+$=+7+,<!+)!,-!1#+(>#<!70 Periodicity -ishow the to primary handlecue aperiodic energy? I how to handle aperiodic energy? 6 ?)<8,-)+)!@+>!(>)A=!!&,#+89 Resynthesis via masked filtering - cannot separate within a single t-f element I cannot separate within a single t-f element 6 B$,,$(/01!&>@)!8$!>(%+90+,<!$#!7$8,C, Bottom-up leaves no ambiguity or context - how to model illusions? I how to model illusions? I E6820 (Ellis &

Mandel) L10: Signal separation April 14, 2009 16 / 45 Source: http://www.doksinet Restoration in sound perception Auditory ‘illusions’ = hearing what’s not there The continuity illusion & Sinewave Speech (SWS) 8000 Frequency 6000 4000 2000 0 0 0.5 1 1.5 Time 2 2.5 2 Time 2.5 3 Frequency 5000 4000 3000 2000 1000 0 I 0.5 1 1.5 3.5 duplex perception What kind of model accounts for this? I is it an important part of hearing? E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 17 / 45 Source: http://www.doksinet Adding top-down constraints: Prediction-Driven CASA Perception is not direct but a search for plausible hypotheses Data-driven (bottom-up). input mixture I Front end signal features Object formation discrete objects Grouping rules Source groups objects irresistibly appear vs. Prediction-driven (top-down) hypotheses Hypothesis management prediction errors input mixture I I Front end signal features Compare &

reconcile Noise components Periodic components Predict & combine predicted features match observations with a ‘world-model’ need world-model constraints. E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 18 / 45 Source: http://www.doksinet Generic sound elements for PDCASA (Ellis, 1996) Goal is a representational space that I I I f/Hz covers real-world perceptual sounds minimal parameterization (sparseness) separate attributes in separate parameters NoiseObject profile magProfile H[k] 4000 ClickObject XC[n,k] decayTimes T[k] f/Hz 2000 4000 2000 1000 1000 WeftObject H W[n,k] 400 400 H[k] 200 0 200 f/Hz 400 XN[n,k] 0.010 0.00 magContour A[n] 0.005 0.01 2.2 2.4 time / s 0.0 0.1 time / s 0.000 0.0 0.1 200 p[n] 100 XC[n,k] = H[k]·exp((n - n0) / T[k]) 50 f/Hz 400 0.2 time/s Correlogram analysis Periodogram 200 XN[n,k] = H[k]·A[n] 100 50 0.0 0.2 0.4 0.6 0.8 time/s XW[n,k] = HW[n,k]·P[n,k] Object hierarchies

built on top. E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 19 / 45 Source: http://www.doksinet PDCASA for old-plus-new Incremental analysis t1 t2 t3 Input Signal Time t1: Initial element created Time t2: Additional element required Time t2: Second element finished E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 20 / 45 Source: http://www.doksinet PDCASA for the continuity illusion Subjects hear the tone as continuous . if the noise is a plausible masker f/Hz ptshort 4000 2000 1000 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 t/s Data-driven analysis gives just visible portions: Prediction-driven can infer masking: E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 21 / 45 Source: http://www.doksinet Prediction-Driven CASA Explain a complex sound with basic elements f/Hz City 4000 2000 1000 400 200 1000 400 200 100 50 0 1 2 3 f/Hz Wefts1−4 4 5 Weft5 6 7 Wefts6,7 8 Weft8 9 Wefts9−12 4000 2000

1000 400 200 1000 400 200 100 50 Horn1 (10/10) Horn2 (5/10) Horn3 (5/10) Horn4 (8/10) Horn5 (10/10) f/Hz Noise2,Click1 4000 2000 1000 400 200 Crash (10/10) f/Hz Noise1 4000 2000 1000 −40 400 200 −50 −60 Squeal (6/10) Truck (7/10) −70 0 E6820 (Ellis & Mandel) 1 2 3 4 5 6 7 8 L10: Signal separation 9 time/s dB April 14, 2009 22 / 45 Source: http://www.doksinet Aside: Ground Truth What do people hear in sound mixtures? I do interpretations match? Listening tests to collect ‘perceived events’: E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 23 / 45 Source: http://www.doksinet Aside: Evaluation Evaluation is a big problem for CASA I I I what is the goal, really? what is a good test domain? how do you measure performance? SNR improvement I I I tricky to derive from before/after signals: correspondence problem can do with fixed filtering mask differentiate removing signal from adding noise Speech Recognition (ASR)

improvement I recognizers often sensitive to artifacts ‘Real’ task? I mixture corpus with specific sound events. E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 24 / 45 Source: http://www.doksinet Outline 1 Sound mixture organization 2 Computational auditory scene analysis 3 Independent component analysis 4 Model-based separation E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 25 / 45 Source: http://www.doksinet Independent Component Analysis (ICA) (Bell and Sejnowski, 1995, etc.) If mixing is like matrix multiplication, then separation is searching for the inverse matrix a11 a12 a21 a22 s1 × s1 s2 x1 x2 a11 w11 w12 × w21 w22 a21 a12 a22 s2 x1 x2 s^1 s^2 −δ MutInfo δwij i.e W ≈ A−1 I with N different versions of the mixed signals (microphones), we can find N different input contributions (sources) I how to rate quality of outputs? i.e when do outputs look separate? E6820 (Ellis & Mandel) L10:

Signal separation April 14, 2009 26 / 45 Source: http://www.doksinet Gaussianity, Kurtosis, & Independence A signal can be characterized by its PDF p(x) i.e as if successive time values are drawn from a random variable (RV) I Gaussian PDF is ‘least interesting’ I Sums of independent RVs (PDFs convolved) tend to Gaussian PDF (central limit theorem) Measures of deviations from Gaussianity: 4th moment is Kurtosis (“bulging”) 0.8 exp(-|x|) kurt = 2.5 0.6 " kurt(y ) = E y −µ σ 4 # −3 0.4 I I I exp(-x4) kurt = -1 0.2 0 -4 I Gaussian kurt = 0 -2 0 2 4 kurtosis of Gaussian is zero (this def.) ‘heavy tails’ kurt > 0 closer to uniform dist. kurt < 0 Directly related to KL divergence from Gaussian PDF E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 27 / 45 Source: http://www.doksinet Independence in Mixtures Scatter plots & Kurtosis values s1 vs. s2 0.6 x1 vs. x2 0.6 0.4 0.4 0.2 0.2 0 0 -0.2 -0.4 -0.2

-0.6 -0.4 -0.4 -0.2 E6820 (Ellis & Mandel) -0.2 0 0.2 0.4 0.6 -0.5 0 0.5 p(s1) kurtosis = 27.90 p(x1) kurtosis = 18.50 p(s2) kurtosis = 53.85 p(x2) kurtosis = 27.90 -0.1 0 0.1 0.2 -0.2 L10: Signal separation -0.1 0 0.1 0.2 April 14, 2009 28 / 45 Source: http://www.doksinet Finding Independent Components Sums of independent RVs are more Gaussian minimize Gaussianity to undo sums i.e search over wij terms in inverse matrix Kurtosis vs. θ s2 0.6 0.4 kurtosis mix 2 Mixture Scatter 0.8 s1 12 10 8 0.2 6 0 4 -0.2 2 -0.4 s1 -0.6 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0 0 0.2 0.4 0.6 mix 1 s2 0.8 θ/π 1 Solve by Gradient descent or Newton-Raphson: w + = E[xg (w T x)] − E[g 0 (w T x)]w w= w+ kw + k “Fast ICA”, (Hyvärinen and Oja, 2000) http://www.cishutfi/projects/ica/fastica/ E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 29 / 45 Source: http://www.doksinet Limitations of ICA Assumes instantaneous mixing

I I real world mixtures have delays & reflections STFT domain? x1 (t) = a11 (t) ∗ s1 (t) + a12 (t) ∗ s2 (t) ⇒ X1 (ω) = A11 (ω)S1 (ω) + A12 (ω)S2 (ω) I Solve ω subbands separately, match up answers Searching for best possible inverse matrix I I cannot find more than N outputs from N inputs but: “projection pursuit” ideas + time-frequency masking. Cancellation inherently fragile I I I ŝ1 = w11 x1 + w12 x2 to cancel out s2 sensitive to noise in x channels time-varying mixtures are a problem E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 30 / 45 Source: http://www.doksinet Outline 1 Sound mixture organization 2 Computational auditory scene analysis 3 Independent component analysis 4 Model-based separation E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 31 / 45 Source: http://www.doksinet Model-Based Separation: HMM decomposition (e.g Varga and Moore, 1990; Gales and Young, 1993) Independent state

sequences for 2+ component source models model 2 model 1 observations / time New combined state space q 0 = q1 × q2 I need pdfs for combinations p(X | q1 , q2 ) E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 32 / 45 Source: http://www.doksinet One-channel Separation: Masked Filtering Multichannel ICA: Inverse filter and cancel target s s^ + - mix interference n n^ proc One channel: find a time-frequency mask mix s+n spectro gram stft x istft s^ mask proc Frequency Frequency Cannot remove overlapping noise in t-f cells, but surprisingly effective (psych masking?): 0 dB SNR 8000 -12 dB SNR -24 dB SNR 6000 Noise mix 4000 2000 8000 6000 Mask resynth 4000 2000 0 0 1 2 Time 3 E6820 (Ellis & Mandel) 0 1 2 Time 3 0 1 2 Time L10: Signal separation 3 April 14, 2009 33 / 45 Source: http://www.doksinet “One microphone source separation” (Roweis, 2001) State sequences t-f estimates mask Original voices freq / Hz

Speaker 1 Speaker 2 4000 3000 2000 1000 0 Mixture 4000 3000 2000 1000 0 Resynthesis masks State means 4000 3000 2000 1000 0 4000 3000 2000 1000 0 I I 0 1 2 3 4 5 6 time / sec 0 1 2 3 4 5 6 time / sec 1000 states/model ( 106 transition probs.) simplify by subbands (coupled HMM)? E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 34 / 45 Source: http://www.doksinet Speech Fragment Recognition (Barker et al., 2005) Signal separation is too hard! Instead: I I segregate features into partially-observed sources then classify Made possible by missing data recognition I integrate over uncertainty in observations for true posterior distribution Goal: Relate clean speech models P(X | M) to speech-plus-noise mixture observations . and make it tractable E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 35 / 45 "#$$#%&!()(!*+,-&%#)#-% Source: http://www.doksinet !!! !! ! . Speech

/0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777 models p(x | m) are multidimensional. 9i.e<2<!=2#$(-!>#1#$?2(!@*1!2>21A!@12B<!?C#$$2& means, variances for every freq. channel "#$$#%&!()(!*+,-&%#)#-% I need values for all dimensions to get p(·) 9 $22,!>#&+2(!@*1!#&&!,=2$($(!0!520!p(•) . /0++,1!2-3+4$!p(x|m)!(5+!264)#3#2+%$#-%(4777 Missing Data Recognition . But: can evaluate over a subset of dimensions xk 9 <2<!=2#$(-!>#1#$?2(!@*1!2>21A!@12B<!?C#$$2& 86)9!,(%!+:(46()+!-:+5!(! xu 9 $22,!>#&+2(!@*1!#&&!,=2$($(!0!520! p(•) p(x k,xu) $6;$+)!-<!3#2+%$#-%$!xk . 86)9!,(%!+:(46()+!-:+5!(! p ! x k m " =Z $ p ! $6;$+)!-<!3#2+%$#-%$! x # x u m " dx u xk p(xk | m) = p(xk , kxu | m) dx p ! x k m " = $ p ! x k#ux u m " dx u . !! y p(xk|xu<y ) p(xk|xu<y ) xk p(x xk k ) p(xk ) P(x P(x | q)| q)= = P(x1 | q) dimension ! P(x·1P(x | q) 2 | q) · P(x3 q) · P(x 2 | | q) ·

P(x4 | q) · P(x 3 |5 q) · P(x | q) · P(x · P(x | q) 4 |6 q) time ! · P(x5 | q) · P(x6 | q) hard part9 isC#1,!D#10!(!!$,$5!0C2!=#(E!F(25125#0*$G finding the mask (segregation) dimension ! I p(xk,xu) y =+%,+>! . =+%,+>! 2#$$#%&!3()(!5+,-&%#)#-%9 2#$$#%&!3()(!5+,-&%#)#-%9 Hence, missing data recognition: ?5+$+%)!3()(!2($@ ?5+$+%)!3()(!2($@ xu "#$!%&&( E6820 (Ellis & Mandel) )*+$,-!./0+12(!3!42#1$$5 time ! separation L10: Signal 67789::976!9!:7!;!68 April 14, 2009 36 / 45 Source: http://www.doksinet Missing Data Results Estimate static background noise level N(f ) Cells with energy close to background are considered “missing” Factory Noise "1754" + noise SNR mask Missing Data Classifier "1754" Digit recognition accuracy / % 100 80 60 40 20 0 a priori missing data MFCC+CMN 5 10 15 20 SNR (dB) I " must use spectral features! But: nonstationary noise spurious mask bits I can we try

removing parts of mask? E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 37 / 45 Source: http://www.doksinet Comparing different segregations Standard classification chooses between models M to match source features X M ∗ = argmax p(M | X ) = argmax p(X | M)p(M) M M Mixtures: observed features Y , segregation S, all related by p(X | Y , S) Observation Y(f) Source X(f) Segregation S freq Joint classification of model and segregation: Z p(X | Y , S) p(M, S | Y ) = p(M) p(X | M) dX p(S | Y ) p(X ) I P(X ) no longer constant E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 38 / 45 Source: http://www.doksinet Calculating fragment matches Z p(M, S | Y ) = p(M) p(X | M) p(X | Y , S) dX p(S | Y ) p(X ) p(X | M) – the clean-signal feature model p(X | Y ,S) p(X ) – is X ‘visible’ given segregation? Integration collapses some bands. p(S|Y ) – segregation inferred from observation I I just assume uniform, find S for most likely

M or: use extra information in Y to distinguish Ss. Result: I I I probabilistically-correct relation between clean-source models p(X | M) and inferred, recognized source + segregation p(M, S | Y ) E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 39 / 45 Source: http://www.doksinet Using CASA features p(S | Y ) links acoustic information to segregation I I is this segregation worth considering? how likely is it? Opportunity for CASA-style information to contribute I I periodicity/harmonicity: these different frequency bands belong together onset/continuity: this time-frequency region must be whole Frequency Proximity E6820 (Ellis & Mandel) Common Onset L10: Signal separation Harmonicity April 14, 2009 40 / 45 Source: http://www.doksinet Fragment decoding Limiting S to whole fragments makes hypothesis search tractable: Fragments Hypotheses w1 w5 w2 w6 w7 w3 w8 w4 time T1 T2 T3 T4 T5 T6 I choice of fragments reflects p(S | Y )p(X

| M) i.e best combination of segregation and match to speech models Merging hypotheses limits space demands . but erases specific history E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 41 / 45 Source: http://www.doksinet Speech fragment decoder results Simple p(S | Y ) model forces contiguous regions to stay together I big efficiency gain when searching S space "1754" + noise SNR mask Fragments Fragment Decoder "1754" Clean-models-based recognition rivals trained-in-noise recognition E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 42 / 45 Source: http://www.doksinet Multi-source decoding Search for more than one source q2(t) S2(t) Y(t) S1(t) q1(t) Mutually-dependent data masks I I I disjoint subsets of cells for each source each model match p(Mx | Sx , Y ) is independent masks are mutually dependent: p(S1 , S2 | Y ) Huge practical advantage over full search E6820 (Ellis & Mandel) L10: Signal

separation April 14, 2009 43 / 45 Source: http://www.doksinet Summary Auditory Scene Analysis: I Hearing: partially understood, very successful Independent Component Analysis: I Simple and powerful, some practical limits Model-based separation: I Real-world constraints, implementation tricks Parting thought Mixture separation the main obstacle in many applications e.g soundtrack recognition E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 44 / 45 Source: http://www.doksinet References Albert S. Bregman Auditory Scene Analysis: The Perceptual Organization of Sound The MIT Press, 1990. ISBN 0262521954 C.J Darwin and RP Carlyon Auditory grouping In BCJ Moore, editor, The Handbook of Perception and Cognition, Vol 6, Hearing, pages 387–424. Academic Press, 1995. G. J Brown and M P Cooke Computational auditory scene analysis Computer speech and language, 8:297–336, 1994. D. P W Ellis Prediction–driven computational auditory scene analysis PhD thesis,

Department of Electrtical Engineering and Computer Science, M.IT, 1996 Anthony J. Bell and Terrence J Sejnowski An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6):1129–1159, 1995. ftp://ftpcnlsalkedu/pub/tony/bellblindpsZ A. Hyvärinen and E Oja Independent component analysis: Algorithms and applications. Neural Networks, 13(4–5):411–430, 2000 http://www.cishutfi/aapo/papers/IJCNN99 tutorialweb/ A. Varga and R Moore Hidden markov model decomposition of speech and noise In Proceedings of the 1990 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 845–848, 1990. M.JF Gales and SJ Young Hmm recognition in noise using parallel model combination. In Proc Eurospeech-93, volume 2, pages 837–840, 1993 S. Roweis One-microphone source separation In NIPS, volume 11, pages 609–616 MIT Press, Cambridge MA, 2001. Jon Barker, Martin Cooke, and Daniel P. W Ellis Decoding speech in the presence of other

sources. Speech Communication, 45(1):5–25, 2005 URL http://www.eecolumbiaedu/~dpwe/pubs/BarkCE05-sfdpdf E6820 (Ellis & Mandel) L10: Signal separation April 14, 2009 45 / 45 Source: http://www.doksinet EE E6820: Speech & Audio Processing & Recognition Lecture 10: Music Analysis Michael Mandel <mim@ee.columbiaedu> Columbia University Dept. of Electrical Engineering http://www.eecolumbiaedu/∼dpwe/e6820 April 17, 2008 1 Music transcription 2 Score alignment and musical structure 3 Music information retrieval 4 Music browsing and recommendation Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 1 / 40 Source: http://www.doksinet Outline 1 Music transcription 2 Score alignment and musical structure 3 Music information retrieval 4 Music browsing and recommendation Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 2 / 40 Source: http://www.doksinet Music Transcription Basic idea: recover the score Frequency 4000 3000

2000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Time Is it possible? Why is it hard? I music students do it . but they are highly trained; know the rules Motivations I I I for study: what was played? highly compressed representation (e.g MIDI) the ultimate restoration system. Not trivial to turn a “piano roll” into a score I I meter determination, rhythmic quantization key finding, pitch spelling Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 3 / 40 Source: http://www.doksinet Transcription framework Recover discrete events to explain signal Note events ? Observations {tk, pk, ik} I synthesis X[k,n] analysis-by-synthesis? Exhaustive search? would be possible given exact note waveforms . or just a 2-dimensional ‘note’ template? I note template 2-D convolution I but superposition is not linear in |STFT| space Inference depends on all detected notes I I is this evidence ‘available’ or ‘used’ ? full solution is exponentially

complex Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 4 / 40 Source: http://www.doksinet Problems for transcription Music is practically worst case! I note events are often synchronized I notes have harmonic relations (2:3 etc.) defeats common onset collision/interference between harmonics I variety of instruments, techniques, . Listeners are very sensitive to certain errors . and impervious to others Apply further constraints I I like our ‘music student’ maybe even the whole score! Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 5 / 40 Source: http://www.doksinet Types of transcription Full polyphonic transcriptions is hard, but maybe unnecessary Melody transcription I I the melody is produced to stand out useful for query-by-humming, summarization, score following Chord transcription I I consider the signal holistically useful for finding structure Drum transcription I I very different from other instruments can’t use

harmonic models Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 6 / 40 Source: http://www.doksinet Spectrogram Modeling Sinusoid model as with synthesis, but signal is more complex Break tracks I need to detect new ‘onset’ at single frequencies 0.06 0.04 0.02 0 0 0.5 1 1.5 3000 freq / Hz I 2500 2000 1500 1000 500 0 0 1 2 3 4 time / s time / s Group by onset & common harmonicity find sets of tracks that start around the same time + stable harmonic pattern I Pass on to constraint-based filtering. Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 7 / 40 Source: http://www.doksinet Searching for multiple pitches (Klapuri, 2005) At each frame: I I I estimate dominant f0 by checking for harmonics cancel it from spectrum repeat until no f0 is prominent Subtract & iterate audio frame level / dB Harmonics enhancement Predominant f0 estimation f0 spectral smoothing 60 40 dB dB 20 10 0 -20 0 20 30 0 1000 2000 3000

frq / Hz 20 10 10 0 0 0 -10 0 1000 2000 3000 4000 frq / Hz 0 0 100 200 300 1000 2000 3000 frq/Hz f0 / Hz Stop when no more prominent f0s Can use pitch predictions as features for further processing e.g HMM Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 8 / 40 Source: http://www.doksinet Probabilistic Pitch Estimates (Goto, 2001) Generative probabilistic model of spectrum as weighted combination of tone models at different fundamental frequencies: ! Z X p(x(f )) = w (F , m)p(x(f ) | F , m) dF m ‘Knowledge’ in terms of tone models + prior distributions for f0 : p(x,1 | F,2, µ(t)(F,2)) c(t)(1|F,2) (t) c (2|F,2) Tone Model (m=1) p(x | F,1, µ(t)(F,1)) c(t)(4|F,2) c(t)(2|F,1) fundamental frequency F+1200 Music analysis c(t)(3|F,2) c(t)(1|F,1) F EM (RT) results: Michael Mandel (E6820 SAPR) Tone Model (m=2) p(x | F,2, µ(t)(F,2)) c(t)(3|F,1) c(t)(4|F,1) F+1902 F+2400 x [cent] x [cent] April 17, 2008 9 / 40 Source:

http://www.doksinet Generative Model Fitting (Cemgil et al., 2006) Multi-level graphical model I I I I rj,t : piano roll note-on indicators sj,t : oscillator state for each harmonic yj,t : time-domain realization of each note separately yt : combined time-domain waveform (actual observation) Incorporates knowledge of high-level musical structure and low-level acoustics Inference exact in some cases, approximate in others I special case of the generally intractable switching Kalman filter Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 10 / 40 Source: http://www.doksinet Transcription as Pattern Recognition (Poliner and Ellis) Existing methods use prior knowledge about the structure of pitched notes i.e we know they have regular harmonics What if we didn’t know that, but just had examples and features? I the classic pattern recognition problem Could use music signal as evidence for pitch class in a black-box classifier: Audio I Trained classifier

p("C0"|Audio) p("C#0"|Audio) p("D0"|Audio) p("D#0"|Audio) p("E0"|Audio) p("F0"|Audio) nb: more than one class at once! But where can we get labeled training data? Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 11 / 40 Source: http://www.doksinet Ground Truth Data Pattern classifiers need training data i.e need {signal, note-label} sets i.e MIDI transcripts of real music already exists? Make sound out of midi I I “play” midi on a software synthesizer record a player piano playing the midi Make midi from monophonic tracks in a multi-track recording I for melody, just need a capella tracks Distort recordings to create more data I I resample/detune any of the audio and repeat add in reverb or noise Use a classifier to train a better classifier I I alignment in the classification domain run SVM & HMM to label, use to retrain SVM Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 12 /

40 Source: http://www.doksinet Polyphonic Piano Transcription (Poliner and Ellis, 2007) Training data from player piano freq / pitch Bach 847 Disklavier A6 20 A5 10 A4 0 A3 A2 -10 A1 -20 0 1 2 3 4 5 6 7 8 level / dB 9 time / sec Independent classifiers for each note I plus a little HMM smoothing Nice results . when test data resembles training Algorithm SVM Klapuri & Ryynänen Marolt Michael Mandel (E6820 SAPR) Errs False Pos False Neg d0 43.3% 66.6% 84.6% 27.9% 28.1% 36.5% 15.4% 38.5% 48.1% 3.44 2.71 2.35 Music analysis April 17, 2008 13 / 40 Source: http://www.doksinet Outline 1 Music transcription 2 Score alignment and musical structure 3 Music information retrieval 4 Music browsing and recommendation Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 14 / 40 Source: http://www.doksinet Midi-audio alignment Pattern classifiers need training data i.e need {signal, note-label} sets i.e MIDI transcripts of real music

already exists? Idea: force-align MIDI and original I I can estimate time-warp relationships recover accurate note events in real music! Also useful by itself I I comparing performances of the same piece score following, etc. Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 15 / 40 Source: http://www.doksinet Which features to align with? Transcription classifier posteriors Playback piano Recorded Audio chroma STFT Feature Analysis Warped MIDI tr Reference MIDI t (t ) w r DTW MIDI score MIDI to WAV Warping Function tw Synthesized Audio chroma ^ t (t ) r w STFT Audio features: STFT, chromagram, classification posteriors Midi features: STFT of synthesized audio, derived chromagram, midi score Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 16 / 40 Source: http://www.doksinet Alignment example Dynamic time warp can overcome timing variations Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 17 / 40 Source:

http://www.doksinet Segmentation and structure Find contiguous regions that are internally similar and different from neighbors e.g “self-similarity” matrix (Foote, 1997) time / 183ms frames DYWMB - self similarity 1800 1600 1400 1200 1000 800 600 400 200 0 0 500 1000 1500 time / 183ms frames 100 200 300 400 500 2D convolution of checkerboard down diagonal = compare fixed windows at every point I Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 18 / 40 Source: http://www.doksinet BIC segmentation Use evidence from whole segment, not just local window Do ‘significance test’ on every possible division of every possible context BIC : log L(X1 ; M1 )L(X2 ; M2 ) λ ≷ log(N)#(M) L(X ; M0 ) 2 Eventually, a boundary is found: Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 19 / 40 Source: http://www.doksinet HMM segmentation Recall, HMM Viterbi path is joint classification and segmentation e.g for singing/accompaniment segmentation

But: HMM states need to be defined in advance I I define a ‘generic set’ ? (MPEG7) learn them from the piece to be segmented? (Logan and Chu, 2000; Peeters et al., 2002) Result is ‘anonymous’ state sequence characteristic of particular piece U2-The Joshua Tree-01-Where The Streets Have No Name 33677 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 1 7 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 1 1 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 1 5 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 3 3 3 3 3 3 3 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 Michael Mandel

(E6820 SAPR) Music analysis April 17, 2008 20 / 40 Source: http://www.doksinet Finding Repeats Music frequently repeats main phrases Repeats give off-diagonal ridges in Similarity matrix (Bartsch and Wakefield, 2001) time / 183ms frames DYWMB - self similarity 1800 1600 1400 1200 1000 800 600 400 200 0 0 500 1000 1500 time / 183ms frames Or: clustering at phrase-level . Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 21 / 40 Source: http://www.doksinet Music summarization What does it mean to ‘summarize’ ? I I I compact representation of larger entity maximize ‘information content’ sufficient to recognize known item So summarizing music? short version e.g < 10% duration (< 20s for pop) sufficient to identify style, artist e.g chorus or ‘hook’ ? I I Why? I I I browsing existing collection discovery among unknown works commerce. Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 22 / 40 Source: http://www.doksinet

Outline 1 Music transcription 2 Score alignment and musical structure 3 Music information retrieval 4 Music browsing and recommendation Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 23 / 40 Source: http://www.doksinet Music Information Retrieval Text-based searching concepts for music? I I I I “Google of music” finding a specific item finding something vague finding something new Significant commercial interest Basic idea: Project music into a space where neighbors are “similar” (Competition from human labeling) Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 24 / 40 Source: http://www.doksinet Music IR: Queries & Evaluation What is the form of the query? Query by humming I I considerable attention, recent demonstrations need/user base? Query by noisy example I I I “Name that tune” in a noisy bar Shazam Ltd.: commercial deployment database access is the hard part? Query by multiple examples I “Find me more stuff like

this” Text queries? (Whitman and Smaragdis, 2002) Evaluation problems I I requires large, shareable music corpus! requires a well-defined task Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 25 / 40 Source: http://www.doksinet Genre Classification (Tzanetakis et al., 2001) Classifying music into genres would get you some way towards finding “more like this” Genre labels are problematic, but they exist Real-time visualization of “GenreGram”: I I I 9 spectral and 8 rhythm features every 200ms 15 genres trained on 50 examples each single Gaussian model ∼ 60% correct Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 26 / 40 Source: http://www.doksinet Artist Classification (Berenzweig et al., 2002) Artist label as available stand-in for genre Train MLP to classify frames among 21 artists Using only “voice” segments: I Song-level accuracy improves 56.7% 649% Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee) true voice Michael Penn The

Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 50 100 150 200 time / sec Track 4 - Arto Lindsay (dynvox=Arto, unseg=Oval) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 0 10 Michael Mandel (E6820 SAPR) 20 30 40 Music analysis 50 60 70 80 time / sec April 17, 2008 27 / 40 Source: http://www.doksinet Textual descriptions Classifiers only know about the sound of the music not its context So collect training data just about the sound I I Have humans label short, unidentified clips Reward for being relevant and thorough MajorMiner Music game Michael Mandel (E6820 SAPR)

Music analysis April 17, 2008 28 / 40 Source: http://www.doksinet Tag classification Use tags to train classifiers I ‘autotaggers’ Treat each tag separately, easy to evaluate performance No ‘true’ negatives I approximate as absence of one tag in the presence of others rap hip hop saxophone house techno r&b jazz ambient beat electronica electronic dance punk fast rock country strings synth distortion guitar female drum machine pop soft piano slow british male solo 80s instrumental keyboard vocal bass voice drums 0.4 Michael Mandel (E6820 SAPR) Music analysis 0.5 0.6 0.7 0.8 0.9 Classification accuracy April 17, 2008 1 29 / 40 Source: http://www.doksinet Outline 1 Music transcription 2 Score alignment and musical structure 3 Music information retrieval 4 Music browsing and recommendation Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 30 / 40 Source: http://www.doksinet Music browsing Most interesting problem in music IR is

finding new music I is there anything on myspace that I would like? Need a feature space where music/artists are arranged according to perceived similarity Particularly interested in little-known bands I I little or no ‘community data’ (e.g collaborative filtering) audio-based measures are critical Also need models of personal preference I I where in the space is stuff I like relative sensitivity to different dimensions Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 31 / 40 Source: http://www.doksinet Unsupervised Clustering (Rauber et al., 2002) Map music into an auditory-based space Build ‘clusters’ of nearby similar music I “Self-Organizing Maps” (Kohonen) Look at the results: ga-iwantit ga-iwantit verve-bittersweetga-moneymilk ga-moneymilk verve-bittersweet party party backforgood backforgood bigworld bigworld nma-poison nma-poison br-anesthesia whenyousay br-anesthesia whenyousay youlearn youlearn br-punkrock br-punkrock limp-lesson

limp-lesson limp-stalemate limp-stalemate addict addict ga-lie ga-lie ga-innocent ga-innocent pinkpanther pinkpanther ga-time ga-time vm-classicalgas vm-classicalgas vm-toccata vm-toccata d3-kryptonite d3-kryptonite fbs-praise fbs-praise ga-anneclaire ga-anneclaire limp-rearranged korn-freak limp-rearranged limp-nobodyloves ga-heaven korn-freak limp-nobodyloves ga-heaven limp-show limp-show pr-neverenoughlimp-wandering pr-neverenough limp-99 limp-99 limp-wandering pr-deadcell pr-deadcell rem-endoftheworld pr-binge pr-binge wildwildwest wildwildwestrem-endoftheworld “Islands of music” I quantitative evaluation? Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 32 / 40 Source: http://www.doksinet Artist Similarity Artist classes as a basis for overall similarity: Less corrupt than ‘record store genres’ ? paula abdul abba roxette toni braxton westlife aaron carter erasure lara fabian jessica simpson mariah carey new order janet jackson a ha whitney houston

eiffel 65 celine dionpet shop boys christina aguilera aqua lauryn hill duran y spears sade soft cell all saints backstreet boys madonna prince spice girlsbelinda carlisle ania twain nelly furtado jamiroquai annie lennox cultur ace of base seal savage garden matthew sweet faith hill tha mumba But: what is similarity between artists? I pattern recognition systems give a number. Need subjective ground truth: Collected via web site I www.musicseercom Results: 1800 users, 22,500 judgments collected over 6 months Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 33 / 40 Source: http://www.doksinet Anchor space A classifier trained for one artist (or genre) will respond partially to a similar artist A new artist will evoke a particular pattern of responses over a set of classifiers We can treat these classifier outputs as a new feature space in which to estimate similarity n-dimensional vector in "Anchor Space" Anchor Anchor p(a1|x) Audio Input (Class i)

AnchorAnchor p(a2n-dimensional |x) vector in "Anchor Space" Anchor Audio Input (Class j) GMM Modeling Similarity Computation p(a1|x)p(an|x) p(a2|x) Anchor Conversion to Anchorspace GMM Modeling KL-d, EMD, etc. p(an|x) Conversion to Anchorspace “Anchor space” reflects subjective qualities? Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 34 / 40 Source: http://www.doksinet Playola interface ( http://www.playolaorg ) Browser finds closest matches to single tracks or entire artists in anchor space Direct manipulation of anchor space axes Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 35 / 40 Source: http://www.doksinet Tag-based search ( http://majorminer.com/search ) Two ways to search MajorMiner tag data Use human labels directly I I (almost) guaranteed to be relevant small dataset Use autotaggers trained on human labels I I can only train classifiers for labels with enough clips once trained, can label unlimited amounts of

music Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 36 / 40 Source: http://www.doksinet Music recommendation Similarity is only part of recommendation I need familiar items to build trust . in the unfamiliar items (serendipity) Can recommend based on different amounts of history I I none: particular query, like search lots: incorporate every song you’ve ever listened to Can recommend from different music collections I I personal music collection: “what am I in the mood for?” online databases: subscription services, retail Appropriate music is a subset of good music? Transparency builds trust in recommendations See (Lamere and Celma, 2007) Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 37 / 40 Source: http://www.doksinet Evaluation Are recommendations good or bad? Subjective evaluation is the ground truth . but subjects don’t know the bands being recommended I can take a long time to decide if a recommendation is good Measure match

to similarity judgments e.g musicseer data Evaluate on “canned” queries, use-cases I I concrete answers: precision, recall, area under ROC curve applicable to long-term recommendations? Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 38 / 40 Source: http://www.doksinet Summary Music transcription I hard, but some progress Score alignment and musical structure I making good training data takes work, has other benefits Music IR I alternative paradigms, lots of interest Music recommendation I potential for biggest impact, difficult to evaluate Parting thought Data-driven machine learning techniques are valuable in each case Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 39 / 40 Source: http://www.doksinet References A. P Klapuri A perceptually motivated multiple-f0 estimation method In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 291–294, 2005. M. Goto A predominant-f¡sub¿0¡/sub¿ estimation method

for cd recordings: Map estimation using em algorithm for adaptive tone models. In IEEE Intl Conf on Acoustics, Speech, and Signal Processing, volume 5, pages 3365–3368 vol.5, 2001 A. T Cemgil, H J Kappen, and D Barber A generative model for music transcription IEEE Transactions on Audio, Speech, and Language Processing, 14(2):679–694, 2006. G. E Poliner and D P W Ellis A discriminative model for polyphonic piano transcription EURASIP J Appl Signal Process., pages 154–154, January 2007 J. Foote A similarity measure for automatic audio classification In Proc AAAI Spring Symposium on Intelligent Integration and Use of Text, Image, Video, and Audio Corpora, March 1997. B. Logan and S Chu Music summarization using key phrases In IEEE Intl Conf on Acoustics, Speech, and Signal Processing, volume 2, pages 749–752, 2000. G. Peeters, A La Burthe, and X Rodet Toward automatic music audio summary generation from signal analysis In Proc. Intl Symp Music Information Retrieval, October 2002

M. A Bartsch and G H Wakefield To catch a chorus: Using chroma-based representations for audio thumbnailing. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2001 B. Whitman and P Smaragdis Combining musical and cultural features for intelligent style detection In Proc Intl. Symp Music Information Retrieval, October 2002 A. Rauber, E Pampalk, and D Merkl Using psychoacoustic models and self-organizing maps to create a hierarchical structuring of music by musical styles. In Proc Intl Symp Music Information Retrieval, October 2002. G. Tzanetakis, G Essl, and P Cook Automatic musical genre classification of audio signals In Proc Intl Symp Music Information Retrieval, October 2001. A. Berenzweig, D Ellis, and S Lawrence Using voice segments to improve artist classification of music In Proc AES-22 Intl. Conf on Virt, Synth, and Ent Audio, June 2002 P. Lamere and O Celma Music recommendation tutorial In Proc Intl Symp Music Information Retrieval,

September 2007. Michael Mandel (E6820 SAPR) Music analysis April 17, 2008 40 / 40 Source: http://www.doksinet Analysis of Everyday Sounds Dan Ellis and Keansub Lee Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng, Columbia Univ, NY USA dpwe@ee.columbiaedu 1. 2. 3. 4. 5. Personal and Consumer Audio Segmenting & Clustering Special-Purpose Detectors Generic Concept Detectors Challenges & Future Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 1 /35 Source: http://www.doksinet LabROSA Overview Information Extraction Music Recognition Environment Separation Machine Learning Retrieval Signal Processing Speech Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 2 /35 Source: http://www.doksinet 1. Personal Audio Archives • Easy to record everything you hear <2GB / week @ 64 kbps to find anything • Hard how to scan? how to visualize? how to index? • Need automatic analysis • Need minimal

impact Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 3 /35 Source: http://www.doksinet Personal Audio Applications • Automatic appointment-book history fills in when & where of movements • “Life statistics” how long did I spend in meetings this week? most frequent conversations favorite phrases? • Retrieving details what exactly did I promise? privacy issues. • Nostalgia • . or what? Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 4 /35 Source: http://www.doksinet Consumer Video • Short video clips as the evolution of snapshots 10-60 sec, one location, no editing browsing? • More information for indexing. video + audio foreground + background Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 5 /35 Source: http://www.doksinet Information in Audio • Environmental recordings contain info on: location – type (restaurant, street, .) and specific activity – talking, walking, typing people – generic (2

males), specific (Chuck & John) spoken content . maybe • but not: what people and things “looked like” day/night . . except when correlated with audible features Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 6 /35 Source: http://www.doksinet A Brief History of Audio Processing • Environmental sound classification draws on earlier sound classification work as well as source separation. Speech Recognition Source Separation One channel Speaker ID Multi-channel GMM-HMMs Model-based Music Audio Genre & Artist ID Cue-based Sountrack & Environmental Recognition Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 7 /35 Source: http://www.doksinet 2. Segmentation & Clustering • Top-level structure for long recordings: Where are the major boundaries? e.g for diary application support for manual browsing of fundamental time-frame • Length 60s rather than 10ms? background more important than foreground average out

uncharacteristic transients features • Perceptually-motivated . so results have perceptual relevance broad spectrum + some detail Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 8 /35 Source: http://www.doksinet MFCC Features • Need “timbral” features: log-domain          auditory-like frequency warping  Mel-Frequency Cepstral Coeffs (MFCCs)           discrete    cosine   transform   = orthogonalization  Analysis of Everyday Sounds - Ellis & Lee       2007-07-24 p. 9 /35 

 Source: http://www.doksinet Long-Duration Features Ellis & Lee ’04 Normalized Energy Deviation Average Linear Energy 120 15 100 10 80 60 20 freq / bark freq / bark 20 15 40 10 20 5 5 60 dB dB Average Log Energy 15 100 10 80 20 freq / bark freq / bark Log Energy Deviation 120 20 5 15 15 10 10 5 5 60 dB dB Spectral Entropy Deviation Average Spectral Entropy 0.9 0.8 15 0.7 10 0.6 5 20 freq / bark freq / bark 20 0.5 bits 0.5 15 0.4 10 0.3 0.2 5 0.1 50 100 150 200 250 300 350 400 450 time / min • Capture both average and variation • Capture a little more detail in subbands. Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 10/35 bits Source: http://www.doksinet Spectral Entropy NF • Auditory spectrum: A[n, j] =  w X[n, k] ‘peakiness’ of each band: • Spectral entropy ≈w X[n, ! " w X[n, k] k] jk k=0 NF H[n, j] = −  jk A[n, j] energy / dB k=0 jk

· log A[n, j] FFT spectral magnitude 0 -20 Auditory Spectrum -40 -60 rel. entropy / bits 0 1000 2000 3000 4000 5000 6000 7000 8000 0.5 0 per-band Spectral Entropies -0.5 -1 30 340 750 1130 1630 2280 3220 3780 4470 5280 6250 7380 freq / Hz Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 11 /35 Source: http://www.doksinet BIC Segmentation Chen & Gopalakrishnan ’98 • BIC (Bayesian Info. Crit) compares models: log L(X1 ;M1 )L(X2 ;M2 ) L(X;M0 ) ≷ λ 2 log(N )∆#(M ) AvgLogAudSpec 2004-09-10-1023 AvgLEnergy 20 15 10 5 BIC score boundary passes BIC 0 no boundary with shorter context -100 -200 13:30 14:00 14:30 L(X1;M1) last segmentation point L(X;M0) Analysis of Everyday Sounds - Ellis & Lee 15:00 15:30 16:00 time / hr L(X2;M2) candidate boundary current context limit 2007-07-24 p. 12/35 Source: http://www.doksinet BIC Segmentation Results 62 hr hand-marked dataset • Evaluate: 8 days, 139 segments, 16

categories measure Correct Accept % @ False Accept = 2%: Feature Correct Accept 80.8% 0.8 μH 81.1% 0.7 σH/μH 81.6% μdB + σH/μH 84.0% μdB + σH/μH + μH 83.6% 0.4 mfcc 73.6% 0.3 o Sensitivity μdB 0.6 0.5 0.2 0 Analysis of Everyday Sounds - Ellis & Lee µdB µH !H/µH µdB + !H/µH µdB + µH + !H/µH 0.005 0.01 0.015 0.02 0.025 1 - Specificity 0.03 2007-07-24 p. 13 /35 0.035 0.04 Source: http://www.doksinet Segment Clustering • Daily activity has lots of repetition: Automatically cluster similar segments ‘affinity’ of segments as KL2 distances 4*5)#1-% 1))%23 -"#"0-) ,"#,)# ()!%*#)/ ,(("#. ,#)"()!%*#)+ !"#$%"& ;01) ,0:(23 4%#))% #)4%"*#"2% + 768 (,#"#9 7 !"15*4 !15 Analysis of Everyday Sounds - Ellis & Lee (, #4% 4%# 666 2007-07-24 p. 14 /35 Source: http://www.doksinet Spectral Clustering Ng, Jordan, Weiss ’01 • Eigenanalysis of affinity matrix: A =

U•S•V SVD components: uk•skk•vk Affinity Matrix 900 800 800 600 700 400 600 200 k=1 k=2 k=3 k=4 500 400 800 300 600 200 400 100 200 200 400 600 800 200 400 600 800 200 400 600 800 eigenvectors vk give cluster memberships • Number of clusters? Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 15 /35 Source: http://www.doksinet Clustering Results • Clustering of automatic segments gives ‘anonymous classes’ BIC criterion to choose number of clusters make best correspondence to 16 GT clusters • Frame-level scoring gives ~70% correct errors when same ‘place’ has multiple ambiences Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 16 /35 Source: http://www.doksinet Browsing Interface / Diary interface • Browsing links to other information (diary, email, photos) synchronize with note taking? (Stifelman & Arons) audio thumbnails • Release Tools + “how to” for capture !"#!! !!(D!%D&$

!!(D!%D&( !!(D!%D&) !!(D!%D&* !!(D!%D&+ !"#$! !%#!! !%#$! &!#!! &!#$! &&#!! ,-./01223 045. ,-./01223 045. >2= 3.067- <.68=: 045. 25580. <.68=: ,2/63.0 EFG!( C 045. 25580. 2769223.067- EFG!$ &&#$! &#!! 25580. &#$! 25580. 276922:-27, <.68=: 27692227692225580. 276922276922<.68=: 25580. ,2/63.0 <.68=: &$#$! 34; &(#!! 045. <.68=: ?4=73 27692225580. 045. 25580. ?8H. C 045. 25580. <.68=: 3.067- @--2A2B 276922- F4<;4-64B 25580. &(#$! &+#$! 25580. &"#$! &%#!! &%#$! /.<8=4- 25580. :, 25580. ,2/63.0 25580. :-27, C.//- H.4=/7; 25580. <.68=: :-27, 25580. 045. 27692234; 25580. Analysis of Everyday Sounds - Ellis & Lee &"#!! :-27, 25580. :-414< &*#$! &+#!! 045. :-27, 276922- &)#!! &*#!! -9: 02<,<6: 276922- &$#!! &)#$! 045. :-27, 3.067- 045. 045. 276922- ,-./01223 =4614= 045. 045.

2007-07-24 p. 17 /35 045. Source: http://www.doksinet 3. Special-Purpose Detectors: Speech • Speech emerges as most interesting content • Just identifying speech would be useful goal is speaker identification / labeling • Lots of background noise conventional Voice Activity Detection inadequate • Insight: Listeners detect pitch track (melody) look for voice-like periodicity in noise coffeeshop excerpt Frequency 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 Time Analysis of Everyday Sounds - Ellis & Lee 3 3.5 4 4.5 2007-07-24 p. 18/35 Source: http://www.doksinet Voice Periodicity Enhancement • Noise-robust subband autocorrelation Lee & Ellis ’06 • Subtract local average suppresses steady background e.g machine noise 15 min test set; 88% acc (no suppression: 79%) also for enhancing speech by harmonic filtering Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 19/35 Source: http://www.doksinet Detecting Repeating Events Ogle &

Ellis ’07 • Recurring sound events can be informative indicate similar circumstance. but: define “event” – sound organization define “recurring event” – how similar? . and how to find them – tractable? • Idea: Use hashing (fingerprints) index points to other occurrences of each hash; intersection of hashes points to match - much quicker search use a fingerprint insensitive to background? Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 20/35 Source: http://www.doksinet Shazam Fingerprints A. Wang ’06 • Prominent spectral onsets are landmarks; Use relations {f1, f2, t} as hashes Phone ring - Shazam fingerprint 4000 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 intrinsically robust to background noise Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 21/35 3 .(<,#8#*%(/#,1,(&" Source: http://www.doksinet

D2#&4,#-2/%3&6(#.%/6#(/#&,5,-4(,#26(<*7#&4,#2,-,.&,/#*,.234#:2?*#8.625=#:,55"##D6<%2,#EF#*4:#&4,#%&-%&#8#. ,-,.&,/#*,.234#(##*326-&,/#GH#)6(%&,#2,32/6(<"##>46#6(35%/,/#2,-,.&*#8#&:#(<7#&42,,#&,5,-4(,#26(<7#&:#,5,3&2(63 ;,55*7#.(/#&:#,34#8#<2<,#/2#(/#32#/2#(6*,"##$#1.26,&=#8#;3?<2%(/#(6*,#:,2,#.5*#-2,,(&" Exhaustive Search for Repeats • More selective hashes few hits required to confirm match (faster; better precision) but less robust to backgound (reduce recall) • Works well when exact structure repeats !"#$%&()S#+,-,.&,/#*%(/#,1,(&#8%(/#/%26(<#,.234"##C&,#&4,#K*,58L).&34#/6<(5"##+,-,&*#8#,1,(&#.2,#8%(/#88T/6<(5#* 5.;,5,/" recorded music, electronic alerts no good for “organic” sounds e.g garage door

$*#,I-,3&,/#&4,#-2/%3&6(#.%/6#(/#&,5,-4(,#26(<*#:,2,#,.*65=#8%(/#:465,#&4,#6(<5,#&(,#;,55#.(/#(6*,#%(/#:,2, )6*,/"##>46#3.&&,2-5&#6*#32,.&,/#;=#-5&&6(<#&4,#&6),#8#&4,#J%,2=#*%(/#,1,(&#.<6(*&#&4,#&6),#8#.(=#2,-,&#33%22,(3,* 8#&4.&#,1,(&"##>4%*#&4,#&2.6<4&#/6<(5#56(,#2,-2,*,(&#&4,#K,58L#).&34#&4&#33%2*#:4,(#&4,#J%,2=#%(/#,1,(&#-.*, &*,58#6(#&4,#865,"##>4,#88#/6.<(5#-6(&*#.2,#&4,#2,-,&,/#,1,(&*"##D2#,I.)-5,7#&4,#-6(&*#.&#MHFF7ENHFO#(/#MHFF7#EPFFO ,-2,*,(&#&:#2,-,.&#33%22,(3,*#8#&4,#&,5,-4(,#26(<#.&#HFF#*,3(/"##>4,,#.),#2,-,&#,1,(&*#.2,#*4:(#.&#MHFF7HFFO7 ENHF7HFFO7#.(/#MEPFF7HFFO"##>46*#=)),&2=#3.(#;,#*,,(#82#.55#&4,#)&34,*#.(/#6*#&4,#2,%5&#8#&4,#5.&,2#33%22,(3,*#8#.

%(/#,1,(&#86(/6(<#&4,#,.256,2#33%22,(3,"##Q(#-23&63,##82:2/#*,.234#:%5/#(5=#;,#(,,/,/#(/#458#8#&46*#-5&#:%5/ 4:#.55#2,5,1(&##6(82)&6("##>4,#/%-563&6(#6(#&46*#6)-5,),(&.&6(#:*#/(,#&#6(32,.*,#&4,#(%);,2#8#2,-,.&,/ 5,),(&*#&#;,&&,2#,1.5%&,#&4,#*=&,)"##>4,#,.234#&6),#82#&4,#GH#)6(%&,#%/6#865,#%*,/#&#<,(,2.&,#&46*#-5&#:.*#RP Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 22/35 Source: http://www.doksinet Music Detector • Two characteristic features for music Lee & Ellis ’08 strong, sustained periodicity (notes) clear, rhythmic repetition (beat) at least one should be present! Pitch-range subband autocorrelation Local stability measure Rhythm-range envelope autocorrelation Perceptual rhythm model Fused music classifier music audio • Noise-robust pitch detector looks for high-order autocorrelation • Beat tracker .

from Music IR work Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 23/35 Source: http://www.doksinet 4. Generic Concept Detectors Chang, Ellis et al. ’07 • Consumer Video application: How to assist browsing? system automatically tags recordings tags chosen by usefulness, feasibility • Initial set of 25 tags defined: “animal”, “baby”, “cheer”, “dancing” . human annotation of 1300+ videos evaluate by average precision • Multimodal detection separate audio + visual low-level detectors (then fused.) Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 24 /35 Source: http://www.doksinet MFCC Covariance Representation • Each clip/segment fixed-size statistics similar to speaker ID and music genre classification • Full Covariance matrix of MFCCs 8 7 6 5 4 3 2 1 0 VTS 04 0001 - Spectrogram MFCC Covariance Matrix 30 20 10 0 -10 MFCC covariance -20 1 2 3 4 5 6 7 8 9 time / sec 20 18 16 14 12 10 8 6 4 2 1 2 3 4 5

6 7 8 9 time / sec 50 20 level / dB 18 16 20 15 10 5 0 -5 -10 -15 -20 value MFCC dimension MFCC features MFCC bin Video Soundtrack freq / kHz maps the kinds of spectral shapes present 14 12 0 10 8 6 4 2 • Clip-to-clip distances for SVM classifier 5 10 15 MFCC dimension 20 by KL or 2nd Gaussian model Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 25/35 -50 Source: http://www.doksinet GMM Histogram Representation • Want a more ‘discrete’ description . to accommodate nonuniformity in MFCC space . to enable other kinds of models • Divide up feature space with a single Gaussian Mixture Model . then represent each clip by the components used 8 150 6 4 7 2 MFCC features Per-Category Mixture Component Histogram count MFCC(1) Global Gaussian Mixture Model 0 2 5 -2 6 1 10 14 8 100 15 9 12 13 3 4 -4 50 11 -6 -8 -10 -20 -10 0 10 20 MFCC(0) Analysis of Everyday Sounds - Ellis & Lee 0 1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 GMM mixture 2007-07-24 p. 26/35 Source: http://www.doksinet Latent Semantic Analysis (LSA) • Probabilistic LSA (pLSA) models each Hofmann ’99 histogram as a mixture of several ‘topics’ . each clip may have several things going on • Topic sets optimized through EM p(ftr | clip) = ∑topics p(ftr | topic) p(topic | clip) = GMM histogram ftrs * “Topic” p(topic | clip) p(ftr | clip) “Topic” AV Clip AV Clip GMM histogram ftrs p(ftr | topic) use p(topic | clip) as per-clip features Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 27/35 Source: http://www.doksinet AP Audio-Only Results 0.8 0.7 0.6 • Wide range of results: 1G + KL 1G + Mah GMM Hist. + pLSA Guess 0.5 0.4 0.3 0.2 0.1 t Ba by Be a W ch ed di M ng us eu m Pa Pl ra ay de gr ou nd Su ns et Pi cn Bi ic rth da y Bo a Da Ski nc in g On e p Gr ers ou on p of 3 + M us ic Ch e Gr ou er p of 2 Sp or ts Pa r Cr k ow d Ni gh An t im Si al ng in g Sh ow 0 Concepts audio

(music, ski) vs. non-audio (group, night) large AP uncertainty on infrequent classes Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 28/35 Source: http://www.doksinet How does it ‘feel’? • Browser impressions: How wrong is wrong? Top 8 hits for “Baby” Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 29/35 Source: http://www.doksinet Confusion analysis • Where are the errors coming from? (b) Confusion Matrix of Classified Labels True Labels (a) Overlapped Manual Labels 1 Birthday Picnic Sunset Playground Parade Museum Wedding Beach Baby Boat Dancing Ski Show Singing Animal Night Crowd Park Sports Group of 2 Cheer Music Group of 3 + One person 0.7 0.9 0.8 0.6 0.7 0.5 0.6 0.4 0.5 0.3 0.4 0.3 0.2 0.2 0.1 0.1 1 3 M C 2 S P C N A S S S D B B BWM P P S P B Analysis of Everyday Sounds - Ellis & Lee 0 1 3 M C 2 S P C N A S S S D B B BWM P P S P B 2007-07-24 p. 30/35 0 Source: http://www.doksinet Fused Results - AV Joint

Boosting • Audio helps in many classes 0.9 Random Baseline Video Only Audio Only A+V Fusion 0.8 0.7 0.5 0.4 0.3 0.2 0.1 Analysis of Everyday Sounds - Ellis & Lee MAP birthday picnic sunset playground parade museum wedding beach baby boat dancing ski shows singing animal night crowd park sport group of 2 music cheer group of 3+ 0 one person AP 0.6 2007-07-24 p. 31/35 Source: http://www.doksinet 5. Future: Temporal Focus • Global vs. local class models tell-tale acoustics may be ‘washed out’ in statistics try iterative realignment of HMMs: YT baby 002: voice baby laugh 4 New Way: Limited temporal extents freq / kHz freq / kHz Old Way: All frames contribute 3 4 2 1 1 5 10 voice 15 0 time / s baby 3 2 0 voice bg 5 voice baby 10 laugh 15 bg time / s baby laugh “background” (bg) model shared by all clips Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 32 /35 laugh Source: http://www.doksinet

Handling Sound Mixtures • MFCCs of mixtures ≠ mix of MFCCs recognition despite widely varying background? factorial models / Nonnegative Matrix Factorization sinusoidal / landmark techniques MFCC Noise resynth Solo Voice freq / kHz Original crm-11-070307 4 crm-11-070307-noise 3 2 1 0 0 M+F Voice Mix freq / kHz -20 crm-11-070307+16-050105-noise crm-11-070307+16-050105 4 -40 crm-11737.wav 3 crm-11737-noise.wav -60 level / dB 2 1 0 crm-11737+16515.wav crm-11737+16515-noisewav 0 0.5 1 1.5 0 time / sec 0.5 Analysis of Everyday Sounds - Ellis & Lee 1 1.5 time / sec 2007-07-24 p. 33/35 Source: http://www.doksinet Larger Datasets • Many detectors are visibly data-limited getting data is ~ hard labeling data is expensive • Bootstrap from YouTube etc. lots of web video is edited/dubbed. - need a “consumer video” detector? • Preliminary YouTube results disappointing downloaded data needed extensive clean-up models did not match Kodak data •

(Freely available data!) Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 34/35 Source: http://www.doksinet Conclusions • Environmental sound contains information . that’s why we hear! . computers can hear it too • Personal audio can be segmented, clustered find specific sounds to help navigation/retrieval • Consumer video can be ‘tagged’ . even in unpromising cases audio is complementary to video • Interesting directions for better models Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 35 /35 Source: http://www.doksinet References • D. Ellis and KS Lee, “Minimal-Impact Audio-Based Personal Archives,” First ACM workshop on Continuous Archiving and Recording of Personal Experiences CARPE-04, New York, Oct 2004, pp. 39-47. • S. Chen and P Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” In Proc. DARPA Broadcast News Transcription and Understanding

Workshop, 1998. • A. Ng, M Jordan, and Y Weiss On spectral clustering: Analysis and an algorithm In Advances in NIPS. MIT Press, Cambridge MA, 2001 • K. Lee and D Ellis, “Voice Activity Detection in Personal Audio Recordings Using Autocorrelogram Compensation,” Proc. Interspeech ICSLP-06, pp 1970-1973, Pittsburgh, Oct 2006 • J. Ogle and D Ellis, “Fingerprinting to Identify Repeated Sound Events in Long-Duration Personal Audio Recordings,” Proc. ICASSP-07, Hawaii, ppI-233-236 • A. Wang, “The Shazam music recognition service,” Communications of the ACM, 49(8):44–48, Aug 2006. • K. Lee and D Ellis, “Detecting Music in Ambient Audio by Long-Window Autocorrelation,” Proc ICASSP-08, pp. 9-12, Las Vegas, April 2008 •S.-F Chang, D Ellis, W Jiang, K Lee, A Yanagawa, A Loui, J Luo (2007), Large-scale multimodal semantic concept detection for consumer video, Multimedia Information Retrieval workshop, ACM Multimedia Augsburg, Germany, Sep 2007, pp. 255-264

•T. Hofmann Probabilistic latent semantic indexing In Proc. SIGIR, 1999 Analysis of Everyday Sounds - Ellis & Lee 2007-07-24 p. 36/35