Speech Transcript - JOINT SOURCE-FILTER MODELING USING FLEXIBLE BASIS FUNCTIONS

a malcolm and and thank you to thanks to you and josh for uh i scheduling this presentation go a everyone's here to to see get what i like to uh bring up the mean the talks of an really great of and making notes to to really cross correlate whatever is talking about i i for are work we are interested in a speech signals uh the raw data being a rum a sensor really to speech specifically a uh the microphone a the audio signal sampled at a given rate i using uh a parametric model call the source-filter model it's been around for a while what we're extending that to do it is uh do joint source-filter modeling a using a wavelets which we term flexible uh basis functions uh this is work with that then drew doing and patrick a will of uh at a harvard so where we fit in the session is uh we we've seen that the representations of speech can come and and various domains and a uh in the machine learn from the raw data we just uh we just looked at i i and including cortical an auditory like representations that would be very appropriate for a perceptual based uh performance however are where where were slightly a focused on the the opposite side or or production based models and i'll tell you where the motivation comes from a uh from the clinic and uh it it uh a though the lines of what the model is we then look at a how we characterise and parameterize it uh which subspace a in which sub so to be pairs will talk about other i subspace a representation a i i want go into these higher level uh uh ways and and i should but the cup with the uh auditory like representations but what just talked about i it and terms summary statistics uh that is are relevant actually because we use kind of a you know a noise model which is a good so way into what what the model is for for speech and lpc coefficients Y are or how they were developed is using a a model where that there's a stochastic input such as white gaussian noise uh or here this uh W function that being input into a a a a a linear filter to give the resonances of of the the vocal tract and will restrict at the moment to all a of speech but in reality there is that there's a mismatch between that model and a the source that we see and and and speech and the from the production side are both deterministic and stochastic uh if we just had they so that was bird speech model is uh appropriate for certain speech but if we wanna analyse of voiced and unvoiced speech then we need at another term and and understand how to estimate i know this second so a term which we term you of N i sorry is a little bit low yeah uh so everything the same except this added parameter which will fit two that deterministic and non stochastic component oh which comes from the glottal pulses i that maybe periodic may then don't have to be periodic i i and how we estimate these uh is a question in which subspace does this a a representation of the voiced part of speech lie oh preempt a B was question uh by saying that the uh uh we're assuming clean speech and a a and in some cases a my my research i uh my P T research was i i at a massachusetts general hospital and i don't voice clinic were able to get uh i clean environments uh clean speech and in a a very stable environment in a sound both i to analyse uh uh and compare it features directly to a a vocal i characteristics so that's kind of uh uh uh at is a limitation and in real a environments so this is not new in terms of just i'm just including that other uh parameter that's not new this a R X or X a as input in the ar model and is as one looked at given a parametric models such as the lf model which i is a parameterisation of the glottal flow derivative of the error flow pulse in addition to at a linear models but both of these do require uh a a a good knowledge of when these source or when a hour uh in pulses happen when the vocal folds open and close so what we are proposing it R R we're proposing is to extend that model and and find a subspace that is not i i it is not a limited or restricted to to needing these uh time based uh uh estimates so a solution how which present a lot L are group present last year i i was uh using wavelets or these G functions G of ends so having it bunch of base a basis functions with a weighting coefficients beta simple summing those two it ink equal this this non stochastic voicing component and uh this is robust uh and to variations in fundamental frequency or pitch glides irregular pitch periods i in in certain disordered voices that we see in the clinic and and the for way and wavelets being time localised a uh allows us to shift the the bases and and time and and frequency uh without knowing a priori some of the the source properties so i alluded to this a uh uh in and the beginning why do we care about this and uh that data in which uh i i deal with it typically from clinical uh applications and voice assessment in forms a speech the plus and a surgeons who do uh uh who do their work a on the vocal folds and one to see from the production side how certain technique will affect a a a a a a feature and addition there are are a characteristics that are important to to look at what health emotional state and uh a speaker D uh i such stature so how do we do this to uh subspace selection the just but one set of uh if equations here all i'm showing you here is the uh least squares solution to to the problem i should before in the L P C case the a uh uh to rest of only case this well down to the uh a classic model where you get lpc coefficients oh that we have a extra is this G matrix which holes the subspace uh of the wave which will model the voicing on it a segments the uh big the gaussian noise of variance O well i i wanted to show you this is the critical issue by which simply estimating a the wavelet basis on all the samples that say you have to and fifty six samples of the raw speech waveform if we wanna estimate to under fifty six basis functions and uh a and parameter the signal that way i turned into an L condition problem uh because of the inversion version that we see here and that i L that that that conditioning problem we are we address in this paper by thresholding a the wave so we don't take all a the way of functions and and this space so we propose three sorry for the guys back three uh uh wavelet shrinkage algorithms which are simply we don't need a to a fifty six wavelets let's how may do we need a thirty two fifty six sixty and that threshold can be uh uh a specified using a uh uh iterative methods uh by a hard thresholding of of wavelet coefficients soft thresholding uh i in addition to a joint estimation of parameters of we wanna use at top and a set of wavelet coefficient and finally we we present a a a a uh another optimization technique uh using a and L one norm as a as criterion so i'm not going to the details here because that uh i wanna give an intuition of of what some of the performance uh a a big uh issues are as well as uh as well as a results and the paper goes into to id D to about the algorithms and sell so what we get well what i'm showing here is a uh and synthesized speech i also show your real example is when we send the size with a given source we know uh a the pitch we can time very the pitch and C it's affect on on various algorithms the pose algorithm is uh using a a a the hard thresholding it of approach uh i'd heard sixty four is the top sixty four wavelets uh it's kind of a a second step after a hard thresholding and finally what i want you to first focus on is the last output which which is the residual the ar residual and that's giving the that the source waveform that we can parameterize and find speech a voicing properties from that i it is now and i it includes both components of of stochastic and and on stochastic a uh parts so what are algorithm does a separate those in a principled manner to begin with and you can also do separation you knew deterministic stochastic separation on on the ar residual but we do this and as part of the model and we get a better uh root mean square error a almost because of that but big but but the source properties uh estimation is better and the the filter property estimation better so the the the classic cases when you have a a filter with a given uh a given formant values you just use the ar model there's a bias in a a a four in the formant frequency which uh i'm showing uh and the last curve the bottom curve i the a residual i and the lpc coefficients when uh you look at the at the spectrum is a bias and F two and F three relative to V ground truth and all algorithm handles that by putting the probably energy and the source and the filter i in way doing that that the separation problem inverse filtering problem the paper also goes into one i theoretical observation where uh we find uh and derive the cramer-rao lower bounds which are the lower bounds of a of this estimator uh uh uh of its variance so if we have that three parameters and this toy example we took a a a a a a a a filter with the wood one poll i two pole filter one resonance and that that resonances that two khz so of a two khz uh one uh uh one residents function and we swept the source sign use so it's we of input source that was swept or over frequency and what you're saying is from zero to eight khz uh and the estimators for the lpc coefficient a a a a one a two and that uh the uh uh the beta coefficient for for the sign your sorry uh i is giving us the variance in what we see is as expected is that we are able to estimate the filter pretty well across the bandwidth except the source is is then able be uh that the proper is any bill be uh estimated when the overlap that centre frequencies this is the classic problem one we're sampling the spectrum at the center frequency i as a sign you sorry lines up with with the resonance we get a a a a an increase in the in the lower bound of that estimator so uh this is what i was talking this example and we can also show it for the band uh in terms of the best that we can do is if the band was a pretty uh i one miss with the band with of of the resonance uh of the filter X the resonance uh of the filter and we the band what uh sorry fix the frequency of the sinusoid at two khz and very the uh the bandwidth we show that the uh i estimator variance gets as expected gets better as the band with gets smaller uh so that the filter and the uh source or i easier to uh to discriminate but or is showing is that the problem gets more and more difficult when a F not are the fundamental frequency of the source crosses uh F one for example i in in real speech so that gives you a high level overview of what what we're doing and and the last that's are uh using i on the technical side doing uh but thresholding be level dependent thresholding uh and not just a a hard a hard thresholding other ways a penalising uh the the wavelet coefficients to to get a better representation of that source i here's an example of the uh a of natural speech where to see sure the waveform with a glottalized voice quality voice-quality uh uh uh that sounds uh like a a a and set of a a it a periodic uh so source and it be sensitive to that uh because finding the exact in a pulse times uh is difficult when a and things are a periodic uh so it wavelets not really caring about what the period is can uh can i grab those uh impulses and then we can find features from from the reconstructed source a fine way i one of the i'm sorry this is low i may be an example of of of one a clinical applications i there some uh advances in high speed video and does could be uh of the layering and and and i get a work with these uh uh of a would with these video all day and and relate these two that the source features that we estimate theoretically on synthesized bells and and real as but also look at but things down the throat or estimate airflow flow and uh this is kind of uh are next up to to look at we do these features actually relates to what we can measure i in human a subject as you can see this is not them easy setup to collect airflow flow in addition to uh endoscopic be use of the larynx so that is a that's that's what i had a a and i welcome what questions thank you well i can also repeat the question what a microphone maybe coming yeah thank you for a for your talk do you uh my my question relates to um what is the main difference of you approach compared to what is done for camp coding where you also have uh L P C and a sum of way have a or M of uh you know that a dictionary atoms to the L C as a source and so you can tell and and that the main difference between sure I in in code excited linear prediction you have kind of a very a dictionary of of oh i'll go back to the dictionary of sources i that you can read they use to and that aspect and that's not limited to uh that's not limited to it's to some simply noise so in kelp and uh it's similar what we're doing is then parameterising that codebook book are that that each one of the the entries in the codebook i to then fine features that relate to the source instead of just the raw uh source and that in in help you have kind of the the raw samples and you we and your cellphone uh in your cellphone processor in it sounds natural so yeah output in that is kind of a natural sounding a re where we're interested in at once we get the parameters is is then writing features from those parameters um but the X that that's that's a good point okay ask a question it sounds like this would you better for noise and and regular lpc wouldn't it depends on the noise i guess uh what what when would do better when we you worse so some um my master sees actually actually found that we can actually stand this model one one step further because the noise can i in voice speech you can we i has been observed be modulated so not it's not actually white cows you know is but can be modulated white gaussian noise uh so even this the this model is a but it it's it's always the tradeoff between how and a do we care about modulations and noise or or not uh where it may not perform as well is this is putting the noise function before the filter a given a speech model if it's a reverberation noise if it's uh a a car got a are going past that's not filtered by the same uh transfer function so there is a a larger model mismatch uh there it's good that's great thank you okay

JOINT SOURCE-FILTER MODELING USING FLEXIBLE BASIS FUNCTIONS

Innovative Representations of Audio

Presented by: Daryush Mehta, Author(s): Daryush Mehta, Daniel Rudoy, Patrick Wolfe, Harvard University, United States