0:00:13a malcolm and and thank you to
0:00:15thanks to you and josh for uh
0:00:18i
0:00:18scheduling this
0:00:20presentation go a everyone's here to to see get what i like to uh
0:00:24bring up the mean the talks of an really great of and making notes to to really
0:00:27cross correlate whatever is talking about
0:00:30i i
0:00:31for
0:00:33are work we are interested in a speech signals
0:00:36uh
0:00:37the raw data being a
0:00:39rum a sensor
0:00:41really to speech specifically a uh the
0:00:44microphone a the audio signal
0:00:46sampled at a given rate
0:00:48i using uh a parametric model call the source-filter model it's been around for a while what we're
0:00:54extending that to do it is uh do joint source-filter modeling
0:00:58a using a wavelets
0:01:00which we term flexible
0:01:01uh basis functions
0:01:03uh this is work with that then drew doing and patrick
0:01:05a will of uh at a harvard
0:01:09so
0:01:11where we fit in the session is
0:01:13uh we we've seen that
0:01:15the representations of speech can come and and various domains and a
0:01:20uh in the machine learn from the raw data we just uh we just looked at
0:01:24i i and including
0:01:26cortical an auditory like representations that would be very appropriate for
0:01:31a perceptual based
0:01:34uh performance
0:01:35however are
0:01:37where where were slightly
0:01:39a focused on the the opposite side or or production based models
0:01:43and i'll tell you where the motivation comes from a
0:01:46uh from the clinic
0:01:49and uh it it uh a
0:01:51though the lines of what the model is we then look at a how we characterise and parameterize it uh
0:01:56which subspace a in which sub so to be pairs will talk about other
0:02:00i subspace a representation
0:02:02a i i want
0:02:03go into these higher level
0:02:05uh
0:02:06uh ways and and i should but the cup with the
0:02:09uh
0:02:10auditory like representations but what
0:02:13just talked about
0:02:15i it
0:02:15and terms summary statistics
0:02:17uh that is
0:02:19are relevant actually because we use kind of a
0:02:21you know a noise model which is a good so way into what what the model is for for speech
0:02:26and
0:02:27lpc coefficients Y are or how they were developed is using a a model where that there's a stochastic input
0:02:34such as white gaussian noise
0:02:37uh or here this uh W
0:02:39function
0:02:40that being input into a a a a a linear filter to give the resonances of
0:02:45of the the vocal tract and will
0:02:47restrict at the moment to
0:02:49all a of speech
0:02:51but in reality there is that there's a mismatch between that model and a
0:02:57the source that we see and and and speech and the from the production side are both deterministic
0:03:02and stochastic
0:03:04uh if we just had they
0:03:08so that was bird speech model is
0:03:10uh
0:03:11appropriate for certain speech but if we wanna analyse of voiced and unvoiced speech then
0:03:17we need at another term
0:03:18and and understand how to estimate i know this
0:03:21second so a term which we term you of N
0:03:24i sorry is a little bit low yeah
0:03:27uh so
0:03:29everything the same except this added parameter which will fit two that deterministic and non stochastic component
0:03:36oh which comes from the glottal
0:03:38pulses
0:03:39i that maybe periodic may then don't have to be periodic
0:03:44i i and
0:03:46how we estimate these uh is a question in which subspace does this
0:03:51a a representation of the voiced part of speech lie
0:03:56oh preempt
0:03:57a B was question uh by saying that the
0:04:01uh uh we're assuming clean speech
0:04:03and a a and in some cases a
0:04:06my my research i uh my P T research was
0:04:10i i at a massachusetts general hospital and
0:04:13i don't voice clinic were able to get
0:04:15uh i clean environments uh clean speech and in a a very stable environment in a sound both
0:04:21i to analyse uh uh and compare it features
0:04:25directly to
0:04:26a a vocal
0:04:27i characteristics so
0:04:29that's kind of
0:04:30uh uh uh
0:04:32at is a limitation and in real
0:04:34a environments
0:04:35so
0:04:36this is not
0:04:38new in terms of just i'm just including that other uh
0:04:42parameter that's not new this a R X or X a as input
0:04:46in the ar model and is
0:04:49as one looked at
0:04:51given a parametric models such as the lf model which
0:04:54i is a parameterisation of the glottal flow derivative of the error flow pulse
0:05:00in addition to
0:05:01at a linear models but both of these do require
0:05:04uh a a a good knowledge of when
0:05:07these source or when a hour
0:05:09uh in pulses happen
0:05:11when the vocal folds open
0:05:12and close
0:05:14so what we are
0:05:15proposing it
0:05:16R R
0:05:17we're proposing is to extend that model
0:05:19and and find a subspace that is not
0:05:22i i it is not a limited
0:05:24or restricted to
0:05:25to needing these
0:05:27uh
0:05:27time based
0:05:29uh
0:05:30uh estimates
0:05:31so a solution how which present a lot L are group present last year i i was uh using wavelets
0:05:38or these G functions G of ends so having it
0:05:40bunch of base a basis functions with a weighting coefficients
0:05:44beta
0:05:46simple summing those two it ink equal this
0:05:49this non stochastic voicing component
0:05:52and uh
0:05:53this
0:05:54is robust uh and to variations in fundamental frequency or pitch glides irregular pitch periods
0:06:01i in in certain disordered voices that we see in the clinic
0:06:05and and the for way and wavelets being time localised a
0:06:09uh allows us to shift
0:06:11the the bases and and time and and frequency
0:06:15uh without knowing a priori some of the the source properties so
0:06:20i alluded to this a uh uh in and the beginning why do we care about this and uh
0:06:25that
0:06:26data in which uh i i deal with it typically from clinical uh applications and voice assessment in forms a
0:06:32speech the plus and a surgeons who do
0:06:35uh
0:06:36uh who do their work a on the vocal folds and one to see from the production side
0:06:40how certain technique will affect
0:06:42a a a a a a feature
0:06:45and addition there are are
0:06:47a characteristics that are important to to look at what health
0:06:50emotional state and uh a speaker D uh i such stature
0:06:55so
0:06:56how do we do this to uh subspace selection
0:06:59the just but one set of uh if equations here all i'm showing you here is
0:07:03the uh
0:07:05least squares solution to
0:07:07to the
0:07:08problem i should before
0:07:09in the L P C case the a uh uh to rest of only case
0:07:14this
0:07:15well down to the uh a classic model where you get lpc coefficients
0:07:19oh that we have a extra is this G matrix which
0:07:23holes the subspace uh of the wave which will model the voicing on it
0:07:27a segments the uh big the gaussian noise of variance O
0:07:31well i i wanted to show you this is the critical
0:07:34issue by which simply estimating a
0:07:37the wavelet basis on all the samples that say you have to and fifty six samples
0:07:42of the raw speech waveform
0:07:45if we wanna estimate to under fifty six basis functions and uh a and parameter the signal that way
0:07:52i turned into an L condition problem uh because of the inversion version that we see here
0:07:57and that i L that that that conditioning problem we
0:08:01are we address
0:08:02in this paper
0:08:04by
0:08:05thresholding
0:08:06a the wave so we don't take
0:08:08all
0:08:09a the way of functions and and this space
0:08:12so we propose
0:08:13three sorry for the guys back
0:08:15three uh uh wavelet shrinkage algorithms which are simply
0:08:20we don't need a to a fifty six wavelets let's how may do we need a thirty two fifty six
0:08:24sixty and that threshold can be uh uh a specified using a
0:08:29uh uh iterative methods
0:08:31uh by a hard thresholding of of wavelet coefficients soft thresholding
0:08:35uh i in addition to
0:08:38a joint estimation of parameters of we wanna use at top and
0:08:41a set of wavelet coefficient
0:08:43and finally we we present
0:08:45a a a a uh another optimization technique
0:08:48uh using a and L one norm
0:08:50as a as criterion so i'm not going to the details here because
0:08:54that
0:08:55uh
0:08:55i wanna give an intuition of of what some of the performance
0:08:59uh
0:08:59a a big uh issues are
0:09:01as well as uh as well as a results
0:09:04and the paper goes into to id D to about the algorithms and sell
0:09:08so
0:09:09what we get
0:09:11well
0:09:12what i'm showing here
0:09:14is a uh and synthesized speech i also show your real example is
0:09:19when we
0:09:20send the size with a given source we know
0:09:23uh a the pitch we can time very the pitch
0:09:26and C it's affect on on various algorithms
0:09:30the pose algorithm is
0:09:32uh using a a a the hard thresholding
0:09:35it of approach
0:09:37uh i'd heard sixty four is the top sixty four wavelets
0:09:41uh it's kind of a a second step after a hard thresholding
0:09:44and finally what i want you to first focus on is
0:09:49the last output which which is the residual
0:09:52the ar residual and that's giving the that the source waveform that
0:09:56we can parameterize and find speech a voicing properties from that
0:10:00i
0:10:00it is now
0:10:02and i
0:10:03it includes both components of of stochastic and
0:10:06and on stochastic
0:10:07a
0:10:09uh
0:10:09parts so
0:10:10what are algorithm does a separate those in a principled manner to begin with
0:10:15and you can also do separation
0:10:17you knew deterministic stochastic separation on on the ar residual but
0:10:22we do this and as part of the model
0:10:25and we get a better uh root mean square error
0:10:28a
0:10:29almost because of that but
0:10:31big but but the source properties uh estimation is better
0:10:35and the the filter property estimation better
0:10:38so
0:10:39the the the classic cases when you have a a filter with a given
0:10:44uh a given formant values
0:10:46you just use the ar model
0:10:49there's a bias
0:10:49in a a a four in the formant frequency
0:10:52which uh i'm showing uh and the last curve the bottom curve
0:10:57i the a residual
0:10:58i and the lpc coefficients when uh you look at the at the spectrum
0:11:03is a bias and F two and F three
0:11:05relative to
0:11:07V
0:11:08ground truth
0:11:10and all algorithm handles that by putting the probably energy and the source and the filter
0:11:15i in way doing that that the separation problem inverse filtering problem
0:11:21the paper also goes into one i theoretical observation where
0:11:25uh we find uh and derive
0:11:28the cramer-rao lower bounds which are
0:11:31the
0:11:32lower bounds of a of this estimator
0:11:35uh uh uh of its variance so if we have
0:11:37that three parameters
0:11:39and this toy example
0:11:40we took
0:11:43a a a a a a a a filter with the wood
0:11:46one poll i two pole filter one resonance
0:11:49and that that resonances that two khz
0:11:52so of a two khz
0:11:54uh
0:11:56one uh uh one residents function
0:11:58and we swept
0:12:00the source
0:12:01sign use so it's we of input source that was swept
0:12:03or over frequency and what you're saying is from zero to eight khz
0:12:07uh and the estimators for the lpc coefficient
0:12:11a a a a one a two
0:12:12and that uh the uh
0:12:15uh the beta coefficient for for the sign your sorry
0:12:18uh i is giving us the
0:12:20variance in what we see is as expected is that
0:12:24we
0:12:25are able to estimate the filter pretty well
0:12:31across the bandwidth except
0:12:34the source is is then able be uh that the proper is any bill be
0:12:38uh estimated when
0:12:39the overlap that centre frequencies this is the classic problem one we're sampling the spectrum
0:12:44at the center frequency
0:12:47i as a sign you sorry lines up with with the resonance
0:12:50we get a a a a
0:12:51an increase in the in the lower bound of that estimator so
0:12:56uh this is what i was talking this example and we can also show it for the band
0:13:01uh in terms of
0:13:03the best that we can do is if the band was a pretty uh i one miss with the band
0:13:07with of of the resonance
0:13:09uh of the filter
0:13:12X the resonance uh of the filter and we
0:13:16the band what uh sorry fix the
0:13:19frequency of the sinusoid
0:13:21at two khz
0:13:23and very the uh the bandwidth we show that the uh i estimator variance gets as expected gets better as
0:13:28the band with gets smaller
0:13:30uh so that the filter and the uh source or
0:13:33i
0:13:34easier to uh to discriminate but
0:13:37or is showing is that the problem gets more and more difficult when a
0:13:41F not
0:13:42are the fundamental frequency of the source crosses
0:13:45uh F one for example
0:13:48i in in real speech
0:13:50so
0:13:54that gives you a high level overview of what what we're doing and and the last that's are uh using
0:14:00i on the technical side doing uh
0:14:02but thresholding be level dependent thresholding
0:14:06uh and not just a a hard a hard thresholding
0:14:09other ways a penalising uh
0:14:12the the wavelet coefficients to
0:14:13to get a better representation of that source
0:14:16i here's an example of the uh a of natural speech where to see sure the waveform
0:14:21with a glottalized
0:14:23voice quality voice-quality uh uh uh
0:14:26that sounds uh like a a a and set of a a it a periodic
0:14:31uh so source
0:14:34and it
0:14:35be sensitive to that uh because finding the exact in a pulse times uh is difficult when a and things
0:14:42are a periodic
0:14:44uh so it wavelets not really caring about what the period is can uh can
0:14:49i grab those
0:14:50uh impulses and then we can find features from
0:14:53from the reconstructed source
0:14:57a fine way i one of the
0:14:59i'm sorry this is low
0:15:01i
0:15:03may be an example of of of one a clinical applications
0:15:06i there some uh
0:15:08advances in high speed video and does could be uh of the layering
0:15:12and and and
0:15:13i get a work with these uh uh of a would with these video all day and and relate these
0:15:18two
0:15:19that the source features that we estimate theoretically on synthesized bells and and real as but also look at
0:15:25but things down the throat or estimate airflow flow
0:15:27and uh
0:15:29this is kind of uh are next up to
0:15:31to look at we do these features actually relates to what we can measure
0:15:35i in human a subject
0:15:37as you can see this is not them easy setup
0:15:40to collect airflow flow in addition to
0:15:42uh
0:15:43endoscopic be use of the larynx
0:15:48so that is a that's that's what i had a a and i welcome what questions
0:15:52thank you
0:16:03well i can also repeat the question what
0:16:05a microphone maybe coming
0:16:12yeah thank you for a for your talk
0:16:14do you
0:16:15uh my my question relates to um
0:16:18what is the main difference of you approach compared to what is done for camp coding where you also have
0:16:24uh L P C and a sum of way have a or M of
0:16:28uh
0:16:30you know that a dictionary atoms to the L C as a source and so
0:16:35you can tell and and that the main difference between sure I in in code excited linear prediction you have
0:16:40kind of a very
0:16:41a dictionary of of
0:16:44oh i'll go back to the
0:16:48dictionary of sources
0:16:50i that you can read they use to
0:16:52and
0:16:53that aspect and that's not limited to
0:16:56uh
0:16:57that's not limited to it's to some simply noise so in kelp
0:17:01and uh it's similar what we're doing is then parameterising that codebook book are that that each one of the
0:17:07the entries in the codebook
0:17:08i to then
0:17:09fine features that relate to the source
0:17:12instead of just the raw
0:17:13uh source and that in in help you have kind of the the raw samples and you we
0:17:18and your cellphone uh in your cellphone processor in it sounds natural so yeah output in that is kind of
0:17:23a natural sounding a re
0:17:25where we're interested in at once we get the parameters is is then writing
0:17:29features from those parameters
0:17:32um but the X that that's that's a good point
0:17:34okay
0:17:36ask a question it sounds like this would you better for noise
0:17:39and and regular lpc
0:17:41wouldn't it depends on the noise i guess uh what what when would do better when we you worse
0:17:46so
0:17:47some um my master sees actually actually found that we can actually stand this model one one step further because
0:17:53the noise can
0:17:54i in voice speech you can
0:17:56we i has been observed be modulated so not it's not actually white cows you know is but can be
0:18:01modulated white gaussian noise
0:18:03uh
0:18:04so even this the this model is a but it it's it's always the tradeoff between how
0:18:09and a do we care about modulations and noise or or not
0:18:12uh where it
0:18:14may not perform as well is
0:18:17this is putting the noise function before the filter
0:18:21a given a speech
0:18:22model if it's
0:18:24a reverberation noise if it's
0:18:26uh a a car got a are going past that's not filtered by the same
0:18:31uh transfer function so
0:18:33there is a
0:18:34a larger model mismatch uh there
0:18:37it's good
0:18:38that's great thank you
0:18:39okay