Speech Transcript - JOINT SOURCE-FILTER MODELING USING FLEXIBLE BASIS FUNCTIONS

0:00:13	a malcolm and and thank you to
0:00:15	thanks to you and josh for uh
0:00:18	i
0:00:18	scheduling this
0:00:20	presentation go a everyone's here to to see get what i like to uh
0:00:24	bring up the mean the talks of an really great of and making notes to to really
0:00:27	cross correlate whatever is talking about
0:00:30	i i
0:00:31	for
0:00:33	are work we are interested in a speech signals
0:00:36	uh
0:00:37	the raw data being a
0:00:39	rum a sensor
0:00:41	really to speech specifically a uh the
0:00:44	microphone a the audio signal
0:00:46	sampled at a given rate
0:00:48	i using uh a parametric model call the source-filter model it's been around for a while what we're
0:00:54	extending that to do it is uh do joint source-filter modeling
0:00:58	a using a wavelets
0:01:00	which we term flexible
0:01:01	uh basis functions
0:01:03	uh this is work with that then drew doing and patrick
0:01:05	a will of uh at a harvard
0:01:09	so
0:01:11	where we fit in the session is
0:01:13	uh we we've seen that
0:01:15	the representations of speech can come and and various domains and a
0:01:20	uh in the machine learn from the raw data we just uh we just looked at
0:01:24	i i and including
0:01:26	cortical an auditory like representations that would be very appropriate for
0:01:31	a perceptual based
0:01:34	uh performance
0:01:35	however are
0:01:37	where where were slightly
0:01:39	a focused on the the opposite side or or production based models
0:01:43	and i'll tell you where the motivation comes from a
0:01:46	uh from the clinic
0:01:49	and uh it it uh a
0:01:51	though the lines of what the model is we then look at a how we characterise and parameterize it uh
0:01:56	which subspace a in which sub so to be pairs will talk about other
0:02:00	i subspace a representation
0:02:02	a i i want
0:02:03	go into these higher level
0:02:05	uh
0:02:06	uh ways and and i should but the cup with the
0:02:09	uh
0:02:10	auditory like representations but what
0:02:13	just talked about
0:02:15	i it
0:02:15	and terms summary statistics
0:02:17	uh that is
0:02:19	are relevant actually because we use kind of a
0:02:21	you know a noise model which is a good so way into what what the model is for for speech
0:02:26	and
0:02:27	lpc coefficients Y are or how they were developed is using a a model where that there's a stochastic input
0:02:34	such as white gaussian noise
0:02:37	uh or here this uh W
0:02:39	function
0:02:40	that being input into a a a a a linear filter to give the resonances of
0:02:45	of the the vocal tract and will
0:02:47	restrict at the moment to
0:02:49	all a of speech
0:02:51	but in reality there is that there's a mismatch between that model and a
0:02:57	the source that we see and and and speech and the from the production side are both deterministic
0:03:02	and stochastic
0:03:04	uh if we just had they
0:03:08	so that was bird speech model is
0:03:10	uh
0:03:11	appropriate for certain speech but if we wanna analyse of voiced and unvoiced speech then
0:03:17	we need at another term
0:03:18	and and understand how to estimate i know this
0:03:21	second so a term which we term you of N
0:03:24	i sorry is a little bit low yeah
0:03:27	uh so
0:03:29	everything the same except this added parameter which will fit two that deterministic and non stochastic component
0:03:36	oh which comes from the glottal
0:03:38	pulses
0:03:39	i that maybe periodic may then don't have to be periodic
0:03:44	i i and
0:03:46	how we estimate these uh is a question in which subspace does this
0:03:51	a a representation of the voiced part of speech lie
0:03:56	oh preempt
0:03:57	a B was question uh by saying that the
0:04:01	uh uh we're assuming clean speech
0:04:03	and a a and in some cases a
0:04:06	my my research i uh my P T research was
0:04:10	i i at a massachusetts general hospital and
0:04:13	i don't voice clinic were able to get
0:04:15	uh i clean environments uh clean speech and in a a very stable environment in a sound both
0:04:21	i to analyse uh uh and compare it features
0:04:25	directly to
0:04:26	a a vocal
0:04:27	i characteristics so
0:04:29	that's kind of
0:04:30	uh uh uh
0:04:32	at is a limitation and in real
0:04:34	a environments
0:04:35	so
0:04:36	this is not
0:04:38	new in terms of just i'm just including that other uh
0:04:42	parameter that's not new this a R X or X a as input
0:04:46	in the ar model and is
0:04:49	as one looked at
0:04:51	given a parametric models such as the lf model which
0:04:54	i is a parameterisation of the glottal flow derivative of the error flow pulse
0:05:00	in addition to
0:05:01	at a linear models but both of these do require
0:05:04	uh a a a good knowledge of when
0:05:07	these source or when a hour
0:05:09	uh in pulses happen
0:05:11	when the vocal folds open
0:05:12	and close
0:05:14	so what we are
0:05:15	proposing it
0:05:16	R R
0:05:17	we're proposing is to extend that model
0:05:19	and and find a subspace that is not
0:05:22	i i it is not a limited
0:05:24	or restricted to
0:05:25	to needing these
0:05:27	uh
0:05:27	time based
0:05:29	uh
0:05:30	uh estimates
0:05:31	so a solution how which present a lot L are group present last year i i was uh using wavelets
0:05:38	or these G functions G of ends so having it
0:05:40	bunch of base a basis functions with a weighting coefficients
0:05:44	beta
0:05:46	simple summing those two it ink equal this
0:05:49	this non stochastic voicing component
0:05:52	and uh
0:05:53	this
0:05:54	is robust uh and to variations in fundamental frequency or pitch glides irregular pitch periods
0:06:01	i in in certain disordered voices that we see in the clinic
0:06:05	and and the for way and wavelets being time localised a
0:06:09	uh allows us to shift
0:06:11	the the bases and and time and and frequency
0:06:15	uh without knowing a priori some of the the source properties so
0:06:20	i alluded to this a uh uh in and the beginning why do we care about this and uh
0:06:25	that
0:06:26	data in which uh i i deal with it typically from clinical uh applications and voice assessment in forms a
0:06:32	speech the plus and a surgeons who do
0:06:35	uh
0:06:36	uh who do their work a on the vocal folds and one to see from the production side
0:06:40	how certain technique will affect
0:06:42	a a a a a a feature
0:06:45	and addition there are are
0:06:47	a characteristics that are important to to look at what health
0:06:50	emotional state and uh a speaker D uh i such stature
0:06:55	so
0:06:56	how do we do this to uh subspace selection
0:06:59	the just but one set of uh if equations here all i'm showing you here is
0:07:03	the uh
0:07:05	least squares solution to
0:07:07	to the
0:07:08	problem i should before
0:07:09	in the L P C case the a uh uh to rest of only case
0:07:14	this
0:07:15	well down to the uh a classic model where you get lpc coefficients
0:07:19	oh that we have a extra is this G matrix which
0:07:23	holes the subspace uh of the wave which will model the voicing on it
0:07:27	a segments the uh big the gaussian noise of variance O
0:07:31	well i i wanted to show you this is the critical
0:07:34	issue by which simply estimating a
0:07:37	the wavelet basis on all the samples that say you have to and fifty six samples
0:07:42	of the raw speech waveform
0:07:45	if we wanna estimate to under fifty six basis functions and uh a and parameter the signal that way
0:07:52	i turned into an L condition problem uh because of the inversion version that we see here
0:07:57	and that i L that that that conditioning problem we
0:08:01	are we address
0:08:02	in this paper
0:08:04	by
0:08:05	thresholding
0:08:06	a the wave so we don't take
0:08:08	all
0:08:09	a the way of functions and and this space
0:08:12	so we propose
0:08:13	three sorry for the guys back
0:08:15	three uh uh wavelet shrinkage algorithms which are simply
0:08:20	we don't need a to a fifty six wavelets let's how may do we need a thirty two fifty six
0:08:24	sixty and that threshold can be uh uh a specified using a
0:08:29	uh uh iterative methods
0:08:31	uh by a hard thresholding of of wavelet coefficients soft thresholding
0:08:35	uh i in addition to
0:08:38	a joint estimation of parameters of we wanna use at top and
0:08:41	a set of wavelet coefficient
0:08:43	and finally we we present
0:08:45	a a a a uh another optimization technique
0:08:48	uh using a and L one norm
0:08:50	as a as criterion so i'm not going to the details here because
0:08:54	that
0:08:55	uh
0:08:55	i wanna give an intuition of of what some of the performance
0:08:59	uh
0:08:59	a a big uh issues are
0:09:01	as well as uh as well as a results
0:09:04	and the paper goes into to id D to about the algorithms and sell
0:09:08	so
0:09:09	what we get
0:09:11	well
0:09:12	what i'm showing here
0:09:14	is a uh and synthesized speech i also show your real example is
0:09:19	when we
0:09:20	send the size with a given source we know
0:09:23	uh a the pitch we can time very the pitch
0:09:26	and C it's affect on on various algorithms
0:09:30	the pose algorithm is
0:09:32	uh using a a a the hard thresholding
0:09:35	it of approach
0:09:37	uh i'd heard sixty four is the top sixty four wavelets
0:09:41	uh it's kind of a a second step after a hard thresholding
0:09:44	and finally what i want you to first focus on is
0:09:49	the last output which which is the residual
0:09:52	the ar residual and that's giving the that the source waveform that
0:09:56	we can parameterize and find speech a voicing properties from that
0:10:00	i
0:10:00	it is now
0:10:02	and i
0:10:03	it includes both components of of stochastic and
0:10:06	and on stochastic
0:10:07	a
0:10:09	uh
0:10:09	parts so
0:10:10	what are algorithm does a separate those in a principled manner to begin with
0:10:15	and you can also do separation
0:10:17	you knew deterministic stochastic separation on on the ar residual but
0:10:22	we do this and as part of the model
0:10:25	and we get a better uh root mean square error
0:10:28	a
0:10:29	almost because of that but
0:10:31	big but but the source properties uh estimation is better
0:10:35	and the the filter property estimation better
0:10:38	so
0:10:39	the the the classic cases when you have a a filter with a given
0:10:44	uh a given formant values
0:10:46	you just use the ar model
0:10:49	there's a bias
0:10:49	in a a a four in the formant frequency
0:10:52	which uh i'm showing uh and the last curve the bottom curve
0:10:57	i the a residual
0:10:58	i and the lpc coefficients when uh you look at the at the spectrum
0:11:03	is a bias and F two and F three
0:11:05	relative to
0:11:07	V
0:11:08	ground truth
0:11:10	and all algorithm handles that by putting the probably energy and the source and the filter
0:11:15	i in way doing that that the separation problem inverse filtering problem
0:11:21	the paper also goes into one i theoretical observation where
0:11:25	uh we find uh and derive
0:11:28	the cramer-rao lower bounds which are
0:11:31	the
0:11:32	lower bounds of a of this estimator
0:11:35	uh uh uh of its variance so if we have
0:11:37	that three parameters
0:11:39	and this toy example
0:11:40	we took
0:11:43	a a a a a a a a filter with the wood
0:11:46	one poll i two pole filter one resonance
0:11:49	and that that resonances that two khz
0:11:52	so of a two khz
0:11:54	uh
0:11:56	one uh uh one residents function
0:11:58	and we swept
0:12:00	the source
0:12:01	sign use so it's we of input source that was swept
0:12:03	or over frequency and what you're saying is from zero to eight khz
0:12:07	uh and the estimators for the lpc coefficient
0:12:11	a a a a one a two
0:12:12	and that uh the uh
0:12:15	uh the beta coefficient for for the sign your sorry
0:12:18	uh i is giving us the
0:12:20	variance in what we see is as expected is that
0:12:24	we
0:12:25	are able to estimate the filter pretty well
0:12:31	across the bandwidth except
0:12:34	the source is is then able be uh that the proper is any bill be
0:12:38	uh estimated when
0:12:39	the overlap that centre frequencies this is the classic problem one we're sampling the spectrum
0:12:44	at the center frequency
0:12:47	i as a sign you sorry lines up with with the resonance
0:12:50	we get a a a a
0:12:51	an increase in the in the lower bound of that estimator so
0:12:56	uh this is what i was talking this example and we can also show it for the band
0:13:01	uh in terms of
0:13:03	the best that we can do is if the band was a pretty uh i one miss with the band
0:13:07	with of of the resonance
0:13:09	uh of the filter
0:13:12	X the resonance uh of the filter and we
0:13:16	the band what uh sorry fix the
0:13:19	frequency of the sinusoid
0:13:21	at two khz
0:13:23	and very the uh the bandwidth we show that the uh i estimator variance gets as expected gets better as
0:13:28	the band with gets smaller
0:13:30	uh so that the filter and the uh source or
0:13:33	i
0:13:34	easier to uh to discriminate but
0:13:37	or is showing is that the problem gets more and more difficult when a
0:13:41	F not
0:13:42	are the fundamental frequency of the source crosses
0:13:45	uh F one for example
0:13:48	i in in real speech
0:13:50	so
0:13:54	that gives you a high level overview of what what we're doing and and the last that's are uh using
0:14:00	i on the technical side doing uh
0:14:02	but thresholding be level dependent thresholding
0:14:06	uh and not just a a hard a hard thresholding
0:14:09	other ways a penalising uh
0:14:12	the the wavelet coefficients to
0:14:13	to get a better representation of that source
0:14:16	i here's an example of the uh a of natural speech where to see sure the waveform
0:14:21	with a glottalized
0:14:23	voice quality voice-quality uh uh uh
0:14:26	that sounds uh like a a a and set of a a it a periodic
0:14:31	uh so source
0:14:34	and it
0:14:35	be sensitive to that uh because finding the exact in a pulse times uh is difficult when a and things
0:14:42	are a periodic
0:14:44	uh so it wavelets not really caring about what the period is can uh can
0:14:49	i grab those
0:14:50	uh impulses and then we can find features from
0:14:53	from the reconstructed source
0:14:57	a fine way i one of the
0:14:59	i'm sorry this is low
0:15:01	i
0:15:03	may be an example of of of one a clinical applications
0:15:06	i there some uh
0:15:08	advances in high speed video and does could be uh of the layering
0:15:12	and and and
0:15:13	i get a work with these uh uh of a would with these video all day and and relate these
0:15:18	two
0:15:19	that the source features that we estimate theoretically on synthesized bells and and real as but also look at
0:15:25	but things down the throat or estimate airflow flow
0:15:27	and uh
0:15:29	this is kind of uh are next up to
0:15:31	to look at we do these features actually relates to what we can measure
0:15:35	i in human a subject
0:15:37	as you can see this is not them easy setup
0:15:40	to collect airflow flow in addition to
0:15:42	uh
0:15:43	endoscopic be use of the larynx
0:15:48	so that is a that's that's what i had a a and i welcome what questions
0:15:52	thank you
0:16:03	well i can also repeat the question what
0:16:05	a microphone maybe coming
0:16:12	yeah thank you for a for your talk
0:16:14	do you
0:16:15	uh my my question relates to um
0:16:18	what is the main difference of you approach compared to what is done for camp coding where you also have
0:16:24	uh L P C and a sum of way have a or M of
0:16:28	uh
0:16:30	you know that a dictionary atoms to the L C as a source and so
0:16:35	you can tell and and that the main difference between sure I in in code excited linear prediction you have
0:16:40	kind of a very
0:16:41	a dictionary of of
0:16:44	oh i'll go back to the
0:16:48	dictionary of sources
0:16:50	i that you can read they use to
0:16:52	and
0:16:53	that aspect and that's not limited to
0:16:56	uh
0:16:57	that's not limited to it's to some simply noise so in kelp
0:17:01	and uh it's similar what we're doing is then parameterising that codebook book are that that each one of the
0:17:07	the entries in the codebook
0:17:08	i to then
0:17:09	fine features that relate to the source
0:17:12	instead of just the raw
0:17:13	uh source and that in in help you have kind of the the raw samples and you we
0:17:18	and your cellphone uh in your cellphone processor in it sounds natural so yeah output in that is kind of
0:17:23	a natural sounding a re
0:17:25	where we're interested in at once we get the parameters is is then writing
0:17:29	features from those parameters
0:17:32	um but the X that that's that's a good point
0:17:34	okay
0:17:36	ask a question it sounds like this would you better for noise
0:17:39	and and regular lpc
0:17:41	wouldn't it depends on the noise i guess uh what what when would do better when we you worse
0:17:46	so
0:17:47	some um my master sees actually actually found that we can actually stand this model one one step further because
0:17:53	the noise can
0:17:54	i in voice speech you can
0:17:56	we i has been observed be modulated so not it's not actually white cows you know is but can be
0:18:01	modulated white gaussian noise
0:18:03	uh
0:18:04	so even this the this model is a but it it's it's always the tradeoff between how
0:18:09	and a do we care about modulations and noise or or not
0:18:12	uh where it
0:18:14	may not perform as well is
0:18:17	this is putting the noise function before the filter
0:18:21	a given a speech
0:18:22	model if it's
0:18:24	a reverberation noise if it's
0:18:26	uh a a car got a are going past that's not filtered by the same
0:18:31	uh transfer function so
0:18:33	there is a
0:18:34	a larger model mismatch uh there
0:18:37	it's good
0:18:38	that's great thank you
0:18:39	okay

JOINT SOURCE-FILTER MODELING USING FLEXIBLE BASIS FUNCTIONS

Innovative Representations of Audio

Presented by: Daryush Mehta, Author(s): Daryush Mehta, Daniel Rudoy, Patrick Wolfe, Harvard University, United States