Speech Transcript - LOG SPECTRA ENHANCEMENT USING SPEAKER DEPENDENT PRIORS FOR SPEAKER VERIFICATION

0:00:13	yeah as you five a mentor machine you i minor
0:00:16	and i'm from that's of university and this is a joint what my adviser
0:00:20	john looked large
0:00:21	and all that title of all walk is log spectral enhancement using speaker dependent as for speaker verification
0:00:29	and the aim of this what the key idea behind it is how we can use
0:00:33	set and in
0:00:34	parameter estimation techniques to improve the robustness of about
0:00:39	a verification systems to noise and miss much
0:00:42	and here
0:00:43	why we want to do
0:00:45	uh
0:00:46	based and up
0:00:46	technique is that
0:00:48	a bayesian approach is
0:00:50	a i'll what's to use
0:00:51	but to have a principled way of accounting for parameter down setting to
0:00:55	noise estimation task
0:00:57	and you know like every most button recognition system
0:01:01	the range is up
0:01:02	a a a a a a a key component
0:01:04	in use yes to extract thing
0:01:07	keep the parameters of interest from your role signal in this case when we have a speech
0:01:12	which is corrupted by noise and we want to extract features of interest
0:01:16	that we using a or
0:01:17	but and classification algorithm
0:01:20	the the the the noise makes all but i
0:01:23	parameterized estimates it's it wrong yes in some case
0:01:26	and depending on the severity of the noise you know
0:01:28	uh
0:01:29	this may
0:01:30	that that is also go ones are or well how much of an fact you are on
0:01:35	and the parameter estimates not if we use
0:01:37	if we can
0:01:39	a like than to put in a bayesian up uh estimate
0:01:43	uh uh we can probably you know and hans all speaker verification system and here will can see that
0:01:49	to
0:01:50	you know the the two main courses of
0:01:53	and what's degradation noise
0:01:55	which have what you discussed and ms much because you know in speaker verification system we we need a more
0:02:02	model to to a model a or a speaker distribution
0:02:07	and
0:02:07	depending on the acoustic environment in which you
0:02:10	you you in the training data that this may not be the same environment in which you using
0:02:15	which are using the system
0:02:16	and this results in me much
0:02:19	and and hence performance degradation
0:02:22	for the cup joe what you're trying to do here
0:02:24	so the aim of all lot that
0:02:26	the title to just me using speaker dependent in the log spectral domain
0:02:31	and the K yeah he is that we want to
0:02:34	link
0:02:35	two system which we feel are closely much that speech enhancement system
0:02:39	and the recognition cyst
0:02:41	the intuition behind it is that
0:02:44	feel doing speech enhancement than you you you and enhancing features
0:02:48	and you know
0:02:49	because dependent priors
0:02:52	uh
0:02:53	this
0:02:54	the intuition is that if you have a better idea of who is speaking and you good brad in
0:02:58	and that domain
0:03:00	then you can do a better job of enhancing and we the N signal
0:03:04	you can do a better job of
0:03:06	oh or recognition
0:03:07	so these an to play between this two systems
0:03:10	and they they are what we do the eight week up to this in doubly plea as a message passing
0:03:16	along nodes you not not have colour model
0:03:19	so
0:03:20	so that be little be
0:03:22	message passing and this will fall out in our formulation
0:03:27	just a brief outline of what the rest of the talk would be like
0:03:31	i just
0:03:31	briefly go over a little bit of verification for any members of the audience who may need it
0:03:37	and then
0:03:38	going to
0:03:40	uh but but any in inference and then
0:03:42	for to how all into a variational bayesian inference which is a
0:03:47	a what we walk in
0:03:48	and then a discuss our model
0:03:51	and then
0:03:52	going to the experimental results
0:03:57	so
0:03:59	he in verification that task is you know you a given an utterance and a claimed identity and the task
0:04:04	is
0:04:05	it's a hypothesis test is the
0:04:08	given the speech segment X is the speech
0:04:10	from from speaker S a not
0:04:13	so
0:04:15	this is a hypothesis test as of say the and uh
0:04:17	what we do is we have to model model
0:04:20	uh out target
0:04:22	get
0:04:22	a target speakers
0:04:24	using splits speaker-specific specific gmms
0:04:27	and then we can
0:04:29	user a
0:04:30	i universal background model to test it out and it i what this it and this is the
0:04:34	you know usually the this line system
0:04:37	you ubm
0:04:38	gmm system which is you know
0:04:40	but
0:04:41	with that starting point to most verification systems there at once is but
0:04:46	this is the the the most basic
0:04:48	oh
0:04:49	and this is where we'll try try a more calm enhancement
0:04:52	in the log spectral domain to see if we can
0:04:54	have improved
0:04:56	so
0:04:58	no no uh
0:04:59	so the classification deciding when we type of the C C
0:05:02	you compute a scroll
0:05:04	just just a lot like who who
0:05:06	log likelihood ratio and then
0:05:08	you know
0:05:09	this do not threshold
0:05:11	decide which uh
0:05:12	which type of this is it correct and
0:05:15	yeah
0:05:15	you know we can plot to i'll two
0:05:18	but i'm for all
0:05:19	same well as of formants matrix we can plot that a an also to compute equal error rate
0:05:25	you know to determine the trade-off between
0:05:27	missed detection
0:05:29	and false alarms
0:05:31	so that's just a a speaker verification part just a little bit of a bayesian inference
0:05:39	so
0:05:40	i
0:05:41	we can say that that two main approaches to parameter estimation you can go the maximum likelihood route
0:05:47	or the bayesian inference rule
0:05:49	uh
0:05:50	here we see
0:05:52	if you have a data X
0:05:53	represented presented in this
0:05:54	figure by X
0:05:55	be the generative model
0:05:57	i one by a parameter a
0:05:59	data
0:06:00	now in them market to some like you
0:06:02	a a day
0:06:04	we assume that this parameter is an unknown constant
0:06:07	and then the
0:06:08	they're quantity of interest is that likelihood and then we can estimate
0:06:12	so it are based on the map them a maximum likelihood criterion
0:06:16	and the but they didn't paradigm i'm one the other hand
0:06:19	we assume that they to uh is uh
0:06:21	is a
0:06:22	but one by is a random variable good up that one by a prior
0:06:26	and this is where the robe
0:06:28	the the robustness to but it down setting to comes in
0:06:32	the fact that we have a prior out what the what this to over the parameter of interest
0:06:37	and then the clean quantity in these cases that
0:06:39	posterior which is proportional that
0:06:42	is given that the problem is proportional to the product of a like the then prior
0:06:46	and then
0:06:49	uh the issue is
0:06:52	how we obtain i estimates we obtain based an estimate does that minimize expect expect
0:06:56	costs and
0:06:57	for instance if we we have the
0:07:00	ah
0:07:01	if the cost is the squared
0:07:03	is the squared norm
0:07:05	of the big
0:07:06	a difference there between now
0:07:10	this expression he a fit so
0:07:12	the difference between the estimate and the true value
0:07:16	the it well known that
0:07:17	this this an estimate a the minimum mean square error estimate a just the posterior mean
0:07:23	note that this is easy to write
0:07:25	but
0:07:26	the what happens is that in most practical cases and even in the one we can see that here
0:07:31	it's
0:07:32	you know
0:07:33	import a almost impossible to perform from this tech
0:07:36	so now what do we do
0:07:37	uh
0:07:43	we can
0:07:45	we can use
0:07:46	the problem lies in the instructor stability
0:07:49	if the problem lies in the ability of the posterior
0:07:52	then we can apply what uh approximate bayesian techniques and for instance are we can use V B or variational
0:07:59	of base
0:08:00	uh where we approximate
0:08:03	well
0:08:03	i what true posterior
0:08:05	by one that's
0:08:06	constrained to be them
0:08:09	now
0:08:10	we need a metric the mapping between two from
0:08:14	and intractable for maybe and a tractable farm
0:08:19	of distributions
0:08:20	and we need a metric so that we know
0:08:23	yeah
0:08:24	and the uh
0:08:25	you know what's the close
0:08:26	approximation to the true posterior indestructible family
0:08:30	and and we measure yeah
0:08:33	we we obtain the approximation that minimize is the a out that dense
0:08:37	to to in our all our approximation and the true to to
0:08:42	oh in cases where i'll but i'm it that's set that uh
0:08:46	consists of a and
0:08:47	number of parameters in this case
0:08:50	and parameters as we can and shot ability by
0:08:54	assuming that
0:08:55	the product of the the posterior factor like that shown in this expression one
0:09:01	so
0:09:02	no the the question what is that
0:09:04	we boils down to
0:09:07	estimating what a
0:09:08	no computing the forms of uh
0:09:12	this approximate posterior each of the five does
0:09:15	and then
0:09:17	a for if
0:09:18	up updating the sufficient statistics
0:09:21	i can be shown that these and
0:09:23	uh uh an expression for the
0:09:28	for the approximate from of the distributions in this uh
0:09:32	we computed by taking an expectation with respect to the logarithm a of that
0:09:37	the joint distribution between observations
0:09:40	and the parameters of interest
0:09:44	oh
0:09:45	so
0:09:47	no that
0:09:48	let's
0:09:48	get but to our speaker verification context and by in particular let's discuss the
0:09:53	the model the probabilistic model
0:09:56	so here what did we are in the log spectral domain
0:10:00	and uh
0:10:01	but we assume is that our our or signal Y of T of the observed signal
0:10:07	is corrupted by additive noise
0:10:10	and if we take the dft we can compute the log spectrum much shown
0:10:14	but the can the look at the
0:10:16	this
0:10:17	that's to
0:10:18	a a F T
0:10:20	and then we can it can be shown
0:10:23	but uh these a nice
0:10:25	a proximity relationship between
0:10:27	then the the log spectrum of the up signal
0:10:31	that the clean log spectrum and that log spectrum of the noise
0:10:35	of this
0:10:36	just a lot
0:10:38	i what our likelihood
0:10:39	you look you in the bayesian paradigm and we we have the likelihood and the prior
0:10:46	so
0:10:47	this
0:10:47	this
0:10:48	is our likelihood
0:10:49	now we need to
0:10:50	two
0:10:51	to write out what is out joint distribution how does it five
0:10:55	because this will help was when we come to compute a
0:10:58	the that box the approximate distribution because that you the called the
0:11:02	the expression
0:11:04	for each of the optimum for does
0:11:06	depends on an expectation
0:11:08	like to
0:11:10	like an expectation of the look that a beam of the joint distribution
0:11:14	so this is a how the joint distribution in this content
0:11:18	uh
0:11:20	a factor arises
0:11:21	you have all all of that out
0:11:24	uh i log spectrum
0:11:25	the clean log spectrum
0:11:27	this is that what it what which tell explain later that we introduce one might lead to up the ability
0:11:32	to like an indicator variable than the noise
0:11:34	so here you have the likelihood tao
0:11:37	and that prior what
0:11:39	or what this
0:11:39	B
0:11:40	clean
0:11:40	speech log spectrum we assume that it is speaker dependent
0:11:46	and uh
0:11:49	so what happens is
0:11:52	yes the speaker dependent ubm so in a speaker I D context this would
0:11:57	in mean that we we'll and models for
0:12:00	each speaker
0:12:01	not id context but in know a verification context what we do is we approximate that
0:12:06	that would be that you snap not
0:12:08	mean in this but if kitchen context we assume that we can
0:12:12	model the light bright you'll speakers as as
0:12:15	just the target speaker and the ubm so this is what happens is that the library dynamic
0:12:20	for each at that your testing you your when you like
0:12:24	and we have a what
0:12:25	i it is that this indicator the variable
0:12:28	uh that was you
0:12:32	who peeking
0:12:33	oh in other what where they'd the target the ubm and which mixture the component is active
0:12:39	so this
0:12:41	just shows you the forms of the five does that we compute
0:12:44	and we can see that there
0:12:46	yeah
0:12:47	the well-known known fans
0:12:49	and the V be able but and what the don't to each realising this a
0:12:53	this
0:12:53	but the sufficient statistics in a in a in a case the mean and
0:12:59	and the covariance
0:13:01	and then and this out of a function of the observations and the prior
0:13:06	and then cycling through until some convergence is that thing
0:13:10	and
0:13:10	yeah
0:13:11	what's
0:13:11	good is that once you obtain
0:13:14	uh
0:13:15	the clean posterior a an estimate of the clean posterior we can derive mfccs easy lee for from them
0:13:22	for verification
0:13:24	so just some experimental results what we do it is
0:13:28	we we use three datasets initially we use to
0:13:32	then we to use the M T mobile device because verification corpus
0:13:36	then we have we also tried it out on a
0:13:38	S the sre two thousand and four corpora
0:13:42	so initial
0:13:43	results here a for
0:13:45	oh to make
0:13:46	and
0:13:47	uh
0:13:48	we did did we trained a ubm with that subset using training data from a subset of the and it's
0:13:53	because that six hundred and that is because in
0:13:57	and then we corrupted the speech
0:13:59	using
0:14:00	additive white gaussian noise
0:14:02	i present results for that
0:14:04	for realistic noise later
0:14:06	and then we used to test utterances by speaker
0:14:11	so
0:14:12	what happens is that we can generate from for the six hundred and that is because you can generate
0:14:16	uh
0:14:17	or hundred and sixty
0:14:19	true trials and then we select a random subset of ten speakers
0:14:23	but in posters
0:14:25	and then we compute its
0:14:26	scores for each trial
0:14:28	and we also compare we tried to implement the
0:14:31	this one by
0:14:34	a a and is corpora
0:14:35	which is a feature domain intersession compensation technique
0:14:40	which entails it
0:14:41	a a a a a a a pro uh
0:14:44	and in a a a a projection matrix to project the features into a
0:14:48	session session independent subspace
0:14:51	we have a a the recognition the i i
0:14:54	verification would be more robust the details i that it but will go through them
0:14:58	oh
0:14:59	and just some
0:15:00	brief uh
0:15:02	table of some result
0:15:04	or the timit case when we add in additive white gaussian noise we sweep through some snr
0:15:10	and then we just
0:15:12	it from the raw data we compute mfccs
0:15:15	and the top line shows you if we just up to in the mfccs without
0:15:20	note fast applying anything to
0:15:23	i out you know just roll
0:15:26	and then what if we obtain uh mfccs after that we've and hans
0:15:30	a log spectra
0:15:32	we in the second line using that B technique
0:15:35	i implementation of F D I C was able to draw
0:15:39	uh
0:15:40	i was able to walk in this
0:15:42	i in the low
0:15:44	as some case is in the high and that case it shouldn't draw broken down in our implementation
0:15:49	uh
0:15:50	we can investigate
0:15:53	this is does a plot for
0:15:54	and the that to db case for timit
0:15:57	and we see that the equal error rate uh dropped that by half and that's the case
0:16:02	a
0:16:03	a course this snr a we investigated
0:16:07	oh we also looked at
0:16:08	uh uh i had that types of noise
0:16:11	a a a a a what we had for three noise
0:16:14	and this noise was obtained from the noise X ninety two dataset
0:16:18	and the the the the results are similar
0:16:21	only the figure that different you know a different snrs because of the type of noise
0:16:27	oh but this to see that
0:16:29	uh
0:16:30	that i almost
0:16:31	have been in this domain yeah it's not as good but this is because there
0:16:36	oh no this is
0:16:37	a very
0:16:38	oh almost clean condition
0:16:40	now then
0:16:41	when we applied this to the mit T um
0:16:45	dataset
0:16:47	a
0:16:50	uh
0:16:51	we we want to show
0:16:54	the the difference
0:16:56	you know what happens when we have missed much
0:16:58	we
0:16:59	data obtained in an all is and uh
0:17:02	and tested it with
0:17:04	yeah
0:17:05	has has data from was noise noisy street intersection
0:17:09	when we we observe the means much you know a it jumps up to twenty percent when the test data
0:17:15	he's from one intersection when that all models were change of this data
0:17:20	and when we apply the the B technique
0:17:22	to use uses to twenty four percent
0:17:28	uh
0:17:29	for sorry experiments we we use
0:17:31	this with it will for corpora
0:17:34	we so the details we use ubm with the fifty
0:17:37	five to of mixture coefficients
0:17:39	and nineteen dimensional mfccs with a stands
0:17:43	a from mean normalization
0:17:45	but up and the is that we only obtain more disk gang
0:17:48	and we applied the whole that's
0:17:50	and this may be due to you know
0:17:53	oh
0:17:53	baseline line system with then
0:17:55	he are of that thirteen point eight
0:17:57	i the only able to get to that in point for
0:18:00	this may be due to the fact that uh
0:18:03	we think that uh
0:18:06	the the the formulation to the models trained on clean speech
0:18:10	and uh and uh it is for all on the L that um what is gained when compared to what
0:18:15	you get to meet and
0:18:17	the M of that data set
0:18:20	and
0:18:21	and that's it
0:18:21	then you
0:18:26	i i one time for one quick concern
0:18:38	i have as a question uh did you try to use uh
0:18:42	and as as a type of voiced speech you hands had reasons such as a wiener filtering
0:18:47	to obtain the enhanced the speech and then
0:18:50	using hands to speech to to to do speaker verification
0:18:55	a no we did not but we tried a a a a at the in a a what we tried
0:18:59	using F frame are
0:19:02	i and they have a but we where you you to getting that you not speaker id context but not
0:19:06	in this context
0:19:07	that is something we we should do
0:19:10	okay yes thank you
0:19:12	let's has a
0:19:13	oh go
0:19:17	okay

LOG SPECTRA ENHANCEMENT USING SPEAKER DEPENDENT PRIORS FOR SPEAKER VERIFICATION

Speaker Verification

Presented by: Ciira wa Maina, Author(s): Ciira wa Maina, John MacLaren Walsh, Drexel University, United States