Speech Transcript - Exemplar-based Sparse Representation and Sparse Discrimination for Noise Robust Speaker Identification

0:00:15	next to representation
0:00:16	that is
0:00:18	exemplar based sparse representation and sparse discrimination
0:00:22	richard per speaker
0:00:24	identification
0:00:43	oh
0:00:44	two examples i think
0:00:47	is the joint work with the university a
0:00:50	oh
0:00:51	not to miss rate
0:00:52	and
0:00:53	and when and
0:00:56	well
0:00:58	so
0:01:00	the name maybe
0:01:01	why we can T
0:01:04	but
0:01:04	this is the first and that sort of
0:01:07	try that for
0:01:08	the speaker recognition
0:01:15	so
0:01:25	section five
0:01:26	that is
0:01:27	speaker
0:01:29	the name
0:01:30	yeah
0:01:32	this
0:01:34	you
0:01:35	noisy conditions that recently
0:01:37	with this sort of motivate us
0:01:41	the recent studies of this one
0:01:45	been done in our group
0:01:46	that this child that's that the noise example
0:01:50	yeah
0:01:50	effect of noise
0:01:51	despite harsh on a state of art speaker recognition
0:01:55	i-vector based system and you have a basis
0:01:58	it is
0:01:59	it needs to be sort of way to deal with the effect of additive noise
0:02:04	in speaker recognition six
0:02:07	yeah
0:02:07	as they are
0:02:09	use
0:02:10	with that being
0:02:12	something about how to deal with the effect of noise in the speaker recognition especially
0:02:18	i-vector basis
0:02:20	a recent literature
0:02:23	in i
0:02:25	i
0:02:26	first
0:02:27	and they
0:02:29	try to multi condition training to deal with different types of noises
0:02:35	speaker recognition
0:02:37	that work was about to sort of very different models and clearly models based on
0:02:43	different noises
0:02:44	and the work of labels about you know how to a different
0:02:51	features
0:02:52	noisy features
0:02:53	and then all of them together in modeling phase
0:02:56	in the sort of a the only thing
0:02:59	most conditional speech
0:03:02	the other way is
0:03:04	it is also
0:03:06	we go a small initial training class a missing features
0:03:11	you already it means that we are using a conditional training is that features are
0:03:15	called you contaminated by noise and them together but the modeling face
0:03:20	that the features that they are affected by noise technicians their account the so called
0:03:26	in the out how we can in a
0:03:31	and the rest for a auditory or features
0:03:35	and separation so
0:03:37	well how to choose the R G F C is not from filterbank as the
0:03:41	cepstral coefficients that they are shown to be quite efficient compared to mfccs
0:03:46	because it is sort of more bus or model of the auditory system
0:03:52	and the separation system based on the on the auditory scene analysis that they
0:03:58	try to separate the speech and noise and build it
0:04:02	three mask that they can rely on speech to trying to clean speech out of
0:04:07	it and you can be done after that we missing feature everybody marginalisation reconstruction
0:04:13	so
0:04:14	a recent for us to make the speaker
0:04:17	robust against i
0:04:21	and
0:04:23	i
0:04:25	yeah
0:04:31	a
0:04:31	what presenting here is a preliminary results are recorded
0:04:36	research to remove the noise robust speaker
0:04:40	and it is quite different from the things that you have seen because that the
0:04:44	message inside the speech is somehow disturbing the speaker
0:04:49	section
0:04:49	there whatever think it's a sort of speaker
0:04:52	mission with a speech
0:04:54	so what do not being exactly is
0:04:57	vision
0:04:57	important what is being said
0:05:01	that works
0:05:03	exemplar based approach it means that we have examples
0:05:06	the data are also clustered examples of the data in the dictionary and then we
0:05:11	build the observation based on what we have a sort of dictionary
0:05:17	yeah we are considering
0:05:22	no sort of long temporal oh temporal context of the spectrum
0:05:28	so we go to build narrowband amplitude for each
0:05:34	what is the that for each for
0:05:37	for each frame we have be noted that and he uses like mfccs this is
0:05:41	just mel band and you know what man and amplitude spectrum
0:05:46	and E that the before here
0:05:48	and we have this three
0:05:52	yeah we
0:05:53	so each frame
0:05:55	and
0:05:56	okay then you have a sort of
0:05:58	superframe every frame that we have all deformation in this part that is typically twenty
0:06:04	five years he's
0:06:06	so it is in the order of two hundred fifty milisecond of all the all
0:06:10	in one vector or to consider one building block i
0:06:15	section
0:06:16	a sliding window is a we're gonna do cover all of the
0:06:24	is that
0:06:25	a small one
0:06:30	in
0:06:31	next
0:06:33	so
0:06:34	i
0:06:35	this
0:06:36	let me say a example
0:06:39	what we need to do is to build the dictionary the next so we have
0:06:43	these things and we need to build the dictionary that it is representative of the
0:06:48	speaker
0:06:49	yeah
0:06:50	so here now this work we had a small vocabulary
0:06:55	so we were able to do forced aligned hmm on it and make all sort
0:07:00	of label for each of the frames and
0:07:03	for example if you have a hmm models and for each of the
0:07:10	word models we have several states this work model and we have
0:07:15	the several states per model
0:07:17	so each of these frames
0:07:19	could be associated with one of the hmm states
0:07:22	but we have associate the
0:07:24	states with the frame so we take the context around on a cool the a
0:07:33	long temporal context
0:07:34	we have labeled as to belong to the same age and state
0:07:38	and then after that so all representing the same sort of phonetic events if you
0:07:44	can call
0:07:45	what we do to make just one representative of this event
0:07:50	to be wise median over all of these temporal on it
0:07:53	and they just one representative of this state
0:07:57	means that in this special task that we perform we had sort of two hundred
0:08:04	fifty hmm states for the for the let me say hmm for someone model
0:08:09	and then be we have now two hundred fifty long temporal context that we just
0:08:16	put in one vector and we have it as a representative of this fantastic
0:08:22	so
0:08:23	this is this is not per speaker so per speaker we have hmm trained on
0:08:28	the data and these atoms are stored in the dictionary
0:08:34	in addition we have also important dictionary to model the noise
0:08:38	so we have speaker or anyhow the noise part in the dictionary and for the
0:08:43	noise we are using a noise dictionary it means that in this is just fit
0:08:49	for it is assumed that you sort of a existing in data in large recording
0:08:56	so we observed that what when you is gonna start time and speakers gonna start
0:09:03	and resampled the noise from the beginning of at the time that it's gonna start
0:09:10	so i think the dictionary so this is sort of context recognition normal way that
0:09:15	people do the sparse representation they do there are lots of taking dictionary building but
0:09:21	this is context recognition rate and we know what we are building and that sort
0:09:24	of the stress of set approach
0:09:27	there exists a factorisation for factorisation normally we estimate the observation
0:09:33	based on dictionary and the
0:09:37	that's a X as activation at all the terms of the dictionary and X as
0:09:43	a nation
0:09:45	it is just a pictorial representation
0:09:47	the data from and to provide and icassp two thousand twelve paper because we were
0:09:53	doing the same thing
0:09:54	so this is because there's that we have three from the dictionary and the or
0:10:01	for a result in this context of the spectral
0:10:06	and an observation
0:10:08	in this once we have this sort of events that they are coming after each
0:10:12	other
0:10:13	and decomposing this the observation this frames we need to all
0:10:20	somehow minimize the distance between the observation at a combination a frame
0:10:28	yeah so we have for example three and also in the activation we have three
0:10:34	elements that it is sort of the linear combination of atoms to build the observation
0:10:40	and yeah well
0:10:42	yeah or nonnegative matrix factorization we have and also non-negative matrix deconvolution
0:10:49	what is done in both there is a distance function to be minimized to make
0:10:53	it's quite similar to what we observe
0:10:56	in addition in this is a function actually it's not easily this ser what you're
0:11:03	using it is scaled averages function it is presented in the in the reference of
0:11:09	the paper sorry hundred be here and in addition we have a penalty term to
0:11:13	just have a sparse
0:11:15	so you variation used here
0:11:17	using the
0:11:18	sparse what is being estimated it means that if you want to estimate this observation
0:11:24	we need to be estimated from a few of the actual the dictionary and we
0:11:27	cannot use all of the combination just tuning optimal weights in its best way to
0:11:32	prepare
0:11:33	and that's why because we see we say that these are also somehow events of
0:11:38	the speech and we have seen before we don't need to combine to meaning of
0:11:41	the observed all observations to represent the current
0:11:47	context
0:11:48	so in a non-negative matrix deconvolution that's is employed here it takes care about this
0:11:54	overlap between the events you know who is
0:11:57	space that it cannot build this one based on that the terms that it's existing
0:12:04	in and the dictionary so it doesn't so all of the activations are zero here
0:12:09	because why because it can be able to from the nist and from the before
0:12:13	and that's the way that are presented to
0:12:17	well and it works just one by one
0:12:23	decompositions of words on this one and tries to build it so close to the
0:12:27	next one tries to build and the cost function minimized over one long temporal oh
0:12:33	but in handy it takes all of this into account and minimize the distance over
0:12:38	the whole utterance of all can you know
0:12:41	so and it was proved that it is utilized in this study
0:12:49	it was on T well it doesn't sort of background in years ago about the
0:12:55	class and this is what on your volunteers were not just the users for speech
0:13:00	recognition
0:13:04	so the content and no we need to
0:13:09	oh well we are using a speaker
0:13:13	so i
0:13:16	oh we are building dictionaries
0:13:17	or long on that for each
0:13:22	one
0:13:25	for example
0:13:27	all two hundred fifty S dictionary from each speaker or concatenated here
0:13:33	a solution noise example
0:13:37	so we have representation of the speakers that there exist
0:13:42	it is closed set speaker
0:13:44	okay
0:13:46	i
0:13:47	when we are decomposing or factor on the relation to see that all we can
0:13:53	deal with this
0:13:55	the dictionary
0:13:57	yeah activation vector that you have your paper is sort of a representative for the
0:14:02	speaker identity by itself
0:14:04	because each of the last one to one of the speakers
0:14:08	but we decompose it
0:14:10	dries it actually the components that they are activating because we have also
0:14:15	sparsity some but not all of them could get activated few of them usually in
0:14:20	the order and fifty
0:14:22	and then we see that normally be again we have one of the things that
0:14:29	the event was called
0:14:31	simple manipulation or something like that but we go over the last in the activations
0:14:37	see if it is just a speaker that's talking
0:14:40	but we just concentrate on one frame this could be nancy
0:14:44	because well we have similarities between the speakers and some of the events it can
0:14:49	happen that the egg
0:14:51	the apple from other the speaker detected
0:14:54	so what you three a this reliability has also called so now we are concentrating
0:15:00	on the can think that each of each one is activate
0:15:04	if you go averaging over these activations
0:15:07	or the art so for each part we have like to vision and it for
0:15:12	example for two seconds we have two hundred activations
0:15:15	so we averaged over activations just
0:15:18	somehow deemphasize the contact
0:15:21	so the content
0:15:22	is good less important but in the real additions for each of them because it
0:15:27	but are averaging or something about this effect the car is less important but is
0:15:33	the information from the speaker
0:15:36	which area is detective at its is still present
0:15:40	i think this is it
0:15:42	feature
0:15:43	you're representing if
0:15:44	the speaker again
0:15:46	so in normal approach we have
0:15:49	icsi's and then
0:15:50	thesis we have i-vectors secondary features that you do classification of the i-vector here we
0:15:56	have a spectrogram and then this is sparse representation
0:16:00	on a strict or as the representative of identity representative there are
0:16:09	so what
0:16:10	but able to do this one
0:16:13	to do the classification is to go for lda or P S
0:16:17	i-vector out of three
0:16:19	and some people are window lda and then
0:16:22	plp to classify the i-vectors
0:16:26	i'm not describing the slides as well no
0:16:28	but what's the
0:16:30	oh yeah i work for the features are sparse features that we have are sparse
0:16:36	so what can be you better
0:16:39	L are sparse
0:16:41	that was a question that i and i
0:16:44	literature and that recently
0:16:47	it is proposed to have sparse discriminant analysis in our data are sparse
0:16:53	the weighted discriminant analysis is working this sort of extension to minimize discriminant analysis
0:16:59	in parallel discriminant analysis we need to account for the within class covariance estimation of
0:17:05	scatter estimation because this is what is that is sparse and this scatter matrix can
0:17:10	be estimated
0:17:11	so there is no doubt that had to the to the within class scatter matrix
0:17:17	which is normally an identity matrix to biased estimation
0:17:24	to make it is sparse
0:17:25	that is
0:17:26	additional this part of the sparse representation we had to sort of northern eight thousand
0:17:31	five hundred so we need to have that the egg and then we want to
0:17:36	make it a sparse so that the eigen directions of between class scatter matrix is
0:17:43	sort of analyzed with the L one norm of the integration in this sense it's
0:17:49	possible
0:17:50	you
0:17:51	i think direction sparse to this i get a sparse direction that it is utilized
0:18:00	so
0:18:01	going to description of course
0:18:04	people in this community are too much time to chime corpus
0:18:08	it is sort of its all computational hearing in multi source environments and it was
0:18:13	challenging
0:18:15	interspeech two thousand twelve for noise-robust speech recognition
0:18:19	this data
0:18:20	the little in U K and the thirty four speakers five hundred segments contain segments
0:18:27	yeah for speaker in training
0:18:29	i six snr levels and test and six hundred files
0:18:33	S and their snr to test
0:18:37	it does
0:18:39	the noisy that they were collected for
0:18:40	room environment that is really living room environment so that and the noise or sort
0:18:46	of very widely this data that the lower snrs we have really nonstationary noises sort
0:18:54	of T V is running matching is working in the also there are many things
0:18:58	happening at the same time and M indicates our streaming so it is quite a
0:19:03	challenging especially
0:19:05	it was from zero db minus sixty it is very challenging database
0:19:10	speech
0:19:12	so the dictionary is limited
0:19:15	all the segments is about two seconds
0:19:20	so we just present some results
0:19:25	some results that we have
0:19:29	yeah
0:19:30	yeah
0:19:32	you
0:19:33	at all
0:19:34	right
0:19:35	that we had speaker dependent hmm training
0:19:39	this one so it is
0:19:41	decode each for each test utterance based on hmms rates and we have thirty four
0:19:46	hmms and be let be decoded each test segment
0:19:51	thirty four hmms
0:19:53	see that each hmms between the baseline
0:19:57	so this is the result of that one
0:19:59	considering the speaker
0:20:01	well
0:20:02	so for the clean it is quite good so missus
0:20:06	match but
0:20:08	but to the lower snrs
0:20:11	our so the H hmms the likelihood was not really robust when we need to
0:20:19	look at it from the pure speaker
0:20:23	if you just you're of number of the
0:20:27	speech recognition results online sixty before these hmms
0:20:31	yeah or something in the in the order of thirty six or so
0:20:37	so going for gmm system is very baseline
0:20:40	but we need to just try to see that what is the what is the
0:20:45	results of a speaker independence is
0:20:48	sorry text-independent system which we don't care about context as the hmm system here and
0:20:55	we do
0:20:55	just easy modeling and gmm you know
0:20:59	what is there anything that you can have compared to the H
0:21:04	since this is some sort of
0:21:08	designed for speaker recognition
0:21:10	it gives us a really large margin of improvement for the noise environments
0:21:16	but this was not something that we will consider this was based to baseline
0:21:21	included here
0:21:23	so for example of a simple manipulation
0:21:26	you i
0:21:27	remember simple manipulation just going to the pitch flux of activations after that i see
0:21:32	that just a simple averaging all activations see that which one is just get activate
0:21:38	which i speakers
0:21:39	is present in this a try
0:21:42	and
0:21:43	it was still in the range for compared to gmm ubm and hmm in noisy
0:21:48	conditions it was quite robust so well the reason because each none of these to
0:21:55	alter the noise models but in the exemplar based approach we have been always included
0:22:00	inside a dictionary so it is sort of dealing with noise but not
0:22:04	a the noise inside
0:22:07	well
0:22:09	so the next one features also of examples which was a scoring so it was
0:22:13	a simple manipulation but in cosine distance cosine distance between the representative this or
0:22:21	it was also
0:22:23	it is a better because it was just the distance between these two was important
0:22:29	in simple manipulation it does not compared to anything it was just a test utterance
0:22:34	we do simple manipulation on the activation
0:22:38	and we put this
0:22:40	we said that no we can have a training for the training we use pot
0:22:47	you
0:22:48	and this training brought improvements in the sort of that's a close to the noisy
0:22:56	for rating that the reason was that the training was that the clean so the
0:23:02	examplars of four hundred
0:23:04	or plastic or just clean speech
0:23:08	and the final what is that
0:23:12	is that we train
0:23:15	the training method is used as the sparse discriminant and
0:23:19	the difference
0:23:20	i
0:23:20	i two of them on it
0:23:23	some of the effect of having
0:23:25	sparse features in the
0:23:28	for the input of training
0:23:31	this really is important in helping that when the sparse modeling technique should be also
0:23:37	sparse to deal better with the data
0:23:42	this is actually and we sort of improve this average by including group sparsity on
0:23:48	top of the norm of the sparsity
0:23:51	this paper a unique what's it is gonna be in the speech recognition systems that
0:23:56	is most likely that you have not gonna see so i'm taking much as presented
0:24:00	here
0:24:01	so the group sparsity means that there are imposing the no sparsity vc we say
0:24:07	that
0:24:08	you should select few iterations you want twenty seven target group sparsity be also make
0:24:15	more penalty that if patients treated from different group of speakers
0:24:20	so it is sort of course that if they need to get at each inside
0:24:25	a lot of the speakers
0:24:27	so it can be improvement in development and test set a specially action
0:24:34	so this work
0:24:37	containing
0:24:38	and
0:24:39	it is
0:24:40	one
0:24:42	that is being now
0:24:44	it's a lot are working on it to fit it's to thing is this are
0:24:48	well as we have posted there i mean that close but we are allowed to
0:24:53	use the speaker information in the training
0:24:57	and the well there are some issues about only about the
0:25:01	the channel effect and dictionary size you nist and with this one
0:25:07	so far we probably no noise and them and the channel estimate the channel difference
0:25:12	with the fact that the channel if you look at this is
0:25:22	is the
0:25:24	yeah well you know asians are different for each frame but if we consider the
0:25:31	channel this constant over a for
0:25:34	so we are at eight estimate the channel difference between what has been observed in
0:25:39	the training and or training or making detection
0:25:45	the test
0:25:47	thank you
0:25:57	yeah
0:26:00	oh
0:26:01	not really different here because
0:26:06	you was provided that the one you have to horses
0:26:10	and each of these two seconds were happening somewhere in this
0:26:15	and we were able to see that the noise before happening inside
0:26:23	i
0:26:25	yeah
0:26:30	i
0:26:35	the stress of this method is that it doesn't care about this and this and
0:26:39	or something that when it is combining the speech and noise atoms
0:26:43	of all the T V
0:26:45	one inside
0:26:47	also there
0:26:48	and about the different noise types
0:26:50	what is the M
0:26:52	sorry what is the idea is working right now is that we don't need really
0:26:57	the noise dictionary what is needed is sort of initialization for the noise dictionary and
0:27:03	adapted during that
0:27:05	the authors the previous that we see that there is no speech activation so we
0:27:10	okay we take as the sort
0:27:12	adaptation for the for the noise dictionary
0:27:27	i
0:27:34	yeah
0:27:43	i
0:27:47	so all sort of yeah
0:27:52	oh
0:27:55	i
0:28:03	so we are also estimating the
0:28:05	this one but not also which is
0:28:07	that the a non-negative matrix factorisation
0:28:12	there is a linear on this one that these are all
0:28:15	these are not simply so this more feature
0:28:56	well
0:29:01	yeah so
0:29:03	each frame
0:29:07	and the dictionary
0:29:09	test for all zero
0:29:11	as we are able we have to be able i
0:29:17	speech inside the
0:29:19	we see
0:29:21	yeah
0:29:21	so
0:29:23	yeah we know
0:29:25	computes vol
0:29:27	so
0:29:28	generated in the morning to be you know
0:29:32	maybe
0:29:34	speech or
0:29:42	there was
0:29:51	i

Exemplar-based Sparse Representation and Sparse Discrimination for Noise Robust Speaker Identification

SESSION 08: Features for Speaker Recognition

Rahim Saeidi