Speech Transcript - Linearly Constrained Minimum Variance for Robust I-vector Based Speaker Recognition

0:00:15	a little buddy i'm happy to be here
0:00:19	"'cause" this is my first where the field of speaker recognition
0:00:24	and remember
0:00:25	one attorney
0:00:27	then these four
0:00:29	providing such approach and eighty four that's two
0:00:33	participate
0:00:36	"'cause" this i think it is
0:00:37	an important to the improvement you can be
0:00:44	that we can improve the speaker recognition system
0:00:47	with the holding these kinds of challenges
0:00:52	okay
0:00:54	what i'm going to propose is the
0:00:57	kind of the idea from the beamforming
0:01:02	which is a name is technique in signal processing
0:01:09	okay what are going to
0:01:11	present
0:01:15	the first one i one to try to explain what beamforming is and the
0:01:21	what how we apply to this challenge
0:01:26	we explain know how we can solve the problem with an adaptive filtering
0:01:31	and then find an optimal a beamformer in order to
0:01:37	solve problem
0:01:38	first of all without any constants windy and can have that
0:01:43	we include a sensitivity
0:01:45	to make you more robust
0:01:47	and or work also include some modification of the
0:01:52	possible audience matrix and some one was it or more station in order to
0:01:57	owing to the
0:01:58	performance
0:02:01	so
0:02:01	what is to know what we suppose
0:02:06	from i-vectors
0:02:09	i-vector is interesting because it's
0:02:12	provides
0:02:14	a fixed dimensional representation of any arbitrary length
0:02:17	speech
0:02:18	and the what
0:02:22	problem with i-vectors that it varies with different i environments speaker role
0:02:30	and this is the challenge
0:02:32	in this field
0:02:37	okay in interspersed intersession compensation we are going to remove this unwanted variability but in
0:02:44	this challenge
0:02:46	using a probabilistic linear discriminant analysis is not going to be a good idea since
0:02:53	we don't have any label for the data
0:02:56	and if we provide this label for the data it's also you'll be
0:03:05	the performance of that clustering labeling be affect the performance of p lda
0:03:15	okay
0:03:19	one important things is that
0:03:23	what
0:03:24	we need
0:03:25	if you have a lot up
0:03:27	speech data
0:03:30	so
0:03:31	we can use of these amounts of
0:03:34	available data for example in a speech sensor
0:03:37	up with telephone speech centre there a lot of speech data passing through
0:03:44	so we can use the take advantage of these data in order to improve speaker
0:03:49	recognition
0:03:50	instead of providing some
0:03:54	artificially
0:03:55	data by labelling them
0:04:01	so the p lda what is similar approaches the
0:04:05	it's a two
0:04:07	have label
0:04:08	so this is not a good idea to use that we taken on a new
0:04:13	approaches
0:04:15	two
0:04:16	solve the problem
0:04:18	so if it can't
0:04:21	finest within speaker scatter matrix reliably so why to be
0:04:27	why don't we go to find that the between speaker variance then increase that
0:04:35	okay
0:04:36	the first things the on going to explain is the beamforming
0:04:41	it is the signal processing technique
0:04:44	from since we're base in order to direct the signal transmission to a
0:04:49	desired target
0:04:52	and adaptive filtering is used the two
0:04:58	using optimal filtering the interference rejections
0:05:01	in order to estimate the signal of interest
0:05:07	so what i beamforming operation is that when a signal implying on some and ten
0:05:14	as well
0:05:16	from the same distance
0:05:19	it then passed through a filter
0:05:23	and then the results
0:05:26	the that filter
0:05:28	the
0:05:29	desired angles
0:05:31	and rejects all the other groups
0:05:35	this is the same as the
0:05:38	dot product of to a filter and the sequel
0:05:44	so if we can
0:05:46	illustrate the
0:05:47	idea is that in the
0:05:50	omnidirectional antennas
0:05:53	the signal the interference of the targets of are treated equally but in the beamformer
0:06:01	we all focus on the talked
0:06:07	so what i have
0:06:09	filter so we are going to design a filter like this the w transpose start
0:06:15	by where i is the i-vector and w is the filter
0:06:20	so we wants to
0:06:23	pass the target speaker to this filter
0:06:26	but we check all the others impostor speakers
0:06:30	so the development set is
0:06:33	impostors so all day impostors comes from the development set
0:06:38	so iffy
0:06:41	use the mean square error
0:06:43	in order to solve the problem
0:06:45	we reach the this result as it can see here
0:06:54	okay
0:06:55	the w is there
0:06:56	a particle filter for this solution
0:07:00	and parties the
0:07:02	autocorrelation matrix
0:07:04	and i is the target which can be estimated by using
0:07:11	okay listen to compare it with the baseline system
0:07:15	the baseline systems
0:07:18	is computed after whitening the i-vectors
0:07:21	and the using it that the cosine similarity to find the score
0:07:28	you can see that when
0:07:30	the use cosine similarity before that we should the normalized the math of the i-vectors
0:07:37	a display but in the
0:07:40	adaptive filtering as like just
0:07:43	explain
0:07:45	there is no normalisation of
0:07:47	the i-vectors
0:07:50	okay just a little further unchanged a criteria
0:07:55	in the beamforming the minimum variance distortionless response
0:08:01	there is a new approach area that is to maximize signal interference lost more information
0:08:07	so we wants to
0:08:11	maximize
0:08:12	this relation
0:08:14	that is to maximize the output of the filter when the targets past
0:08:19	but to recheck all the
0:08:22	impostors the to want to minimize the
0:08:25	did not meaning to but t vs the dominate two
0:08:29	in order to
0:08:30	solve the problem we assume that
0:08:33	the nominee two
0:08:35	equals one
0:08:36	that's the
0:08:38	all
0:08:39	the best way
0:08:41	so
0:08:41	we wants to minimize the
0:08:44	did not many to which is this for of a pasta been passed through the
0:08:49	field
0:08:51	where a value of that
0:08:53	and here
0:08:55	particles the
0:08:58	impostor the covariance matrix so the optimum solution for this problem
0:09:03	can easily be found this way
0:09:07	so let's just compare it with the cosine similarity
0:09:11	the baseline system is like that and the mvdr proposed this way
0:09:16	so if you look at this idea we see that
0:09:22	this nor mvdr suppose that new similarity measure
0:09:27	that does not include the normalisation of the test i-vector but focuses more on the
0:09:33	targets
0:09:37	the result
0:09:39	shows that
0:09:40	it's will provide a
0:09:43	improvement of seven point seven percent
0:09:46	in the
0:09:47	i-vector challenge
0:09:50	so let's the goal and
0:09:53	step further and to make it more robust
0:09:57	as the we had we had in the previous
0:10:00	the slide that we use the all the mean of
0:10:03	all the target i-vectors in order to
0:10:08	so estimate the target since the mvdr suppose that there is no uncertainty regarding the
0:10:13	target
0:10:14	but in this
0:10:17	the linear constrained minimum variance speech and the include uncertainty by some linear constraints
0:10:26	so that we anna i all the i-vectors provided for the target
0:10:32	in the matrix c
0:10:35	and the
0:10:36	we enforce
0:10:40	that the past the filter
0:10:43	we the value of one
0:10:46	so f is equal to one
0:10:49	so if you solve this problem
0:10:53	the optimal filter will be as you can see here
0:10:59	and when we applied to the challenge
0:11:02	there is a more another improvement of three point seven relative mvdr but
0:11:08	and then eleven point one percent relative to the baseline system
0:11:13	so
0:11:14	now we have your no we can
0:11:17	do an additional the
0:11:20	job
0:11:21	in order to improve the performance
0:11:24	since you need in signal processing
0:11:27	there are many a more techniques such as will paucity palm beamformer or public constraint
0:11:33	robust keep on the formant were two
0:11:37	improve the performance by top only loading the
0:11:41	covariance matrix
0:11:44	i just used a similar approach and the but use the pop impostor i-vectors
0:11:51	ward the most similar to the target i-vectors
0:11:55	so
0:11:57	in this way we compare we passport impostors through the
0:12:03	filter for each target
0:12:05	and
0:12:06	selected those what was six thousands impostors know the for similarity
0:12:13	two and computed the covariance matrix again
0:12:17	this result in a
0:12:20	very good improvements
0:12:22	of twenty one point five percent
0:12:25	relative to the baseline system
0:12:28	we can see that
0:12:30	is
0:12:31	for all the impostor when compared to the
0:12:34	a target
0:12:36	after the
0:12:39	applying is you have covariance matrix modification we can see you
0:12:43	the put reduction
0:12:46	in the schools
0:12:50	another
0:12:52	factor to be true for the
0:12:54	or the speaker performance was to use that score normalisation
0:12:59	i just found this relation
0:13:02	the best
0:13:04	contrary to some others use the variance of well
0:13:09	two z norm or t-norm use the various of the scores
0:13:14	we could not do that
0:13:18	and this results nine further improve
0:13:23	okay
0:13:24	and let's go and the more a supervised
0:13:28	okay
0:13:29	that we use the within class covariance matrix
0:13:33	fine using some clustering methods
0:13:37	but this clustering method is some what different
0:13:41	as we
0:13:42	three set each target each a single i-vector in the development set as a target
0:13:49	and found the closest or the similar
0:13:54	i-vector to that target
0:13:56	and this is repeated each time at one more in order to find more similar
0:14:04	i-vectors
0:14:05	after finding those i-vectors
0:14:08	we use the this formula in order to compute a within class of the tools
0:14:12	like vector which assumes to be from the same speaker
0:14:18	and the final model
0:14:22	can be found by adding this w since we what we apply to chance so
0:14:28	the inter session variability as well as the
0:14:31	rejecting impostor
0:14:34	so we added together
0:14:37	to find this optimum what
0:14:39	so can see the results
0:14:43	it's
0:14:45	you to leads to an improvement of twenty five to twenty seven point five percent
0:14:50	relative to the baseline system
0:14:54	so in conclusion
0:14:55	we have proposed a new
0:14:59	idea of rum of the signal processing for adaptive filtering in order to solve the
0:15:04	i-vector challenge
0:15:10	so a modification of the impostor covariance matrix can be possible
0:15:15	this way
0:15:16	so
0:15:17	we have used the
0:15:20	this idea
0:15:22	two we can apply to p l d i thing to do we can
0:15:27	improve the speaker recognition if we apply to p lda
0:15:32	but we had not much enough time to do that
0:15:37	thank you for your listening
0:15:58	so one time if do not remember starts and eleven we did language at that
0:16:05	was doing something cosine and i was i the target michael length normalized text
0:16:12	we should the same what's your data
0:16:14	would be inferred and it was cut off the backend was able to a calibrate
0:16:19	scores but you have some shift in the scroll effect on two test
0:16:24	sort calibration sessions cell so that's what happened time so the calibration what nick but
0:16:32	also was about the clock offset estimation
0:16:36	but we have that we find that too much worse when you wait when you
0:16:40	don't normalise the test
0:16:42	so no
0:16:44	for language id was possible speaker okay but
0:16:47	okay thank you
0:16:57	good

Linearly Constrained Minimum Variance for Robust I-vector Based Speaker Recognition

Nist I-Vector Special Session

Abbas Khosravani and Mohammad Mahdi Homayounpour