Speech Transcript - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

0:00:13	hello
0:00:14	my name is an and i one of the unwieldy timit speech signal just
0:00:19	i'm going to tell you about of our own each speaker embedding see what you
0:00:24	recognition also differences
0:00:34	the problem of the market individual systems weighted smart speakers fuse the demand for the
0:00:39	five you'll speaker recognition
0:00:41	as environmental conditions those devices i usually used in a provide some cases non the
0:00:46	nist clean speech processing algorithms additionally have to be robust noise
0:00:51	and the last thing there does recognition you have what incomprehension complete on the results
0:00:57	you know what are you
0:01:00	is performance on short duration test segments
0:01:04	so the main focus of our study
0:01:07	was to design the model that would
0:01:10	not all before well or unseen by you audio samples recorded was environment but
0:01:16	also we seem the recognition quality when tested on short speech segments
0:01:22	in order to achieve this
0:01:24	we started from what of moving the training data closer to the testing scenario
0:01:29	of that but investigated in fact
0:01:31	to know overall recognition performance of changes in how do we are relation of the
0:01:37	training data conditions teach
0:01:39	and the second concern was the problem presents or was segments are not speaker specific
0:01:45	information such as background noise as silence in what you so we prioritise the robustness
0:01:51	to noise aspect of the voice activity detector which was used to the couple's house
0:01:59	next we have try different acoustic features as well as biting extract architectures
0:02:06	also we investigated the effect so bad
0:02:09	can level that we have tuition and score normalisation
0:02:14	is the
0:02:16	datas foundation on every that dependent experiment we will first introduced data used the current
0:02:22	study
0:02:24	so we have constructed for datasets
0:02:27	that i primarily comprised of books love one into data
0:02:32	except for the training data one and two
0:02:35	one and three so that also have a fractional seen data mixed in
0:02:43	and that significant
0:02:45	difference between these datasets used a harmonisation use
0:02:49	as for the training data to a forced and of how this dialogue limitations used
0:02:54	while training data one and we don't that i in green
0:02:59	well mentally a different way
0:03:03	so in contrast to the augmentation scheme developed in reality or readable moist and speech
0:03:11	rate sure
0:03:12	what two thousand
0:03:14	once we have generated a reading room impulse responses from for different positions of sources
0:03:21	and destructive
0:03:23	to generate those are also responses we have used the impulse response generative proposed by
0:03:31	john allen and of in berkeley
0:03:34	this may be we have try to narrow down the gap between a real and
0:03:39	simply a room impulse responses by creating more realistic blocks
0:03:46	the benchmark okay i'll breast speaker recognition systems verified that
0:03:51	as described scheme use that indeed are one standard that conditions key
0:03:59	you can see
0:04:01	so now let's start with the
0:04:06	data we boasting
0:04:07	as sort of acoustic features we have experimented with what to dimensional mfccs and inter
0:04:13	dimensional mel of the backs
0:04:16	extracted acoustic features underwent you the local mean normalisation
0:04:22	followed by a global mean and variance normalisation or just a single local station
0:04:30	if a look at the benchmark swivels you that model trained on a two dimensional
0:04:34	mel filter banks i've defines the si model trained on what to dimensional mfccs
0:04:40	on the on the
0:04:41	percent of test particles
0:04:45	and the next preprocessing stage we want to draw attention to use was of detection
0:04:52	in our previous studies we have these and mitchell energy based voice activity detector being
0:04:58	sensitive to noise
0:05:00	so we have decided to create a i'll a neural network based voice activity detector
0:05:07	i well voice activity detector is based on you net architecture which initially was developed
0:05:15	for medical image segmentation
0:05:18	the joyous or unit is actually read to the tree don't betweens you to one
0:05:24	we have traded on
0:05:25	one holes
0:05:27	data and a small fraction of microphone you know which was downsampled to eight khz
0:05:33	well labels for these does it
0:05:37	well a teens either in terms of manual segmentation or using out meeting
0:05:46	speech recognition based was estimated that the segmentation
0:05:50	followed by manual post processing
0:05:55	as for the results what we observe is that you then based wasn't the type
0:06:00	that they actually helps us to improve the quality of systems for difficult conditions
0:06:07	a bit to the standard called energy and there should be fine
0:06:17	let's now that into the details of the main components of our system
0:06:23	converting structures
0:06:26	embedding extract is comprised of one frame level network
0:06:32	then statistics willing clear and
0:06:36	also the segment level or
0:06:41	where you level
0:06:43	next walk are where actually darcy analyses that you what do you features at the
0:06:48	frame level
0:06:50	and for the frame level we have considered two types of neural networks for used
0:06:55	in the n n's
0:06:56	based on a present that
0:07:00	did an em based and also i resent response
0:07:04	the main difference between
0:07:07	there's to me is the there and that type of a kernel and well processing
0:07:12	that what
0:07:15	frame level lattice formal by statistical here
0:07:19	that's
0:07:22	and it's frame level just a long time
0:07:25	i'm gonna feature maps are then latins and rasta the segment level that extracts herence
0:07:30	level mission
0:07:32	or salted embedding vector results normalized and class that that's fine
0:07:38	we have started with well-known extended version all
0:07:42	t d n s
0:07:43	and or place t nine
0:07:48	time delay
0:07:51	lee here with a list here
0:07:55	then we have moved to the fact tries to the end i texture and finally
0:08:00	ended our experiments we rise that's
0:08:03	which ride present it and see for configuration and y a v one resonates of
0:08:10	india with a skip looks at it
0:08:16	and
0:08:19	i'm gonna the test results for those architectures we are drawn to components
0:08:25	first
0:08:26	are whereas than that the t for all forms x vectors
0:08:33	second there is that no improvement is achieved by switching to it's here is that
0:08:39	the loss functions we have stick to additive white pixels
0:08:43	which is well started in the area of speaker recognition
0:08:49	also we have try to train our best model using this axles
0:08:55	which was recently proposed and when it actually does is this section of the softmax
0:09:01	was to independent and try and the class checkers
0:09:06	however that was not able to get these mikes please training help me from absentmindedly
0:09:14	in this work we use cosine similarity emphasizing liberty a
0:09:20	mentioned learning
0:09:21	a scoring
0:09:23	we
0:09:24	also used
0:09:25	simple domain adaptation procedure based on a century the data
0:09:30	on in domain set by we have speaker bindings obstruction
0:09:35	the mean vectors of calculated using adaptation said this case
0:09:40	we also adaptively normalize the schools with the statistics of total
0:09:47	ten percent best scoring posters for which embedding people
0:09:52	mean annotation allows us to use the equal error rate and improve we just here
0:09:59	but slightly
0:10:01	but so if we can well
0:10:07	score normalisation we will see that score normalisation outperforms station
0:10:14	on the majority of the distance so that we can make sure somebody
0:10:19	the results
0:10:21	change during training
0:10:24	propose to model for jesus on the duration of training samples
0:10:30	so
0:10:34	was so it is that systems based on race that architectures are deformed spectra based
0:10:39	systems in all experiments
0:10:41	you know based
0:10:43	voice activity detector a skull the energy based voice activity detector
0:10:48	and score normalisation well as the good performance of all extracted types of the majority
0:10:55	of the test settings
0:10:57	also the task of millions it's a training data from relation can slightly with the
0:11:04	quality of c
0:11:07	this of max baseless training doesn't help to present that eer
0:11:12	or performance and also we did not achieve any one by using more complex right
0:11:24	five right
0:11:28	for testing our hypothesis on the whole generation test segments
0:11:32	we have more to the thing
0:11:36	the experiments
0:11:38	with the tests of links ranging from a point five seconds
0:11:43	first we have seen that independent wanted to sample duration is here is that it
0:11:47	is still doesn't doing better but it to address that the g
0:11:53	secondly we validated that everything based architectures
0:11:59	thingy
0:12:04	be the ones
0:12:06	it is based on
0:12:09	expenditures in terms of you or weighted and i mean this year for the tests
0:12:13	on a
0:12:15	a four and the while to twenty five second segments
0:12:21	it is
0:12:22	also
0:12:24	where is it to see that today in other ways to extract systems degree more
0:12:29	that resin systems function segments
0:12:34	his finger with the that occur is an illustration of the relative differences between wanting
0:12:41	from testing address that sample durations to come up short length segment and looking from
0:12:46	testing x searches for durations to test environment shuffling signal
0:12:57	for a voltage right to see how
0:13:00	a low we to augment we refer to as more realistic we can base to
0:13:05	the call this dialogue intuition in terms of the performance of the best duration model
0:13:11	what is then trained on short duration segments would see that the situation changes in
0:13:16	the way that the we now
0:13:18	is not one obvious how well that
0:13:23	what is that the gap between the
0:13:26	metrics into roles just quite now
0:13:28	no
0:13:30	if we say the training segments not sure that
0:13:35	we would
0:13:36	differentiating we know that are
0:13:38	the whole trained on data with more realistic room impulse responses
0:13:43	i've defines the model trained on call just l version of impulse responses
0:13:52	and the gap is getting why the
0:13:55	how with the but absolute we are
0:13:58	is the still not
0:14:01	we as
0:14:03	the obvious conclusion we can draw from the results
0:14:09	here is that in case of training address that based more on short differences
0:14:16	in the one for shorter duration is that it was tough degradation for
0:14:27	in order to compare our speaker recognition systems performance for sure you rinse as you
0:14:33	those already presented for the probably
0:14:35	we have they publish describing
0:14:40	are used as well as same time nibbling too heavy steel above results on what's
0:14:45	layer experiment
0:14:47	so we were able to
0:14:51	cheering testing problem bolts mostly identical to those used in the paper
0:14:56	of interest so this is the second p with the
0:15:01	you can see hold endurance level location of speaker recognition in the war
0:15:07	so for we do stability purposes we also did not use you know what for
0:15:15	just a data
0:15:17	so you can see a how
0:15:21	actually try to trying to create a problem
0:15:26	as for the results we can say that when used testing show significantly better quality
0:15:32	or our moral or very short duration
0:15:37	a slight one second to second of artistic as
0:15:44	like durations
0:15:47	and
0:15:50	hence the final spy and the here the
0:15:56	maybe
0:15:58	take ways
0:15:59	all this talk so that jane results confirm that
0:16:06	or is that i take sures x vector approach in table one duration of short
0:16:11	duration scenarios
0:16:14	appropriate training data preparation can significantly improve the quality of the final speaker recognition systems
0:16:22	also proposed you know based the of was to detect of queens energy based was
0:16:28	activity detector
0:16:32	and best performing system or voice just goal
0:16:37	it is a thirty four based so systems built on inter dimensional mel the bank
0:16:43	features
0:16:44	and it actually all ones our previous best single system unit to the voice this
0:16:50	challenge
0:16:53	proposal scoring model means adaptation score normalisation techniques provide additional performance gains for speaker
0:17:03	and that's it
0:17:06	maybe for attention you have any questions will having tons of them in a given
0:17:11	a session

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Special Session: VOiCES 2020

Aleksei Gusev, Vladimir Volokhov, Tseren Andzhukaev, Sergey Novoselov, Galina Lavrentyeva, Marina Volkova, Alice Gazizullina, Andrey Shulipa, Artem Gorlanov, Anastasia Avdeeva, Artem Ivanov, Alexander Kozlov, Timur Pekhovsky, Yuri Matveev