Speech Transcript - Utilizing VOiCES Dataset for Multichannel Speaker Verification with Beamforming

0:00:13	no
0:00:14	mining solutions
0:00:16	and i would like to present our contribution regarding q decision of the courses corpus
0:00:21	or multichannel speaker verification which uses talking
0:00:27	according to research papers
0:00:29	there is
0:00:30	more interest
0:00:32	in march a speaker verification
0:00:34	but the number of dataset
0:00:37	still limited
0:00:40	therefore we wanted to use
0:00:43	boxes data
0:00:45	for the evaluation of the multichannel speaker verification systems to the object is of our
0:00:51	a where
0:00:52	are as follows
0:00:55	we analyze the original trial is defined for the voices challenge
0:01:01	we really finally
0:01:03	so that a multichannel speaker verification systems can use of
0:01:10	since we created in you try to nist
0:01:13	multi-channel trial list
0:01:15	the final if sensors robust
0:01:18	and also you assist used to do their voices data for training subsystems
0:01:26	so because we wanted to create a multichannel trial set
0:01:31	we needed first
0:01:33	and one
0:01:34	the original
0:01:35	try to set defined for the first time
0:01:39	so we can see your stiff one that every set of recordings
0:01:45	what recording
0:01:47	in a different room
0:01:49	as regards noise condition
0:01:51	we can see that test recordings were recorded with background noise
0:01:57	it was under babble noise
0:02:00	television noise
0:02:02	and music and also without and thing
0:02:06	enrollment recordings
0:02:07	where required without any vector noise
0:02:10	so they're just room reverberation and background
0:02:15	and we can see that the haar of the
0:02:19	enrollment data for evaluation
0:02:22	what's taken from the original there is
0:02:28	as regards microphones
0:02:30	and enrollment recordings but with two microphones
0:02:34	test recordings with eight or eleven microphone
0:02:39	these numbers
0:02:40	would be quite important for us
0:02:44	in terms of speakers we can see that there are some unique speakers in and
0:02:50	enrollment and test portion
0:02:53	overall we have about one hundred speakers in enrollment both for evaluation and development
0:03:02	for development
0:03:04	we have much more speakers test then a calibration set
0:03:13	regarding utterances
0:03:15	utterances are just trying between enrollment and test
0:03:21	also speakers in the development set are different from those that are evaluation set
0:03:30	so we wanted to create a multi multichannel trials
0:03:35	so we to analyze the origin one and you realise
0:03:40	but for every enrollment recording
0:03:43	there are always multiple test recording
0:03:46	containing the same utterance the same noise speaker
0:03:51	room
0:03:52	but they are recorded with a different microphone
0:03:56	and this is what we may use all
0:04:00	so while creating our multichannel trials we use single enrolment
0:04:06	and that in terms of test recordings
0:04:10	we do some recordings to create and microphone
0:04:16	so now we will look into the creation of test portions of development and evaluation
0:04:24	for the for this to
0:04:27	we can see that for every and enrollment utterance
0:04:31	there are always eight
0:04:34	utterances containing the same
0:04:37	basically the same utterance
0:04:39	and are recorded over different microphone
0:04:43	she one to a three s o one
0:04:46	are numbers representing random turkish
0:04:51	we decided to always four
0:04:54	recordings
0:04:55	two one microphone error
0:04:58	that means
0:04:59	that
0:05:01	instead of eight trials people tend to trials
0:05:06	meaning that you use the number of trials from four million to one
0:05:14	or evaluation set
0:05:16	we have relied on
0:05:18	recordings for every enrollment utterance
0:05:22	we again grooved for recordings together and we're left with three more utterances
0:05:32	therefore we randomly another one utterance from those that of it
0:05:40	this new use the number of trials from three one five million to nine hundred
0:05:46	eighty thousand shots
0:05:50	we try not only reading
0:05:53	a development
0:05:55	and evaluation sets but we also try to creating
0:05:58	and data
0:06:01	our multichannel training dataset is based on the full list of recordings from one and
0:06:08	two
0:06:10	be excluded completely recordings from three and four
0:06:15	because as we have seen a full original utterances recorded in ruins to four
0:06:23	we also the i'll the development data because they were recorded in one and room
0:06:31	then we again grooved the recordings based on the content and we obtain again microphone
0:06:39	areas contain four microphones
0:06:43	so
0:06:45	the result was trained dataset comprising fifty seven point eight thousand examples but which use
0:06:52	of two hundred speakers
0:06:55	so it is clear that there is there
0:06:58	because this dataset is similar to the development dataset in terms of speakers and also
0:07:06	acoustic conditions
0:07:08	but this was already just because the
0:07:11	all the data set
0:07:13	so now they're three
0:07:16	development and evaluation set
0:07:18	and also training set
0:07:21	no less channel to explanation of multichannel product speaker verification
0:07:26	so we use it is done system
0:07:30	then it contains a front end which is the funding that performs a station are
0:07:36	very
0:07:37	and then the single channel output goes to exeter extractor
0:07:42	and the and buildings are scored using nearly
0:07:47	so this is very standard i point
0:07:50	but our goal was not propose no motion system
0:07:55	but rather assess the use of the to the voices
0:08:00	for
0:08:02	forming we were able to make use not original voices training data
0:08:09	we also tried using simulated data and i will explain while and when later presentation
0:08:16	the voices training dataset is quite the and therefore we couldn't use it for training
0:08:23	of the extra their extractor
0:08:26	it means that use bookseller or training a of the experisuch there and also you
0:08:33	be okay
0:08:35	or front end processing
0:08:37	we use the g
0:08:39	generalized eigen four
0:08:42	so this is former get utilizes
0:08:47	something would statistics
0:08:49	and crew a single
0:08:54	so first we need to a computer or estimate speech cross power spectral density matrix
0:09:01	and noise here
0:09:04	those three matrices
0:09:05	go to g is over
0:09:08	which is generalized eigenvalue decomposition
0:09:12	the principal eigenvector then used construed to be a beamformer weight
0:09:18	it is applied to multi-channel input
0:09:21	and we obtain a single job
0:09:25	in order
0:09:26	to estimate speech i was used
0:09:30	we use neural network
0:09:34	we have
0:09:35	single one quarter
0:09:37	and
0:09:39	this is applied to all of g a chance
0:09:45	to give an input
0:09:47	this not network is supposed to a model for each and mask for noise
0:09:54	the resulting mask
0:09:57	are applied to input spectrum
0:10:01	and noise and speech psd matrices are estimated
0:10:06	this picture is differentiable s is usually in our previous work
0:10:12	the architecture of this model is pretty simple a contains how about two in your
0:10:20	layers
0:10:21	and then there are two layers
0:10:24	one
0:10:25	of coding
0:10:27	model ordinance one
0:10:29	or
0:10:32	in our experiments we will refer to models
0:10:37	but essentially they are the same
0:10:40	and what is different is the weight of training
0:10:44	so for b c model
0:10:46	we do
0:10:48	the weight of the most system either
0:10:51	just by a optimize the output mask
0:10:56	therefore
0:10:57	we
0:10:58	compute first i
0:11:01	ideal binary mask
0:11:03	and then we are minimizing binary cross entropy between output and yes
0:11:10	so in order to computer science
0:11:14	we need to know speech and noise
0:11:17	so that means that can not use which dataset and to this data for training
0:11:27	to create such assimilated a dataset we use the same utterances are and multi-channel voices
0:11:34	dataset
0:11:36	and we perform us english using mute source method
0:11:42	and everything
0:11:44	all sessions
0:11:45	which was also used in
0:11:49	of course dataset
0:11:53	for the missing model
0:11:55	we optimize the output of the form
0:12:00	therefore we minimize
0:12:02	and s between the output
0:12:05	and clean speech
0:12:08	in this case we can use multichannel a voices training data
0:12:13	because what
0:12:14	described it audio
0:12:16	and then clean speech which is taken from speech
0:12:22	so much for that expunge our architecture and now we will to experiments are
0:12:31	for reference
0:12:32	we show results for the so called single channel
0:12:37	in this case we use the original trial list
0:12:42	defined for the voice
0:12:45	and we evaluated our extractor extract the
0:12:50	our baseline is informed which is well established to for performing with us
0:12:59	the results are you
0:13:03	then we try to assessing dct and this models
0:13:08	using the same trial this
0:13:11	s for one
0:13:14	it is worth mentioning that
0:13:17	take the channel cannot be readily compare the formant because of the number of trials
0:13:23	is the
0:13:26	then we tried assessing the performance of u c and testing
0:13:33	we can see that this is novel tense
0:13:36	better results than baseline from
0:13:40	however the performance of this model is quite or
0:13:46	we hypothesize that it is much more difficult part
0:13:50	to train new on this work to all good but correct mask for speech from
0:13:56	foreigners just by minimizing unless you how good and
0:14:03	moreover
0:14:04	there is more variability in the training data for easy model
0:14:09	then in the training data for missing the training data or miss model
0:14:14	all training are okay
0:14:16	from the voices
0:14:20	further
0:14:21	we can see
0:14:22	the pca model generalizes another
0:14:25	then and this novel
0:14:27	and this is
0:14:29	again because of variability in the data
0:14:34	then you're trying to improve and missing model
0:14:38	but still using voices dataset and no external data
0:14:43	so what
0:14:44	its use of men
0:14:47	and especially proposed variant of spectrum and where we apply mosques directly to this
0:14:56	more specifically we have five to frequency must
0:15:00	and to time marks
0:15:03	we can see that we were able to improve
0:15:08	performance of and si models quite substantially
0:15:12	we can also observe that performance is not better than the baseline performance
0:15:19	we also tried using spectral language model
0:15:23	and again see some improvement
0:15:26	but the improvement is not that i don't
0:15:30	as for the mse model which is good news for us
0:15:35	so much for the
0:15:37	first experiment
0:15:38	and let's turn to the wrong so in the number
0:15:43	experiment
0:15:44	higher assessing performance of individual microphones
0:15:49	we hypothesize that some of microphones can perform poorly
0:15:55	they are used in multiple microphone errors
0:16:00	in this case
0:16:01	microphones
0:16:02	can be far from each other as opposed to conventional small microphone arrays
0:16:09	and i thought that maybe a really performing microphones to integrate overall performance greatly
0:16:18	and it might be useful to exclude them from trials
0:16:24	so to assess this
0:16:27	we first needed to assess single microphones
0:16:31	so it is me too
0:16:35	the original trial list
0:16:37	and then we created a microphone specific trial list
0:16:42	where as you can see
0:16:45	neural and recordings are always the same
0:16:48	and test recordings correspond to the microphone
0:16:53	that recorded
0:16:55	specifically utterance
0:16:59	these are the result that we obtain
0:17:02	and we can see that best microphone
0:17:06	our best performing microphone dislike in front of the loudspeaker
0:17:11	then the worst microphone is microphone with number twelve which was constructed
0:17:18	and another who microphone is the one that is order
0:17:23	from the loudspeaker number six
0:17:26	we can see that there is quite some difference between the best and worst my
0:17:32	this is even more pronounced
0:17:35	for evaluation set
0:17:38	where we can see that best performing microphone
0:17:41	and its performance for two point two eight
0:17:45	because er
0:17:48	the was performed microphone it's almost seven times worse than the best one
0:17:54	again the microphone number twelve was for obstructive
0:17:59	and microphone number six is far from the art speaker
0:18:05	then try excluding those microphones
0:18:08	from trials
0:18:11	as expected the numbers that you got are better
0:18:17	but what is more important
0:18:18	the difference is not the
0:18:22	so
0:18:23	we decided not to exclude and
0:18:26	microphones from the trials
0:18:30	this concludes our presentation
0:18:33	and now let's move
0:18:34	to the outcomes
0:18:36	all or
0:18:39	we adopted the closest definition of trials
0:18:43	and created trial list for development and evolution market a speaker verification
0:18:49	we are aware of the fact and that reduce the number of trials quite substantially
0:18:56	but we verified that the results obtained with the trial list are reliable
0:19:02	details on that can be found in our paper
0:19:07	we have i five several
0:19:09	or
0:19:10	such a small number of speakers and small rather than t in the acoustic environments
0:19:17	and channel s
0:19:18	and we tackle these problems via data location
0:19:24	in our set of experiments we have confirmed that even with a data set of
0:19:30	the size
0:19:31	and without of data limitation
0:19:33	we can achieve interesting results
0:19:36	and carry out research in this field much a speaker verification
0:19:42	thank you for your attention

Utilizing VOiCES Dataset for Multichannel Speaker Verification with Beamforming

Special Session: VOiCES 2020

Ladislav Mošner, Oldřich Plchot, Johan Rohdin, Jan Černocký