0:00:13no
0:00:14mining solutions
0:00:16and i would like to present our contribution regarding q decision of the courses corpus
0:00:21or multichannel speaker verification which uses talking
0:00:27according to research papers
0:00:29there is
0:00:30more interest
0:00:32in march a speaker verification
0:00:34but the number of dataset
0:00:37still limited
0:00:40therefore we wanted to use
0:00:43boxes data
0:00:45for the evaluation of the multichannel speaker verification systems to the object is of our
0:00:51a where
0:00:52are as follows
0:00:55we analyze the original trial is defined for the voices challenge
0:01:01we really finally
0:01:03so that a multichannel speaker verification systems can use of
0:01:10since we created in you try to nist
0:01:13multi-channel trial list
0:01:15the final if sensors robust
0:01:18and also you assist used to do their voices data for training subsystems
0:01:26so because we wanted to create a multichannel trial set
0:01:31we needed first
0:01:33and one
0:01:34the original
0:01:35try to set defined for the first time
0:01:39so we can see your stiff one that every set of recordings
0:01:45what recording
0:01:47in a different room
0:01:49as regards noise condition
0:01:51we can see that test recordings were recorded with background noise
0:01:57it was under babble noise
0:02:00television noise
0:02:02and music and also without and thing
0:02:06enrollment recordings
0:02:07where required without any vector noise
0:02:10so they're just room reverberation and background
0:02:15and we can see that the haar of the
0:02:19enrollment data for evaluation
0:02:22what's taken from the original there is
0:02:28as regards microphones
0:02:30and enrollment recordings but with two microphones
0:02:34test recordings with eight or eleven microphone
0:02:39these numbers
0:02:40would be quite important for us
0:02:44in terms of speakers we can see that there are some unique speakers in and
0:02:50enrollment and test portion
0:02:53overall we have about one hundred speakers in enrollment both for evaluation and development
0:03:02for development
0:03:04we have much more speakers test then a calibration set
0:03:13regarding utterances
0:03:15utterances are just trying between enrollment and test
0:03:21also speakers in the development set are different from those that are evaluation set
0:03:30so we wanted to create a multi multichannel trials
0:03:35so we to analyze the origin one and you realise
0:03:40but for every enrollment recording
0:03:43there are always multiple test recording
0:03:46containing the same utterance the same noise speaker
0:03:51room
0:03:52but they are recorded with a different microphone
0:03:56and this is what we may use all
0:04:00so while creating our multichannel trials we use single enrolment
0:04:06and that in terms of test recordings
0:04:10we do some recordings to create and microphone
0:04:16so now we will look into the creation of test portions of development and evaluation
0:04:24for the for this to
0:04:27we can see that for every and enrollment utterance
0:04:31there are always eight
0:04:34utterances containing the same
0:04:37basically the same utterance
0:04:39and are recorded over different microphone
0:04:43she one to a three s o one
0:04:46are numbers representing random turkish
0:04:51we decided to always four
0:04:54recordings
0:04:55two one microphone error
0:04:58that means
0:04:59that
0:05:01instead of eight trials people tend to trials
0:05:06meaning that you use the number of trials from four million to one
0:05:14or evaluation set
0:05:16we have relied on
0:05:18recordings for every enrollment utterance
0:05:22we again grooved for recordings together and we're left with three more utterances
0:05:32therefore we randomly another one utterance from those that of it
0:05:40this new use the number of trials from three one five million to nine hundred
0:05:46eighty thousand shots
0:05:50we try not only reading
0:05:53a development
0:05:55and evaluation sets but we also try to creating
0:05:58and data
0:06:01our multichannel training dataset is based on the full list of recordings from one and
0:06:08two
0:06:10be excluded completely recordings from three and four
0:06:15because as we have seen a full original utterances recorded in ruins to four
0:06:23we also the i'll the development data because they were recorded in one and room
0:06:31then we again grooved the recordings based on the content and we obtain again microphone
0:06:39areas contain four microphones
0:06:43so
0:06:45the result was trained dataset comprising fifty seven point eight thousand examples but which use
0:06:52of two hundred speakers
0:06:55so it is clear that there is there
0:06:58because this dataset is similar to the development dataset in terms of speakers and also
0:07:06acoustic conditions
0:07:08but this was already just because the
0:07:11all the data set
0:07:13so now they're three
0:07:16development and evaluation set
0:07:18and also training set
0:07:21no less channel to explanation of multichannel product speaker verification
0:07:26so we use it is done system
0:07:30then it contains a front end which is the funding that performs a station are
0:07:36very
0:07:37and then the single channel output goes to exeter extractor
0:07:42and the and buildings are scored using nearly
0:07:47so this is very standard i point
0:07:50but our goal was not propose no motion system
0:07:55but rather assess the use of the to the voices
0:08:00for
0:08:02forming we were able to make use not original voices training data
0:08:09we also tried using simulated data and i will explain while and when later presentation
0:08:16the voices training dataset is quite the and therefore we couldn't use it for training
0:08:23of the extra their extractor
0:08:26it means that use bookseller or training a of the experisuch there and also you
0:08:33be okay
0:08:35or front end processing
0:08:37we use the g
0:08:39generalized eigen four
0:08:42so this is former get utilizes
0:08:47something would statistics
0:08:49and crew a single
0:08:54so first we need to a computer or estimate speech cross power spectral density matrix
0:09:01and noise here
0:09:04those three matrices
0:09:05go to g is over
0:09:08which is generalized eigenvalue decomposition
0:09:12the principal eigenvector then used construed to be a beamformer weight
0:09:18it is applied to multi-channel input
0:09:21and we obtain a single job
0:09:25in order
0:09:26to estimate speech i was used
0:09:30we use neural network
0:09:34we have
0:09:35single one quarter
0:09:37and
0:09:39this is applied to all of g a chance
0:09:45to give an input
0:09:47this not network is supposed to a model for each and mask for noise
0:09:54the resulting mask
0:09:57are applied to input spectrum
0:10:01and noise and speech psd matrices are estimated
0:10:06this picture is differentiable s is usually in our previous work
0:10:12the architecture of this model is pretty simple a contains how about two in your
0:10:20layers
0:10:21and then there are two layers
0:10:24one
0:10:25of coding
0:10:27model ordinance one
0:10:29or
0:10:32in our experiments we will refer to models
0:10:37but essentially they are the same
0:10:40and what is different is the weight of training
0:10:44so for b c model
0:10:46we do
0:10:48the weight of the most system either
0:10:51just by a optimize the output mask
0:10:56therefore
0:10:57we
0:10:58compute first i
0:11:01ideal binary mask
0:11:03and then we are minimizing binary cross entropy between output and yes
0:11:10so in order to computer science
0:11:14we need to know speech and noise
0:11:17so that means that can not use which dataset and to this data for training
0:11:27to create such assimilated a dataset we use the same utterances are and multi-channel voices
0:11:34dataset
0:11:36and we perform us english using mute source method
0:11:42and everything
0:11:44all sessions
0:11:45which was also used in
0:11:49of course dataset
0:11:53for the missing model
0:11:55we optimize the output of the form
0:12:00therefore we minimize
0:12:02and s between the output
0:12:05and clean speech
0:12:08in this case we can use multichannel a voices training data
0:12:13because what
0:12:14described it audio
0:12:16and then clean speech which is taken from speech
0:12:22so much for that expunge our architecture and now we will to experiments are
0:12:31for reference
0:12:32we show results for the so called single channel
0:12:37in this case we use the original trial list
0:12:42defined for the voice
0:12:45and we evaluated our extractor extract the
0:12:50our baseline is informed which is well established to for performing with us
0:12:59the results are you
0:13:03then we try to assessing dct and this models
0:13:08using the same trial this
0:13:11s for one
0:13:14it is worth mentioning that
0:13:17take the channel cannot be readily compare the formant because of the number of trials
0:13:23is the
0:13:26then we tried assessing the performance of u c and testing
0:13:33we can see that this is novel tense
0:13:36better results than baseline from
0:13:40however the performance of this model is quite or
0:13:46we hypothesize that it is much more difficult part
0:13:50to train new on this work to all good but correct mask for speech from
0:13:56foreigners just by minimizing unless you how good and
0:14:03moreover
0:14:04there is more variability in the training data for easy model
0:14:09then in the training data for missing the training data or miss model
0:14:14all training are okay
0:14:16from the voices
0:14:20further
0:14:21we can see
0:14:22the pca model generalizes another
0:14:25then and this novel
0:14:27and this is
0:14:29again because of variability in the data
0:14:34then you're trying to improve and missing model
0:14:38but still using voices dataset and no external data
0:14:43so what
0:14:44its use of men
0:14:47and especially proposed variant of spectrum and where we apply mosques directly to this
0:14:56more specifically we have five to frequency must
0:15:00and to time marks
0:15:03we can see that we were able to improve
0:15:08performance of and si models quite substantially
0:15:12we can also observe that performance is not better than the baseline performance
0:15:19we also tried using spectral language model
0:15:23and again see some improvement
0:15:26but the improvement is not that i don't
0:15:30as for the mse model which is good news for us
0:15:35so much for the
0:15:37first experiment
0:15:38and let's turn to the wrong so in the number
0:15:43experiment
0:15:44higher assessing performance of individual microphones
0:15:49we hypothesize that some of microphones can perform poorly
0:15:55they are used in multiple microphone errors
0:16:00in this case
0:16:01microphones
0:16:02can be far from each other as opposed to conventional small microphone arrays
0:16:09and i thought that maybe a really performing microphones to integrate overall performance greatly
0:16:18and it might be useful to exclude them from trials
0:16:24so to assess this
0:16:27we first needed to assess single microphones
0:16:31so it is me too
0:16:35the original trial list
0:16:37and then we created a microphone specific trial list
0:16:42where as you can see
0:16:45neural and recordings are always the same
0:16:48and test recordings correspond to the microphone
0:16:53that recorded
0:16:55specifically utterance
0:16:59these are the result that we obtain
0:17:02and we can see that best microphone
0:17:06our best performing microphone dislike in front of the loudspeaker
0:17:11then the worst microphone is microphone with number twelve which was constructed
0:17:18and another who microphone is the one that is order
0:17:23from the loudspeaker number six
0:17:26we can see that there is quite some difference between the best and worst my
0:17:32this is even more pronounced
0:17:35for evaluation set
0:17:38where we can see that best performing microphone
0:17:41and its performance for two point two eight
0:17:45because er
0:17:48the was performed microphone it's almost seven times worse than the best one
0:17:54again the microphone number twelve was for obstructive
0:17:59and microphone number six is far from the art speaker
0:18:05then try excluding those microphones
0:18:08from trials
0:18:11as expected the numbers that you got are better
0:18:17but what is more important
0:18:18the difference is not the
0:18:22so
0:18:23we decided not to exclude and
0:18:26microphones from the trials
0:18:30this concludes our presentation
0:18:33and now let's move
0:18:34to the outcomes
0:18:36all or
0:18:39we adopted the closest definition of trials
0:18:43and created trial list for development and evolution market a speaker verification
0:18:49we are aware of the fact and that reduce the number of trials quite substantially
0:18:56but we verified that the results obtained with the trial list are reliable
0:19:02details on that can be found in our paper
0:19:07we have i five several
0:19:09or
0:19:10such a small number of speakers and small rather than t in the acoustic environments
0:19:17and channel s
0:19:18and we tackle these problems via data location
0:19:24in our set of experiments we have confirmed that even with a data set of
0:19:30the size
0:19:31and without of data limitation
0:19:33we can achieve interesting results
0:19:36and carry out research in this field much a speaker verification
0:19:42thank you for your attention