| 0:00:17 | i everyone finds not require an and all be presenting the work by myself and | 
|---|
| 0:00:23 | my car was | 
|---|
| 0:00:24 | on the voices from a distance challenge two thousand nineteen analysis of speaker verification results | 
|---|
| 0:00:30 | and main challenges | 
|---|
| 0:00:34 | when we look at evaluations and challenges in the community they tend to provide a | 
|---|
| 0:00:38 | common data | 
|---|
| 0:00:39 | benchmarks performance metrics for the advancement of research in speaker recognition community | 
|---|
| 0:00:44 | some examples of might be for me without the nist sre series | 
|---|
| 0:00:49 | speakers in the one challenge | 
|---|
| 0:00:50 | voxel speaker recognition challenge | 
|---|
| 0:00:52 | and s-dsc | 
|---|
| 0:00:54 | previous evaluations focused on speaker verification in the mines considering telephone dialler microphone data | 
|---|
| 0:01:01 | different speaking styles | 
|---|
| 0:01:03 | noisy data vocal effort audio from video short duration and better and while | 
|---|
| 0:01:08 | however they haven't been that many that focus on | 
|---|
| 0:01:11 | there even inside in the far-field distant speaker domain | 
|---|
| 0:01:16 | now | 
|---|
| 0:01:17 | nowadays we've got commercial personal assistants a at a really | 
|---|
| 0:01:22 | outstanding in this area so trying to get a bit more of an understanding in | 
|---|
| 0:01:25 | this context is important especially when we will you know single microphone | 
|---|
| 0:01:29 | scenario | 
|---|
| 0:01:31 | and the voices from a distance challenge in two thousand nineteen was hosted by sri | 
|---|
| 0:01:35 | international and lap forty one | 
|---|
| 0:01:38 | i in to stage two thousand and nineteen | 
|---|
| 0:01:41 | and what this channel focused on was both speaker recognition and speech recognition | 
|---|
| 0:01:45 | using the distant farfield speech acquired using a single microphone | 
|---|
| 0:01:49 | in noisy and realistic reverberant environments | 
|---|
| 0:01:52 | there are several objectives that we had for this challenge | 
|---|
| 0:01:55 | one was the benchmark in a state-of-the-art technology for farfield speech | 
|---|
| 0:02:00 | we want to support the development of new ideas in technology to bring that technology | 
|---|
| 0:02:05 | for | 
|---|
| 0:02:06 | we wanted to support new research groups entering the field of distant speech processing | 
|---|
| 0:02:11 | and that was larceny and i would be publicly available dataset | 
|---|
| 0:02:15 | i think that this is realistic of reverberation characteristics | 
|---|
| 0:02:21 | what we noticed since the release of the public database in two thousand nineteen of | 
|---|
| 0:02:25 | those an increase use the voices dataset | 
|---|
| 0:02:28 | so we thought this actually called for will be current special session that we're hosting | 
|---|
| 0:02:32 | here in odyssey two thousand forty even not virtual | 
|---|
| 0:02:36 | now the session we're hoping will focus on broad areas such a single buses multi | 
|---|
| 0:02:41 | channel speaker recognition | 
|---|
| 0:02:43 | single versus multichannel speech enhancement for speaker recognition | 
|---|
| 0:02:47 | domain adaptation for farfield speaker recognition | 
|---|
| 0:02:51 | calibration in five two conditions | 
|---|
| 0:02:53 | and advancing the standard | 
|---|
| 0:02:55 | over what we saw in the voices from a distance challenge two thousand nine ten | 
|---|
| 0:03:01 | let's have a look at what the voices corpus actually had in | 
|---|
| 0:03:05 | so voices stands for voice is obscured in complex environment setting | 
|---|
| 0:03:09 | and it is alaska now publicly available corpus collecting in collected in the real reverberant | 
|---|
| 0:03:15 | environments | 
|---|
| 0:03:16 | well we have inside the dataset is | 
|---|
| 0:03:19 | three thousand nine hundred or more hours of audio | 
|---|
| 0:03:22 | from about a million segment | 
|---|
| 0:03:24 | multiple rooms for internal | 
|---|
| 0:03:26 | different distracters that just t v in babble noise | 
|---|
| 0:03:29 | and different microphones different distances | 
|---|
| 0:03:32 | we even have a male speaker the right tails to mimic human head movement | 
|---|
| 0:03:37 | the idea for this dataset was that would be useful for speaker recognition | 
|---|
| 0:03:41 | automatic speech recognition | 
|---|
| 0:03:43 | speech enhancement | 
|---|
| 0:03:45 | and speech activity detection | 
|---|
| 0:03:49 | here a couple different statistics from the voices dataset | 
|---|
| 0:03:52 | it is released under the creative commons full license and that makes it accessible commercial | 
|---|
| 0:03:57 | academic and government used | 
|---|
| 0:04:00 | when a large number of speakers three hundred over for different rooms | 
|---|
| 0:04:04 | up to twenty different microphones and different microphone types | 
|---|
| 0:04:09 | these source data so that we used it was a read speech data set accordingly | 
|---|
| 0:04:13 | per state | 
|---|
| 0:04:15 | and we've got number of different background noises including babble | 
|---|
| 0:04:18 | music | 
|---|
| 0:04:18 | and tv sounds | 
|---|
| 0:04:20 | of the loudspeaker when it orientates for re mimicking human head movement | 
|---|
| 0:04:25 | a ranges between zero two hundred ninety degrees | 
|---|
| 0:04:30 | but still will read half of what we sort in the challenge of two thousand | 
|---|
| 0:04:33 | nine ten | 
|---|
| 0:04:35 | we have two different a speaker recognition asr | 
|---|
| 0:04:39 | and they had to different task conditions one was a fixed condition | 
|---|
| 0:04:42 | and the idea here was the data was constrained | 
|---|
| 0:04:45 | everyone got to use the sign constraint dataset | 
|---|
| 0:04:48 | the purpose behind this was to benchmarking assistance trained with that same data set to | 
|---|
| 0:04:53 | see if there's a dramatic difference between interictal technologies for what was commonly applied | 
|---|
| 0:04:58 | in the open condition | 
|---|
| 0:05:00 | it were left use any available dataset private or public | 
|---|
| 0:05:04 | now the idea here was to quantify those guys it could be achieved when we | 
|---|
| 0:05:07 | have and constraints amount of data | 
|---|
| 0:05:09 | relative to the fixed condition | 
|---|
| 0:05:14 | in terms of the goal here | 
|---|
| 0:05:15 | well looking at | 
|---|
| 0:05:16 | can we determine whether i target speaker space | 
|---|
| 0:05:20 | in a segment of speech and that's true enrollment of that target speaker | 
|---|
| 0:05:25 | but performance metric is too much the nist sre | 
|---|
| 0:05:28 | cost functions | 
|---|
| 0:05:29 | when the parameters on screen | 
|---|
| 0:05:32 | as far as the challenge we also provided a score of so uses can measure | 
|---|
| 0:05:36 | performance | 
|---|
| 0:05:37 | during development confirm the validity of discourse before they submitting them to us for evaluation | 
|---|
| 0:05:46 | and the training set in fixed condition was my to all speakers in the while | 
|---|
| 0:05:51 | but a collection | 
|---|
| 0:05:52 | and voxel n one and fox lead to datasets | 
|---|
| 0:05:57 | in terms of development and evaluation died of the challenge participants for lead to develop | 
|---|
| 0:06:02 | on the development data | 
|---|
| 0:06:04 | and then it was held out evaluation data that i and schmack the systems on | 
|---|
| 0:06:08 | another couple of different things to point out here about how we divided these conditions | 
|---|
| 0:06:12 | here | 
|---|
| 0:06:14 | we make sure that we actually had some room mismatch between enrollment and test | 
|---|
| 0:06:18 | as well as rooms use between development and evaluation | 
|---|
| 0:06:22 | and this is to help me to mimic mitigate | 
|---|
| 0:06:25 | sorry mimic what would happen with a system developed in all of our tree true | 
|---|
| 0:06:29 | level data | 
|---|
| 0:06:30 | and then sent out for the real world use | 
|---|
| 0:06:34 | similarly we had mismatch between enrollment and test all the microphone type | 
|---|
| 0:06:39 | comparing the studio two lapel | 
|---|
| 0:06:41 | or to the l members and their own remote | 
|---|
| 0:06:45 | we also had mismatch between the enrollment and verification for the microphone used | 
|---|
| 0:06:50 | between those two different tasks | 
|---|
| 0:06:54 | finally the last speaker orientation | 
|---|
| 0:06:56 | we have quite a range then we list of those ranges so that we lie | 
|---|
| 0:07:00 | to analyze the impact of head movement on speaker recognition | 
|---|
| 0:07:05 | in terms the results we had twenty one change successfully submitted scores | 
|---|
| 0:07:09 | and for voiced aims also submitted the scores for the open submission so we can | 
|---|
| 0:07:14 | get that comparison point | 
|---|
| 0:07:17 | entire we have fifty i system submissions if a fixed knife right | 
|---|
| 0:07:21 | however on the side here was shown that all scores for each | 
|---|
| 0:07:24 | a t | 
|---|
| 0:07:26 | i will begin to these a little bit on the next slide | 
|---|
| 0:07:29 | let's start analysing some others results | 
|---|
| 0:07:33 | the first thing we did was we would that the confidence intervals the ninety five | 
|---|
| 0:07:36 | percent confidence intervals | 
|---|
| 0:07:38 | and we did this by using a modified version of a joint bootstrapping technique | 
|---|
| 0:07:42 | reference can be found in i | 
|---|
| 0:07:45 | now the reason we modified this was to account for the correlation of trials to | 
|---|
| 0:07:49 | more to multiple models being available per speaker | 
|---|
| 0:07:54 | that is different recording from a speaker could have represent a different in rowley | 
|---|
| 0:07:59 | and so this correlation that happens | 
|---|
| 0:08:01 | in the trial scores | 
|---|
| 0:08:03 | what we're calling here the in people's between the five ninety five percent files | 
|---|
| 0:08:07 | of the resulting empirical distribution | 
|---|
| 0:08:10 | now if we look at those top for scores on them in a little | 
|---|
| 0:08:13 | we can see that the confidence intervals on our | 
|---|
| 0:08:16 | when you don't take into account the speaker sampling all the multiple models per state | 
|---|
| 0:08:20 | so can easily as if we don't take that into account | 
|---|
| 0:08:24 | what we should be looking other red buffy | 
|---|
| 0:08:27 | that gives us a more true impact of what the confidence intervals are | 
|---|
| 0:08:33 | and from look at those for systems with respect to the other submissions | 
|---|
| 0:08:37 | we see that the significantly different compared to the rest of this submission | 
|---|
| 0:08:41 | however they also perform relatively similar | 
|---|
| 0:08:47 | somebody observations we found when looking at about what a different group submitted | 
|---|
| 0:08:52 | wasn't the top teams applied weighted prediction error for dereverberation i remember the voices corpus | 
|---|
| 0:08:59 | has a lottery of the rooms a quite and noisy | 
|---|
| 0:09:02 | and that was the and the person really step | 
|---|
| 0:09:06 | every team also use an extract the system with that are augmentation | 
|---|
| 0:09:10 | and this is sometimes complimented we present it image net and that's net based architectures | 
|---|
| 0:09:16 | but i was the most popular choice in the back and | 
|---|
| 0:09:19 | and system calibration was actually for is crucial here | 
|---|
| 0:09:23 | with all the bottom sixteens final to achieve good system calibration | 
|---|
| 0:09:27 | and what that means is there was a significant difference between the minimum | 
|---|
| 0:09:30 | and actual dcf values for which the system should have been shameful | 
|---|
| 0:09:36 | a cycle now what happens when you change of the enrollment condition | 
|---|
| 0:09:41 | in particular we looking at what happens in reverberant environment should be used source data | 
|---|
| 0:09:46 | that is now reverberation close talking microphone | 
|---|
| 0:09:49 | well use data from a different room | 
|---|
| 0:09:52 | with reverberation | 
|---|
| 0:09:54 | to enrol | 
|---|
| 0:09:56 | but we actually stole one can i of the blue results of the balloon buffy | 
|---|
| 0:10:00 | a resource enrollment against testing with room for data whereas the red us enrolling on | 
|---|
| 0:10:08 | reverberant room three data | 
|---|
| 0:10:10 | against the same test larry | 
|---|
| 0:10:12 | we see the red bows a higher than if i | 
|---|
| 0:10:15 | this reverberation enrollment | 
|---|
| 0:10:18 | cost than on to forty two percent relative source enrol a degradation | 
|---|
| 0:10:23 | and that depends on the system is being benchmark they're of course | 
|---|
| 0:10:26 | but it does suggest that speaker should be enrolled using close-talking segment | 
|---|
| 0:10:31 | a clean this stage | 
|---|
| 0:10:33 | basically when you have this in our all different reverberation between enrollment and test | 
|---|
| 0:10:38 | reverberation doesn't have a role | 
|---|
| 0:10:41 | when enrolling on it | 
|---|
| 0:10:45 | but several different background distracted | 
|---|
| 0:10:48 | we call them distracters because then to start the system from that fruit speech the | 
|---|
| 0:10:52 | speaker | 
|---|
| 0:10:53 | we had t v in the background | 
|---|
| 0:10:55 | or babble noise in the background | 
|---|
| 0:10:57 | when enrolling we enrolled clean speech no destruction | 
|---|
| 0:11:01 | but from verification we had three different types no destruction | 
|---|
| 0:11:04 | t vs the tv noise which sometimes include stage | 
|---|
| 0:11:08 | and babble noise | 
|---|
| 0:11:11 | and what we found that the systems that was submitted would reasonably robust to the | 
|---|
| 0:11:14 | effect of t v noise in the background | 
|---|
| 0:11:16 | however with babble | 
|---|
| 0:11:18 | including the speech environment for the true speaker | 
|---|
| 0:11:21 | resort in we have forty five to fifty percent relative degradation so it's quite a | 
|---|
| 0:11:26 | significant drop the | 
|---|
| 0:11:30 | okay now microphone time | 
|---|
| 0:11:33 | we had i studio mic place close to the source for enrollment | 
|---|
| 0:11:36 | and then treated from my class lapel men's and down tree but verification at different | 
|---|
| 0:11:41 | positions | 
|---|
| 0:11:42 | and look at different distances i in the next slide | 
|---|
| 0:11:45 | here we just one to look at how you different microphones | 
|---|
| 0:11:49 | the quite | 
|---|
| 0:11:51 | consistently across systems | 
|---|
| 0:11:53 | have a step down going from boundary commenced a lapel microphone | 
|---|
| 0:12:00 | from looking at different distances we just looking at a top five systems here | 
|---|
| 0:12:04 | to constrain results to look at | 
|---|
| 0:12:07 | with the lapel mikes placed at seven distances for the top five times | 
|---|
| 0:12:12 | note you to be self non-overlapping masking effects them or just of my standard approach | 
|---|
| 0:12:17 | to parse a greater challenge | 
|---|
| 0:12:20 | what was interesting of the bus the really stand out the | 
|---|
| 0:12:23 | read | 
|---|
| 0:12:25 | kill and blue | 
|---|
| 0:12:27 | tended to be partially obscured | 
|---|
| 0:12:28 | so some of them are actually hidden | 
|---|
| 0:12:30 | all very far from the | 
|---|
| 0:12:33 | speaker | 
|---|
| 0:12:34 | so the standard really draw performance as well | 
|---|
| 0:12:39 | this also tends to explain the poor performance of the lapel mikes in general | 
|---|
| 0:12:43 | embedded and remains a sore on the previous slide | 
|---|
| 0:12:48 | and it was a summary now we're looking at the remaining challenges | 
|---|
| 0:12:51 | based on organs a so far from voices publications and system submissions | 
|---|
| 0:12:58 | but range in the ratio are characteristic | 
|---|
| 0:13:01 | was to the three times worse than evaluation set and also the development set | 
|---|
| 0:13:06 | now this was quite i | 
|---|
| 0:13:09 | great the level of reverberation evaluation room | 
|---|
| 0:13:12 | embedded development | 
|---|
| 0:13:13 | and i was quite clear we found that this | 
|---|
| 0:13:17 | sue severe amount of reverberation the country that the degree to degrade results compared to | 
|---|
| 0:13:22 | development | 
|---|
| 0:13:24 | current speaker recognition technology doesn't tend to address | 
|---|
| 0:13:27 | the impact of reverberation sufficiently | 
|---|
| 0:13:31 | the error rates a lot harder for reverberation condition then the source signal | 
|---|
| 0:13:35 | the reverberation in the presence of noise for the degrades the performance | 
|---|
| 0:13:39 | and the | 
|---|
| 0:13:40 | increasing distance | 
|---|
| 0:13:42 | provides a big impact of reverberation and degraded performance | 
|---|
| 0:13:46 | so we need to explore novel speaker modeling techniques in a context is capable of | 
|---|
| 0:13:50 | handling long time information | 
|---|
| 0:13:53 | utterances the alien light reverberation the can happen in this nice | 
|---|
| 0:13:57 | and try and make a robust to multiple noise conditions | 
|---|
| 0:14:02 | system calibration is seven was critical for systems deployed in the real world | 
|---|
| 0:14:07 | the bottom sixteen style to successfully calibrated system | 
|---|
| 0:14:10 | and the previous work to shine that there is actually allows degradation calibration performance when | 
|---|
| 0:14:15 | the distance the microphone | 
|---|
| 0:14:17 | is significantly different between the calibration training conditions and one attacks to the court | 
|---|
| 0:14:23 | so one way that we might be out to mitigate justify effect | 
|---|
| 0:14:27 | is to have calibration methods that dynamically consider conditions of the trial | 
|---|
| 0:14:32 | the predicted distance for instance | 
|---|
| 0:14:36 | and that of the challenge is based on single-channel market find voices that was actually | 
|---|
| 0:14:40 | collected with microphone well my | 
|---|
| 0:14:42 | more microphones in the room | 
|---|
| 0:14:44 | and we haven't looked into the effective for instance being for me | 
|---|
| 0:14:49 | and there are a number of the front end processing | 
|---|
| 0:14:52 | that would like to look at | 
|---|
| 0:14:53 | including speech enhancement | 
|---|
| 0:14:55 | dereverberation a little bit so that typically for the task of speaker recognition | 
|---|
| 0:15:01 | so we hope you enjoy this special session at odyssey this year and that you | 
|---|
| 0:15:06 | continue to drive technology forward in these areas | 
|---|
| 0:15:09 | and we look forward to seeing what comes out of it | 
|---|
| 0:15:12 | thank you | 
|---|