0:00:12i know miming with powerlessness and today it is my great pleasure
0:00:17two percent or on speaker detection in the while
0:00:21lessons learned from jason two thousand nineteen
0:00:25i would like to reverse that all the off course
0:00:28that may this work possible
0:00:31let's first ask a question
0:00:33one they didn't we have here
0:00:36right now we have plenty of devices
0:00:39like smart phones
0:00:40recorders
0:00:41social media
0:00:43from which we can gather data
0:00:45and use it for downstream
0:00:47tests
0:00:49and you we go for then we can even
0:00:53performance speaker detection
0:00:56hello
0:00:57my name is bonus to date you'd is my great pleasure percent or
0:01:01on the speaker detection in the while
0:01:04lessons learned from jason two thousand nineteen
0:01:07i would like to the first faculty of course
0:01:09no make this work possible
0:01:12so let's start
0:01:16what they did we have older
0:01:19we have plenty of devices like
0:01:22a smart phones
0:01:24recorders
0:01:25even we can get information from social media
0:01:28and
0:01:30we gather data
0:01:32and you see
0:01:33for downstream task
0:01:35however these data needs to be label
0:01:39to be useful
0:01:40and we'd is labeling we can you can perform speaker detection
0:01:44one of them are very experiments was to use brute force
0:01:51and it was the motivation to use diarization actual words
0:01:57so we have the speech recording
0:01:59and we obtain homogeneous segments from it
0:02:04from those segments we computed the and things and we compare those in billings
0:02:09with the target
0:02:11speakers involved in between
0:02:13and gave and i result
0:02:16but then
0:02:18we need diarization
0:02:21and we extracted the segments that not to the same speaker
0:02:25and we obtain better results
0:02:28so it was i would find it
0:02:32to do it this way
0:02:34so this is the be sure
0:02:37a whole pipeline
0:02:39we have
0:02:40a record mean and we're looking for john
0:02:43the first stage
0:02:44is to a client was voice activity detection
0:02:48that means to get rid of all the silence that's the second stage
0:02:52is to perform speaker type classification or super b e
0:02:57that means to time
0:02:59all the segments according to the gender or if it's a keyboard and at all
0:03:05or even it is t v
0:03:07the speaker diarization
0:03:08that answers the question who spoke when
0:03:11so gathered together
0:03:13the a segments that you know to the same speaker detection as the question
0:03:19if we have john in
0:03:22in any segment so is a binary decision
0:03:25and then we can look for john a low
0:03:28the recording with the speaker tracking
0:03:33thus in were fine
0:03:36to follow this type of and if we have challenges in are used as a
0:03:41cocktail party that there is no
0:03:44if any we have five psnr in the answer again is
0:03:49so let's take a look at some of the numbers on the diarisation side
0:03:55on the right now
0:03:57we can observe the results of obtain
0:04:01on the that i try to
0:04:02based on the p x
0:04:05provided by but
0:04:09so we can observe that ceilings
0:04:12we do not too long the recordings
0:04:15and basque which are due to be d s
0:04:18got very that results
0:04:24we conclude that is that results are because we're talking about far field microphone
0:04:30noisy speech
0:04:31overlapping speech
0:04:33condition mismatch not comparative speakers
0:04:36and biased towards angles speech
0:04:39so we wanted to study this conditions
0:04:43no that's is some numbers on speaker recognition
0:04:47for speaker recognition we compare two systems
0:04:52a two datasets
0:04:53the first one is that it's alright and the second one is the voices
0:04:57and we are comparing
0:05:00a close talking microphone are feel
0:05:04we can observe that for far microphone
0:05:07a big our doubles or false
0:05:13then our main goal was to research developed and benchmark speaker diarization a speaker recognition
0:05:19systems
0:05:20for real speech
0:05:22by using a single microphones in realistic scenarios that included right around noises
0:05:28so just television audio music
0:05:31or other people talk
0:05:35the data one of the characteristics of the data
0:05:38is it
0:05:39like this one where you're having a meeting
0:05:42or is it
0:05:43completely while
0:05:46as the one inch i'm five were people gathered together to have already
0:05:51or anything they long recording
0:05:55just having a five hour recording or even longer
0:06:00or is it
0:06:01that we have a are far field microphone
0:06:05on the other room
0:06:06that is catching
0:06:08the voice of the speaker
0:06:15to cover
0:06:16although this type of data sets we included this so core right
0:06:22i mean this alright channel five and bt training
0:06:25going from the easiest one
0:06:27to the most typical one
0:06:30so for i mean we have a meeting domain
0:06:33and we use it for both for devastation a detection
0:06:37for this alright we have i mean to control domain
0:06:40we just use it for the for detection we then used for the recession because
0:06:46we have
0:06:47the complete
0:06:49labels for all the speakers
0:06:51for channel five we use it for diarisation only
0:06:55and it's an injured domain
0:06:57we didn't using for addition because we usually four speakers
0:07:02which is
0:07:03quite a few persons
0:07:06and babies right
0:07:09we use it for both for their station and detection and is completely while i
0:07:15don't control
0:07:17the models that we explore as i said before is the devastation and the speaker
0:07:23detection
0:07:24so
0:07:25from the devastation we get the labels for all the speakers and for the speaker
0:07:30detection we can
0:07:32try the speaker i don't are equal
0:07:36this is that the picture of the for the devastation so we have
0:07:41a traditional modularized system that is composed enhancement the p
0:07:47the embedding the scoring the cluster e
0:07:50the re-segmentation and the overlap assignment
0:07:53we have to type something enhancement
0:07:56one of the signal level
0:07:58and you're with their one i the enhancement level
0:08:01the
0:08:03boxes that are in orange
0:08:05are the ones that we explore
0:08:09let's start with the enhancement
0:08:12on the signal level
0:08:14we feel
0:08:15and snr progressive multi target and based speech enhancement model
0:08:20the progressive mode in time
0:08:23network or p n g
0:08:24is divided into statistically stacking blocks
0:08:28with one elicit em where you're
0:08:30and one phoneme connected they can be a multi target learning per block
0:08:35the one connected to let your in every plot
0:08:38is designed to ranger meeting speech target with higher
0:08:42snr than the previous target the first
0:08:47a serious progress you variation masks
0:08:50are concatenated with the progressively and have low power spectral features
0:08:56other targets
0:08:58i test time with directly be
0:09:00the enhanced audio
0:09:02processed by awarding has been model to the back end systems
0:09:07note that we have a wiener signal
0:09:09we can
0:09:10explored vad
0:09:12in this case
0:09:13we have two directions
0:09:15the one on the top that is based
0:09:17on mfccs and on the one on the bottom that is based on
0:09:22i think that
0:09:23and volatile then sure there is a philosophy a list we collected layers
0:09:30the output these the speech
0:09:31and nonspeech
0:09:33it is important to note
0:09:35that the lower branch is the one that we chose
0:09:38for works very
0:09:42although this is not part of the finite stages it is also true that debated
0:09:47invading network
0:09:48the related to the performance
0:09:50as shown in the table
0:09:52so we explore the extended t and then
0:09:55with a box so that and with box so that
0:09:59cluster augmentation
0:10:01and we also explore a
0:10:03after t and then
0:10:04we also there was a commendation
0:10:07so we can see that the factor t v n and with even
0:10:11the best results are be trained
0:10:13and i mean it was completely given in child five
0:10:17so we chose the factor g d n
0:10:20for our experiments
0:10:22now let's focus
0:10:24on the speech enhancement
0:10:27we had is i mean how to train an unsupervised speech enhancement system
0:10:31which can be used as a front end
0:10:34good processing model
0:10:36to improve the quality of the features
0:10:38before they are passed
0:10:39two than varying or
0:10:41the main idea here is to use an unsupervised
0:10:45adaptation system
0:10:47based on cycle against
0:10:49we train a cycle can network using a lot will be addressed
0:10:54as input
0:10:55to each of the generator networks
0:10:58so we have a clean source signal on the left and the real time domain
0:11:03data on the right
0:11:05during testing
0:11:06we process that is data to the target signal
0:11:12these are then huh
0:11:13acoustic features
0:11:15i being used
0:11:16just write extractors
0:11:18even though the cycle get and you work was trained for doing the reverberation
0:11:23we also testing on noisy data sets
0:11:26showing improvements
0:11:28now let's continue with the overlap assignment
0:11:32but have these architecture might also sample mean here
0:11:35it is exactly the same as the one use for the vad approach
0:11:40but now training in a certain way that would ease
0:11:43overlap or not overlap
0:11:45speech
0:11:47it can also be used to perform a speaker at right
0:11:50and also asking the vad
0:11:53the thing that approach show better results
0:11:57let's continue with the overlap assignment
0:12:02from the e
0:12:04we got a posterior matrix
0:12:07for each of the speakers
0:12:10so the most probable speakers will be you rolls one and two
0:12:17so we can combine this with the overlap detector
0:12:21and also we didn't vad
0:12:24merging these results
0:12:26we got what we call the overlap assignment where we have regions where the overlapping
0:12:32to tell us that we have two speakers and we put their the most probable
0:12:38speaker
0:12:39in this part
0:12:40we ended our diarization system
0:12:45but now the question is what combination of all these things
0:12:49a good results
0:12:51so in our case
0:12:53we put together to into n b a d enhancement
0:12:57that maybe re-segmentation an overlap assignment
0:13:01for all thus a corpora we got a nice improvements
0:13:06for example i mean
0:13:08we went problem fourteen nine percent
0:13:11the residual error rate to thirty percent
0:13:13there is station
0:13:16for the channel five
0:13:18so the corpus
0:13:20we also put together
0:13:23the same combination we went
0:13:26problem sixty nine percent every station error rate
0:13:28justice degree
0:13:30or set at every station
0:13:33and finally pervading train
0:13:37we got a nice improvement from eighty five percent every session error rate to forty
0:13:42seven percent
0:13:44the recession error rate
0:13:45it is important to note here that in two
0:13:49and but
0:13:50really improve the system
0:13:54this is the speaker detection pipeline
0:13:57we have the enhancement
0:14:00and the signal level and also the invading level we have the devastation segmentation
0:14:06we have been the in extractor the that okay the calibration and finally
0:14:10we get the speaker detection
0:14:14the boxes in orange
0:14:17use the same techniques
0:14:19i think conversation
0:14:20so we use the enhancement two levels
0:14:24and the signal level and also
0:14:26and the and very little
0:14:29that there is station
0:14:31a segmentation
0:14:33is fed into that invading extractor and the type like wendy's
0:14:40then that in extractor as we are really emphasised before
0:14:44it is a factor at the nn
0:14:46which is getting the best results for speaker i p
0:14:50we also is used an enhancement
0:14:52module
0:14:53for this and getting extractor
0:14:57and finally we have the backing and the calibration
0:15:00the backend
0:15:01sure the key lda front of devastation with documentation and the calibration stage
0:15:08goes directly to speaker detection
0:15:11the combination of the use of results for all hours of corpora
0:15:16enclosed is speech enhancement the spectral augmentation and of the lda with augmentation it is
0:15:22important to note that although this
0:15:26this is include the devastation as their first stage
0:15:30so for
0:15:31we got an improvement going problem
0:15:34seventeen percent equal error rate
0:15:36two percent equal error rate
0:15:40in terms of mindcf and actual dcf shown in the bottom we can also something
0:15:46improvement
0:15:49remained trained we kind of so the same trend
0:15:52going from fourteen percent equal error rate
0:15:56two nine percent equal error rate
0:16:01on the bottom we kind of service then mean these yes
0:16:04and the actual dcf the mean this got improvement
0:16:09but the actual dcf
0:16:12for the is alright data our system also include the results going from twenty one
0:16:19percent equal error rate to sixty percent
0:16:21equal error rate
0:16:23the mean these i'm the actual dcf
0:16:26for this and trend
0:16:28getting improvement simple
0:16:34finally some taken ways i'd like to mention
0:16:39the recession ease that fundamental stage to perform speaker detection
0:16:46there are some models that are really needed to have a competitive system
0:16:51course a whole good enhancement could be a i
0:16:55we didn't beginnings
0:16:56an overlap detection and assignment
0:17:01the speaker detection they hence not only
0:17:04on the devastation model
0:17:06but also wanting but in extractor on the augmentation
0:17:12then you directions of this work are as follows
0:17:17or the signal to signal enhancement and speaker separation we need some customisation
0:17:22you could be by the test it by speaker or quite task
0:17:26for the speech enhancement
0:17:28we have to explore other hand gestures a transformer and largescale training
0:17:35for the vad we need ways to handle domain mismatch
0:17:39you can be done for example using domain or sorry
0:17:43for the clustering we need an unsupervised adaptation
0:17:47take the overlap into account
0:17:50during the clustering
0:17:52and also included transcription
0:17:54in parallel with the speaker and b
0:17:57for the speaker detection
0:17:58some enhancement for the multi-speaker scenario
0:18:02that means
0:18:04hi light
0:18:05that's speaker of interest
0:18:08and also perform better clustering
0:18:10for short segments
0:18:12this is our amazing thing
0:18:15i would like to thank
0:18:16all of them very much thank you questions