i know miming with powerlessness and today it is my great pleasure

two percent or on speaker detection in the while

lessons learned from jason two thousand nineteen

i would like to reverse that all the off course

that may this work possible

let's first ask a question

one they didn't we have here

right now we have plenty of devices

like smart phones

recorders

social media

from which we can gather data

and use it for downstream

tests

and you we go for then we can even

performance speaker detection

hello

my name is bonus to date you'd is my great pleasure percent or

on the speaker detection in the while

lessons learned from jason two thousand nineteen

i would like to the first faculty of course

no make this work possible

so let's start

what they did we have older

we have plenty of devices like

a smart phones

recorders

even we can get information from social media

and

we gather data

and you see

for downstream task

however these data needs to be label

to be useful

and we'd is labeling we can you can perform speaker detection

one of them are very experiments was to use brute force

and it was the motivation to use diarization actual words

so we have the speech recording

and we obtain homogeneous segments from it

from those segments we computed the and things and we compare those in billings

with the target

speakers involved in between

and gave and i result

but then

we need diarization

and we extracted the segments that not to the same speaker

and we obtain better results

so it was i would find it

to do it this way

so this is the be sure

a whole pipeline

we have

a record mean and we're looking for john

the first stage

is to a client was voice activity detection

that means to get rid of all the silence that's the second stage

is to perform speaker type classification or super b e

that means to time

all the segments according to the gender or if it's a keyboard and at all

or even it is t v

the speaker diarization

that answers the question who spoke when

so gathered together

the a segments that you know to the same speaker detection as the question

if we have john in

in any segment so is a binary decision

and then we can look for john a low

the recording with the speaker tracking

thus in were fine

to follow this type of and if we have challenges in are used as a

cocktail party that there is no

if any we have five psnr in the answer again is

so let's take a look at some of the numbers on the diarisation side

on the right now

we can observe the results of obtain

on the that i try to

based on the p x

provided by but

so we can observe that ceilings

we do not too long the recordings

and basque which are due to be d s

got very that results

we conclude that is that results are because we're talking about far field microphone

noisy speech

overlapping speech

condition mismatch not comparative speakers

and biased towards angles speech

so we wanted to study this conditions

no that's is some numbers on speaker recognition

for speaker recognition we compare two systems

a two datasets

the first one is that it's alright and the second one is the voices

and we are comparing

a close talking microphone are feel

we can observe that for far microphone

a big our doubles or false

then our main goal was to research developed and benchmark speaker diarization a speaker recognition

systems

for real speech

by using a single microphones in realistic scenarios that included right around noises

so just television audio music

or other people talk

the data one of the characteristics of the data

is it

like this one where you're having a meeting

or is it

completely while

as the one inch i'm five were people gathered together to have already

or anything they long recording

just having a five hour recording or even longer

or is it

that we have a are far field microphone

on the other room

that is catching

the voice of the speaker

to cover

although this type of data sets we included this so core right

i mean this alright channel five and bt training

going from the easiest one

to the most typical one

so for i mean we have a meeting domain

and we use it for both for devastation a detection

for this alright we have i mean to control domain

we just use it for the for detection we then used for the recession because

we have

the complete

labels for all the speakers

for channel five we use it for diarisation only

and it's an injured domain

we didn't using for addition because we usually four speakers

which is

quite a few persons

and babies right

we use it for both for their station and detection and is completely while i

don't control

the models that we explore as i said before is the devastation and the speaker

detection

so

from the devastation we get the labels for all the speakers and for the speaker

detection we can

try the speaker i don't are equal

this is that the picture of the for the devastation so we have

a traditional modularized system that is composed enhancement the p

the embedding the scoring the cluster e

the re-segmentation and the overlap assignment

we have to type something enhancement

one of the signal level

and you're with their one i the enhancement level

the

boxes that are in orange

are the ones that we explore

let's start with the enhancement

on the signal level

we feel

and snr progressive multi target and based speech enhancement model

the progressive mode in time

network or p n g

is divided into statistically stacking blocks

with one elicit em where you're

and one phoneme connected they can be a multi target learning per block

the one connected to let your in every plot

is designed to ranger meeting speech target with higher

snr than the previous target the first

a serious progress you variation masks

are concatenated with the progressively and have low power spectral features

other targets

i test time with directly be

the enhanced audio

processed by awarding has been model to the back end systems

note that we have a wiener signal

we can

explored vad

in this case

we have two directions

the one on the top that is based

on mfccs and on the one on the bottom that is based on

i think that

and volatile then sure there is a philosophy a list we collected layers

the output these the speech

and nonspeech

it is important to note

that the lower branch is the one that we chose

for works very

although this is not part of the finite stages it is also true that debated

invading network

the related to the performance

as shown in the table

so we explore the extended t and then

with a box so that and with box so that

cluster augmentation

and we also explore a

after t and then

we also there was a commendation

so we can see that the factor t v n and with even

the best results are be trained

and i mean it was completely given in child five

so we chose the factor g d n

for our experiments

now let's focus

on the speech enhancement

we had is i mean how to train an unsupervised speech enhancement system

which can be used as a front end

good processing model

to improve the quality of the features

before they are passed

two than varying or

the main idea here is to use an unsupervised

adaptation system

based on cycle against

we train a cycle can network using a lot will be addressed

as input

to each of the generator networks

so we have a clean source signal on the left and the real time domain

data on the right

during testing

we process that is data to the target signal

these are then huh

acoustic features

i being used

just write extractors

even though the cycle get and you work was trained for doing the reverberation

we also testing on noisy data sets

showing improvements

now let's continue with the overlap assignment

but have these architecture might also sample mean here

it is exactly the same as the one use for the vad approach

but now training in a certain way that would ease

overlap or not overlap

speech

it can also be used to perform a speaker at right

and also asking the vad

the thing that approach show better results

let's continue with the overlap assignment

from the e

we got a posterior matrix

for each of the speakers

so the most probable speakers will be you rolls one and two

so we can combine this with the overlap detector

and also we didn't vad

merging these results

we got what we call the overlap assignment where we have regions where the overlapping

to tell us that we have two speakers and we put their the most probable

speaker

in this part

we ended our diarization system

but now the question is what combination of all these things

a good results

so in our case

we put together to into n b a d enhancement

that maybe re-segmentation an overlap assignment

for all thus a corpora we got a nice improvements

for example i mean

we went problem fourteen nine percent

the residual error rate to thirty percent

there is station

for the channel five

so the corpus

we also put together

the same combination we went

problem sixty nine percent every station error rate

justice degree

or set at every station

and finally pervading train

we got a nice improvement from eighty five percent every session error rate to forty

seven percent

the recession error rate

it is important to note here that in two

and but

really improve the system

this is the speaker detection pipeline

we have the enhancement

and the signal level and also the invading level we have the devastation segmentation

we have been the in extractor the that okay the calibration and finally

we get the speaker detection

the boxes in orange

use the same techniques

i think conversation

so we use the enhancement two levels

and the signal level and also

and the and very little

that there is station

a segmentation

is fed into that invading extractor and the type like wendy's

then that in extractor as we are really emphasised before

it is a factor at the nn

which is getting the best results for speaker i p

we also is used an enhancement

module

for this and getting extractor

and finally we have the backing and the calibration

the backend

sure the key lda front of devastation with documentation and the calibration stage

goes directly to speaker detection

the combination of the use of results for all hours of corpora

enclosed is speech enhancement the spectral augmentation and of the lda with augmentation it is

important to note that although this

this is include the devastation as their first stage

so for

we got an improvement going problem

seventeen percent equal error rate

two percent equal error rate

in terms of mindcf and actual dcf shown in the bottom we can also something

improvement

remained trained we kind of so the same trend

going from fourteen percent equal error rate

two nine percent equal error rate

on the bottom we kind of service then mean these yes

and the actual dcf the mean this got improvement

but the actual dcf

for the is alright data our system also include the results going from twenty one

percent equal error rate to sixty percent

equal error rate

the mean these i'm the actual dcf

for this and trend

getting improvement simple

finally some taken ways i'd like to mention

the recession ease that fundamental stage to perform speaker detection

there are some models that are really needed to have a competitive system

course a whole good enhancement could be a i

we didn't beginnings

an overlap detection and assignment

the speaker detection they hence not only

on the devastation model

but also wanting but in extractor on the augmentation

then you directions of this work are as follows

or the signal to signal enhancement and speaker separation we need some customisation

you could be by the test it by speaker or quite task

for the speech enhancement

we have to explore other hand gestures a transformer and largescale training

for the vad we need ways to handle domain mismatch

you can be done for example using domain or sorry

for the clustering we need an unsupervised adaptation

take the overlap into account

during the clustering

and also included transcription

in parallel with the speaker and b

for the speaker detection

some enhancement for the multi-speaker scenario

that means

hi light

that's speaker of interest

and also perform better clustering

for short segments

this is our amazing thing

i would like to thank

all of them very much thank you questions