Speech Transcript - Semi-supervised On-line Speaker Diarization for Meeting Data with Incremental Maximum A-posteriori Adaptation

but not everybody today angle and at present a more work that freeze and a

addressee the problem of our online speaker diarization in contrast the two other works that

the majority of work in that is a shown that the was a it is

mainly offline

and the in a semi-supervised scenario

a i would first provide a brief introduction and the motivation a i would band

and describe the system implementation

i will the provide this mess

the some experimental results and the a simple solution

a i guess most of you are familiar with the problem of speaker diarization basically

given an audio stream a i want to determine who spoke when

and i want to a determined not to my segmentation word this segment boundaries are

present the speaker there are changes and the in optimize speaker sequence what i want

to assign you a the segments a to a specific speaker

an so

most of the state-of-the-art detection system at the vault around the belmont off offline diarization

system however with the diffusion of a smart objects are intimately fittings and mar force

in the your recent the are online diarization has attracted the

and increasing is the right interest

so in the literature only a few if you allow online there is some system

that i've been presented mainly focusing on the plenary speeches and the broadcast news when

the speaker turns article

and the our previous work right whether the problem of unsupervised online diarization

for of meeting data

a with a single piece the microphone

so as the condition

so unfortunately the system be the b-format and

although the results were aligned with the previous work at their system thing to perform

a well to some or practical like applications

in there is a short online diarization

we a

we think basically in online diarization get the

we have to deal with a the problem of speaker modeling not in addition

and the after period of not feature

the we assume that we encounter speech we want to be able to initialize speaker

model and the properties which kind of analysis window with the which amount of

speech

we take to a

initialize the speaker model

so i can choose a the what amount of as a speech time to the

and decrease the latency of the system

however everybody probably knows that the

there are interface much it is much higher because the speaker model are not well

initialized with little data

otherwise i can take a and low and longer windows a longer amount of us

a speech

a yes

whatever in the keys a the in speaker variation i

can fix the problem of

using too long speech in which are there are multiple speaker might be multiple speakers

because number speaker might and increase with the

a longer windows

so a

and way to improve the line that additional used to allow that it is allowed

a speaker models to decision

we but as some labeled training he some initial labeled training data

and the so our contribution is will present work going that the t seems supervised

online diarization system

and the on the work this kind of what was already presented by then your

moral

in august done for but in the context of the offline diarization

so the problem is the and the problem that we try to address is what

kind of the of c data is required to reach a similar performance the two

and of lined original system

a okay i will continue we've the spinning the and they all came up with

addition that these that we used to update the models in our system so on

the we supposed to have a sequence of these speech segment

from a particular speaker s

and the each segment is by parameterized by set of acoustic features

you

we maximum number of features and i and i was able to have initial the

gmm ubm model

we've gig of some components

so we found that basically in most of that is also used in that we

found the literature

it'd the authors would be used to a initialize speaker model by map adapting the

lander ubm model and that you with the for speech segment obtaining a the than

the first the for speaker model

and then using the net

speaker two and two right we have that the previous model s one in this

case

putting that and model or a new model as to

however we found out that the by using these that the final model doesn't is

not the same model s for using all the segments

two but they rely on the ubm model in ones

is so although it is a modest contribution we found out that the value

by least a calculating the sufficient statistic using the a available speech segment we against

the land the ubm model each time and by accumulating the sample statistics

the final model is more consistent a more similar to a the more that it

and that is the cosine with the offline that a muppet map adapted model

so you know from a politician basically this fisher statistics for a gig of the

component are doing a with for the main question okay i of the first the

older quite zero for a zero for and

sufficient statistic the first order statistics and the second order statistics

and the basically a

it's quite represent the likelihood of the each feature contain in the segments and i

guess the around the ubm model and i use all the segments available to market

that the lander ubm model

to obtain that they knew they are they to estimate i basically just the is

a tradeoff between the a and some ratio between assumption statistics and the original the

original

parameters and that these racial is

this ratio is like

is the depends on their relevance factor that tells me

a in how much importance quickly to initial parameters rather

rather than the final day

i what's important to

much closer but the i want to be

in estimating the new parameters and also we have become of it is an additional

parameter so that the weights are estimated weights sum to one

and so it's once motivation as we said use the first segments to updated around

the ubm model

i've the in the first speaker model s one

and then i recalculate there is sufficient statistics i guess to the in a speaker

model s one

to train and you model has two

and the more in general so i train the s one speaker model by map

adapting the ubm and model with the first segment and then a given once a

of the speaker segment you plus one

a the sufficient statistics

are given a calculated guess the previous model s i

so i have all the sufficient statistics calculating just the model

and the incremental map adaptation extent we want to calculate the with the for statement

the sufficient statistics a guest aligning the model to obtain the for small the embedding

a with the second segment we again a couple the sufficient statistics of against the

long to a ubm model and we have q-mode this one in this way the

three d of that the model is the same s offline map adaptation

and will so what was it

a we train the need just because model by you with the first segment

and then

a given i do not seem an a plus one is sufficient statistics a quality

by accumulating this

a with the previous one by using the features in the

it in the last segment

and calculating sufficient statistic guest the lower than and are committing to the thing to

statistics

so as a set for the model cost contribution but we found it green good

improvements in the final the original results as we see after

okay the system implementation

a so a now two

and the we have a supervised the supervised and face and unsupervised in the supervised

case we the is like around able to not be allowed the

we have and amount of the and the spk segments per speaker and with that

those ones we after feature extraction we initialize the models

in the all the people speak and the meeting for example

and the in the line face instead we have a supervised and the lights and

the we classify each signal and the speech segment a factor of dividing the to

segment of the maximum duration ts that are present our latency

and the these basically

these the distance are classified a guest the speaker model available

and the a i determine which speaker models is the and

is the most likely and the i'd label that segment with the according to the

speaker models that maximize the likelihood and i update the model by my incremental or

sequential map adaptation we use we show both results

and the in their life data i bring we need a sufficient statistics that with

the will be that are used to update to the speaker models

so all in the line processing i assign each segment i one of them speaker

models

according to must be more likely criteria

and the model that maximize the likelihood of for the feature contain the segment is

set to be this use the is used log of the segment

and the that speaker model e the adapted by either sequential or increment all the

of the station

and this is the implementation of the system

the to use

so i will not present experimental setup and the experimental results

so we used for different datasets

a compact from the n is the rich transcription activator

and the to the first that set is used to train the ubm and the

is just a few and is a set of sixty meeting shows the from the

nist audio for evaluation

the we have the development dataset

is the set of fifteen meeting shows from there are two five and that you've

six evaluation is used to g

to develop the system

and the evaluation set that we used to ask people that used to evaluate this

system is the set of at meeting shows from the active seven and the set

of seventeen shows from the t o nine evaluation

and the we show the results independently for these two dataset to perform better comparison

we previous work

the experimental setup we use a nineteen the mel frequency cepstral coefficient i made by

energy we've a twenty millisecond windows and the ten millisecond chips

a i mean is shown me ten milisecond the overlap

a ubm is trained on that the ubm with a

ten iteration em iteration and sixty four gaussian components

and the analysis window that you that the segment duration that

that correspond to the lattice of the system is the we not i for the

phone lattices from zero point twenty five second zero point five seconds on t four

seconds

and the amount of training data that used to initialize the model initially in the

back from one thirty nine seconds

and the which both short of the results for sequential an increment all the map

adaptation

and the relevance factor for the map adaptation least and the

the a

okay the overlapped speech is a move according to the transcription

well what

but problems with the descent

we need and the offline baseline system is the idea eric on top down duration

system classification the to use as a baseline to a as a reference for

the results

so in this in these in the first in the rest of the left

you are presented the results using a sequential model adaptation approach we can see that

the by allowing and amount of training data in a of the people putting data

can also labeled training data to initialize the model

and we managed to perform better of an offline diarization system

and the results right are is that we increment the muppet that they show that

the

allows for better or a profiles of the q and the because the model we

accumulate the statistics

and the we can see that the

we can reach an offline but there is some performance we've only five second possible

actual y n

incremental works better with three seconds

and the by allowing more training data

a the we the readers to that is not a rate of ten percent

and

the c and the okay the state for the lower latency this is just a

it is system

does not perform well because see that the licensees the really global too low and

the also

the

so for ten in this table we present the results for different amount of and

bring the cost the

a training data

three second by second and seven seconds

and the for the different not set and all these results correspond reluctance use of

three second

so we can see that in all cases incremental by map adaptation works better when

the question map adaptation so but being the statistics of we are but would it's

better results

and the finally

these dropped represents a represents the amount of training data as a function of the

latency and yes

when that

when we would be to the offline diarization performance all

or points corresponding to d r of seventeen percent and still have in here that

the incremental map adaptation works better when is when sum up to the station

the in

for future work the in

and

the goal is to probably a and reduce both lattices and both the amount of

training data to reach better performance

to include

so we propose a semi-supervised aligned there is some system

and we show that for the in the case of they have to your seven

dataset the system gonna outperform an offline diarization system we've already three seconds of

a speaker c data and with a latency of three second

well using an document the map adaptation approach

and the a by line will be the legacy all retirement see the that we

have lower eer and percent

and the also if we tested this inconvenience of like of

initialize the speaker model some labeled training data we can

open there should two

development of supervisors image supervise the speaker discriminative features transformation

a both to reduce that fancy

and that the amount of data

thank you have any so i'll in here we have time for a few questions

thank you for talk

for so long to this then do you

to know how many speakers or the conversation runs

yes is usually we mean that we allow for these

and knowing in advance the number of speakers to initialize the better the models

and was we five we are searching other ways to introduce new speakers that are

unknown in the

in this was in the beginning

you divided the data between the speakers

so we assume that all the speakers of speaks from the beginning

yes

present himself or something like this

yes exactly we are it means that us not which everybody also examined

you're wrong system

does not assume that the number of speakers is not the ones so it's more

a totally correlated to compare isn't version

this mean a when we do the these experiments i agree with the comment when

you really these experiments we had only that all signed addition system any was difficult

a initialized speaker models

so since we had already that baseline we decide okay to stick that one as

a reference and the g

we compare with the and the last questions practical who see in the that online

for the first the segmentation for the filtering

okay also other application that the these the work can and can be useful for

example can be

and

when you interact weavers mat corpus

all six you can provide the initial these

some data for the people that usually utilized was is mapped object so you can

provide some initial data that was our disk is not gonna use for this east

thank you remote

any other questions

so i have a question so and

i from your presentation i understand that you are adapting all the parameter phase

okay systems for score area where is okay by dataset covariance tennis a means so

have you tried

two

check if extractor that fewer parameters and you get some

yes it right what we only the mean that the as we increment the model

that these women so you

bring "'em" we must have that we stick with a map adaptation

it's a to get worse results in front of data on the parameters

while the and saw in the case identity of the mean

because as the use of all these few data will data model is like

use only few data but incremental map adaptation is we are bringing we've also the

statistics those statistics are also useful to an update a variance and the mean as

well so an incremental map adaptation case we you

id updating with the parameters a broad better results than updating models

and for example

do you think a comparison or

you think it will make sense to compare

in terms of number of parameters maybe increasing number of gaussians in the unit or

so i don't if you and sounds and

okay so maybe releasing for about or

for better a competition on reducing the number of gaussian mean

no i was i was thinking owning increasing then if you as you go for

example i mean you could okay

even

that might be postport model becomes a better liable to increase the number of components

to maybe double the money we don't right in this case because we sixty four

different components

because it was language pairs was that we did

so but that might be a good

there's still time for questions

questions

a global there is the explanation but the white your system is it does on

the offline but system

sorry your system is it does on the of the nicest in

and we would you know like

okay inane that case so we try to put in previous work we tried we

totally unsupervised system and they which within a number of speakers and that the performance

where much more some end of line there is some system

and the as i said before the comparison was faring the case because the

in that line that is some system all c than the number of speakers but

in this case you allow the you allow to not the number of speakers in

a bass to get better performance and to stop of practical application so but knowing

and the number of speakers already you add a lot of that you two

to the problem so you is already

and

is already some information

that adds to be the offline diarization system

understanding that you can decreasing but on my system can do you line

so what the difference at the end of line

i mean you can imitate of online

by using offline system

okay but the flesh system basically you need all the audio from the beginning so

that you do not have to use all the

was audio

also the this system that we use of flying system was really computationally every so

it to the use is a lot of and segmentation and clustering e like

iterations

so is my it may lie on the from the point of view of latency

is much worse than the online diarization system

you know questions

let's take the speaker again

Semi-supervised On-line Speaker Diarization for Meeting Data with Incremental Maximum A-posteriori Adaptation

Speaker Clustering and Diarization

Giovanni Soldi, Massimiliano Todisco, Héctor Delgado, Christophe Beaugeant, Nicholas Evans