a this is such a are going to a uh and uh i we're talk about

a binary that is a to they

this is a during what we jump and so last from university of a you're i'm coming from that of

funny nick research

so what do not that way

use

has these out

it is in front of a

"'cause" is outlined

i'm going to use what speaker diarization use or at least

for the ones of you that the remember from the produced

talk

i i'm about a binary speaker modeling

and then when during the two things into the binary speaker diarization system that we just developed

experiments and then conclude a future work

first first a speaker there is a and yeah as we have a a a of you are we split

the give the speakers

we see who spoke were and and we don't know

need

speakers is or how many speakers are there K is like the P three dogs

no

is the art days

well we have done

oh as in the people in the last year as we've got an

do around seven ten percent in but cousin was even though these this

something that since two thousand four uh

it's not a part of the nist evaluations and i bet nowadays it's

C even lower than that

and we've got an to twelve fourteen percent for meetings even maybe nine percent now on uh

on making the me

this is a a a a a great to result these makes that a shouldn't be able to use for

all there a you know as a a a as a blocking as a block step for other applications like

a speaker I D when there is a a multiple speaker that that there

but still we have a problem

it's too small

example of have some numbers an in in uh standard systems if you develop a diarization system you the do

anything about it

it's

most probably gonna go way up of one time real

and if you try doing something about it

the two systems at base so that the people that they saw the were trying to do something about it

that for mixing

the first one is

one couple years ago

i was going down to point ninety seven for real time that's a on a model or and they were

do some tweaks

to to a gmm based algorithms to or it's a hierarchical bottom-up system

a and the they were getting to a just on the real time

and a father on they said okay let's go to do you know

so we can use uh sixteen core whatever the uh our you much this

and they went down to zero point zero seven i nine all the nowadays these this still even five you

and faster but is to P O so

you don't have to be you in a mobile phone or you don't have you know these

uh

the is have to work on used in

depending on what architecture so

and this is what you

have a system

that the really really is very fast and it doesn't matter and one and or what uh architecture running it

and still have a

this case we by adapting a recently proposed the uh uh uh a technique called binary speaker model

we also have another poster or in uh getting i on using this for a speaker I

and so in this case we that it to there is a show and i'll tell you i'll tell you

how we it

uh uh to know what we'll do that to know the basics of what's uh by speaker modeling as were

i'm explain about it a little bit more now

so

and this is a a six of it so we have a a an acoustic C have some input acoustic

data and one uh and that we

a a vector well actor

of J

a zeros and ones

so that it is uh basically in a very general way like explain here we just extract some to sleep

parameters mfcc or whatever want

and and we use a

and he back background model K B M which is basically a ubm but trained in a different way

to

yeah this acoustic they and then with these K B M we obtain this uh

these minor to case

for each uh acoustic

they say which could be a a data for one speaker or data for a couple seconds a

that C D's T M

be T B M you the understand it different ways this is basically a set of options

position

in a particular way in the acoustic space in that more be them may show acoustic space

in you have just one they may so that

we can see the example

so we first position be

acoustic options in the space and then we take up put that that were acoustic stick or or or of

a speaker data

i we see which all these options

at most are present in the best our would data

and uh from there extract a binary fingerprint which uh or by taking which

has to does

are present in the positions of discussions that do not really represent well than that and ones

oh on the options that are are ending our data

and is you right

so how do we do it for a a a for an obvious to how the we all these together

well we can see here on the left side we have already puts signal

where we compute that were uh mfcc acoustic features at at the and on the right side we have a

can be yeah

and and the

so

and that's the vertical vectors

he's we have uh vector as

which each uh whose dimensionality is and is than a certain number of options we have

you in our what be a model

and for each input feature vector we select

the best

we could say other nor the one percent best two percent best that ten best whatever

we wanna use whatever

for a scroll one of us

the

and but feature vector

that that for X one to X and a

our where data one a model

we can get down to this uh camping vector

the first of the of the result of actors which basically

counts how many times

each of these options have been

has been selected as one of the best representing options for the acoustic data

and then i C we just say okay

that

a and know that by or whatever the options are present in the data of our once on the rest

are also does

so

once we have a a a a E

a binary vector

for a two speakers of for two sets of acoustic data

it is very fast and very easy to

to compare them to combat how close they are

in here is just an example and a is the the type of a few that should be a some

uh uh in the form of a

as in the top one of the model

and uh a basic this one possible

i mean that is many possibilities it in the working in by them are you just need to find a

way to compare to binary signals in this case

well we used in this paper is

in the uh not we just need the sum

oh of uh uh you know it's uh some supplies one whenever in the to back to we have a

one

and the denominator just uh

do are so we some but as one of a number in a a in you that of the vectors

we have a one

and this gives as a score or from zero to one no

the zero either use

a this in a body not seem that and one is the the same back

a a a a a speaker by and models

and that they said to we have a poster experiment more about and you can go back to uh to

a post we cut speech

and that's see now how we apply

to us

to speaker diarization

so this is basically the system the new system that was into they

this is uh uh

just

even if it was a because different of strange this is just and a minute if but the map system

we can see that the is a but if clustering down B and we have a

kind of a stopping therapy or or or a cluster selection

but the see

the the of about so first the bottom it

uh its uh D feature extraction to extract mfcc whatever we run

training the next to eight

so we need to train these K be a models in this case we train them from be they the

the data itself we don't use external up

features i did by stations

well the we take the acoustic features

and we

like

a i'm interested in in summarisation we always need to initialize as or a system as we are doing about

the bottom-up system we need

many more clusters than actual speakers are there so

we need some how to create those clusters

and

this part is that was processing

that is

in at this is just a nice of using should would just

a just a little bit of time of the computational time of the system

after that the of minute of clustering

which is what we uh keep blasting keep joining together or those clusters that are closest to that this is

all going in the binary space

and final once we have reached to one and this is one difference from a standard

have a minute of clustering system go

from a and to one

we have reached a one

use an algorithm to select how many

a terms of to multi we have

as a said

uh of mfccs

we use like be they have to C is a standard uh and B ten millisecond T five miliseconds

and can be um well as a said that

a model but train to the you know a special way

i in a special

if you use a uh you a model train it we stand standard em M L techniques

you going to have the options positions at the average points

modeling optimal more the in the late that uh and this a but it's is are we so that they

are not

uh uh uh are present in the particle it is of the discriminative information that the speakers have that the

speakers of your all you have

so we try to do something different that can model that

and uh and the this and X so that it can be anything higher than five hundred options we can

go to ten thousand the the performance

those an the neither neither a uh that rates

how to do this

so in this case

in this paper or to it in the following way

to the uh we to be the that these is uh a i would put audio and we first train

as to option for them

i believe it's two seconds of speech we some overlap

so we and that is parental

oh

second that the house and options

oh two thousand a all the options

the options of was and very small portions of the only so whenever the is speaker they represent the speaker

very discriminatively

and and we use that can do to uh medic to adaptively

yeah shows shows that we're

this space is optimally ultimately

model the space

like more do more separate between them

the whole acoustic space

and that's it

this is actually much faster than than doing with additive splitting uh yeah M L

no

right

and a is of the data

these these binary vectors

from the acoustic data and in two steps

to do stuff so

a step which is

oh in the

first best the the K best

uh captions for each acoustic feature that we have to do

we one time only and then on the second step

for every subset of features that we to compute a fingerprint from that's gonna meet only the evenly in our

uh

that is that is addition

hmmm

a time we need it then this is actually very fast

so that

we have the mfcc vectors

a in top

and for each of them yet is

this best options you may not working in

for the time

and that is our were first part and we can store in the score memory

and that's done on one time this is a little expensive because evaluating option mixture models

but this is

one time only

then at every time when i can be here

speaker model just have to get

that that the

and

the counts

and from those counts get a binary vector

okay and this is like fast

a five

acoustic have to talk about initialisation

and he just uh uh did something so for simplicity just use we use the can be M

the kingdom

and then initial clusters which you you just to bit any options that where

uh chosen first

i mean

as that that the segmental or segmentation and we've it there and with those we assigned

we got the clusters that

we than the most

now are in the binary the me

okay and we have that

this is have for us is is is is exactly the same as

for example the icsi system is a format for them map clustering

except that now or anything the domain

so for example to is

fingerprints from our approach of

of a cave as options

a close per T is completely a binary

a between all the models are that all the cluster models and just choosing the two that are closest

to merge them

i

i am and we

there are we just take

three seconds of data

in one second at time and assign compute a fingerprint from T for each of them

and assign it to the to the better speaker model

last but

the last part of the system

these ones we not to one so we have one a cluster we have to choose how many clusters is

our optimum number of clusters

so for bad

a a a that the S uh

to test this terms that was present

but i i two are people in interspeech two thousand eight

and the uh in a fit of time

so five it all in the paper are but we just a is estimated to in the uh just a

relation between the uh in and inter distances between the power of the terms

which allows us to select the optimal number of clusters

as as i have to say

he's about in the system that i'm less happy about and the have to improve this by

about eight

of course we use that as a should it but also use a by a factor

and

because

the diarization results of so freaking decided to use

a nice to rich transcription evaluation that he's is about thirty the six

uh shows

and uh i

to say that he's

runs see in just a but an hour in in a lot the P C so it's pretty fast

they

maybe results

the first aligned use the results using uh a big easy could a gmm system but just an implementation of

the

um

basic one

a a is as about twenty three mm send and average that position of than a running down of about

one point nineteen uh real time

he's is optimization here that is no i mean is just an implementation

the standard implementation

a at the last two lines

but do that this is a uh to uh configuration depending on the number of options we do we take

for the K B N

two possible implementations of by system

we can see that in

a five

or that position it is this is slightly higher than the baseline instant

a the real time factor is ten times

faster

so is pretty good

and uh i mean was to importance of the training of the K B in

a a a a uh we

the the that's we used just a standard gmm just T V if too

that's the second line of results are we see that it just breaks

i mean the reaches as if at a speaker

characteristic a speaker discriminant down shown it just doesn't work

i also about so that

a a selection of the number of clusters

still those and

do the job

a number of clusters after running the system

we actually get to the five percent of the error rate

which is

a better than the than our baseline

this is just a a a two show that's right and

all depending on the number of options that the position error rate we have

how we can see just think of the the black is the average

we can see that event

and have nine hundred but after five hundred a our sense for the K B um the results are more

less flat

so we doesn't matter of five hundred six and the

that's fine

and this is a body is so i've shows was or meetings

oh

are our proposed system or the baseline of and see that in most cases they have

but was the same out of course a sum up is that

make these two percent difference but

and

and and that a couple of shows that are are better

so

we

so that that is a shown was kind of a a a a a star

shown is more uh a was but the things on top of of a standard system that i to get

these little gains in performance

but

just start a a a a system that we call we can even get that

and and and when i'm working the next to uh uh we can improve the by key fingerprinting

we gonna find a better of stopping at the hopefully

and uh also

that the system always monocle in and maybe working in cell phones will

thank you very much

that's can like to think of making did not

and he's my

okay

no

no this is this is and the M

oh

oh

oh sorry L merging and speech key detection is on right at the beginning

at the very beginning

so as just the stand like a standard uh

that action system

it's just not it's

mean

the

the

and see if it goes back

justin the acoustic feature extractions at the beginning of the system

and but uh used uh the speech taking that action from for you to come

thanks to that

no no i just i just to acoustic

i don't merge

i use M D and that's multiple microphones but just been for than the use a single channel then

many ideas but that work at the no

have to try

okay since we ran out of a nice thing