0:00:20a this is such a are going to a uh and uh i we're talk about
0:00:24a binary that is a to they
0:00:26this is a during what we jump and so last from university of a you're i'm coming from that of
0:00:30funny nick research
0:00:32so what do not that way
0:00:34use
0:00:35has these out
0:00:37it is in front of a
0:00:39"'cause" is outlined
0:00:40i'm going to use what speaker diarization use or at least
0:00:44for the ones of you that the remember from the produced
0:00:46talk
0:00:47i i'm about a binary speaker modeling
0:00:50and then when during the two things into the binary speaker diarization system that we just developed
0:00:55experiments and then conclude a future work
0:00:59first first a speaker there is a and yeah as we have a a a of you are we split
0:01:03the give the speakers
0:01:05we see who spoke were and and we don't know
0:01:08need
0:01:09speakers is or how many speakers are there K is like the P three dogs
0:01:14no
0:01:16is the art days
0:01:18well we have done
0:01:20oh as in the people in the last year as we've got an
0:01:23do around seven ten percent in but cousin was even though these this
0:01:27something that since two thousand four uh
0:01:29it's not a part of the nist evaluations and i bet nowadays it's
0:01:33C even lower than that
0:01:35and we've got an to twelve fourteen percent for meetings even maybe nine percent now on uh
0:01:41on making the me
0:01:43this is a a a a a great to result these makes that a shouldn't be able to use for
0:01:47all there a you know as a a a as a blocking as a block step for other applications like
0:01:52a speaker I D when there is a a multiple speaker that that there
0:01:55but still we have a problem
0:01:56it's too small
0:01:58example of have some numbers an in in uh standard systems if you develop a diarization system you the do
0:02:04anything about it
0:02:05it's
0:02:06most probably gonna go way up of one time real
0:02:10and if you try doing something about it
0:02:14the two systems at base so that the people that they saw the were trying to do something about it
0:02:18that for mixing
0:02:19the first one is
0:02:20one couple years ago
0:02:22i was going down to point ninety seven for real time that's a on a model or and they were
0:02:28do some tweaks
0:02:29to to a gmm based algorithms to or it's a hierarchical bottom-up system
0:02:35a and the they were getting to a just on the real time
0:02:38and a father on they said okay let's go to do you know
0:02:41so we can use uh sixteen core whatever the uh our you much this
0:02:45and they went down to zero point zero seven i nine all the nowadays these this still even five you
0:02:49and faster but is to P O so
0:02:52you don't have to be you in a mobile phone or you don't have you know these
0:02:55uh
0:02:56the is have to work on used in
0:02:59depending on what architecture so
0:03:02and this is what you
0:03:04have a system
0:03:05that the really really is very fast and it doesn't matter and one and or what uh architecture running it
0:03:12and still have a
0:03:14this case we by adapting a recently proposed the uh uh uh a technique called binary speaker model
0:03:21we also have another poster or in uh getting i on using this for a speaker I
0:03:27and so in this case we that it to there is a show and i'll tell you i'll tell you
0:03:31how we it
0:03:34uh uh to know what we'll do that to know the basics of what's uh by speaker modeling as were
0:03:39i'm explain about it a little bit more now
0:03:43so
0:03:44and this is a a six of it so we have a a an acoustic C have some input acoustic
0:03:47data and one uh and that we
0:03:50a a vector well actor
0:03:52of J
0:03:54a zeros and ones
0:03:55so that it is uh basically in a very general way like explain here we just extract some to sleep
0:04:02parameters mfcc or whatever want
0:04:05and and we use a
0:04:07and he back background model K B M which is basically a ubm but trained in a different way
0:04:13to
0:04:14yeah this acoustic they and then with these K B M we obtain this uh
0:04:18these minor to case
0:04:19for each uh acoustic
0:04:21they say which could be a a data for one speaker or data for a couple seconds a
0:04:28that C D's T M
0:04:30be T B M you the understand it different ways this is basically a set of options
0:04:35position
0:04:36in a particular way in the acoustic space in that more be them may show acoustic space
0:04:40in you have just one they may so that
0:04:43we can see the example
0:04:44so we first position be
0:04:46acoustic options in the space and then we take up put that that were acoustic stick or or or of
0:04:50a speaker data
0:04:51i we see which all these options
0:04:54at most are present in the best our would data
0:04:58and uh from there extract a binary fingerprint which uh or by taking which
0:05:03has to does
0:05:05are present in the positions of discussions that do not really represent well than that and ones
0:05:10oh on the options that are are ending our data
0:05:14and is you right
0:05:16so how do we do it for a a a for an obvious to how the we all these together
0:05:20well we can see here on the left side we have already puts signal
0:05:25where we compute that were uh mfcc acoustic features at at the and on the right side we have a
0:05:30can be yeah
0:05:32and and the
0:05:33so
0:05:34and that's the vertical vectors
0:05:36he's we have uh vector as
0:05:38which each uh whose dimensionality is and is than a certain number of options we have
0:05:43you in our what be a model
0:05:46and for each input feature vector we select
0:05:49the best
0:05:51we could say other nor the one percent best two percent best that ten best whatever
0:05:56we wanna use whatever
0:05:57for a scroll one of us
0:05:59the
0:06:00and but feature vector
0:06:02that that for X one to X and a
0:06:06our where data one a model
0:06:08we can get down to this uh camping vector
0:06:11the first of the of the result of actors which basically
0:06:15counts how many times
0:06:16each of these options have been
0:06:18has been selected as one of the best representing options for the acoustic data
0:06:23and then i C we just say okay
0:06:25that
0:06:27a and know that by or whatever the options are present in the data of our once on the rest
0:06:31are also does
0:06:39so
0:06:40once we have a a a a E
0:06:42a binary vector
0:06:43for a two speakers of for two sets of acoustic data
0:06:47it is very fast and very easy to
0:06:49to compare them to combat how close they are
0:06:52in here is just an example and a is the the type of a few that should be a some
0:06:57uh uh in the form of a
0:07:00as in the top one of the model
0:07:01and uh a basic this one possible
0:07:03i mean that is many possibilities it in the working in by them are you just need to find a
0:07:07way to compare to binary signals in this case
0:07:10well we used in this paper is
0:07:12in the uh not we just need the sum
0:07:16oh of uh uh you know it's uh some supplies one whenever in the to back to we have a
0:07:20one
0:07:22and the denominator just uh
0:07:24do are so we some but as one of a number in a a in you that of the vectors
0:07:28we have a one
0:07:29and this gives as a score or from zero to one no
0:07:32the zero either use
0:07:33a this in a body not seem that and one is the the same back
0:07:41a a a a a speaker by and models
0:07:43and that they said to we have a poster experiment more about and you can go back to uh to
0:07:48a post we cut speech
0:07:50and that's see now how we apply
0:07:52to us
0:07:53to speaker diarization
0:07:57so this is basically the system the new system that was into they
0:08:01this is uh uh
0:08:03just
0:08:04even if it was a because different of strange this is just and a minute if but the map system
0:08:10we can see that the is a but if clustering down B and we have a
0:08:14kind of a stopping therapy or or or a cluster selection
0:08:17but the see
0:08:19the the of about so first the bottom it
0:08:22uh its uh D feature extraction to extract mfcc whatever we run
0:08:27training the next to eight
0:08:28so we need to train these K be a models in this case we train them from be they the
0:08:32the data itself we don't use external up
0:08:35features i did by stations
0:08:38well the we take the acoustic features
0:08:40and we
0:08:41like
0:08:42a i'm interested in in summarisation we always need to initialize as or a system as we are doing about
0:08:46the bottom-up system we need
0:08:47many more clusters than actual speakers are there so
0:08:51we need some how to create those clusters
0:08:53and
0:08:55this part is that was processing
0:08:57that is
0:08:59in at this is just a nice of using should would just
0:09:02a just a little bit of time of the computational time of the system
0:09:05after that the of minute of clustering
0:09:08which is what we uh keep blasting keep joining together or those clusters that are closest to that this is
0:09:14all going in the binary space
0:09:15and final once we have reached to one and this is one difference from a standard
0:09:19have a minute of clustering system go
0:09:21from a and to one
0:09:23we have reached a one
0:09:24use an algorithm to select how many
0:09:26a terms of to multi we have
0:09:30as a said
0:09:31uh of mfccs
0:09:33we use like be they have to C is a standard uh and B ten millisecond T five miliseconds
0:09:39and can be um well as a said that
0:09:42a model but train to the you know a special way
0:09:46i in a special
0:09:48if you use a uh you a model train it we stand standard em M L techniques
0:09:52you going to have the options positions at the average points
0:09:56modeling optimal more the in the late that uh and this a but it's is are we so that they
0:10:01are not
0:10:02uh uh uh are present in the particle it is of the discriminative information that the speakers have that the
0:10:07speakers of your all you have
0:10:09so we try to do something different that can model that
0:10:12and uh and the this and X so that it can be anything higher than five hundred options we can
0:10:18go to ten thousand the the performance
0:10:20those an the neither neither a uh that rates
0:10:24how to do this
0:10:26so in this case
0:10:28in this paper or to it in the following way
0:10:31to the uh we to be the that these is uh a i would put audio and we first train
0:10:35as to option for them
0:10:37i believe it's two seconds of speech we some overlap
0:10:41so we and that is parental
0:10:42oh
0:10:43second that the house and options
0:10:45oh two thousand a all the options
0:10:48the options of was and very small portions of the only so whenever the is speaker they represent the speaker
0:10:54very discriminatively
0:10:55and and we use that can do to uh medic to adaptively
0:11:00yeah shows shows that we're
0:11:02this space is optimally ultimately
0:11:05model the space
0:11:06like more do more separate between them
0:11:09the whole acoustic space
0:11:11and that's it
0:11:12this is actually much faster than than doing with additive splitting uh yeah M L
0:11:18no
0:11:19right
0:11:20and a is of the data
0:11:22these these binary vectors
0:11:24from the acoustic data and in two steps
0:11:27to do stuff so
0:11:28a step which is
0:11:30oh in the
0:11:31first best the the K best
0:11:34uh captions for each acoustic feature that we have to do
0:11:37we one time only and then on the second step
0:11:40for every subset of features that we to compute a fingerprint from that's gonna meet only the evenly in our
0:11:47uh
0:11:48that is that is addition
0:11:49hmmm
0:11:51a time we need it then this is actually very fast
0:11:53so that
0:11:55we have the mfcc vectors
0:11:57a in top
0:11:58and for each of them yet is
0:12:00this best options you may not working in
0:12:04for the time
0:12:05and that is our were first part and we can store in the score memory
0:12:09and that's done on one time this is a little expensive because evaluating option mixture models
0:12:13but this is
0:12:15one time only
0:12:16then at every time when i can be here
0:12:19speaker model just have to get
0:12:21that that the
0:12:23and
0:12:24the counts
0:12:25and from those counts get a binary vector
0:12:27okay and this is like fast
0:12:31a five
0:12:33acoustic have to talk about initialisation
0:12:36and he just uh uh did something so for simplicity just use we use the can be M
0:12:42the kingdom
0:12:44and then initial clusters which you you just to bit any options that where
0:12:49uh chosen first
0:12:51i mean
0:12:52as that that the segmental or segmentation and we've it there and with those we assigned
0:12:58we got the clusters that
0:13:00we than the most
0:13:03now are in the binary the me
0:13:05okay and we have that
0:13:07this is have for us is is is is exactly the same as
0:13:11for example the icsi system is a format for them map clustering
0:13:14except that now or anything the domain
0:13:17so for example to is
0:13:20fingerprints from our approach of
0:13:22of a cave as options
0:13:25a close per T is completely a binary
0:13:28a between all the models are that all the cluster models and just choosing the two that are closest
0:13:34to merge them
0:13:36i
0:13:37i am and we
0:13:39there are we just take
0:13:41three seconds of data
0:13:43in one second at time and assign compute a fingerprint from T for each of them
0:13:47and assign it to the to the better speaker model
0:13:55last but
0:13:56the last part of the system
0:13:58these ones we not to one so we have one a cluster we have to choose how many clusters is
0:14:02our optimum number of clusters
0:14:04so for bad
0:14:06a a a that the S uh
0:14:09to test this terms that was present
0:14:11but i i two are people in interspeech two thousand eight
0:14:14and the uh in a fit of time
0:14:17so five it all in the paper are but we just a is estimated to in the uh just a
0:14:22relation between the uh in and inter distances between the power of the terms
0:14:26which allows us to select the optimal number of clusters
0:14:29as as i have to say
0:14:30he's about in the system that i'm less happy about and the have to improve this by
0:14:37about eight
0:14:38of course we use that as a should it but also use a by a factor
0:14:43and
0:14:43because
0:14:44the diarization results of so freaking decided to use
0:14:48a nice to rich transcription evaluation that he's is about thirty the six
0:14:53uh shows
0:14:55and uh i
0:14:55to say that he's
0:14:56runs see in just a but an hour in in a lot the P C so it's pretty fast
0:15:03they
0:15:03maybe results
0:15:05the first aligned use the results using uh a big easy could a gmm system but just an implementation of
0:15:12the
0:15:13um
0:15:14basic one
0:15:15a a is as about twenty three mm send and average that position of than a running down of about
0:15:21one point nineteen uh real time
0:15:24he's is optimization here that is no i mean is just an implementation
0:15:28the standard implementation
0:15:31a at the last two lines
0:15:33but do that this is a uh to uh configuration depending on the number of options we do we take
0:15:38for the K B N
0:15:39two possible implementations of by system
0:15:42we can see that in
0:15:44a five
0:15:45or that position it is this is slightly higher than the baseline instant
0:15:51a the real time factor is ten times
0:15:53faster
0:15:54so is pretty good
0:15:55and uh i mean was to importance of the training of the K B in
0:15:59a a a a uh we
0:16:01the the that's we used just a standard gmm just T V if too
0:16:05that's the second line of results are we see that it just breaks
0:16:08i mean the reaches as if at a speaker
0:16:11characteristic a speaker discriminant down shown it just doesn't work
0:16:15i also about so that
0:16:18a a selection of the number of clusters
0:16:20still those and
0:16:21do the job
0:16:23a number of clusters after running the system
0:16:26we actually get to the five percent of the error rate
0:16:30which is
0:16:31a better than the than our baseline
0:16:34this is just a a a two show that's right and
0:16:37all depending on the number of options that the position error rate we have
0:16:41how we can see just think of the the black is the average
0:16:44we can see that event
0:16:47and have nine hundred but after five hundred a our sense for the K B um the results are more
0:16:52less flat
0:16:53so we doesn't matter of five hundred six and the
0:16:56that's fine
0:16:58and this is a body is so i've shows was or meetings
0:17:02oh
0:17:02are our proposed system or the baseline of and see that in most cases they have
0:17:07but was the same out of course a sum up is that
0:17:10make these two percent difference but
0:17:12and
0:17:14and and that a couple of shows that are are better
0:17:18so
0:17:19we
0:17:20so that that is a shown was kind of a a a a a star
0:17:23shown is more uh a was but the things on top of of a standard system that i to get
0:17:29these little gains in performance
0:17:31but
0:17:33just start a a a a system that we call we can even get that
0:17:39and and and when i'm working the next to uh uh we can improve the by key fingerprinting
0:17:44we gonna find a better of stopping at the hopefully
0:17:47and uh also
0:17:49that the system always monocle in and maybe working in cell phones will
0:17:55thank you very much
0:18:02that's can like to think of making did not
0:18:07and he's my
0:18:09okay
0:18:19no
0:18:20no this is this is and the M
0:18:22oh
0:18:25oh
0:18:27oh sorry L merging and speech key detection is on right at the beginning
0:18:31at the very beginning
0:18:32so as just the stand like a standard uh
0:18:35that action system
0:18:36it's just not it's
0:18:37mean
0:18:39the
0:18:40the
0:18:42and see if it goes back
0:18:46justin the acoustic feature extractions at the beginning of the system
0:18:49and but uh used uh the speech taking that action from for you to come
0:18:53thanks to that
0:18:59no no i just i just to acoustic
0:19:02i don't merge
0:19:03i use M D and that's multiple microphones but just been for than the use a single channel then
0:19:13many ideas but that work at the no
0:19:16have to try
0:19:18okay since we ran out of a nice thing