0:00:26so um this is the second talk
0:00:29i about uh i J again a a speaker diarization them what we are trying to focus on multistream approach
0:00:34use
0:00:35and it's uh actually detect in the the the baseline technique which we are using a
0:00:40is the same as in the previous talk which is uh
0:00:43information but to like system
0:00:46and to uh we are made me as the P saying the are need trying to look at the a
0:00:50combination of the outputs or a combination actually of for different different seems on different levels
0:00:56and these are only acoustic strings cell so no prior information
0:01:00from brawl statistic so
0:01:02again
0:01:03um
0:01:04and the third
0:01:05for order here
0:01:07is this was done by D
0:01:08was was D is you them to a T D a we let
0:01:12for for should L D
0:01:14and um
0:01:16i interaction a motivation
0:01:18as the same or
0:01:19kind of close
0:01:21um again we set holes uh or we assume that's uh
0:01:26the recordings which we are working with a are recorded with multiple distant microphones
0:01:31i i as um actually features what you are using a two kind of a "'cause" to features and that
0:01:37mfcc features which are kind of standards
0:01:39and then
0:01:40i that time delay of are right and i was features
0:01:44um
0:01:45each loop they are pretty uh a compliment to mfcc
0:01:49and uh um people nowadays they they use quite quite a lot for
0:01:53for uh diarization
0:01:56actually this combination
0:01:58winning acoustic feature combination
0:02:00for we uh
0:02:01uh uh information but like a technique is
0:02:04a a key
0:02:06less a a state-of-the-art results in a meeting data stations
0:02:12um so back to O two motivations so usually the feature streams are combined or a model level
0:02:19so
0:02:19there are separate models for a gmm models
0:02:22for
0:02:23different uh actually speak uh streams
0:02:25and this is are those way away
0:02:28and the and
0:02:29these uh actually uh a look like use in the and are combined
0:02:32with it's some you know waiting
0:02:34a and there also some other approach is like a voting schemes between
0:02:38these uh systems
0:02:39i diarisation systems already
0:02:41or or actually the initialisation
0:02:44i run system is done on the output of the other system or some the graded approach
0:02:49our are actually question a is uh if we can if
0:02:52and if you see or do this to kind of different acoustic features
0:02:56can be integrated using independent diarization systems
0:02:59rather than independent
0:03:01models or in other word
0:03:03but but actually D add some advantage of using systems are then
0:03:07a a combination
0:03:08but do we mean by system or a combination i hope is going to be clear
0:03:12uh uh sure or
0:03:14a to slides
0:03:16um
0:03:19so maybe the last one about i'd like blind of the talks so for let me say a few words
0:03:24of all this
0:03:25information about but like principal which we use
0:03:28and which is actually done on single stream that a station so no combination of before features
0:03:34and also if few words about to model based combination about
0:03:38system based combination some he bit combination
0:03:40and the experiment a result
0:03:43again uh a state-of-the-art results using actually
0:03:47uh this uh but to make uh
0:03:48information but the like a technique
0:03:51um
0:03:53um we we are getting state of the results with such system and that is not too much of a
0:03:58computational
0:03:59complexity in in that
0:04:01um
0:04:02so this is uh are the can the advantage
0:04:04uh how does it work
0:04:06these information about button like principle
0:04:08um actually this kind of intuitive div approach each has been borrowed from uh
0:04:13from a a document clustering so
0:04:15at the beginning sample that we have some document that you want to class or in
0:04:20C clusters
0:04:21in our terminology
0:04:22and
0:04:24um
0:04:25and uh
0:04:27what these actually
0:04:28a a what is added did that as a as the information is some body Y which is about but
0:04:33be of interest
0:04:35a a or we call it as are but i of a body able which it surely no
0:04:39or something about discussed ring so some in these uh
0:04:43a document clustering these why uh why why able can be
0:04:47a can be words
0:04:49oh all the vocabulary which
0:04:51of course to was about uh a about these uh
0:04:55discussed serves and has information about
0:04:57a about six a
0:04:59also so actually some all that there is a a normal condition distribution P you white X so like given
0:05:04X is available
0:05:06and back
0:05:07and going back to this uh a problem or speaker diarisation
0:05:11our X got to i X is actually set of elements
0:05:15oh and the speech so again
0:05:17speech uh segments
0:05:20again you need for segmentation we we set and
0:05:23these need to be
0:05:24uh
0:05:25uh a cluster into C C class or
0:05:29so we to this information about the like a principal state
0:05:32uh that the clustering should be press the ring as much information as possible between
0:05:38a a C a Y
0:05:40or by minimizing the distortion these distortion we can see as a
0:05:44uh some
0:05:45compression for example
0:05:47or
0:05:48also in a our
0:05:49our way it's actually some regularization regularization so if you don't have uh
0:05:54these distortion C N N
0:05:56which is actually but our terms uh
0:05:59i'm each information
0:06:00oh oh X and C for i X and C
0:06:03uh uh if you don't have a it's probably going to
0:06:06cussing to one one global class or which which is not so the case C one
0:06:11so i get this i'm
0:06:13i intuitive div approach
0:06:15but in the end it looks that uh
0:06:18or you can be proved
0:06:19but
0:06:20if we actually
0:06:21you are going to
0:06:23um
0:06:24have to my this objective function which is again
0:06:27uh a mutual information C Y
0:06:30and my nose
0:06:31some
0:06:32uh like i to rate or uh X and C
0:06:35uh yeah are going to
0:06:37actually uh
0:06:39to move the problem to the
0:06:41uh to the way you where
0:06:42uh the properties
0:06:44those
0:06:45that he's Y given X are going to be
0:06:48uh
0:06:49measure don't can but using a simple divorce and
0:06:52so
0:06:53but the point so we don't need to look for some
0:06:55especially divisions of the as your which is saying
0:06:58which got a of we should we should be him together
0:07:01in this uh in do
0:07:02and so intuitive approach
0:07:04i be due the derivation we will find out that actually that should be jensen jensen channel uh the imagines
0:07:09used for
0:07:10for clustering
0:07:11so in the end uh the approach is pretty simple or
0:07:16going to be is
0:07:17so here it's actually a got marty for a
0:07:20a also in each iteration them the are
0:07:23we are uh
0:07:24we are thing to clusters together are based on the information
0:07:28uh from these uh give chance so we take those clusters which have
0:07:32the small the and we just met jim
0:07:34and you do it it's that to the um
0:07:36until
0:07:37should is some stop criteria
0:07:39stop it that you know
0:07:40is again pretty simple and it is actually a normalized
0:07:44but you or from
0:07:46i go back
0:07:47uh this a mutual information between C and Y
0:07:51so so again mm to somehow O
0:07:55i i know finalised this uh i the approach
0:07:58uh
0:07:59right is good we have us to pink daddy and we have actually
0:08:03the the um
0:08:05where you how to measure your the
0:08:07the similarity between between clusters
0:08:09and uh
0:08:11it's pretty simple
0:08:12to to and coded it you know
0:08:14so
0:08:15um
0:08:17oh just a a few information about uh are those properties which are actually here so
0:08:21would be fairly suppose that uh by but you of C given an X where C is cluster eight X
0:08:27is input uh segment
0:08:28is going to be hard
0:08:30partition meaning
0:08:31it all
0:08:32all these bills only to one class or
0:08:34but is no like
0:08:35a a week a uh weighting between several class er
0:08:39and place probability why given C which is actually
0:08:43a a some yeah but about a viable
0:08:46yeah distribution
0:08:47which which is used to a actually to do this so merging
0:08:51and um
0:08:55everything should be more clear to on this
0:08:58on this up your
0:08:59so i mean suppose we have input speech which is uniformly segment it
0:09:04oh for example mfcc features in this single
0:09:07some the approach
0:09:09we have uh elements of
0:09:11these
0:09:12and among variables
0:09:13i still didn't say what it is but
0:09:15i i it's probably in T if in
0:09:17our case is just universal background model
0:09:20you just on and tired speech
0:09:22and uh
0:09:23uh this is actually defining body able to what you to do the thing so
0:09:28actually actually state or which you see in the middle or are back doors P why you an X which
0:09:33are
0:09:33probabilities
0:09:35for a vector Y given
0:09:37uh you the input segments
0:09:40and um
0:09:42the clustering which is a a again competitive technique and in the end we get some initial segmentation
0:09:48and finally we do refinement using ca
0:09:51training a gmm and doing viterbi decoding
0:09:58that are let's go back to
0:09:59to the feature combination
0:10:02so in case of uh
0:10:04uh a feature combination which is based on the big around what else so suppose that we can have to
0:10:09features again uh a few just a at is and and tdoa away
0:10:13and we have to big our models
0:10:15uh each are trained on on such features
0:10:18uh what we can simply do that
0:10:19we uh we can just wait can nearly weights
0:10:22these uh
0:10:23B Y given X uh
0:10:25vectors or probabilities
0:10:27with
0:10:27put some weight
0:10:28and it's going to be us new mats weeks
0:10:31oh for these settlements sorry abilities
0:10:33in the
0:10:34a these weights
0:10:36how to get a to of course we trained them or estimate them on the development data so
0:10:41we should be juror rising or different data
0:10:43L so one
0:10:45we have actually these P Y X is make it's the rest of the diarization system is same so P
0:10:49actually do it just at the beginning where we combine these
0:10:53i are buttons where is
0:10:54and then we just just do a iterative
0:10:57approach to
0:10:58to do clustering
0:11:00so actually this is not a new these has been already but uh
0:11:03published be i row last the interspeech
0:11:06um this is just again the gap how how it is down
0:11:10a again there is a matrix cold
0:11:11thus be white X
0:11:13probably
0:11:14um
0:11:15the vectors like an vectors
0:11:17and they are simply
0:11:18a a it's uh by by alright right
0:11:21yeah and then there is a clustering operation and refinement
0:11:25now what is actually knew and what uh what we are type in this paper is uh
0:11:30multiple system combination
0:11:32so so
0:11:33a set of doing the combination before clustering uh what would happen if you do combination after clustering
0:11:39so
0:11:40um
0:11:41again with a of that they are to big our models
0:11:44oh trained on different uh features
0:11:46and they are two diarization systems in the end so
0:11:49uh we
0:11:50actually it actively
0:11:52get some clusters
0:11:53a stopping titanium actually can be different
0:11:56meaning
0:11:57can have different number of clusters for
0:11:59for a feature a a or four it should be
0:12:02the end to be get a this in these wide given X
0:12:06or a you see actually
0:12:08and
0:12:09and
0:12:09a time to go back
0:12:11from this class to initial segmentation
0:12:14is
0:12:14have been that would D Y you X
0:12:16i to do is just simple by bison operation
0:12:20and um
0:12:21again there is um
0:12:23something you image how how this is done
0:12:25so again and that two diarization systems
0:12:29which are doing complete clustering
0:12:32and in the end we are again getting a
0:12:34um some
0:12:36we are getting
0:12:37some clusters and to get actually back
0:12:39two
0:12:40to this initial segments P Y given X
0:12:43uh we just a apply those uh a simple operations um
0:12:47and just simply
0:12:48uh integrated over all be like C
0:12:54uh
0:12:54why why this should actually work uh is uh again between two intuitive
0:12:59in this case uh these be Y X
0:13:02after combination are actually estimate it on
0:13:05a a large amount of data so if they are not estimated on those short segments
0:13:09as in case so for a your combination
0:13:12before for clustering
0:13:13now each actually white a is uh
0:13:16estimated it or not
0:13:17on a lot of data because you have just you cost in the end of course
0:13:24um um
0:13:25the third approach so
0:13:27a actually keep it system so each is just the combination of those two but also
0:13:33uh are before passing and after clustering
0:13:36so in one case
0:13:38what we can do use just
0:13:40that before a as we just uh
0:13:42or
0:13:43and a one in one a a simple stream just do uh
0:13:47a a system combination and then we just uh
0:13:50a combine such output with a
0:13:53yeah are the others
0:13:54stream uh
0:13:56and she's to be before to cussing so maybe it's it's more seen here
0:13:59i into two streams
0:14:00in one case we do this system combination so we two clustering and from these be white C but is
0:14:06we go back to be Y X
0:14:08to get initial
0:14:09we show segmentation or initial properties for for the segmentation
0:14:13and in in the second case actually be
0:14:17we just do these uh um
0:14:20she's uh did you always stream
0:14:22just
0:14:22uh
0:14:24i try to do these combination before
0:14:27for for clustering
0:14:28that's a those to the kings are simply combine of course
0:14:31i i D and we have some you Y X uh
0:14:34but takes
0:14:35a P Y C about six N B just the i'm and as before
0:14:38of course there are two possible K sees uh what should be done on beach kind of theme
0:14:43and uh this is going to be seen the results are going to be the seen in table but again
0:14:47maybe it's into a D for how this should be done so that we say a few words about the
0:14:51experiments
0:14:52uh we are using the same but each transcription data uh system me sister uh sending meetings so no i
0:14:58mean data but the only rich transcription
0:15:01um the mfcc features and these uh
0:15:04uh tdoa features
0:15:06um
0:15:07and uh
0:15:08and she or the speech is coming from and the and they again
0:15:11um be
0:15:12uh
0:15:13single and hence speech signal
0:15:15um
0:15:16again the was weights which between the estimate are are estimated on the open set
0:15:21um as before we are only many shopping diarization error rate with respect to speaker or or so not speech
0:15:28or speech nonspeech there
0:15:31a a are the results each be a shift if you remember from the previews uh to talk
0:15:36the baseline was around fifteen or
0:15:38fifteen point five uh percent
0:15:41was uh
0:15:42actually use
0:15:43single stream techniques so just mfcc features
0:15:46you do can nation
0:15:48oh for mfcc and tdoa features
0:15:51in case of information but to technique and
0:15:54kind of the H M and gmm
0:15:56uh we may see that to because we get to you and twelve percent
0:16:00um
0:16:01and the second but is just to a being but are the weights those are weights for
0:16:06reading the
0:16:08but different features so in case of
0:16:10because these are different quantity so in our case of some properties which are actually
0:16:15which we are combining
0:16:17in case of a and uh J and those are a look like people so
0:16:21a that's why also be so uh weights are different
0:16:25and again in our case the combination is done using can of variables
0:16:29and this is actually as you see you can see a perform K the
0:16:33the a of system
0:16:35so these are the results for combination
0:16:38uh but combination
0:16:40one no
0:16:42on the um
0:16:43actually after clustering so
0:16:45combination system level as as we call it
0:16:47so in that is this base like you and point six percent comes from the previous table
0:16:52you do system
0:16:54combination meaning after a can my these tolerance of labels after
0:16:58clustering cut a you may C V are getting pretty high uh almost forty percent uh improvement
0:17:04and then they are of course two possible combinations of system and model
0:17:08and a weeding
0:17:10um
0:17:11actually
0:17:12looks
0:17:13and again it's pretty straightforward that
0:17:15it's better to
0:17:16to do see stan
0:17:17combination or system waiting we the tdoa features because they are usually
0:17:22mm more noisy
0:17:24and they need probably more data to were to be what estimated it or at least those of viable
0:17:30to have more data to to to be but estimated
0:17:32in case of a and that's is is features uh it looks at works so much better
0:17:36so that's why reason
0:17:38also you may look at the table
0:17:40a race
0:17:41if the the weights goals close to the
0:17:44those weights uh which we need to estimate the goal the go close to the system combination so instead of
0:17:50zero point seven zero point three
0:17:52we go to zero point eight
0:17:53and then estimated on different data but
0:17:56to generalise
0:17:58for this case
0:18:00um
0:18:01uh
0:18:02just a B to explain why
0:18:04possibly why we are getting such improvement
0:18:07a if you look at the single the stream
0:18:09a a results
0:18:10for each meeting seventeen meetings can this case
0:18:13so are
0:18:14but model combination and system combination
0:18:17um um
0:18:18and you look at the button or which is just simple and S C and D do you do away
0:18:23information but to neck techniques so there is no combination of different features
0:18:27may see that
0:18:28most of the improvement comes in case
0:18:31but is a big gap between
0:18:32those two single stream techniques
0:18:36we have the course you don't get to improvement but you is
0:18:38a a big gap between mfcc and tdoa single stream
0:18:42but system combination works so P develop for such a meeting
0:18:51and um
0:18:52just to conclude the paper
0:18:54uh so here we are present a new technique for or new weight of combination of of the streams of
0:19:00a was six teams
0:19:01so rather as we did before uh
0:19:04before clustering to to way the the acoustic features here we are present technique which
0:19:09actually is trying to do we after clustering
0:19:11and the reason uh a simple for that this uh probably the these on the variables which
0:19:16which are used to then to what you're to match different different uh a clusters or different segments
0:19:22are
0:19:23going to be estimated on are more data
0:19:25or not just on on
0:19:27short segments
0:19:29and uh actually uh as it was seeing in uh in uh
0:19:34the results you are getting pretty cool to improvement for
0:19:37for such a technique so forty percent uh
0:19:40that were all seventeen meeting
0:19:43um i think i'm done
0:19:46oh
0:19:47we
0:19:48the on spoken
0:19:53since something that i mean
0:19:55i no not i think some a specific question
0:20:00for them
0:20:01yeah
0:20:03i for all of the
0:20:04and and goes to P