Speech Transcript - MULTISTREAM SPEAKER DIARIZATION THROUGH INFORMATION BOTTLENECK SYSTEM OUTPUTS COMBINATION

0:00:26	so um this is the second talk
0:00:29	i about uh i J again a a speaker diarization them what we are trying to focus on multistream approach
0:00:34	use
0:00:35	and it's uh actually detect in the the the baseline technique which we are using a
0:00:40	is the same as in the previous talk which is uh
0:00:43	information but to like system
0:00:46	and to uh we are made me as the P saying the are need trying to look at the a
0:00:50	combination of the outputs or a combination actually of for different different seems on different levels
0:00:56	and these are only acoustic strings cell so no prior information
0:01:00	from brawl statistic so
0:01:02	again
0:01:03	um
0:01:04	and the third
0:01:05	for order here
0:01:07	is this was done by D
0:01:08	was was D is you them to a T D a we let
0:01:12	for for should L D
0:01:14	and um
0:01:16	i interaction a motivation
0:01:18	as the same or
0:01:19	kind of close
0:01:21	um again we set holes uh or we assume that's uh
0:01:26	the recordings which we are working with a are recorded with multiple distant microphones
0:01:31	i i as um actually features what you are using a two kind of a "'cause" to features and that
0:01:37	mfcc features which are kind of standards
0:01:39	and then
0:01:40	i that time delay of are right and i was features
0:01:44	um
0:01:45	each loop they are pretty uh a compliment to mfcc
0:01:49	and uh um people nowadays they they use quite quite a lot for
0:01:53	for uh diarization
0:01:56	actually this combination
0:01:58	winning acoustic feature combination
0:02:00	for we uh
0:02:01	uh uh information but like a technique is
0:02:04	a a key
0:02:06	less a a state-of-the-art results in a meeting data stations
0:02:12	um so back to O two motivations so usually the feature streams are combined or a model level
0:02:19	so
0:02:19	there are separate models for a gmm models
0:02:22	for
0:02:23	different uh actually speak uh streams
0:02:25	and this is are those way away
0:02:28	and the and
0:02:29	these uh actually uh a look like use in the and are combined
0:02:32	with it's some you know waiting
0:02:34	a and there also some other approach is like a voting schemes between
0:02:38	these uh systems
0:02:39	i diarisation systems already
0:02:41	or or actually the initialisation
0:02:44	i run system is done on the output of the other system or some the graded approach
0:02:49	our are actually question a is uh if we can if
0:02:52	and if you see or do this to kind of different acoustic features
0:02:56	can be integrated using independent diarization systems
0:02:59	rather than independent
0:03:01	models or in other word
0:03:03	but but actually D add some advantage of using systems are then
0:03:07	a a combination
0:03:08	but do we mean by system or a combination i hope is going to be clear
0:03:12	uh uh sure or
0:03:14	a to slides
0:03:16	um
0:03:19	so maybe the last one about i'd like blind of the talks so for let me say a few words
0:03:24	of all this
0:03:25	information about but like principal which we use
0:03:28	and which is actually done on single stream that a station so no combination of before features
0:03:34	and also if few words about to model based combination about
0:03:38	system based combination some he bit combination
0:03:40	and the experiment a result
0:03:43	again uh a state-of-the-art results using actually
0:03:47	uh this uh but to make uh
0:03:48	information but the like a technique
0:03:51	um
0:03:53	um we we are getting state of the results with such system and that is not too much of a
0:03:58	computational
0:03:59	complexity in in that
0:04:01	um
0:04:02	so this is uh are the can the advantage
0:04:04	uh how does it work
0:04:06	these information about button like principle
0:04:08	um actually this kind of intuitive div approach each has been borrowed from uh
0:04:13	from a a document clustering so
0:04:15	at the beginning sample that we have some document that you want to class or in
0:04:20	C clusters
0:04:21	in our terminology
0:04:22	and
0:04:24	um
0:04:25	and uh
0:04:27	what these actually
0:04:28	a a what is added did that as a as the information is some body Y which is about but
0:04:33	be of interest
0:04:35	a a or we call it as are but i of a body able which it surely no
0:04:39	or something about discussed ring so some in these uh
0:04:43	a document clustering these why uh why why able can be
0:04:47	a can be words
0:04:49	oh all the vocabulary which
0:04:51	of course to was about uh a about these uh
0:04:55	discussed serves and has information about
0:04:57	a about six a
0:04:59	also so actually some all that there is a a normal condition distribution P you white X so like given
0:05:04	X is available
0:05:06	and back
0:05:07	and going back to this uh a problem or speaker diarisation
0:05:11	our X got to i X is actually set of elements
0:05:15	oh and the speech so again
0:05:17	speech uh segments
0:05:20	again you need for segmentation we we set and
0:05:23	these need to be
0:05:24	uh
0:05:25	uh a cluster into C C class or
0:05:29	so we to this information about the like a principal state
0:05:32	uh that the clustering should be press the ring as much information as possible between
0:05:38	a a C a Y
0:05:40	or by minimizing the distortion these distortion we can see as a
0:05:44	uh some
0:05:45	compression for example
0:05:47	or
0:05:48	also in a our
0:05:49	our way it's actually some regularization regularization so if you don't have uh
0:05:54	these distortion C N N
0:05:56	which is actually but our terms uh
0:05:59	i'm each information
0:06:00	oh oh X and C for i X and C
0:06:03	uh uh if you don't have a it's probably going to
0:06:06	cussing to one one global class or which which is not so the case C one
0:06:11	so i get this i'm
0:06:13	i intuitive div approach
0:06:15	but in the end it looks that uh
0:06:18	or you can be proved
0:06:19	but
0:06:20	if we actually
0:06:21	you are going to
0:06:23	um
0:06:24	have to my this objective function which is again
0:06:27	uh a mutual information C Y
0:06:30	and my nose
0:06:31	some
0:06:32	uh like i to rate or uh X and C
0:06:35	uh yeah are going to
0:06:37	actually uh
0:06:39	to move the problem to the
0:06:41	uh to the way you where
0:06:42	uh the properties
0:06:44	those
0:06:45	that he's Y given X are going to be
0:06:48	uh
0:06:49	measure don't can but using a simple divorce and
0:06:52	so
0:06:53	but the point so we don't need to look for some
0:06:55	especially divisions of the as your which is saying
0:06:58	which got a of we should we should be him together
0:07:01	in this uh in do
0:07:02	and so intuitive approach
0:07:04	i be due the derivation we will find out that actually that should be jensen jensen channel uh the imagines
0:07:09	used for
0:07:10	for clustering
0:07:11	so in the end uh the approach is pretty simple or
0:07:16	going to be is
0:07:17	so here it's actually a got marty for a
0:07:20	a also in each iteration them the are
0:07:23	we are uh
0:07:24	we are thing to clusters together are based on the information
0:07:28	uh from these uh give chance so we take those clusters which have
0:07:32	the small the and we just met jim
0:07:34	and you do it it's that to the um
0:07:36	until
0:07:37	should is some stop criteria
0:07:39	stop it that you know
0:07:40	is again pretty simple and it is actually a normalized
0:07:44	but you or from
0:07:46	i go back
0:07:47	uh this a mutual information between C and Y
0:07:51	so so again mm to somehow O
0:07:55	i i know finalised this uh i the approach
0:07:58	uh
0:07:59	right is good we have us to pink daddy and we have actually
0:08:03	the the um
0:08:05	where you how to measure your the
0:08:07	the similarity between between clusters
0:08:09	and uh
0:08:11	it's pretty simple
0:08:12	to to and coded it you know
0:08:14	so
0:08:15	um
0:08:17	oh just a a few information about uh are those properties which are actually here so
0:08:21	would be fairly suppose that uh by but you of C given an X where C is cluster eight X
0:08:27	is input uh segment
0:08:28	is going to be hard
0:08:30	partition meaning
0:08:31	it all
0:08:32	all these bills only to one class or
0:08:34	but is no like
0:08:35	a a week a uh weighting between several class er
0:08:39	and place probability why given C which is actually
0:08:43	a a some yeah but about a viable
0:08:46	yeah distribution
0:08:47	which which is used to a actually to do this so merging
0:08:51	and um
0:08:55	everything should be more clear to on this
0:08:58	on this up your
0:08:59	so i mean suppose we have input speech which is uniformly segment it
0:09:04	oh for example mfcc features in this single
0:09:07	some the approach
0:09:09	we have uh elements of
0:09:11	these
0:09:12	and among variables
0:09:13	i still didn't say what it is but
0:09:15	i i it's probably in T if in
0:09:17	our case is just universal background model
0:09:20	you just on and tired speech
0:09:22	and uh
0:09:23	uh this is actually defining body able to what you to do the thing so
0:09:28	actually actually state or which you see in the middle or are back doors P why you an X which
0:09:33	are
0:09:33	probabilities
0:09:35	for a vector Y given
0:09:37	uh you the input segments
0:09:40	and um
0:09:42	the clustering which is a a again competitive technique and in the end we get some initial segmentation
0:09:48	and finally we do refinement using ca
0:09:51	training a gmm and doing viterbi decoding
0:09:58	that are let's go back to
0:09:59	to the feature combination
0:10:02	so in case of uh
0:10:04	uh a feature combination which is based on the big around what else so suppose that we can have to
0:10:09	features again uh a few just a at is and and tdoa away
0:10:13	and we have to big our models
0:10:15	uh each are trained on on such features
0:10:18	uh what we can simply do that
0:10:19	we uh we can just wait can nearly weights
0:10:22	these uh
0:10:23	B Y given X uh
0:10:25	vectors or probabilities
0:10:27	with
0:10:27	put some weight
0:10:28	and it's going to be us new mats weeks
0:10:31	oh for these settlements sorry abilities
0:10:33	in the
0:10:34	a these weights
0:10:36	how to get a to of course we trained them or estimate them on the development data so
0:10:41	we should be juror rising or different data
0:10:43	L so one
0:10:45	we have actually these P Y X is make it's the rest of the diarization system is same so P
0:10:49	actually do it just at the beginning where we combine these
0:10:53	i are buttons where is
0:10:54	and then we just just do a iterative
0:10:57	approach to
0:10:58	to do clustering
0:11:00	so actually this is not a new these has been already but uh
0:11:03	published be i row last the interspeech
0:11:06	um this is just again the gap how how it is down
0:11:10	a again there is a matrix cold
0:11:11	thus be white X
0:11:13	probably
0:11:14	um
0:11:15	the vectors like an vectors
0:11:17	and they are simply
0:11:18	a a it's uh by by alright right
0:11:21	yeah and then there is a clustering operation and refinement
0:11:25	now what is actually knew and what uh what we are type in this paper is uh
0:11:30	multiple system combination
0:11:32	so so
0:11:33	a set of doing the combination before clustering uh what would happen if you do combination after clustering
0:11:39	so
0:11:40	um
0:11:41	again with a of that they are to big our models
0:11:44	oh trained on different uh features
0:11:46	and they are two diarization systems in the end so
0:11:49	uh we
0:11:50	actually it actively
0:11:52	get some clusters
0:11:53	a stopping titanium actually can be different
0:11:56	meaning
0:11:57	can have different number of clusters for
0:11:59	for a feature a a or four it should be
0:12:02	the end to be get a this in these wide given X
0:12:06	or a you see actually
0:12:08	and
0:12:09	and
0:12:09	a time to go back
0:12:11	from this class to initial segmentation
0:12:14	is
0:12:14	have been that would D Y you X
0:12:16	i to do is just simple by bison operation
0:12:20	and um
0:12:21	again there is um
0:12:23	something you image how how this is done
0:12:25	so again and that two diarization systems
0:12:29	which are doing complete clustering
0:12:32	and in the end we are again getting a
0:12:34	um some
0:12:36	we are getting
0:12:37	some clusters and to get actually back
0:12:39	two
0:12:40	to this initial segments P Y given X
0:12:43	uh we just a apply those uh a simple operations um
0:12:47	and just simply
0:12:48	uh integrated over all be like C
0:12:54	uh
0:12:54	why why this should actually work uh is uh again between two intuitive
0:12:59	in this case uh these be Y X
0:13:02	after combination are actually estimate it on
0:13:05	a a large amount of data so if they are not estimated on those short segments
0:13:09	as in case so for a your combination
0:13:12	before for clustering
0:13:13	now each actually white a is uh
0:13:16	estimated it or not
0:13:17	on a lot of data because you have just you cost in the end of course
0:13:24	um um
0:13:25	the third approach so
0:13:27	a actually keep it system so each is just the combination of those two but also
0:13:33	uh are before passing and after clustering
0:13:36	so in one case
0:13:38	what we can do use just
0:13:40	that before a as we just uh
0:13:42	or
0:13:43	and a one in one a a simple stream just do uh
0:13:47	a a system combination and then we just uh
0:13:50	a combine such output with a
0:13:53	yeah are the others
0:13:54	stream uh
0:13:56	and she's to be before to cussing so maybe it's it's more seen here
0:13:59	i into two streams
0:14:00	in one case we do this system combination so we two clustering and from these be white C but is
0:14:06	we go back to be Y X
0:14:08	to get initial
0:14:09	we show segmentation or initial properties for for the segmentation
0:14:13	and in in the second case actually be
0:14:17	we just do these uh um
0:14:20	she's uh did you always stream
0:14:22	just
0:14:22	uh
0:14:24	i try to do these combination before
0:14:27	for for clustering
0:14:28	that's a those to the kings are simply combine of course
0:14:31	i i D and we have some you Y X uh
0:14:34	but takes
0:14:35	a P Y C about six N B just the i'm and as before
0:14:38	of course there are two possible K sees uh what should be done on beach kind of theme
0:14:43	and uh this is going to be seen the results are going to be the seen in table but again
0:14:47	maybe it's into a D for how this should be done so that we say a few words about the
0:14:51	experiments
0:14:52	uh we are using the same but each transcription data uh system me sister uh sending meetings so no i
0:14:58	mean data but the only rich transcription
0:15:01	um the mfcc features and these uh
0:15:04	uh tdoa features
0:15:06	um
0:15:07	and uh
0:15:08	and she or the speech is coming from and the and they again
0:15:11	um be
0:15:12	uh
0:15:13	single and hence speech signal
0:15:15	um
0:15:16	again the was weights which between the estimate are are estimated on the open set
0:15:21	um as before we are only many shopping diarization error rate with respect to speaker or or so not speech
0:15:28	or speech nonspeech there
0:15:31	a a are the results each be a shift if you remember from the previews uh to talk
0:15:36	the baseline was around fifteen or
0:15:38	fifteen point five uh percent
0:15:41	was uh
0:15:42	actually use
0:15:43	single stream techniques so just mfcc features
0:15:46	you do can nation
0:15:48	oh for mfcc and tdoa features
0:15:51	in case of information but to technique and
0:15:54	kind of the H M and gmm
0:15:56	uh we may see that to because we get to you and twelve percent
0:16:00	um
0:16:01	and the second but is just to a being but are the weights those are weights for
0:16:06	reading the
0:16:08	but different features so in case of
0:16:10	because these are different quantity so in our case of some properties which are actually
0:16:15	which we are combining
0:16:17	in case of a and uh J and those are a look like people so
0:16:21	a that's why also be so uh weights are different
0:16:25	and again in our case the combination is done using can of variables
0:16:29	and this is actually as you see you can see a perform K the
0:16:33	the a of system
0:16:35	so these are the results for combination
0:16:38	uh but combination
0:16:40	one no
0:16:42	on the um
0:16:43	actually after clustering so
0:16:45	combination system level as as we call it
0:16:47	so in that is this base like you and point six percent comes from the previous table
0:16:52	you do system
0:16:54	combination meaning after a can my these tolerance of labels after
0:16:58	clustering cut a you may C V are getting pretty high uh almost forty percent uh improvement
0:17:04	and then they are of course two possible combinations of system and model
0:17:08	and a weeding
0:17:10	um
0:17:11	actually
0:17:12	looks
0:17:13	and again it's pretty straightforward that
0:17:15	it's better to
0:17:16	to do see stan
0:17:17	combination or system waiting we the tdoa features because they are usually
0:17:22	mm more noisy
0:17:24	and they need probably more data to were to be what estimated it or at least those of viable
0:17:30	to have more data to to to be but estimated
0:17:32	in case of a and that's is is features uh it looks at works so much better
0:17:36	so that's why reason
0:17:38	also you may look at the table
0:17:40	a race
0:17:41	if the the weights goals close to the
0:17:44	those weights uh which we need to estimate the goal the go close to the system combination so instead of
0:17:50	zero point seven zero point three
0:17:52	we go to zero point eight
0:17:53	and then estimated on different data but
0:17:56	to generalise
0:17:58	for this case
0:18:00	um
0:18:01	uh
0:18:02	just a B to explain why
0:18:04	possibly why we are getting such improvement
0:18:07	a if you look at the single the stream
0:18:09	a a results
0:18:10	for each meeting seventeen meetings can this case
0:18:13	so are
0:18:14	but model combination and system combination
0:18:17	um um
0:18:18	and you look at the button or which is just simple and S C and D do you do away
0:18:23	information but to neck techniques so there is no combination of different features
0:18:27	may see that
0:18:28	most of the improvement comes in case
0:18:31	but is a big gap between
0:18:32	those two single stream techniques
0:18:36	we have the course you don't get to improvement but you is
0:18:38	a a big gap between mfcc and tdoa single stream
0:18:42	but system combination works so P develop for such a meeting
0:18:51	and um
0:18:52	just to conclude the paper
0:18:54	uh so here we are present a new technique for or new weight of combination of of the streams of
0:19:00	a was six teams
0:19:01	so rather as we did before uh
0:19:04	before clustering to to way the the acoustic features here we are present technique which
0:19:09	actually is trying to do we after clustering
0:19:11	and the reason uh a simple for that this uh probably the these on the variables which
0:19:16	which are used to then to what you're to match different different uh a clusters or different segments
0:19:22	are
0:19:23	going to be estimated on are more data
0:19:25	or not just on on
0:19:27	short segments
0:19:29	and uh actually uh as it was seeing in uh in uh
0:19:34	the results you are getting pretty cool to improvement for
0:19:37	for such a technique so forty percent uh
0:19:40	that were all seventeen meeting
0:19:43	um i think i'm done
0:19:46	oh
0:19:47	we
0:19:48	the on spoken
0:19:53	since something that i mean
0:19:55	i no not i think some a specific question
0:20:00	for them
0:20:01	yeah
0:20:03	i for all of the
0:20:04	and and goes to P

MULTISTREAM SPEAKER DIARIZATION THROUGH INFORMATION BOTTLENECK SYSTEM OUTPUTS COMBINATION

Speaker Diarization

Presented by: Petr Motlíček, Author(s): Deepu Vijayasenan, Fabio Valente, Petr Motlicek, Idiap Research Institute, Switzerland