Speech Transcript - Unsupervised Compensation of Intra-Session Intra-Speaker Variability for Speaker Diarization

0:00:06	okay so this to talk is a about unsupervised compensation of intercession interspeaker variability for speaker diarisation
0:00:17	okay
0:00:18	so
0:00:18	those on the front of the following i will first uh
0:00:22	define the notion of intercession interspeaker variability and
0:00:26	actually it's a very
0:00:26	yeah
0:00:27	equivalent to that
0:00:29	to the notion to that concept of E to segment the probability that was the described in the previous uh
0:00:35	presentation
0:00:36	i would talk about how to estimate
0:00:38	december built in a supervised
0:00:40	i manner
0:00:42	in a a supervised manner
0:00:44	and have to compensate it
0:00:45	then i will propose that to speaker diarization
0:00:49	system
0:00:50	which is a bit vector based and these also utilises
0:00:53	is um compensation of
0:00:55	this uh intercession intraspeaker variability
0:00:58	which from then no one i will just a recall interspeaker variability
0:01:03	yeah i would end report experiments
0:01:05	and i will summarise in a in a talk about future work
0:01:11	okay
0:01:12	so
0:01:13	they deals the interspeaker variability is uh
0:01:17	the fact is that this is the reason why speaker diarisation is not achievable task
0:01:21	because if there was no it interspeaker variability it was very easy to
0:01:26	to perform this task
0:01:27	and now when when talk about interspeaker variability i
0:01:30	i'm
0:01:31	and talk about phonetic variability
0:01:33	energy or loudness uh probability
0:01:36	within the speech of fiscal for six single speaker
0:01:39	yeah acoustic or speaker intrinsic uh probability
0:01:43	speech rate variation and even then
0:01:45	non speech rate which is the due to the others
0:01:48	so it sometimes about makes more errors and sometimes about mixed less the rose and this can also
0:01:53	course um a probability that can be harmful to the
0:01:56	their position algorithm
0:01:58	and relevant to ask the for this kind of ability as is first of all scores speaker diarization
0:02:04	but also speaker recognition ensure the training and testing sessions
0:02:10	okay so the idea now is to france to
0:02:13	a propose a generative model of proper generative model
0:02:17	to handle this area kind of ability
0:02:19	first i will start with the classical gmm model
0:02:22	where a speaker is uh
0:02:23	a disk is modelled by a gmm
0:02:26	and the frames are
0:02:27	generated independently according to this gmm
0:02:32	now in a two thousand five we have proposed a modified model
0:02:37	that the accounts also for intersession probability model
0:02:41	according to this model speaker is not the model by a single gmm
0:02:45	rather it's mode modelled by the pdf over the session space
0:02:49	uh every session is modelled by gmm and uh frames are
0:02:53	generated
0:02:54	according to this session the gmm
0:02:57	so now what we want to do is to again document this model and
0:03:01	two proposed uh i guess i should be more though
0:03:04	where a speaker here in in this case is a
0:03:07	is the pdf over the session space
0:03:16	the speaker is a pdf over the session space
0:03:18	session is again a pdf over the segments space and in a segment is is the modelled by a gmm
0:03:25	and the frames are generated independently according to the segment you mean
0:03:31	if we want to visualise this and uh gmm space or the gmm supervector space we can see in this
0:03:37	slide
0:03:38	we see for super vectors
0:03:40	and this is it for support vectors
0:03:42	i corresponds to four different segments
0:03:45	of the same speakers recorded this speaker in a in a particular session session out for
0:03:53	and this can be modelled by by this distribution
0:03:56	now if the same speaker
0:03:57	talks in a different session
0:03:59	then
0:04:00	it was okay
0:04:01	uh this distribution
0:04:03	and we can model the entire session
0:04:06	set of sessions with a with a single distribution which will be the speaker speaker a distribution
0:04:13	and we can do the same for for speaker B
0:04:19	okay so according to this uh
0:04:21	a generative model
0:04:22	we want to in a we assume that a for that each supervector supervector
0:04:29	it for for
0:04:30	particular speaker
0:04:31	particular session a particular segment
0:04:34	distributes normally
0:04:35	with some mean and the covariance about the mean he's speaker and session dependent
0:04:40	and the covariance is also speaker session dependent
0:04:42	so this is them almost a
0:04:44	a quite that um
0:04:45	a general model we will then assume or assumptions in order to make it more right
0:04:50	is it possible to actually use this model
0:04:55	okay
0:04:56	so back in a two thousand five in a in a paper called the trainable speaker diarization in interspeech
0:05:03	we have a use this model used
0:05:05	this journal the relative model
0:05:07	two
0:05:08	do supervised
0:05:09	yeah you to interspeaker variability modelling for speaker diarisation
0:05:13	in the context of what does use diarisation
0:05:17	and in this case what we doing what you did we assume that the covariance
0:05:21	or dying to speaker variability
0:05:23	is speaker and session independent
0:05:25	so we have a single covariance matrix
0:05:28	and we estimated from a labelled
0:05:31	so demented development set
0:05:34	and so it's quite a trivial to to to estimate it
0:05:37	and once we have estimated just to compress metrics we can use it to
0:05:42	to where come into two and develop a metric
0:05:45	that is induced
0:05:46	from this uh
0:05:48	components metrics
0:05:49	and use this matrix to actually perform the devastation
0:05:53	and the techniques that we that we we should we use a pca in W C C N
0:05:59	okay so this what's in a in tucson seven but problems this technique is that we must have a labelled
0:06:05	development set
0:06:06	and this development set must be labelled according to speaker turns and sometimes this it's can be problematic
0:06:27	okay
0:06:29	so
0:06:30	in
0:06:31	in this paper
0:06:33	what we do is we assume that the session
0:06:36	that combines a medically session independent
0:06:39	so
0:06:40	it's not global as we have done before but it's a session dependent
0:06:44	and
0:06:45	that
0:06:46	and in in that good that i'm gonna is described in the next slide
0:06:50	yeah actually we don't need any labelled data
0:06:53	we we don't use it in labelled data we we we have no training process
0:06:57	we just
0:06:59	estimate documents method which is that's it
0:07:01	session dependent we estimate on the fly on the power
0:07:04	session base
0:07:08	okay so how do we do that
0:07:10	the first stage is to do gmm supervector parameterisation
0:07:14	what we do here is that we we take the session that session that you want to
0:07:18	and that
0:07:19	to apply their position
0:07:21	and we first extract and that the features such as mfcc features that under the frame rate
0:07:28	and then what we do we we estimate the session dependent ubm
0:07:32	of course you get is a is a is a form of low order
0:07:36	typical order can be for them
0:07:38	four
0:07:39	and we do that using the yin algorithm
0:07:42	now after we do that we we take the speech signal and divided into overlapping superframes
0:07:48	which of us at a frame rate of ten per second
0:07:51	so what we do we just say
0:07:53	define a chance of one second at superframe
0:07:57	and we have a ninety percent overlap
0:07:59	now if we
0:08:00	such as a bit frame we estimate a gmm using a standard uh
0:08:04	no adaptation from the uh the ubm did we just the
0:08:08	estimated
0:08:09	and now
0:08:10	and then we take the uh the parameters of the gmm we concatenate them into a supervector
0:08:15	and this is about presentation
0:08:17	for that super frame
0:08:19	so in the end of this process that a socialist paradise by a sequence of
0:08:23	of
0:08:24	supervector
0:08:27	okay
0:08:27	so now the next say yes
0:08:29	next phase
0:08:30	is to estimate that interspeaker combine metrics
0:08:33	and we're we're given the fanciest of the supervectors
0:08:37	and
0:08:38	we should note
0:08:39	yeah the super vector at time T
0:08:41	S R P is actually found to component
0:08:44	the first component is a speaker mean component
0:08:47	it it's a comment that it's speaker dependent session dependent
0:08:50	and the second component is the actual innocence
0:08:53	the
0:08:54	that you the interspeaker probability
0:08:56	which is denoted by I T
0:08:58	and according to our previous assumption this ten years and
0:09:01	is the
0:09:03	distributes the normally with a zero mean and uh
0:09:06	combats matrix which is
0:09:08	set session dependent
0:09:10	no
0:09:10	our goal is to estimate this a combat medics and what we do is
0:09:14	we consider the difference between two consecutive a supervector
0:09:18	so when the question for i define
0:09:21	dallas update which is a difference between
0:09:23	and to weigh the supervectors of two consecutive the
0:09:27	a subframe
0:09:29	superframe
0:09:30	and
0:09:31	we can see that the the difference between two super frame is actually that difference between that
0:09:36	uh
0:09:37	corresponding gay interspeaker variability
0:09:40	a new things that i teeny minus i teen
0:09:44	a minus one
0:09:45	plus some may component
0:09:47	which is usually zero
0:09:49	because most of the time there's no speaker change
0:09:52	but in every time there's a speaker change this to the value is non zero
0:09:57	now what we
0:09:58	what we do now is
0:09:59	something that it's a bit risky but we we assume that
0:10:03	most of the time there's no speaker change
0:10:05	and we can live with that of the noise that is added by
0:10:09	the violation of this assumption and we just neglect
0:10:12	the impact of that speaker changes on the covariance matrix of
0:10:15	yeah of the P
0:10:17	so basically we compute
0:10:18	the combat medic
0:10:20	of the diversity by just
0:10:22	eh
0:10:23	and
0:10:24	throwing away all the dog uh places that we that there is
0:10:27	we could change and just the computing
0:10:29	the covariance
0:10:31	or that of a
0:10:33	so so we in in
0:10:35	so what we get that is that if we want to estimate that
0:10:38	that the covariance of dying to speak about everything we actually what we have to do we just have to
0:10:43	compute
0:10:44	that empirical
0:10:45	covariance
0:10:46	of that delta supervector
0:10:48	that's what we do
0:10:50	that
0:10:50	that's in the question
0:10:52	the button
0:10:53	right
0:11:03	okay
0:11:03	so that we have done that uh and we get some kind of estimation for thine
0:11:07	fig availability
0:11:08	we can try to compensate it
0:11:10	and the way we do that is that we we of course a assume that it's of low rent
0:11:15	and can apply pca to find a basis for the for this low rank a subspace
0:11:21	in this above it
0:11:21	space
0:11:22	now we have two possible compensation after the first one is that
0:11:26	now
0:11:27	well we can come compensated for that
0:11:30	in a speaker variability
0:11:32	in the supervector space
0:11:34	and this is only useful if we want to do that it was a change in these days
0:11:38	so but
0:11:38	fig
0:11:39	yeah and another alternative is to use feature eh
0:11:43	based now
0:11:44	well we can compensate that then to speak about it in the feature space
0:11:48	and then we can actually use any other eh
0:11:51	regularisation algorithm
0:11:52	on these compensated features
0:11:57	okay
0:11:58	so it is it what we what we do is we actually and decide to use now
0:12:04	and then
0:12:05	to do the rest of the diarization either supervectors
0:12:08	domain
0:12:09	so the motivation for that button which i'm about
0:12:11	why
0:12:12	is that
0:12:13	if we look for example you in it
0:12:15	the figures in the bottom of slide
0:12:17	in the left hand side we we see an illustration on and off the ice
0:12:22	if and
0:12:23	right
0:12:23	and what we see here
0:12:24	with the uh two speakers
0:12:26	but this is not the real that i just a
0:12:28	creation
0:12:29	so we in below we have once again but we have to speak at target speaker and the problem is
0:12:33	that
0:12:34	if we want to apply adaptation about them
0:12:36	of course he doesn't know that that colour of the ball
0:12:40	and it's very hard to separate the two speakers
0:12:43	now if we walk in the supervector space
0:12:46	then then what
0:12:47	what
0:12:48	in the middle of the
0:12:49	by
0:12:50	we we see here that had a distributions of that
0:12:53	because that tend to be more anymore though
0:12:56	therefore it it's much easier to
0:12:58	to try to
0:12:59	eh
0:13:00	diarisation and this is of course the counter to the fact that we do some smoothing
0:13:04	because a superframe is one second long
0:13:07	now if we a if you manage to remove
0:13:10	yeah substantial amount of that interspeaker variability
0:13:14	what we get is uh uh they'll station in the right hand side of the fly
0:13:19	and any
0:13:20	even if we don't know the two colours if we don't know that
0:13:23	some of the points are of one sample points are all right
0:13:26	we still in it
0:13:27	it's reasonable that we make
0:13:29	find a solution of how to separate these speaker
0:13:33	because that's the question is is very easy
0:13:38	okay this is separation
0:13:43	okay so of any of the uh of the algorithm is as following first of course we
0:13:47	parameterised that speech in in two times to obtain serious of supervectors
0:13:52	and we compensate for interspeaker variability as i have shown already
0:13:58	and then we use the viterbi algorithm to do segmentation
0:14:02	now we know that to do that to be a good
0:14:03	two segmentation we just a sign think of they
0:14:06	for each speaker
0:14:07	actually we we also use um in a a
0:14:11	a length the constraint so
0:14:13	i'm
0:14:13	this is some simplification here
0:14:15	but basically we using one data per speaker
0:14:18	and all we have to do is to be able to
0:14:20	to estimate transition probabilities and how the probabilities
0:14:24	now position but but it's it's very hard to it's very easy to way to estimate
0:14:29	we can just take a very small a that that's that in and then estimate
0:14:33	that the effects of that
0:14:35	a of the length of to a speaker turn
0:14:38	so if we know the i mean the mean is we could on we can estimate transition probabilities
0:14:43	now if you want to
0:14:44	the tricky part is to be able to estimate the output probabilities
0:14:48	the probability of a
0:14:51	speaker
0:14:52	of support vector
0:14:53	at time T
0:14:54	given speaker I
0:14:55	and we're not doing this a directly we do it
0:14:58	do do you did a indirectly and i was see i would feel that this in the next slide
0:15:04	so let's say that we can do this in some way
0:15:07	and
0:15:08	so
0:15:08	we just apply data segmentation and we can and come up to four segmentation
0:15:13	and then we a run again the yeah
0:15:16	we we find the diarization using iterative viterbi resegmentation
0:15:20	so we we just say come back to the original a frame based features that this is we train the
0:15:25	speaker uh H M and
0:15:27	using our previous segmentation and we really one of the best imitation but now the resolution is better
0:15:32	because we we're working in a frame rate of a hundred
0:15:35	a second
0:15:36	and can do that this for
0:15:38	couple of iterations then and that's the end of the totem
0:15:42	so what i still have to talk about is how to
0:15:45	estimate is the output probabilities
0:15:47	probability of a a supervector T
0:15:50	a given speaker on
0:15:54	okay so
0:15:55	that would
0:15:56	is that following
0:15:57	what we do is we and i will give us some motivation after this slide
0:16:01	so what we do is we find a larger
0:16:03	the eigenvector
0:16:04	so in
0:16:05	and more more precisely find a large
0:16:07	eigenvalue and and we take the corresponding eigenvector
0:16:11	and and we and this loud
0:16:13	this a good vector is taken for the covariance matrix
0:16:16	of all the supervectors
0:16:17	so what we do take uh compensated supervectors from the session
0:16:22	and we just compare compute the covariance matrix
0:16:24	of these supervectors
0:16:25	and find that that we take the first day eigenvector
0:16:28	now what we do we take each shape compensated supervector
0:16:32	and projected onto the larger a eigenvector
0:16:36	and and what we get we get a for each a four four time T V get
0:16:40	peace up T P supply is the projection
0:16:42	of soap about R S T
0:16:44	onto the largest eh
0:16:46	it can vector
0:16:47	and the
0:16:48	well what i am i'm not going to convince you now about the the beatles on the paper and we
0:16:52	try to get some division
0:16:54	is that the action
0:16:55	the log likelihood that we are looking for the log likelihood of a supervector S P
0:16:59	given speaker one
0:17:01	and and
0:17:02	yeah divided by the the the log likelihood of a
0:17:05	is
0:17:06	you know the probability of
0:17:07	speaker of the product at a given speaker to is actually a linear
0:17:11	function
0:17:12	of this projection
0:17:13	that we have found
0:17:14	so what we have to do is to be able to somehow
0:17:18	estimate parameters a and B
0:17:20	and if we estimate about the A B we can approximate the likelihood
0:17:24	we're looking for you know in order to plug it into the derby
0:17:28	but just taking this projection
0:17:30	now be is actually
0:17:32	is related to the dominance of of that speaker if we have two speakers and one of them is more
0:17:38	dominant than be will be uh either positive or negative
0:17:41	and in this day in in our experiments we just assumed to be zero therefore we assume that
0:17:47	the speakers are equally dominant
0:17:49	it's about this uh
0:17:51	the conversation of course not true but uh we hope to
0:17:55	accommodate with this uh problem a by using a final a viterbi resegmentation
0:18:00	and a is assumed to be corpus the tape and then call content and we just have to make it
0:18:06	single uh
0:18:07	parameter with timit form this mode of that
0:18:10	and
0:18:10	no rules and do those out in the paper
0:18:14	just that but just to get the motivation let's let's say
0:18:17	we tended we have very easy problem
0:18:20	that uh in in this case the right
0:18:22	is one speaker I and uh
0:18:24	and the blue is the second speaker
0:18:26	and this is uh yeah actually in the super vectors
0:18:29	right
0:18:30	after we have done that i need to
0:18:32	speaker compensation
0:18:33	it it does become a bit of computation
0:18:35	so if we just take the although the blue and the red eh
0:18:39	but
0:18:39	and we just compare the combined fanatics and take the first eigenvector what we get we get that
0:18:44	the black arrow
0:18:46	okay so if we take the black arrow and
0:18:48	just project all the points on that there are we get the distribution in the right hand side of the
0:18:53	the flight
0:18:53	and if we just to decide
0:18:55	according to that object on the black arrow if we just the
0:18:59	compute
0:19:00	the decision boundary
0:19:01	we we see that it
0:19:02	that
0:19:02	this it does she and battery and it's a actually it's it's a
0:19:07	it exactly as the optimal decision but we can compute the optimal decision boundary because
0:19:12	we know the true distribution of the red and the blue box
0:19:15	a because it's artificial data here
0:19:17	in this this like
0:19:19	so actually this this algorithm works in this case the simple case
0:19:23	now as we try to increase then that amount of interspeaker variability
0:19:28	like that we we didn't manage to
0:19:30	remove all the almost dying to speak about ability
0:19:33	so
0:19:34	as we see
0:19:35	as we a great amount of the excess
0:19:37	they are within within all they speak about the
0:19:40	we see that uh
0:19:42	that decision boundary that we have to make is not exactly as the optimal decision boundary and
0:19:46	actually this algorithm we we fail in in someone
0:19:50	when a with this too much into station in in to speak about building
0:19:54	so the hope is that
0:19:55	an hour ago tend to that meant that the
0:19:58	suppose always compensate for you to speak about the the hope is that on the on the data we're gonna
0:20:02	work on
0:20:03	we actually managed to remove not
0:20:05	of the variability in order to let that button to work
0:20:10	oh
0:20:11	okay so now yeah
0:20:13	i reported experiments
0:20:15	yeah
0:20:15	we used a a a a small i think they
0:20:19	from that need two thousand five
0:20:21	reef as that of a month if that we used to have we used one hundred the stations
0:20:26	actually i don't think that we actually
0:20:28	we need the
0:20:29	so much the data for development because we actually
0:20:32	estimate only a handful of parameters
0:20:34	and this is used to tune the hmm transition parameters
0:20:38	and uh and uh a parameter
0:20:41	which is used for the loglikelihood calibration
0:20:44	and we use that and and these two thousand five a course
0:20:48	that's it
0:20:49	for a for evaluation
0:20:51	what we did is we took a the stereo
0:20:53	a phone call and we just a sum
0:20:56	artificially
0:20:57	the two sides in order to get a
0:21:00	a two wire data
0:21:02	and we take the ground truth for my we derive it from the asr transcript
0:21:07	provided by nist
0:21:09	yeah we report a speaker error rate eh
0:21:12	our error message
0:21:14	rms measure is this speaker right
0:21:17	and
0:21:17	we use that stand out in is what we call it too
0:21:20	this is
0:21:21	correct
0:21:24	okay
0:21:24	and just enough to be able to compare our results to two different systems we we use the baseline which
0:21:30	is big bass
0:21:31	the
0:21:32	and inspired by lindsay is that two thousand five this them
0:21:36	oh
0:21:36	it's it's based on detection of speaker changes using the A B in ten iterations of a
0:21:42	viterbi resegmentation and along with david biclustering
0:21:46	followed finally by a page of a viterbi resegmentation
0:21:52	okay so this is a domain with all
0:21:55	and on all day that that we we achieve the
0:21:59	we
0:22:01	we actually had uh uh
0:22:03	speaker error rate of six point one
0:22:05	for baseline
0:22:06	and when we just use the supervector basis and without any intraspeaker variability compensation
0:22:13	we got four point eight
0:22:15	and we when we also a a used a speaker variability compensation we we got
0:22:20	two point
0:22:21	right one day
0:22:23	and in the six
0:22:24	that meant a supervector gmm order is sixty four and uh now
0:22:27	compensation order is
0:22:29	five
0:22:34	we we ran some experiments in order to try to improve on this we we actually we didn't manage people
0:22:39	but
0:22:39	we try to to see
0:22:41	if we just
0:22:42	change the front end
0:22:44	what's gonna happen so
0:22:46	and we find out that feature warping actually degrade performance
0:22:49	and this is the
0:22:51	this was already has already been and
0:22:54	a explained by the fact that the channel is actually something we were done we want to explore twenty do
0:22:58	diarisation
0:22:59	because
0:23:00	it may be the case that different speakers and different channel
0:23:03	so we don't want to roll channel
0:23:05	a information
0:23:07	and i think the other end
0:23:09	see also slightly degraded
0:23:11	performance
0:23:12	and
0:23:13	we also wanted to check the power
0:23:15	but what would the way how much
0:23:18	yeah but that would be in a way that it can be if we we had perfect
0:23:22	eh
0:23:22	positivity that though and we and we actually prove that
0:23:26	very likely to two point six
0:23:31	some more experiments a fine
0:23:32	two
0:23:33	eh
0:23:34	it checks that it's a T V T of
0:23:37	a system to
0:23:39	a gmm although too
0:23:42	and the nap compensation although so basically uh we see what
0:23:46	jim order
0:23:47	the best order that sixty four and one twenty eight i didn't try to increase the eh
0:23:52	i don't know what would happen if i would be
0:23:54	great that you know boulder
0:23:55	and for nap and we see that it's quite a bit of it i think
0:23:58	from fifteen uh we get quite the we get already any for the uh uh
0:24:03	speaker rate of
0:24:03	three
0:24:04	but
0:24:05	but performance
0:24:06	something around
0:24:07	five
0:24:11	okay so finally before i am i
0:24:13	and
0:24:14	one of the motivations for this work was to use this the devastation for
0:24:19	speaker I D into in um a
0:24:22	they are in a
0:24:23	to wire data
0:24:25	so we we try to to say whether we get improvements using this devastation
0:24:30	so
0:24:30	what would we did it we we tried we defended acquisition
0:24:34	and systems one of them was that a reference to
0:24:37	a diarisation second one was the baseline diarization
0:24:41	the third was the proposal
0:24:42	it every station
0:24:44	and we all we tried to apply that is they should i do only in the test
0:24:48	the on the test data also
0:24:51	and
0:24:52	the train data because i
0:24:54	one for sponsors actually
0:24:56	have this problem so also the training data is also a uh some
0:25:01	therefore we we
0:25:02	we wanted to check
0:25:03	a what
0:25:05	the performance on also on training a
0:25:07	on some data
0:25:09	and we used
0:25:10	speaker this system which is not based
0:25:12	and
0:25:13	that achieves a regular rate of five point two on and
0:25:17	done this two thousand five or female data set
0:25:21	and what we concluded that for when only test
0:25:25	data is some
0:25:26	that the proposed and achieve that performance that is equivalent to using manual diarization
0:25:32	and when both a train and test
0:25:35	data um
0:25:36	there is a degradation compared to to the reference there
0:25:39	it every station
0:25:41	but however documentation that we get for the baseline system you have
0:25:46	when we use
0:25:47	the proposed
0:25:51	okay so a to summarise
0:25:53	and we have a described in the add button for unsupervised estimation of
0:25:58	uh intercession interspeaker variability
0:26:01	and we also
0:26:03	it describes the two methods
0:26:04	to a compensate for for this probability one of them is uh using now
0:26:10	in support vector space and also
0:26:12	i didn't report any without but we also ran experiments using a feature space now
0:26:17	he then it
0:26:17	C space
0:26:18	and
0:26:19	using get too
0:26:21	to speaker diarization it
0:26:22	using the gmm supervectors we got a speaker at all
0:26:26	for one day
0:26:28	and if we also apply
0:26:30	interspeaker variability compensation
0:26:33	i would think the supervector based they to speaker diarization system
0:26:37	we get the for the improvement
0:26:38	two two point eight
0:26:40	and finally the whole system
0:26:43	improve speaker I D accuracy
0:26:45	for some the audio especially for the sound something case
0:26:50	now for future work would be first of all to apply a feature space now
0:26:56	for a need to speak a word with the conversation
0:26:59	and then
0:27:01	to use
0:27:01	different the diarisation systems
0:27:04	so this can be interesting
0:27:06	and we we did try feature space now but but then we just
0:27:10	stayed with
0:27:11	we we just we estimated subvectors so all we did was in the top of it
0:27:15	space
0:27:16	and and of course it should try to extend this work in into multiple speaker diarization
0:27:22	possibly in both of yours or in meetings
0:27:24	and of course to integrate after a methods
0:27:28	such as interspeaker variability modelling
0:27:30	and which were proposed lately and
0:27:33	to integrate them
0:27:34	this uh this approach
0:27:50	i personable congratulations for your work
0:27:53	um
0:27:55	you restricted yourself to the to speaker diarization
0:27:58	yeah
0:27:59	just regularisation something like uh there's this without model selection
0:28:03	and the
0:28:04	for my opinion the beauty of the resisted his model selection but anyway
0:28:08	um
0:28:09	you claim that with this
0:28:11	so later
0:28:12	subspace
0:28:13	this projection
0:28:14	you somehow customise
0:28:16	data
0:28:17	right
0:28:19	yeah
0:28:21	i i think that first of all it by using the supervector coach a it it's more convenient
0:28:26	to do
0:28:27	a diarisation because in some sense yes you
0:28:30	you're you're a distribution tends to be more money model unimodal modelling modelling not and also that
0:28:37	and
0:28:38	i claim that you can use similar techniques that are used for speaker recognition such as a
0:28:43	if the session variability modelling you can use
0:28:46	similar techniques
0:28:47	in in in this domain so so if you want
0:28:50	uh you there is or something
0:28:51	forget about two speaker
0:28:53	you
0:28:54	re a model selection task
0:28:56	uh one question would be something like
0:28:58	so you
0:28:59	you can see the rascals and image
0:29:02	let's say the emission this
0:29:03	so space
0:29:04	fine
0:29:05	projection
0:29:06	um since you
0:29:08	she is
0:29:08	these would be a gaussian
0:29:10	i blame
0:29:11	then traditional uh model selection techniques
0:29:14	uh like
0:29:15	like this information about you
0:29:16	okay
0:29:17	we'll uh
0:29:18	be applicable without these student
0:29:20	or um
0:29:21	hopefully it
0:29:22	yeah because
0:29:23	them specification of the model
0:29:25	when you use it with
0:29:27	oh it's obvious when you use it in the M S
0:29:29	she domain which clearly my remote
0:29:31	okay we uh if you if you consider it now also goes animation
0:29:36	then probably the B
0:29:38	we'll give you answers
0:29:40	in these subspace
0:29:42	about
0:29:42	the number of speakers as well
0:29:44	number of speakers possibly
0:29:46	oh
0:29:46	he into an hmm frame
0:29:49	uh so
0:29:50	uh we should consider this
0:29:53	definitely
0:30:03	yeah no worse than about the summit some of the condition
0:30:06	and
0:30:07	how
0:30:07	you
0:30:08	can you define it in the training set the
0:30:11	the sum
0:30:12	condition because the
0:30:13	if you
0:30:14	the make that position in the training set the you don't know what is the motor okay
0:30:19	so uh
0:30:21	the method that was that decided the with with
0:30:24	some sponsor is that
0:30:26	eh
0:30:28	you factorisation and then you just a you have the reference
0:30:32	segmentation you just decide which
0:30:35	side
0:30:36	has more in an overlap or with
0:30:39	with the um
0:30:41	the limitation that you got
0:30:43	so you know you know what you know what what portions of the speech are actually
0:30:48	eh
0:30:49	yeah
0:30:49	you need to change
0:30:50	on
0:30:51	okay so because you have it's for white
0:30:53	ordinates for wine data and you know and you have a segmentation
0:30:56	you got
0:30:57	you get uh two clusters
0:30:59	and then you just compare these clusters
0:31:01	it automatically to the uh to the reference to decide which cluster is more uh
0:31:07	more correct
0:31:10	and that's something that actually the
0:31:12	if you if you think about it
0:31:14	if someone train
0:31:15	and on some data and so one of the way you consisting that you can apply automatic diarisation and just
0:31:21	so
0:31:22	and just say let's
0:31:23	the user i see the two classes and he'll just decide which cluster is the right class
0:31:28	that that's the motivation
0:31:31	thank you
0:31:39	yeah
0:31:46	oh

Unsupervised Compensation of Intra-Session Intra-Speaker Variability for Speaker Diarization

SESSION 6: Diarization

Added: 14. 7. 2010 11:08, Author: Hagai Aronowitz (IBM Reseach - Haifa), Length: 0:31:46