Speech Transcript - Online Two Speaker Diarization

0:00:38	OK, So the title of my talk is following, when we first say, given a
0:00:45	file offline
0:00:46	supervector-based speaker diarization system which we presented last Odyssey 2 years ago
0:01:10	OK, so the
0:01:17	doesn't
0:01:24	no, it doesn't go into the computer, but not here
0:01:38	first time I see something like this
0:01:41	OK, so this is that
0:01:42	outline already of my topic
0:01:45	OK, so just to those are not familiar with baseline algorithm
0:01:50	The idea is to take two speakers eh and usually a third symbol and to
0:01:56	do a speaker diarization.
0:01:58	And the main principle is as following: if you look at this illustration, this is
0:02:02	the illustration of the dynamic sys specs, one speaker in blue, and the other in
0:02:06	red.
0:02:07	And if you did not have the specific color, we wouldn't be able to do
0:02:13	separation between these 2 speakers.
0:02:15	So the idea is to take the speech, and to do some kind of parameterization
0:02:22	into a series of supervectors, representing overlapping short segments.
0:02:27	So what we get is what we see here: now we see some kind of
0:02:33	separation ... between two, between the two speakers.
0:02:36	And also we can see that every speaker roughly can be modeled by a unique
0:02:42	model PDF.
0:02:43	This is thanks to the supervector representation.
0:02:47	And the next step is to improve the separation between speakers by
0:02:54	removing pieces some of the intra section inter speaker variability.
0:02:59	This is the sketch of the algorithm.
0:03:02	and here are the actual steps.
0:03:06	so first there's the audio parameterization ... the section is taken and conversation dependent UBM
0:03:14	is estimated.
0:03:16	So basically this algorithm doesn't need any development data, the UBM is estimated from the
0:03:22	conversation.
0:03:23	Then the conversation is segmented into overlapping 1 second superframe.
0:03:28	and for each superframe we represent by a supervector which is adapted from the UBM.
0:03:35	then there is another step which I am not going to go into detail, because
0:03:39	it's something that we've already presented.
0:03:42	it's eh, what we do is we try to estimate on the fly, eh, from
0:03:47	the conversation.
0:03:48	intra speaker variability and compensate for it.
0:03:52	and to improve the accuracy.
0:03:55	the next step is to score the superframe, being either from speaker 1 or speaker
0:03:59	2.
0:04:00	This is being done by first computing the covariance matrix, the covariance matrix of the
0:04:05	compensated
0:04:06	supervectors.
0:04:07	then applying PCA analysis to this covariance matrix, and justifying the largest eigenvector, and projecting
0:04:15	everything onto this largest Eigen vector.
0:04:19	Then we use Viterbi to do some smoothing and finally we do Viterbi segmentation on
0:04:26	the MFCC space.
0:04:28	So this is the baseline.
0:04:30	At least a few shortcomings to this algorithm.
0:04:34	First we found out when we apply this algorithm on short section and
0:04:39	when I'm thinking about short, it can be 15 seconds, or, 30 seconds.
0:04:45	and it doesn't work that well.
0:04:49	on short section, and this is first of all because of insufficient data for estimating
0:04:55	all the models and parameters from a single short section.
0:04:59	and also because of the probability of misbalance between the speakers, and the representation of
0:05:05	the speakers increases with, uh, when we're dealing with short section.
0:05:10	and also this algorithm is heavily based on the fact that there is some kind
0:05:16	of balance between the 2 speakers.
0:05:19	and another issue is that this algorithm inherits the online, or, offline, and several of
0:05:26	our customers require that we have the online solution.
0:05:31	and so this is the shortcomings.
0:05:34	So first, I'll talk about robustness in short sessions which is important by itself, but
0:05:40	also the first step towards the online algorithm
0:05:43	algorithm
0:05:44	So the basic idea is try to do everything that we can, to do it
0:05:49	offline from the development set.
0:05:51	instead of training the UBM
0:05:53	from the conversation, we just train from
0:05:56	from development set, and also the NAP intra speaker
0:06:01	variability compensation is trained from development set
0:06:05	but we don't need any labeling from the development set, because
0:06:09	our algorithm is unsupervised, it doesn't need to have speaker labels, or speaker turns.
0:06:16	labellings, we just need the raw audio.
0:06:19	So we take the development set, we estimate the UBM, we estimate the NAP
0:06:26	the NAP transform, and also we do the GMM model in order to make it
0:06:31	more robust to short sessions.
0:06:33	next thing is what we call the outlier-emphasizing PCA.
0:06:38	Contrary to robust PCA which someone is familiar with, in our case, we're actually interested
0:06:43	in the outlier and
0:06:45	we want to emphasize and give high weight to outliers when we're doing PCA
0:06:52	This is true because, let's look at this illustration
0:06:56	This illustration is when we have 2 speakers.
0:06:59	And they're balanced, we have the same data from both.
0:07:04	If we look at this example, then
0:07:09	if some conditions are actually happened then
0:07:13	then actually we just take the supervectors
0:07:15	and apply PCA then the largest eigenvector will actually
0:07:20	give us the decision boundary
0:07:23	now if we have unbalance speakers, then
0:07:26	in many cases, the PCA will be dominated from the most dominant speaker.
0:07:32	and we won't get the right decision boundary
0:07:37	So what we do is we do the following
0:07:40	we assign higher weight to outliers
0:07:42	which are found by selecting
0:07:45	the top 10% supervectors
0:07:47	in the given session with the largest distance to the sample mean
0:07:51	So we compute the center of gravity, the sample mean, and we just
0:07:55	uh, pick up, we select the 10% of the supervectors, which are more, most distant
0:08:01	from
0:08:02	this mean, and in this case, these are the outliers
0:08:06	we just give them the higher weight
0:08:09	and now
0:08:10	suddenly the PCA works well in this example
0:08:15	another problem is how to choose the threshold
0:08:21	because for example
0:08:23	in this case when the speakers are imbalanced
0:08:28	and if we just for example take the center of gravity
0:08:32	and then as a threshold then we would not be able to distinguish this two
0:08:38	speakers
0:08:39	correctly
0:08:41	so what we're trying to do is again
0:08:44	according to the same principle we compute the 10% and 90% percentile
0:08:50	look at the value that gives this percentile
0:08:54	around the Eigen the largest Eigen vector
0:08:58	and we just take this two values, average them, and decide
0:09:03	more robustly from the thresholds
0:09:09	OK, so before talking about the online diarization, just
0:09:13	a few experiments for show this section
0:09:16	so we use NIST 2005
0:09:19	dataset for this evaluation
0:09:24	and some thing important is that we compute the speaker error rate without discarding the
0:09:30	margin around the speaker turns
0:09:32	this is contrary to standard, and this is because we're dealing with short sessions, and
0:09:37	we try to throw weight
0:09:40	data then we found out it requires some numerical problems
0:09:46	so basically what it means is that
0:09:48	the result that I present are in some way
0:09:50	a bit pessimistic than what we would get, if we would
0:09:54	use the standard method
0:09:56	another important issue is that we
0:10:00	throw away short sessions
0:10:02	with less than 3 seconds
0:10:04	per speaker so
0:10:06	actually what we do is we take
0:10:08	we take the 5 minute session from NIST
0:10:11	and we just chop them into
0:10:15	short sessions
0:10:17	and now sometime when doing that
0:10:20	we may get short sessions
0:10:22	for example 50 seconds, 15 seconds
0:10:25	without, with only a single speaker
0:10:27	or with only 1 second from the second speaker
0:10:29	so in this work we will not try to deal with such a problem as
0:10:35	detecting such situation where we only have a single speaker
0:10:39	therefore we remove such sessions
0:10:44	the results for
0:10:48	for the diarization I talked about, basically what we can see here is that
0:10:54	for long sessions we don't get
0:10:57	any improvement or degradation
0:10:59	however for short sessions, we can get roughly something like 15% error reduction using this
0:11:07	technique
0:11:12	OK, so now let's talk about online diarization
0:11:16	so frame here is the following ... what we do in this session is we
0:11:21	take the prefix of a session
0:11:24	and the prefix is something that we will have to process offline
0:11:29	and of course you would want the prefix to be as short as possible
0:11:34	and we will actually set the length of the prefix adaptively
0:11:39	so we start by taking a short prefix
0:11:42	and according to the confidence estimation we would see ... we will verify whether this
0:11:48	prefix is good enough for the processing or we should just take a longer prefix
0:11:54	and do the processing
0:11:56	so we take the prefix of the session and
0:11:59	and we do offline processing the same ... just apply our algorithm on this prefix
0:12:06	and we did the result of this processing, the result with segmentation for the prefix
0:12:12	and also
0:12:13	with some model parameters for example the PCA
0:12:17	the threshold from the PCA and then we take this model threshold parameters and we
0:12:23	go to process the rest of the session online
0:12:27	using this model as a starting point
0:12:31	we update them periodically
0:12:34	and we do online processing
0:12:35	usually with some delay because we're using, we need some kind of backtracking, so we
0:12:42	have some short delay
0:12:44	can be a second or less but
0:12:46	we would always have some kind of latency
0:12:52	so we first apply this for voice activity detection
0:12:57	I won't go over all the details ... it's quite standard
0:13:03	OK so once we have voice activity detection, done online
0:13:08	then we have to do speaker diarization
0:13:10	so first we have the front end .. we do it online by using step
0:13:16	to get MFCC
0:13:17	extracting the supervectors
0:13:20	and compensating for intra-speaker variability
0:13:24	and then we take the prefix and we compute PCA for the supervector in this
0:13:31	prefix ... we
0:13:33	we project them all the supervectors onto the largest eigenvector
0:13:38	we do Viterbi ... Viterbi segmentation
0:13:41	then for the rest of the session, we just take the PCA statistics from the
0:13:47	prefix
0:13:48	we accumulate them online ... we periodically recompute the
0:13:53	PCA, and adjust our decision boundary
0:13:57	periodically
0:13:58	and also we do Viterbi and partial backtracking with some kind of latency
0:14:07	so here 're some results
0:14:10	first we will try to analyze the sensitivity of the delay
0:14:17	parameters ...delay parameter is the delay we have when we do the online diarization
0:14:23	on the rest of the conversation we still have some delay because we're using Viterbi
0:14:28	and ... and ... in order to do smoothing... so we found out that 0.2
0:14:36	second was good enough for ... for this algorithm
0:14:40	and then we ran some experiments to verify the sensitivity of the prefix length
0:14:48	and we found out that .. actually if we start with speaker rate of 4.4,
0:14:55	we'll see some significant degradation
0:14:58	gets to 9.0 for 15 seconds
0:15:02	of prefix
0:15:03	now when we... now we ran some control experiment
0:15:06	we did the same experiments, but
0:15:08	but we throw away all the sessions
0:15:11	uh.. with the ... there were .... not uh... with at least 3 seconds per
0:15:15	speaker in the prefix
0:15:17	for example if we take this column
0:15:20	we throw away all the session that in the first 15 seconds
0:15:25	we don't have at least 3 seconds per speaker
0:15:27	and when we do that we see quite good result, and the explanation is that
0:15:31	most of the degradation is due to the fact
0:15:33	that when we get the prefix
0:15:35	some time we do not have the presentation of the 2 speakers
0:15:39	and so and
0:15:41	so the way we introduce is this ... is to try to apply this confidence
0:15:45	term I will talk about
0:15:47	but before talking about the confidence ... the
0:15:51	the overall latency of the system is 1.3 seconds
0:15:54	including the prefix so...
0:15:56	if we have a 5 minute conversation ... for the first say 15 seconds
0:16:02	it's not online, it's offline, and then starting from the fifteen...
0:16:06	from the ... after this prefix
0:16:08	we will get the latency of 1.3 seconds
0:16:14	so now the issue of confidence based, the prefix we saw that
0:16:19	some time 15 seconds is enough, some time it's not enough, and it's heavily controlled
0:16:25	by the fact that we need 2 speakers to be presented
0:16:30	in the prefix
0:16:32	so what we do is we start with a short prefix .. we do diarization
0:16:36	we estimate the confidence
0:16:38	in the diarization
0:16:40	and if the confidence is not high enough, we just expand the prefix ... and
0:16:45	...
0:16:45	start over
0:16:47	we tried several confidence measures, and we chose to use... finally the Davies Bouldin index
0:16:55	which is the ratio between the average intra-class standard deviation and the inter-class distance
0:17:01	we're able to calculate when we have the diarization
0:17:08	OK and ... so
0:17:12	I won't go into all the details of this slide and the next ones, but
0:17:16	the main idea is that you can
0:17:18	you can actually get nice gains
0:17:21	by using this confidence measure, so for example for 30 second prefixes
0:17:27	50% of the session needs to be extended
0:17:30	to get almost as good result, but for the 50% of the session
0:17:35	you can just stop
0:17:36	so you can start with the prefix of 30 seconds
0:17:39	do diarization, compute this confidence measure
0:17:43	and for 50% of the session, you can decide that it's OK, I can
0:17:47	stop now to do the online processing
0:17:49	and for the rest of the session, you would need, for example, 45 to 60
0:17:54	seconds
0:17:55	to get optimal result
0:18:01	OK ... so
0:18:03	eh... what is the time complexity of the offline system .. the online system ...
0:18:09	this is the question that
0:18:11	many ... many ... many people would ask me after the previous presentation in the
0:18:18	last Odyssey
0:18:20	so we ran analysis .. experimental analysis
0:18:24	for this algorithm
0:18:25	and the analysis was run for 5 minute session
0:18:30	there was no sort of optimization done
0:18:33	just plain research goal
0:18:36	and ... so what we see here is that the baseline system
0:18:41	is 5 times faster than real time
0:18:44	and
0:18:45	we can actually improve the accuracy of the system by taking some of the
0:18:51	algorithm that I presented
0:18:54	improve the accuracy
0:18:56	and if we just take the whole uh... the whole
0:19:01	all the complexity I talked about, some of them actually degraded previously
0:19:05	for example training the UBM enough offline gives some degradation
0:19:11	so we get back the 4.4, but we get the speed up effect of 50,
0:19:16	50 times faster than real time
0:19:18	and for the online system, if we take the prefix of 30 seconds and the
0:19:23	delay of 0.2 seconds
0:19:25	then if we... actually the speed up effect is controlled by the retraining parameters
0:19:32	retraining parameter means in what frequency do we reestimate our PCA model and our GMMs
0:19:41	so we control it in a variable way that mean we start with a high
0:19:47	frequency at the beginning of the conversation, and then ...
0:19:51	just towards the end of the conversation, we actually stop retraining, or do it very
0:19:57	low frequency
0:19:58	we managed to get for the online system ..... speaker error rate of 7.8 with
0:20:04	the speed up effect of 30
0:20:10	OK before we're concluding, we give ... I'll just talk about specific .. specific task
0:20:17	which we're interested in.
0:20:20	which is speaker diarization for speaker verification
0:20:23	here we're not really interested in getting a very accurate diarization ... very high resolution
0:20:29	diarization
0:20:29	we just want ... don't want to get a good degradation in the equal error
0:20:34	rate for the speaker recognition in too wired data
0:20:39	so uh... we have initial work presented in Interspeech 2011 and here we have some
0:20:46	improvement
0:20:46	here we have some improvements that integrate all the components that I talked about in
0:20:52	this presentation
0:20:53	into this variance of our system
0:20:58	so we divide our audio into overlapping 5 second superframes, because we don't need the
0:21:06	resolution... high resolution
0:21:07	and we score each superframe independently against the target speaker model
0:21:13	now we have to do is to be able ... uh...
0:21:17	to classify or cluster these supervectors ... superframes into 2 speakers
0:21:24	so what we do is we do a partial diarization
0:21:27	and cluster these superframes into 2 groups of clusters and also
0:21:32	deemphasize some of the superframes which are in the borderline between the clusters
0:21:37	because we're actually ... uh... interested in speaker verification ... not speaker diarization, so we
0:21:43	can just throw away some superframes which we are not certain to which speaker they
0:21:49	belong
0:21:50	and we use eigenvoice-based dimensionality reduction in k-means
0:21:56	and we found out that ... the ..
0:21:58	the silhouette measure was actually optimal for deemphasizing several .. some of the supervectors
0:22:09	we also do it online, so we do
0:22:12	we use the same framework prefix, which is processed offline and then
0:22:18	we just adapt it ... eh ... uh...
0:22:22	for the rest of the conversation, we use the GMM-NAP-SVM system
0:22:27	developed for NIST 04 & 06, and evaluated on NIST 2005, for male only
0:22:38	we see that we get an improvement ... uh...
0:22:43	some improvement compared to the result that we presented in Interspeech
0:22:48	and we also observed that using this new technique
0:22:54	using the silhouette confidence measure for removing the superframes ... we get ... using the
0:23:00	hard decision... we get the optimal result
0:23:05	compared to using soft decision or no removal at all
0:23:12	so to summarize
0:23:14	we extended our speaker diarization method to work with short sessions and to run online
0:23:20	and we propose the following novelties: offline unsupervised estimation of intra-session intra-speaker variability
0:23:26	so again we use the development set to estimate this variability
0:23:32	but it's not labeled at all, we don't need labeled data
0:23:36	and we also use outlier emphasizing PCA for improving speaker clustering and adaptive threshold setting
0:23:43	the overall latency is 1.3 seconds except for the prefix
0:23:49	and speed is 50 times faster than real time for the offline system and between
0:23:55	30 to 40 for the online system
0:23:59	and also for the speaker verification task, we manage to substantially ... it's more in
0:24:04	the paper than in the presentation, but
0:24:07	we manage to substantially delay ... to reduce the delay ... eh ... substantially
0:24:14	for ... for speaker verification in some channels
0:24:20	OK, thank you
0:24:29	for initialization, you consider trying online speaker segmentation
0:24:37	algorithm, you just find the first speaker change to, so that you
0:24:42	are sure the second speaker
0:24:46	or the first speaker or any person in the next 15 seconds ?
0:24:51	yeah, what we're trying to do now.. is
0:24:54	is to start with
0:24:56	to take, to go with the prefix ... uh...
0:24:59	framework, start with a very short prefix, and to try to
0:25:03	start expanding it
0:25:06	and accessing whether it is a single speaker or not in this prefix... so
0:25:11	that is the title, that would be hard .... yeah, that's why we don't have
0:25:15	in the paper
0:25:19	so we
0:25:21	you have the speaker diarization
0:25:24	rate, diarization error rate
0:25:27	speaker error rate, it's without voice activity detection
0:25:30	OK, so just confusion, that's all
0:25:36	uh, in there we didn't mean
0:25:39	go to the result, go back to the result for
0:25:42	tests .... some
0:25:44	for recognition, for recognition
0:25:51	so ... do you know how the baseline being done?
0:25:56	did nothing, just scoring
0:25:58	you have the number?
0:26:08	we have it in the Interspeech ... uh...
0:26:11	in the last Interspeech paper, we have that number
0:26:15	the last question is about the PCA itself, so one of the thing
0:26:19	NAP which is removed
0:26:23	remove the channel first, trying to
0:26:28	the PCA no...
0:26:33	do you do any kind of channel compensation?
0:26:35	channel note
0:26:37	we'll do it ... there's something that actually
0:26:42	try to do ..uh.. same techniques as
0:26:44	being done for speaker verification
0:26:47	it's the NAP technique, so
0:26:51	so what's we doing... we're just taking the
0:26:55	pair of adjacent supervectors
0:26:57	and we just assume that
0:26:58	they belong to the same speaker, which is usually the right case
0:27:02	once in a while, it's not, because of speaker change, but usually from the same
0:27:06	speaker
0:27:07	from this we're estimating the
0:27:09	intra-speaker variability
0:27:12	you only estimate short term variability
0:27:14	short term variability
0:27:22	I don't understand the reason for online diarization used?
0:27:29	OK
0:27:29	try to know the motivation
0:27:31	OK, this is started because actually where were two clients
0:27:36	this ... one of them is ... uh...
0:27:39	for example in the call center scenario
0:27:41	let's assume that it's two wires
0:27:45	for many in practice, that's the case
0:27:48	nowadays
0:27:50	at least one of the vendors
0:27:52	uh... .actually
0:27:54	this is the case ... so.... uh...
0:27:57	the project was... the idea was to
0:28:00	to run speech recognition on
0:28:03	online, on the
0:28:06	call center data
0:28:08	and to present the agent with some summary
0:28:12	of the conversation
0:28:14	and in order to do the summary, they need the speaker diarization
0:28:18	and everything must be done online but
0:28:20	it can be done with some latency of
0:28:22	for example with 30 seconds prefix, it's OK
0:28:27	because it's usually longer conversation
0:28:34	when you use Viterbi, do you always go all the way back to the beginning
0:28:37	or you just do...?
0:28:38	in the online, no, in the online we do just in a small chunk
0:28:42	how far do you go back?
0:28:46	it depends, because
0:28:48	we also of course try to go all the way back
0:28:51	it does not really cause false alarm
0:28:54	but we found out that we can
0:28:58	save a bit by not doing that, but it's not very important
0:29:03	the latency is caused by what happened after the
0:29:07	by the future, not the past, the past is something you can do it very
0:29:11	quickly
0:29:12	one more question
0:29:13	do you try it with the algorithm that added to the
0:29:16	multi-speaker diarization task that was used
0:29:20	in our meeting data
0:29:22	actually now we're working in a
0:29:24	in a framework, European project
0:29:26	that's ... uh...
0:29:28	it's a... we're dealing with ... a...
0:29:30	a meeting type scenario
0:29:33	we have to take this algorithm and to run it
0:29:37	we have to modify it of course
0:29:40	alright, thank you the speaker again
0:29:41	applauses

Online Two Speaker Diarization

SESSION 05: Speaker Diarization

Hagai Aronowitz