0:00:38OK, So the title of my talk is following, when we first say, given a
0:00:45file offline
0:00:46supervector-based speaker diarization system which we presented last Odyssey 2 years ago
0:01:10OK, so the
0:01:17doesn't
0:01:24no, it doesn't go into the computer, but not here
0:01:38first time I see something like this
0:01:41OK, so this is that
0:01:42outline already of my topic
0:01:45OK, so just to those are not familiar with baseline algorithm
0:01:50The idea is to take two speakers eh and usually a third symbol and to
0:01:56do a speaker diarization.
0:01:58And the main principle is as following: if you look at this illustration, this is
0:02:02the illustration of the dynamic sys specs, one speaker in blue, and the other in
0:02:06red.
0:02:07And if you did not have the specific color, we wouldn't be able to do
0:02:13separation between these 2 speakers.
0:02:15So the idea is to take the speech, and to do some kind of parameterization
0:02:22into a series of supervectors, representing overlapping short segments.
0:02:27So what we get is what we see here: now we see some kind of
0:02:33separation ... between two, between the two speakers.
0:02:36And also we can see that every speaker roughly can be modeled by a unique
0:02:42model PDF.
0:02:43This is thanks to the supervector representation.
0:02:47And the next step is to improve the separation between speakers by
0:02:54removing pieces some of the intra section inter speaker variability.
0:02:59This is the sketch of the algorithm.
0:03:02and here are the actual steps.
0:03:06so first there's the audio parameterization ... the section is taken and conversation dependent UBM
0:03:14is estimated.
0:03:16So basically this algorithm doesn't need any development data, the UBM is estimated from the
0:03:22conversation.
0:03:23Then the conversation is segmented into overlapping 1 second superframe.
0:03:28and for each superframe we represent by a supervector which is adapted from the UBM.
0:03:35then there is another step which I am not going to go into detail, because
0:03:39it's something that we've already presented.
0:03:42it's eh, what we do is we try to estimate on the fly, eh, from
0:03:47the conversation.
0:03:48intra speaker variability and compensate for it.
0:03:52and to improve the accuracy.
0:03:55the next step is to score the superframe, being either from speaker 1 or speaker
0:03:592.
0:04:00This is being done by first computing the covariance matrix, the covariance matrix of the
0:04:05compensated
0:04:06supervectors.
0:04:07then applying PCA analysis to this covariance matrix, and justifying the largest eigenvector, and projecting
0:04:15everything onto this largest Eigen vector.
0:04:19Then we use Viterbi to do some smoothing and finally we do Viterbi segmentation on
0:04:26the MFCC space.
0:04:28So this is the baseline.
0:04:30At least a few shortcomings to this algorithm.
0:04:34First we found out when we apply this algorithm on short section and
0:04:39when I'm thinking about short, it can be 15 seconds, or, 30 seconds.
0:04:45and it doesn't work that well.
0:04:49on short section, and this is first of all because of insufficient data for estimating
0:04:55all the models and parameters from a single short section.
0:04:59and also because of the probability of misbalance between the speakers, and the representation of
0:05:05the speakers increases with, uh, when we're dealing with short section.
0:05:10and also this algorithm is heavily based on the fact that there is some kind
0:05:16of balance between the 2 speakers.
0:05:19and another issue is that this algorithm inherits the online, or, offline, and several of
0:05:26our customers require that we have the online solution.
0:05:31and so this is the shortcomings.
0:05:34So first, I'll talk about robustness in short sessions which is important by itself, but
0:05:40also the first step towards the online algorithm
0:05:43algorithm
0:05:44So the basic idea is try to do everything that we can, to do it
0:05:49offline from the development set.
0:05:51instead of training the UBM
0:05:53from the conversation, we just train from
0:05:56from development set, and also the NAP intra speaker
0:06:01variability compensation is trained from development set
0:06:05but we don't need any labeling from the development set, because
0:06:09our algorithm is unsupervised, it doesn't need to have speaker labels, or speaker turns.
0:06:16labellings, we just need the raw audio.
0:06:19So we take the development set, we estimate the UBM, we estimate the NAP
0:06:26the NAP transform, and also we do the GMM model in order to make it
0:06:31more robust to short sessions.
0:06:33next thing is what we call the outlier-emphasizing PCA.
0:06:38Contrary to robust PCA which someone is familiar with, in our case, we're actually interested
0:06:43in the outlier and
0:06:45we want to emphasize and give high weight to outliers when we're doing PCA
0:06:52This is true because, let's look at this illustration
0:06:56This illustration is when we have 2 speakers.
0:06:59And they're balanced, we have the same data from both.
0:07:04If we look at this example, then
0:07:09if some conditions are actually happened then
0:07:13then actually we just take the supervectors
0:07:15and apply PCA then the largest eigenvector will actually
0:07:20give us the decision boundary
0:07:23now if we have unbalance speakers, then
0:07:26in many cases, the PCA will be dominated from the most dominant speaker.
0:07:32and we won't get the right decision boundary
0:07:37So what we do is we do the following
0:07:40we assign higher weight to outliers
0:07:42which are found by selecting
0:07:45the top 10% supervectors
0:07:47in the given session with the largest distance to the sample mean
0:07:51So we compute the center of gravity, the sample mean, and we just
0:07:55uh, pick up, we select the 10% of the supervectors, which are more, most distant
0:08:01from
0:08:02this mean, and in this case, these are the outliers
0:08:06we just give them the higher weight
0:08:09and now
0:08:10suddenly the PCA works well in this example
0:08:15another problem is how to choose the threshold
0:08:21because for example
0:08:23in this case when the speakers are imbalanced
0:08:28and if we just for example take the center of gravity
0:08:32and then as a threshold then we would not be able to distinguish this two
0:08:38speakers
0:08:39correctly
0:08:41so what we're trying to do is again
0:08:44according to the same principle we compute the 10% and 90% percentile
0:08:50look at the value that gives this percentile
0:08:54around the Eigen the largest Eigen vector
0:08:58and we just take this two values, average them, and decide
0:09:03more robustly from the thresholds
0:09:09OK, so before talking about the online diarization, just
0:09:13a few experiments for show this section
0:09:16so we use NIST 2005
0:09:19dataset for this evaluation
0:09:24and some thing important is that we compute the speaker error rate without discarding the
0:09:30margin around the speaker turns
0:09:32this is contrary to standard, and this is because we're dealing with short sessions, and
0:09:37we try to throw weight
0:09:40data then we found out it requires some numerical problems
0:09:46so basically what it means is that
0:09:48the result that I present are in some way
0:09:50a bit pessimistic than what we would get, if we would
0:09:54use the standard method
0:09:56another important issue is that we
0:10:00throw away short sessions
0:10:02with less than 3 seconds
0:10:04per speaker so
0:10:06actually what we do is we take
0:10:08we take the 5 minute session from NIST
0:10:11and we just chop them into
0:10:15short sessions
0:10:17and now sometime when doing that
0:10:20we may get short sessions
0:10:22for example 50 seconds, 15 seconds
0:10:25without, with only a single speaker
0:10:27or with only 1 second from the second speaker
0:10:29so in this work we will not try to deal with such a problem as
0:10:35detecting such situation where we only have a single speaker
0:10:39therefore we remove such sessions
0:10:44the results for
0:10:48for the diarization I talked about, basically what we can see here is that
0:10:54for long sessions we don't get
0:10:57any improvement or degradation
0:10:59however for short sessions, we can get roughly something like 15% error reduction using this
0:11:07technique
0:11:12OK, so now let's talk about online diarization
0:11:16so frame here is the following ... what we do in this session is we
0:11:21take the prefix of a session
0:11:24and the prefix is something that we will have to process offline
0:11:29and of course you would want the prefix to be as short as possible
0:11:34and we will actually set the length of the prefix adaptively
0:11:39so we start by taking a short prefix
0:11:42and according to the confidence estimation we would see ... we will verify whether this
0:11:48prefix is good enough for the processing or we should just take a longer prefix
0:11:54and do the processing
0:11:56so we take the prefix of the session and
0:11:59and we do offline processing the same ... just apply our algorithm on this prefix
0:12:06and we did the result of this processing, the result with segmentation for the prefix
0:12:12and also
0:12:13with some model parameters for example the PCA
0:12:17the threshold from the PCA and then we take this model threshold parameters and we
0:12:23go to process the rest of the session online
0:12:27using this model as a starting point
0:12:31we update them periodically
0:12:34and we do online processing
0:12:35usually with some delay because we're using, we need some kind of backtracking, so we
0:12:42have some short delay
0:12:44can be a second or less but
0:12:46we would always have some kind of latency
0:12:52so we first apply this for voice activity detection
0:12:57I won't go over all the details ... it's quite standard
0:13:03OK so once we have voice activity detection, done online
0:13:08then we have to do speaker diarization
0:13:10so first we have the front end .. we do it online by using step
0:13:16to get MFCC
0:13:17extracting the supervectors
0:13:20and compensating for intra-speaker variability
0:13:24and then we take the prefix and we compute PCA for the supervector in this
0:13:31prefix ... we
0:13:33we project them all the supervectors onto the largest eigenvector
0:13:38we do Viterbi ... Viterbi segmentation
0:13:41then for the rest of the session, we just take the PCA statistics from the
0:13:47prefix
0:13:48we accumulate them online ... we periodically recompute the
0:13:53PCA, and adjust our decision boundary
0:13:57periodically
0:13:58and also we do Viterbi and partial backtracking with some kind of latency
0:14:07so here 're some results
0:14:10first we will try to analyze the sensitivity of the delay
0:14:17parameters ...delay parameter is the delay we have when we do the online diarization
0:14:23on the rest of the conversation we still have some delay because we're using Viterbi
0:14:28and ... and ... in order to do smoothing... so we found out that 0.2
0:14:36second was good enough for ... for this algorithm
0:14:40and then we ran some experiments to verify the sensitivity of the prefix length
0:14:48and we found out that .. actually if we start with speaker rate of 4.4,
0:14:55we'll see some significant degradation
0:14:58gets to 9.0 for 15 seconds
0:15:02of prefix
0:15:03now when we... now we ran some control experiment
0:15:06we did the same experiments, but
0:15:08but we throw away all the sessions
0:15:11uh.. with the ... there were .... not uh... with at least 3 seconds per
0:15:15speaker in the prefix
0:15:17for example if we take this column
0:15:20we throw away all the session that in the first 15 seconds
0:15:25we don't have at least 3 seconds per speaker
0:15:27and when we do that we see quite good result, and the explanation is that
0:15:31most of the degradation is due to the fact
0:15:33that when we get the prefix
0:15:35some time we do not have the presentation of the 2 speakers
0:15:39and so and
0:15:41so the way we introduce is this ... is to try to apply this confidence
0:15:45term I will talk about
0:15:47but before talking about the confidence ... the
0:15:51the overall latency of the system is 1.3 seconds
0:15:54including the prefix so...
0:15:56if we have a 5 minute conversation ... for the first say 15 seconds
0:16:02it's not online, it's offline, and then starting from the fifteen...
0:16:06from the ... after this prefix
0:16:08we will get the latency of 1.3 seconds
0:16:14so now the issue of confidence based, the prefix we saw that
0:16:19some time 15 seconds is enough, some time it's not enough, and it's heavily controlled
0:16:25by the fact that we need 2 speakers to be presented
0:16:30in the prefix
0:16:32so what we do is we start with a short prefix .. we do diarization
0:16:36we estimate the confidence
0:16:38in the diarization
0:16:40and if the confidence is not high enough, we just expand the prefix ... and
0:16:45...
0:16:45start over
0:16:47we tried several confidence measures, and we chose to use... finally the Davies Bouldin index
0:16:55which is the ratio between the average intra-class standard deviation and the inter-class distance
0:17:01we're able to calculate when we have the diarization
0:17:08OK and ... so
0:17:12I won't go into all the details of this slide and the next ones, but
0:17:16the main idea is that you can
0:17:18you can actually get nice gains
0:17:21by using this confidence measure, so for example for 30 second prefixes
0:17:2750% of the session needs to be extended
0:17:30to get almost as good result, but for the 50% of the session
0:17:35you can just stop
0:17:36so you can start with the prefix of 30 seconds
0:17:39do diarization, compute this confidence measure
0:17:43and for 50% of the session, you can decide that it's OK, I can
0:17:47stop now to do the online processing
0:17:49and for the rest of the session, you would need, for example, 45 to 60
0:17:54seconds
0:17:55to get optimal result
0:18:01OK ... so
0:18:03eh... what is the time complexity of the offline system .. the online system ...
0:18:09this is the question that
0:18:11many ... many ... many people would ask me after the previous presentation in the
0:18:18last Odyssey
0:18:20so we ran analysis .. experimental analysis
0:18:24for this algorithm
0:18:25and the analysis was run for 5 minute session
0:18:30there was no sort of optimization done
0:18:33just plain research goal
0:18:36and ... so what we see here is that the baseline system
0:18:41is 5 times faster than real time
0:18:44and
0:18:45we can actually improve the accuracy of the system by taking some of the
0:18:51algorithm that I presented
0:18:54improve the accuracy
0:18:56and if we just take the whole uh... the whole
0:19:01all the complexity I talked about, some of them actually degraded previously
0:19:05for example training the UBM enough offline gives some degradation
0:19:11so we get back the 4.4, but we get the speed up effect of 50,
0:19:1650 times faster than real time
0:19:18and for the online system, if we take the prefix of 30 seconds and the
0:19:23delay of 0.2 seconds
0:19:25then if we... actually the speed up effect is controlled by the retraining parameters
0:19:32retraining parameter means in what frequency do we reestimate our PCA model and our GMMs
0:19:41so we control it in a variable way that mean we start with a high
0:19:47frequency at the beginning of the conversation, and then ...
0:19:51just towards the end of the conversation, we actually stop retraining, or do it very
0:19:57low frequency
0:19:58we managed to get for the online system ..... speaker error rate of 7.8 with
0:20:04the speed up effect of 30
0:20:10OK before we're concluding, we give ... I'll just talk about specific .. specific task
0:20:17which we're interested in.
0:20:20which is speaker diarization for speaker verification
0:20:23here we're not really interested in getting a very accurate diarization ... very high resolution
0:20:29diarization
0:20:29we just want ... don't want to get a good degradation in the equal error
0:20:34rate for the speaker recognition in too wired data
0:20:39so uh... we have initial work presented in Interspeech 2011 and here we have some
0:20:46improvement
0:20:46here we have some improvements that integrate all the components that I talked about in
0:20:52this presentation
0:20:53into this variance of our system
0:20:58so we divide our audio into overlapping 5 second superframes, because we don't need the
0:21:06resolution... high resolution
0:21:07and we score each superframe independently against the target speaker model
0:21:13now we have to do is to be able ... uh...
0:21:17to classify or cluster these supervectors ... superframes into 2 speakers
0:21:24so what we do is we do a partial diarization
0:21:27and cluster these superframes into 2 groups of clusters and also
0:21:32deemphasize some of the superframes which are in the borderline between the clusters
0:21:37because we're actually ... uh... interested in speaker verification ... not speaker diarization, so we
0:21:43can just throw away some superframes which we are not certain to which speaker they
0:21:49belong
0:21:50and we use eigenvoice-based dimensionality reduction in k-means
0:21:56and we found out that ... the ..
0:21:58the silhouette measure was actually optimal for deemphasizing several .. some of the supervectors
0:22:09we also do it online, so we do
0:22:12we use the same framework prefix, which is processed offline and then
0:22:18we just adapt it ... eh ... uh...
0:22:22for the rest of the conversation, we use the GMM-NAP-SVM system
0:22:27developed for NIST 04 & 06, and evaluated on NIST 2005, for male only
0:22:38we see that we get an improvement ... uh...
0:22:43some improvement compared to the result that we presented in Interspeech
0:22:48and we also observed that using this new technique
0:22:54using the silhouette confidence measure for removing the superframes ... we get ... using the
0:23:00hard decision... we get the optimal result
0:23:05compared to using soft decision or no removal at all
0:23:12so to summarize
0:23:14we extended our speaker diarization method to work with short sessions and to run online
0:23:20and we propose the following novelties: offline unsupervised estimation of intra-session intra-speaker variability
0:23:26so again we use the development set to estimate this variability
0:23:32but it's not labeled at all, we don't need labeled data
0:23:36and we also use outlier emphasizing PCA for improving speaker clustering and adaptive threshold setting
0:23:43the overall latency is 1.3 seconds except for the prefix
0:23:49and speed is 50 times faster than real time for the offline system and between
0:23:5530 to 40 for the online system
0:23:59and also for the speaker verification task, we manage to substantially ... it's more in
0:24:04the paper than in the presentation, but
0:24:07we manage to substantially delay ... to reduce the delay ... eh ... substantially
0:24:14for ... for speaker verification in some channels
0:24:20OK, thank you
0:24:29for initialization, you consider trying online speaker segmentation
0:24:37algorithm, you just find the first speaker change to, so that you
0:24:42are sure the second speaker
0:24:46or the first speaker or any person in the next 15 seconds ?
0:24:51yeah, what we're trying to do now.. is
0:24:54is to start with
0:24:56to take, to go with the prefix ... uh...
0:24:59framework, start with a very short prefix, and to try to
0:25:03start expanding it
0:25:06and accessing whether it is a single speaker or not in this prefix... so
0:25:11that is the title, that would be hard .... yeah, that's why we don't have
0:25:15in the paper
0:25:19so we
0:25:21you have the speaker diarization
0:25:24rate, diarization error rate
0:25:27speaker error rate, it's without voice activity detection
0:25:30OK, so just confusion, that's all
0:25:36uh, in there we didn't mean
0:25:39go to the result, go back to the result for
0:25:42tests .... some
0:25:44for recognition, for recognition
0:25:51so ... do you know how the baseline being done?
0:25:56did nothing, just scoring
0:25:58you have the number?
0:26:08we have it in the Interspeech ... uh...
0:26:11in the last Interspeech paper, we have that number
0:26:15the last question is about the PCA itself, so one of the thing
0:26:19NAP which is removed
0:26:23remove the channel first, trying to
0:26:28the PCA no...
0:26:33do you do any kind of channel compensation?
0:26:35channel note
0:26:37we'll do it ... there's something that actually
0:26:42try to do ..uh.. same techniques as
0:26:44being done for speaker verification
0:26:47it's the NAP technique, so
0:26:51so what's we doing... we're just taking the
0:26:55pair of adjacent supervectors
0:26:57and we just assume that
0:26:58they belong to the same speaker, which is usually the right case
0:27:02once in a while, it's not, because of speaker change, but usually from the same
0:27:06speaker
0:27:07from this we're estimating the
0:27:09intra-speaker variability
0:27:12you only estimate short term variability
0:27:14short term variability
0:27:22I don't understand the reason for online diarization used?
0:27:29OK
0:27:29try to know the motivation
0:27:31OK, this is started because actually where were two clients
0:27:36this ... one of them is ... uh...
0:27:39for example in the call center scenario
0:27:41let's assume that it's two wires
0:27:45for many in practice, that's the case
0:27:48nowadays
0:27:50at least one of the vendors
0:27:52uh... .actually
0:27:54this is the case ... so.... uh...
0:27:57the project was... the idea was to
0:28:00to run speech recognition on
0:28:03online, on the
0:28:06call center data
0:28:08and to present the agent with some summary
0:28:12of the conversation
0:28:14and in order to do the summary, they need the speaker diarization
0:28:18and everything must be done online but
0:28:20it can be done with some latency of
0:28:22for example with 30 seconds prefix, it's OK
0:28:27because it's usually longer conversation
0:28:34when you use Viterbi, do you always go all the way back to the beginning
0:28:37or you just do...?
0:28:38in the online, no, in the online we do just in a small chunk
0:28:42how far do you go back?
0:28:46it depends, because
0:28:48we also of course try to go all the way back
0:28:51it does not really cause false alarm
0:28:54but we found out that we can
0:28:58save a bit by not doing that, but it's not very important
0:29:03the latency is caused by what happened after the
0:29:07by the future, not the past, the past is something you can do it very
0:29:11quickly
0:29:12one more question
0:29:13do you try it with the algorithm that added to the
0:29:16multi-speaker diarization task that was used
0:29:20in our meeting data
0:29:22actually now we're working in a
0:29:24in a framework, European project
0:29:26that's ... uh...
0:29:28it's a... we're dealing with ... a...
0:29:30a meeting type scenario
0:29:33we have to take this algorithm and to run it
0:29:37we have to modify it of course
0:29:40alright, thank you the speaker again
0:29:41applauses