0:00:38 | OK, So the title of my talk is following, when we first say, given a |
---|---|

0:00:45 | file offline |

0:00:46 | supervector-based speaker diarization system which we presented last Odyssey 2 years ago |

0:01:10 | OK, so the |

0:01:17 | doesn't |

0:01:24 | no, it doesn't go into the computer, but not here |

0:01:38 | first time I see something like this |

0:01:41 | OK, so this is that |

0:01:42 | outline already of my topic |

0:01:45 | OK, so just to those are not familiar with baseline algorithm |

0:01:50 | The idea is to take two speakers eh and usually a third symbol and to |

0:01:56 | do a speaker diarization. |

0:01:58 | And the main principle is as following: if you look at this illustration, this is |

0:02:02 | the illustration of the dynamic sys specs, one speaker in blue, and the other in |

0:02:06 | red. |

0:02:07 | And if you did not have the specific color, we wouldn't be able to do |

0:02:13 | separation between these 2 speakers. |

0:02:15 | So the idea is to take the speech, and to do some kind of parameterization |

0:02:22 | into a series of supervectors, representing overlapping short segments. |

0:02:27 | So what we get is what we see here: now we see some kind of |

0:02:33 | separation ... between two, between the two speakers. |

0:02:36 | And also we can see that every speaker roughly can be modeled by a unique |

0:02:42 | model PDF. |

0:02:43 | This is thanks to the supervector representation. |

0:02:47 | And the next step is to improve the separation between speakers by |

0:02:54 | removing pieces some of the intra section inter speaker variability. |

0:02:59 | This is the sketch of the algorithm. |

0:03:02 | and here are the actual steps. |

0:03:06 | so first there's the audio parameterization ... the section is taken and conversation dependent UBM |

0:03:14 | is estimated. |

0:03:16 | So basically this algorithm doesn't need any development data, the UBM is estimated from the |

0:03:22 | conversation. |

0:03:23 | Then the conversation is segmented into overlapping 1 second superframe. |

0:03:28 | and for each superframe we represent by a supervector which is adapted from the UBM. |

0:03:35 | then there is another step which I am not going to go into detail, because |

0:03:39 | it's something that we've already presented. |

0:03:42 | it's eh, what we do is we try to estimate on the fly, eh, from |

0:03:47 | the conversation. |

0:03:48 | intra speaker variability and compensate for it. |

0:03:52 | and to improve the accuracy. |

0:03:55 | the next step is to score the superframe, being either from speaker 1 or speaker |

0:03:59 | 2. |

0:04:00 | This is being done by first computing the covariance matrix, the covariance matrix of the |

0:04:05 | compensated |

0:04:06 | supervectors. |

0:04:07 | then applying PCA analysis to this covariance matrix, and justifying the largest eigenvector, and projecting |

0:04:15 | everything onto this largest Eigen vector. |

0:04:19 | Then we use Viterbi to do some smoothing and finally we do Viterbi segmentation on |

0:04:26 | the MFCC space. |

0:04:28 | So this is the baseline. |

0:04:30 | At least a few shortcomings to this algorithm. |

0:04:34 | First we found out when we apply this algorithm on short section and |

0:04:39 | when I'm thinking about short, it can be 15 seconds, or, 30 seconds. |

0:04:45 | and it doesn't work that well. |

0:04:49 | on short section, and this is first of all because of insufficient data for estimating |

0:04:55 | all the models and parameters from a single short section. |

0:04:59 | and also because of the probability of misbalance between the speakers, and the representation of |

0:05:05 | the speakers increases with, uh, when we're dealing with short section. |

0:05:10 | and also this algorithm is heavily based on the fact that there is some kind |

0:05:16 | of balance between the 2 speakers. |

0:05:19 | and another issue is that this algorithm inherits the online, or, offline, and several of |

0:05:26 | our customers require that we have the online solution. |

0:05:31 | and so this is the shortcomings. |

0:05:34 | So first, I'll talk about robustness in short sessions which is important by itself, but |

0:05:40 | also the first step towards the online algorithm |

0:05:43 | algorithm |

0:05:44 | So the basic idea is try to do everything that we can, to do it |

0:05:49 | offline from the development set. |

0:05:51 | instead of training the UBM |

0:05:53 | from the conversation, we just train from |

0:05:56 | from development set, and also the NAP intra speaker |

0:06:01 | variability compensation is trained from development set |

0:06:05 | but we don't need any labeling from the development set, because |

0:06:09 | our algorithm is unsupervised, it doesn't need to have speaker labels, or speaker turns. |

0:06:16 | labellings, we just need the raw audio. |

0:06:19 | So we take the development set, we estimate the UBM, we estimate the NAP |

0:06:26 | the NAP transform, and also we do the GMM model in order to make it |

0:06:31 | more robust to short sessions. |

0:06:33 | next thing is what we call the outlier-emphasizing PCA. |

0:06:38 | Contrary to robust PCA which someone is familiar with, in our case, we're actually interested |

0:06:43 | in the outlier and |

0:06:45 | we want to emphasize and give high weight to outliers when we're doing PCA |

0:06:52 | This is true because, let's look at this illustration |

0:06:56 | This illustration is when we have 2 speakers. |

0:06:59 | And they're balanced, we have the same data from both. |

0:07:04 | If we look at this example, then |

0:07:09 | if some conditions are actually happened then |

0:07:13 | then actually we just take the supervectors |

0:07:15 | and apply PCA then the largest eigenvector will actually |

0:07:20 | give us the decision boundary |

0:07:23 | now if we have unbalance speakers, then |

0:07:26 | in many cases, the PCA will be dominated from the most dominant speaker. |

0:07:32 | and we won't get the right decision boundary |

0:07:37 | So what we do is we do the following |

0:07:40 | we assign higher weight to outliers |

0:07:42 | which are found by selecting |

0:07:45 | the top 10% supervectors |

0:07:47 | in the given session with the largest distance to the sample mean |

0:07:51 | So we compute the center of gravity, the sample mean, and we just |

0:07:55 | uh, pick up, we select the 10% of the supervectors, which are more, most distant |

0:08:01 | from |

0:08:02 | this mean, and in this case, these are the outliers |

0:08:06 | we just give them the higher weight |

0:08:09 | and now |

0:08:10 | suddenly the PCA works well in this example |

0:08:15 | another problem is how to choose the threshold |

0:08:21 | because for example |

0:08:23 | in this case when the speakers are imbalanced |

0:08:28 | and if we just for example take the center of gravity |

0:08:32 | and then as a threshold then we would not be able to distinguish this two |

0:08:38 | speakers |

0:08:39 | correctly |

0:08:41 | so what we're trying to do is again |

0:08:44 | according to the same principle we compute the 10% and 90% percentile |

0:08:50 | look at the value that gives this percentile |

0:08:54 | around the Eigen the largest Eigen vector |

0:08:58 | and we just take this two values, average them, and decide |

0:09:03 | more robustly from the thresholds |

0:09:09 | OK, so before talking about the online diarization, just |

0:09:13 | a few experiments for show this section |

0:09:16 | so we use NIST 2005 |

0:09:19 | dataset for this evaluation |

0:09:24 | and some thing important is that we compute the speaker error rate without discarding the |

0:09:30 | margin around the speaker turns |

0:09:32 | this is contrary to standard, and this is because we're dealing with short sessions, and |

0:09:37 | we try to throw weight |

0:09:40 | data then we found out it requires some numerical problems |

0:09:46 | so basically what it means is that |

0:09:48 | the result that I present are in some way |

0:09:50 | a bit pessimistic than what we would get, if we would |

0:09:54 | use the standard method |

0:09:56 | another important issue is that we |

0:10:00 | throw away short sessions |

0:10:02 | with less than 3 seconds |

0:10:04 | per speaker so |

0:10:06 | actually what we do is we take |

0:10:08 | we take the 5 minute session from NIST |

0:10:11 | and we just chop them into |

0:10:15 | short sessions |

0:10:17 | and now sometime when doing that |

0:10:20 | we may get short sessions |

0:10:22 | for example 50 seconds, 15 seconds |

0:10:25 | without, with only a single speaker |

0:10:27 | or with only 1 second from the second speaker |

0:10:29 | so in this work we will not try to deal with such a problem as |

0:10:35 | detecting such situation where we only have a single speaker |

0:10:39 | therefore we remove such sessions |

0:10:44 | the results for |

0:10:48 | for the diarization I talked about, basically what we can see here is that |

0:10:54 | for long sessions we don't get |

0:10:57 | any improvement or degradation |

0:10:59 | however for short sessions, we can get roughly something like 15% error reduction using this |

0:11:07 | technique |

0:11:12 | OK, so now let's talk about online diarization |

0:11:16 | so frame here is the following ... what we do in this session is we |

0:11:21 | take the prefix of a session |

0:11:24 | and the prefix is something that we will have to process offline |

0:11:29 | and of course you would want the prefix to be as short as possible |

0:11:34 | and we will actually set the length of the prefix adaptively |

0:11:39 | so we start by taking a short prefix |

0:11:42 | and according to the confidence estimation we would see ... we will verify whether this |

0:11:48 | prefix is good enough for the processing or we should just take a longer prefix |

0:11:54 | and do the processing |

0:11:56 | so we take the prefix of the session and |

0:11:59 | and we do offline processing the same ... just apply our algorithm on this prefix |

0:12:06 | and we did the result of this processing, the result with segmentation for the prefix |

0:12:12 | and also |

0:12:13 | with some model parameters for example the PCA |

0:12:17 | the threshold from the PCA and then we take this model threshold parameters and we |

0:12:23 | go to process the rest of the session online |

0:12:27 | using this model as a starting point |

0:12:31 | we update them periodically |

0:12:34 | and we do online processing |

0:12:35 | usually with some delay because we're using, we need some kind of backtracking, so we |

0:12:42 | have some short delay |

0:12:44 | can be a second or less but |

0:12:46 | we would always have some kind of latency |

0:12:52 | so we first apply this for voice activity detection |

0:12:57 | I won't go over all the details ... it's quite standard |

0:13:03 | OK so once we have voice activity detection, done online |

0:13:08 | then we have to do speaker diarization |

0:13:10 | so first we have the front end .. we do it online by using step |

0:13:16 | to get MFCC |

0:13:17 | extracting the supervectors |

0:13:20 | and compensating for intra-speaker variability |

0:13:24 | and then we take the prefix and we compute PCA for the supervector in this |

0:13:31 | prefix ... we |

0:13:33 | we project them all the supervectors onto the largest eigenvector |

0:13:38 | we do Viterbi ... Viterbi segmentation |

0:13:41 | then for the rest of the session, we just take the PCA statistics from the |

0:13:47 | prefix |

0:13:48 | we accumulate them online ... we periodically recompute the |

0:13:53 | PCA, and adjust our decision boundary |

0:13:57 | periodically |

0:13:58 | and also we do Viterbi and partial backtracking with some kind of latency |

0:14:07 | so here 're some results |

0:14:10 | first we will try to analyze the sensitivity of the delay |

0:14:17 | parameters ...delay parameter is the delay we have when we do the online diarization |

0:14:23 | on the rest of the conversation we still have some delay because we're using Viterbi |

0:14:28 | and ... and ... in order to do smoothing... so we found out that 0.2 |

0:14:36 | second was good enough for ... for this algorithm |

0:14:40 | and then we ran some experiments to verify the sensitivity of the prefix length |

0:14:48 | and we found out that .. actually if we start with speaker rate of 4.4, |

0:14:55 | we'll see some significant degradation |

0:14:58 | gets to 9.0 for 15 seconds |

0:15:02 | of prefix |

0:15:03 | now when we... now we ran some control experiment |

0:15:06 | we did the same experiments, but |

0:15:08 | but we throw away all the sessions |

0:15:11 | uh.. with the ... there were .... not uh... with at least 3 seconds per |

0:15:15 | speaker in the prefix |

0:15:17 | for example if we take this column |

0:15:20 | we throw away all the session that in the first 15 seconds |

0:15:25 | we don't have at least 3 seconds per speaker |

0:15:27 | and when we do that we see quite good result, and the explanation is that |

0:15:31 | most of the degradation is due to the fact |

0:15:33 | that when we get the prefix |

0:15:35 | some time we do not have the presentation of the 2 speakers |

0:15:39 | and so and |

0:15:41 | so the way we introduce is this ... is to try to apply this confidence |

0:15:45 | term I will talk about |

0:15:47 | but before talking about the confidence ... the |

0:15:51 | the overall latency of the system is 1.3 seconds |

0:15:54 | including the prefix so... |

0:15:56 | if we have a 5 minute conversation ... for the first say 15 seconds |

0:16:02 | it's not online, it's offline, and then starting from the fifteen... |

0:16:06 | from the ... after this prefix |

0:16:08 | we will get the latency of 1.3 seconds |

0:16:14 | so now the issue of confidence based, the prefix we saw that |

0:16:19 | some time 15 seconds is enough, some time it's not enough, and it's heavily controlled |

0:16:25 | by the fact that we need 2 speakers to be presented |

0:16:30 | in the prefix |

0:16:32 | so what we do is we start with a short prefix .. we do diarization |

0:16:36 | we estimate the confidence |

0:16:38 | in the diarization |

0:16:40 | and if the confidence is not high enough, we just expand the prefix ... and |

0:16:45 | ... |

0:16:45 | start over |

0:16:47 | we tried several confidence measures, and we chose to use... finally the Davies Bouldin index |

0:16:55 | which is the ratio between the average intra-class standard deviation and the inter-class distance |

0:17:01 | we're able to calculate when we have the diarization |

0:17:08 | OK and ... so |

0:17:12 | I won't go into all the details of this slide and the next ones, but |

0:17:16 | the main idea is that you can |

0:17:18 | you can actually get nice gains |

0:17:21 | by using this confidence measure, so for example for 30 second prefixes |

0:17:27 | 50% of the session needs to be extended |

0:17:30 | to get almost as good result, but for the 50% of the session |

0:17:35 | you can just stop |

0:17:36 | so you can start with the prefix of 30 seconds |

0:17:39 | do diarization, compute this confidence measure |

0:17:43 | and for 50% of the session, you can decide that it's OK, I can |

0:17:47 | stop now to do the online processing |

0:17:49 | and for the rest of the session, you would need, for example, 45 to 60 |

0:17:54 | seconds |

0:17:55 | to get optimal result |

0:18:01 | OK ... so |

0:18:03 | eh... what is the time complexity of the offline system .. the online system ... |

0:18:09 | this is the question that |

0:18:11 | many ... many ... many people would ask me after the previous presentation in the |

0:18:18 | last Odyssey |

0:18:20 | so we ran analysis .. experimental analysis |

0:18:24 | for this algorithm |

0:18:25 | and the analysis was run for 5 minute session |

0:18:30 | there was no sort of optimization done |

0:18:33 | just plain research goal |

0:18:36 | and ... so what we see here is that the baseline system |

0:18:41 | is 5 times faster than real time |

0:18:44 | and |

0:18:45 | we can actually improve the accuracy of the system by taking some of the |

0:18:51 | algorithm that I presented |

0:18:54 | improve the accuracy |

0:18:56 | and if we just take the whole uh... the whole |

0:19:01 | all the complexity I talked about, some of them actually degraded previously |

0:19:05 | for example training the UBM enough offline gives some degradation |

0:19:11 | so we get back the 4.4, but we get the speed up effect of 50, |

0:19:16 | 50 times faster than real time |

0:19:18 | and for the online system, if we take the prefix of 30 seconds and the |

0:19:23 | delay of 0.2 seconds |

0:19:25 | then if we... actually the speed up effect is controlled by the retraining parameters |

0:19:32 | retraining parameter means in what frequency do we reestimate our PCA model and our GMMs |

0:19:41 | so we control it in a variable way that mean we start with a high |

0:19:47 | frequency at the beginning of the conversation, and then ... |

0:19:51 | just towards the end of the conversation, we actually stop retraining, or do it very |

0:19:57 | low frequency |

0:19:58 | we managed to get for the online system ..... speaker error rate of 7.8 with |

0:20:04 | the speed up effect of 30 |

0:20:10 | OK before we're concluding, we give ... I'll just talk about specific .. specific task |

0:20:17 | which we're interested in. |

0:20:20 | which is speaker diarization for speaker verification |

0:20:23 | here we're not really interested in getting a very accurate diarization ... very high resolution |

0:20:29 | diarization |

0:20:29 | we just want ... don't want to get a good degradation in the equal error |

0:20:34 | rate for the speaker recognition in too wired data |

0:20:39 | so uh... we have initial work presented in Interspeech 2011 and here we have some |

0:20:46 | improvement |

0:20:46 | here we have some improvements that integrate all the components that I talked about in |

0:20:52 | this presentation |

0:20:53 | into this variance of our system |

0:20:58 | so we divide our audio into overlapping 5 second superframes, because we don't need the |

0:21:06 | resolution... high resolution |

0:21:07 | and we score each superframe independently against the target speaker model |

0:21:13 | now we have to do is to be able ... uh... |

0:21:17 | to classify or cluster these supervectors ... superframes into 2 speakers |

0:21:24 | so what we do is we do a partial diarization |

0:21:27 | and cluster these superframes into 2 groups of clusters and also |

0:21:32 | deemphasize some of the superframes which are in the borderline between the clusters |

0:21:37 | because we're actually ... uh... interested in speaker verification ... not speaker diarization, so we |

0:21:43 | can just throw away some superframes which we are not certain to which speaker they |

0:21:49 | belong |

0:21:50 | and we use eigenvoice-based dimensionality reduction in k-means |

0:21:56 | and we found out that ... the .. |

0:21:58 | the silhouette measure was actually optimal for deemphasizing several .. some of the supervectors |

0:22:09 | we also do it online, so we do |

0:22:12 | we use the same framework prefix, which is processed offline and then |

0:22:18 | we just adapt it ... eh ... uh... |

0:22:22 | for the rest of the conversation, we use the GMM-NAP-SVM system |

0:22:27 | developed for NIST 04 & 06, and evaluated on NIST 2005, for male only |

0:22:38 | we see that we get an improvement ... uh... |

0:22:43 | some improvement compared to the result that we presented in Interspeech |

0:22:48 | and we also observed that using this new technique |

0:22:54 | using the silhouette confidence measure for removing the superframes ... we get ... using the |

0:23:00 | hard decision... we get the optimal result |

0:23:05 | compared to using soft decision or no removal at all |

0:23:12 | so to summarize |

0:23:14 | we extended our speaker diarization method to work with short sessions and to run online |

0:23:20 | and we propose the following novelties: offline unsupervised estimation of intra-session intra-speaker variability |

0:23:26 | so again we use the development set to estimate this variability |

0:23:32 | but it's not labeled at all, we don't need labeled data |

0:23:36 | and we also use outlier emphasizing PCA for improving speaker clustering and adaptive threshold setting |

0:23:43 | the overall latency is 1.3 seconds except for the prefix |

0:23:49 | and speed is 50 times faster than real time for the offline system and between |

0:23:55 | 30 to 40 for the online system |

0:23:59 | and also for the speaker verification task, we manage to substantially ... it's more in |

0:24:04 | the paper than in the presentation, but |

0:24:07 | we manage to substantially delay ... to reduce the delay ... eh ... substantially |

0:24:14 | for ... for speaker verification in some channels |

0:24:20 | OK, thank you |

0:24:29 | for initialization, you consider trying online speaker segmentation |

0:24:37 | algorithm, you just find the first speaker change to, so that you |

0:24:42 | are sure the second speaker |

0:24:46 | or the first speaker or any person in the next 15 seconds ? |

0:24:51 | yeah, what we're trying to do now.. is |

0:24:54 | is to start with |

0:24:56 | to take, to go with the prefix ... uh... |

0:24:59 | framework, start with a very short prefix, and to try to |

0:25:03 | start expanding it |

0:25:06 | and accessing whether it is a single speaker or not in this prefix... so |

0:25:11 | that is the title, that would be hard .... yeah, that's why we don't have |

0:25:15 | in the paper |

0:25:19 | so we |

0:25:21 | you have the speaker diarization |

0:25:24 | rate, diarization error rate |

0:25:27 | speaker error rate, it's without voice activity detection |

0:25:30 | OK, so just confusion, that's all |

0:25:36 | uh, in there we didn't mean |

0:25:39 | go to the result, go back to the result for |

0:25:42 | tests .... some |

0:25:44 | for recognition, for recognition |

0:25:51 | so ... do you know how the baseline being done? |

0:25:56 | did nothing, just scoring |

0:25:58 | you have the number? |

0:26:08 | we have it in the Interspeech ... uh... |

0:26:11 | in the last Interspeech paper, we have that number |

0:26:15 | the last question is about the PCA itself, so one of the thing |

0:26:19 | NAP which is removed |

0:26:23 | remove the channel first, trying to |

0:26:28 | the PCA no... |

0:26:33 | do you do any kind of channel compensation? |

0:26:35 | channel note |

0:26:37 | we'll do it ... there's something that actually |

0:26:42 | try to do ..uh.. same techniques as |

0:26:44 | being done for speaker verification |

0:26:47 | it's the NAP technique, so |

0:26:51 | so what's we doing... we're just taking the |

0:26:55 | pair of adjacent supervectors |

0:26:57 | and we just assume that |

0:26:58 | they belong to the same speaker, which is usually the right case |

0:27:02 | once in a while, it's not, because of speaker change, but usually from the same |

0:27:06 | speaker |

0:27:07 | from this we're estimating the |

0:27:09 | intra-speaker variability |

0:27:12 | you only estimate short term variability |

0:27:14 | short term variability |

0:27:22 | I don't understand the reason for online diarization used? |

0:27:29 | OK |

0:27:29 | try to know the motivation |

0:27:31 | OK, this is started because actually where were two clients |

0:27:36 | this ... one of them is ... uh... |

0:27:39 | for example in the call center scenario |

0:27:41 | let's assume that it's two wires |

0:27:45 | for many in practice, that's the case |

0:27:48 | nowadays |

0:27:50 | at least one of the vendors |

0:27:52 | uh... .actually |

0:27:54 | this is the case ... so.... uh... |

0:27:57 | the project was... the idea was to |

0:28:00 | to run speech recognition on |

0:28:03 | online, on the |

0:28:06 | call center data |

0:28:08 | and to present the agent with some summary |

0:28:12 | of the conversation |

0:28:14 | and in order to do the summary, they need the speaker diarization |

0:28:18 | and everything must be done online but |

0:28:20 | it can be done with some latency of |

0:28:22 | for example with 30 seconds prefix, it's OK |

0:28:27 | because it's usually longer conversation |

0:28:34 | when you use Viterbi, do you always go all the way back to the beginning |

0:28:37 | or you just do...? |

0:28:38 | in the online, no, in the online we do just in a small chunk |

0:28:42 | how far do you go back? |

0:28:46 | it depends, because |

0:28:48 | we also of course try to go all the way back |

0:28:51 | it does not really cause false alarm |

0:28:54 | but we found out that we can |

0:28:58 | save a bit by not doing that, but it's not very important |

0:29:03 | the latency is caused by what happened after the |

0:29:07 | by the future, not the past, the past is something you can do it very |

0:29:11 | quickly |

0:29:12 | one more question |

0:29:13 | do you try it with the algorithm that added to the |

0:29:16 | multi-speaker diarization task that was used |

0:29:20 | in our meeting data |

0:29:22 | actually now we're working in a |

0:29:24 | in a framework, European project |

0:29:26 | that's ... uh... |

0:29:28 | it's a... we're dealing with ... a... |

0:29:30 | a meeting type scenario |

0:29:33 | we have to take this algorithm and to run it |

0:29:37 | we have to modify it of course |

0:29:40 | alright, thank you the speaker again |

0:29:41 | applauses |