| 0:00:06 | okay so this to talk is a about unsupervised compensation of intercession interspeaker variability for speaker diarisation | 
|---|
| 0:00:17 | okay | 
|---|
| 0:00:18 | so | 
|---|
| 0:00:18 | those on the front of the following i will first uh | 
|---|
| 0:00:22 | define the notion of intercession interspeaker variability and | 
|---|
| 0:00:26 | actually it's a very | 
|---|
| 0:00:26 | yeah | 
|---|
| 0:00:27 | equivalent to that | 
|---|
| 0:00:29 | to the notion to that concept of E to segment the probability that was the described in the previous uh | 
|---|
| 0:00:35 | presentation | 
|---|
| 0:00:36 | i would talk about how to estimate | 
|---|
| 0:00:38 | december built in a supervised | 
|---|
| 0:00:40 | i manner | 
|---|
| 0:00:42 | in a a supervised manner | 
|---|
| 0:00:44 | and have to compensate it | 
|---|
| 0:00:45 | then i will propose that to speaker diarization | 
|---|
| 0:00:49 | system | 
|---|
| 0:00:50 | which is a bit vector based and these also utilises | 
|---|
| 0:00:53 | is um compensation of | 
|---|
| 0:00:55 | this uh intercession intraspeaker variability | 
|---|
| 0:00:58 | which from then no one i will just a recall interspeaker variability | 
|---|
| 0:01:03 | yeah i would end report experiments | 
|---|
| 0:01:05 | and i will summarise in a in a talk about future work | 
|---|
| 0:01:11 | okay | 
|---|
| 0:01:12 | so | 
|---|
| 0:01:13 | they deals the interspeaker variability is uh | 
|---|
| 0:01:17 | the fact is that this is the reason why speaker diarisation is not achievable task | 
|---|
| 0:01:21 | because if there was no it interspeaker variability it was very easy to | 
|---|
| 0:01:26 | to perform this task | 
|---|
| 0:01:27 | and now when when talk about interspeaker variability i | 
|---|
| 0:01:30 | i'm | 
|---|
| 0:01:31 | and talk about phonetic variability | 
|---|
| 0:01:33 | energy or loudness uh probability | 
|---|
| 0:01:36 | within the speech of fiscal for six single speaker | 
|---|
| 0:01:39 | yeah acoustic or speaker intrinsic uh probability | 
|---|
| 0:01:43 | speech rate variation and even then | 
|---|
| 0:01:45 | non speech rate which is the due to the others | 
|---|
| 0:01:48 | so it sometimes about makes more errors and sometimes about mixed less the rose and this can also | 
|---|
| 0:01:53 | course um a probability that can be harmful to the | 
|---|
| 0:01:56 | their position algorithm | 
|---|
| 0:01:58 | and relevant to ask the for this kind of ability as is first of all scores speaker diarization | 
|---|
| 0:02:04 | but also speaker recognition ensure the training and testing sessions | 
|---|
| 0:02:10 | okay so the idea now is to france to | 
|---|
| 0:02:13 | a propose a generative model of proper generative model | 
|---|
| 0:02:17 | to handle this area kind of ability | 
|---|
| 0:02:19 | first i will start with the classical gmm model | 
|---|
| 0:02:22 | where a speaker is uh | 
|---|
| 0:02:23 | a disk is modelled by a gmm | 
|---|
| 0:02:26 | and the frames are | 
|---|
| 0:02:27 | generated independently according to this gmm | 
|---|
| 0:02:32 | now in a two thousand five we have proposed a modified model | 
|---|
| 0:02:37 | that the accounts also for intersession probability model | 
|---|
| 0:02:41 | according to this model speaker is not the model by a single gmm | 
|---|
| 0:02:45 | rather it's mode modelled by the pdf over the session space | 
|---|
| 0:02:49 | uh every session is modelled by gmm and uh frames are | 
|---|
| 0:02:53 | generated | 
|---|
| 0:02:54 | according to this session the gmm | 
|---|
| 0:02:57 | so now what we want to do is to again document this model and | 
|---|
| 0:03:01 | two proposed uh i guess i should be more though | 
|---|
| 0:03:04 | where a speaker here in in this case is a | 
|---|
| 0:03:07 | is the pdf over the session space | 
|---|
| 0:03:16 | the speaker is a pdf over the session space | 
|---|
| 0:03:18 | session is again a pdf over the segments space and in a segment is is the modelled by a gmm | 
|---|
| 0:03:25 | and the frames are generated independently according to the segment you mean | 
|---|
| 0:03:31 | if we want to visualise this and uh gmm space or the gmm supervector space we can see in this | 
|---|
| 0:03:37 | slide | 
|---|
| 0:03:38 | we see for super vectors | 
|---|
| 0:03:40 | and this is it for support vectors | 
|---|
| 0:03:42 | i corresponds to four different segments | 
|---|
| 0:03:45 | of the same speakers recorded this speaker in a in a particular session session out for | 
|---|
| 0:03:53 | and this can be modelled by by this distribution | 
|---|
| 0:03:56 | now if the same speaker | 
|---|
| 0:03:57 | talks in a different session | 
|---|
| 0:03:59 | then | 
|---|
| 0:04:00 | it was okay | 
|---|
| 0:04:01 | uh this distribution | 
|---|
| 0:04:03 | and we can model the entire session | 
|---|
| 0:04:06 | set of sessions with a with a single distribution which will be the speaker speaker a distribution | 
|---|
| 0:04:13 | and we can do the same for for speaker B | 
|---|
| 0:04:19 | okay so according to this uh | 
|---|
| 0:04:21 | a generative model | 
|---|
| 0:04:22 | we want to in a we assume that a for that each supervector supervector | 
|---|
| 0:04:29 | it for for | 
|---|
| 0:04:30 | particular speaker | 
|---|
| 0:04:31 | particular session a particular segment | 
|---|
| 0:04:34 | distributes normally | 
|---|
| 0:04:35 | with some mean and the covariance about the mean he's speaker and session dependent | 
|---|
| 0:04:40 | and the covariance is also speaker session dependent | 
|---|
| 0:04:42 | so this is them almost a | 
|---|
| 0:04:44 | a quite that um | 
|---|
| 0:04:45 | a general model we will then assume or assumptions in order to make it more right | 
|---|
| 0:04:50 | is it possible to actually use this model | 
|---|
| 0:04:55 | okay | 
|---|
| 0:04:56 | so back in a two thousand five in a in a paper called the trainable speaker diarization in interspeech | 
|---|
| 0:05:03 | we have a use this model used | 
|---|
| 0:05:05 | this journal the relative model | 
|---|
| 0:05:07 | two | 
|---|
| 0:05:08 | do supervised | 
|---|
| 0:05:09 | yeah you to interspeaker variability modelling for speaker diarisation | 
|---|
| 0:05:13 | in the context of what does use diarisation | 
|---|
| 0:05:17 | and in this case what we doing what you did we assume that the covariance | 
|---|
| 0:05:21 | or dying to speaker variability | 
|---|
| 0:05:23 | is speaker and session independent | 
|---|
| 0:05:25 | so we have a single covariance matrix | 
|---|
| 0:05:28 | and we estimated from a labelled | 
|---|
| 0:05:31 | so demented development set | 
|---|
| 0:05:34 | and so it's quite a trivial to to to estimate it | 
|---|
| 0:05:37 | and once we have estimated just to compress metrics we can use it to | 
|---|
| 0:05:42 | to where come into two and develop a metric | 
|---|
| 0:05:45 | that is induced | 
|---|
| 0:05:46 | from this uh | 
|---|
| 0:05:48 | components metrics | 
|---|
| 0:05:49 | and use this matrix to actually perform the devastation | 
|---|
| 0:05:53 | and the techniques that we that we we should we use a pca in W C C N | 
|---|
| 0:05:59 | okay so this what's in a in tucson seven but problems this technique is that we must have a labelled | 
|---|
| 0:06:05 | development set | 
|---|
| 0:06:06 | and this development set must be labelled according to speaker turns and sometimes this it's can be problematic | 
|---|
| 0:06:27 | okay | 
|---|
| 0:06:29 | so | 
|---|
| 0:06:30 | in | 
|---|
| 0:06:31 | in this paper | 
|---|
| 0:06:33 | what we do is we assume that the session | 
|---|
| 0:06:36 | that combines a medically session independent | 
|---|
| 0:06:39 | so | 
|---|
| 0:06:40 | it's not global as we have done before but it's a session dependent | 
|---|
| 0:06:44 | and | 
|---|
| 0:06:45 | that | 
|---|
| 0:06:46 | and in in that good that i'm gonna is described in the next slide | 
|---|
| 0:06:50 | yeah actually we don't need any labelled data | 
|---|
| 0:06:53 | we we don't use it in labelled data we we we have no training process | 
|---|
| 0:06:57 | we just | 
|---|
| 0:06:59 | estimate documents method which is that's it | 
|---|
| 0:07:01 | session dependent we estimate on the fly on the power | 
|---|
| 0:07:04 | session base | 
|---|
| 0:07:08 | okay so how do we do that | 
|---|
| 0:07:10 | the first stage is to do gmm supervector parameterisation | 
|---|
| 0:07:14 | what we do here is that we we take the session that session that you want to | 
|---|
| 0:07:18 | and that | 
|---|
| 0:07:19 | to apply their position | 
|---|
| 0:07:21 | and we first extract and that the features such as mfcc features that under the frame rate | 
|---|
| 0:07:28 | and then what we do we we estimate the session dependent ubm | 
|---|
| 0:07:32 | of course you get is a is a is a form of low order | 
|---|
| 0:07:36 | typical order can be for them | 
|---|
| 0:07:38 | four | 
|---|
| 0:07:39 | and we do that using the yin algorithm | 
|---|
| 0:07:42 | now after we do that we we take the speech signal and divided into overlapping superframes | 
|---|
| 0:07:48 | which of us at a frame rate of ten per second | 
|---|
| 0:07:51 | so what we do we just say | 
|---|
| 0:07:53 | define a chance of one second at superframe | 
|---|
| 0:07:57 | and we have a ninety percent overlap | 
|---|
| 0:07:59 | now if we | 
|---|
| 0:08:00 | such as a bit frame we estimate a gmm using a standard uh | 
|---|
| 0:08:04 | no adaptation from the uh the ubm did we just the | 
|---|
| 0:08:08 | estimated | 
|---|
| 0:08:09 | and now | 
|---|
| 0:08:10 | and then we take the uh the parameters of the gmm we concatenate them into a supervector | 
|---|
| 0:08:15 | and this is about presentation | 
|---|
| 0:08:17 | for that super frame | 
|---|
| 0:08:19 | so in the end of this process that a socialist paradise by a sequence of | 
|---|
| 0:08:23 | of | 
|---|
| 0:08:24 | supervector | 
|---|
| 0:08:27 | okay | 
|---|
| 0:08:27 | so now the next say yes | 
|---|
| 0:08:29 | next phase | 
|---|
| 0:08:30 | is to estimate that interspeaker combine metrics | 
|---|
| 0:08:33 | and we're we're given the fanciest of the supervectors | 
|---|
| 0:08:37 | and | 
|---|
| 0:08:38 | we should note | 
|---|
| 0:08:39 | yeah the super vector at time T | 
|---|
| 0:08:41 | S R P is actually found to component | 
|---|
| 0:08:44 | the first component is a speaker mean component | 
|---|
| 0:08:47 | it it's a comment that it's speaker dependent session dependent | 
|---|
| 0:08:50 | and the second component is the actual innocence | 
|---|
| 0:08:53 | the | 
|---|
| 0:08:54 | that you the interspeaker probability | 
|---|
| 0:08:56 | which is denoted by I T | 
|---|
| 0:08:58 | and according to our previous assumption this ten years and | 
|---|
| 0:09:01 | is the | 
|---|
| 0:09:03 | distributes the normally with a zero mean and uh | 
|---|
| 0:09:06 | combats matrix which is | 
|---|
| 0:09:08 | set session dependent | 
|---|
| 0:09:10 | no | 
|---|
| 0:09:10 | our goal is to estimate this a combat medics and what we do is | 
|---|
| 0:09:14 | we consider the difference between two consecutive a supervector | 
|---|
| 0:09:18 | so when the question for i define | 
|---|
| 0:09:21 | dallas update which is a difference between | 
|---|
| 0:09:23 | and to weigh the supervectors of two consecutive the | 
|---|
| 0:09:27 | a subframe | 
|---|
| 0:09:29 | superframe | 
|---|
| 0:09:30 | and | 
|---|
| 0:09:31 | we can see that the the difference between two super frame is actually that difference between that | 
|---|
| 0:09:36 | uh | 
|---|
| 0:09:37 | corresponding gay interspeaker variability | 
|---|
| 0:09:40 | a new things that i teeny minus i teen | 
|---|
| 0:09:44 | a minus one | 
|---|
| 0:09:45 | plus some may component | 
|---|
| 0:09:47 | which is usually zero | 
|---|
| 0:09:49 | because most of the time there's no speaker change | 
|---|
| 0:09:52 | but in every time there's a speaker change this to the value is non zero | 
|---|
| 0:09:57 | now what we | 
|---|
| 0:09:58 | what we do now is | 
|---|
| 0:09:59 | something that it's a bit risky but we we assume that | 
|---|
| 0:10:03 | most of the time there's no speaker change | 
|---|
| 0:10:05 | and we can live with that of the noise that is added by | 
|---|
| 0:10:09 | the violation of this assumption and we just neglect | 
|---|
| 0:10:12 | the impact of that speaker changes on the covariance matrix of | 
|---|
| 0:10:15 | yeah of the P | 
|---|
| 0:10:17 | so basically we compute | 
|---|
| 0:10:18 | the combat medic | 
|---|
| 0:10:20 | of the diversity by just | 
|---|
| 0:10:22 | eh | 
|---|
| 0:10:23 | and | 
|---|
| 0:10:24 | throwing away all the dog uh places that we that there is | 
|---|
| 0:10:27 | we could change and just the computing | 
|---|
| 0:10:29 | the covariance | 
|---|
| 0:10:31 | or that of a | 
|---|
| 0:10:33 | so so we in in | 
|---|
| 0:10:35 | so what we get that is that if we want to estimate that | 
|---|
| 0:10:38 | that the covariance of dying to speak about everything we actually what we have to do we just have to | 
|---|
| 0:10:43 | compute | 
|---|
| 0:10:44 | that empirical | 
|---|
| 0:10:45 | covariance | 
|---|
| 0:10:46 | of that delta supervector | 
|---|
| 0:10:48 | that's what we do | 
|---|
| 0:10:50 | that | 
|---|
| 0:10:50 | that's in the question | 
|---|
| 0:10:52 | the button | 
|---|
| 0:10:53 | right | 
|---|
| 0:11:03 | okay | 
|---|
| 0:11:03 | so that we have done that uh and we get some kind of estimation for thine | 
|---|
| 0:11:07 | fig availability | 
|---|
| 0:11:08 | we can try to compensate it | 
|---|
| 0:11:10 | and the way we do that is that we we of course a assume that it's of low rent | 
|---|
| 0:11:15 | and can apply pca to find a basis for the for this low rank a subspace | 
|---|
| 0:11:21 | in this above it | 
|---|
| 0:11:21 | space | 
|---|
| 0:11:22 | now we have two possible compensation after the first one is that | 
|---|
| 0:11:26 | now | 
|---|
| 0:11:27 | well we can come compensated for that | 
|---|
| 0:11:30 | in a speaker variability | 
|---|
| 0:11:32 | in the supervector space | 
|---|
| 0:11:34 | and this is only useful if we want to do that it was a change in these days | 
|---|
| 0:11:38 | so but | 
|---|
| 0:11:38 | fig | 
|---|
| 0:11:39 | yeah and another alternative is to use feature eh | 
|---|
| 0:11:43 | based now | 
|---|
| 0:11:44 | well we can compensate that then to speak about it in the feature space | 
|---|
| 0:11:48 | and then we can actually use any other eh | 
|---|
| 0:11:51 | regularisation algorithm | 
|---|
| 0:11:52 | on these compensated features | 
|---|
| 0:11:57 | okay | 
|---|
| 0:11:58 | so it is it what we what we do is we actually and decide to use now | 
|---|
| 0:12:04 | and then | 
|---|
| 0:12:05 | to do the rest of the diarization either supervectors | 
|---|
| 0:12:08 | domain | 
|---|
| 0:12:09 | so the motivation for that button which i'm about | 
|---|
| 0:12:11 | why | 
|---|
| 0:12:12 | is that | 
|---|
| 0:12:13 | if we look for example you in it | 
|---|
| 0:12:15 | the figures in the bottom of slide | 
|---|
| 0:12:17 | in the left hand side we we see an illustration on and off the ice | 
|---|
| 0:12:22 | if and | 
|---|
| 0:12:23 | right | 
|---|
| 0:12:23 | and what we see here | 
|---|
| 0:12:24 | with the uh two speakers | 
|---|
| 0:12:26 | but this is not the real that i just a | 
|---|
| 0:12:28 | creation | 
|---|
| 0:12:29 | so we in below we have once again but we have to speak at target speaker and the problem is | 
|---|
| 0:12:33 | that | 
|---|
| 0:12:34 | if we want to apply adaptation about them | 
|---|
| 0:12:36 | of course he doesn't know that that colour of the ball | 
|---|
| 0:12:40 | and it's very hard to separate the two speakers | 
|---|
| 0:12:43 | now if we walk in the supervector space | 
|---|
| 0:12:46 | then then what | 
|---|
| 0:12:47 | what | 
|---|
| 0:12:48 | in the middle of the | 
|---|
| 0:12:49 | by | 
|---|
| 0:12:50 | we we see here that had a distributions of that | 
|---|
| 0:12:53 | because that tend to be more anymore though | 
|---|
| 0:12:56 | therefore it it's much easier to | 
|---|
| 0:12:58 | to try to | 
|---|
| 0:12:59 | eh | 
|---|
| 0:13:00 | diarisation and this is of course the counter to the fact that we do some smoothing | 
|---|
| 0:13:04 | because a superframe is one second long | 
|---|
| 0:13:07 | now if we a if you manage to remove | 
|---|
| 0:13:10 | yeah substantial amount of that interspeaker variability | 
|---|
| 0:13:14 | what we get is uh uh they'll station in the right hand side of the fly | 
|---|
| 0:13:19 | and any | 
|---|
| 0:13:20 | even if we don't know the two colours if we don't know that | 
|---|
| 0:13:23 | some of the points are of one sample points are all right | 
|---|
| 0:13:26 | we still in it | 
|---|
| 0:13:27 | it's reasonable that we make | 
|---|
| 0:13:29 | find a solution of how to separate these speaker | 
|---|
| 0:13:33 | because that's the question is is very easy | 
|---|
| 0:13:38 | okay this is separation | 
|---|
| 0:13:43 | okay so of any of the uh of the algorithm is as following first of course we | 
|---|
| 0:13:47 | parameterised that speech in in two times to obtain serious of supervectors | 
|---|
| 0:13:52 | and we compensate for interspeaker variability as i have shown already | 
|---|
| 0:13:58 | and then we use the viterbi algorithm to do segmentation | 
|---|
| 0:14:02 | now we know that to do that to be a good | 
|---|
| 0:14:03 | two segmentation we just a sign think of they | 
|---|
| 0:14:06 | for each speaker | 
|---|
| 0:14:07 | actually we we also use um in a a | 
|---|
| 0:14:11 | a length the constraint so | 
|---|
| 0:14:13 | i'm | 
|---|
| 0:14:13 | this is some simplification here | 
|---|
| 0:14:15 | but basically we using one data per speaker | 
|---|
| 0:14:18 | and all we have to do is to be able to | 
|---|
| 0:14:20 | to estimate transition probabilities and how the probabilities | 
|---|
| 0:14:24 | now position but but it's it's very hard to it's very easy to way to estimate | 
|---|
| 0:14:29 | we can just take a very small a that that's that in and then estimate | 
|---|
| 0:14:33 | that the effects of that | 
|---|
| 0:14:35 | a of the length of to a speaker turn | 
|---|
| 0:14:38 | so if we know the i mean the mean is we could on we can estimate transition probabilities | 
|---|
| 0:14:43 | now if you want to | 
|---|
| 0:14:44 | the tricky part is to be able to estimate the output probabilities | 
|---|
| 0:14:48 | the probability of a | 
|---|
| 0:14:51 | speaker | 
|---|
| 0:14:52 | of support vector | 
|---|
| 0:14:53 | at time T | 
|---|
| 0:14:54 | given speaker I | 
|---|
| 0:14:55 | and we're not doing this a directly we do it | 
|---|
| 0:14:58 | do do you did a indirectly and i was see i would feel that this in the next slide | 
|---|
| 0:15:04 | so let's say that we can do this in some way | 
|---|
| 0:15:07 | and | 
|---|
| 0:15:08 | so | 
|---|
| 0:15:08 | we just apply data segmentation and we can and come up to four segmentation | 
|---|
| 0:15:13 | and then we a run again the yeah | 
|---|
| 0:15:16 | we we find the diarization using iterative viterbi resegmentation | 
|---|
| 0:15:20 | so we we just say come back to the original a frame based features that this is we train the | 
|---|
| 0:15:25 | speaker uh H M and | 
|---|
| 0:15:27 | using our previous segmentation and we really one of the best imitation but now the resolution is better | 
|---|
| 0:15:32 | because we we're working in a frame rate of a hundred | 
|---|
| 0:15:35 | a second | 
|---|
| 0:15:36 | and can do that this for | 
|---|
| 0:15:38 | couple of iterations then and that's the end of the totem | 
|---|
| 0:15:42 | so what i still have to talk about is how to | 
|---|
| 0:15:45 | estimate is the output probabilities | 
|---|
| 0:15:47 | probability of a a supervector T | 
|---|
| 0:15:50 | a given speaker on | 
|---|
| 0:15:54 | okay so | 
|---|
| 0:15:55 | that would | 
|---|
| 0:15:56 | is that following | 
|---|
| 0:15:57 | what we do is we and i will give us some motivation after this slide | 
|---|
| 0:16:01 | so what we do is we find a larger | 
|---|
| 0:16:03 | the eigenvector | 
|---|
| 0:16:04 | so in | 
|---|
| 0:16:05 | and more more precisely find a large | 
|---|
| 0:16:07 | eigenvalue and and we take the corresponding eigenvector | 
|---|
| 0:16:11 | and and we and this loud | 
|---|
| 0:16:13 | this a good vector is taken for the covariance matrix | 
|---|
| 0:16:16 | of all the supervectors | 
|---|
| 0:16:17 | so what we do take uh compensated supervectors from the session | 
|---|
| 0:16:22 | and we just compare compute the covariance matrix | 
|---|
| 0:16:24 | of these supervectors | 
|---|
| 0:16:25 | and find that that we take the first day eigenvector | 
|---|
| 0:16:28 | now what we do we take each shape compensated supervector | 
|---|
| 0:16:32 | and projected onto the larger a eigenvector | 
|---|
| 0:16:36 | and and what we get we get a for each a four four time T V get | 
|---|
| 0:16:40 | peace up T P supply is the projection | 
|---|
| 0:16:42 | of soap about R S T | 
|---|
| 0:16:44 | onto the largest eh | 
|---|
| 0:16:46 | it can vector | 
|---|
| 0:16:47 | and the | 
|---|
| 0:16:48 | well what i am i'm not going to convince you now about the the beatles on the paper and we | 
|---|
| 0:16:52 | try to get some division | 
|---|
| 0:16:54 | is that the action | 
|---|
| 0:16:55 | the log likelihood that we are looking for the log likelihood of a supervector S P | 
|---|
| 0:16:59 | given speaker one | 
|---|
| 0:17:01 | and and | 
|---|
| 0:17:02 | yeah divided by the the the log likelihood of a | 
|---|
| 0:17:05 | is | 
|---|
| 0:17:06 | you know the probability of | 
|---|
| 0:17:07 | speaker of the product at a given speaker to is actually a linear | 
|---|
| 0:17:11 | function | 
|---|
| 0:17:12 | of this projection | 
|---|
| 0:17:13 | that we have found | 
|---|
| 0:17:14 | so what we have to do is to be able to somehow | 
|---|
| 0:17:18 | estimate parameters a and B | 
|---|
| 0:17:20 | and if we estimate about the A B we can approximate the likelihood | 
|---|
| 0:17:24 | we're looking for you know in order to plug it into the derby | 
|---|
| 0:17:28 | but just taking this projection | 
|---|
| 0:17:30 | now be is actually | 
|---|
| 0:17:32 | is related to the dominance of of that speaker if we have two speakers and one of them is more | 
|---|
| 0:17:38 | dominant than be will be uh either positive or negative | 
|---|
| 0:17:41 | and in this day in in our experiments we just assumed to be zero therefore we assume that | 
|---|
| 0:17:47 | the speakers are equally dominant | 
|---|
| 0:17:49 | it's about this uh | 
|---|
| 0:17:51 | the conversation of course not true but uh we hope to | 
|---|
| 0:17:55 | accommodate with this uh problem a by using a final a viterbi resegmentation | 
|---|
| 0:18:00 | and a is assumed to be corpus the tape and then call content and we just have to make it | 
|---|
| 0:18:06 | single uh | 
|---|
| 0:18:07 | parameter with timit form this mode of that | 
|---|
| 0:18:10 | and | 
|---|
| 0:18:10 | no rules and do those out in the paper | 
|---|
| 0:18:14 | just that but just to get the motivation let's let's say | 
|---|
| 0:18:17 | we tended we have very easy problem | 
|---|
| 0:18:20 | that uh in in this case the right | 
|---|
| 0:18:22 | is one speaker I and uh | 
|---|
| 0:18:24 | and the blue is the second speaker | 
|---|
| 0:18:26 | and this is uh yeah actually in the super vectors | 
|---|
| 0:18:29 | right | 
|---|
| 0:18:30 | after we have done that i need to | 
|---|
| 0:18:32 | speaker compensation | 
|---|
| 0:18:33 | it it does become a bit of computation | 
|---|
| 0:18:35 | so if we just take the although the blue and the red eh | 
|---|
| 0:18:39 | but | 
|---|
| 0:18:39 | and we just compare the combined fanatics and take the first eigenvector what we get we get that | 
|---|
| 0:18:44 | the black arrow | 
|---|
| 0:18:46 | okay so if we take the black arrow and | 
|---|
| 0:18:48 | just project all the points on that there are we get the distribution in the right hand side of the | 
|---|
| 0:18:53 | the flight | 
|---|
| 0:18:53 | and if we just to decide | 
|---|
| 0:18:55 | according to that object on the black arrow if we just the | 
|---|
| 0:18:59 | compute | 
|---|
| 0:19:00 | the decision boundary | 
|---|
| 0:19:01 | we we see that it | 
|---|
| 0:19:02 | that | 
|---|
| 0:19:02 | this it does she and battery and it's a actually it's it's a | 
|---|
| 0:19:07 | it exactly as the optimal decision but we can compute the optimal decision boundary because | 
|---|
| 0:19:12 | we know the true distribution of the red and the blue box | 
|---|
| 0:19:15 | a because it's artificial data here | 
|---|
| 0:19:17 | in this this like | 
|---|
| 0:19:19 | so actually this this algorithm works in this case the simple case | 
|---|
| 0:19:23 | now as we try to increase then that amount of interspeaker variability | 
|---|
| 0:19:28 | like that we we didn't manage to | 
|---|
| 0:19:30 | remove all the almost dying to speak about ability | 
|---|
| 0:19:33 | so | 
|---|
| 0:19:34 | as we see | 
|---|
| 0:19:35 | as we a great amount of the excess | 
|---|
| 0:19:37 | they are within within all they speak about the | 
|---|
| 0:19:40 | we see that uh | 
|---|
| 0:19:42 | that decision boundary that we have to make is not exactly as the optimal decision boundary and | 
|---|
| 0:19:46 | actually this algorithm we we fail in in someone | 
|---|
| 0:19:50 | when a with this too much into station in in to speak about building | 
|---|
| 0:19:54 | so the hope is that | 
|---|
| 0:19:55 | an hour ago tend to that meant that the | 
|---|
| 0:19:58 | suppose always compensate for you to speak about the the hope is that on the on the data we're gonna | 
|---|
| 0:20:02 | work on | 
|---|
| 0:20:03 | we actually managed to remove not | 
|---|
| 0:20:05 | of the variability in order to let that button to work | 
|---|
| 0:20:10 | oh | 
|---|
| 0:20:11 | okay so now yeah | 
|---|
| 0:20:13 | i reported experiments | 
|---|
| 0:20:15 | yeah | 
|---|
| 0:20:15 | we used a a a a small i think they | 
|---|
| 0:20:19 | from that need two thousand five | 
|---|
| 0:20:21 | reef as that of a month if that we used to have we used one hundred the stations | 
|---|
| 0:20:26 | actually i don't think that we actually | 
|---|
| 0:20:28 | we need the | 
|---|
| 0:20:29 | so much the data for development because we actually | 
|---|
| 0:20:32 | estimate only a handful of parameters | 
|---|
| 0:20:34 | and this is used to tune the hmm transition parameters | 
|---|
| 0:20:38 | and uh and uh a parameter | 
|---|
| 0:20:41 | which is used for the loglikelihood calibration | 
|---|
| 0:20:44 | and we use that and and these two thousand five a course | 
|---|
| 0:20:48 | that's it | 
|---|
| 0:20:49 | for a for evaluation | 
|---|
| 0:20:51 | what we did is we took a the stereo | 
|---|
| 0:20:53 | a phone call and we just a sum | 
|---|
| 0:20:56 | artificially | 
|---|
| 0:20:57 | the two sides in order to get a | 
|---|
| 0:21:00 | a two wire data | 
|---|
| 0:21:02 | and we take the ground truth for my we derive it from the asr transcript | 
|---|
| 0:21:07 | provided by nist | 
|---|
| 0:21:09 | yeah we report a speaker error rate eh | 
|---|
| 0:21:12 | our error message | 
|---|
| 0:21:14 | rms measure is this speaker right | 
|---|
| 0:21:17 | and | 
|---|
| 0:21:17 | we use that stand out in is what we call it too | 
|---|
| 0:21:20 | this is | 
|---|
| 0:21:21 | correct | 
|---|
| 0:21:24 | okay | 
|---|
| 0:21:24 | and just enough to be able to compare our results to two different systems we we use the baseline which | 
|---|
| 0:21:30 | is big bass | 
|---|
| 0:21:31 | the | 
|---|
| 0:21:32 | and inspired by lindsay is that two thousand five this them | 
|---|
| 0:21:36 | oh | 
|---|
| 0:21:36 | it's it's based on detection of speaker changes using the A B in ten iterations of a | 
|---|
| 0:21:42 | viterbi resegmentation and along with david biclustering | 
|---|
| 0:21:46 | followed finally by a page of a viterbi resegmentation | 
|---|
| 0:21:52 | okay so this is a domain with all | 
|---|
| 0:21:55 | and on all day that that we we achieve the | 
|---|
| 0:21:59 | we | 
|---|
| 0:22:01 | we actually had uh uh | 
|---|
| 0:22:03 | speaker error rate of six point one | 
|---|
| 0:22:05 | for baseline | 
|---|
| 0:22:06 | and when we just use the supervector basis and without any intraspeaker variability compensation | 
|---|
| 0:22:13 | we got four point eight | 
|---|
| 0:22:15 | and we when we also a a used a speaker variability compensation we we got | 
|---|
| 0:22:20 | two point | 
|---|
| 0:22:21 | right one day | 
|---|
| 0:22:23 | and in the six | 
|---|
| 0:22:24 | that meant a supervector gmm order is sixty four and uh now | 
|---|
| 0:22:27 | compensation order is | 
|---|
| 0:22:29 | five | 
|---|
| 0:22:34 | we we ran some experiments in order to try to improve on this we we actually we didn't manage people | 
|---|
| 0:22:39 | but | 
|---|
| 0:22:39 | we try to to see | 
|---|
| 0:22:41 | if we just | 
|---|
| 0:22:42 | change the front end | 
|---|
| 0:22:44 | what's gonna happen so | 
|---|
| 0:22:46 | and we find out that feature warping actually degrade performance | 
|---|
| 0:22:49 | and this is the | 
|---|
| 0:22:51 | this was already has already been and | 
|---|
| 0:22:54 | a explained by the fact that the channel is actually something we were done we want to explore twenty do | 
|---|
| 0:22:58 | diarisation | 
|---|
| 0:22:59 | because | 
|---|
| 0:23:00 | it may be the case that different speakers and different channel | 
|---|
| 0:23:03 | so we don't want to roll channel | 
|---|
| 0:23:05 | a information | 
|---|
| 0:23:07 | and i think the other end | 
|---|
| 0:23:09 | see also slightly degraded | 
|---|
| 0:23:11 | performance | 
|---|
| 0:23:12 | and | 
|---|
| 0:23:13 | we also wanted to check the power | 
|---|
| 0:23:15 | but what would the way how much | 
|---|
| 0:23:18 | yeah but that would be in a way that it can be if we we had perfect | 
|---|
| 0:23:22 | eh | 
|---|
| 0:23:22 | positivity that though and we and we actually prove that | 
|---|
| 0:23:26 | very likely to two point six | 
|---|
| 0:23:31 | some more experiments a fine | 
|---|
| 0:23:32 | two | 
|---|
| 0:23:33 | eh | 
|---|
| 0:23:34 | it checks that it's a T V T of | 
|---|
| 0:23:37 | a system to | 
|---|
| 0:23:39 | a gmm although too | 
|---|
| 0:23:42 | and the nap compensation although so basically uh we see what | 
|---|
| 0:23:46 | jim order | 
|---|
| 0:23:47 | the best order that sixty four and one twenty eight i didn't try to increase the eh | 
|---|
| 0:23:52 | i don't know what would happen if i would be | 
|---|
| 0:23:54 | great that you know boulder | 
|---|
| 0:23:55 | and for nap and we see that it's quite a bit of it i think | 
|---|
| 0:23:58 | from fifteen uh we get quite the we get already any for the uh uh | 
|---|
| 0:24:03 | speaker rate of | 
|---|
| 0:24:03 | three | 
|---|
| 0:24:04 | but | 
|---|
| 0:24:05 | but performance | 
|---|
| 0:24:06 | something around | 
|---|
| 0:24:07 | five | 
|---|
| 0:24:11 | okay so finally before i am i | 
|---|
| 0:24:13 | and | 
|---|
| 0:24:14 | one of the motivations for this work was to use this the devastation for | 
|---|
| 0:24:19 | speaker I D into in um a | 
|---|
| 0:24:22 | they are in a | 
|---|
| 0:24:23 | to wire data | 
|---|
| 0:24:25 | so we we try to to say whether we get improvements using this devastation | 
|---|
| 0:24:30 | so | 
|---|
| 0:24:30 | what would we did it we we tried we defended acquisition | 
|---|
| 0:24:34 | and systems one of them was that a reference to | 
|---|
| 0:24:37 | a diarisation second one was the baseline diarization | 
|---|
| 0:24:41 | the third was the proposal | 
|---|
| 0:24:42 | it every station | 
|---|
| 0:24:44 | and we all we tried to apply that is they should i do only in the test | 
|---|
| 0:24:48 | the on the test data also | 
|---|
| 0:24:51 | and | 
|---|
| 0:24:52 | the train data because i | 
|---|
| 0:24:54 | one for sponsors actually | 
|---|
| 0:24:56 | have this problem so also the training data is also a uh some | 
|---|
| 0:25:01 | therefore we we | 
|---|
| 0:25:02 | we wanted to check | 
|---|
| 0:25:03 | a what | 
|---|
| 0:25:05 | the performance on also on training a | 
|---|
| 0:25:07 | on some data | 
|---|
| 0:25:09 | and we used | 
|---|
| 0:25:10 | speaker this system which is not based | 
|---|
| 0:25:12 | and | 
|---|
| 0:25:13 | that achieves a regular rate of five point two on and | 
|---|
| 0:25:17 | done this two thousand five or female data set | 
|---|
| 0:25:21 | and what we concluded that for when only test | 
|---|
| 0:25:25 | data is some | 
|---|
| 0:25:26 | that the proposed and achieve that performance that is equivalent to using manual diarization | 
|---|
| 0:25:32 | and when both a train and test | 
|---|
| 0:25:35 | data um | 
|---|
| 0:25:36 | there is a degradation compared to to the reference there | 
|---|
| 0:25:39 | it every station | 
|---|
| 0:25:41 | but however documentation that we get for the baseline system you have | 
|---|
| 0:25:46 | when we use | 
|---|
| 0:25:47 | the proposed | 
|---|
| 0:25:51 | okay so a to summarise | 
|---|
| 0:25:53 | and we have a described in the add button for unsupervised estimation of | 
|---|
| 0:25:58 | uh intercession interspeaker variability | 
|---|
| 0:26:01 | and we also | 
|---|
| 0:26:03 | it describes the two methods | 
|---|
| 0:26:04 | to a compensate for for this probability one of them is uh using now | 
|---|
| 0:26:10 | in support vector space and also | 
|---|
| 0:26:12 | i didn't report any without but we also ran experiments using a feature space now | 
|---|
| 0:26:17 | he then it | 
|---|
| 0:26:17 | C space | 
|---|
| 0:26:18 | and | 
|---|
| 0:26:19 | using get too | 
|---|
| 0:26:21 | to speaker diarization it | 
|---|
| 0:26:22 | using the gmm supervectors we got a speaker at all | 
|---|
| 0:26:26 | for one day | 
|---|
| 0:26:28 | and if we also apply | 
|---|
| 0:26:30 | interspeaker variability compensation | 
|---|
| 0:26:33 | i would think the supervector based they to speaker diarization system | 
|---|
| 0:26:37 | we get the for the improvement | 
|---|
| 0:26:38 | two two point eight | 
|---|
| 0:26:40 | and finally the whole system | 
|---|
| 0:26:43 | improve speaker I D accuracy | 
|---|
| 0:26:45 | for some the audio especially for the sound something case | 
|---|
| 0:26:50 | now for future work would be first of all to apply a feature space now | 
|---|
| 0:26:56 | for a need to speak a word with the conversation | 
|---|
| 0:26:59 | and then | 
|---|
| 0:27:01 | to use | 
|---|
| 0:27:01 | different the diarisation systems | 
|---|
| 0:27:04 | so this can be interesting | 
|---|
| 0:27:06 | and we we did try feature space now but but then we just | 
|---|
| 0:27:10 | stayed with | 
|---|
| 0:27:11 | we we just we estimated subvectors so all we did was in the top of it | 
|---|
| 0:27:15 | space | 
|---|
| 0:27:16 | and and of course it should try to extend this work in into multiple speaker diarization | 
|---|
| 0:27:22 | possibly in both of yours or in meetings | 
|---|
| 0:27:24 | and of course to integrate after a methods | 
|---|
| 0:27:28 | such as interspeaker variability modelling | 
|---|
| 0:27:30 | and which were proposed lately and | 
|---|
| 0:27:33 | to integrate them | 
|---|
| 0:27:34 | this uh this approach | 
|---|
| 0:27:50 | i personable congratulations for your work | 
|---|
| 0:27:53 | um | 
|---|
| 0:27:55 | you restricted yourself to the to speaker diarization | 
|---|
| 0:27:58 | yeah | 
|---|
| 0:27:59 | just regularisation something like uh there's this without model selection | 
|---|
| 0:28:03 | and the | 
|---|
| 0:28:04 | for my opinion the beauty of the resisted his model selection but anyway | 
|---|
| 0:28:08 | um | 
|---|
| 0:28:09 | you claim that with this | 
|---|
| 0:28:11 | so later | 
|---|
| 0:28:12 | subspace | 
|---|
| 0:28:13 | this projection | 
|---|
| 0:28:14 | you somehow customise | 
|---|
| 0:28:16 | data | 
|---|
| 0:28:17 | right | 
|---|
| 0:28:19 | yeah | 
|---|
| 0:28:21 | i i think that first of all it by using the supervector coach a it it's more convenient | 
|---|
| 0:28:26 | to do | 
|---|
| 0:28:27 | a diarisation because in some sense yes you | 
|---|
| 0:28:30 | you're you're a distribution tends to be more money model unimodal modelling modelling not and also that | 
|---|
| 0:28:37 | and | 
|---|
| 0:28:38 | i claim that you can use similar techniques that are used for speaker recognition such as a | 
|---|
| 0:28:43 | if the session variability modelling you can use | 
|---|
| 0:28:46 | similar techniques | 
|---|
| 0:28:47 | in in in this domain so so if you want | 
|---|
| 0:28:50 | uh you there is or something | 
|---|
| 0:28:51 | forget about two speaker | 
|---|
| 0:28:53 | you | 
|---|
| 0:28:54 | re a model selection task | 
|---|
| 0:28:56 | uh one question would be something like | 
|---|
| 0:28:58 | so you | 
|---|
| 0:28:59 | you can see the rascals and image | 
|---|
| 0:29:02 | let's say the emission this | 
|---|
| 0:29:03 | so space | 
|---|
| 0:29:04 | fine | 
|---|
| 0:29:05 | projection | 
|---|
| 0:29:06 | um since you | 
|---|
| 0:29:08 | she is | 
|---|
| 0:29:08 | these would be a gaussian | 
|---|
| 0:29:10 | i blame | 
|---|
| 0:29:11 | then traditional uh model selection techniques | 
|---|
| 0:29:14 | uh like | 
|---|
| 0:29:15 | like this information about you | 
|---|
| 0:29:16 | okay | 
|---|
| 0:29:17 | we'll uh | 
|---|
| 0:29:18 | be applicable without these student | 
|---|
| 0:29:20 | or um | 
|---|
| 0:29:21 | hopefully it | 
|---|
| 0:29:22 | yeah because | 
|---|
| 0:29:23 | them specification of the model | 
|---|
| 0:29:25 | when you use it with | 
|---|
| 0:29:27 | oh it's obvious when you use it in the M S | 
|---|
| 0:29:29 | she domain which clearly my remote | 
|---|
| 0:29:31 | okay we uh if you if you consider it now also goes animation | 
|---|
| 0:29:36 | then probably the B | 
|---|
| 0:29:38 | we'll give you answers | 
|---|
| 0:29:40 | in these subspace | 
|---|
| 0:29:42 | about | 
|---|
| 0:29:42 | the number of speakers as well | 
|---|
| 0:29:44 | number of speakers possibly | 
|---|
| 0:29:46 | oh | 
|---|
| 0:29:46 | he into an hmm frame | 
|---|
| 0:29:49 | uh so | 
|---|
| 0:29:50 | uh we should consider this | 
|---|
| 0:29:53 | definitely | 
|---|
| 0:30:03 | yeah no worse than about the summit some of the condition | 
|---|
| 0:30:06 | and | 
|---|
| 0:30:07 | how | 
|---|
| 0:30:07 | you | 
|---|
| 0:30:08 | can you define it in the training set the | 
|---|
| 0:30:11 | the sum | 
|---|
| 0:30:12 | condition because the | 
|---|
| 0:30:13 | if you | 
|---|
| 0:30:14 | the make that position in the training set the you don't know what is the motor okay | 
|---|
| 0:30:19 | so uh | 
|---|
| 0:30:21 | the method that was that decided the with with | 
|---|
| 0:30:24 | some sponsor is that | 
|---|
| 0:30:26 | eh | 
|---|
| 0:30:28 | you factorisation and then you just a you have the reference | 
|---|
| 0:30:32 | segmentation you just decide which | 
|---|
| 0:30:35 | side | 
|---|
| 0:30:36 | has more in an overlap or with | 
|---|
| 0:30:39 | with the um | 
|---|
| 0:30:41 | the limitation that you got | 
|---|
| 0:30:43 | so you know you know what you know what what portions of the speech are actually | 
|---|
| 0:30:48 | eh | 
|---|
| 0:30:49 | yeah | 
|---|
| 0:30:49 | you need to change | 
|---|
| 0:30:50 | on | 
|---|
| 0:30:51 | okay so because you have it's for white | 
|---|
| 0:30:53 | ordinates for wine data and you know and you have a segmentation | 
|---|
| 0:30:56 | you got | 
|---|
| 0:30:57 | you get uh two clusters | 
|---|
| 0:30:59 | and then you just compare these clusters | 
|---|
| 0:31:01 | it automatically to the uh to the reference to decide which cluster is more uh | 
|---|
| 0:31:07 | more correct | 
|---|
| 0:31:10 | and that's something that actually the | 
|---|
| 0:31:12 | if you if you think about it | 
|---|
| 0:31:14 | if someone train | 
|---|
| 0:31:15 | and on some data and so one of the way you consisting that you can apply automatic diarisation and just | 
|---|
| 0:31:21 | so | 
|---|
| 0:31:22 | and just say let's | 
|---|
| 0:31:23 | the user i see the two classes and he'll just decide which cluster is the right class | 
|---|
| 0:31:28 | that that's the motivation | 
|---|
| 0:31:31 | thank you | 
|---|
| 0:31:39 | yeah | 
|---|
| 0:31:46 | oh | 
|---|