0:00:06 | okay so this to talk is a about unsupervised compensation of intercession interspeaker variability for speaker diarisation |
---|---|

0:00:17 | okay |

0:00:18 | so |

0:00:18 | those on the front of the following i will first uh |

0:00:22 | define the notion of intercession interspeaker variability and |

0:00:26 | actually it's a very |

0:00:26 | yeah |

0:00:27 | equivalent to that |

0:00:29 | to the notion to that concept of E to segment the probability that was the described in the previous uh |

0:00:35 | presentation |

0:00:36 | i would talk about how to estimate |

0:00:38 | december built in a supervised |

0:00:40 | i manner |

0:00:42 | in a a supervised manner |

0:00:44 | and have to compensate it |

0:00:45 | then i will propose that to speaker diarization |

0:00:49 | system |

0:00:50 | which is a bit vector based and these also utilises |

0:00:53 | is um compensation of |

0:00:55 | this uh intercession intraspeaker variability |

0:00:58 | which from then no one i will just a recall interspeaker variability |

0:01:03 | yeah i would end report experiments |

0:01:05 | and i will summarise in a in a talk about future work |

0:01:11 | okay |

0:01:12 | so |

0:01:13 | they deals the interspeaker variability is uh |

0:01:17 | the fact is that this is the reason why speaker diarisation is not achievable task |

0:01:21 | because if there was no it interspeaker variability it was very easy to |

0:01:26 | to perform this task |

0:01:27 | and now when when talk about interspeaker variability i |

0:01:30 | i'm |

0:01:31 | and talk about phonetic variability |

0:01:33 | energy or loudness uh probability |

0:01:36 | within the speech of fiscal for six single speaker |

0:01:39 | yeah acoustic or speaker intrinsic uh probability |

0:01:43 | speech rate variation and even then |

0:01:45 | non speech rate which is the due to the others |

0:01:48 | so it sometimes about makes more errors and sometimes about mixed less the rose and this can also |

0:01:53 | course um a probability that can be harmful to the |

0:01:56 | their position algorithm |

0:01:58 | and relevant to ask the for this kind of ability as is first of all scores speaker diarization |

0:02:04 | but also speaker recognition ensure the training and testing sessions |

0:02:10 | okay so the idea now is to france to |

0:02:13 | a propose a generative model of proper generative model |

0:02:17 | to handle this area kind of ability |

0:02:19 | first i will start with the classical gmm model |

0:02:22 | where a speaker is uh |

0:02:23 | a disk is modelled by a gmm |

0:02:26 | and the frames are |

0:02:27 | generated independently according to this gmm |

0:02:32 | now in a two thousand five we have proposed a modified model |

0:02:37 | that the accounts also for intersession probability model |

0:02:41 | according to this model speaker is not the model by a single gmm |

0:02:45 | rather it's mode modelled by the pdf over the session space |

0:02:49 | uh every session is modelled by gmm and uh frames are |

0:02:53 | generated |

0:02:54 | according to this session the gmm |

0:02:57 | so now what we want to do is to again document this model and |

0:03:01 | two proposed uh i guess i should be more though |

0:03:04 | where a speaker here in in this case is a |

0:03:07 | is the pdf over the session space |

0:03:16 | the speaker is a pdf over the session space |

0:03:18 | session is again a pdf over the segments space and in a segment is is the modelled by a gmm |

0:03:25 | and the frames are generated independently according to the segment you mean |

0:03:31 | if we want to visualise this and uh gmm space or the gmm supervector space we can see in this |

0:03:37 | slide |

0:03:38 | we see for super vectors |

0:03:40 | and this is it for support vectors |

0:03:42 | i corresponds to four different segments |

0:03:45 | of the same speakers recorded this speaker in a in a particular session session out for |

0:03:53 | and this can be modelled by by this distribution |

0:03:56 | now if the same speaker |

0:03:57 | talks in a different session |

0:03:59 | then |

0:04:00 | it was okay |

0:04:01 | uh this distribution |

0:04:03 | and we can model the entire session |

0:04:06 | set of sessions with a with a single distribution which will be the speaker speaker a distribution |

0:04:13 | and we can do the same for for speaker B |

0:04:19 | okay so according to this uh |

0:04:21 | a generative model |

0:04:22 | we want to in a we assume that a for that each supervector supervector |

0:04:29 | it for for |

0:04:30 | particular speaker |

0:04:31 | particular session a particular segment |

0:04:34 | distributes normally |

0:04:35 | with some mean and the covariance about the mean he's speaker and session dependent |

0:04:40 | and the covariance is also speaker session dependent |

0:04:42 | so this is them almost a |

0:04:44 | a quite that um |

0:04:45 | a general model we will then assume or assumptions in order to make it more right |

0:04:50 | is it possible to actually use this model |

0:04:55 | okay |

0:04:56 | so back in a two thousand five in a in a paper called the trainable speaker diarization in interspeech |

0:05:03 | we have a use this model used |

0:05:05 | this journal the relative model |

0:05:07 | two |

0:05:08 | do supervised |

0:05:09 | yeah you to interspeaker variability modelling for speaker diarisation |

0:05:13 | in the context of what does use diarisation |

0:05:17 | and in this case what we doing what you did we assume that the covariance |

0:05:21 | or dying to speaker variability |

0:05:23 | is speaker and session independent |

0:05:25 | so we have a single covariance matrix |

0:05:28 | and we estimated from a labelled |

0:05:31 | so demented development set |

0:05:34 | and so it's quite a trivial to to to estimate it |

0:05:37 | and once we have estimated just to compress metrics we can use it to |

0:05:42 | to where come into two and develop a metric |

0:05:45 | that is induced |

0:05:46 | from this uh |

0:05:48 | components metrics |

0:05:49 | and use this matrix to actually perform the devastation |

0:05:53 | and the techniques that we that we we should we use a pca in W C C N |

0:05:59 | okay so this what's in a in tucson seven but problems this technique is that we must have a labelled |

0:06:05 | development set |

0:06:06 | and this development set must be labelled according to speaker turns and sometimes this it's can be problematic |

0:06:27 | okay |

0:06:29 | so |

0:06:30 | in |

0:06:31 | in this paper |

0:06:33 | what we do is we assume that the session |

0:06:36 | that combines a medically session independent |

0:06:39 | so |

0:06:40 | it's not global as we have done before but it's a session dependent |

0:06:44 | and |

0:06:45 | that |

0:06:46 | and in in that good that i'm gonna is described in the next slide |

0:06:50 | yeah actually we don't need any labelled data |

0:06:53 | we we don't use it in labelled data we we we have no training process |

0:06:57 | we just |

0:06:59 | estimate documents method which is that's it |

0:07:01 | session dependent we estimate on the fly on the power |

0:07:04 | session base |

0:07:08 | okay so how do we do that |

0:07:10 | the first stage is to do gmm supervector parameterisation |

0:07:14 | what we do here is that we we take the session that session that you want to |

0:07:18 | and that |

0:07:19 | to apply their position |

0:07:21 | and we first extract and that the features such as mfcc features that under the frame rate |

0:07:28 | and then what we do we we estimate the session dependent ubm |

0:07:32 | of course you get is a is a is a form of low order |

0:07:36 | typical order can be for them |

0:07:38 | four |

0:07:39 | and we do that using the yin algorithm |

0:07:42 | now after we do that we we take the speech signal and divided into overlapping superframes |

0:07:48 | which of us at a frame rate of ten per second |

0:07:51 | so what we do we just say |

0:07:53 | define a chance of one second at superframe |

0:07:57 | and we have a ninety percent overlap |

0:07:59 | now if we |

0:08:00 | such as a bit frame we estimate a gmm using a standard uh |

0:08:04 | no adaptation from the uh the ubm did we just the |

0:08:08 | estimated |

0:08:09 | and now |

0:08:10 | and then we take the uh the parameters of the gmm we concatenate them into a supervector |

0:08:15 | and this is about presentation |

0:08:17 | for that super frame |

0:08:19 | so in the end of this process that a socialist paradise by a sequence of |

0:08:23 | of |

0:08:24 | supervector |

0:08:27 | okay |

0:08:27 | so now the next say yes |

0:08:29 | next phase |

0:08:30 | is to estimate that interspeaker combine metrics |

0:08:33 | and we're we're given the fanciest of the supervectors |

0:08:37 | and |

0:08:38 | we should note |

0:08:39 | yeah the super vector at time T |

0:08:41 | S R P is actually found to component |

0:08:44 | the first component is a speaker mean component |

0:08:47 | it it's a comment that it's speaker dependent session dependent |

0:08:50 | and the second component is the actual innocence |

0:08:53 | the |

0:08:54 | that you the interspeaker probability |

0:08:56 | which is denoted by I T |

0:08:58 | and according to our previous assumption this ten years and |

0:09:01 | is the |

0:09:03 | distributes the normally with a zero mean and uh |

0:09:06 | combats matrix which is |

0:09:08 | set session dependent |

0:09:10 | no |

0:09:10 | our goal is to estimate this a combat medics and what we do is |

0:09:14 | we consider the difference between two consecutive a supervector |

0:09:18 | so when the question for i define |

0:09:21 | dallas update which is a difference between |

0:09:23 | and to weigh the supervectors of two consecutive the |

0:09:27 | a subframe |

0:09:29 | superframe |

0:09:30 | and |

0:09:31 | we can see that the the difference between two super frame is actually that difference between that |

0:09:36 | uh |

0:09:37 | corresponding gay interspeaker variability |

0:09:40 | a new things that i teeny minus i teen |

0:09:44 | a minus one |

0:09:45 | plus some may component |

0:09:47 | which is usually zero |

0:09:49 | because most of the time there's no speaker change |

0:09:52 | but in every time there's a speaker change this to the value is non zero |

0:09:57 | now what we |

0:09:58 | what we do now is |

0:09:59 | something that it's a bit risky but we we assume that |

0:10:03 | most of the time there's no speaker change |

0:10:05 | and we can live with that of the noise that is added by |

0:10:09 | the violation of this assumption and we just neglect |

0:10:12 | the impact of that speaker changes on the covariance matrix of |

0:10:15 | yeah of the P |

0:10:17 | so basically we compute |

0:10:18 | the combat medic |

0:10:20 | of the diversity by just |

0:10:22 | eh |

0:10:23 | and |

0:10:24 | throwing away all the dog uh places that we that there is |

0:10:27 | we could change and just the computing |

0:10:29 | the covariance |

0:10:31 | or that of a |

0:10:33 | so so we in in |

0:10:35 | so what we get that is that if we want to estimate that |

0:10:38 | that the covariance of dying to speak about everything we actually what we have to do we just have to |

0:10:43 | compute |

0:10:44 | that empirical |

0:10:45 | covariance |

0:10:46 | of that delta supervector |

0:10:48 | that's what we do |

0:10:50 | that |

0:10:50 | that's in the question |

0:10:52 | the button |

0:10:53 | right |

0:11:03 | okay |

0:11:03 | so that we have done that uh and we get some kind of estimation for thine |

0:11:07 | fig availability |

0:11:08 | we can try to compensate it |

0:11:10 | and the way we do that is that we we of course a assume that it's of low rent |

0:11:15 | and can apply pca to find a basis for the for this low rank a subspace |

0:11:21 | in this above it |

0:11:21 | space |

0:11:22 | now we have two possible compensation after the first one is that |

0:11:26 | now |

0:11:27 | well we can come compensated for that |

0:11:30 | in a speaker variability |

0:11:32 | in the supervector space |

0:11:34 | and this is only useful if we want to do that it was a change in these days |

0:11:38 | so but |

0:11:38 | fig |

0:11:39 | yeah and another alternative is to use feature eh |

0:11:43 | based now |

0:11:44 | well we can compensate that then to speak about it in the feature space |

0:11:48 | and then we can actually use any other eh |

0:11:51 | regularisation algorithm |

0:11:52 | on these compensated features |

0:11:57 | okay |

0:11:58 | so it is it what we what we do is we actually and decide to use now |

0:12:04 | and then |

0:12:05 | to do the rest of the diarization either supervectors |

0:12:08 | domain |

0:12:09 | so the motivation for that button which i'm about |

0:12:11 | why |

0:12:12 | is that |

0:12:13 | if we look for example you in it |

0:12:15 | the figures in the bottom of slide |

0:12:17 | in the left hand side we we see an illustration on and off the ice |

0:12:22 | if and |

0:12:23 | right |

0:12:23 | and what we see here |

0:12:24 | with the uh two speakers |

0:12:26 | but this is not the real that i just a |

0:12:28 | creation |

0:12:29 | so we in below we have once again but we have to speak at target speaker and the problem is |

0:12:33 | that |

0:12:34 | if we want to apply adaptation about them |

0:12:36 | of course he doesn't know that that colour of the ball |

0:12:40 | and it's very hard to separate the two speakers |

0:12:43 | now if we walk in the supervector space |

0:12:46 | then then what |

0:12:47 | what |

0:12:48 | in the middle of the |

0:12:49 | by |

0:12:50 | we we see here that had a distributions of that |

0:12:53 | because that tend to be more anymore though |

0:12:56 | therefore it it's much easier to |

0:12:58 | to try to |

0:12:59 | eh |

0:13:00 | diarisation and this is of course the counter to the fact that we do some smoothing |

0:13:04 | because a superframe is one second long |

0:13:07 | now if we a if you manage to remove |

0:13:10 | yeah substantial amount of that interspeaker variability |

0:13:14 | what we get is uh uh they'll station in the right hand side of the fly |

0:13:19 | and any |

0:13:20 | even if we don't know the two colours if we don't know that |

0:13:23 | some of the points are of one sample points are all right |

0:13:26 | we still in it |

0:13:27 | it's reasonable that we make |

0:13:29 | find a solution of how to separate these speaker |

0:13:33 | because that's the question is is very easy |

0:13:38 | okay this is separation |

0:13:43 | okay so of any of the uh of the algorithm is as following first of course we |

0:13:47 | parameterised that speech in in two times to obtain serious of supervectors |

0:13:52 | and we compensate for interspeaker variability as i have shown already |

0:13:58 | and then we use the viterbi algorithm to do segmentation |

0:14:02 | now we know that to do that to be a good |

0:14:03 | two segmentation we just a sign think of they |

0:14:06 | for each speaker |

0:14:07 | actually we we also use um in a a |

0:14:11 | a length the constraint so |

0:14:13 | i'm |

0:14:13 | this is some simplification here |

0:14:15 | but basically we using one data per speaker |

0:14:18 | and all we have to do is to be able to |

0:14:20 | to estimate transition probabilities and how the probabilities |

0:14:24 | now position but but it's it's very hard to it's very easy to way to estimate |

0:14:29 | we can just take a very small a that that's that in and then estimate |

0:14:33 | that the effects of that |

0:14:35 | a of the length of to a speaker turn |

0:14:38 | so if we know the i mean the mean is we could on we can estimate transition probabilities |

0:14:43 | now if you want to |

0:14:44 | the tricky part is to be able to estimate the output probabilities |

0:14:48 | the probability of a |

0:14:51 | speaker |

0:14:52 | of support vector |

0:14:53 | at time T |

0:14:54 | given speaker I |

0:14:55 | and we're not doing this a directly we do it |

0:14:58 | do do you did a indirectly and i was see i would feel that this in the next slide |

0:15:04 | so let's say that we can do this in some way |

0:15:07 | and |

0:15:08 | so |

0:15:08 | we just apply data segmentation and we can and come up to four segmentation |

0:15:13 | and then we a run again the yeah |

0:15:16 | we we find the diarization using iterative viterbi resegmentation |

0:15:20 | so we we just say come back to the original a frame based features that this is we train the |

0:15:25 | speaker uh H M and |

0:15:27 | using our previous segmentation and we really one of the best imitation but now the resolution is better |

0:15:32 | because we we're working in a frame rate of a hundred |

0:15:35 | a second |

0:15:36 | and can do that this for |

0:15:38 | couple of iterations then and that's the end of the totem |

0:15:42 | so what i still have to talk about is how to |

0:15:45 | estimate is the output probabilities |

0:15:47 | probability of a a supervector T |

0:15:50 | a given speaker on |

0:15:54 | okay so |

0:15:55 | that would |

0:15:56 | is that following |

0:15:57 | what we do is we and i will give us some motivation after this slide |

0:16:01 | so what we do is we find a larger |

0:16:03 | the eigenvector |

0:16:04 | so in |

0:16:05 | and more more precisely find a large |

0:16:07 | eigenvalue and and we take the corresponding eigenvector |

0:16:11 | and and we and this loud |

0:16:13 | this a good vector is taken for the covariance matrix |

0:16:16 | of all the supervectors |

0:16:17 | so what we do take uh compensated supervectors from the session |

0:16:22 | and we just compare compute the covariance matrix |

0:16:24 | of these supervectors |

0:16:25 | and find that that we take the first day eigenvector |

0:16:28 | now what we do we take each shape compensated supervector |

0:16:32 | and projected onto the larger a eigenvector |

0:16:36 | and and what we get we get a for each a four four time T V get |

0:16:40 | peace up T P supply is the projection |

0:16:42 | of soap about R S T |

0:16:44 | onto the largest eh |

0:16:46 | it can vector |

0:16:47 | and the |

0:16:48 | well what i am i'm not going to convince you now about the the beatles on the paper and we |

0:16:52 | try to get some division |

0:16:54 | is that the action |

0:16:55 | the log likelihood that we are looking for the log likelihood of a supervector S P |

0:16:59 | given speaker one |

0:17:01 | and and |

0:17:02 | yeah divided by the the the log likelihood of a |

0:17:05 | is |

0:17:06 | you know the probability of |

0:17:07 | speaker of the product at a given speaker to is actually a linear |

0:17:11 | function |

0:17:12 | of this projection |

0:17:13 | that we have found |

0:17:14 | so what we have to do is to be able to somehow |

0:17:18 | estimate parameters a and B |

0:17:20 | and if we estimate about the A B we can approximate the likelihood |

0:17:24 | we're looking for you know in order to plug it into the derby |

0:17:28 | but just taking this projection |

0:17:30 | now be is actually |

0:17:32 | is related to the dominance of of that speaker if we have two speakers and one of them is more |

0:17:38 | dominant than be will be uh either positive or negative |

0:17:41 | and in this day in in our experiments we just assumed to be zero therefore we assume that |

0:17:47 | the speakers are equally dominant |

0:17:49 | it's about this uh |

0:17:51 | the conversation of course not true but uh we hope to |

0:17:55 | accommodate with this uh problem a by using a final a viterbi resegmentation |

0:18:00 | and a is assumed to be corpus the tape and then call content and we just have to make it |

0:18:06 | single uh |

0:18:07 | parameter with timit form this mode of that |

0:18:10 | and |

0:18:10 | no rules and do those out in the paper |

0:18:14 | just that but just to get the motivation let's let's say |

0:18:17 | we tended we have very easy problem |

0:18:20 | that uh in in this case the right |

0:18:22 | is one speaker I and uh |

0:18:24 | and the blue is the second speaker |

0:18:26 | and this is uh yeah actually in the super vectors |

0:18:29 | right |

0:18:30 | after we have done that i need to |

0:18:32 | speaker compensation |

0:18:33 | it it does become a bit of computation |

0:18:35 | so if we just take the although the blue and the red eh |

0:18:39 | but |

0:18:39 | and we just compare the combined fanatics and take the first eigenvector what we get we get that |

0:18:44 | the black arrow |

0:18:46 | okay so if we take the black arrow and |

0:18:48 | just project all the points on that there are we get the distribution in the right hand side of the |

0:18:53 | the flight |

0:18:53 | and if we just to decide |

0:18:55 | according to that object on the black arrow if we just the |

0:18:59 | compute |

0:19:00 | the decision boundary |

0:19:01 | we we see that it |

0:19:02 | that |

0:19:02 | this it does she and battery and it's a actually it's it's a |

0:19:07 | it exactly as the optimal decision but we can compute the optimal decision boundary because |

0:19:12 | we know the true distribution of the red and the blue box |

0:19:15 | a because it's artificial data here |

0:19:17 | in this this like |

0:19:19 | so actually this this algorithm works in this case the simple case |

0:19:23 | now as we try to increase then that amount of interspeaker variability |

0:19:28 | like that we we didn't manage to |

0:19:30 | remove all the almost dying to speak about ability |

0:19:33 | so |

0:19:34 | as we see |

0:19:35 | as we a great amount of the excess |

0:19:37 | they are within within all they speak about the |

0:19:40 | we see that uh |

0:19:42 | that decision boundary that we have to make is not exactly as the optimal decision boundary and |

0:19:46 | actually this algorithm we we fail in in someone |

0:19:50 | when a with this too much into station in in to speak about building |

0:19:54 | so the hope is that |

0:19:55 | an hour ago tend to that meant that the |

0:19:58 | suppose always compensate for you to speak about the the hope is that on the on the data we're gonna |

0:20:02 | work on |

0:20:03 | we actually managed to remove not |

0:20:05 | of the variability in order to let that button to work |

0:20:10 | oh |

0:20:11 | okay so now yeah |

0:20:13 | i reported experiments |

0:20:15 | yeah |

0:20:15 | we used a a a a small i think they |

0:20:19 | from that need two thousand five |

0:20:21 | reef as that of a month if that we used to have we used one hundred the stations |

0:20:26 | actually i don't think that we actually |

0:20:28 | we need the |

0:20:29 | so much the data for development because we actually |

0:20:32 | estimate only a handful of parameters |

0:20:34 | and this is used to tune the hmm transition parameters |

0:20:38 | and uh and uh a parameter |

0:20:41 | which is used for the loglikelihood calibration |

0:20:44 | and we use that and and these two thousand five a course |

0:20:48 | that's it |

0:20:49 | for a for evaluation |

0:20:51 | what we did is we took a the stereo |

0:20:53 | a phone call and we just a sum |

0:20:56 | artificially |

0:20:57 | the two sides in order to get a |

0:21:00 | a two wire data |

0:21:02 | and we take the ground truth for my we derive it from the asr transcript |

0:21:07 | provided by nist |

0:21:09 | yeah we report a speaker error rate eh |

0:21:12 | our error message |

0:21:14 | rms measure is this speaker right |

0:21:17 | and |

0:21:17 | we use that stand out in is what we call it too |

0:21:20 | this is |

0:21:21 | correct |

0:21:24 | okay |

0:21:24 | and just enough to be able to compare our results to two different systems we we use the baseline which |

0:21:30 | is big bass |

0:21:31 | the |

0:21:32 | and inspired by lindsay is that two thousand five this them |

0:21:36 | oh |

0:21:36 | it's it's based on detection of speaker changes using the A B in ten iterations of a |

0:21:42 | viterbi resegmentation and along with david biclustering |

0:21:46 | followed finally by a page of a viterbi resegmentation |

0:21:52 | okay so this is a domain with all |

0:21:55 | and on all day that that we we achieve the |

0:21:59 | we |

0:22:01 | we actually had uh uh |

0:22:03 | speaker error rate of six point one |

0:22:05 | for baseline |

0:22:06 | and when we just use the supervector basis and without any intraspeaker variability compensation |

0:22:13 | we got four point eight |

0:22:15 | and we when we also a a used a speaker variability compensation we we got |

0:22:20 | two point |

0:22:21 | right one day |

0:22:23 | and in the six |

0:22:24 | that meant a supervector gmm order is sixty four and uh now |

0:22:27 | compensation order is |

0:22:29 | five |

0:22:34 | we we ran some experiments in order to try to improve on this we we actually we didn't manage people |

0:22:39 | but |

0:22:39 | we try to to see |

0:22:41 | if we just |

0:22:42 | change the front end |

0:22:44 | what's gonna happen so |

0:22:46 | and we find out that feature warping actually degrade performance |

0:22:49 | and this is the |

0:22:51 | this was already has already been and |

0:22:54 | a explained by the fact that the channel is actually something we were done we want to explore twenty do |

0:22:58 | diarisation |

0:22:59 | because |

0:23:00 | it may be the case that different speakers and different channel |

0:23:03 | so we don't want to roll channel |

0:23:05 | a information |

0:23:07 | and i think the other end |

0:23:09 | see also slightly degraded |

0:23:11 | performance |

0:23:12 | and |

0:23:13 | we also wanted to check the power |

0:23:15 | but what would the way how much |

0:23:18 | yeah but that would be in a way that it can be if we we had perfect |

0:23:22 | eh |

0:23:22 | positivity that though and we and we actually prove that |

0:23:26 | very likely to two point six |

0:23:31 | some more experiments a fine |

0:23:32 | two |

0:23:33 | eh |

0:23:34 | it checks that it's a T V T of |

0:23:37 | a system to |

0:23:39 | a gmm although too |

0:23:42 | and the nap compensation although so basically uh we see what |

0:23:46 | jim order |

0:23:47 | the best order that sixty four and one twenty eight i didn't try to increase the eh |

0:23:52 | i don't know what would happen if i would be |

0:23:54 | great that you know boulder |

0:23:55 | and for nap and we see that it's quite a bit of it i think |

0:23:58 | from fifteen uh we get quite the we get already any for the uh uh |

0:24:03 | speaker rate of |

0:24:03 | three |

0:24:04 | but |

0:24:05 | but performance |

0:24:06 | something around |

0:24:07 | five |

0:24:11 | okay so finally before i am i |

0:24:13 | and |

0:24:14 | one of the motivations for this work was to use this the devastation for |

0:24:19 | speaker I D into in um a |

0:24:22 | they are in a |

0:24:23 | to wire data |

0:24:25 | so we we try to to say whether we get improvements using this devastation |

0:24:30 | so |

0:24:30 | what would we did it we we tried we defended acquisition |

0:24:34 | and systems one of them was that a reference to |

0:24:37 | a diarisation second one was the baseline diarization |

0:24:41 | the third was the proposal |

0:24:42 | it every station |

0:24:44 | and we all we tried to apply that is they should i do only in the test |

0:24:48 | the on the test data also |

0:24:51 | and |

0:24:52 | the train data because i |

0:24:54 | one for sponsors actually |

0:24:56 | have this problem so also the training data is also a uh some |

0:25:01 | therefore we we |

0:25:02 | we wanted to check |

0:25:03 | a what |

0:25:05 | the performance on also on training a |

0:25:07 | on some data |

0:25:09 | and we used |

0:25:10 | speaker this system which is not based |

0:25:12 | and |

0:25:13 | that achieves a regular rate of five point two on and |

0:25:17 | done this two thousand five or female data set |

0:25:21 | and what we concluded that for when only test |

0:25:25 | data is some |

0:25:26 | that the proposed and achieve that performance that is equivalent to using manual diarization |

0:25:32 | and when both a train and test |

0:25:35 | data um |

0:25:36 | there is a degradation compared to to the reference there |

0:25:39 | it every station |

0:25:41 | but however documentation that we get for the baseline system you have |

0:25:46 | when we use |

0:25:47 | the proposed |

0:25:51 | okay so a to summarise |

0:25:53 | and we have a described in the add button for unsupervised estimation of |

0:25:58 | uh intercession interspeaker variability |

0:26:01 | and we also |

0:26:03 | it describes the two methods |

0:26:04 | to a compensate for for this probability one of them is uh using now |

0:26:10 | in support vector space and also |

0:26:12 | i didn't report any without but we also ran experiments using a feature space now |

0:26:17 | he then it |

0:26:17 | C space |

0:26:18 | and |

0:26:19 | using get too |

0:26:21 | to speaker diarization it |

0:26:22 | using the gmm supervectors we got a speaker at all |

0:26:26 | for one day |

0:26:28 | and if we also apply |

0:26:30 | interspeaker variability compensation |

0:26:33 | i would think the supervector based they to speaker diarization system |

0:26:37 | we get the for the improvement |

0:26:38 | two two point eight |

0:26:40 | and finally the whole system |

0:26:43 | improve speaker I D accuracy |

0:26:45 | for some the audio especially for the sound something case |

0:26:50 | now for future work would be first of all to apply a feature space now |

0:26:56 | for a need to speak a word with the conversation |

0:26:59 | and then |

0:27:01 | to use |

0:27:01 | different the diarisation systems |

0:27:04 | so this can be interesting |

0:27:06 | and we we did try feature space now but but then we just |

0:27:10 | stayed with |

0:27:11 | we we just we estimated subvectors so all we did was in the top of it |

0:27:15 | space |

0:27:16 | and and of course it should try to extend this work in into multiple speaker diarization |

0:27:22 | possibly in both of yours or in meetings |

0:27:24 | and of course to integrate after a methods |

0:27:28 | such as interspeaker variability modelling |

0:27:30 | and which were proposed lately and |

0:27:33 | to integrate them |

0:27:34 | this uh this approach |

0:27:50 | i personable congratulations for your work |

0:27:53 | um |

0:27:55 | you restricted yourself to the to speaker diarization |

0:27:58 | yeah |

0:27:59 | just regularisation something like uh there's this without model selection |

0:28:03 | and the |

0:28:04 | for my opinion the beauty of the resisted his model selection but anyway |

0:28:08 | um |

0:28:09 | you claim that with this |

0:28:11 | so later |

0:28:12 | subspace |

0:28:13 | this projection |

0:28:14 | you somehow customise |

0:28:16 | data |

0:28:17 | right |

0:28:19 | yeah |

0:28:21 | i i think that first of all it by using the supervector coach a it it's more convenient |

0:28:26 | to do |

0:28:27 | a diarisation because in some sense yes you |

0:28:30 | you're you're a distribution tends to be more money model unimodal modelling modelling not and also that |

0:28:37 | and |

0:28:38 | i claim that you can use similar techniques that are used for speaker recognition such as a |

0:28:43 | if the session variability modelling you can use |

0:28:46 | similar techniques |

0:28:47 | in in in this domain so so if you want |

0:28:50 | uh you there is or something |

0:28:51 | forget about two speaker |

0:28:53 | you |

0:28:54 | re a model selection task |

0:28:56 | uh one question would be something like |

0:28:58 | so you |

0:28:59 | you can see the rascals and image |

0:29:02 | let's say the emission this |

0:29:03 | so space |

0:29:04 | fine |

0:29:05 | projection |

0:29:06 | um since you |

0:29:08 | she is |

0:29:08 | these would be a gaussian |

0:29:10 | i blame |

0:29:11 | then traditional uh model selection techniques |

0:29:14 | uh like |

0:29:15 | like this information about you |

0:29:16 | okay |

0:29:17 | we'll uh |

0:29:18 | be applicable without these student |

0:29:20 | or um |

0:29:21 | hopefully it |

0:29:22 | yeah because |

0:29:23 | them specification of the model |

0:29:25 | when you use it with |

0:29:27 | oh it's obvious when you use it in the M S |

0:29:29 | she domain which clearly my remote |

0:29:31 | okay we uh if you if you consider it now also goes animation |

0:29:36 | then probably the B |

0:29:38 | we'll give you answers |

0:29:40 | in these subspace |

0:29:42 | about |

0:29:42 | the number of speakers as well |

0:29:44 | number of speakers possibly |

0:29:46 | oh |

0:29:46 | he into an hmm frame |

0:29:49 | uh so |

0:29:50 | uh we should consider this |

0:29:53 | definitely |

0:30:03 | yeah no worse than about the summit some of the condition |

0:30:06 | and |

0:30:07 | how |

0:30:07 | you |

0:30:08 | can you define it in the training set the |

0:30:11 | the sum |

0:30:12 | condition because the |

0:30:13 | if you |

0:30:14 | the make that position in the training set the you don't know what is the motor okay |

0:30:19 | so uh |

0:30:21 | the method that was that decided the with with |

0:30:24 | some sponsor is that |

0:30:26 | eh |

0:30:28 | you factorisation and then you just a you have the reference |

0:30:32 | segmentation you just decide which |

0:30:35 | side |

0:30:36 | has more in an overlap or with |

0:30:39 | with the um |

0:30:41 | the limitation that you got |

0:30:43 | so you know you know what you know what what portions of the speech are actually |

0:30:48 | eh |

0:30:49 | yeah |

0:30:49 | you need to change |

0:30:50 | on |

0:30:51 | okay so because you have it's for white |

0:30:53 | ordinates for wine data and you know and you have a segmentation |

0:30:56 | you got |

0:30:57 | you get uh two clusters |

0:30:59 | and then you just compare these clusters |

0:31:01 | it automatically to the uh to the reference to decide which cluster is more uh |

0:31:07 | more correct |

0:31:10 | and that's something that actually the |

0:31:12 | if you if you think about it |

0:31:14 | if someone train |

0:31:15 | and on some data and so one of the way you consisting that you can apply automatic diarisation and just |

0:31:21 | so |

0:31:22 | and just say let's |

0:31:23 | the user i see the two classes and he'll just decide which cluster is the right class |

0:31:28 | that that's the motivation |

0:31:31 | thank you |

0:31:39 | yeah |

0:31:46 | oh |