Speech Transcript - Unsupervised Compensation of Intra-Session Intra-Speaker Variability for Speaker Diarization

okay so this to talk is a about unsupervised compensation of intercession interspeaker variability for speaker diarisation okay so those on the front of the following i will first uh define the notion of intercession interspeaker variability and actually it's a very yeah equivalent to that to the notion to that concept of E to segment the probability that was the described in the previous uh presentation i would talk about how to estimate december built in a supervised i manner in a a supervised manner and have to compensate it then i will propose that to speaker diarization system which is a bit vector based and these also utilises is um compensation of this uh intercession intraspeaker variability which from then no one i will just a recall interspeaker variability yeah i would end report experiments and i will summarise in a in a talk about future work okay so they deals the interspeaker variability is uh the fact is that this is the reason why speaker diarisation is not achievable task because if there was no it interspeaker variability it was very easy to to perform this task and now when when talk about interspeaker variability i i'm and talk about phonetic variability energy or loudness uh probability within the speech of fiscal for six single speaker yeah acoustic or speaker intrinsic uh probability speech rate variation and even then non speech rate which is the due to the others so it sometimes about makes more errors and sometimes about mixed less the rose and this can also course um a probability that can be harmful to the their position algorithm and relevant to ask the for this kind of ability as is first of all scores speaker diarization but also speaker recognition ensure the training and testing sessions okay so the idea now is to france to a propose a generative model of proper generative model to handle this area kind of ability first i will start with the classical gmm model where a speaker is uh a disk is modelled by a gmm and the frames are generated independently according to this gmm now in a two thousand five we have proposed a modified model that the accounts also for intersession probability model according to this model speaker is not the model by a single gmm rather it's mode modelled by the pdf over the session space uh every session is modelled by gmm and uh frames are generated according to this session the gmm so now what we want to do is to again document this model and two proposed uh i guess i should be more though where a speaker here in in this case is a is the pdf over the session space the speaker is a pdf over the session space session is again a pdf over the segments space and in a segment is is the modelled by a gmm and the frames are generated independently according to the segment you mean if we want to visualise this and uh gmm space or the gmm supervector space we can see in this slide we see for super vectors and this is it for support vectors i corresponds to four different segments of the same speakers recorded this speaker in a in a particular session session out for and this can be modelled by by this distribution now if the same speaker talks in a different session then it was okay uh this distribution and we can model the entire session set of sessions with a with a single distribution which will be the speaker speaker a distribution and we can do the same for for speaker B okay so according to this uh a generative model we want to in a we assume that a for that each supervector supervector it for for particular speaker particular session a particular segment distributes normally with some mean and the covariance about the mean he's speaker and session dependent and the covariance is also speaker session dependent so this is them almost a a quite that um a general model we will then assume or assumptions in order to make it more right is it possible to actually use this model okay so back in a two thousand five in a in a paper called the trainable speaker diarization in interspeech we have a use this model used this journal the relative model two do supervised yeah you to interspeaker variability modelling for speaker diarisation in the context of what does use diarisation and in this case what we doing what you did we assume that the covariance or dying to speaker variability is speaker and session independent so we have a single covariance matrix and we estimated from a labelled so demented development set and so it's quite a trivial to to to estimate it and once we have estimated just to compress metrics we can use it to to where come into two and develop a metric that is induced from this uh components metrics and use this matrix to actually perform the devastation and the techniques that we that we we should we use a pca in W C C N okay so this what's in a in tucson seven but problems this technique is that we must have a labelled development set and this development set must be labelled according to speaker turns and sometimes this it's can be problematic okay so in in this paper what we do is we assume that the session that combines a medically session independent so it's not global as we have done before but it's a session dependent and that and in in that good that i'm gonna is described in the next slide yeah actually we don't need any labelled data we we don't use it in labelled data we we we have no training process we just estimate documents method which is that's it session dependent we estimate on the fly on the power session base okay so how do we do that the first stage is to do gmm supervector parameterisation what we do here is that we we take the session that session that you want to and that to apply their position and we first extract and that the features such as mfcc features that under the frame rate and then what we do we we estimate the session dependent ubm of course you get is a is a is a form of low order typical order can be for them four and we do that using the yin algorithm now after we do that we we take the speech signal and divided into overlapping superframes which of us at a frame rate of ten per second so what we do we just say define a chance of one second at superframe and we have a ninety percent overlap now if we such as a bit frame we estimate a gmm using a standard uh no adaptation from the uh the ubm did we just the estimated and now and then we take the uh the parameters of the gmm we concatenate them into a supervector and this is about presentation for that super frame so in the end of this process that a socialist paradise by a sequence of of supervector okay so now the next say yes next phase is to estimate that interspeaker combine metrics and we're we're given the fanciest of the supervectors and we should note yeah the super vector at time T S R P is actually found to component the first component is a speaker mean component it it's a comment that it's speaker dependent session dependent and the second component is the actual innocence the that you the interspeaker probability which is denoted by I T and according to our previous assumption this ten years and is the distributes the normally with a zero mean and uh combats matrix which is set session dependent no our goal is to estimate this a combat medics and what we do is we consider the difference between two consecutive a supervector so when the question for i define dallas update which is a difference between and to weigh the supervectors of two consecutive the a subframe superframe and we can see that the the difference between two super frame is actually that difference between that uh corresponding gay interspeaker variability a new things that i teeny minus i teen a minus one plus some may component which is usually zero because most of the time there's no speaker change but in every time there's a speaker change this to the value is non zero now what we what we do now is something that it's a bit risky but we we assume that most of the time there's no speaker change and we can live with that of the noise that is added by the violation of this assumption and we just neglect the impact of that speaker changes on the covariance matrix of yeah of the P so basically we compute the combat medic of the diversity by just eh and throwing away all the dog uh places that we that there is we could change and just the computing the covariance or that of a so so we in in so what we get that is that if we want to estimate that that the covariance of dying to speak about everything we actually what we have to do we just have to compute that empirical covariance of that delta supervector that's what we do that that's in the question the button right okay so that we have done that uh and we get some kind of estimation for thine fig availability we can try to compensate it and the way we do that is that we we of course a assume that it's of low rent and can apply pca to find a basis for the for this low rank a subspace in this above it space now we have two possible compensation after the first one is that now well we can come compensated for that in a speaker variability in the supervector space and this is only useful if we want to do that it was a change in these days so but fig yeah and another alternative is to use feature eh based now well we can compensate that then to speak about it in the feature space and then we can actually use any other eh regularisation algorithm on these compensated features okay so it is it what we what we do is we actually and decide to use now and then to do the rest of the diarization either supervectors domain so the motivation for that button which i'm about why is that if we look for example you in it the figures in the bottom of slide in the left hand side we we see an illustration on and off the ice if and right and what we see here with the uh two speakers but this is not the real that i just a creation so we in below we have once again but we have to speak at target speaker and the problem is that if we want to apply adaptation about them of course he doesn't know that that colour of the ball and it's very hard to separate the two speakers now if we walk in the supervector space then then what what in the middle of the by we we see here that had a distributions of that because that tend to be more anymore though therefore it it's much easier to to try to eh diarisation and this is of course the counter to the fact that we do some smoothing because a superframe is one second long now if we a if you manage to remove yeah substantial amount of that interspeaker variability what we get is uh uh they'll station in the right hand side of the fly and any even if we don't know the two colours if we don't know that some of the points are of one sample points are all right we still in it it's reasonable that we make find a solution of how to separate these speaker because that's the question is is very easy okay this is separation okay so of any of the uh of the algorithm is as following first of course we parameterised that speech in in two times to obtain serious of supervectors and we compensate for interspeaker variability as i have shown already and then we use the viterbi algorithm to do segmentation now we know that to do that to be a good two segmentation we just a sign think of they for each speaker actually we we also use um in a a a length the constraint so i'm this is some simplification here but basically we using one data per speaker and all we have to do is to be able to to estimate transition probabilities and how the probabilities now position but but it's it's very hard to it's very easy to way to estimate we can just take a very small a that that's that in and then estimate that the effects of that a of the length of to a speaker turn so if we know the i mean the mean is we could on we can estimate transition probabilities now if you want to the tricky part is to be able to estimate the output probabilities the probability of a speaker of support vector at time T given speaker I and we're not doing this a directly we do it do do you did a indirectly and i was see i would feel that this in the next slide so let's say that we can do this in some way and so we just apply data segmentation and we can and come up to four segmentation and then we a run again the yeah we we find the diarization using iterative viterbi resegmentation so we we just say come back to the original a frame based features that this is we train the speaker uh H M and using our previous segmentation and we really one of the best imitation but now the resolution is better because we we're working in a frame rate of a hundred a second and can do that this for couple of iterations then and that's the end of the totem so what i still have to talk about is how to estimate is the output probabilities probability of a a supervector T a given speaker on okay so that would is that following what we do is we and i will give us some motivation after this slide so what we do is we find a larger the eigenvector so in and more more precisely find a large eigenvalue and and we take the corresponding eigenvector and and we and this loud this a good vector is taken for the covariance matrix of all the supervectors so what we do take uh compensated supervectors from the session and we just compare compute the covariance matrix of these supervectors and find that that we take the first day eigenvector now what we do we take each shape compensated supervector and projected onto the larger a eigenvector and and what we get we get a for each a four four time T V get peace up T P supply is the projection of soap about R S T onto the largest eh it can vector and the well what i am i'm not going to convince you now about the the beatles on the paper and we try to get some division is that the action the log likelihood that we are looking for the log likelihood of a supervector S P given speaker one and and yeah divided by the the the log likelihood of a is you know the probability of speaker of the product at a given speaker to is actually a linear function of this projection that we have found so what we have to do is to be able to somehow estimate parameters a and B and if we estimate about the A B we can approximate the likelihood we're looking for you know in order to plug it into the derby but just taking this projection now be is actually is related to the dominance of of that speaker if we have two speakers and one of them is more dominant than be will be uh either positive or negative and in this day in in our experiments we just assumed to be zero therefore we assume that the speakers are equally dominant it's about this uh the conversation of course not true but uh we hope to accommodate with this uh problem a by using a final a viterbi resegmentation and a is assumed to be corpus the tape and then call content and we just have to make it single uh parameter with timit form this mode of that and no rules and do those out in the paper just that but just to get the motivation let's let's say we tended we have very easy problem that uh in in this case the right is one speaker I and uh and the blue is the second speaker and this is uh yeah actually in the super vectors right after we have done that i need to speaker compensation it it does become a bit of computation so if we just take the although the blue and the red eh but and we just compare the combined fanatics and take the first eigenvector what we get we get that the black arrow okay so if we take the black arrow and just project all the points on that there are we get the distribution in the right hand side of the the flight and if we just to decide according to that object on the black arrow if we just the compute the decision boundary we we see that it that this it does she and battery and it's a actually it's it's a it exactly as the optimal decision but we can compute the optimal decision boundary because we know the true distribution of the red and the blue box a because it's artificial data here in this this like so actually this this algorithm works in this case the simple case now as we try to increase then that amount of interspeaker variability like that we we didn't manage to remove all the almost dying to speak about ability so as we see as we a great amount of the excess they are within within all they speak about the we see that uh that decision boundary that we have to make is not exactly as the optimal decision boundary and actually this algorithm we we fail in in someone when a with this too much into station in in to speak about building so the hope is that an hour ago tend to that meant that the suppose always compensate for you to speak about the the hope is that on the on the data we're gonna work on we actually managed to remove not of the variability in order to let that button to work oh okay so now yeah i reported experiments yeah we used a a a a small i think they from that need two thousand five reef as that of a month if that we used to have we used one hundred the stations actually i don't think that we actually we need the so much the data for development because we actually estimate only a handful of parameters and this is used to tune the hmm transition parameters and uh and uh a parameter which is used for the loglikelihood calibration and we use that and and these two thousand five a course that's it for a for evaluation what we did is we took a the stereo a phone call and we just a sum artificially the two sides in order to get a a two wire data and we take the ground truth for my we derive it from the asr transcript provided by nist yeah we report a speaker error rate eh our error message rms measure is this speaker right and we use that stand out in is what we call it too this is correct okay and just enough to be able to compare our results to two different systems we we use the baseline which is big bass the and inspired by lindsay is that two thousand five this them oh it's it's based on detection of speaker changes using the A B in ten iterations of a viterbi resegmentation and along with david biclustering followed finally by a page of a viterbi resegmentation okay so this is a domain with all and on all day that that we we achieve the we we actually had uh uh speaker error rate of six point one for baseline and when we just use the supervector basis and without any intraspeaker variability compensation we got four point eight and we when we also a a used a speaker variability compensation we we got two point right one day and in the six that meant a supervector gmm order is sixty four and uh now compensation order is five we we ran some experiments in order to try to improve on this we we actually we didn't manage people but we try to to see if we just change the front end what's gonna happen so and we find out that feature warping actually degrade performance and this is the this was already has already been and a explained by the fact that the channel is actually something we were done we want to explore twenty do diarisation because it may be the case that different speakers and different channel so we don't want to roll channel a information and i think the other end see also slightly degraded performance and we also wanted to check the power but what would the way how much yeah but that would be in a way that it can be if we we had perfect eh positivity that though and we and we actually prove that very likely to two point six some more experiments a fine two eh it checks that it's a T V T of a system to a gmm although too and the nap compensation although so basically uh we see what jim order the best order that sixty four and one twenty eight i didn't try to increase the eh i don't know what would happen if i would be great that you know boulder and for nap and we see that it's quite a bit of it i think from fifteen uh we get quite the we get already any for the uh uh speaker rate of three but but performance something around five okay so finally before i am i and one of the motivations for this work was to use this the devastation for speaker I D into in um a they are in a to wire data so we we try to to say whether we get improvements using this devastation so what would we did it we we tried we defended acquisition and systems one of them was that a reference to a diarisation second one was the baseline diarization the third was the proposal it every station and we all we tried to apply that is they should i do only in the test the on the test data also and the train data because i one for sponsors actually have this problem so also the training data is also a uh some therefore we we we wanted to check a what the performance on also on training a on some data and we used speaker this system which is not based and that achieves a regular rate of five point two on and done this two thousand five or female data set and what we concluded that for when only test data is some that the proposed and achieve that performance that is equivalent to using manual diarization and when both a train and test data um there is a degradation compared to to the reference there it every station but however documentation that we get for the baseline system you have when we use the proposed okay so a to summarise and we have a described in the add button for unsupervised estimation of uh intercession interspeaker variability and we also it describes the two methods to a compensate for for this probability one of them is uh using now in support vector space and also i didn't report any without but we also ran experiments using a feature space now he then it C space and using get too to speaker diarization it using the gmm supervectors we got a speaker at all for one day and if we also apply interspeaker variability compensation i would think the supervector based they to speaker diarization system we get the for the improvement two two point eight and finally the whole system improve speaker I D accuracy for some the audio especially for the sound something case now for future work would be first of all to apply a feature space now for a need to speak a word with the conversation and then to use different the diarisation systems so this can be interesting and we we did try feature space now but but then we just stayed with we we just we estimated subvectors so all we did was in the top of it space and and of course it should try to extend this work in into multiple speaker diarization possibly in both of yours or in meetings and of course to integrate after a methods such as interspeaker variability modelling and which were proposed lately and to integrate them this uh this approach i personable congratulations for your work um you restricted yourself to the to speaker diarization yeah just regularisation something like uh there's this without model selection and the for my opinion the beauty of the resisted his model selection but anyway um you claim that with this so later subspace this projection you somehow customise data right yeah i i think that first of all it by using the supervector coach a it it's more convenient to do a diarisation because in some sense yes you you're you're a distribution tends to be more money model unimodal modelling modelling not and also that and i claim that you can use similar techniques that are used for speaker recognition such as a if the session variability modelling you can use similar techniques in in in this domain so so if you want uh you there is or something forget about two speaker you re a model selection task uh one question would be something like so you you can see the rascals and image let's say the emission this so space fine projection um since you she is these would be a gaussian i blame then traditional uh model selection techniques uh like like this information about you okay we'll uh be applicable without these student or um hopefully it yeah because them specification of the model when you use it with oh it's obvious when you use it in the M S she domain which clearly my remote okay we uh if you if you consider it now also goes animation then probably the B we'll give you answers in these subspace about the number of speakers as well number of speakers possibly oh he into an hmm frame uh so uh we should consider this definitely yeah no worse than about the summit some of the condition and how you can you define it in the training set the the sum condition because the if you the make that position in the training set the you don't know what is the motor okay so uh the method that was that decided the with with some sponsor is that eh you factorisation and then you just a you have the reference segmentation you just decide which side has more in an overlap or with with the um the limitation that you got so you know you know what you know what what portions of the speech are actually eh yeah you need to change on okay so because you have it's for white ordinates for wine data and you know and you have a segmentation you got you get uh two clusters and then you just compare these clusters it automatically to the uh to the reference to decide which cluster is more uh more correct and that's something that actually the if you if you think about it if someone train and on some data and so one of the way you consisting that you can apply automatic diarisation and just so and just say let's the user i see the two classes and he'll just decide which cluster is the right class that that's the motivation thank you yeah oh

Unsupervised Compensation of Intra-Session Intra-Speaker Variability for Speaker Diarization

SESSION 6: Diarization

Added: 14. 7. 2010 11:08, Author: Hagai Aronowitz (IBM Reseach - Haifa), Length: 0:31:46