Speech Transcript - Online Diarization of Telephone Conversations

yeah we'll come back to the session so now let's and not only from my seat deployable on recent so now we change the topic today to speaker diarization and uh not in speaker diarization uh one of the important things to guess the number right let me give you that we have a speaker and each one you need it do the segmentation uh and uh so we have four people as in the first people is on the well my diarization of telephone conversations uh presented by all three penthouse please i know giving him the topic of the presentation is a non derivation of telephone conversation yeah i would begin by presenting the speaker you get diarization problem and after when cheryl talk about online those online speaker diarisation and the overview of current to short overview of current speaker diarisation system i will then said present the suggested online speaker diarization system including description derivation time complexity and performance and i will conclude of course the conclusion the task of speaker diarization system is to assign temporal segments of speech why now are participants in a conversation speaker diarization basically a ten two cluster the segment and cluster conversation such that if we see a from the left it's a manual derivation of a conversation down by a human listener and on the right automatic diarisation exhibited by the suggested this because diarisation system more a state of the art speaker diarization system operates in an on off line manner that is conversation samples are gathered until the conversation and falling an application of the diarization system however for some applications such as forensic or a speech recognition online diarization could be beneficial that is if we want to apply some automatic speaker recognition system we would uh be able to see this realisation of the conversation until the point yeah we want to apply online or something online derivation can be achieved by removing or minimising the size of the uh but for and however this incurs in the sun difficult to to the system because the amount of data is reduced most of the offline diarization systems operate in a two stage uh process first i'll just i'll just remain over generated over segmented by some change detection algorithm and then and then the ground or yeah i hierarchical clustering algorithm is applied in which segments are merged until some termination conditions are met generally the number of the final speakers in the conversation some recent approaches in uh offline diarization system include gmmubm is figure modelling speaker identification clustering and the fusion of several system was several a a feature set in order to apply there is a nation online speaker diarization system income to the encountered in the literature include online gmm learning as some novelty detection algorithms apply into detecting when a new speaker is appearing in a conversation and uh gmmubm this scheme most of the state of the art diarization systems online and offline and carton it in the literature requires some offline training background a channel or gender and models in order to apply on the diarisation algorithms is some require several sets of features and the practically all require a large amount of the computation power this is just an online diarization system operates in a two stage process first and unsupervised algorithm is applied over an initial training segment of the conversation followed by the use of the model generate in the in the first stage in order to put apply and eh receiver segmentation of the conversation on demand that is this sound a on the samples are entered into the preprocessing stage feature extraction and uh into the buffer which incorporates the uh initial training segments there is a show is applied only on the initial training segments and models are generated from the initial training segment once the models are available we could a apply or perform segmentation of the conversation based on these initial models however a major assumption a is that all of the speakers in the conversation must participate in this initial training segment or else they want the a model for these speakers will not be a be available for the rest of the segmentation process the first data validation is if we can still provide a telephone conversation their vision over the initial training segment yeah and which the samples in the initial training segment that preposterous feature extraction is applied on the emission thingy thing man and an initial assignment algorithm that is in a conversation and let's assume a telephone conversation once we have we have successfully identified the non speech we still have two speakers it was signed features too that is what we can identify the speech however we must apply to some kind of algorithm plus nine features to either of the speakers one features are assigned to each of the speakers uh an iterative process of modelling and time series clustering is applied until termination conditions are met once termination conditions are right where we can provide the segmentation modelling in this paper or in this work is the band by song and the time series processing is done by it's some variant of the hidden markov model when we apply diarization over short segments of speech eh two main issues arise one is it a low model complexities required because of the sparse amount of data and another problem is the or clustering constraints that is we would not like that and segmentation what skit with men speakers we would like to okay employed physical ones trains on the time of speech for each speaker the fourth problem tackled by replacing the common gmm models by a self organising map that is we train a self organising map for each of the speakers self organising maps was a uh presented by on it any composed of the three main stages the first uh initialisation the second is a rough training and finally a a fine tuning of the neurons or the centroid into the distribution of point once we have and train the model for each of the all speakers in the conversation a we would require some means to estimate the likelihood oh given a new feature okay a we would like to the estimate the probability of the likelihood of the uh feature observation given the model under the assumption of normality that is its centroid in the self organising map is a a the mean over yeah and probability uh with the unit covariance metric we could apply the following equation in order to estimate the log likelihood what the minus log likelihood of the and observation see that we we estimate the loglikelihood only with a single neuron because generally it will contain the most to um most of the information regarding the closest observation point the joint likelihood go and a set of features could be estimated by some of the log likelihoods of the single feature given that is that the clean independent justin constraints are enabled using uhuh if hidden markov model or a minimum duration hidden markov model in this model it's they is modelled using yeah hyper state that is in each hyper state we enforce a minimum duration of transitions from once one one state to another state and in this manner we could use the hidden markov model in order to enforce the minimum duration time a for each of the speakers each state in the meeting duration hidden markov model if the left or right hi per state you know which songs you to estimate the the log likelihood or the emission probability for each of the observation i don't know in the that's right transition matrix of the hidden markov models elements on the diagonal and a hyper state transition matrix matrix of and the element and all that do not uh the entire hyper state the transition matrices and then this matrix is updated it part of the training process segmentation once we have the models for each of the speakers in the hmm segmentation is applied and using the a viterbi time series clustering algorithm that is samples of the um sound wave is entered into a buffer initial training segment is applied derivation and hidden markov models which is generated by the diarization system once we have this hidden markov model segmentation is applied almost instantaneously on the mac i would and it viterbi algorithm computation complexity is in the order of Q squared chi where you are the number of states in the H M and and T is the number of features uh in the conversation so that initialisation and recursion of the viterbi algorithm could be applied online that is F F S with which is after they were really which is the first feature is used to initialise the viterbi algorithm followed by F and which is your a two in the recursion process once segmentation is demanded um termination and backtracking could be applied online and that is almost instantaneous a graph stating the time required to generate the segmentation of a conversation is a function of the conversation length uh is given here and it's show that four four hundred second the conversation for example only one millisecond of it and of time computer time is required and in the current implementation of the diarization system one second of processing time give a white man alive seventy three seconds of the audio doing the first aid of a derivation and experimentation the database used was the of two thousand forty eight conversation from the nist two thousand and five speaker recognition evaluation recordings L to speaker conversation in at a four wire which was sound and normalised in order to be generated two speaker conversations and the features extracted was twelve mfcc features and twelve mfcc including delta features the entire database was first processed by the diarization system using all of the data available to produce twenty percent diarization error rate in six point nine percent speaker right diarisation error rate how to the way we measured it was to include all of the hours available that is speech confusion and the uh also i mean speech and nonspeech also overlapped speech which is the set which are segments of speakers speaking together was also considered as an arrow in the speaker error rate and we actually eliminated the nonspeech in both of the segmentations in order to generate only the speaker confusion the derivation error rate as a function of the initial segment length it's shown to approach the optimal of the eh performance obtained by the applying the nation system over the entire segment as we can see that four say one twenty one or two minutes of initial training segment where we where you save twenty four percent diarization error rate and the this behaviour is also presented in the application of a speaker error it seems that given two minutes of initial training segment they relation iterative sufficiently close uh to the diarization error rate obtained by applying this segmentation the diarization over the entire conversation and using one or twenty seconds of the initial training segment we could obtain twenty three twenty four diarisation percent diarisation error rate and twenty points ten point six signal a speaker error rate well using one and i think seconds of initial training segment provide twenty two point three diarization error rate and about ten percent speaker that the features eh did not provide an improved performance to conclude ascending online speaker that information system uh was presented and it was shown that using as few as one hundred twenty seconds of conversation and we could apply and provide segmentation of the conversation by an increase of four percent when compared to the diarization error rate obtained by the by applying the vision system over the entire conversation for them corpus of robustness and simplicity gmm models or or replaced by a self organising map a um and we assume no prior information regarding the speakers on the or the conversation that if we use no background models of any kind yeah in order to apply there is asian and no parameters are required to be trained offline and in order to apply diarization thank you take some questions oh no uh well as opposed to some initialisation uh maybe i missed what is the length of the segment that you get into the sum okay that's fine we've done this merriment using a variable length of initial training segment that is assuming you are one hundred and twenty seconds of initial training segment some of which belongs to speaker a sound which belong to speaker B and sound belongs to non speech that is the the the exact amount of features belonging to each of the speakers was not measured because it's a it's a function of the initialisation algorithm okay but um i i mean what you know do also self organising map is using the short segments from this initialisation yeah and do you have a fixed flanks for the for the segments or is it so the uh segmented okay okay here okay the initial training segment there is a she's actually applied on the initial training segment that is first speech or nonspeech is uh detected nonspeech of that and then the segments belonging belonging speech are a distributed among the two speakers in the conversation the distribution of the features to each of the speakers as a function of the initialisation algorithm which is a client of the K means a clustering algorithm so the exact amount of features assigned to each of the speakers eh i was not nice okay um i have a note on the question about the overlapping speech you said that you um overlapping speech in the responses but you score it as an error yeah and that you did not take it into account so we always and they're only one way to yeah and do you have an idea of the amount appeal is that it yes to to your result we have used two databases for uh there is a nation and the one used here was two thousand and forty eight conversation from then these the two of them two thousand and five speaker recognition and if i correctly remember it was about three dot eight percent of overlapped speech and in average okay like i also have two questions first have you evaluated the degradation you get from replacing the gaussian model with the the uh that's why model and secondly um uh could you i mean you want to use the initial you know so many seconds for for building your your uh you're speaker clusters a could you just redo that every so often i mean most uh machines this dataset more than once if you record uh you can continue doing online segmentation and in the background you can we compute your speaker clusters you know every uh thirty seconds or something like that of course for the first question we have examined self organising maps and gmm models for derivation in papers presented the previous that is jan and then solve for the nation in our studies experiments presented the same performance so we didn't find any reason to use a gmm especially because the training process for so long is a lot fast faster quicker and basically for us more robust for a second question and exact paper was submitted to interspeech it does exactly what is it so i two questions here one is the um comment about each set being used yeah it is the first you get good performance going first hundred twenty seconds your initial thing at the door the files are only i mean for five minutes long you're using yeah percent of the data yeah you into that realistic to go halfway through a conversation absolutely not because just and if we use about a thirty plus thirty second of the data in order to initialise the conversation the performance why that is see i mean we get like a thirty three percent diarization error rate and about twenty four percent speaker the the amount of data required by the initial training but by the diarization system it's quite large so if we have uh the possibility to train online thing the system as the conversation goes it would be great that's exactly what we partition in it in the next a paper in this fearing did you see the link that was also it's just to name one they're looking at things like that oh well oh we use that well what where the conversation although ten minutes let's not the duration issues knots of its duration structure right you conversations between street they i take it turns you take you know it's the duty cycle very if you look variation variance format you like i mean E R you know there it should be fine if someone dominates first part conversation you know well exactly that's so and i also think in the call home and call friend but the actually more than two people getting yeah yeah two people on one side getting on sharing um so you have more realistic action so what yeah the point in maybe you had this i really type your address in the online what online what you compare this to so for example the window has at published papers we did this workshop that's me exactly this task you start out blindly you start building up doing online did you use that the baseline did you formant okay yeah no i i think uh to to two papers a which perform this online diarization task but mostly of broadcast news naked on telephone i believe so this very little a problem yeah i would we have yeah um you know yeah thank you wanted to know if you have some idea two detect a new cluster a new speaker the system not to be able to i do class during decoding yeah our diarization system is only oriented to telephone conversation between two speakers that is what we already assumed that the number of speakers is too but i have encountered some ideas eh part of which use the leader follower algorithm which is a practically very simple that is the distance all if we take and segment the conversation and take and and you segment you can take the distance to the current model you have and in the distance for all the it's a certain threshold then you and meeting new model you say that this is a new speaker and you train a new model for it and use it in order to a cluster the conversation later on when you come to the end of the conversation could also use did this distance matrix between and models you know to um march model which are very very close uh i want to make one of the um uh when you say that uh uh to meaning out of five in real life we never know what will be the length eh recitation can be for mean can be ten million so yeah to mean oh no oh right just action finished before the meeting just make it cation yeah hmmm no i don't agree because we uh to me it's for initialisation does it matter if after the the yeah the computation before one means more or when you mean do you do need it i mean to me you should them and doesn't matter a piece to me online king the results no matter what and so so if you one day four fig the conversation so oh fig well i see second to initiate then yeah can have better without on the i think you be more if we if we just need to know how many how almost iterations then you get sufficient statistics cover both speaker right cindy fig this is not i'm it is not only show the percentage of the conversation it's a matter of that the amount of statistics required to train two speakers wanted right that is if the conversation would last for half an hour following the two minutes unless the channel is change in such a manner that the models are not no longer no longer valid the result will be the same but you are correct we have examined payment in order show that and that we wanted right i think we do not have anything to speak you know what i mean right so you so you have an online system but i suspect you actually don't i suspect that you're online system is actually an offline system okay do you know what anything before you reach the end of the file in any point where we get results that's the output of that diarisation system do you but you do use an hmm i do have an original so you are differing your decisions so you output is soon but you output the history as soon as it's a single uh so the the results to a single pair i uh there is all on user request that is yeah using the hmm in order to provide diarization results i only need perform termination and backtrack and this could be done using one millisecond of processing time this stage can all be done online that is initialisation using only the first feature and the rest of the features and their their fortunes stage for any new feature i determination and backtracking is only memory than memorising i could provide results instantaneous instantaneously what instantaneously before the uh uh hmm results to single path yeah hmmm you know what i really want to say is i think that this uh online offline distinction a distinction is really a red herring but uh it would be better i think to um talk about the uh allow D for the allowed deferral time before a decision needs to be made uh you know you you you make a distinction between online and offline but really what you're doing is you're convolving with that particular approach where you uh create models with an initial segment that to my with thinking that um doesn't really make the distinction between what is online once offline and if i would call it semi online it would be okay with you oh what what i would like to see you oh oh specification of the uh amount of time that's allowed to be the decision is allowed to be P for and uh you know and if you do that then um the um an offline system that deferral time would be infinite in an online system the real time would be something that is demanded by the application yeah i online definition of the system of the client oh but that

Online Diarization of Telephone Conversations

SESSION 6: Diarization

Added: 14. 7. 2010 11:08, Author: Oshry Ben-Harush (Ben-Gurion University), Itshak Lapidot (Sami Shamoon College of Engineering), Hugo Guterman (Ben-Gurion University), Length: 0:30:46