Přepis řeči - Factor analysis-based approaches applied to the speaker diarization task of meetings: a preliminary study

well after a great discussion uh about uh the last so take i will i will continue with another topic of related to speaker diarisation uh my name is bob automatic and uh i was working uh previous semester or uh as an erasmus student in that uh you at the university of i mean you wanna at all about about the last in formatting that venue uh were my supervisors where coding the video and there is not true uh it was about uh preliminary study oh factor analysis based approach is applied to the speaker diarization task of meetings well what it would be about uh i will briefly describe the speaker diarisation also factor analysis i will tell you something about the objectives of this study uh some experiments and the perspective uh shortly about diarisation i suppose uh almost all of you know what speaker diarization means what is its purpose uh speaker diarization tries to find the answer a question who spoke one uh we don't have uh any a priori knowledge about speakers they and number and their identity uh as you can see here is a small small you have uh if uh and how would of uh such a such a system uh where we can see the speech segments are labelled by the by the speakers uh the diarisation system you uh tries to find the same segments of goers and label them uh for for my experiments i used uh diarisation system uh developed in in the yeah uh the the system uh participate it uh in a nice the rich transcription uh combines since two thousand three uh the system uses topdown strategy uh what is the top down strategy i will i will uh sounds now uh the top down strategy consists of uh four main steps the first the uh is in uh speech activity detection uh where to retrain the gmm models uh are are are used uh as a as a models of speech and nonspeech uh then uh it's used uh viterbi decoding and the map adaptation another step is uh segmentation uh where is use the evaluative uh hidden markov model uh also viterbi the counting the coding and uh uh the third and for the fourth steps are almost the same uh it's for segmentation about using different parameterisation uh factor and all is is uh is is so well known in in fields like uh speaker verification language identification uh and video gender classification uh and the the uh the big difference uh you can say it's uh but then that legally describe uh in these two equations where the the first decorations is standard gmm ubm modelling and the second equation uh contains uh um contains you we which uh so modelling the session variability so what about trying factor analysis uh the link uh uh the the single audio files uh uh we have situation for example speaker is peaky and environment of the recording is changing like the speaker is going and around the microphone and the distance speaker and uh and the microphone is changing uh the the factor analysis can be held helpful in this case um uh we we tried to uh to two approaches in this work and the first is uh by localising subspace you containing the entire segment viability and the second uh is uh in a localising the interspeaker variability about the experimental protocol the details uh are the following as a development set i used twenty three audio files from the nist uh rich transcriptions since two thousand four two two thousand six uh it took place in seven different meeting rooms and uh from some statistical data uh the recordings uh have from ten to eighteen minutes containing from four to nine participants and as evaluation set i use the seven audio files from nist uh from the previous year they have from seventeen to twenty seven minutes and from four to seven speakers uh the multiple distant microphones were used here and as a performance uh measure uh i used uh diarisation error rate the factor analysis model link was applied only in the third step of the speaker diarisation system now the first approach the modelling go interspeaker variability uh the U matrix uh here in in this equation is common to all speakers and the assumptions are uh main relevant speaker information located in the low dimension subspace and the rest uh all the speaker information in the full space and the results are on the next page uh there is uh nothing interesting except one think it's the difference between these two columns uh what does it mean and the first column uh contains the baseline diarization error rate of this file without application of factor analysis uh the next column contains uh results after application uh factor analysis for segmentation containing the U V and the last without you think and the difference is big uh in average about ten percent what does it mean it means that the U I can contains some information useful four what they're doing speaker uh in this case uh the only only thing uh which is important all the all the results are uh in average whereas the second approach is uh in the in in their segment of our identity um it's almost the same except uh the the base think that the right but the is uh modelling inter segment so the results uh are this page yeah the baseline diarisation error rate there is uh after application of factor analysis with ordering with you you and here without you you uh what is what is uh interesting here only the fact that uh so speaker information uh present is present in the inter segment component but not significant uh i tried another experiment and it was based uh on filtering um uh of a speech segment in mm kay development set in the first column you can uh see there are results of system uh which uses you metrics uh estimated on all speech segments of from the the from the development set in the next next column you can see uh results system using uh you matrix estimated on uh segments longer or equal to one second and so on so seconds five second consequence uh the most uh interesting i think uh uh this this in this paper is is the uh the big difference in these values uh for this file uh it's uh the original uh diarization error rate for this file was about twenty percent after application uh this modelling and this filtration of uh segments shorter than one second we improve the segmentation uh about fifteen point five point five person uh well it's interesting and uh we move this segmentation uh so much we we got from twenty percent error rate to five percent error rate uh what about next uh our segmentation step using ca norm uh standard or a segmentation step uh they but this is is that uh we can again and other improvements uh with viterbi and map adaptation and we can see here that is it but this is calm it's confirmed because from uh from the well change the segmentation we improve it so but by another one point four percent but this is uh this is important uh and significant only for for this file uh where the segmentation changed a lot oh in general the it's not significant these changes uh and the signal segmentation uh was uh just about classical viterbi and map adaptation i would like to summarise this work uh i just it's a two strategies the interspeaker variability modelling and inter segment but i but at the moment modelling and uh only the second has uh and improvements uh of of the segmentation but very or uh it can be useful to to feel filters some some short uh speech segment in the in the heart of estimation you moderate and it's also useful as you so uh another presegmentation step next work uh can be done with uh more training data uh and uh the large number of speakers when dealing with the interspeaker variability uh regarding the inter segment viability uh it can be interesting to to ben dealing with the multiple distant microphones uh and uh also another test can be done uh one uh when the application factor analysis based uh speaker modelling in the first step of the the speaker diarization system well thank you very much for attention and if you have any questions question only reported an improvement when actually you selected only the speech segments longer than one second right it means that actually in your segmentation of most of most lots of research and this is your variable files so good that was how we were i was configure are there any it limits for the minimum duration of a segment uh sorry i cannot tell uh and i think about the vad because i just the work uh with the diarization system as it was uh maybe uh korean if uh not serious uh but uh maybe uh i i didn't understand well uh this uh this uh filtration is made on the development so uh_huh yeah in fact that the united yeah train on the and development it so we have to wait for instance the development set so we can choose and the length of the segment and you try to train yeah but the united estimation yeah yeah oh i have a question so i see that is it to speaker variability in this segment ability and uh do you so i guess each segment their ability uh reflects the changes speaker is it useful information for or detecting the speaker change so and we expect a two speaker i think they should okay can you do some information but we should keep okay segment and applications compensated and nation well you can line why not in uh in the estimation of you metrics the uh the vocal development set uh we had the reference and uh you matrix was estimated um in this case uh for for each speaker uh between uh the segments of of one speaker so it was it was not uh in there a segment of arrival they in the way of for uh intel all segments right but the only uh it was in their segment the viability of of a certain speaker all speakers soprano testing and then you do the presegmentation using a generative model you can see you mentioned B B segmentation i always process so you have one one night lately and how many rounds right uh how many how many or segmentation uh uh well there is normally there is uh one one uh segmentation and then uh take place and the story segmentation this case it was a resegmentation uses uh factor analysis wondering uh and uh there is segmentation uh uh was it the right thing until uh the number of five changes of in the in the segmentation uh was uh less than a certain well you one one one per segmentation process with many iterations right which oh slide a class uh i don't know which light you mean in this uh there are parts of the uh right segmentation yes uh yeah this is the original baseline system and there are two resegmentation uh steps and uh the factor analysis took place after this presegmentation step as the last part of the of the diarization system okay you can can anything in fact that the number education is not speak it depends that understands it changes and giving them a sense so when we an estimated ten no more changes it went a segmentation that a given state we stop thank you no i actually yes uh but you tested so you you only scored the sections of the meetings that did not have overlapping speakers correct uh we just it only the the evaluation set uh from the nist so i but there were different ways to score that there was a parameter which determines how much overlapping speech was included uh and and your your uh error rate are quite low so i assume you but did not score the overlap speakers but that's just an assumption i want to from you well there are rights uh maybe maybe you don't know because you just drama for example the yeah right here are are the global arrays that although the total all rights including curve force are um with speech and the speaker you could change okay i i don't know uh if i and just oh okay and then about this one meeting where you had a significant improvement um i i i remember that on one of the nist meetings there was a much larger number of speakers then and the other meetings and i wonder if that was the one meeting where you saw again um so there were many more speaker changes because the number of speakers were actually that's like double the other meetings uh so i wondered if you had actually looked at some statistics of your meetings to see uh if there are some variable like the number speakers that uh could predict when you're method works uh well and when that might make a difference oh well uh i i don't have anyhow oh no information and we we did not it and and is about to the this was it we we know that you say that again yeah sometimes and is not necessary you to to the fact and he's in good if we change and finally implies sense and insinuation of that and uh and this is and we know that we can and this improvement an infected E es work the good ones too and exp tool and don't we all applying thank john and easy C speaker deviation on meetings we had these aladdin speakers most because then implementation and a different connotation it is and that that's the overlap and we didn't scroll and we thought about that and because we we do something oh and to delete overlap and the ones to law school

Factor analysis-based approaches applied to the speaker diarization task of meetings: a preliminary study

SESSION 6: Diarization

Přidáno: 14. 7. 2010 11:08, Autor: Pavel Tomasek, Corinne Fredouille, Driss Matrouf (University of Avignon, CERI/LIA, Avignon, France), Délka: 0:22:18