Speech Transcript - UNSUPERVISED ACOUSTIC SUB-WORD UNIT DETECTION FOR QUERY-BY-EXAMPLE SPOKEN TERM DETECTION

but often um i'm i'm have that uh broke at most but university in then that lends a normally on speaker diarisation uh but i also do a little bit of work on uh speech recognition for spoken a a document retrieval so that i'm very glad that i'm here at the diarisation this session a talking about a sorry uh a true a ago we got a a question uh from the touch of that an institute for veteran institute if we could um process uh about two hundred in diffuse that they had a with uh for utterance uh that were uh taking place uh at their homes one with the table top mark phones and background noise and not very clear speech every now and then um and we try to do this and the the first thing that we did was uh supervised adaptation of uh the acoustic models and for about half of the in diffuse i think we did a pretty good job we had word error rates well thirty forty percent so that was good enough to to build a search system but the know the hall first terrible um i think on average the entire uh work the what error rate on average for those two hundred and if was sixty three percent and well i don't think it was price you but um this was probably because of the uh acoustical uh a mismatch between data to have training data and our evaluation data um because well we have our decoder we trained it on broadcast news and now we try to we've had a weighted on on uh interviews with tabletop microphones of stuff um and this is an issue that's actually the uh well we we trying to solve this in uh the we station uh where we well most systems a train our models on the evaluation data itself unsupervised T and we don't use any training data um so i thought well if we can do this for dairy station would it possible to do a similar thing for speech recognition so skip all the training data and try to uh uh train all you models only for the evaluation data itself of course this is a quite a task which i'm not going to solve them a so i thought maybe i should look at the acoustic models first so is it possible to uh train unsupervised a trained acoustic models on the evaluation data itself and maybe we can do it the same way as we do it far say should just so the the goal of the research that i like to talk about today is that to create a system that's uh able to automatically segment and cluster an audio recording in um well little clusters that we call subword unit uh so that these support units are able to perform a as R and um even this turned out to be a very difficult task because if you have unsupervised on train some some kind of sub-word units that might represent phones uh we need a dictionary and we re well so if it's the first step now here that i'm can to be talking about today is um can we evaluate these units D separate units in some query by example spoken term detection i experiment i so that are we say she system um i don't wanna say too much about it uh yeah i think we we try to prevent normally with every station uh that that we train on short term characteristics some from mike units by uh and forcing a minimum duration constraint and uh making sure that we don't use the that's especially the and duration of course is important um these two pictures below show the how might might research and system works it's a a club agglomerative clustering start with speech nonspeech uh detection create initial models uh randomly randomly basically on a chosen data and by a re aligning and retraining your models you have a very good initial models and then we start agglomerative clustering by uh making the best to the do well the the two models that are most similar um based on a big patient information criterion uh we merge these two models we do retraining training again we pick the best to the second best to models and go on on on an on on until a stopping criterion which is also the bayesian information criterion do what you see uh the the hmm topology where uh there's a number of strings of states and each of the uh strings represents one speaker and uh they all contain only one single gmm so the string is mainly a well it's only there to one force the minimum duration i so um obtaining these uh sub-word units unsupervised supervised uh we well we had to choose a name so we called it unsupervised acoustic sub-word units detector of detection a you so and i list the difference a between our diarization system and you was system uh entire a station we uh typically have multiple speakers uh in the case uh this experiment we had uh each time only one speaker the fatter and the work that one was speaking for about two hours so we had quite some data of the one speaker minimum duration in a station for our system to half seconds a the minimum duration in the U that system was forty milliseconds um i i guess i deal would of been thirty milliseconds because a for models of thirty milliseconds but that was technically uh difficult so it's forty milliseconds and every recession we didn't use that we don't use dealt that's in uh use that we do a in there every station uh the initial number of clusters fairies and that um this because we use more initial clusters if the recording is longer a a you was that we just the start of but uh a lot of initial cost to one of said um um and we didn't actually a stop using the bayesian information criterion just a of until till we had fifty seven left now come back to that later my method so that was how we uh out to make it generate the uh a units uh but we need to evaluate this since so are so um we decided to do a a a a a spoken term detection experiment uh because we don't have a dictionary or or was small available the examples so uh what we are going to do is to use uh uh uh provides an example from the audio self and the system should be able to uh provide a list oh terms that uh of the other terms that are the same that data in the audio um so that's how we going to evaluate weighted well how did we uh create a system because and till now we only have to features um um we do it the same as uh uh has an all in their query by example spoken term detection using phonetic posteriorgram gram damp let's paper uh that i think was presented here and two and seven from co um are they do is uh a first create a posterior gram of uh the entire recording and of tried to to draw it here at the last a uh on the X axis you have time on the Y axes you have uh the posteriors oh each time frame a a of all the phones that are and the system and an our case it's the support unit and when you have this posterior gram you can uh calculate a similarity matrix between the query and the actual recording that's the drawing on the right um where a a well as a similar T sure we just talk to the log likelihood of the uh in product of Q Q the factors of the query and an the factors of the work i once you've done this you can do uh dynamic time warping to uh find your well but the the but so are very similar to your example a query we actually uh implemented for different systems the first one is that the you was that system are we automatically find are are clusters second one is uh well that the system a similar to that of house and but uh phones the third one we just use the features directly and the fourth one is a but a gmm system uh that was uh last you're percentage here by uh yeah don't sound hopefully a france that correctly i um and basically it's the a uh the a if uh variant of the same uh you take the entire audio recording and you train up a a gmm single gmm and each dimension you uh use as a uh well you kept you use as a posterior a probability these are the results we did a two experiments one on broadcast news and one on this uh uh in diffuse with war veterans uh uh calculated mean average position for each system um and as you can see the mfcc system uh well forms we were on both experiments uh a the phone system and what was the old three or systems a did pretty well on the on the broke new news experiment um do are very similar but if you go to the if use you can see that especially the phone system uh field that's key i think that's because of the the acoustic mismatch i um well the you was sat system is a little bit better than the gmm system that might be because of the effects uh of the third talk to day that if you do agglomerative clustering um we're not as well normalized for think we stick variance which is what we try to find here but i i'm not sure of it's actually significant so we have to to test more and and try to um that in more data find find out um think that would like to do next is uh try to generate speaker independent models because these are models specific for each uh war and um maybe um so that that's the acoustic step maybe a to be try to find a some kind of dictionary so try to find a recurrent sequences of sub-word and uh also we have to not minutes and a data for each that interview that we used to adapt are uh a phone model on i we might be able to use as annotated data to get a little bit more information on the words that were spoken and know how to map are uh so part to is to these words so that's it for me thank you and i i i want to company yeah i yeah okay and most of them are about or or you can

UNSUPERVISED ACOUSTIC SUB-WORD UNIT DETECTION FOR QUERY-BY-EXAMPLE SPOKEN TERM DETECTION

Speaker Diarization

Presented by: Marijn Huijbregts, Author(s): Marijn Huijbregts, Mitchell McLaren, David van Leeuwen, Radboud University Nijmegen, Netherlands