0:00:21but often um
0:00:23i'm i'm have that
0:00:24uh broke at most but university in then that lends
0:00:27a normally on speaker diarisation
0:00:29uh but i also do a little bit of work on uh speech recognition for spoken
0:00:34a a document retrieval
0:00:35so that i'm
0:00:37very glad that i'm here
0:00:38at the diarisation this session
0:00:40a talking about a sorry
0:00:44a true a ago we got a a question uh from the touch of that an institute for veteran institute
0:00:50if we could um process uh about two hundred in diffuse that they had a with uh for utterance
0:00:57that were uh taking place uh at their homes
0:00:59one with the table top mark phones and background noise and not very clear speech every now and then
0:01:06and we try to do this and the the first thing that we did was uh supervised adaptation of uh
0:01:11the acoustic models
0:01:12and for about half of the in diffuse i think we did a pretty good job we had word error
0:01:16rates well thirty forty percent
0:01:18so that was good enough to to build a search system
0:01:21but the know the hall first terrible
0:01:25i think on average the entire uh work the what error rate on average for those two hundred and if
0:01:29was sixty three percent
0:01:33well i don't think it was price you but
0:01:36this was probably because of the uh acoustical uh a mismatch between data to have training data and our evaluation
0:01:44because well we have our decoder we trained it on broadcast news and now we try to we've had a
0:01:47weighted on on
0:01:49uh interviews with tabletop microphones of stuff
0:01:54and this is an issue that's actually the uh
0:01:58we we trying to solve this in uh the we station
0:02:01uh where we well most systems a train our models on the evaluation data itself unsupervised T and we don't
0:02:07use any training data
0:02:10so i thought well
0:02:11if we can do this for dairy station
0:02:13would it possible to do a similar thing for speech recognition
0:02:17so skip all the training data and try to
0:02:19uh uh train all you models only for the evaluation data itself
0:02:24of course this is a quite a task
0:02:27which i'm not going to solve them a so i thought maybe i should look at the acoustic models first
0:02:32so is it possible to
0:02:34uh train unsupervised a trained acoustic models on the evaluation data itself
0:02:38and maybe we can do it the same way as we do it far say should just
0:02:44so the the goal of the research that i like to talk about today
0:02:47is that to create a system that's
0:02:50uh able to automatically segment and cluster an audio recording
0:02:53in um well little clusters that we call subword unit
0:02:58uh so that these support units are able to perform a as R
0:03:03and um
0:03:04even this turned out to be a very difficult task because if you have
0:03:07unsupervised on train some some kind of sub-word units that might represent phones
0:03:12uh we need a dictionary and we re well so
0:03:16it's the first step now here
0:03:18that i'm can to be talking about today is
0:03:21can we evaluate these units D separate units
0:03:25in some query by example spoken term detection
0:03:28i experiment
0:03:36so that are we say she system um
0:03:38i don't wanna say too much about it uh
0:03:41yeah i think we we try to prevent normally with every station
0:03:45uh that that we train on short term characteristics some from mike
0:03:49units by uh and forcing a minimum duration constraint
0:03:52and uh making sure that we don't use the that's especially the and duration of course is
0:03:59these two pictures below
0:04:01show the
0:04:02how might might research and system works
0:04:05it's a a club agglomerative clustering
0:04:07start with speech nonspeech
0:04:09uh detection
0:04:10create initial models
0:04:13randomly randomly basically on a chosen data
0:04:15and by a re aligning and retraining your models
0:04:18you have a very good initial models and then
0:04:20we start agglomerative clustering by uh making the best to
0:04:24the do well the the two models that are most similar
0:04:27um based on a big patient information criterion
0:04:30uh we merge these two models
0:04:32we do retraining training again
0:04:34we pick the best to the second best to models and
0:04:36go on on on an on on until a stopping criterion which is also the bayesian information criterion
0:04:42do what you see uh the the hmm topology
0:04:46where uh there's a number of
0:04:48strings of states
0:04:50and each of the uh strings represents one speaker
0:04:53and uh they all contain only one single gmm
0:04:57so the string is mainly
0:04:59a well it's only there to one force the minimum duration
0:05:06so um
0:05:07obtaining these uh sub-word units
0:05:10unsupervised supervised
0:05:11uh we well we had to choose a name so we called it unsupervised acoustic sub-word units detector
0:05:18of detection
0:05:19a you so
0:05:20and i list the difference a between our diarization system and you was system
0:05:24uh entire a station we uh typically have multiple
0:05:28uh in the case
0:05:30uh this experiment we had
0:05:32uh each time only one speaker the fatter and
0:05:34the work that one
0:05:35was speaking for about two hours so we had quite some data of the one speaker
0:05:40minimum duration in a station for our system to half seconds
0:05:44a the minimum duration in the U that system was forty milliseconds
0:05:50i i guess i deal would of been thirty milliseconds because a for models of thirty milliseconds
0:05:55but that was technically uh
0:05:57so it's forty milliseconds
0:05:59and every recession we didn't use that we don't use dealt that's in uh use that we do
0:06:03a in there every station uh the initial number of clusters fairies and that
0:06:07um this because we use more initial clusters if the recording is longer
0:06:12a a you was that we just the start of but uh a lot of
0:06:15initial cost to one of said
0:06:17um um and we didn't actually a stop using the bayesian information criterion just
0:06:21a of until till we had fifty seven left
0:06:24now come back to that later my method
0:06:29so that was how we uh out to make it generate the
0:06:32uh a units
0:06:34uh but we need to evaluate this since so are so um
0:06:37we decided to do a a a a a spoken term detection
0:06:42uh because we don't have a dictionary or or was small available
0:06:46the examples
0:06:47so uh what we are going to do is to use
0:06:49uh uh uh provides an example from the audio self
0:06:53and the system should be able to uh provide a list
0:06:56oh terms that uh
0:06:58of the other terms that are the same that data in the audio
0:07:05so that's how we going to evaluate weighted
0:07:07how did we uh create a system because and till now we only have to features
0:07:12um um we do it the same as
0:07:13uh uh has an all
0:07:15in their query by example spoken term detection using phonetic posteriorgram gram damp let's paper
0:07:20uh that i think was presented here and two and seven from co
0:07:25are they do is uh a first create a posterior gram
0:07:29uh the entire recording
0:07:31and of tried to to draw it here at the last
0:07:34a uh on the X axis you have time
0:07:37on the
0:07:37Y axes you have uh the posteriors
0:07:40oh each time frame
0:07:42a a of all the phones that are and the system and an our case it's the support unit
0:07:47and when you have this posterior gram
0:07:49you can uh calculate a similarity matrix between the query
0:07:53and the actual recording that's the drawing on the right
0:07:58where a a well as a similar T sure we just talk to the log likelihood of the uh in
0:08:03product of
0:08:04Q Q the factors of the query
0:08:06and an the factors of the work
0:08:09i once you've done this you can do uh dynamic time warping
0:08:12to uh find your
0:08:16but the the
0:08:17but so are very similar to your example
0:08:20a query
0:08:24we actually uh implemented for different systems
0:08:27the first one is that the you was that system are we automatically find are are clusters
0:08:32second one is uh well that the system
0:08:35a similar to that of house and but uh phones
0:08:38the third one we just use the features directly
0:08:40and the fourth one is a
0:08:42but a gmm system
0:08:43uh that was
0:08:45last you're percentage here by
0:08:48uh yeah don't sound
0:08:50hopefully a
0:08:51france that correctly
0:08:52i um and basically it's the a uh the a if
0:08:56uh variant of the same uh you take the entire audio recording
0:09:00and you train up a a gmm single gmm
0:09:03and each dimension
0:09:05you uh use as a
0:09:07uh well
0:09:08you kept you use as a posterior
0:09:10a probability
0:09:15these are the results we did a two experiments one on broadcast news and one on this
0:09:20uh in diffuse with war veterans
0:09:22uh uh calculated mean average position for each system
0:09:26and as you can see the mfcc system
0:09:29well forms
0:09:30we were on
0:09:31both experiments
0:09:33uh a the phone system and what was the old three or systems
0:09:36a did pretty well on the on the broke new news experiment
0:09:40um do are very similar but if you go to the if use you can see that
0:09:44especially the phone system uh field that's key
0:09:47i think that's because of the the acoustic mismatch
0:09:50i um
0:09:52well the you was sat system is a little bit better than the gmm system
0:09:55that might be because of the effects uh of the third talk to day
0:09:59that if you do agglomerative clustering
0:10:02we're not as well normalized for
0:10:05think we stick
0:10:07which is what we try to find here but
0:10:09i i'm not sure of it's actually significant so we have to to test
0:10:13more and and try to
0:10:15that in more data
0:10:17find find out
0:10:19think that would like to do next is uh
0:10:22try to generate speaker independent models because these are models
0:10:25specific for each
0:10:26uh war and
0:10:30maybe um
0:10:31so that that's the acoustic step
0:10:33maybe a to be try to find a some kind of dictionary
0:10:37so try to find a recurrent sequences of sub-word
0:10:41and uh also we have to not minutes
0:10:44and a data for each
0:10:46that interview
0:10:47that we used to adapt are uh
0:10:49a phone model on
0:10:50i we might be able to use as annotated data
0:10:52to get a little bit more information on the words that were spoken and know how to map are uh
0:10:57so part to is to these words
0:11:00so that's it for me thank you
0:11:13i i want to company
0:11:15yeah i
0:11:20and most of them are about or or you can