0:00:15so i agree much for all the introduction average that like to say i
0:00:20i was work was a focus primarily on my two students chang change she and
0:00:26gang worked on this for
0:00:29a part of the lre efforts that we've been looking at
0:00:33and they were supposed to both here unfortunately are processed for getting the visa to
0:00:38finland i was a little bit more elaborate from the state since i wasn't able
0:00:41to get an here but this represents their work so credit noted that he was
0:00:47kinda pass the baton over to me say something about
0:00:50i don't know how the
0:00:51on the highway or something like this i was afraid that i was gonna get
0:00:54a into a bad spot here so i'll start of the talk right thinking the
0:00:59organisers for last night
0:01:01i pulled a bunch of pictures i see
0:01:04we have one long wheels that sitting out here kind of waving to everyone here
0:01:07and tell me sits here is a
0:01:11kind of got energy and even though these cities named after joe
0:01:16expected generated go diving into the lake i and cannibal around a pretty kind of
0:01:21took the gentle approaches of siding in all those daughter to cannibal and systems
0:01:27right so
0:01:29now that we've adjusted for the event pair for the morning i guess so the
0:01:34outline for that are
0:01:35at first will talk about robust language recognition some ideas that we're looking at in
0:01:40this area and the focus of this talk will be a little more i'm feature
0:01:44characterisation we have a number different features there were exploring
0:01:48and from that we'll talk about some proposed a fusion system that we're looking at
0:01:54then are evaluations are evaluations are on two different corpora the darpa rats are corpus
0:02:00i which is a very noisy corpus and the nist lre which we just heard
0:02:05about this is from the o nine test set that we are working with
0:02:09and some performance analysis and conclusions so to begin with that the focus we want
0:02:15that is one or things that when you look at language id if you could
0:02:20simply say well the purpose used to distinguish one language from a set of languages
0:02:25or multiple languages
0:02:26but the type of task that you're looking at might be different depending on the
0:02:30different context
0:02:31you regret kinda node
0:02:33that in the nist lre there are a number different scenarios you're looking for example
0:02:37i that are doing hindi or let's say russian or ukrainian
0:02:42these are languages that are close to each other and well they are unique separate
0:02:47languages i there are maybe a little bit different and dialects of a particular language
0:02:51the other hand you could have very distinct languages the really far apart somehow the
0:02:56classifiers and features that you might use
0:02:59for languages that are spaced really far apart me not necessarily be the best scenarios
0:03:04you're looking at closely spaced languages
0:03:06or dialects of the same language no the challenge i think that there is becoming
0:03:11more and more a roland and the language shy space
0:03:15is not just
0:03:16the space between the languages but the space between the different characteristics that you might
0:03:20see in the audio streams if you're gonna be using
0:03:24it's much more likely that you use found data how to help build acoustic models
0:03:29and particularly for the out of set languages in you would freak instead languages
0:03:33not knowing the context in which the audio is captured of for those dataset languages
0:03:38introduces a lot of a challenges
0:03:41we had a paper in be interspeech two years back that was entitled i dialect
0:03:49id is the secret in the silence and this was by no means an indication
0:03:54of ldc it's torque efforts to collect a wide variety of language data both for
0:03:59dialect the language id
0:04:01we had done some studies on an arabic corpus which is a five corpus a
0:04:06set for i arabic and compare that against the for our corporate available from ldc
0:04:12and found that
0:04:14in fact that if you throw away all the speech from the five were corpora
0:04:19from the ldc set for arabic you actually did better for language id or dialect
0:04:25id by just focusing on the silent sections and so what that actually tells us
0:04:30is that
0:04:31if you're not sure about how to data is being collected you probably doing a
0:04:35channel handset or microphone id and not necessarily doing dialect id so the work we're
0:04:41looking at here's actually to see if we can improve some performance and robustness side
0:04:46note that
0:04:48in some previous work we've done a lot of efforts and i b m s
0:04:52r i b n
0:04:54of late teens working on the darpa rats language id task which is very noisy
0:05:02more recently our work is focused a little bit more in looking at improving open
0:05:06set out of set language rejection
0:05:09and their primarily because we were interested in seeing how we can come up with
0:05:13more efficient ways to develop background models
0:05:16for when we don't have all of the rejection language information we're trying to change
0:05:22i in this study we're gonna for crystal the moron on alternate features as well
0:05:25as various backend classifiers and fusion
0:05:30so three different sets of features are being considered here the classical features that typically
0:05:35you might expect to see in a typical speech application these are for different sets
0:05:41of features we have here i innovative features are the power normalized cepstral coefficients b
0:05:46and c from cmu group i and
0:05:49perceptual minimum variance distortionless for someone's p mvdr these are set of features that we
0:05:56maybe at ten years back
0:05:58i one of the interspeech meetings
0:06:00that we use for speech recognition and then a number of extension of features and
0:06:05we refer to these primarily because there's additional processing that might be associated with these
0:06:09as opposed to simply just extracting based feature set so
0:06:13these include
0:06:15various versions of mfcc features depending on with a window
0:06:19our a cell lfs season rasta-plp type features
0:06:23these are kind of the three classes of features that we've been working at
0:06:28in order to kinda give your flow diagram of how the of the data is
0:06:31being extracted it kind of see alright the paper we kind of summarise all the
0:06:35different aspects here but
0:06:37these are the various sets of features that are coming out of our system
0:06:41and the next part there will look at how we actually extract these so in
0:06:45the front end for processing we have speech activity detector uses a common set setup
0:06:49we develop for the rats program
0:06:52a standard that shifted delta cepstra features
0:06:56with a seven one three seven i configuration for this
0:07:00you a ubm in a state-of-the-art i-vector based system that uses for dimensions and we
0:07:05use an lda based up again for dimensionality reduction on the back end processing we
0:07:11do duration length normalisation and we have two different setups wanna gaussian
0:07:17gender gaussian backend
0:07:19also gaussian eyes that cosine distance scoring strategy for the two different classifiers
0:07:24so the system flow diagram for this other words like this
0:07:28we have our input audio data here the two audio a datasets that you see
0:07:32here a basic represent raw data for the ubm type construction as well as for
0:07:37the total variability matrix it's needed for the i-vector setup
0:07:41and these two datasets are actually the same is what we use an hour training
0:07:45set gaussian a gender back end it is on the side here and then the
0:07:50cosine distance scoring setup is here
0:07:52and then we do score fusion
0:07:54score processing first and then fuse the setups
0:07:58so for system fusion we have our setup looks like this we can do feature
0:08:04concatenation then that's one of approaches we look at your just counting feature set up
0:08:08i would back and fusion we use for call in kind of fuse the backend
0:08:14a system so we see here to them up in the final a decision surface
0:08:17or decision a
0:08:20for the evaluation corpora i know that we had to different corpora that we're working
0:08:24with of the nist lre weak classifiers is a large scale setup twenty three different
0:08:29languages where only using these for the in set there and of the duration mismatch
0:08:35we looked at the three sets that you would typically see
0:08:39for the darpa program as i know some of you may not of be familiar
0:08:44with the darpa setup but the it's five languages that are rendered darpa language id
0:08:50task or arabic farsi urdu props to an already
0:08:54and they're ten out of seven languages that are included in there is extremely noisy
0:08:59a play just an audio clip here so you get some sense of how about
0:09:02the data is
0:09:19i see clearly see that that's not your typical telephone call but you might be
0:09:24picking up
0:09:25and so in that context the language id task is quite challenging so what are
0:09:32the things we wanted to kind of see in our setup here for a lisa
0:09:36darpa rats corpus where y to understand
0:09:39if the channels were somehow dependent on each other if everything was kinda uniform there's
0:09:44some variability across the channels so we consider seven of the channels channel d was
0:09:50there is a channels in the system we set out channel id here because seven
0:09:54channels here
0:09:55or it is we look at the six are a language classes that of the
0:09:59correlation in seven languages and then there's the ten out of set languages that are
0:10:03set up here we scored
0:10:06to a seven or eight errors only seven sorted files here crosses forty one classes
0:10:11and the ideas that you kind of look at the channel confusion set up here
0:10:16if there is no
0:10:18you know dependency here we kind of expect there to be kind of clear diagonal
0:10:23lines here the factory c d's and it i aspects here tell us to their
0:10:27clearly some channel dependencies in here so where is telling us is that there's a
0:10:31lot of transmission channel factors
0:10:33they're kinda influencing
0:10:35all the data and what we would expect the classifier setup so that was reason
0:10:41i pointed to this previous study we good looking at the airbag test to ensure
0:10:46we could try to do some type of normalisation and channel characteristics
0:10:50so in looking at the two corpora we did kind of our evaluation here four
0:10:56no the various feature set so
0:10:59this has the other rats the results here and the lre on nine results here
0:11:04the three different a broad classes of features the classical features innovative features an extension
0:11:09of features are here and we list rich
0:11:15performance here for four
0:11:17for each of the different feature sets
0:11:19and you can see with the gaussian eyes than the cosine distance scoring individual scores
0:11:24here you look at the back end fusion strategy we get their performance improvement here
0:11:29and we can see obviously that i confusion ends up helping and all these conditions
0:11:33there's a very striking in terms of the performance on the clean datasets are a
0:11:39little bit better than the performance on the noisy sets
0:11:43see from the rats that next we wanted to kind of look at rank ordering
0:11:47which features
0:11:50might i actually show better improvement so here we just plot
0:11:55the two classifiers and be a the backend fusion setup so this just gives your
0:12:00relative comparison across other rats and the lre a nine dataset and basically by confusion
0:12:06here benefits various feature concatenation strategy set and almost all combinations
0:12:13we get thirty three percent relative improvement on performance for lid on the rats data
0:12:18and of thirty four percent relative improvement on the lre set
0:12:23so next we wanted to look at a little bit more i'm kind of
0:12:30test duration aspects here so baseline system shows how test duration performance varies depending on
0:12:38the on that the test sets here for the lre data
0:12:42and you can kinda see as the test duration increase is obviously we get better
0:12:45performance if you look at the hybrid fusion are also has a nice improvement here
0:12:51we see that the relative improvement is quite substantial a hybrid fusion obviously does improve
0:12:57lid performance
0:12:58and the roles improvement is actually much stronger the longer duration set is but you
0:13:03can see that we're almost kind of cutting the error rates here and half which
0:13:07is or forty percent leased
0:13:09which is quite nice in terms of the shorter three second duration sets
0:13:15finally thing we want to look for is looking at the various features we want
0:13:19to ask coupled basic questions in terms of how each of these features might you
0:13:23contributing to improve system performance
0:13:25so i one question might be how do we calibrate the contribution of each feature
0:13:30in the fusion set
0:13:31and use that contribution similar to the different tasks
0:13:35for rats for a for the lre so the ideas that if you look at
0:13:39the rank ordering hearing clean data versus the noisy data do you actually get a
0:13:43different set of features that might be better for that particular task
0:13:47so we use this that relative significance factor here where we use the a leave-one-out
0:13:53that system ranking
0:13:55for each particular feature and we normalize that by the individual systems ratings for that
0:14:02particular feature so that allows us to look at the relative rank for the particular
0:14:07features and this kind assures now the
0:14:11the rank-order setups for the different features for rats and for lre
0:14:16and what we see here is that if you're looking closely you see that sets
0:14:19pasta l p i guess my students got hundred at ross l p l the
0:14:24it's of released on the rats in lre the rasta plp feature actually
0:14:28i gave us that the strongest contribution for improved terribly performance
0:14:33and you can see various other features here rank a lower
0:14:38what's interesting to note is that if you look at the relative significance factor here
0:14:42for the clean data are rasta plp actually a far surpasses all the other features
0:14:49i in the clean task that relative impact actually reduce is quite significantly it still
0:14:55rank first
0:14:56but the impact of that single feature when the data becomes x extremely noisy is
0:15:01a whole lot less
0:15:02well that's telling us this that in noisy tasks you actually need to leverage performance
0:15:07across multiple features
0:15:09in order to hope to get a similar levels of performance and the lid task
0:15:13and noisy conditions
0:15:15so i in conclusion here
0:15:18probably if using various types of acoustic features and i can't classifiers we can contribute
0:15:23to a stronger ali performance
0:15:27in various are corpora
0:15:29the latest propose gaussian a cosine distance scoring back end were shown to outperform the
0:15:34generative gaussian backend
0:15:36for the darpa rats scenario we saw that we had of thirty eight percent improvement
0:15:43for that particular task and for nist lre we had some additional experiments are in
0:15:48the paper that show that forty six percent relative improvement
0:15:52and for the right order features the rasta-plp feature turned out to be the most
0:15:57significant feature set
0:15:59for the two corpora that we are considered but we found that you need to
0:16:03fuse multiple features and particularly for the noisy conditions in order to hope to get
0:16:08a similar levels of performance gain
0:16:13a star
0:16:21any questions
0:16:27next on and i just don wallace logic presented right to left and spot and
0:16:32lre results
0:16:38given rats so noise
0:16:42so there's always a challenge in explaining why something works like so i would say
0:16:48kind of looking at yellowy data
0:16:52i think you have different sets of levels of noise on the rats i think
0:16:59for us the rejection but you see for the rats data you got the ten
0:17:04out of set languages those in some sense might be a little bit easier we
0:17:09have done a test return yellow re sets and what we did as we generated
0:17:13a five in set task that was used as close as possible to the five
0:17:18means that we start from rats
0:17:20we show the performance there was actually of a remote fairly different then we were
0:17:24sitting on the lre on nine set
0:17:28i wish i can give yield more insight as to why performance was colours but
0:17:33how to say that using more features actually helps
0:17:38expected say
0:17:44did you look at the end scene channel in the rats so's i understood the
0:17:48rats you trained on data in set through all the channels your testing the or
0:17:53did you did the pull one out and unseen there's see i think the unseen
0:17:57which the one images recently released
0:18:00well you could do not held out just help wanted uni that's we need to
0:18:04be gentle that we have all that actually with the channel but we did doing
0:18:08to help
0:18:09we do we have done tests on late in that context but not against all
0:18:14these features
0:18:16i can say you similar we did a fair amount of testing actually when we're
0:18:20looking at ms and may g c features and a couple of other frontend enhancement
0:18:25straight techniques in the last icassp for the lid task and their we did we
0:18:31did hold one of the channels that just to see if we could do an
0:18:33unseen channel that might help
0:18:39so that you looked at a year shifted delta
0:18:42cepstrum like but you can use for plp something that you have more long-term information
0:18:46so you should be is used actual to shifted delta cepstra area system plp so
0:18:52we use shifted delta cepstra set up on all that that's on for seven one
0:18:56three seven is the configuration
0:19:11the just a excellent talk solely on the at all the well the try to
0:19:17the on the study on
0:19:19is set up recognising the channel america the language rather than the channels so comment
0:19:26on how you know in what cell findings from so which is features this simple
0:19:32effective enough
0:19:35i can answer that question and that real but let me naked one common so
0:19:39when joe is giving us talk a keynote talk industry one comments i guess the
0:19:44to get a chance to make all sit now
0:19:46when you're doing language id or speaker id for that matter particular language id "'cause"
0:19:51you're much more likely to use kind of found data for this and you may
0:19:55not know the channel conditions are one of the tasks that actually a really good
0:19:59thing to do and it may not be something you wanna report but it's something
0:20:02that i think everyone should do
0:20:04typically when you're looking at lid you would run a speech activity detector so you're
0:20:09gonna have kind of your silence or low energy and noise and you're speech what
0:20:14was really a good task is to run your language id task on the speech
0:20:18and then run it on the silence okay all the data that you pulled out
0:20:23if you run it on the silence and you find it you getting basically chance
0:20:26across all your setups then you kind of note that the channels are not really
0:20:31dependent on each other
0:20:33what if you're getting really good performance
0:20:36actually better performance than if you're using the speech and you can have no we
0:20:40get your classifier is not really targeting the speech is actually targeting the channel characteristics
0:20:47and that's what we found we tried actually i in a previous paper a number
0:20:51of ways to kind of just
0:20:53long term channel normalisation techniques something like this we were able to get the long
0:20:58term channel exactly the same for those different corpora
0:21:01during that we still could not get a the performance i'm a silence to draw
0:21:05up to chance
0:21:06a personal and no for i think it from looking at nist
0:21:10i really would like to see a performance benchmark especially for lid not necessarily for
0:21:15this it's i but if you look for lid if you could come up with
0:21:18the performance benchmark that looked at your performance for all the speech and kind of
0:21:23balance that against performance against the silence
0:21:26because the ideas that you get a great performance here and you're getting just a
0:21:30little improvement here than your gain is you that's all you're really leveraging actually looking
0:21:35at the speech but the performance is really big thing that actually and out with
0:21:40make up and affected your cheating
0:21:42so that the kind of says more about the speaker