Speech Transcript - Robust Language Recognition Based on Diverse Features

0:00:15	so i agree much for all the introduction average that like to say i
0:00:20	i was work was a focus primarily on my two students chang change she and
0:00:26	gang worked on this for
0:00:29	a part of the lre efforts that we've been looking at
0:00:33	and they were supposed to both here unfortunately are processed for getting the visa to
0:00:38	finland i was a little bit more elaborate from the state since i wasn't able
0:00:41	to get an here but this represents their work so credit noted that he was
0:00:46	gonna
0:00:47	kinda pass the baton over to me say something about
0:00:50	i don't know how the
0:00:51	on the highway or something like this i was afraid that i was gonna get
0:00:54	a into a bad spot here so i'll start of the talk right thinking the
0:00:59	organisers for last night
0:01:01	i pulled a bunch of pictures i see
0:01:04	we have one long wheels that sitting out here kind of waving to everyone here
0:01:07	and tell me sits here is a
0:01:11	kind of got energy and even though these cities named after joe
0:01:16	expected generated go diving into the lake i and cannibal around a pretty kind of
0:01:21	took the gentle approaches of siding in all those daughter to cannibal and systems
0:01:27	right so
0:01:29	now that we've adjusted for the event pair for the morning i guess so the
0:01:34	outline for that are
0:01:35	at first will talk about robust language recognition some ideas that we're looking at in
0:01:40	this area and the focus of this talk will be a little more i'm feature
0:01:44	characterisation we have a number different features there were exploring
0:01:48	and from that we'll talk about some proposed a fusion system that we're looking at
0:01:54	then are evaluations are evaluations are on two different corpora the darpa rats are corpus
0:02:00	i which is a very noisy corpus and the nist lre which we just heard
0:02:05	about this is from the o nine test set that we are working with
0:02:09	and some performance analysis and conclusions so to begin with that the focus we want
0:02:15	that is one or things that when you look at language id if you could
0:02:20	simply say well the purpose used to distinguish one language from a set of languages
0:02:25	or multiple languages
0:02:26	but the type of task that you're looking at might be different depending on the
0:02:30	different context
0:02:31	you regret kinda node
0:02:33	that in the nist lre there are a number different scenarios you're looking for example
0:02:37	i that are doing hindi or let's say russian or ukrainian
0:02:42	these are languages that are close to each other and well they are unique separate
0:02:47	languages i there are maybe a little bit different and dialects of a particular language
0:02:51	the other hand you could have very distinct languages the really far apart somehow the
0:02:56	classifiers and features that you might use
0:02:59	for languages that are spaced really far apart me not necessarily be the best scenarios
0:03:04	you're looking at closely spaced languages
0:03:06	or dialects of the same language no the challenge i think that there is becoming
0:03:11	more and more a roland and the language shy space
0:03:15	is not just
0:03:16	the space between the languages but the space between the different characteristics that you might
0:03:20	see in the audio streams if you're gonna be using
0:03:24	it's much more likely that you use found data how to help build acoustic models
0:03:29	and particularly for the out of set languages in you would freak instead languages
0:03:33	not knowing the context in which the audio is captured of for those dataset languages
0:03:38	introduces a lot of a challenges
0:03:41	we had a paper in be interspeech two years back that was entitled i dialect
0:03:49	id is the secret in the silence and this was by no means an indication
0:03:54	of ldc it's torque efforts to collect a wide variety of language data both for
0:03:59	dialect the language id
0:04:01	we had done some studies on an arabic corpus which is a five corpus a
0:04:06	set for i arabic and compare that against the for our corporate available from ldc
0:04:12	and found that
0:04:14	in fact that if you throw away all the speech from the five were corpora
0:04:19	from the ldc set for arabic you actually did better for language id or dialect
0:04:25	id by just focusing on the silent sections and so what that actually tells us
0:04:30	is that
0:04:31	if you're not sure about how to data is being collected you probably doing a
0:04:35	channel handset or microphone id and not necessarily doing dialect id so the work we're
0:04:41	looking at here's actually to see if we can improve some performance and robustness side
0:04:46	note that
0:04:48	in some previous work we've done a lot of efforts and i b m s
0:04:52	r i b n
0:04:54	of late teens working on the darpa rats language id task which is very noisy
0:05:02	more recently our work is focused a little bit more in looking at improving open
0:05:06	set out of set language rejection
0:05:09	and their primarily because we were interested in seeing how we can come up with
0:05:13	more efficient ways to develop background models
0:05:16	for when we don't have all of the rejection language information we're trying to change
0:05:22	i in this study we're gonna for crystal the moron on alternate features as well
0:05:25	as various backend classifiers and fusion
0:05:30	so three different sets of features are being considered here the classical features that typically
0:05:35	you might expect to see in a typical speech application these are for different sets
0:05:41	of features we have here i innovative features are the power normalized cepstral coefficients b
0:05:46	and c from cmu group i and
0:05:49	perceptual minimum variance distortionless for someone's p mvdr these are set of features that we
0:05:54	had
0:05:56	maybe at ten years back
0:05:58	i one of the interspeech meetings
0:06:00	that we use for speech recognition and then a number of extension of features and
0:06:05	we refer to these primarily because there's additional processing that might be associated with these
0:06:09	as opposed to simply just extracting based feature set so
0:06:13	these include
0:06:15	various versions of mfcc features depending on with a window
0:06:19	our a cell lfs season rasta-plp type features
0:06:23	these are kind of the three classes of features that we've been working at
0:06:28	in order to kinda give your flow diagram of how the of the data is
0:06:31	being extracted it kind of see alright the paper we kind of summarise all the
0:06:35	different aspects here but
0:06:37	these are the various sets of features that are coming out of our system
0:06:41	and the next part there will look at how we actually extract these so in
0:06:45	the front end for processing we have speech activity detector uses a common set setup
0:06:49	we develop for the rats program
0:06:52	a standard that shifted delta cepstra features
0:06:56	with a seven one three seven i configuration for this
0:07:00	you a ubm in a state-of-the-art i-vector based system that uses for dimensions and we
0:07:05	use an lda based up again for dimensionality reduction on the back end processing we
0:07:11	do duration length normalisation and we have two different setups wanna gaussian
0:07:17	gender gaussian backend
0:07:19	also gaussian eyes that cosine distance scoring strategy for the two different classifiers
0:07:24	so the system flow diagram for this other words like this
0:07:28	we have our input audio data here the two audio a datasets that you see
0:07:32	here a basic represent raw data for the ubm type construction as well as for
0:07:37	the total variability matrix it's needed for the i-vector setup
0:07:41	and these two datasets are actually the same is what we use an hour training
0:07:45	set gaussian a gender back end it is on the side here and then the
0:07:50	cosine distance scoring setup is here
0:07:52	and then we do score fusion
0:07:54	score processing first and then fuse the setups
0:07:58	so for system fusion we have our setup looks like this we can do feature
0:08:04	concatenation then that's one of approaches we look at your just counting feature set up
0:08:08	i would back and fusion we use for call in kind of fuse the backend
0:08:14	a system so we see here to them up in the final a decision surface
0:08:17	or decision a
0:08:20	for the evaluation corpora i know that we had to different corpora that we're working
0:08:24	with of the nist lre weak classifiers is a large scale setup twenty three different
0:08:29	languages where only using these for the in set there and of the duration mismatch
0:08:35	we looked at the three sets that you would typically see
0:08:39	for the darpa program as i know some of you may not of be familiar
0:08:44	with the darpa setup but the it's five languages that are rendered darpa language id
0:08:50	task or arabic farsi urdu props to an already
0:08:54	and they're ten out of seven languages that are included in there is extremely noisy
0:08:59	a play just an audio clip here so you get some sense of how about
0:09:02	the data is
0:09:19	i see clearly see that that's not your typical telephone call but you might be
0:09:24	picking up
0:09:25	and so in that context the language id task is quite challenging so what are
0:09:32	the things we wanted to kind of see in our setup here for a lisa
0:09:36	darpa rats corpus where y to understand
0:09:39	if the channels were somehow dependent on each other if everything was kinda uniform there's
0:09:44	some variability across the channels so we consider seven of the channels channel d was
0:09:50	there is a channels in the system we set out channel id here because seven
0:09:54	channels here
0:09:55	or it is we look at the six are a language classes that of the
0:09:58	five
0:09:59	correlation in seven languages and then there's the ten out of set languages that are
0:10:03	set up here we scored
0:10:06	to a seven or eight errors only seven sorted files here crosses forty one classes
0:10:11	and the ideas that you kind of look at the channel confusion set up here
0:10:16	if there is no
0:10:18	you know dependency here we kind of expect there to be kind of clear diagonal
0:10:23	lines here the factory c d's and it i aspects here tell us to their
0:10:27	clearly some channel dependencies in here so where is telling us is that there's a
0:10:31	lot of transmission channel factors
0:10:33	they're kinda influencing
0:10:35	all the data and what we would expect the classifier setup so that was reason
0:10:41	i pointed to this previous study we good looking at the airbag test to ensure
0:10:46	we could try to do some type of normalisation and channel characteristics
0:10:50	so in looking at the two corpora we did kind of our evaluation here four
0:10:56	no the various feature set so
0:10:59	this has the other rats the results here and the lre on nine results here
0:11:04	the three different a broad classes of features the classical features innovative features an extension
0:11:09	of features are here and we list rich
0:11:15	performance here for four
0:11:17	for each of the different feature sets
0:11:19	and you can see with the gaussian eyes than the cosine distance scoring individual scores
0:11:24	here you look at the back end fusion strategy we get their performance improvement here
0:11:29	and we can see obviously that i confusion ends up helping and all these conditions
0:11:33	there's a very striking in terms of the performance on the clean datasets are a
0:11:39	little bit better than the performance on the noisy sets
0:11:43	see from the rats that next we wanted to kind of look at rank ordering
0:11:47	which features
0:11:50	might i actually show better improvement so here we just plot
0:11:55	the two classifiers and be a the backend fusion setup so this just gives your
0:12:00	relative comparison across other rats and the lre a nine dataset and basically by confusion
0:12:06	here benefits various feature concatenation strategy set and almost all combinations
0:12:13	we get thirty three percent relative improvement on performance for lid on the rats data
0:12:18	and of thirty four percent relative improvement on the lre set
0:12:23	so next we wanted to look at a little bit more i'm kind of
0:12:30	test duration aspects here so baseline system shows how test duration performance varies depending on
0:12:38	the on that the test sets here for the lre data
0:12:42	and you can kinda see as the test duration increase is obviously we get better
0:12:45	performance if you look at the hybrid fusion are also has a nice improvement here
0:12:51	we see that the relative improvement is quite substantial a hybrid fusion obviously does improve
0:12:57	lid performance
0:12:58	and the roles improvement is actually much stronger the longer duration set is but you
0:13:03	can see that we're almost kind of cutting the error rates here and half which
0:13:07	is or forty percent leased
0:13:09	which is quite nice in terms of the shorter three second duration sets
0:13:15	finally thing we want to look for is looking at the various features we want
0:13:19	to ask coupled basic questions in terms of how each of these features might you
0:13:23	contributing to improve system performance
0:13:25	so i one question might be how do we calibrate the contribution of each feature
0:13:30	in the fusion set
0:13:31	and use that contribution similar to the different tasks
0:13:35	for rats for a for the lre so the ideas that if you look at
0:13:39	the rank ordering hearing clean data versus the noisy data do you actually get a
0:13:43	different set of features that might be better for that particular task
0:13:47	so we use this that relative significance factor here where we use the a leave-one-out
0:13:53	that system ranking
0:13:55	for each particular feature and we normalize that by the individual systems ratings for that
0:14:02	particular feature so that allows us to look at the relative rank for the particular
0:14:07	features and this kind assures now the
0:14:11	the rank-order setups for the different features for rats and for lre
0:14:16	and what we see here is that if you're looking closely you see that sets
0:14:19	pasta l p i guess my students got hundred at ross l p l the
0:14:24	it's of released on the rats in lre the rasta plp feature actually
0:14:28	i gave us that the strongest contribution for improved terribly performance
0:14:33	and you can see various other features here rank a lower
0:14:38	what's interesting to note is that if you look at the relative significance factor here
0:14:42	for the clean data are rasta plp actually a far surpasses all the other features
0:14:49	i in the clean task that relative impact actually reduce is quite significantly it still
0:14:55	rank first
0:14:56	but the impact of that single feature when the data becomes x extremely noisy is
0:15:01	a whole lot less
0:15:02	well that's telling us this that in noisy tasks you actually need to leverage performance
0:15:07	across multiple features
0:15:09	in order to hope to get a similar levels of performance and the lid task
0:15:13	and noisy conditions
0:15:15	so i in conclusion here
0:15:18	probably if using various types of acoustic features and i can't classifiers we can contribute
0:15:23	to a stronger ali performance
0:15:27	in various are corpora
0:15:29	the latest propose gaussian a cosine distance scoring back end were shown to outperform the
0:15:34	generative gaussian backend
0:15:36	for the darpa rats scenario we saw that we had of thirty eight percent improvement
0:15:42	i'm
0:15:43	for that particular task and for nist lre we had some additional experiments are in
0:15:48	the paper that show that forty six percent relative improvement
0:15:52	and for the right order features the rasta-plp feature turned out to be the most
0:15:57	significant feature set
0:15:59	for the two corpora that we are considered but we found that you need to
0:16:03	fuse multiple features and particularly for the noisy conditions in order to hope to get
0:16:08	a similar levels of performance gain
0:16:13	a star
0:16:21	any questions
0:16:24	which
0:16:27	next on and i just don wallace logic presented right to left and spot and
0:16:32	lre results
0:16:35	what's
0:16:38	given rats so noise
0:16:42	so there's always a challenge in explaining why something works like so i would say
0:16:48	kind of looking at yellowy data
0:16:52	i think you have different sets of levels of noise on the rats i think
0:16:59	for us the rejection but you see for the rats data you got the ten
0:17:04	out of set languages those in some sense might be a little bit easier we
0:17:09	have done a test return yellow re sets and what we did as we generated
0:17:13	a five in set task that was used as close as possible to the five
0:17:18	means that we start from rats
0:17:20	we show the performance there was actually of a remote fairly different then we were
0:17:24	sitting on the lre on nine set
0:17:28	i wish i can give yield more insight as to why performance was colours but
0:17:33	how to say that using more features actually helps
0:17:38	expected say
0:17:44	did you look at the end scene channel in the rats so's i understood the
0:17:48	rats you trained on data in set through all the channels your testing the or
0:17:53	did you did the pull one out and unseen there's see i think the unseen
0:17:57	which the one images recently released
0:18:00	well you could do not held out just help wanted uni that's we need to
0:18:04	be gentle that we have all that actually with the channel but we did doing
0:18:08	to help
0:18:09	we do we have done tests on late in that context but not against all
0:18:14	these features
0:18:16	i can say you similar we did a fair amount of testing actually when we're
0:18:20	looking at ms and may g c features and a couple of other frontend enhancement
0:18:25	straight techniques in the last icassp for the lid task and their we did we
0:18:31	did hold one of the channels that just to see if we could do an
0:18:33	unseen channel that might help
0:18:39	so that you looked at a year shifted delta
0:18:42	cepstrum like but you can use for plp something that you have more long-term information
0:18:46	so you should be is used actual to shifted delta cepstra area system plp so
0:18:52	we use shifted delta cepstra set up on all that that's on for seven one
0:18:56	three seven is the configuration
0:19:11	the just a excellent talk solely on the at all the well the try to
0:19:17	the on the study on
0:19:19	is set up recognising the channel america the language rather than the channels so comment
0:19:26	on how you know in what cell findings from so which is features this simple
0:19:32	effective enough
0:19:34	also
0:19:35	i can answer that question and that real but let me naked one common so
0:19:39	when joe is giving us talk a keynote talk industry one comments i guess the
0:19:44	to get a chance to make all sit now
0:19:46	when you're doing language id or speaker id for that matter particular language id "'cause"
0:19:51	you're much more likely to use kind of found data for this and you may
0:19:55	not know the channel conditions are one of the tasks that actually a really good
0:19:59	thing to do and it may not be something you wanna report but it's something
0:20:02	that i think everyone should do
0:20:04	typically when you're looking at lid you would run a speech activity detector so you're
0:20:09	gonna have kind of your silence or low energy and noise and you're speech what
0:20:14	was really a good task is to run your language id task on the speech
0:20:18	and then run it on the silence okay all the data that you pulled out
0:20:23	if you run it on the silence and you find it you getting basically chance
0:20:26	across all your setups then you kind of note that the channels are not really
0:20:31	dependent on each other
0:20:33	what if you're getting really good performance
0:20:36	actually better performance than if you're using the speech and you can have no we
0:20:40	get your classifier is not really targeting the speech is actually targeting the channel characteristics
0:20:47	and that's what we found we tried actually i in a previous paper a number
0:20:51	of ways to kind of just
0:20:53	long term channel normalisation techniques something like this we were able to get the long
0:20:58	term channel exactly the same for those different corpora
0:21:01	during that we still could not get a the performance i'm a silence to draw
0:21:05	up to chance
0:21:06	a personal and no for i think it from looking at nist
0:21:10	i really would like to see a performance benchmark especially for lid not necessarily for
0:21:15	this it's i but if you look for lid if you could come up with
0:21:18	the performance benchmark that looked at your performance for all the speech and kind of
0:21:23	balance that against performance against the silence
0:21:26	because the ideas that you get a great performance here and you're getting just a
0:21:30	little improvement here than your gain is you that's all you're really leveraging actually looking
0:21:35	at the speech but the performance is really big thing that actually and out with
0:21:40	make up and affected your cheating
0:21:42	so that the kind of says more about the speaker

Robust Language Recognition Based on Diverse Features

Language Recognition

Qian Zhang, Gang Liu and John Hansen