0:00:15also be presenting the work of a whole so the sri you malay with a
0:00:19little from this one
0:00:22and this is looking at applying convolutional deep neural networks to language id noisy conditions
0:00:26in particular the conditions of the rats top part
0:00:31so start with a bit of a background mormon we might wanna do this and
0:00:35one domain
0:00:36motivation to use the in an i-vector framework that we recently proposed a speaker id
0:00:41then how do we kind of the noisy conditions we start looking at convolutional neural
0:00:46networks for that purpose
0:00:47and then we also present a simple a system called a scene and posterior system
0:00:52the language id ensure results on that
0:00:54that your experiment setup note walk through some results
0:00:58that's for a bit of the background and language id the ubm i-vector framework is
0:01:02pretty widely used in language are they a phone recognizers are also good option
0:01:07the when if you use these two together that's when you get a really nice
0:01:12improvement the quite complementary so in our books one of the challenge is as always
0:01:18been how do we get speech information the white someone pronounces something into a single
0:01:23system that outperformed you don't individual systems that's that what we call the challenge
0:01:28so we want one phonetically away a system that can produce scores that outperform few
0:01:33scores about
0:01:34so we recently solve this the speaker id
0:01:37at least in the telephone cases
0:01:40that
0:01:41we just a background of the nn i-vector framework
0:01:44so what we're doing here's combining the deep neural network that's trained for automatic speech
0:01:49recognition along with the popular i-vector model
0:01:52why we use it is to generate out first-order stats and zero what's that
0:01:56in particular we using with the nn in place of the ubm
0:02:01and what we're doing here if you look at the comparison down the bottom here
0:02:04the ubm plus the in the ubm is trained in unsupervised manner
0:02:08it's trying to
0:02:10represent classes would be gaussian
0:02:12and it's a shame generally
0:02:14to map to different phonetic classes however someone pronounces define a one why someone else
0:02:20mark
0:02:20phonetic completely different
0:02:22ubms gonna model that in different components
0:02:26so what the bn nn is
0:02:27i else is trained in unsupervised manner
0:02:30that means it's trying to map those same fines up to what we call seen
0:02:34that it's a
0:02:35a tight three fine state so the two different people pronouncing different wise the find
0:02:42a
0:02:43would be activating the same scene
0:02:46and that
0:02:47hopefully should capture different speaker trials
0:02:52so have a second speaker i
0:02:54it's very couples but speaker id in a the initial publication in like ask this
0:02:59year
0:03:01we got thirty percent relative improvement on telephone conditions particularly see two and c five
0:03:06of the nist sre twelve
0:03:09what i'm showing on this
0:03:11slide here is actually we've got three different systems the sri sre twelve submission which
0:03:17the fusion of six different features side-information a quarter conglomeration the
0:03:23and then we show some recent work done is that the mfccs and deltas and
0:03:27double deltas what we're calling my pca dct
0:03:30that we publication on that in
0:03:33i mean to speech i mean icassp next year
0:03:36just to give your reference that gives about twenty percent relative improvement i've mfccs on
0:03:41all conditions of sre twelve
0:03:44but what's really not to be in an i-vector can still bring twenty percent improvement
0:03:48on these two conditions c two and c five
0:03:51so it's very powerful there's to work to be done a microphone trials this mismatch
0:03:55happening we have my progress on that and we'll fact would be able to publish
0:03:59on that very soon
0:04:01so what i want to conclude here is that would now got a single system
0:04:05the pizza sre twelve submission
0:04:08so this how
0:04:09cannot be useful language
0:04:11that's the question to get there
0:04:13so the context years the output of the nn should include language-related information
0:04:19and ideally be more robust to speaker and noise variations
0:04:22so the reason i say that is when you training in the nn for
0:04:26i guess a you wanna remove speaker variability
0:04:32but i'll a suitable for channel degraded language
0:04:36in the rats program we actually and i think it was i b m o
0:04:39b n
0:04:41so i c n was particularly good for the rats noisy conditions we validated that
0:04:46a nap keyword spotting trials you can see that dramatic different mikes in the channel
0:04:50degraded speech so we said but use the same and then why should be very
0:04:54way that the nn
0:04:56and so that something that still open and we got a few review comments on
0:04:59that actually in we need to validate difference between nice to show the actual improvement
0:05:04in lid performance
0:05:06so we do that in future
0:05:08smoothing along with the c n
0:05:11this is essentially the same trying to close to the and then
0:05:15you can see we've got acoustic features that go into you had to men gmm
0:05:18that provides alignments for the day n and trying to what you're trying to the
0:05:21nn you no longer need to islam so you don't need to generate those test
0:05:25false coming in france
0:05:27and we've got acoustic features the forty dimensional log mel filterbank energies that are used
0:05:31for training the neural net
0:05:34here we stacking in our work with stacking fifteen frames together has the input for
0:05:38the training
0:05:39and we use a decision tree needs to be fancy names
0:05:43and as a set a we generate training alignments with the pre-trained
0:05:47h m and gmm which we don't need of
0:05:51just as an illustration
0:05:53c n
0:05:54basically front of trying to ban in appends this liar all this process here where
0:06:00i you've got your
0:06:03filterbank energies within fifteen frame context
0:06:05you
0:06:07possible be convolutional filter
0:06:09i think we're using
0:06:11size of i which is in but i
0:06:14and then what we're doing is the max pooling option of the n
0:06:17that means that each of the three
0:06:19for each three blocks the come out we take the maximum one
0:06:22and i just helps with the noise state
0:06:27have a single i-vector system go with this
0:06:30we can see that which simply plug in the c n instead of the ubm
0:06:34what straightforward
0:06:35what's interesting here is that we've got two different acoustic features first is used for
0:06:40the c and to get the posteriors for each of the same lines
0:06:43and then multiplying those posteriors would be
0:06:46acoustic features for language out that is to discriminate languages
0:06:51that is the second set of features so the number and of the two apart
0:06:55will be negative thought as you got extracted from features if you choose to you
0:06:58can use the same suffice
0:07:00but if you want to use model features in this in the fusion system
0:07:04and you need to extract posteriors using that one set of pages
0:07:07this is in contrast to twenty but multiple features for fusing with the ubm systems
0:07:11you've got extract for instance five different sets of posteriors for each feature if you
0:07:16had a five way fusion
0:07:18another aspect is you're right but it sure in the language are the features independently
0:07:22of those of the providing the posteriors
0:07:26heart currently with their ubm systems it's a bit of a balancing act you want
0:07:29stable posteriors but you also want would extremely discriminability a of the upper side of
0:07:35it
0:07:37in the statistics
0:07:40so it can we go easy a simpler an alternative system here's a simple system
0:07:44which take mostly and then we get the frame posteriors
0:07:48we forget about first order statistics
0:07:51we're doing here is normalized in the zero th order statistics in log domain
0:07:55and then we just use a simple we back end for instance here we using
0:07:58a neural network can use a gaussian backend assets one thing that distinguishes it from
0:08:02phonotactic system
0:08:04you can use standard language id backends which is not
0:08:08so here we using a count of but i'd context dependent states will try
0:08:14and that's a state level instead of find labels just
0:08:18let's look at experimental setup
0:08:21darpa rats program sure many of you have a how noisy these samples are
0:08:26i think john was talking about them anywhere on the way this part target languages
0:08:29tend online
0:08:31i can see those on the screen the
0:08:33this a few channel the degradation seven different channels snrs between zero and thirty
0:08:39the transcription that we used to train the seen in table keyword spotting task and
0:08:43as any two languages that that's an unusual aspect that we're trying to distinguish five
0:08:47languages but we training onto a plane
0:08:51test durations three ten thirty seconds and one twenty seconds and a metric we use
0:08:55here is the average equal error rate across the target languages
0:08:59terms of the model the one used to generate the training alignments for the and
0:09:03then
0:09:04the hmm gmm set up here we were producing three around three thousand c nines
0:09:10with two hundred thousand gaussians
0:09:13and this was multilingual training on both bastien haven't on our
0:09:17c n model was also trying to sign my with the multilingual training set
0:09:22we've got hot pocket lies with twelve hundred
0:09:25nodes each and we've got forty filter banks of things that frames
0:09:29you can see the pooling size and the filter sauce will be
0:09:33for be seen in convolutional stuff
0:09:37for the ubm model for comparison
0:09:39our training of two thousand forty eight component ubm
0:09:42and the features directly optimize the seed task the speaker id task the right based
0:09:47tended to for well over two language are the actual number forty dimensional two d
0:09:51dct log mel spectral features and this is similar to the zigzag dct
0:09:58work that we propose to not cast this you the pci dct the shirt the
0:10:02speaker are they really a that's an extension that further improves that's
0:10:08what about the vectors and background back and sorry about the thin and then ubm
0:10:13i-vectors all trained on the same data for the i-vector subspace
0:10:17and that by four hundred dimensional
0:10:18for the posterior system with collecting the three thousand average posteriors removing the silence indexes
0:10:25three of those am reducing to four hundred dimensional reality same as the i-vector subspace
0:10:31using probabilistic pca
0:10:33for the back end we trying simple neural network mlp
0:10:36i would cross entropy
0:10:38what we do with the data to a enlarge our training dataset is to chunk
0:10:42the data into thirty second chunks and i'd second chunks with fifty percent overlap i
0:10:46think that end up with around two million
0:10:49i-vectors to train on the
0:10:51the output is five target languages and the one house across as well
0:10:55i was performance got
0:10:58well first of all the ubm i-vector approach the ubm isn't being trained in a
0:11:02supervised by so what we said was well let's take the same lines from the
0:11:06same and where we know we've got three thousand five
0:11:09and let's along the frames for each of the icing lines and train each of
0:11:13the ubm components with that
0:11:14so the idea here was to try to give a fair comparison between ubm unseen
0:11:18system
0:11:20we don't nice improvement across all of those the
0:11:24the scene and approaches
0:11:25a is less to see that for ten seconds or more
0:11:28getting a thirty percent for more relative improvement over the ubm approach for the three
0:11:34second
0:11:35timeframe testing
0:11:36twenty percent relative improvement
0:11:39what was interesting between the posterior system and the i-vector system for the c in
0:11:43lid performance is actually quite similar
0:11:48but if we fuse that study
0:11:49we're gonna nice can again twenty percent relative improvement
0:11:52when we just component combine the two difference in the parts as
0:11:56optically for less than one twenty seconds is where we see for one twenty wasn't
0:12:00pretty
0:12:02when we got hidden at the ubm i-vector system to that so different modeling part
0:12:06actually get no
0:12:08kind from the fusion except in the one twenty case
0:12:10also another interesting problem
0:12:14in conclusion we compare in the robustness of the c n in the noisy conditions
0:12:18in particular taking that the i-vector framework and making an effective on the rest language
0:12:23id task
0:12:25we propose to in that sense and yukon the phonotactic system the scene and posterior
0:12:29system
0:12:30i which is quite a simple system and those high complementarity between these two propose
0:12:36i in terms of extension where do we go from here we can improve performance
0:12:39a little one not doing probabilistic pca before the backend classification
0:12:45and fusion of different language dependent si amends
0:12:49the schools from there is a also provides a again
0:12:54some the bottleneck features which i think a local might be talking about so
0:12:58are also good alternatives
0:13:00for the direct usage of the nn c n l for language
0:13:04thank you
0:13:12we have time for some question
0:13:21like what
0:13:26thanks for my start we're expecting request from right
0:13:30possibly
0:13:31so for the posterior cm impostors esteem you said it use the pca to four
0:13:36hundred a prior to on your on it right yep you try also to put
0:13:41data for vector
0:13:42i imagine it is
0:13:44yes the extension the first one on the extension says if we don't do that's
0:13:48that we do get a slight improvement
0:13:50i think the motivation for them reducing it to four hundred dimensions was for comparability
0:13:54with the i-vector space see what can we get in that's four hundred dimensions
0:14:16but
0:14:18so my question to do with the data that was used to train the asr
0:14:22and the c n i chi was it the multiple channels of the arabic and
0:14:27the farsi data abilities are so you trained in channel
0:14:31no channel condition less channel
0:14:32i believe that the ubm
0:14:35the ubm yes so we use the channel degraded data for the ubm the like
0:14:39outlook same data
0:14:41the a use arabic in the farsi data to train ubm like alignment like the
0:14:46ubm was exposed all five languages across all channel conditions
0:14:50but that was one thing you said you to the alignment of the signals and
0:14:53then train the states the ubm
0:14:56the second one there i guess that's what it was used for that
0:15:01supervised ubm to get the alignment with the c nine words coming through the c
0:15:05n and that was trained with keyword spotting tighter but the ubm it so i
0:15:09believe that have checked this of the couples as was trained with the with data
0:15:13which is across five languages
0:15:15that's what you think that how much you think that's an impact of having c
0:15:19change datasets and
0:15:21classifiers you think that first question i would have to say you would you would
0:15:26think that having five languages in the ubm
0:15:30plus more plus the other set languages would give it an advantage to some degree
0:15:34but as you said datasets changing
0:15:37i think so to their be a good point
0:15:45so match it if you're gonna do a very wide set of languages do you
0:15:50have a hope of having sort of a single master universal like the hungarian traps
0:15:55has been so successful in the past or do you think you're gonna have to
0:15:57build many different language d n and
0:16:00so what would what was saying so far as basically
0:16:04the mall language dependency intends that you put together you fused together
0:16:08the improvement reduces so perhaps if you had five the cover a good a space
0:16:15of the phones across different languages that might be what you michael universal collection that's
0:16:20appealing
0:16:27right