0:00:15my name is problematic a and i'll be talking about that
0:00:20a neural network or bottleneck features for language identification
0:00:24a i did this work during my postdoc iterating bbn
0:00:29and at first i will talk about the writes they darpa rats program which i
0:00:35tested you don't so it's a noisy condition
0:00:39then i will talk about the neural network what the mecc features and then an
0:00:44application to language identification
0:00:48so the darpa rats program i think that it's already introduce that so i would
0:00:53like to give you some taste of the red unfortunately
0:00:58there are not enough rates to taste for all of you arena there and so
0:01:01i know there is a place an audio samples
0:01:09i
0:01:14i
0:01:15they really i
0:01:18i
0:01:24so you get some impression a noise it is
0:01:29so the bottleneck features
0:01:31so the bottleneck feature stands for
0:01:34a neural network topology where the one hidden layer is a has a significantly lower
0:01:40dimension than the surrounding layers
0:01:42my case was at diamond used for the bottleneck and fifteen hundred for the surrounding
0:01:48layers
0:01:49what actually it does it does the it does a kind of a compression in
0:01:53this compress information so that we can use it in a some other ways then
0:01:57adjusting neural network
0:01:59it comes from the speech recognition
0:02:02i where they usually use it the frigid features alone or in the conjunction we
0:02:08is the baseline features that will be a final image mfccs
0:02:13what i actually used still got stuck but the mac
0:02:16where i have to the redhead doing you know networks
0:02:21both these bottlenecks
0:02:22and actually the that's second neural network takes the input from the first not from
0:02:30the bottleneck then expect it in time five frames with a five frame shift
0:02:36actually this was proven by that the but guys to be very good for speech
0:02:41recognition so are used today the right to do different number of frames different used
0:02:47different used
0:02:48and so on so we you mustn't for this
0:02:52right topology of the bottleneck features where is the for the first not okay used
0:02:57frequency domain linear prediction coefficients peace fundamental frequency
0:03:04as input actually we use the block of the log mel-filterbank it gives you about
0:03:08the same results
0:03:10then i have fifteen hundred sixteen hundred and eighty the bottleneck
0:03:15fifteen hundred and the target
0:03:17a number of target for me to targets where a state of the context dependent
0:03:22clause with queen phones
0:03:24usually like to the beauty garbage or use the triphones i use a queen phones
0:03:29because bbn had dbn is using the queen phones
0:03:34the second net actually has about the same topology just the input is different it's
0:03:41actually i have a five frames
0:03:43that's stuck in time so it's five times at so it's four hundred but then
0:03:49other otherwise the quality same
0:03:53for that's we have a two languages which were transcribed which is a farsi and
0:03:58eleven time you can see the number of hours what the net was trained
0:04:03and number of targets
0:04:06was we just for the each system
0:04:10so let's go to
0:04:13language recognition
0:04:14so the data that syndication as meat set the rest five target language is out
0:04:20of set class
0:04:21different durations and as you heard it's quite noisy so i would just keep this
0:04:26like
0:04:28i baseline discrete might baseline system description
0:04:31he's
0:04:33i use the p l ps might not nine plp coefficients i use a short
0:04:37time gaussianization usually can see the benefit of using this for language id but for
0:04:42these noisy
0:04:43condition you actually helps
0:04:46we use a block of these look at eleven frames respect them together and project
0:04:51them to sixty dimensions of hlda
0:04:54and as you
0:04:58sorry is you can see in the in the next slide i tried different coefficients
0:05:04to compare
0:05:06you go see the results in next slide are used a ubm with one thousand
0:05:10twenty four versions
0:05:12i-vector was four hundred dimensional and the final classifier was neural network we found that
0:05:18for this kind of task was the best
0:05:22but you should you have to do something speech are described in the paper
0:05:27so here are the slide with the first results ugh of them baseline system baseline
0:05:31results and there are four different feature extraction is i we focused on the three
0:05:38seconds and ten second condition because under twenty second was so good that it did
0:05:43make sense to look at thirty second was also would after diffusion
0:05:47so we mainly focus on the on these two conditions
0:05:50as you see the mmi coefficients from you two dollars are but the fourth
0:05:55ten second condition plp sub at the phone three second conditions
0:05:59the rest this was the but mfcc features which we very using for nist
0:06:06evaluations
0:06:06and this is the features which of a the best two thousand thirteen that doesn't
0:06:12thirteen rats evaluation for us
0:06:15so these are the like the baseline a baseline features like the conventional acoustic features
0:06:21so let me before agenda the results of is the bottleneck features let me talk
0:06:26about the prior for over
0:06:28the mainly
0:06:30they use the
0:06:32a context independent phonemes
0:06:34which makes quite a lot of differences we will see later
0:06:37and so what in two thousand thirteen in the forest evaluation jeff map from bbn
0:06:43actually he use the
0:06:45context independent and phonemes actually clustering on valentine arabic the dimensional thirty nine so he
0:06:51to go look of these posteriors just and simply just stick it to the block
0:06:56of the p l ps is the baseline and then all of this projected back
0:07:01to sixty dimensions with hlda
0:07:03and he got the pretty good results it's like got
0:07:06feature-level fusion it's like
0:07:11your idea is she she's doing so called phone a log-likelihood ratio posterior features
0:07:19what she does she takes the posteriors take the log and then do the likelihood
0:07:23ratio between them
0:07:25usually appended deltas and sometimes you use the pca to reduce the dimension dimensionality and
0:07:31then later she easier if use it is this plp
0:07:35she was before christmas she was it but and she was working on a lot
0:07:39as well so we could compare these features
0:07:43and actually these features these features that also better than the baseline features and exceeded
0:07:51are better than the phonotactic system because they did also the for like the conventional
0:07:55phonotactic system in this which is there much better and that the and the phonotactics
0:07:59the code like the conventional phonotactic system to make it to the fusion
0:08:02and these features the speech used it
0:08:07during the value process one of your told us that there was a there was
0:08:13a very similar work which was submitted to ieee electronic that there is at the
0:08:18end of two thousand thirteen
0:08:19it will by mister strong and he applied on the clean white cream data on
0:08:24the nist two thousand you have two thousand nine data
0:08:28then during the presentation on two thousand and fourteen i guess
0:08:34actually it's not in the paper we just in the presentation
0:08:37that's your more and of from google you present in the bottleneck features
0:08:44and but he's neural network is d n is actually the range to produce the
0:08:52posterior probability of target language is not a phonemes
0:08:55so it might open the new field of the like data-driven and data driven features
0:09:03so let's go to results
0:09:05so if i take so here are again this for baseline features then divide take
0:09:10the look up posterior just the log posterior of the which comes out of the
0:09:15neural network i think just one frame this time means of just one frame
0:09:21and just build the i-vectors esteem then you can see that it can it is
0:09:27better than any of the based on about
0:09:30so then what i did i to eleven like going to block of the this
0:09:34posteriors
0:09:35and
0:09:37stacked them together project the we hlda two sixty
0:09:41and you can see that it's
0:09:44quite better than just one frame so it means that the context is very important
0:09:49and then this is what jeff might need the like the baseline features plus one
0:09:56frame of and you the posteriors
0:10:00and projected to hlda via just dimensions
0:10:03and you can see that this is this case good but it's all the data
0:10:06like fusion of two systems
0:10:10so how does the what select features then
0:10:13so again is just one frame
0:10:16i tried also more things but it didn't help for me
0:10:21so one frame of bottleneck features the diamonds nineties at
0:10:25and you by take the but this at the bottom language the bottleneck from the
0:10:29first neural network
0:10:31and this is the stack but language is the book like from the second neural
0:10:35network so you can see that a boss this teams is quite better than any
0:10:40of the baseline and actually it makes sense to do the that the stack but
0:10:46may architecture because you get something
0:10:48something out of it
0:10:50what why i'm thinking just one frame it might be a this for the case
0:10:55for the button like for the state but make features that i'm doing this taking
0:10:58between the between the nets of it might be that the context it solidity so
0:11:02that there
0:11:05so then i did some i have some analysis slides
0:11:10and the first thing was obviously to a try to tune the bottleneck size
0:11:16so the usually they use it for speech recognition used a user usually at so
0:11:20i took eighty and it is the baseline and then try to very the button
0:11:26excise but is it is you see
0:11:29the at was the best
0:11:31if you go to sixty and i
0:11:34it's kind of stuff to rate both so i stick with at because it was
0:11:37the baseline formant
0:11:40the other thing i was interested in force how it stand if we if what
0:11:45what's but the target for the neural network should be
0:11:48so we did of is the context dependent phonemes
0:11:53but how it how is it is the context independent
0:11:56so it's much easier to train the system is context independent phonemes
0:12:02then this context dependent because we do not need to build the lvcsr sistine the
0:12:05training of the neural network is much faster and so and so on
0:12:09bob if you look at the results the results
0:12:12a query speaker use the context dependent context dependent phones
0:12:17i think it's because of you are modelling of the final estimate structure in does
0:12:22the this feature space
0:12:27then
0:12:29the
0:12:31we have it as i said at the beginning we had we have a to
0:12:34language is we do we have a transcription for
0:12:38it's farsi eleven time
0:12:40and so i
0:12:42the dues to set of features one was trained on farsi one on lemon time
0:12:46you can see that they perform about the same
0:12:51actual data used because it is you can see of the final a slide
0:12:56and evaluate would you buy we need to choose just one i which is the
0:13:00levenshtein one because it's just
0:13:03slightly better
0:13:06you would not see doing to do in test proceed but test but in that
0:13:10the reigning the farsi has much higher target like that might much more
0:13:17context dependent phones so that the reigning was more time so it for like training
0:13:22convenience the levenshtein but the
0:13:26so then into thai wasn't two thousand thirteen what jeff did for the rats evaluation
0:13:35he did the kind of fusion of several six teams
0:13:40recording language dependent sixteen
0:13:42and i was explained on the picture
0:13:46so the language do what is the language dependency for usually we have just ubm
0:13:52and i-vector and you're not to obtain on the same data which are usually all
0:13:57data we have
0:13:58so what we did is to train the gmm on one language which that's a
0:14:04just are a big language just on dari farsi bunch two or two
0:14:09and all other languages and then i-vector and in it was collected all of them
0:14:15and then at the end we to be a just a simple average of the
0:14:18scores we didn't want to train diffusion because it's more parameters so we do we
0:14:22because the fusion was then train this other systems
0:14:27so
0:14:28personally i do like that structure that match because the complexity of the systems grows
0:14:34quite a lot but i think is doing it takes advantage of different of different
0:14:38alignment of the of the ubms so how does it look like and the results
0:14:44so that the first line is the
0:14:47is the baseline where we trained everything on when we try to train the ubm
0:14:50on all languages
0:14:52one only down
0:14:53then next six lines
0:14:56are the sistine the separate sistine where we train the ubm only on the particle
0:15:00language
0:15:02so if you if you see the results none of these be the baseline which
0:15:06is kind of the is
0:15:07but then you by take average of this of the of discourse and score it
0:15:11you can see that there is a very nice benefit of doing this so it's
0:15:15those of fifteen percent forty second twenty five percent for ten second
0:15:19and
0:15:21we had also shan't we had we what we did also the so that in
0:15:25the rats there are eight
0:15:27it should be nine
0:15:28g h channels but the source channel so i did the same for that no
0:15:31on the channel level
0:15:34it perform about the same
0:15:35and then i do those of the average of all of them i
0:15:38it's also about the same so
0:15:40there is some separation is also
0:15:44due to some point it improves
0:15:47it would be good to the right the what we small saying that the like
0:15:51the d n and alignment which might be which might be something similar different alignment
0:15:57to look at is or
0:16:01so let's look at the final sideways the fusion
0:16:05the first line is the plp sistine
0:16:08then
0:16:10the then i have a fusion of three system which is the stacked bottleneck sistine
0:16:15with for false eleven time and then the feature level fusion with acoustic system
0:16:21and you can see that there is about the thirty percent improvement
0:16:27then
0:16:28the same one to compare if we did length both all system language dependent so
0:16:33we saw that like thirty percent or twenty five percent improvement from the
0:16:38from the language dependent us esteem and here we can see that the fusion still
0:16:44can't gain the same gain as if we do not the language dependency which is
0:16:49which is very nice
0:16:53but the thirty percent from the fusion over the single best system if you do
0:16:57the language dependency or not
0:17:00then one of your suggest it's to do something words isolated maybe it was a
0:17:06review from sri
0:17:09the
0:17:10and
0:17:11also are after the rats evaluation i actually extent streamers from within each and he
0:17:18said just introduced to do that so what we had is the blue
0:17:22lou stream
0:17:23and what's
0:17:26kind of day deed was very easy to for me to try so i didn't
0:17:30got the what to make here but i just use entire network and use the
0:17:35posteriors which were here and dialect defeated due to another mlp and produce the scores
0:17:40and then i could to use it
0:17:43so you can see that there is
0:17:45that's actually for me
0:17:47the posterior system was voters then the like the stacked bottleneck with i-vectors
0:17:54but yesterday we can i compared the results this image and actually they are see
0:18:00steam
0:18:01like the c n posterior system is a little bit better than mine system here
0:18:07we talk a little bit it might be because of the c n is behaving
0:18:10much better than the indian and for noisy condition
0:18:13which we need to train it to try but the fusion was fusion with these
0:18:18two approaches is very nice
0:18:21the conclusion the bottleneck features provides very nice gain
0:18:28it
0:18:29it's very nicely compete of is the with the conventional phonotactic system which we did
0:18:33before actually it it's much better
0:18:36and as i said before we
0:18:39for than for the rats evaluation this year we had also phonotactic systems and none
0:18:45of then made it to the final fusion
0:18:48and there are much bigger gains for longer audio files
0:18:53a
0:18:55as i said this what events you more noise during the
0:18:59the
0:19:00that is that the make this trained for direct to like to the bottom x
0:19:05but on the direct task that he's doing the net with the target are the
0:19:09languages for this case
0:19:11then it might open to new space for the that the ribbon feature extraction
0:19:18thank you
0:19:26thank you problem do we have any questions
0:19:31how to train the neural networks for the deck
0:19:35so i for this task is the bbn training told is a stochastic gradient tests
0:19:40and gpus and each net was trained like it's three day so i have two
0:19:46nets it was about a week to train trying to
0:19:55activation function but it was is
0:20:01i do they remembered activation function in the in the in the a hidden layers
0:20:06but i know that for the button make these the linear one so there wasn't
0:20:09there was a linear activation function for the bottleneck was also shown that for the
0:20:13speech recognition that is providing a better
0:20:16but the results i can i it's between actually it's in the paper
0:20:21i satisfy the deleted
0:20:25so that the same questions that's image this you tell what was used to train
0:20:29your asr in your d n and
0:20:31same s
0:20:32since all the channels
0:20:34yes all channels and
0:20:37the d n the d n and for that's the bottleneck or the but make
0:20:41features contain from the keyword spotting data so it's different data from the ubm and
0:20:46the and the this
0:20:48okay so you'd also at different datasets on there so what are the questions here's
0:20:53how much
0:20:55what is your sense on the sensitivity to so the to do the indians it
0:20:59seems like there's it's a start all that a good asr system i and label
0:21:03your data then trained indian answer the question
0:21:06maybe people the places i seven had experiences
0:21:09what people think this sensitivity is on saying i start off with a very good
0:21:13alignment
0:21:14see that you start to train at the end
0:21:17do you get the sense on that you know anything maybe not this work but
0:21:20otherwise
0:21:22that's hardware so i
0:21:26what is but here that you really need to beautiful lvcsr reason is to be
0:21:31good
0:21:32what i like is what a nazi armour nancy more noise doing that actually does
0:21:37not need to do that the irises subsystem you just a side so we can
0:21:42use actually the language id data directly you trying to neural network on the post
0:21:46like in the language posteriors so use the same data actually as norm assisting
0:21:52i played is that a little bit as well i did those of the bottleneck
0:21:54speakers the because if you do it what he was doing actually on the j
0:21:59g workshop
0:22:00that he train the net like the d n and to produce the
0:22:05line the targets the languages are then he did a because you have it you
0:22:10have a
0:22:11the posterior probability of language each frame so we need to do the some timing
0:22:15so what he did he just the average
0:22:18and which is
0:22:20good for three seconds but is not good for ten seconds
0:22:23so what i did then i to the exactly this a posteriors as the features
0:22:29i to the output of the layer before the features and then it helps because
0:22:33then and is that if you just i-vectors justin
0:22:36and then it helps to do something actually for that i would have much smaller
0:22:40system to do the to the i-vector system
0:22:44so that might be
0:22:47on support not to do the ldc as
0:22:55just to follow on in response to dogs question with the keyword spotting died of
0:22:59that was transmitted different time to language are data one thing so i observed in
0:23:05the speaker i data was that retransmission a time as it trying to the side
0:23:11as the atmosphere and the transmission affects
0:23:14so the channel is bearing i've a time so in one regard we got this
0:23:17keyword spotting that of that kind has different channels of language at daylight
0:23:22a even though it's theoretically the same equipment that sending that that's a different effect
0:23:27that's coming three so it's nice to see that is still working despite that different
0:23:32a similar question now is for instance in the clean sre data with saying difference
0:23:37between or a problem trying to classify microphone a trials when most you're trying that
0:23:45if we take your network is telephone speech
0:23:49the one here last statements on the thought was that the bottleneck features are a
0:23:53great even in noisy conditions so of course got very matched data he at do
0:23:58you have any theories of how the bottleneck features my car in mismatched conditions i
0:24:04last minute because of various system
0:24:06appears sensitive to it i wonder if the bottleneck smart little bit more distant just
0:24:10because the compression factor
0:24:12i think it would depend on the train data for the d n and
0:24:16okay so what we did for that's it for best
0:24:18together with speech
0:24:22we had adjust the clean data for training to the nn so we just say
0:24:26okay so what we go do if the test data will be noisy so we
0:24:30just to get thirty percent of the training data and we just artificially i denotes
0:24:35that help a lot
0:24:37so then the d n and source of the noisy condition
0:24:46since that reputed question our
0:24:50if you have to do very many languages the
0:24:53could you imagine having from for universal recognizer system for the d n are you
0:24:58think you'll have to be very
0:25:01i think that the people need to build at least a few d n n's
0:25:04because i think that mitch you said that you try to those of the like
0:25:07the farsi eleven time and then
0:25:10the universal one right
0:25:14so you might common much more if it was better than to separate one or
0:25:18the fusion of these two
0:25:23so we had someone in our lab construct the multilingual dictionary between these two languages
0:25:27that was the best
0:25:28of the three systems that we tried but we also found the fusion of all
0:25:33three was best in fact our primary system was the fusion of the t c
0:25:36and in systems but also three a c in an i-vector systems for the site
0:25:43languages but all with one language id feature
0:25:47if that might if you're member distinction between the d n and has a certain
0:25:51age and language id as a set which we just maintain one image and em
0:25:56sales code language id feature that the c and then change and that was
0:26:00a very good fusion
0:26:02i in terms of the sre all i read i to we found that having
0:26:06the multiple languages
0:26:10if you get the good scart across the different phone right that's one of stuff
0:26:13to converge
0:26:16then i think that you would need to a few systems
0:26:19not many for three four
0:26:22and it would be better than to have one universal
0:26:30okay