0:00:16okay so by no means but also
0:00:19and whatever send a work with microwave so these hand plucking
0:00:25we're basically we test and evaluate our address the m utterance system
0:00:32in a like a different mean on this and so this was a system that
0:00:36we already presented
0:00:37and at an obvious to test how we disaffected on different scenarios
0:00:44so for someone a the one the motivation why we started using this architecture and
0:00:50how we started using this
0:00:53there we will lead to a very of file is the m probably you will
0:00:57be quite
0:00:59already aware this that i guess it's
0:01:02nice to have some tracks near
0:01:05then we will all the details of the screen men so we will detailed a
0:01:09system description
0:01:11the reference i-vector system that we will compare
0:01:14our proposed system we
0:01:16the different scenarios we're gonna tested
0:01:19and results
0:01:21and finally we will conclude the work
0:01:26we all know what we take these already the process of automatically identifying language for
0:01:32a given a spoken utterance
0:01:33and typically this has been done for many years
0:01:38rely you know acoustic model so these systems
0:01:41basically have the state is first some i-vector extraction
0:01:45and then some classification states
0:01:48and last years we're seeing a really a strong
0:01:52a new line that it's deep neural networks
0:01:54and it can be more or less divided in three different approaches
0:01:58one is the and two and systems we have seen that it's a very nice
0:02:02solution but we are not achieving best results so far
0:02:06then we have the what an x
0:02:08and then
0:02:09after computing but as we go to the i-vector
0:02:13struck some and we keep the fuel line
0:02:16and then we have this signals
0:02:18sorry for type other
0:02:20so this would be a and this paper we wanna focus on the end-to-end approach
0:02:25so we want to improve the end-to-end approach for this
0:02:29we would be a very like stander the nn for language recognition when we try
0:02:36to use and to an approach
0:02:37basically we have
0:02:40some parameters as input
0:02:42then we have one or several he'd over the years with some nonlinearity
0:02:48and we try to compute the probability of some
0:02:52some of the
0:02:54the language we are gone test
0:02:56in the last layer so for this we use a softmax it with us probabilities
0:03:02one of the main drawbacks of these
0:03:05be system is that
0:03:07we need some context if we try to get an output frame-by-frame we are not
0:03:12gonna get then you would result so this system relies on stacking several acoustic frames
0:03:18in order to all the time context
0:03:21and that has many problems one and we have a fixed
0:03:25length but probably will not work best for all different conditions
0:03:30and it's like bright the
0:03:33since a deep to union
0:03:36so how can we model these in a better way
0:03:39the theoretical answer he's recommend your networks so basically we have same
0:03:45structure that before
0:03:46but this once we have recourse if connections
0:03:49all the others are saying
0:03:51what's the problem with this one's we have a vanishing gradient problem
0:03:54the a basically what happened sees
0:03:58in theory it's a very nice model
0:04:00but when we try to train these networks because of these records you can extra
0:04:06we end up having all day they weights going to either zero
0:04:11or something really high
0:04:13so there are ways to avoid this but usually is very tricky it depends a
0:04:17lot on the task on the data so it's not
0:04:20really useful
0:04:22and here is where the others the m columns
0:04:26basically stm means they first
0:04:30stander the nn
0:04:31and we replace
0:04:32all day hidden nodes
0:04:34with this l s t m block that we have here
0:04:38so let's go to the theory of this blog
0:04:43basically it seems kind of scary when you see first
0:04:47but it's pretty simple after you look at provider
0:04:52we have a flow of information that goes from the bottom to the top
0:04:56and as in any
0:04:59a standard you know we have a nonlinear function
0:05:03that we
0:05:04this one here
0:05:06and is bessel thing all the others the n is that it has a minimal
0:05:11we take this one
0:05:14so that
0:05:15the all the other stuff that we have there are three different gates duh what
0:05:20they do he's they let
0:05:22or b
0:05:23they later they don't lead the information go through
0:05:27so here we have a input data
0:05:29the if it's activated we will lead the input
0:05:33of a new times that we'll for war
0:05:36if it's not it won't
0:05:38we have they forget gate
0:05:40that's what it that is basically we set the memory so
0:05:45so if we speech calculated it will would that sell to zero otherwise it will
0:05:51keep the state of the of the previous time step
0:05:55and the output gates
0:05:57note that gate we'll let the computer
0:06:02computer output
0:06:06go to the network or not
0:06:08and then what we have of course is a vector and connex so
0:06:13the output of
0:06:15well as time goes the input
0:06:18of day next one you know data
0:06:20so it's basically trying to meaning they are in and model
0:06:25but with this case we avoid that problem because that gates work
0:06:30both in this time but also entering time so when we have we're doing the
0:06:35back propagation
0:06:36and we have some ever that's a stew maybe rice the weight
0:06:41that forget gate that would be a that input gate it's but also clock that
0:06:45error from going
0:06:47many times so we avoid that problem
0:06:51the system that we used for language recognition
0:06:55been doesn't rely on stacking acoustic frame so we receive only one frame at the
0:07:02we will have one or two hidden layers and the relay here will be a
0:07:06unidirectional it is the m
0:07:08we impose
0:07:09impose war
0:07:11these connections that we have here
0:07:14that basically
0:07:15it allows the network to decide things on the like depending on time so we
0:07:21it supposed to improve they the performance of a memory cell
0:07:28the output we will use a softmax right like in the nn
0:07:32cross entropy error function
0:07:34and for training
0:07:36what we do he's in the first area will have a very balanced nice dataset
0:07:42so we need to do any implies either
0:07:44but on more difficult to know is we will have some and but also the
0:07:49so what we do in order to avoid problems with them but data
0:07:53we just over something so we take random sites of two seconds and then we
0:07:57have six hours
0:07:58of all the other languages in every iteration
0:08:00so that it so that we have
0:08:02for every iteration is different
0:08:05then we we'll use
0:08:08to compute the final score of an utterance we will do operates of day softmax
0:08:13but taking into account only the last ten percent
0:08:16of this course i was playing ability later right
0:08:20and then finally we will use a multiclass linear logistic regression calibration we use simple
0:08:27we will compare the system to a reference i-vector system needs a very straightforward using
0:08:33mfccs the see exactly the same features that we used for that is the m
0:08:38we we'll one thousand twenty four gaussian components for the ubm
0:08:42the i-vector ease of size four hundred
0:08:45it's based cosine distance scoring that's
0:08:49it controls are it depending on how many languages we have snow would
0:08:53this was working better
0:08:55the and doing lda you're doing the lda so that's why we decided to take
0:08:58a cosine distance scoring
0:09:01if we have more languages it would be better to use lda but the difference
0:09:06was a small enough to note that a too much since there
0:09:09and this is the most implementing quality and need has exactly the same by recent
0:09:13technique always trained with the same
0:09:16same data
0:09:19so these are the three scenarios that we are going values to compare and test
0:09:25these network personnel you e
0:09:29a subset
0:09:30on the nist
0:09:31two thousand nine language permission evaluation
0:09:34so that is that we use is that there is coming from the three seconds
0:09:38this is a subset the a pretty minutes it's like very set so that the
0:09:43it is the in will work based
0:09:46so it's a very kind of d c subset in the in the two thousand
0:09:52and nine evaluation what we d d's first we have a imbalance meetings of cts
0:09:57voice of america so we draw all the cts data then we will avoid that
0:10:02buttons makes and also we will avoid i mean a mismatched
0:10:05in training so we have only one dataset
0:10:09a for the languages we wanted to have also a high amount of data
0:10:14so we to only those then which is that had at least two hundred of
0:10:18more hours
0:10:19i'm we also then want to have unbalanced data so we got those datasets so
0:10:23all of them
0:10:24two hundred hours per available for training
0:10:28and that lid
0:10:30two d subset of we have here
0:10:33it's not a soul seven it's not the most difficult like we so before it's
0:10:38just those that happened these two hundred hours a of voice of america data
0:10:45and we use only a three seconds task because historically we so that for starter
0:10:53is where the neural networks outperform more director so we wanted to be in that
0:10:59in that scenario
0:11:00then seconds note that we want to test is they that said
0:11:05of nist language is no one listened to for some fifteen
0:11:09here we don't avoid any of the difficulties so we have a meeting so cts
0:11:14and brought about and b and b s
0:11:17and we will keep
0:11:20we have seen the there's of this so it's twenty language is scroll in six
0:11:23cluster accordion similarity so it's supposed to be more challenging because the languages are closer
0:11:30within a cluster
0:11:32that model training data it's also gonna be like it during just followed we have
0:11:36some then which is we lessen have a lower something which is with more than
0:11:39hundred hours
0:11:40and split that we need ease
0:11:43eighty five percent for training fifteen percent for testing
0:11:46that's something we wouldn't do again if we like run experiments again this is what
0:11:52we need
0:11:54the time so before i mean this set and everything
0:11:57and we thought it would be nice to have more data for training but after
0:12:01that we ran some experiments and we found that having
0:12:04it'll be less training data but more they've data we'll help experiments
0:12:10but we keep exactly what we use in the one best
0:12:14and that's a test what we need these with that fifty percent we took chance
0:12:19of three seconds ten seconds and three seconds
0:12:22two meeting a little bit the a
0:12:25the performance of on the and the one less
0:12:28and then that are texan area will be they test set of nist language relational
0:12:33we discover a broad runs of speech durations it's not been beans anymore
0:12:38and we has a big mismatches between training and unable as we so before
0:12:47so the results that we have first this is kind of aside result is not
0:12:51that important but as we are using a unidirectional it is the em what we
0:12:56have is that the output at a given
0:13:00times them
0:13:00things that depends
0:13:02not only on the input of that
0:13:04times that are also and all the input of the previous inputs
0:13:08so the last output is always more reliable
0:13:11then the ones before
0:13:13and we thought that maybe we were affecting they performance if we take the first
0:13:19outputs that are less reliable so we just started dropping all the first outputs and
0:13:26seen how that affected the performance
0:13:28this is this so for this one
0:13:31we don't really care about the
0:13:34the moderated we have here we only got about how improves
0:13:38so the absolute
0:13:39equal error rate doesn't matter only the relative difference
0:13:42and we found a taking into account only the last ten percent
0:13:46would be a very optimal point
0:13:49and we also so that taking into account only the very last score only one
0:13:54output of a softmax
0:13:55we were as good as taking the last ten percent but we do they
0:13:59the last ten percent or
0:14:03so these are the results
0:14:06on they
0:14:07on they first scenario
0:14:09remember that this is the one do we only voice of america a languages
0:14:13two hundred hours per language for training
0:14:16we have here
0:14:18this is the different architecture that we use we both three had one hidden layer
0:14:23those two layers and then we have different size of the he'll data from like
0:14:30this is the smallest we two hundred fifty six
0:14:33are the begins with one hundred twenty four one thousand and four
0:14:37this is the a size in terms of number of parameters
0:14:42of all the models
0:14:44and be so the results that we obtain
0:14:47so the reference i-vector system and a seventeen percent almost equal error rate
0:14:53and point sixty now see a rates
0:14:56and we see that pretty mats all day and as the em approach is clearly
0:15:01outperformed that
0:15:02and i'm not of them has a much smaller number of parameters
0:15:09so those are really good results but we are in these
0:15:12balance error you
0:15:14as we can see the best system us like
0:15:17fifteen percent a better error
0:15:21and has like
0:15:24i four percent gain in terms of size
0:15:27and we also wanted to check how complementary information that these others the m and
0:15:32the i-vector were struggling so we fuse the best alice consistent with the reference i-vector
0:15:39and the result whether the way remotes
0:15:41that's better
0:15:42we twelve percent
0:15:44which is like fifteen percent better than they based system i'll
0:15:50this is the completion metrics doesn't have much information but we can see i'm not
0:15:56only in terms of accuracy but comparison with other languages how would be performed in
0:16:00this subset
0:16:03these are the results in that the dev set of a language recognition evaluation
0:16:08to for some fifteen
0:16:10for this one we just we didn't do an experiment with different detectors we were
0:16:14a little bit and harris we use only they based system on the previous scenario
0:16:20we what which was to don't layers of size five hundred total
0:16:25and what we can see here is that the
0:16:29that is the m
0:16:32much better than the than the i-vector or on three seconds
0:16:36while on thirty seconds
0:16:39we d scenario where we have these mismatches between that the bases and these buttons
0:16:45on the data sets
0:16:47this end to end system is not that so we still results for what are
0:16:51like that were always outperform an i-vector why this and to an approach i it's
0:16:57able to extract more information from sort lessons but not that matt's for longer
0:17:03that would think that we so here is that even though the results for longer
0:17:08a way worse than the one of the i-vector
0:17:11diffusion used pretty much always
0:17:14better than any of a single system
0:17:16so even if the even when the when there is the m is working worse
0:17:21than the i-vector
0:17:22we are able to strut different information that will help in a file and system
0:17:28so we were also quite be quite happy with the results
0:17:31this is they
0:17:32they do that we have for three seconds where we can see that the l
0:17:37is the em outperforms and over twenty percent relative improvement
0:17:41over the i-vector
0:17:43and we see also that the a diffusion works always
0:17:48that in any of a single system
0:17:50and now here we go on to the results of at all but this set
0:17:55of language recognition evaluation
0:17:57and here the things get much more so
0:18:00first of all
0:18:01we have first column is that is the and second column is a i-vector
0:18:07third one is the fusion of both
0:18:10noncheating one the one we used for the listening
0:18:13and a point one
0:18:14is exactly the same but using like the succeeding fusion so we use a two
0:18:20fold cross validation
0:18:22so we will use in how of the test set
0:18:25for training they fusion on the other half
0:18:29of course that that's
0:18:30that is not alone in the evaluation
0:18:33but we wanted to know how like
0:18:35whether the
0:18:37the systems were learning complementary information
0:18:40or whether they weren't so what with always maybe we've used in a good way
0:18:47we can distract how maps how complementary information
0:18:51so for the messages that we have to take from here is that versa for
0:18:55at the end you learning these very hot a scenario is able to
0:19:00get a result they need a comparable with a i-vector but when it gets much
0:19:06worse as when the base increases because the i-vector is able to extract better than
0:19:11better results when it is the m
0:19:13a status
0:19:14on the
0:19:15on that same performance
0:19:18but the good thing is that we don't have such a big might minutes or
0:19:21we are able to do a good solution
0:19:23we can steal even when you're
0:19:27on the known as the rest we can use the in room we diffusion
0:19:31the performance of that i-vector
0:19:37it's conclude the work
0:19:38basically the same take a messages are
0:19:42first of all on a control unbalanced scenario
0:19:45we have we promising results
0:19:47it's a very simple system we eighty five percent this parameters
0:19:52and that it's able to get fifteen percent relative improvement
0:19:56problem is the once it gets
0:19:59on an imbalance in a real more exciting england the results are not
0:20:04as good
0:20:05and finally we know that the that on strong mismatches and harder scenario it we
0:20:11are not able to strike information within a
0:20:14so there is a need for variability compensation but we still think that it's a
0:20:18you really promising approach
0:20:21that we need to simpler a systems that can get quite good results
0:20:38lots of questions
0:20:43so also
0:20:50just the small comment you say that you're averaging the outputs of the ten percent
0:20:55of the last frames
0:20:57you are always using than posants for free second test of a thirty second test
0:21:03we always using them person did you try to just
0:21:07a rate for the thirty last frame independently of duration of your this
0:21:11we e actually not for they how this areas but for the aec once we
0:21:17need a like a lot of
0:21:19things like not only averaging about like i mean or selecting only based on was
0:21:25one or
0:21:26just a drawl all the ones that are out that yours
0:21:29and we found that is not really work need to the little thing to note
0:21:32in there
0:21:33but maybe in day in a more telling in serious it would be a with
0:21:37with the way we haven't right
0:21:51is it possible to go back to slide twenty four second here
0:21:57sorry i notice you're always getting improvement with i guess when you're good to elicit
0:22:04iain's versus the i-vector but when you look at the in which case and think
0:22:08when you want to the fusion
0:22:12fusion with emotion actually did worse than the i-vector system three point to each six
0:22:19or each seven and the i-vector had one point nine that's the only one where
0:22:23you didn't get an improvement was really reason why
0:22:26you saw maybe when it happened may be used or you know stm actually had
0:22:30words performance of guessing is because
0:22:33you got more realist em system in
0:22:36so i'm not completed is are but we have some kind of idea of why
0:22:40that happened the idea is that
0:22:43for training day systems
0:22:45what we d is these oversampling
0:22:49so on english there was one of the i think it was reduced english that
0:22:52had only half an hour
0:22:54so you know to train the others the n
0:22:56that of course hardly the need has a war is useful but i think it
0:23:01also hardy follow the fusion
0:23:03so when you have
0:23:05well in with that has a less data for today the nn for that is
0:23:10the m
0:23:11you can more or less we'd oversampling because that
0:23:14an infusion usually you need much less
0:23:17much less data in general so in all the other clusters that since a lot
0:23:23because you anything they are imbalance
0:23:24for calibration you have stealing a of all the blame
0:23:27but for the english one i think it yes that we do not have enough
0:23:31data for calibrating
0:23:34so the fusion for training sufficient to so i think that was there is
0:23:39the diffusion is not well trained because of not having enough data one of the
0:23:54i've a question and i found it quite interesting that you're
0:23:59a list em has fewer parameters than the and the i-vector system
0:24:05and i'm wondering about the
0:24:08the time complexity how long does it like to train it and test time
0:24:12some compared to the i-vector system
0:24:16the a training time is much longer because we had a lot of iterations i
0:24:20think it's also because of the way we trained the that we use a different
0:24:24subset per iteration we need a lot of them so
0:24:30actually i think that there is also have your are not the best we could
0:24:34see because these was only and evaluation and so some of the networks were still
0:24:39improving when we had to stop band and there's run them as they were sewing
0:24:44training time eight side and w has much fewer parameters each word
0:24:49but testing time he's way faster
0:24:53and of course of one thing is that was you have the network trained you
0:24:57only need to the day before while in the editorial you have new data you
0:25:01have always extra i-vector before
0:25:04before doing scoring
0:25:10anymore questions
0:25:15so then there's lots of time for costly i guess we'll back end at five
0:25:23forty four special tools
0:25:25so that's target speaker can