0:00:15kind of the transition from the systems and in the previous salary into the n
0:00:21n's in what
0:00:22people do deeply we all could have
0:00:25no i don't think so
0:00:29we all clodo presented in both but i think this is a good transition "'cause"
0:00:32we did have some kind of new things that we did that i wanna talk
0:00:37this is work with my colleagues correct cell and daniel from johns hopkins of both
0:00:41of whom were unable unfortunately to get spousal permission to attend this work so but
0:00:47they have that excuses rags wife had their second child two weeks ago and daniels
0:00:51is due in about two so
0:00:54they have a reason
0:00:57so i'm gonna present an overview of the d n an i-vector system that we
0:01:02submitted to leo every fifteen
0:01:04i wanna hear give a shout out nist for introducing his fixed training data condition
0:01:10which actually allowed us to make a very competitive system with only three people which
0:01:15is a not very common in our is historically
0:01:20the approach that we used algorithmically i'll go in the more detail but we use
0:01:25the n n's unlike some of the previous presentations you've seen we were able to
0:01:30get good performance not just with the bottleneck features but also with the nn state
0:01:35labels i'll talk about that
0:01:38we used a three different kind of i-vectors i'll explain that more but
0:01:43everyone had acoustic systems and those are very good we're able to do quite well
0:01:47with the phonotactic i-vector system as well and here we're trying for the first time
0:01:52a joint i-vector which does both things at once
0:01:56because we had a fairly powerful system that we were comfortable with and we didn't
0:02:02trust that we had enough development data we used i think the you simplest and
0:02:07most naive fusion of anybody a net seem to work for us because we actually
0:02:10got of using game which i think also we were one of the few
0:02:14and that was just to some the scores together and then scale "'em" with the
0:02:18duration model that all talk about
0:02:21and lastly as i think it's been mentioned but i wanna go with a little
0:02:24bit more because this was a limited data task data augmentation turned out to be
0:02:29very helpful for us
0:02:33so in the top i'll go through our bayes the i-vectors system design a talk
0:02:38about the two ways that we use the d n n's that have both been
0:02:41touched on previously today
0:02:43and i'll talk about the use alternate i-vectors to we experimented with
0:02:50talks a more specifically about the lre fifteen task and how we use the data
0:02:54and what we learn later about how we could have used the data
0:02:59and trying to that will talk about the results that we had in the summation
0:03:02in some interesting things that we've learned since both about whatever other systems could have
0:03:07done and also how we could've done better with the systems that we that we
0:03:13so here's a but block diagram of
0:03:16our lid system
0:03:21it's a little i-vector system so we can be split into two parts the first
0:03:24uses the unlabeled data to the to do the ubm and the t matrix learning
0:03:29and then the supervised system is basically the two covariance model
0:03:34within class across class covariance that's first used in lda to reduce the dimension and
0:03:39in the same matrices are used for the gaussian scoring following on after that
0:03:45we've done for awhile rather than having a separate back end to do the work
0:03:48we do a discriminative refinement of these gaussian parameters
0:03:53to produce a system that not only performs a little bit better but also produces
0:03:58naturally calibrated scores
0:04:00and we do that in a two-step process first we learn a scale factor of
0:04:05this within class covariance
0:04:07and then we go into all the class means and adjust them to better or
0:04:10provide the discriminative power and that's we we're using the mmi algorithm from gmm training
0:04:17in a really simplified mode
0:04:19and of course that's the same criterion is the multiclass cross entropy but all everybody
0:04:23uses for every day
0:04:28so just layout data
0:04:29talk more about how we use the d n and together people dimension it but
0:04:33let me have some pictures of so you can see a better of what we're
0:04:36doing splitting up the normal use the gmm to do the alignment and then compute
0:04:40the stats after the fire
0:04:42from that
0:04:43where splitting it out in two ways and using the announced the first is simply
0:04:47to replace the mfccs with bottleneck features
0:04:51from indiana and we are just using a straightforward bottleneck note that kind of anything
0:04:56and then the
0:04:58second system
0:04:59is a little bit more complicated were used to the nn to generate the frame
0:05:03posteriors for the signals are for the cluster all state
0:05:06that used to label the data and you the alignment and then you use the
0:05:10ubm after that
0:05:15the unit time are to draw indian and but this is daniel's best rendition of
0:05:20of a probably d n and a couple of things that the power particular perhaps
0:05:23about our system or about the cali way of doing things
0:05:27which by the way we do highly recommend
0:05:30is it uses this t-norm would just kind of like to max pooling so there
0:05:33is a there's an expansion in a contractual made at each layer that's how the
0:05:37nonlinearity come there
0:05:40what else we i think probably nobody says these days but we're not using fmllr
0:05:43which i think it is common
0:05:45for our purposes
0:05:48you can see we basically use the same architecture either for this you know posteriors
0:05:52are or we introduce the bottleneck to the one that's just gonna be the bottleneck
0:05:56that goes
0:05:57the that's a little in your layer before the
0:06:01in the middle that one there
0:06:06we have
0:06:07about nine thousand output state so it is it is a pretty big ubm that
0:06:13we get out of this
0:06:14and of course it's trained using switchboard one "'cause" that's what we were given for
0:06:18the a fixed data condition
0:06:20in you know
0:06:24so let me talk about desire is a little bit the one that
0:06:29we're all familiar what we're gonna fall acoustic i-vector this is based on a gaussian
0:06:33probability model and german output in a little parentheses for use a given that the
0:06:39alignments already know otherwise it would be much more complicated
0:06:44and but because of that it's a big gaussian supervector problem there's closed form solution
0:06:48for the map estimate that the i-vector
0:06:51there's an em algorithm for the that the estimation
0:06:55the second approach is phonotactic thing now i think the you guys mentioned that used
0:07:00it for a number of areas before
0:07:03the this is well i'll talk about the details of the or lighter but that
0:07:07the king is we can still have sort of a gaussian model for an i-vector
0:07:12but the output now is the latent model we're talking about the weights of gmm
0:07:17instead of the means
0:07:19and those things are naturally gonna be count based so we need a multinomial probability
0:07:24model out not a gaussian probability model
0:07:27and the way we do that with is to go from log space with the
0:07:30softmax singular probability part
0:07:33even when they're fairly simple formula unfortunately there's not a closed form solution for what
0:07:38is the optimal i-vectors of these additions method iteration
0:07:42and similarly there's not a two year for the t matrix that we know what
0:07:46yet so there is a alternating maximization algorithm
0:07:53so we presented this phonotactics a thing for lid the four
0:07:59and in the meantime we don't think it okay we have two systems we have
0:08:02an acoustic in a phonotactic are we gonna combine
0:08:05actually the first thing we knew score fusion and yes we did that and use
0:08:08that works
0:08:09and then we are a little more except well
0:08:11about two i-vector systems there are doing the same thing why don't i stack the
0:08:15i-vectors together and get one big i-vector and then run one i-vector system and does
0:08:19that work
0:08:20and yes that works two
0:08:22and we thought of as more and said well
0:08:24why the widely twos independent i-vector extractors
0:08:28what can i make one latent variable the both models
0:08:31the means of the gmm the latent gmm the generated code and the weights of
0:08:35the gmm generated to cut
0:08:38the fact is the math says that you can i'll go into a little more
0:08:42detail but basically this is
0:08:44a permutation of the subspace gmm that the input we was talking about in two
0:08:49thousand eight thousand nine
0:08:52to see leslie workshop and sense
0:08:54so there are algorithms for doing this we had to manipulate them a little bit
0:08:58for our purpose
0:09:02so a couple of the tails how to do this we have some references in
0:09:07the paper
0:09:08so complex in particular that we're doing differently than then if you just to get
0:09:12out of what bandwidth
0:09:14the first is he did everything but sort of ml estimates so we didn't have
0:09:17any prior didn't how many backup
0:09:19obviously for acoustic we don't wanna use ml i-vectors we wanna use map i-vectors
0:09:24we've actually shown previously that for a phonotactic system map is also beneficial and if
0:09:29we're gonna do a jointly it's
0:09:31critical the to be the same criterion for both things because it back
0:09:35it is a joint optimization of
0:09:38map of the overall likelihood plus that the prior
0:09:44a nice trick we can do with this joint i-vector is since this closed form
0:09:47solution for the acoustic we can
0:09:49initialize the newton's method with the acoustic and then just refine it using the phonotactic
0:09:54as well
0:09:55and that gets us to a starting point pretty easily where we can then do
0:09:59winning greatly simplify the newton's descent
0:10:03in particular by pretending everything is independent of each other which is a huge spud
0:10:08improvement because the doing full haskins in this update
0:10:11is anybody who's ever looked at it is a pretty tedious
0:10:15so once we do that
0:10:16it's essentially rather than being much slower than acoustic i-vector system it's essentially the same
0:10:22order it's very simple
0:10:33so that no okay
0:10:36the lre fifteen task which has been discussed
0:10:39this i guess isn't happening here there is telephone and broadcast narrowband speech with it
0:10:44twenty language six confusable clusters
0:10:48but the limited training condition is very important element from what we were able to
0:10:52get away with
0:10:53and of course that means
0:10:54both that you have limited a little data to more only twenty languages but also
0:10:58means that you can only train your supervised the nn
0:11:01on the switchboard english because that's the only thing that had transcripts
0:11:06which is not our favourite thing to do it was kind of limiting but it
0:11:09allows nist exercise the technology
0:11:12and because of the languages didn't have much data that was also would keep
0:11:20so all of our systems
0:11:21basically because we had a small team we didn't built too much complicated stuff
0:11:26i described really everything that we did
0:11:28so we had two different ways of the using the d n and we had
0:11:31three different kinds of i-vectors that we could have built out of each of the
0:11:34to the in an i-vector the unit systems
0:11:37out of that we could've done six things i'll talk about a few that were
0:11:40interesting and the ones that we actually
0:11:43but everything was the same classifier
0:11:48as i missed because the systems are already calibrated a by this mmi process
0:11:54we didn't have to use a complicated back end
0:11:57the thing we get introduced because there is we knew there was this range of
0:12:01durations that had to be exercised
0:12:04i think the simplest way that we could get there was to re reuse some
0:12:08work that we had done previously on making a
0:12:11duration dependent backend where there's a continuous function which maps
0:12:15duration into scale factor score
0:12:19between of the raw score and the true log likelihood estimate that you're trying to
0:12:25and that there's a justification for that function but for our purposes the important thing
0:12:29is that
0:12:29it's very simply trainable because it's just got to free parameters
0:12:34so then you can use this cross entropy criterion and figure out the best parameters
0:12:39and then because we have is a very simple system
0:12:43we just at all scores together assume that they were independent estimates of things and
0:12:48then rescaled the whole thing to bring it back in
0:12:52and we found that to be helpful for us
0:12:58another thing about lre fifteen which was mentioned but maybe to go to be more
0:13:01familiar with the task you it went past incorrectly is very important
0:13:05so nist the
0:13:07proposed these somewhat on task of close to the texan across each of the clusters
0:13:13what we did is
0:13:15it is generated each cluster is an id score which means that each cluster had
0:13:19a id posteriors on the one since the ri six clusters of means we gave
0:13:23an scores from the six which means if nist wanted to evaluate across cluster performance
0:13:29it was meeting this
0:13:32and we had to convert these ideas to detection log likelihood ratios which is something
0:13:36we've all over how to do your
0:13:39but one thing i wanna mention about our system is we didn't do anything
0:13:42cluster specific anywhere we just change train a twenty language lid system
0:13:47and then just the
0:13:50spun on the scores for each of the clusters because that's what nist one
0:13:54i think we would like in the future for a more generic lid task
0:14:01not the key element that i mentioned is the
0:14:04other with limited training data so
0:14:08we had figure out what to do with that
0:14:11as i mentioned we have the unsupervised and supervised power we took the theory which
0:14:16was later proven not quite ready to we would use everything we could
0:14:20for the unsupervised data which included switchboard which is english only and was not one
0:14:25of the languages
0:14:27for that we could've done better than that all talk about it
0:14:30and then for the classifier design we did find it helpful
0:14:34to do augmentation to do duration modeling a cut so we can use all sides
0:14:39we use segments that were duration
0:14:42appropriate for the lid task
0:14:44and we used argumentation used augmentation to change the limited clean data
0:14:49and try and give us more examples of things to learn what i-vectors would look
0:14:55to go into the augmentation a little bit more
0:14:58many of these are standard things the this big thing indian ends now is to
0:15:02do augmentation
0:15:05so sample rate perturbation additive noise
0:15:08right made a kind of forty kind of an additive noise but maybe that's more
0:15:11interesting we did throw in reverb
0:15:15and a multi band compression is kind of a signal processing thing that you might
0:15:18see in an audio signal
0:15:20but the thing i wanna mention and the thing that we have actually don't have
0:15:23been slides but if you look in the paper
0:15:26the most effective single augmentation for us in the task was to run to use
0:15:30"'em" so you were encoder decoder against
0:15:32which kind of the makes sense
0:15:35as a thing to do
0:15:36and as former speech coding to a fairly attractive
0:15:42so our submission performance
0:15:45these are the for things that we submitted our primary wasn't fact one of the
0:15:49bottom which looks like it's pretty good choice out of the were available to us
0:15:54so we did a joint i-vector on the bottleneck features we have well i'll show
0:15:58later of the when some more stable i guess other through that they know what
0:16:02dimensional ways in this submission
0:16:04our second basis them was actually slightly better than our bottleneck system and again
0:16:09that makes that the best sort of phonotactic system i think than anybody saw because
0:16:14everyone else from the bottlenecks will be the only really good thing to do
0:16:18and fusion provided again partly because we have simple fusion and partly because we have
0:16:23two systems which are pretty good
0:16:28so we get a couple things post email with we found someone educational the first
0:16:34one i will go in the much details in the paper but
0:16:38within the family of gaussian scoring there's a question of whether you count trials as
0:16:42independent are not which in speaker you typically pertain you all had one you only
0:16:46had one trial for
0:16:47enrolment is all one hour
0:16:50that turned on and what we submitted we usually see it is slightly better turns
0:16:53out for this develop a slightly worse
0:16:55i have no idea work
0:16:57the other thing that might be a little bit more interesting is the list usage
0:17:00we spent quite invaded time even with their the metadata trying this
0:17:05decide what to do with the ubm and t
0:17:08but i think that it turned out to work best
0:17:10we didn't try because we thought of as a dom idea which is to just
0:17:15only the lid data
0:17:16and only for cuts
0:17:18which i forget exactly but i think that's only three or four thousand cuts or
0:17:22it ought to be nowhere near enough to train a t matrix we thought
0:17:27but without or
0:17:30so here again there's a more numbers splitting things out the first thing which is
0:17:34kind of interesting for us as we went and rented to this acoustic baseline so
0:17:38what we would have done with previous technology we are definitely better with all the
0:17:42stuff we have i dunno if we're not instantly better but we're better
0:17:49i'm sorry
0:17:51the ldc is now we split out with the scene on system the three different
0:17:54kinds of i-vectors and the first thing is the phonotactic system by itself
0:17:59is actually better than the acoustic system which is what we have seen before and
0:18:03i think that's
0:18:04well linguist might about whether it's really a phonotactic system to look at the counts
0:18:08of frame posteriors but
0:18:10that aside it's i think the best performing phonotactic system that's out of for lid
0:18:16right now and then you see also that the joint i-vector doesn't five given noticeable
0:18:21gain over the acoustic
0:18:23so that's
0:18:44okay and the fusion still work let me just go one so then included in
0:18:52we were able to get pretty good performance in this evaluation with a small team
0:18:55and of relatively straightforward system
0:18:58we think that there is still whole in the signal count system that doesn't have
0:19:03to be just bottleneck
0:19:05and we were able show that
0:19:07we think that the phonotactic in the joint i-vectors the joint i-vector especially is a
0:19:12nice simple way to capture that
0:19:14that information is one of things that enables the signal system to be competitive
0:19:20we think it is helpful to use a really simple fusion if you have this
0:19:23discriminatively trained classifier to start with
0:19:27and find the data augmentation it can be a very valuable thing for the manager
0:19:33limited data
0:19:35thank you
0:19:43we have time some questions
0:19:55thank you for told you proposed able to collect all is doing marks
0:20:02we can focus to the lower levels double is the tools like to other tools
0:20:10for d is a classical for more old home too
0:20:15yes there are we always use the same mailed again gaussian classifiers
0:20:20no matter what kind of i-vectors
0:20:22"'cause" distribution is not
0:20:24no the intention is the i-vector could still have been in a gaussian space that
0:20:29that's this is why we like this kind of
0:20:33a subspace there are other count subspace algorithms like lda a non-negative matrix factorization i
0:20:40think that in for example is compared some of those
0:20:42where the subspaces in the linear probability space and that
0:20:47i don't think would be well modeled by gaussian fact i know it wouldn't be
0:20:50well my time gaussian pretty comfortable that "'cause" it's positive
0:20:53but by going into the log space i think you it does
0:20:57it really is going to lda and that's right tools okay cindy
0:21:20but i'm very much like the additional processing that you're doing to kind of or
0:21:24to the data you had casework security of sample rate perturbation most of the speech
0:21:29coders most versions
0:21:31if you had to go back again we're which ones you think actually would help
0:21:35i think you mean imagine which up there is a table in the paper
0:21:41which many of them are helpful but the speech coder is the most helpful on
0:21:45its own
0:21:45so we choose the sample rate conversion was at a really big variations
0:21:51we did things like plus or minus ten percent plus or minus five percent but
0:21:56i think
0:21:57i would say that big
0:22:02so we use a big difference maybe we have other cts broadcast news progress has
0:22:08been which would typically be guessing
0:22:12we didn't break them apart
0:22:24try other nations that just
0:22:27a pattern p norm
0:22:30we have since
0:22:34so little bit it seems like for this particular task it looks like the sigmoid
0:22:39is that some other people use are a little bit better i'm not sure if
0:22:43we think that's a universal statement
0:22:46excuse me the sigmoid are better for training the bottlenecks
0:22:51i think for this you know maybe not
0:22:54so we have looked a little bit
0:22:56there is more to explore
0:23:07so if there are no more questions and we assume everybody here knows everything about
0:23:12language recognition got common both systems
0:23:16so that the same speaker again