0:00:15 | kind of the transition from the systems and in the previous salary into the n |
---|---|

0:00:21 | n's in what |

0:00:22 | people do deeply we all could have |

0:00:25 | no i don't think so |

0:00:29 | we all clodo presented in both but i think this is a good transition "'cause" |

0:00:32 | we did have some kind of new things that we did that i wanna talk |

0:00:36 | about |

0:00:37 | this is work with my colleagues correct cell and daniel from johns hopkins of both |

0:00:41 | of whom were unable unfortunately to get spousal permission to attend this work so but |

0:00:47 | they have that excuses rags wife had their second child two weeks ago and daniels |

0:00:51 | is due in about two so |

0:00:54 | they have a reason |

0:00:57 | so i'm gonna present an overview of the d n an i-vector system that we |

0:01:02 | submitted to leo every fifteen |

0:01:04 | i wanna hear give a shout out nist for introducing his fixed training data condition |

0:01:10 | which actually allowed us to make a very competitive system with only three people which |

0:01:15 | is a not very common in our is historically |

0:01:20 | the approach that we used algorithmically i'll go in the more detail but we use |

0:01:25 | the n n's unlike some of the previous presentations you've seen we were able to |

0:01:30 | get good performance not just with the bottleneck features but also with the nn state |

0:01:35 | labels i'll talk about that |

0:01:38 | we used a three different kind of i-vectors i'll explain that more but |

0:01:43 | everyone had acoustic systems and those are very good we're able to do quite well |

0:01:47 | with the phonotactic i-vector system as well and here we're trying for the first time |

0:01:52 | a joint i-vector which does both things at once |

0:01:56 | because we had a fairly powerful system that we were comfortable with and we didn't |

0:02:02 | trust that we had enough development data we used i think the you simplest and |

0:02:07 | most naive fusion of anybody a net seem to work for us because we actually |

0:02:10 | got of using game which i think also we were one of the few |

0:02:14 | and that was just to some the scores together and then scale "'em" with the |

0:02:18 | duration model that all talk about |

0:02:21 | and lastly as i think it's been mentioned but i wanna go with a little |

0:02:24 | bit more because this was a limited data task data augmentation turned out to be |

0:02:29 | very helpful for us |

0:02:33 | so in the top i'll go through our bayes the i-vectors system design a talk |

0:02:38 | about the two ways that we use the d n n's that have both been |

0:02:41 | touched on previously today |

0:02:43 | and i'll talk about the use alternate i-vectors to we experimented with |

0:02:50 | talks a more specifically about the lre fifteen task and how we use the data |

0:02:54 | and what we learn later about how we could have used the data |

0:02:59 | and trying to that will talk about the results that we had in the summation |

0:03:02 | in some interesting things that we've learned since both about whatever other systems could have |

0:03:07 | done and also how we could've done better with the systems that we that we |

0:03:10 | use |

0:03:13 | so here's a but block diagram of |

0:03:16 | our lid system |

0:03:21 | it's a little i-vector system so we can be split into two parts the first |

0:03:24 | uses the unlabeled data to the to do the ubm and the t matrix learning |

0:03:29 | and then the supervised system is basically the two covariance model |

0:03:34 | within class across class covariance that's first used in lda to reduce the dimension and |

0:03:39 | in the same matrices are used for the gaussian scoring following on after that |

0:03:45 | we've done for awhile rather than having a separate back end to do the work |

0:03:48 | we do a discriminative refinement of these gaussian parameters |

0:03:53 | to produce a system that not only performs a little bit better but also produces |

0:03:58 | naturally calibrated scores |

0:04:00 | and we do that in a two-step process first we learn a scale factor of |

0:04:05 | this within class covariance |

0:04:07 | and then we go into all the class means and adjust them to better or |

0:04:10 | provide the discriminative power and that's we we're using the mmi algorithm from gmm training |

0:04:17 | in a really simplified mode |

0:04:19 | and of course that's the same criterion is the multiclass cross entropy but all everybody |

0:04:23 | uses for every day |

0:04:28 | so just layout data |

0:04:29 | talk more about how we use the d n and together people dimension it but |

0:04:33 | let me have some pictures of so you can see a better of what we're |

0:04:36 | doing splitting up the normal use the gmm to do the alignment and then compute |

0:04:40 | the stats after the fire |

0:04:42 | from that |

0:04:43 | where splitting it out in two ways and using the announced the first is simply |

0:04:47 | to replace the mfccs with bottleneck features |

0:04:51 | from indiana and we are just using a straightforward bottleneck note that kind of anything |

0:04:55 | else |

0:04:56 | and then the |

0:04:58 | second system |

0:04:59 | is a little bit more complicated were used to the nn to generate the frame |

0:05:03 | posteriors for the signals are for the cluster all state |

0:05:06 | that used to label the data and you the alignment and then you use the |

0:05:10 | ubm after that |

0:05:15 | the unit time are to draw indian and but this is daniel's best rendition of |

0:05:19 | a |

0:05:20 | of a probably d n and a couple of things that the power particular perhaps |

0:05:23 | about our system or about the cali way of doing things |

0:05:27 | which by the way we do highly recommend |

0:05:30 | is it uses this t-norm would just kind of like to max pooling so there |

0:05:33 | is a there's an expansion in a contractual made at each layer that's how the |

0:05:37 | nonlinearity come there |

0:05:40 | what else we i think probably nobody says these days but we're not using fmllr |

0:05:43 | which i think it is common |

0:05:45 | for our purposes |

0:05:48 | you can see we basically use the same architecture either for this you know posteriors |

0:05:52 | are or we introduce the bottleneck to the one that's just gonna be the bottleneck |

0:05:56 | that goes |

0:05:57 | the that's a little in your layer before the |

0:06:01 | in the middle that one there |

0:06:06 | we have |

0:06:07 | about nine thousand output state so it is it is a pretty big ubm that |

0:06:13 | we get out of this |

0:06:14 | and of course it's trained using switchboard one "'cause" that's what we were given for |

0:06:18 | the a fixed data condition |

0:06:20 | in you know |

0:06:24 | so let me talk about desire is a little bit the one that |

0:06:29 | we're all familiar what we're gonna fall acoustic i-vector this is based on a gaussian |

0:06:33 | probability model and german output in a little parentheses for use a given that the |

0:06:39 | alignments already know otherwise it would be much more complicated |

0:06:44 | and but because of that it's a big gaussian supervector problem there's closed form solution |

0:06:48 | for the map estimate that the i-vector |

0:06:51 | there's an em algorithm for the that the estimation |

0:06:55 | the second approach is phonotactic thing now i think the you guys mentioned that used |

0:07:00 | it for a number of areas before |

0:07:03 | the this is well i'll talk about the details of the or lighter but that |

0:07:07 | the king is we can still have sort of a gaussian model for an i-vector |

0:07:12 | but the output now is the latent model we're talking about the weights of gmm |

0:07:17 | instead of the means |

0:07:19 | and those things are naturally gonna be count based so we need a multinomial probability |

0:07:24 | model out not a gaussian probability model |

0:07:27 | and the way we do that with is to go from log space with the |

0:07:30 | softmax singular probability part |

0:07:33 | even when they're fairly simple formula unfortunately there's not a closed form solution for what |

0:07:38 | is the optimal i-vectors of these additions method iteration |

0:07:42 | and similarly there's not a two year for the t matrix that we know what |

0:07:46 | yet so there is a alternating maximization algorithm |

0:07:53 | so we presented this phonotactics a thing for lid the four |

0:07:59 | and in the meantime we don't think it okay we have two systems we have |

0:08:02 | an acoustic in a phonotactic are we gonna combine |

0:08:05 | actually the first thing we knew score fusion and yes we did that and use |

0:08:08 | that works |

0:08:09 | and then we are a little more except well |

0:08:11 | about two i-vector systems there are doing the same thing why don't i stack the |

0:08:15 | i-vectors together and get one big i-vector and then run one i-vector system and does |

0:08:19 | that work |

0:08:20 | and yes that works two |

0:08:22 | and we thought of as more and said well |

0:08:24 | why the widely twos independent i-vector extractors |

0:08:28 | what can i make one latent variable the both models |

0:08:31 | the means of the gmm the latent gmm the generated code and the weights of |

0:08:35 | the gmm generated to cut |

0:08:38 | the fact is the math says that you can i'll go into a little more |

0:08:42 | detail but basically this is |

0:08:44 | a permutation of the subspace gmm that the input we was talking about in two |

0:08:49 | thousand eight thousand nine |

0:08:52 | to see leslie workshop and sense |

0:08:54 | so there are algorithms for doing this we had to manipulate them a little bit |

0:08:58 | for our purpose |

0:09:02 | so a couple of the tails how to do this we have some references in |

0:09:07 | the paper |

0:09:08 | so complex in particular that we're doing differently than then if you just to get |

0:09:12 | out of what bandwidth |

0:09:14 | the first is he did everything but sort of ml estimates so we didn't have |

0:09:17 | any prior didn't how many backup |

0:09:19 | obviously for acoustic we don't wanna use ml i-vectors we wanna use map i-vectors |

0:09:24 | we've actually shown previously that for a phonotactic system map is also beneficial and if |

0:09:29 | we're gonna do a jointly it's |

0:09:31 | critical the to be the same criterion for both things because it back |

0:09:35 | it is a joint optimization of |

0:09:38 | map of the overall likelihood plus that the prior |

0:09:44 | a nice trick we can do with this joint i-vector is since this closed form |

0:09:47 | solution for the acoustic we can |

0:09:49 | initialize the newton's method with the acoustic and then just refine it using the phonotactic |

0:09:54 | as well |

0:09:55 | and that gets us to a starting point pretty easily where we can then do |

0:09:59 | winning greatly simplify the newton's descent |

0:10:03 | in particular by pretending everything is independent of each other which is a huge spud |

0:10:08 | improvement because the doing full haskins in this update |

0:10:11 | is anybody who's ever looked at it is a pretty tedious |

0:10:15 | so once we do that |

0:10:16 | it's essentially rather than being much slower than acoustic i-vector system it's essentially the same |

0:10:22 | order it's very simple |

0:10:33 | so that no okay |

0:10:36 | the lre fifteen task which has been discussed |

0:10:39 | this i guess isn't happening here there is telephone and broadcast narrowband speech with it |

0:10:44 | twenty language six confusable clusters |

0:10:48 | but the limited training condition is very important element from what we were able to |

0:10:52 | get away with |

0:10:53 | and of course that means |

0:10:54 | both that you have limited a little data to more only twenty languages but also |

0:10:58 | means that you can only train your supervised the nn |

0:11:01 | on the switchboard english because that's the only thing that had transcripts |

0:11:06 | which is not our favourite thing to do it was kind of limiting but it |

0:11:09 | allows nist exercise the technology |

0:11:12 | and because of the languages didn't have much data that was also would keep |

0:11:20 | so all of our systems |

0:11:21 | basically because we had a small team we didn't built too much complicated stuff |

0:11:26 | i described really everything that we did |

0:11:28 | so we had two different ways of the using the d n and we had |

0:11:31 | three different kinds of i-vectors that we could have built out of each of the |

0:11:34 | to the in an i-vector the unit systems |

0:11:37 | out of that we could've done six things i'll talk about a few that were |

0:11:40 | interesting and the ones that we actually |

0:11:43 | but everything was the same classifier |

0:11:48 | as i missed because the systems are already calibrated a by this mmi process |

0:11:54 | we didn't have to use a complicated back end |

0:11:57 | the thing we get introduced because there is we knew there was this range of |

0:12:01 | durations that had to be exercised |

0:12:04 | i think the simplest way that we could get there was to re reuse some |

0:12:08 | work that we had done previously on making a |

0:12:11 | duration dependent backend where there's a continuous function which maps |

0:12:15 | duration into scale factor score |

0:12:19 | between of the raw score and the true log likelihood estimate that you're trying to |

0:12:23 | make |

0:12:25 | and that there's a justification for that function but for our purposes the important thing |

0:12:29 | is that |

0:12:29 | it's very simply trainable because it's just got to free parameters |

0:12:34 | so then you can use this cross entropy criterion and figure out the best parameters |

0:12:39 | and then because we have is a very simple system |

0:12:43 | we just at all scores together assume that they were independent estimates of things and |

0:12:48 | then rescaled the whole thing to bring it back in |

0:12:52 | and we found that to be helpful for us |

0:12:58 | another thing about lre fifteen which was mentioned but maybe to go to be more |

0:13:01 | familiar with the task you it went past incorrectly is very important |

0:13:05 | so nist the |

0:13:07 | proposed these somewhat on task of close to the texan across each of the clusters |

0:13:13 | what we did is |

0:13:15 | it is generated each cluster is an id score which means that each cluster had |

0:13:19 | a id posteriors on the one since the ri six clusters of means we gave |

0:13:23 | an scores from the six which means if nist wanted to evaluate across cluster performance |

0:13:29 | it was meeting this |

0:13:32 | and we had to convert these ideas to detection log likelihood ratios which is something |

0:13:36 | we've all over how to do your |

0:13:39 | but one thing i wanna mention about our system is we didn't do anything |

0:13:42 | cluster specific anywhere we just change train a twenty language lid system |

0:13:47 | and then just the |

0:13:50 | spun on the scores for each of the clusters because that's what nist one |

0:13:54 | i think we would like in the future for a more generic lid task |

0:14:01 | not the key element that i mentioned is the |

0:14:04 | other with limited training data so |

0:14:08 | we had figure out what to do with that |

0:14:11 | as i mentioned we have the unsupervised and supervised power we took the theory which |

0:14:16 | was later proven not quite ready to we would use everything we could |

0:14:20 | for the unsupervised data which included switchboard which is english only and was not one |

0:14:25 | of the languages |

0:14:27 | for that we could've done better than that all talk about it |

0:14:30 | and then for the classifier design we did find it helpful |

0:14:34 | to do augmentation to do duration modeling a cut so we can use all sides |

0:14:39 | we use segments that were duration |

0:14:42 | appropriate for the lid task |

0:14:44 | and we used argumentation used augmentation to change the limited clean data |

0:14:49 | and try and give us more examples of things to learn what i-vectors would look |

0:14:53 | like |

0:14:55 | to go into the augmentation a little bit more |

0:14:58 | many of these are standard things the this big thing indian ends now is to |

0:15:02 | do augmentation |

0:15:05 | so sample rate perturbation additive noise |

0:15:08 | right made a kind of forty kind of an additive noise but maybe that's more |

0:15:11 | interesting we did throw in reverb |

0:15:15 | and a multi band compression is kind of a signal processing thing that you might |

0:15:18 | see in an audio signal |

0:15:20 | but the thing i wanna mention and the thing that we have actually don't have |

0:15:23 | been slides but if you look in the paper |

0:15:26 | the most effective single augmentation for us in the task was to run to use |

0:15:30 | "'em" so you were encoder decoder against |

0:15:32 | which kind of the makes sense |

0:15:35 | as a thing to do |

0:15:36 | and as former speech coding to a fairly attractive |

0:15:42 | so our submission performance |

0:15:45 | these are the for things that we submitted our primary wasn't fact one of the |

0:15:49 | bottom which looks like it's pretty good choice out of the were available to us |

0:15:54 | so we did a joint i-vector on the bottleneck features we have well i'll show |

0:15:58 | later of the when some more stable i guess other through that they know what |

0:16:02 | dimensional ways in this submission |

0:16:04 | our second basis them was actually slightly better than our bottleneck system and again |

0:16:09 | that makes that the best sort of phonotactic system i think than anybody saw because |

0:16:14 | everyone else from the bottlenecks will be the only really good thing to do |

0:16:18 | and fusion provided again partly because we have simple fusion and partly because we have |

0:16:23 | two systems which are pretty good |

0:16:28 | so we get a couple things post email with we found someone educational the first |

0:16:34 | one i will go in the much details in the paper but |

0:16:38 | within the family of gaussian scoring there's a question of whether you count trials as |

0:16:42 | independent are not which in speaker you typically pertain you all had one you only |

0:16:46 | had one trial for |

0:16:47 | enrolment is all one hour |

0:16:50 | that turned on and what we submitted we usually see it is slightly better turns |

0:16:53 | out for this develop a slightly worse |

0:16:55 | i have no idea work |

0:16:57 | the other thing that might be a little bit more interesting is the list usage |

0:17:00 | we spent quite invaded time even with their the metadata trying this |

0:17:05 | decide what to do with the ubm and t |

0:17:08 | but i think that it turned out to work best |

0:17:10 | we didn't try because we thought of as a dom idea which is to just |

0:17:14 | use |

0:17:15 | only the lid data |

0:17:16 | and only for cuts |

0:17:18 | which i forget exactly but i think that's only three or four thousand cuts or |

0:17:21 | something |

0:17:22 | it ought to be nowhere near enough to train a t matrix we thought |

0:17:27 | but without or |

0:17:30 | so here again there's a more numbers splitting things out the first thing which is |

0:17:34 | kind of interesting for us as we went and rented to this acoustic baseline so |

0:17:38 | what we would have done with previous technology we are definitely better with all the |

0:17:42 | stuff we have i dunno if we're not instantly better but we're better |

0:17:48 | thing |

0:17:49 | i'm sorry |

0:17:51 | the ldc is now we split out with the scene on system the three different |

0:17:54 | kinds of i-vectors and the first thing is the phonotactic system by itself |

0:17:59 | is actually better than the acoustic system which is what we have seen before and |

0:18:03 | i think that's |

0:18:04 | well linguist might about whether it's really a phonotactic system to look at the counts |

0:18:08 | of frame posteriors but |

0:18:10 | that aside it's i think the best performing phonotactic system that's out of for lid |

0:18:16 | right now and then you see also that the joint i-vector doesn't five given noticeable |

0:18:21 | gain over the acoustic |

0:18:23 | so that's |

0:18:44 | okay and the fusion still work let me just go one so then included in |

0:18:52 | we were able to get pretty good performance in this evaluation with a small team |

0:18:55 | and of relatively straightforward system |

0:18:58 | we think that there is still whole in the signal count system that doesn't have |

0:19:03 | to be just bottleneck |

0:19:05 | and we were able show that |

0:19:07 | we think that the phonotactic in the joint i-vectors the joint i-vector especially is a |

0:19:12 | nice simple way to capture that |

0:19:14 | that information is one of things that enables the signal system to be competitive |

0:19:20 | we think it is helpful to use a really simple fusion if you have this |

0:19:23 | discriminatively trained classifier to start with |

0:19:27 | and find the data augmentation it can be a very valuable thing for the manager |

0:19:32 | at |

0:19:33 | limited data |

0:19:35 | thank you |

0:19:43 | we have time some questions |

0:19:55 | thank you for told you proposed able to collect all is doing marks |

0:20:02 | we can focus to the lower levels double is the tools like to other tools |

0:20:10 | for d is a classical for more old home too |

0:20:15 | yes there are we always use the same mailed again gaussian classifiers |

0:20:20 | no matter what kind of i-vectors |

0:20:22 | "'cause" distribution is not |

0:20:24 | no the intention is the i-vector could still have been in a gaussian space that |

0:20:29 | that's this is why we like this kind of |

0:20:33 | a subspace there are other count subspace algorithms like lda a non-negative matrix factorization i |

0:20:40 | think that in for example is compared some of those |

0:20:42 | where the subspaces in the linear probability space and that |

0:20:47 | i don't think would be well modeled by gaussian fact i know it wouldn't be |

0:20:50 | well my time gaussian pretty comfortable that "'cause" it's positive |

0:20:53 | but by going into the log space i think you it does |

0:20:57 | it really is going to lda and that's right tools okay cindy |

0:21:20 | but i'm very much like the additional processing that you're doing to kind of or |

0:21:24 | to the data you had casework security of sample rate perturbation most of the speech |

0:21:29 | coders most versions |

0:21:31 | if you had to go back again we're which ones you think actually would help |

0:21:35 | i think you mean imagine which up there is a table in the paper |

0:21:41 | which many of them are helpful but the speech coder is the most helpful on |

0:21:45 | its own |

0:21:45 | so we choose the sample rate conversion was at a really big variations |

0:21:51 | we did things like plus or minus ten percent plus or minus five percent but |

0:21:56 | i think |

0:21:57 | i would say that big |

0:22:02 | so we use a big difference maybe we have other cts broadcast news progress has |

0:22:08 | been which would typically be guessing |

0:22:12 | we didn't break them apart |

0:22:24 | try other nations that just |

0:22:27 | a pattern p norm |

0:22:30 | we have since |

0:22:31 | and |

0:22:34 | it's |

0:22:34 | so little bit it seems like for this particular task it looks like the sigmoid |

0:22:39 | is that some other people use are a little bit better i'm not sure if |

0:22:43 | we think that's a universal statement |

0:22:46 | excuse me the sigmoid are better for training the bottlenecks |

0:22:51 | i think for this you know maybe not |

0:22:54 | so we have looked a little bit |

0:22:56 | there is more to explore |

0:23:07 | so if there are no more questions and we assume everybody here knows everything about |

0:23:12 | language recognition got common both systems |

0:23:16 | so that the same speaker again |