0:00:14you know all that from you the
0:00:18i will be presenting what we did for lre fifteen and probably
0:00:23great part of you have already seen most of this presentation
0:00:27at the workshop
0:00:29we have changed you things correctly some errors
0:00:32and i will give you the presentation again
0:00:38well lets them here it was as john already said it was a collaboration between
0:00:43per no i need your and only technically three you know
0:00:46i included the almost the full list of people who participate it is a in
0:00:51our team that was a lot of concentrated fun during the autumn and we really
0:00:58enjoyed that
0:01:02let's go straight to the system what they to be used to be we decided
0:01:07to participate in both nist conditions the fixed data condition and open data condition
0:01:12and the fixed data condition we joint some affords with mit and the they provided
0:01:20some definitions of the
0:01:21of the development set and the shortcuts so we split all of the data we
0:01:26had available for
0:01:27training and they have we kept sixty percent for training and forty percent for that
0:01:33and we also generate the some short cuts out of the long segments that are
0:01:38uniformly distributed from three to thirty seconds because that was that's what we apply are
0:01:44expecting then devil data according to evolution one
0:01:47for the open training data condition a
0:01:51we try to harvest all of the data from a harddrive that we could find
0:01:56we also asked our friends
0:01:58from here from bilbao to provide some other databases and also nudging from mit so
0:02:04these databases that you might not using your systems regular eer colour guthrie that is
0:02:11we took european spanish and british english
0:02:14and from al jazeera free speech corpus we took some arabic dialects otherwise it was
0:02:22just all the data that be harvested for nist lre o nine from the radios
0:02:28from the voice of america and so on just to let you know we didn't
0:02:32use any bible four
0:02:34for the classifier training we just use the bible data to train some
0:02:41bottleneck feature extractors able to speak about it later
0:02:47bottleneck features that's really is a core far system so it's
0:02:52i think that most of you are already familiar but this architecture we train a
0:02:56neural network do classify phoneme states it's just some better specially did is architecture because
0:03:04it is stacked bottleneck so
0:03:06the structure is here on the picture
0:03:08the stacked mean that
0:03:10we first train the classical network to classify the phonemes days then be coded at
0:03:15the bottleneck
0:03:16and then steak these bottlenecks in time and train again
0:03:20so that we train another stage and we take the bottlenecks
0:03:24from the second stage from the second network so that's why the stacked bottlenecks
0:03:30the effect is that
0:03:31in the end they see longer context and
0:03:35from our experience other they work pretty well but if you do
0:03:39some tuning you can you can
0:03:42you can just use the first bottlenecks it's enough especially for speaker id i say
0:03:49so for the fixed training condition apparently we had to use switchboard and the network
0:03:54was approximately seven thousand triphone states at all
0:03:58and the we were trying some new technique a with the automatic acoustic unit discovery
0:04:06and we train the bottleneck on these and for that we used lre fifteen data
0:04:11for the open training
0:04:13condition b
0:04:15we use the bible data and later in the most of all we've train another
0:04:19network that has seventeen languages of the bible and it is indeed the one that
0:04:26that it would like to use if you can use
0:04:30all kind of data
0:04:33so general system or would be you as i already said the basis of our
0:04:38system other bottlenecks either based on switchboard or labeled data and then some reference we
0:04:45had the mfcc shifted delta cepstral system we had be llr system we also tried
0:04:54some politics systems and model the
0:04:56expect the n-gram counts with the multinomial subspace model and techniques like that where around
0:05:03fewer spectra they didn't make it a diffusion
0:05:07and are favourite classifier is just a simple wiener gaussian classifier
0:05:12and if you can along with it's good to include the i-vector uncertainty in the
0:05:18computation of scores that helps quite a bit with the calibration and also
0:05:24provide you slides
0:05:27performance boost
0:05:31we had them new fink
0:05:33a sequence summarizing neural network
0:05:36i will speak about just now
0:05:40just later because it was a little bit of a disaster labels e
0:05:45the fusion
0:05:47fusion was a little bit different we tried to reflect the nist criteria because we've
0:05:52are to the c average was computed over the clusters and then averaged so
0:05:59so we are reflected ease and the otherwise
0:06:04we had one way then
0:06:06per system and one buys per language
0:06:08and the cluster prior and that be assigned the cluster specific priors for the data
0:06:14for each cluster and all of the or other data
0:06:17other set whose where had the prior set to zero and v be trained over
0:06:22all clusters in the end so that
0:06:26i think that it improve the results on the nist metric what substantially
0:06:33and also we gave nist a system that was
0:06:36a classical multiclass system that they could they could do some between cluster results on
0:06:41this is because if we gave them just the one that b calibrated or fused
0:06:47this way
0:06:48they would be out of like with doing anything with that because of course
0:06:53the asked for
0:06:54a log likelihood ratios not the log likelihoods i hope that the next time they
0:06:58will they will rectify this
0:07:02this all what we had in the end in our submissions
0:07:07most of the systems are stacked bottlenecks to see in the and mean the cluster
0:07:11dependent system i will speak about it just two slides later
0:07:15and then there was this a sequence summarizing network
0:07:19and as you can see
0:07:21it is the clear that were system it would never make it to the to
0:07:26the diffusion but at the nist workshop five as present think is this as a
0:07:29system that could almost perfectly classify but that's data it's not the case there was
0:07:34a bunch of course
0:07:37some level data in the training data
0:07:40so now it's the worst system
0:07:43so anyway we were so scared added what worked so well on our test data
0:07:47that we didn't included in the primary system anyway so that the red arrow shows
0:07:52what we had as a primary system a narrative and the
0:07:56the alternate system would be with the
0:07:59sequence summarizing that were included the what i report here is the c word star
0:08:04means that the calibration was performed on the dev set
0:08:09i don't i don't show already the c average for the dev so because during
0:08:13that develop we were doing check and i think
0:08:16which is
0:08:18not here in this lies anymore
0:08:21and so these are the results on that that's that it's
0:08:25it's pretty good let's skip to the
0:08:28results on the of also
0:08:31there is nothing much to say just that the we sing quite some a calibration
0:08:36loss on the of all data
0:08:39and the
0:08:42which was not the case on our test data especially on the on the fixed
0:08:46set because it proved to be
0:08:49quite easier said than the one i design for the open data condition
0:08:57so that's it that's our that's of this are fixed that's our system for the
0:09:01fixed training condition
0:09:03so now let's talk about those specialities we had there the one with a cluster
0:09:08dependent i-vector system
0:09:10the cluster dependent means that we train
0:09:12per cluster we train the ubm separate cluster and then the i-vector and the rest
0:09:18of the system is trained on the whole data
0:09:23they provide
0:09:25you can see there's a six independent systems which provide the scores and then we
0:09:29fuse them here with the
0:09:32with a simple average due to provide some robustness be we calibrate them later anyway
0:09:38so based this proved to be quite effective during the development with you just need
0:09:45to take care about the amount of the daytime in the in the cluster so
0:09:50the results line coming here indicate that there is no need you know data and
0:09:55if you use of diagonal ubm you have a
0:09:58you have a better result in the end which i believe this cost by not
0:10:03enough data per cluster to fit all of all of the parameters of the full
0:10:06covariance ubm
0:10:10and the sequence summarizing neural network which doesn't work
0:10:14is i don't know if you have ever use it for language id it's basically
0:10:20you take a sequence and short utterance
0:10:24and passing through the network summarise it at this there is a summarisation a layer
0:10:31when you many of initial the frames then you then you provoke a the rest
0:10:33till the end where you have to
0:10:36probabilities of the classes and you do it all over again over all the data
0:10:43and the that's it
0:10:47and then to just that you can use the sequence summarizing layer
0:10:50as some sort of feature extractor and model it is and later it differently
0:10:56and apparently works a little bit better than then just using the network to do
0:11:01the final classification
0:11:03we had some partial results with the sequence summarizing that for the at when we
0:11:09tried it on lre o nine but here the task is so much tougher
0:11:15the system was a complete disaster
0:11:18open training data condition
0:11:21it's a almost the same scenario just we had a little bit more variability in
0:11:26features here specifically i would like to point out the multilingual features multilingual bottleneck features
0:11:33that is the ml seven insist in
0:11:39you can see that if you include this whole machinery and all of the data
0:11:43and the nice a look like that can really cluster the space
0:11:47of the languages you get the cleared the best system that you can get
0:11:54and it also is the case on the ml data
0:11:59here i can even show you that what is the difference when you use the
0:12:04use the covariance in the in the gaussian linear classifier to obtain the scores
0:12:11it's the last line versus the second line of the table there is not so
0:12:16much gain on the on the dev data because they're already
0:12:22goals are to whatever we are training on but there is a nice gain
0:12:28my skin on the on the of all data
0:12:34if we if we submitted just the single system that would be probably the best
0:12:39but of course
0:12:41we haven't seen the
0:12:43seen the results on the dev all data before submitting and
0:12:46and tried try the whole fusion which is
0:12:50slightly worse than the single best system
0:12:57some analysis with the training data
0:13:01we had a little a time constraints and we thought that
0:13:06from our experience
0:13:07it's experience it's always good do
0:13:11necessary to retrain the final classifier i mean when you have the i-vectors to retrain
0:13:16the logistic regression or regions of classifier to get your classes posteriors
0:13:22but it unfortunately was not this case or for the album data condition we decided
0:13:27okay we have this ubm i-vector extractor let's just use deals and retrain a retrain
0:13:34the system we will use for our submission of the open data condition
0:13:41and we didn't train the new ubm and i-vector extractor of course we did it
0:13:46and you can see that
0:13:48the column just below the submission is the one that we would get if we
0:13:54to the time and retrained both ubm i-vector and the classifier on top of our
0:14:05so we hurt ourselves quite a bit here as well
0:14:12so features
0:14:14as i already said the bottleneck features are the best ones that we were able
0:14:20to train
0:14:22if you compare it with the mfcc and shifted okay switch shifted delta cepstra there
0:14:29is a there is a huge get and i think that
0:14:33the bottleneck system should be the basis of
0:14:37any serious
0:14:38language id system nowadays
0:14:42the bottlenecks out of the network it was trained on the automatically derived units
0:14:47it didn't perform very well but of course
0:14:51that was a very new thing and we didn't want to only
0:14:56run the bottlenecks and
0:14:59be done with the evaluation so we tried it you can see that still it's
0:15:04really depends if you can if you can derive some
0:15:08some meaningful units and
0:15:09and more specifically if
0:15:11if the ml data would match your that they do very are trained it because
0:15:16then the units what
0:15:18would correspond and probably the book like would be better
0:15:22it so far doesn't work that well
0:15:29with french cluster yesterday i so many people present the results here already been of
0:15:34the french cluster they but inspired with great in the nist workshop where he it
0:15:40excluded them from the results i think that we should not do that i spoke
0:15:46the ldc
0:15:47at the data are completely okay people can recognise a there is just the problem
0:15:51with the channel as if they gave us
0:15:53one channel in training and another one in the test they basically swap it
0:15:59and because this is a cluster of just two languages we all build a very
0:16:02nice channel detector
0:16:06that is something we should deal with and not to exclude the french class are
0:16:09from the evaluation
0:16:12just please fix it
0:16:14well we will try but we haven't time to really do that so all of
0:16:18the results i will show in q of course include the french cluster
0:16:25they're pretty good if you if you take the a multilingual bottleneck features but we
0:16:29have to be careful even you when you're doing analysis of with the french cluster
0:16:36the croat from the french is actually from bubble so if you happen to have
0:16:39some bubble data bic or for about it rather not use it or use it
0:16:45or you might be surprised how useful the problem
0:16:47well it didn't solve it it'll
0:16:53we of course try the bunch of the classifiers on top of the i-vectors and
0:16:58i can say that
0:17:00it's all about the same
0:17:04and the classifier of choices the simplest one just the gaussian in our classifier that
0:17:10you can build
0:17:11right away out of i-vectors
0:17:13an eagle was experimenting with some different language dependent i-vectors when you extract the i-vectors
0:17:20with the language priors involved it was
0:17:24it was performing nicely but
0:17:28but the
0:17:30not really beating the
0:17:32the simple across a linear classifier we try it
0:17:37fully bayesian classifier we tried a neural network and the logistic regression you can see
0:17:42that all the columns here are pretty much the same
0:17:49we still have a few minutes so i can again briefly us to do something
0:17:53all this automatically derived you needs it's a it's a variational bayes method a we
0:17:58train a duration a process mixture of hmms and b we try to fit the
0:18:04open phoneme blue the on the data to estimate the estimate the
0:18:11and then be used this to somehow transcribed data
0:18:16and use these once this
0:18:19as the source for a training the training the neural network which would include the
0:18:24bottleneck and then
0:18:25then we would have some
0:18:27unsupervised bottleneck
0:18:31well maybe there is there is a
0:18:34still somehow four days and i hope that people edge h work should bill
0:18:37we'll move this thing forward and we will see the goal think is that
0:18:43we were able to surpass the mfcc baseline on the dev set with this system
0:18:50that is i think that's already impressive
0:18:55so the conclusions
0:19:00use the bottleneck system in your lid system the gaussian linear classifier is enough
0:19:07it if you can do you just include the uncertainty in the score computation
0:19:13and we tried a bunch of the phonotactic systems and they perform
0:19:20okay but they didn't make it to the fusion
0:19:26i would say that it's always good to have some exercise with the data engineering
0:19:31and try to see the
0:19:33see the data that we have and try to collect something and
0:19:36where with the data not only with the systems
0:19:40we tried a bunch of other things like the denoising the reverberation we didn't see
0:19:45any gains on the dev set then there is very slight gains on the evaluation
0:19:52for the phonotactic systems we very using the switchboard to train it
0:19:57we try to frame of the nn which
0:20:00which was pretty bad
0:20:02so that's all ready thank you
0:20:11okay time for some questions
0:20:20so my question is more related with the stacked bottleneck that you were recently there
0:20:25you mentioned that it's good for language at night you didn't get so many good
0:20:31which holds for speaker at
0:20:32well we get the good results for speaker id just that we get as good
0:20:38results with the bottlenecks that would not be the stack so you can train the
0:20:44first network
0:20:45only and take the classical what lex you don't need to do this exercise which
0:20:50thinking the bottlenecks and training another network
0:20:53well but they perform well for speaker s one is not what the right
0:20:59i once i wouldn't think i wouldn't say that it's worth it
0:21:03but maybe bill using the sorry sixteen just don't you don't use it as an
0:21:11and the other question it's a
0:21:13although i guess that using these are stacked bottleneck features on later six ubms for
0:21:21language cluster you're solution was like in terms of time like are we can't
0:21:26well that is indeed a
0:21:29oracle system
0:21:30from the point of the design but it worked slightly better
0:21:37i wouldn't be in favour of a building such a system for five percent relative
0:21:42gain over ten percent relative in but it simulation no
0:21:46the numbers matter the usability is
0:21:49the second thing
0:22:06thank you the for the presentation i'm sorry because my question is also related to
0:22:10the stacked bottleneck i was wondering if you have made in the analysis on the
0:22:15alignment provided by both
0:22:17the first bottlenecks and the stacked one to see if there is really an evolution
0:22:21in the process all
0:22:25you mean you mean what you mean the performance of the system or some
0:22:29no i'm talking read about the lid alignment on your ubm to see how they
0:22:33are about the distribution of the features evolves
0:22:37i don't think we made my this comparison
0:22:46our can ask questions are also messiah problem accurate context you're looking at plusminus time
0:22:53found that did you
0:22:55we don't something you kind of exporter you can't that fixed to the set of
0:22:59course this is the ideal number explored
0:23:03a bunch of numbers if you're having just the first network i think that you
0:23:07can play more with a context
0:23:12you should aim for something like three hundred millisecond of the context we if you're
0:23:16using the stacked bottleneck the context is more because used a
0:23:21several bottlenecks and
0:23:22use that in the second stage so
0:23:24that's why they will something plusminus then
0:23:27i was thinking for maybe more sensitive
0:23:29with the background noise "'cause" you do in your other systems you said you did
0:23:33some denoising theirselves wondering what's more sensitive to noise the bottleneck is pretty good in
0:23:38dealing with the noise actually i had a paper interspeech when we trained the denoising
0:23:44all tangled or
0:23:45and it works pretty well on the mfccs
0:23:48then be used they'll that denoised spectral to generate the bottlenecks
0:23:53and the
0:23:54and well basically repeat all the experiments with the bottlenecks and the gains are much
0:23:59more much smaller
0:24:12so that this is more of a
0:24:14a comment on the french cluster you're speaking about and i agree you know it
0:24:19showed up is problematic that you said ignoring it is not the answer to it
0:24:24i would point out that we do a contradiction going on in the sense you
0:24:29about you label that a single the channel thing right
0:24:33but we know from lre nineteen other ones we done
0:24:37narrowband over brought up or broadcast and haven't seen this massive the ship four
0:24:43so we have that the contradiction in the past use this successfully with telephony speech
0:24:48pulling it from broadcast and so forth there is an interesting point here which the
0:24:54it again ldc went out that did say that it's not it was not in
0:24:58this labelling was errors in there but
0:25:01this chance that the formality of the language changes based on whether you're broadcast you
0:25:06might be at a higher you know high versus low whereas telephony so there's i
0:25:11just bring doesn't bring these are in general because policing talks coming on the display
0:25:15be one thing that may be something about the actual
0:25:18dialect show that happens based on how to produce not so much of the channel
0:25:23we don't know yet
0:25:25i agree
0:25:27okay lets them for speaker again