Speech Transcript - BAT System Description for NIST LRE 2015

you know all that from you the

i will be presenting what we did for lre fifteen and probably

great part of you have already seen most of this presentation

at the workshop

we have changed you things correctly some errors

and i will give you the presentation again

well lets them here it was as john already said it was a collaboration between

per no i need your and only technically three you know

i included the almost the full list of people who participate it is a in

our team that was a lot of concentrated fun during the autumn and we really

enjoyed that

let's go straight to the system what they to be used to be we decided

to participate in both nist conditions the fixed data condition and open data condition

and the fixed data condition we joint some affords with mit and the they provided

some definitions of the

of the development set and the shortcuts so we split all of the data we

had available for

training and they have we kept sixty percent for training and forty percent for that

and we also generate the some short cuts out of the long segments that are

uniformly distributed from three to thirty seconds because that was that's what we apply are

expecting then devil data according to evolution one

for the open training data condition a

we try to harvest all of the data from a harddrive that we could find

we also asked our friends

from here from bilbao to provide some other databases and also nudging from mit so

these databases that you might not using your systems regular eer colour guthrie that is

we took european spanish and british english

and from al jazeera free speech corpus we took some arabic dialects otherwise it was

just all the data that be harvested for nist lre o nine from the radios

from the voice of america and so on just to let you know we didn't

use any bible four

for the classifier training we just use the bible data to train some

bottleneck feature extractors able to speak about it later

bottleneck features that's really is a core far system so it's

i think that most of you are already familiar but this architecture we train a

neural network do classify phoneme states it's just some better specially did is architecture because

it is stacked bottleneck so

the structure is here on the picture

the stacked mean that

we first train the classical network to classify the phonemes days then be coded at

the bottleneck

and then steak these bottlenecks in time and train again

so that we train another stage and we take the bottlenecks

from the second stage from the second network so that's why the stacked bottlenecks

the effect is that

in the end they see longer context and

from our experience other they work pretty well but if you do

some tuning you can you can

you can just use the first bottlenecks it's enough especially for speaker id i say

so for the fixed training condition apparently we had to use switchboard and the network

was approximately seven thousand triphone states at all

and the we were trying some new technique a with the automatic acoustic unit discovery

and we train the bottleneck on these and for that we used lre fifteen data

for the open training

condition b

we use the bible data and later in the most of all we've train another

network that has seventeen languages of the bible and it is indeed the one that

that it would like to use if you can use

all kind of data

so general system or would be you as i already said the basis of our

system other bottlenecks either based on switchboard or labeled data and then some reference we

had the mfcc shifted delta cepstral system we had be llr system we also tried

some

some politics systems and model the

expect the n-gram counts with the multinomial subspace model and techniques like that where around

fewer spectra they didn't make it a diffusion

and are favourite classifier is just a simple wiener gaussian classifier

and if you can along with it's good to include the i-vector uncertainty in the

computation of scores that helps quite a bit with the calibration and also

provide you slides

performance boost

and

we had them new fink

a sequence summarizing neural network

i will speak about just now

just later because it was a little bit of a disaster labels e

the fusion

fusion was a little bit different we tried to reflect the nist criteria because we've

are to the c average was computed over the clusters and then averaged so

so we are reflected ease and the otherwise

we had one way then

per system and one buys per language

and the cluster prior and that be assigned the cluster specific priors for the data

for each cluster and all of the or other data

other set whose where had the prior set to zero and v be trained over

all clusters in the end so that

i think that it improve the results on the nist metric what substantially

and also we gave nist a system that was

a classical multiclass system that they could they could do some between cluster results on

this is because if we gave them just the one that b calibrated or fused

this way

they would be out of like with doing anything with that because of course

the asked for

a log likelihood ratios not the log likelihoods i hope that the next time they

will they will rectify this

this all what we had in the end in our submissions

most of the systems are stacked bottlenecks to see in the and mean the cluster

dependent system i will speak about it just two slides later

and then there was this a sequence summarizing network

and as you can see

it is the clear that were system it would never make it to the to

the diffusion but at the nist workshop five as present think is this as a

system that could almost perfectly classify but that's data it's not the case there was

a bunch of course

some level data in the training data

so now it's the worst system

so anyway we were so scared added what worked so well on our test data

that we didn't included in the primary system anyway so that the red arrow shows

what we had as a primary system a narrative and the

the alternate system would be with the

sequence summarizing that were included the what i report here is the c word star

means that the calibration was performed on the dev set

i don't i don't show already the c average for the dev so because during

that develop we were doing check and i think

which is

not here in this lies anymore

and so these are the results on that that's that it's

it's pretty good let's skip to the

results on the of also

there is nothing much to say just that the we sing quite some a calibration

loss on the of all data

and the

which was not the case on our test data especially on the on the fixed

set because it proved to be

quite easier said than the one i design for the open data condition

so that's it that's our that's of this are fixed that's our system for the

fixed training condition

so now let's talk about those specialities we had there the one with a cluster

dependent i-vector system

the cluster dependent means that we train

per cluster we train the ubm separate cluster and then the i-vector and the rest

of the system is trained on the whole data

they provide

you can see there's a six independent systems which provide the scores and then we

fuse them here with the

with a simple average due to provide some robustness be we calibrate them later anyway

so based this proved to be quite effective during the development with you just need

to take care about the amount of the daytime in the in the cluster so

the results line coming here indicate that there is no need you know data and

if you use of diagonal ubm you have a

you have a better result in the end which i believe this cost by not

enough data per cluster to fit all of all of the parameters of the full

covariance ubm

and the sequence summarizing neural network which doesn't work

it's

is i don't know if you have ever use it for language id it's basically

you take a sequence and short utterance

and

and passing through the network summarise it at this there is a summarisation a layer

inside

when you many of initial the frames then you then you provoke a the rest

till the end where you have to

probabilities of the classes and you do it all over again over all the data

and

and the that's it

the

and then to just that you can use the sequence summarizing layer

as some sort of feature extractor and model it is and later it differently

and apparently works a little bit better than then just using the network to do

the final classification

we had some partial results with the sequence summarizing that for the at when we

tried it on lre o nine but here the task is so much tougher

and

the system was a complete disaster

open training data condition

it's a almost the same scenario just we had a little bit more variability in

features here specifically i would like to point out the multilingual features multilingual bottleneck features

that is the ml seven insist in

and

you can see that if you include this whole machinery and all of the data

and the nice a look like that can really cluster the space

of the languages you get the cleared the best system that you can get

and it also is the case on the ml data

here i can even show you that what is the difference when you use the

use the covariance in the in the gaussian linear classifier to obtain the scores

it's the last line versus the second line of the table there is not so

much gain on the on the dev data because they're already

goals are to whatever we are training on but there is a nice gain

my skin on the on the of all data

if we if we submitted just the single system that would be probably the best

but of course

we haven't seen the

seen the results on the dev all data before submitting and

and tried try the whole fusion which is

slightly worse than the single best system

some analysis with the training data

we had a little a time constraints and we thought that

from our experience

it's experience it's always good do

necessary to retrain the final classifier i mean when you have the i-vectors to retrain

the logistic regression or regions of classifier to get your classes posteriors

but it unfortunately was not this case or for the album data condition we decided

okay we have this ubm i-vector extractor let's just use deals and retrain a retrain

the system we will use for our submission of the open data condition

and we didn't train the new ubm and i-vector extractor of course we did it

after

and you can see that

the column just below the submission is the one that we would get if we

to the time and retrained both ubm i-vector and the classifier on top of our

dataset

so we hurt ourselves quite a bit here as well

so features

as i already said the bottleneck features are the best ones that we were able

to train

if you compare it with the mfcc and shifted okay switch shifted delta cepstra there

is a there is a huge get and i think that

the bottleneck system should be the basis of

any serious

language id system nowadays

the bottlenecks out of the network it was trained on the automatically derived units

it didn't perform very well but of course

that was a very new thing and we didn't want to only

run the bottlenecks and

be done with the evaluation so we tried it you can see that still it's

really depends if you can if you can derive some

some meaningful units and

and more specifically if

if the ml data would match your that they do very are trained it because

then the units what

would correspond and probably the book like would be better

it so far doesn't work that well

with french cluster yesterday i so many people present the results here already been of

the french cluster they but inspired with great in the nist workshop where he it

excluded them from the results i think that we should not do that i spoke

the ldc

at the data are completely okay people can recognise a there is just the problem

with the channel as if they gave us

one channel in training and another one in the test they basically swap it

and because this is a cluster of just two languages we all build a very

nice channel detector

that is something we should deal with and not to exclude the french class are

from the evaluation

just please fix it

well we will try but we haven't time to really do that so all of

the results i will show in q of course include the french cluster

and

there

they're pretty good if you if you take the a multilingual bottleneck features but we

have to be careful even you when you're doing analysis of with the french cluster

the croat from the french is actually from bubble so if you happen to have

some bubble data bic or for about it rather not use it or use it

carefully

or you might be surprised how useful the problem

well it didn't solve it it'll

we of course try the bunch of the classifiers on top of the i-vectors and

i can say that

it's all about the same

and the classifier of choices the simplest one just the gaussian in our classifier that

you can build

right away out of i-vectors

an eagle was experimenting with some different language dependent i-vectors when you extract the i-vectors

with the language priors involved it was

it was performing nicely but

but the

not really beating the

the simple across a linear classifier we try it

fully bayesian classifier we tried a neural network and the logistic regression you can see

that all the columns here are pretty much the same

and

we still have a few minutes so i can again briefly us to do something

all this automatically derived you needs it's a it's a variational bayes method a we

train a duration a process mixture of hmms and b we try to fit the

open phoneme blue the on the data to estimate the estimate the

units

and then be used this to somehow transcribed data

and use these once this

as the source for a training the training the neural network which would include the

bottleneck and then

then we would have some

unsupervised bottleneck

well maybe there is there is a

still somehow four days and i hope that people edge h work should bill

we'll move this thing forward and we will see the goal think is that

we were able to surpass the mfcc baseline on the dev set with this system

that is i think that's already impressive

so the conclusions

again

use the bottleneck system in your lid system the gaussian linear classifier is enough

it if you can do you just include the uncertainty in the score computation

and we tried a bunch of the phonotactic systems and they perform

okay but they didn't make it to the fusion

and

i would say that it's always good to have some exercise with the data engineering

and try to see the

see the data that we have and try to collect something and

where with the data not only with the systems

we tried a bunch of other things like the denoising the reverberation we didn't see

any gains on the dev set then there is very slight gains on the evaluation

set

for the phonotactic systems we very using the switchboard to train it

and

we try to frame of the nn which

which was pretty bad

so that's all ready thank you

okay time for some questions

so my question is more related with the stacked bottleneck that you were recently there

you mentioned that it's good for language at night you didn't get so many good

which holds for speaker at

well we get the good results for speaker id just that we get as good

results with the bottlenecks that would not be the stack so you can train the

first network

only and take the classical what lex you don't need to do this exercise which

thinking the bottlenecks and training another network

well but they perform well for speaker s one is not what the right

i once i wouldn't think i wouldn't say that it's worth it

but maybe bill using the sorry sixteen just don't you don't use it as an

excuse

and the other question it's a

although i guess that using these are stacked bottleneck features on later six ubms for

language cluster you're solution was like in terms of time like are we can't

well that is indeed a

oracle system

from the point of the design but it worked slightly better

i wouldn't be in favour of a building such a system for five percent relative

gain over ten percent relative in but it simulation no

the numbers matter the usability is

the second thing

recursions

thank you the for the presentation i'm sorry because my question is also related to

the stacked bottleneck i was wondering if you have made in the analysis on the

alignment provided by both

the first bottlenecks and the stacked one to see if there is really an evolution

in the process all

alignment

you mean you mean what you mean the performance of the system or some

no i'm talking read about the lid alignment on your ubm to see how they

are about the distribution of the features evolves

i don't think we made my this comparison

sorry

our can ask questions are also messiah problem accurate context you're looking at plusminus time

found that did you

we don't something you kind of exporter you can't that fixed to the set of

course this is the ideal number explored

a bunch of numbers if you're having just the first network i think that you

can play more with a context

you should aim for something like three hundred millisecond of the context we if you're

using the stacked bottleneck the context is more because used a

several bottlenecks and

use that in the second stage so

that's why they will something plusminus then

i was thinking for maybe more sensitive

with the background noise "'cause" you do in your other systems you said you did

some denoising theirselves wondering what's more sensitive to noise the bottleneck is pretty good in

dealing with the noise actually i had a paper interspeech when we trained the denoising

all tangled or

and it works pretty well on the mfccs

then be used they'll that denoised spectral to generate the bottlenecks

and the

and well basically repeat all the experiments with the bottlenecks and the gains are much

more much smaller

discussion

so that this is more of a

a comment on the french cluster you're speaking about and i agree you know it

showed up is problematic that you said ignoring it is not the answer to it

i would point out that we do a contradiction going on in the sense you

about you label that a single the channel thing right

but we know from lre nineteen other ones we done

narrowband over brought up or broadcast and haven't seen this massive the ship four

so we have that the contradiction in the past use this successfully with telephony speech

pulling it from broadcast and so forth there is an interesting point here which the

it again ldc went out that did say that it's not it was not in

this labelling was errors in there but

this chance that the formality of the language changes based on whether you're broadcast you

might be at a higher you know high versus low whereas telephony so there's i

just bring doesn't bring these are in general because policing talks coming on the display

be one thing that may be something about the actual

dialect show that happens based on how to produce not so much of the channel

we don't know yet

i agree

okay lets them for speaker again

BAT System Description for NIST LRE 2015

Speaker & Language Recognition Systems

Oldrich Plchot, Pavel Matejka, Ondrej Glembek, Radek Fer, Ondrej Novotny, Jan Pesan, Lukas Burget, Niko Brummer, Sandro Cumani