Speech Transcript - The MITLL NIST LRE 2015 Language Recognition System

so but of this past leo and all be presenting a mit a side presentation

for the summation

and you know we have a number of system kind of focus more on the

but i

a little bit of analysis and even kind of listening to in the top one

so in general basically wanna talk a little bit about the kind of systems that

we looked at and then i'm gonna talk about which ones we end up using

on the evaluation itself so maybe the primary submission

but it up about the development data we have a how we ended up using

that data panel things we look to try to augment that data

then initial some of the averaging results we had in like everybody else was kind

of surprise

when we first so what happened on the original compared but we have seen on

the development set

then i'm gonna have a sum of the regions income

so in terms of systems we looked at about well over ten systems

not surprisingly the system that we looked at where you there

i-vector systems or the nn

i don't bottleneck systems in something in some way and are you know on the

on the more conventional i-vectors subset of systems we have an about an sdc type

system with cepstra and then we also have a system that was basically the same

system with a speech added to it

then we have kind of are set of the nn system would welcome x and

bottleneck spa speech and modeled in a posteriors

then we even though to things like and i'm in my system kind of also

will system but in that case we were using a bottleneck features instead of kind

of the conventional features we used in the past

for the open task let me kind of emphasise that quite a bit we also

tried the multilingual system

and we use a five of the babel systems

and we also had a few other systems that where maybe on the slightly more

site you know we had a kind of what unit discover a system is kind

of along the lines of what all the described earlier

and we also have this of the nn counts multinomial model system which it something

that i think my jeans gonna talked about what a bit more during his stuff

right

and four turns out that for calibration we really didn't do anything nude that we

don't over the last years maybe the last pretty of allegiance or sell so that

wasn't really anything new on the on that site

in next someone to talk a little bit of the development data

and have you probably heard by now we have the six people are displayed

we did a little wider be a waste of augmented that data and you know

at the end there wasn't really a whole lot of things that work on the

side of movement in the data and we basically ended up having

kind of some reviews on the data where we had the full segments lost

i mean the full utterances plus segments we derive from that same data so we

can also the data twice except we so some sort of duration

variability in kind of a lot of things we tried we tried you know kind

of doing some working page a change in the spectrum of the to be looking

at things like to create

at the end of that really seen to hal a performance even what we

one thing we did what we and we did not retrain

our whole systems what we did retrain kind of what back and strategy so basically

we can't more system specs on the data we had been developed mean with what

we did retrain the back in with kind of basically the hundred percent of the

in terms of be that was mainly by the way for though the fix that

so for the open set of course like everybody else we did looked at one

of the source if we had available and of course there's the word plenty of

sources in there

at the end you only know thing i'm you know system but really benefited by

you have been this additional data was gonna the multilingual

or just so basically most of the system we use on the open set as

we have developed on the

fixed condition except for of course the multilingual which kind of needed all the extra

one thing and on wanna talk a little bit more and get into some specifics

that doing that development

we did notice that using all the data without available actually did not her did

not help performance

so it was our after doing some kind of all really experiments we decided to

only at data in a few of the languages

and i'm gonna talk about whether that was the best decision or not

at least in the bottom and we need see that out in that

data to those languages the performance

in terms of the bottom results we sell

in this addresses both kind of the cluster average in detail

and then what happened between the de l

fixed set and the open set

in this kind of what we so we for the most part you know

we select chinese and i've union kind of been the top this new ones on

the dev set

what the performance in general seem reasonable so we were kind of pretty happy with

it in an average we were kind of one though

one zero to some more about neighbourhood

the other observation here i think

is that on the open set we did see that we get a little bit

of improvement over you know the that fixed condition

so we may maybe we will see as much of maybe we could have expected

what we saw some improvement so we so that was also recently

not wouldn't talk a little bit about the evaluation results

well so on the evaluation results kind of

the bad we got a big this discrepancy we between what we saw doing the

data set and what we so on the evaluation set

in right away john like you know almost a year ten times some other also

regions in here and things like we ended up some meeting a five way fusion

of systems

and we had a this unit discovery system

we had a account system we had a the bottleneck features and we have the

speech kind of conventional system that we

that we train and that's just

performance in that we of in their of obtaining was on the on c average

of what a little bit lowest and point eighteen

and of course also julian here

this idea of what happened with the french cluster

controlling both the performance we had as a whole in the performance you

we have not dealt with the french cluster

and

one other observation here is

die

like everything else we ended up using all the systems and we had a greedy

approach to kind of remotely panel of the lines of what you saw the are

on the last person speech and then we sorted out

after looking at

big long evaluation of all a fusion of

then we systems and five we system

and we ended up with this five with some system and it does not for

the most we were not necessarily

that far off of the

a human performance and we could have obtain so how we describe how we somehow

know what our best system it would have been

we actually would have been

very little form some it into a kind of the oracle system

other than that of one of the region is kind of like the best system

we had an enormous

for estimation what's the bottleneck feature system closely followed by the awards just

of course in it something data has been talked about

quite a bit by now

there was these each with the french clustering and in each really kind of the

main there were two things that they came in that we kind of talked about

and the first one was applied seems like we're really building the channel detector which

is kind of what all dimension and then there were all the things that a

mean and we heard from

ldc at the workshop that have to do well there might be older each is

not only channel to sell before i forget that i wanna kind of drawing a

common to the earlier discussion that all dine in dog had

which is we did do a lot of analysis on the

on the channel each you in two you know nine

one thought that thing to my neighbours something in the and say can i say

something different that what everybody of the same is whether the difference is that we

analyze you know nine were mainly based on the language

in here we're kind of one cluster classes which may or may not

at to the discussion of all why we're seeing that this seems to be printing

on the channel side even though

apparently would people listened to it they're those difference and might not be there

kind of

going into more detail into is feature about the prince cluster we did see here

that we do seem like things line up for channel and i mean the big

feature and i think it was obvious earlier had to deal with the fact that

for one of the languages just

did not have any data on that channel

so it seemed like the channel what that we so on the wall in was

not a available all do in the dataset

was kind of being more like going to the actual channel instead of the language

and you can see here that you know there is there doesn't seem to be

a big difference here between you know

been able to tell the like that passes so far it's more like seems to

be more on the channel

element

one thing we did do

it's kind of well we said well maybe just the nature of the problem when

we kind of look at i don't know how to different a cluster and we

look at the slap a cluster which was

polish in russian and it does not we look at that cluster we didn't sing

used to observe the same each you with this kind of channel alignment on so

we were able to some extent even though there's the challenge channel element here we

also can kind of tells the classes of what a lot better than

we were able to do one the french cluster

no i'm going to the

the open condition kind of the though the main difference here like i said was

then we have this modeling well bottleneck feature system

and that was actually kind of replaced the but what system

or compared to what we had only a fixed condition

and once again the performance you was a little bit better was and not substantially

better done on the on a fixed condition but it was a little bit better

and like i said kind of the multilingual bottleneck seem to be you don't want

that

run into the difference and was actually be different in this case

like i said earlier one thing that came in

okay and we were a little bit surprised so

was the fact that using extra that did not seem to help on the development

set

and you know you hear your kind of looking at what happen in the case

of arabic and we added error rate in a number of ways and you can

see on that

on the lower right corner in there

that for the most but it didn't seem like it make a big differences only

kind of one particular scenario where we get a little bit of one improvement but

it's not like svm or data we seem to consistently be able to

get improvements

one thing that actually

also came into play list the fact that what happened as fast as we look

at be used after the evaluation

and one thing was that anything old also address some of this was that

even though we did not see any improvements by adding data on the development set

we would have taken substantial improvements we have care all that data in

into the eval set of course one into is that a lot of that has

to do with this labeled data that have that particular channel in there and whether

there are some data in there that seems to be used the same data or

not we didn't go in and it's precisely looked at you know are just precisely

the same cluster not examples lines of course we're expecting that maybe not necessarily are

the same

body would have kind of substantially change or performance maybe on the order of thirty

forty percent

i don't think that we also did a little bit of after the eval was

kind of keep looking at this multilingual bottleneck features and once again we don't is

no you know are scored nine

so someone think that we used a we also get some improvements on with dot

the multilingual bottleneck feature system

asked to change the diversity implanted in this is not completely

linear meaning you doesn't mean that we go from five to seven five to ten

and five fifteen any it's always improving it still kind of something we're getting a

better handle on what it seems like there's some obviously some religion in their between

the diversity of the languages we used to train the not be with

and kind of the performance in once again i'd probably at this point with seen

as much as tend to fifty percent improvement

i don't think that we that idea actually post people was i try to listen

to the languages that i know so spanish and english and

kind of the idea was well you know for our system and once again in

my assessment which i'm not a language

you know if i listened to some of the errors we had

you know is there anything i see and hear that seems to be system that

once again for the u r submission any in the case of spanish basically what

i ended up going we had a number of errors i mean probably for the

holy but i think we had on the order of two thousand there was also

and what if i just randomly picked fifty on each of these two languages this

into them in figure out if there anything that seems to be somewhat system

any disparage case there were two things that seem calm in the first one that

i was a little bit surprising seem like we have we can do not a

problem with human p

once again know little white necessarily but my i mean one idea that comes to

mind is maybe there were someone on the represented on the training

and by the way when i say i i'd say expenditures i mean spending terrors

i took out for to be used from the iberia clusters alarm and miss you

know of the to the three classes advantage

and i don't think that clean and one thing i'll want one other point is

that i see example and all the error cycles across all directions i mean i

probably the same to maybe a handful of forty seconds maybe ten or so on

the order of ten seconds and maybe like seventy percent of the cards or about

when you some

we low twenties are on the three set up the range and that applied for

both cases actually

so one thing that's also

i want to mention is

we actually had within because we have on the stand aside between five and seven

cats that either nonspeech on them

or things like

or a ladder or something so i mean how much you should be able to

detect language from that

not quite sure bottom what we obviously

we obviously having five cards in there i mean seems like it might be a

big number what it that all usefully extended to the whole set of errors we

had all it's not clear but at least that's the observation on this limited

set of data that i listen to on the english side we also kind of

have also for this a mutual well basically empty any speech files most of them

where you should on the three second what even on some of the ten seconds

we would have this nominal ten seconds speech caught and then you'll see that you

know the person rate

comes here the first and maybe speaks for a second dozens there's nothing left while

that gets detect it and then becomes they have again and maybe laughter something so

there was a little bit of that once again i guess to some extent that's

reality but it's kind of something product or that i wanted to bring up well

for

to your attention

the other thing was that in once again on this limited sample on the english

side it seemed like most of the errors

i so

where between

british english and american

so there were

maybe five rate

we're seen there that they all in one way or another within the ending beach

but most of the year maybe like i said there on the eighty percent or

so we're actually confusion between rereading which

and

in america

and i think that's actually be a

particular

we're going too much right away so kind of i

quickly as a

let me just gotta go through you know we did see that there what a

little bit of improvement vaguely future

needless to say you know bottlenecks in the nn bayesian i-vectors

dominated at well we're still kind of parsing out procedure with the french cluster i

actually saw presentation yesterday that i think they for that some of the data kind

of across like thinking with the true some of the data anything like they got

really big improvements by using a little bit of the data of for training

so it does seem like having the channel represented in the would include what a

bit euro performance

and you know there was also this each other's adding more data to sit down

quail

everything else i'd

you know hindsight is twenty so we know

and once again i guess the generally to is in the feature you know should

we focus on

some particular conditions or kind of think about in terms of robustness

and not

right now we have time for some questions

so we were coming we

only a week

that probably also these errors that we have in the spanish clusters

could be also you

like to

no each of the levels because it raises the question about

it's that you will get a spanish your this pain

from the cell

it's closer to carry any spondees or

to the regular responding a from spain

a i mean

in my personal experience

i find like people like i think it's under the cr for example seen very

close to the people in puerto rico

wait closer than people from my three than anywhere else in to sell a decision

like i saw a lot of those errors like a like people that really where

maybe what i would hypothesize that been from the south spain

community once again i saw somebody it's not like at least for our system it

seems like this q one female notion was something that actually

but absolutely i would i in my limited understanding and knowledge about this i would

say i would have expected that because i you know the way that a that

people from seem to me i once again that the cr would kind of draw

a last syllables and seems like that is precisely the way people important people would

cushion

thank you for your presentation in one of your s like to set that up

for your opens the task you didn't use all the data set and for training

this

this out of set model right all right so

if i recall correctly what i said was that we only use the this you

open

data for the multi

lamel or not

a lot i opens the data you well

once again we remember not necessarily all the data what kind of the multilingual was

trained on the five well label

about sorry and you have mentioned here that adding more data did not solve the

problem

which data you added to your us all you know it's a paralysed addition or

just blind

no absolutely its size so we it like i shows i think on the

on that are we is just an example or one

so in this case we basically what if we drawing more error rate and basically

what we get a serving on the test set so obviously you're kind of as

everything else are doing the best you put on the dev set and hoping to

make a good prediction of what we're gonna see on the eval set

what we had observing here that are more are training data in this just one

so that is to help the problem we have done training which makes included all

the data that is going there like for the sources we had available and it's

actually seem to work so we backtrack and only added training in some of the

for instance again whether we didn't necessarily go back end of a little this on

the eval data consistently say i mean if we had at your data you know

we did the analysis of all we have done all the system that whole training

data only

that would have been better mostly because of the french last right because we would

have labeled data that represent the that channel

in about all the don't have happened only in it systematically i don't you know

on this language would have well known that language without her so we have one

okay thank

questions

i'm gonna ask those questions on the slide that you have here we used for

the four

from the other real whatever

i'm gonna ask is the supreme assignments there because the reno

reno when we did our test that if you threw away all the speech and

just use well what you thought was silent you get a five percent danger

i just one i'm not sure right eigen necessarily and channel dependent

The MITLL NIST LRE 2015 Language Recognition System

Speaker & Language Recognition Systems

Pedro Torres-Carrasquillo, Najim Dehak, Elizabeth Godoy, Douglas Reynolds, Fred Richardson, Stephen Shum, Elliot Singer, Douglas Sturim