yeah i well describe the two thousand and nine on this language recognition evaluation lre O nine um this uh discordant in this evaluation now and your U S government sponsorship and this work was largely done with great greenberg like owing in them this yeah multimodal information group so the two thousand nine evaluation was the fit in the series of this coordinated lre is the first was in ninety six and everything yeah and we're evaluations in two thousand three and two thousand five two thousand seven two thousand nine uh oh one might suspect that the could be another evaluation twenty eleven um trying to to that in nine he changes uh we're in the nature of the data say more about that the treatment of dialogue dialect mutually intelligible languages and in the set of evaluation test condition we will get to those um the data the oh oreo nine or indicated there there were eighteen total participating sites um the prior nist evaluations used conversational telephone speech this involved paying subjects yeah they call it nature language recognition you just wanna make a single call in their native language ah in the U S preferably control channel conditions this paradigm is becoming expensive and impractical it's hard to pay people to make single call these days talk to me um um helpful um access is easy so lre O nine attempted to use primarily down data in this case down uh from what voice of america right yes data um this was this sampled by the with the uh data consortium actually found data from three different yours of um voice american to where you started about data the L D C S other conferences separately reported on this data collection the a feasibility study of using this data for lre was done before and then done in the by researchers uh here at the brno university of of of technology uh that was a key part a lot in the data for this evaluation um the selected segments that were actually used for testing also does it for development work uh segments uh determined by wanting to be involved narrow bands speech and we want to get as many different speakers as possible um evaluation also use cts data that had been collected previously but for various reasons and not been used in the uh in the prior evaluation so um here is our list of target languages for this evaluation and then using found data is that have um more target languages indeed we had that twenty three in this case ah in some cases we just list it is languages quite that would trees have been created is dialect um english american and in american english and indian english or also uh indian and working so this one says we just do these into it a single part languages will talk about the language here condition um any D we specified eight um language pairs as being a particular interest um they either languages that are similar patient english dialects uh indian or do may be viewed as a dialect station other languages are many cases mutually intelligible post processing croatian a real haitian and french are of interest uh include such pairs it's cantonese mandarin spanish portuguese so we specify these as these eight as being the of particular interest for those who wanted to investigate um uh so the evaluation ah consist of a long series of trials for each of the in addition and as in the past we charlie's test segment our approximately thirty or approximately ten or parts really a three seconds of speech uh i for each trial you have a target language hypothesis and and alternative i thought this and for each ah a trial we require i passed a decision and the score yeah we specify three different has conditions this year the close second edition this is the traditional condition it but part of all the evaluation is required condition and this reach language segment oh you have a one of the target languages as i thought each segment is running really target is a part the alternative hypothesis is it's a different target language one of the other twenty two the open second edition the alternative i thought this is not simply that one of those twenty two languages it could be that they could also be some other language an unknown how does that language and finally we introduce the C of the language here condition which is designed to look at ah i just distinguishing here so that the i have this in all cases a single line you know target languages english the alternative uh is that it it's french um ah so there are two and we twenty three target languages there are two hundred fifty three pairs and a part of language you want to look at this way and systems were invited to do all of them only a couple chose to do so or selected ones in particular the eight a mentioned above uh this gives you some indication of the um training and test segments that will provide in there there's a source so a green language it it indicating the number of segments and between segments of each duration uh um they were providing it'd be a weight training or be away ah yes where cts that that all the cts data from previous evaluations where we're also available um and the B Y training we provided you know we provided lots of data and not just limited to these selected segments but oh a corporate move around terabytes uh a drive the route but we're distributed people but rich language we haven't had about two hundred uh from the really data a segment of each duration separated but that S yeah we had open three or four hundred alright quite depending on availability and we we had training in languages which i i'm a and that we had with training data are all the languages for which i was not ah previous cts data in many languages that relevant data with cts but the new you data could be the other way um so that's numbers are there eighteen side they're listed here are many lamar represented in this room evaluation metric with the traditional metric yeah we have used a is essentially something like yeah total error rate we equally weight a lot of miss the cost of false alarm take an average of miss rate the false alarm rate but we average that over all possible oh uh target languages all possible alternative languages and they ended uh computed this way there's also waiting indicator for the open second edition of how we wait that the outer set alternative to the for the are actual target languages so it's turn results so terms of the official metric uh these are the results four systems are the average scores uh the close any open set in addition uh the scores are cumulative so that the three seconds or is the total of the green and the yellow and there red bar oh opens it's close to laugh opens another right we have labels oh some systems indicate yeah the same system close at an open set traditionally we have not identified systems with their scores are ah in public presentations but you can uh they're open close i was in languages and you know it you know it's really three seconds or ten seconds the three seconds that takes the big performance yeah it close in all three his clothes and language here is a oh but two sides be yeah yeah uh language we wouldn't see the relatively uh good performance as you might expect on and language pairs and we traditionally put these on yeah what uh there are that part with the close to have them alive we have the various uh that's another thirty second uh and of the right for once we give a flavour that different or in thirty seconds and second three seconds the linearity of the most of what uh suggest underlying normal distributions uh it was open set and you can see that problem you taking going what was that the open so that uh oh there we on the right but up of the close that an open set for each of the a three durations are uh give you a sense there findings an analysis um yeah and i will talk about the effect of averaging in that while the other terms pulling back at work moving away from the term cool we had a long discussion at the workshop is it right to average get across multiple we have the same data multi try out the multiple languages and we then resolve that what with all that but see that funny thing that happened in particular for the and here is is is two systems were right then ukrainian ah uh so the regions where they cranium language type uh that's in the lou the russian language that uh this and these yeah inherently a symmetry uh uh between these these cars 'cause this is the page that i think the only possibility does it russian or ukrainian and if you average those pulling together what happens or system on the combined curve and black all right through the middle that's what you expect random one system too the binder where uh uh i mean lester combined performance one is that um uh we show the distributions uh on the road um the rhino records for the two languages and then different shapes and uh another thing to note is the choruses show the actual decision points the circles we a minimum the average point and the first system on they're right on top of one another in the middle of but calibration two the right there way at the extremes in the case and uh with the sort of the middle indicating it indicating for calibration combine them but hello to it i is what you see um so as i said their questions is it the right thing to average across languages um we have done so if you look at language pairs uh this is for one system one of the system that all the language pairs are we look at george dunning created the curve i believe ah this looked at although there isn't shows the ones that have the why that um average error rate um so all the others were low two percent ah most confusable up of the top word in the or do and by then croatian um these were among the black pairs of interest in uh you know these are certainly mutually intelligible they may be considered dialect and indeed oh yeah at least arguable that he these language or dialect distinctions are based but also and political boundaries are are rather than um then uh more inherent language patterns any case those two of the most confusable next one for russian ukrainian the english dialect and a dari farsi which are generally considered usually palatable given to you you are there is a god in there real french and is is uh in the list um uh when we several of them no a little list of leading one two that were in our that's the pairs of interest yeah nice and mandarin portuguese and spanish maybe certain different ways languages that might be regarded a similar effect um maybe aren't in at least for the a system involve or not all that hard distinguish all that we can look at uh the terms were in the right to a particular target languages towards the if you of everything price languages here we do so looking at the training corpus type the they show the various languages for the that he had a training on the be away data and then we look at the ones that training on cts data um you see kind of a movement how would be either way ah yes performance was on one two Q is that we're languages um but uh done previously among many cases the training the cts and the yeah but realigned unless spanish korean mandarin for example were among the best performing languages worst performing or several indian languages i mean other confusions there in the or do indian english oh yeah we look at performance by but the what was it test corpus whether it be away or cts thirty hand and three um and one thing we were sorry please with you know we just introduced using the only data you know with the we we recognise well in fact the overall performance was probably comparable um this even though for some of the V O A languages that are training with cts four some reason i don't know we know why the uh cts curves here appear less linear and some history so we like to but back over the course of several evaluation how things change are we seeing better performance there have yet that that ah okay we have occurs over there evaluation use of the numbers of target languages go on up in recent evaluation number of participants will open up in them too much recent evaluations but we're yeah slightly into the nineteen thirty seven seven wonderful hereby try to uh you're simply blah and with an increasing number of um out of seven languages as for the basic one of the major um for thirty seconds with that nice uh right and uh you know garcia good data exchange languages it type change that but are we think uh improved results for three second four every second for the past couple evaluations are we seem to yeah have but a i'm terms of the the system also noted this year's three second performance was at the level thirty second performance in nineteen ninety six oh here we do some history looking at the best system you know caviar differences reflect well systems and someone changes in the task definition and of course different data in it hard to sort those out of it a different vol no less what can we say about how well romances there ah um i think we hinted that before but three seconds um we see a it was lacking oh nine wounded or seven media anything ewing performance improvement but in the there can second bite out in the thirty second maybe we right progress a bit oh really look at a couple of individual languages uh that's for sure tend to do the same language uh oh nine O seven in the of the 'cause O nine minutes of seven and the colours are one of the three durations and here to kind of language in which they were we have language pair since for korean oh we haven't seen improvements throughout right but the recycling three in two thousand nine is uh perfect the results are are ah languages part is one we see the overall having the we sing for the evaluation the whole ah improvement at three seconds uh a little change or even ridge regression thirty five and of course there are going to do that or new this year as well oh also here but dialect kind of has to be done previously to that american english and indian english uh uh that and we do see improvement like two thousand nine which is that the minutes thirty seconds and second and even more uh uh predicament a big there's three seconds american indian english and going to in the or do do you known to be a challenging language here but we see improvement thirty seconds three seconds yeah there's ten seconds oh and wait a three seconds ah three seconds well maybe this improvement yeah but have it three seconds in the order was that or too hard comparison performance little better than and random your words in summary are we experiment with a new data collection paradigm and we're reasonably satisfied with that producing a and effective evaluation get berkeley have trouble performance repeating this trick when the right data for future evaluations that remains a challenge uh we shall continue performance improvement uh of having a son a real nice based on the shorter segments um for both coding open say condition a language pairs was introduced here in particular for marketers it relative interesting poses challenges more likely you part of any in in if your evaluation that we do um this story an issue we've argued about about whether used actors average cross language and i think that yeah uh includes might right off thank you and information this is just a common on the comparing uh she happens tween uh that's done nist evaluations yes with the number target languages uh that's uh it uh they're more languages than the weight vanished that's it the hypothesis mostly so there are more languages uh you know list about five months 'cause you know you less sure about which one to be so um that makes it a little bit on that it makes it a little bit harder just one makes it a little bit not not a lot if you were doing just fine identification obviously the number of languages as a strong stick second autistic which which don't have so arguably if we just apparently tread water but it made the problem are doing introduce language people haven't seen before i'd argue that that it'd be apart it's also predicate argument for the language pairs condition which well so tenderly i think that affect yeah we should you plan to to use it voice of america uh it uh for the nist evaluation oh there other than your right he it we need to discuss this with i don't think we can hope thing just get more voice of america data we're exploring um other similar type or or that may be available that have multiple languages um are there any recommendation that people with them yep uh i'm honestly wondering why four uh identification oh just sure so make or break two cation four i mean to do that uh you should and i'm i'm and you are using uh uh that it editions oh um i would like to do i need to find a direct just hmmm you know to yeah see oh and identification and i wonder why um you you could try your interesting identification with recognition because it's a whatever this if you use you correlation we thank you right but there's no no you you can some oh accuracy yeah yeah i i always wonder why i wanna see how well it does yeah yeah you use you understand well and i am not sure you're saying you're interested in bring in distinguishing particular mostly related to i or are you saying i think of the identification problem yeah the language of their and possibilities which one is it it yeah i yeah you target for that dialect i think yeah i'm interested in education like see a comparison uh this if you use your oh but i mean the language here condition does that computing yeah but but yeah what you have oh oh right yeah okay oh uh no one like yes the i comparison oh not yeah and and i okay yeah i think yeah huh and right yeah no yeah as opposed to i'm not okay nectar a couple of that but maybe that's something we can talk about for the wrong one yes right i can like combine um uh i've it's one thing so yes um i've uh i think that's discuss but yeah uh qualitative uh this and and other what this oh someone is so elements of this the the pulling of the day good and equal error rate being one point oh on the go it's part of that discussion um so i'm not going to start that again no uh i think i have something useful to say about average which so if you doing identification uh given a speech segment you told you're in languages speech segment can be in one of these in language then uh you also have to assume some prior so you can assume a flat prior of the of those languages that you would uh yeah likely uh before you look the speech that that would be uh the identification problem so what nist is done is that i uh so something this problem if they're in languages at in doesn't apply so the in doesn't primes is uh target language number one as a prior oh and all of the other languages uh susan between them uh oh a probability of heart so it's it's just you try and then you go to the next topic two you say you know this one has probability of false all the others uh i have a smaller probability then you missus D uh uh essentially i didn't cation yeah right given that probably in times and you and all those it it right that's the that's the secret um so okay and he to be to go on the next speaker again interesting yeah