morning uh what i would like to present here today is uh our language recognition two thousand nine submission and we did a lot of work after the evaluations to figure out what happened darcy them into our lives that because actually we saw a big difference in the performance on our development data and the on then that on the actual evaluation so uh first i will try to explain what new in the uh language recognition what happened in the year two thousand nine that in some new data then i will go through a very quick and brief description of our all system and then i will try to concentrate on the uh issues of the calibration and data selection and uh how we resolve problems with our original development set then i will try to conclude our work so in two thousand nine what was new that the new uh source of the data came into the language recognition actually these data are broadcast the voice of america and we found a big are high of about for the three languages and uh the data uh out of this archive what's the use and actually only the detected telephone calls and um which this data brought at peak variability a to the original cts data we always used for the training of our language ideas so it okay it brought some new problems with calibration and channel compensation so uh these are the languages uh which are present i would have to check if they are still present in the a voice over of the M erica archive as you can see the and multiple languages here is a very huge and it brought very very nice dataset to test our systems on and ability to improve the language recognition stems two two uh actually classify more languages so for the two thousand nine nist lre these are the twenty three uh target languages and the bold ones other languages uh where the only uh well the that we had only uh data coming from the from this was of and there are high so there was no cts data for training on these languages on the other languages we also had normal continues speech data are recorded by L D C previous times and also for the two thousand nine uh evaluation so we had to deal with this issue and uh and uh do the proper calibration and channel compensation so what be more tomatoes after the evaluation to do this work and work again on our development set and to do a lot of experiments was but we saw a huge difference between the performance our original development set and uh you've all said which the uh we which was uh corrected by nice so all of the numbers you will see here will be the average detection cost defined by nice and uh yeah uh on the language recognition workshop there about there were a lot of discussions about that crafting of uh a development set alarm systems so uh some some people created a rather small and very clean upset we we had a actually a very very huge development set containing a lot of data which brought some computational issues to train the systems but uh we decided to go with this development set the big one and in the end it didn't show to be maybe the but the she's but that but decision but we had to well with that so and we presentation of our us is what we had in the in the summation so we had two types of uh uh front ends the first on acoustic frontends which are based on the gmm modelling and the features are mfcc derive actually these are the uh popular shifty don't like cepstral features and for the system we had there jfa sixteen we tried a new feature extraction based on the audio the and then we had their eighty and then maximum mutual information criterion and using the channel compensated features also we tried to normal gmm with a guilty features without any channel compensation we perform the well tract length normalisation cepstral mean and and variance normalisation and reading the voice activity detection using car hungarian phoneme recogniser when we where we met all of this speech phonemes to the speech and nonspeech the to decide yeah thanks then it's a standard based jittery sistine uh as you can see a sorry but this time of course without and the eigenvoices there is only a uh channel variability present so we had some super vector of gmm means for every speech segment and which is then uh channel dependent the this uh channel loading matrix was trained using the E M algorithm and the five hundred sessions for every language very used to train uh the the channel loading matrix and uh language dependent uh super vectors the alice the remote adapted using the rather than smart all these but also trained using the five hundred segments there a language actually this is the core acoustic system here because uh it uses also our delta features and as you will see later on we decided to drop the audio D features and use just the J faces scheme eating the shifted of packets yeah we tried a new discriminative technique to derive our features uh this is technique based on the a region dependent linear transforms this is a technique uh which was introduced in the speech recognition but it is known as S and P E the idea is that we have some you know transformations which will take our features and then we take the linear combinations of the transformation to uh two for menu uh feature which would be which should uh be discriminate it's trained so i know but picture and i will try to at least very briefly uh describe what is going on so in the star we are having some linear transformation in the beginning there are initialised two great just the shifted delta cepstral features we have some G M and which is trained on all or over all languages and which is select the two which is uh suppose two so like the uh here the transformations in every step it actually provides the weights we are uh then we are combining these transformation so for every twenty one frames we we take the we we take the twenty once frames mfcc put it into the gmm then we take the most meaning gaussian components which provide us the weights and we will combine according to this might be a combined is linear transformations usually it happened that only one or three a gaussian components for these twenty one frames where nonzero also uh not all of these other transformations were linearly combined all the other weights are set to zero so then we are taking the eating area combined transformations and summing up and then there is a gmm which will estimate these feature and according to the training translate criteria we will update these linear transform and then we go one other one two months frames train the system so here in the end what we have after the training this will be the features we will feed you are jeff face the next acoustic system what that was a gmm they two hundred and for the adults and one and which was uh discriminatively trained using tandem i uh criterion and uh we use the features which are which where penn state so that was for acoustic subsystems then some common technique from then uh the core of our well but but but think systems where of course our phoneme recognisers the first one to english one is a gmm based uh phoneme recogniser which is based on our on the triphone acoustic models from an lvcsr than with just a oh take uh language model the two other uh for the party for not phoneme recognisers of the russian and hungarian our neural network based well the neural network uh estimates the posterior probabilities of the phonemes and then it feeds them to the hmm for the decoding so these we uh phoneme recognisers were used to be able three uh binary decision tree language models and one svm well which was based on the hungarian phoneme written nice here the foreground where use and uh as we and was actually using only the trying around uh a lattice come as a feature then uh we were doing a fusion um we use it and you go multiclass uh logistic regression focal toolkit so whatever assisting uh the thing is but the first time we had we didn't trained to three separate beckons for the each condition we tried to do the duration independent fusion so every sixteen was a coding some raw scores and in addition to these it was outputting also some information about the line a segment which for the acoustic system or was number of frames and uh phonotactic systems they provide it number of phonemes then these a raw scores for every systems where we are going to lose uh the gaussian backend we had three but gaussian back and persisting because we use uh three and so uh lance normalisation either we divided discourse by the uh by the land or two okay square root or we didn't do anything and then we put all of the L wheels of the uh these goals and back and sing to the multiclass uh a logistic regression discriminatively trained and i'll put most that a calibrated language look like course so here's scheme of the fusion so again it's is thing uh i'll put uh four and it's a it's either taken as it these or it's normalised by where or divide it and uh then the output of the gaussian beckons both also together with the information about the lines to the discriminant it's criminal multi possible just regression so the the actual core of this paper was to was to go uh so our development set and decide whether or address but the problem you're right like thing uh our friends int or you know get too much yeah who provided us with their development set so we were able to do this analyse actually in the tory no they had much uh small development set then we had it contained about uh if i correctly may remember ten thousand segments of and thirty three thirty four languages our development set was very huge it contained data from fifty seven languages and about uh sixty thousand second so we did the experiment we try to recreate the putting the whole uh training set and development set and also we had of course all training at developments and then we the the four types experiment i'd everywhere training our systems are the system and cutting and calibrating in on the uh put it all they what it does set or we trained on the L P T set and then potty break it on our set or we train our set and calibrated on the L P T outright trained on our set and i degraded one hours so these while at while i columns our our original scores these analyses of course was done using our our uh one the stick subsystem the jfa system because it would be very um feasible to run all of the systems again for the training so as you can see we had some serious issues for some languages actually these were the languages uh whether only the what's of america uh data were available so bosnian language was an issue you can see a big difference between a twenty two and our set the the blue blue column is just training on our set and using the putting those the development set for calibration so there must have been uh some some bothersome issue in our development set so the problems where the wasn't in farsi and also the final final score we were everywhere gaining some performance a loss uh so we try to focus on these languages and fine that should we had in our development so the first first we should we found was ridiculous we had mislabelled one language in our development set actually that was a labour label for far as the and version and we treated them as different languages so we we corrected is or and the problems for the for the language mostly disappear the next problem we we address was finding the repeating speakers between training and development set because based on the discussions on the language recognition workshop we already a suspect it this can be a problem for our uh training and develop so what we D we trained the our speaker I D's stint from previous evaluations which is a gmm based speaker I D's dean and uh train the models for every train segment inside the language and test again the segment in the uh developments what we ended up was this uh bimodal uh distribution of scores so uh this part here this part here these are the hi speaker I discourse and it's uh just there are some recruiting speakers between the training and the developments so when they look at these pictures we decided to threshold the data and to discard everything from our development set what is higher score then for this ukrainian language uh of discourse the threshold twenty did this experiment we discovered that we are we are discarding for some languages yeah disquiet discarding almost everything from our development set for example bosnian we ended up with the just fourteen fourteen segments in our development set and for the other languages where very very doing the speaker i didn't cation filtering we also discarded a lot of the data for example ukrainian only twelve well segments inaudible so what was the performance change when we did this experiment really me or correcting the label or already it was easy and it a show and that the did does some uh proven and then speaker I D filtering this was white huge different in the performance so these again these are the results for our acoustic subsystem the jfa two thousand what they got as with the R T L T features so when we did this we decided to run the whole fusion on our filter data it's not that we we didn't change the nature or or we didn't retrain and you far system we had in the submission for the nist language recognition evaluation we just filtered out course from our development set and run diffusion again and we were gaining some performance improvements quite substantial so for the third the second condition the C average went from two point three to one point ninety three which is what a nice improvement and the if you look at the table for every duration the improve there is an improvement i think there is no number which deteriorated so it worked all over the conditions and the uh over oh the all uh set and for every language and four every duration what we also so here what's a little you duration of the results on our developments yeah it it could be address could be the the cost could be that the our system right trained actually to the that that is speaker and they they're more i can recognise the speaker then the then the language for some languages so then we decided to work on our uh acoustics just in the jfa are the L D system because we wanted to do also another possible experiments to improve the final final fusion so what we did we just discarded the audio T features and use the plane shifted delta cepstra train the system and it uh there was some improvement out of this also what we did was to train the jfa using all all the segments there language instead of five hundred segments the or language and this uh brought so some nice improvement so when we did the final fusion we is guarded the are the L T J face in replace it with the normal jfa justin the and the my us still remained in the fusion and instead of all other uh binary trees and that one is we and we we put there actually a lot of the svm systems which are phonotactic based and uh they are based on our uh all of us all of ours uh phoneme recognisers and that a much because we'll have at all on the two P M and we will he will explain more about this stem cell when we did this the final fusion went from one point nine as we saw previously two uh one point fifty seven which is very competitive results of course it's a positive relation with so what is the conclusions of this work we have to really care about our development that the data and rather than creating a huge huge development set it better to pay attention and and have it smaller box filter and clean uh development set we actually did experiments with trying given more data two or seven it didn't help us the problem of the repeating speakers between between the training and the development set was i was like large and we should pay attention when we are doing the next evolves so that this well so thank you and also uh_huh okay what oh oh about a person so we're principles is what later use a we are we looked at least and yeah we we talked with them with one in the workshop and they were they were doing the speaker filtering stuff but we didn't uh filter they are set according to our uh training set uh to but even the filter uh the repeating speaker what remained there we just use as it was oh right do you same speakers element i see we should we we don't know that and we didn't check it we we just it uh wanted to treat our evaluation set S and evaluation set and we didn't look at it yeah yeah you know robust you could probably get a little uh well i think that the the remote or not so much speakers repeating in that you've also because as i understood nice was using some uh previously recorded data and that it is probably much less likely that that there will be the meeting speakers again for something which is of course it can happen four some of them but we we didn't actually check if those or so well short but oh you lose it seems this list uh you choose that actually would be to then the more uh yeah yeah yeah it is like that uh we were making a lot of effort to try this new our guilty technique and uh which didn't work also what was working was combining 'cause of many phonotactic systems as as you did in your submission and yeah very easily combining thirteen pca base as we and since based on our phoneme recognisers what's actually uh very very nice the results are quite compact it if if you will and the number one one seventy eight just these svm systems where better then our final submission even after the filtering of the of the calibration like speaker sort of let's but oh