so this is um this is work that we did at S R I and it's in fact our foray into language recognition um and it's great because most of the the kind of background techniques were already uh splendid plenty of detail in the previous three talks so two go into that again um i will start with some preliminaries and then uh tell you about our or mental set up and the focus here will be on phonotactic uh why should not say phonotactic but phone base uh recognition uh i say that because that encompass or both phonotactic uh modelling uh such as what we've heard about before as well as uh mllr based modelling which of course is also based on phone models hence the commonalities um and i will also look at two different techniques and comparing uh two different ways of doing the phonotactic modelling um then conclude with some pointers to future work and conclusion so because we haven't participated in the past uh language recognition evaluations we actually did have access two any data after lre all five since this work was done L the sea's accuracy release L O seven but we didn't have a chance process that yeah so we're dealing with this i apologise for dealing with a rather outdated a task which is the a seven language task mallory of five it's a conversational speech sounds a lot of that a voice of america stuff um the test data consists of uh about thirty six hundred test segments and we only look at thirty second condition uh the training data and is from uh you know the same seven languages and the duration after we perform our automatic segmentation into speech nonspeech uh boils down to about fifty six hours for these three so you first languages and that way six hours for the many falling that you see here uh well i'll be reporting the equal error rate averaged over all languages uh so uh uh that's just what how to choose and maybe that that the best choice but but that what you see um the we performed two fold cross validation so uh to do calibration and fusion we uh you know which put the data and how we we uh estimate on the first have to second vice versa combined results and this is again because we didn't have an independent you that only working salary of five okay you heard all this so the the the two main stream uh techniques are really the cepstral gmm um and lately you know there's been the incorporation of uh uh session oh variability compensation with jfa and we implemented that uh using you know kind of a standard framework and just for reference gives you something like two point eight seven uh the average equal error rate um Q data and uh the alternative a popular technique of course the prlm technique and you've heard all about that already um yeah so i won't repeat it here the uh of course what popular about it is that you can then combine multiple language specific like fusion at the score level and get much results to solve for and for calibration fusion we also didn't attempt anything uh out of the ordinary uh in fact we haven't even i tried to um oh i haven't you know incorporated this uh the gaussian uh i can't modelling yet so we just use the multiclass vocal uh which is based on a lot cleaner aggression um did you both user and um calibration so first section is about phonotactic language modelling um this again is a standard technique by now saw before so stead of doing one best phone decoding we do we do for that the coding um uh this was actually you know adopted twice uh lindsay uh proposed for language I D and uh as right C proposes for speaker I D um and in both cases that shows pretty dramatic improvements and actually i wanted to respond to uh to uh something that uh uh haptic said um uh because quest uh or because of the previous talk i actually do not think that the lattice the coding in that increases your variability and in fact i think it reduces your credibility because you're not making hard decision so whereas in one hypothesis you know the recogniser so the whatever charlie might decide between having a frequency of one or you know point nine nine nine in the in the lattice approach you actually uh represent both the one best and and later hypotheses so you have you have all the hypotheses represented and they just differ by small numerical values i think it actually gives you a more robust uh feature and that was actually demonstrated by andy had in the original paper post yeah and the other reason why it works but of course gives you more granularity feature uh so much about the how we do the feature extraction um we we didn't have time to really develop new phone recognisers for this so we just took three phone recognition systems that we had lying around uh one is for american english one is for spanish and thirty four levels you know and you can see here the phone sets differ in their sizes and uh furthermore the training data is vastly different uh english we have you know basically as much data one um and for that reason we gender dependent modelling for the other two languages with less training data we we do a gender independent modelling but other than that they all use same kind of standard asr right uh plp front and vocal tract length normalisation hlda uh for dimensionality reduction and the crossword triphones intently uh acoustic model training uh the decoding of course is done without one attack constraints um but we do use following the results from lindsay we use the context dependent triphones the code uh and also go again following uh you know very nice figure from them as a couple years ago uh we use the cmllr adaptation and decoding um and by the way we tried a regular mllr as well and it didn't perform as well you know and i guess that's an agreement with another of the previous talk okay so uh the first uh i think we would like to propose is to get rid of or largely get rid of all these different kernel phone decoder and instead we we can define a universal phone set that covers several languages in our case we made up such a set uh fifty two phones and what you do is you map a new map your your individual language specific dictionaries to a uh common shared phone set and then you retrain you're acoustic models uh using the map ref uh and of course the language models uh if you perform uh for decoding with the language model with the phonotactic models should also be retrained the phone recognition accuracies uh as measured on um on individual languages are very close two what you get with universal phone set you you're not really you uh sacrificing much in terms of that see and this is the these are the following these are the language specific steps that we combine map in this fashion so we took a um american english data of two the right is mainly native and nonnative speakers and we because we know that in much of what we do both native and the nonnative speakers of her uh not in the natural frequencies but with with more than native nonnative uh focus we actually waited them so that they have equal amount of data roughly uh and then um and then and spanish and egyptian arabic now note we use the egyptian arabic here because that happen to be a dataset where we have about lies uh transitioned so we can actually perform this from happening in there you know pretty straightforward way and also these two data sets that have very little data spanish and uh egyptian they are weighted more heavily to it's even more about in terms of the overall model then this is might be a detail and known to everybody but uh window to this i thought i'd point it out uh so when we do the the log likelihood ratio scoring we actually a do not use all the languages in the denominator but only the languages that are not the target language and that gives you slightly better result okay so here the results using the uh prlm approach so have the three individual um the individual P are aligned still american based on the american english recogniser level in arabic and spanish recognisers and uh american english asthmatic back i because it has the most training data gives you the best individual results uh and might also be good because american english actually had where english has it relatively high number of uh oh testing phone so that gives you a lot of resolution in your in your coding um and then when you do the standard um prlm with first two recognisers and then three recognisers you get progress truman overall from the single best uh which is the american english to the three way you know be prlm uh with about thirty four centrality yeah and then these single decoder that uses only the multilingual a recording a gives you three point O one which is very close to the combined oh and of course vastly simpler and faster strongly because and if you combine these all you have a four way P P R line now you get another nice to uh improvement so you go from the previous result the uh you know what the three language specific stuff terms to a full weight prlm with a with a pretty significant uh twenty four percent additional reduction so usually when you add more and more of these language specific systems the improvement kind of peter out as you might expect but if you apply the multilingual uh system then you get another big uh just some details so again this might all be common knowledge but uh we did find that that was actually no gain from four grams three grams was the best in terms of the language model um you know the overall lack see so somehow the you know the the the the programs are too sparse all the models are not adequate to capture the information programs and it actually a good to do a fairly uh suboptimal smoothing in terms of language model performance the assembly at once moving works best works better than doing fancy things and i um okay so now we do have some we had something which is very easily done in like a bird system uh my understanding so we we use uh we used we augment the standard uh cepstral front end with a mlp features with multilayer perceptron and neural network features uh which works right very well when we do word recognition uh and and other tasks and we also show that a front end that is trained on saying english too the form english phone uh uh discrimination uh actually generalises to other layers so you could train and you're not to discriminating users english phones at the frame level and then uh use that train front end to train say a mandarin recogniser and you would see a nice so this is this confidence that is this a front end although it is trained on only one language will actually a generalised to other languages which is exactly what we want for the language uh did we find that the across the board for all languages um we get uh a small but consistent improvement in the the recognition accuracy at the phone level phone phone recognition or see um and now we're gonna throw this at the multilingual prlm so we can augment the multilingual prlm the uh with uh with this uh mlp feature front and you see an improvement here that uh is about um uh is it that is about um you want to see it yet though some three point O one two two point eight one and if you do this combination with the with the other language specific P L M systems you got improvement from two point oh nine to one so nice nice improvement from adding those yeah i uh an opium as others have seen but we wanted to verify it uh for this uh for this framework with the multilingual okay so now we're gonna try something diff so another thing that we use with some success in speaker identification of course is the M R trends um so why should we be able to do this or language recognition so the idea seen it uh talk about probably he at the workshop is you have a language independent set of phone models and uh you use mllr adaptation so you estimated transform to move certain phone class from there language independent locations or speaker independent or whatever the in the pennants is that you that you care about to a defendant to to a location that is specific to a subset of your data such as language or and then use the transform coefficients as features and you model them with as we have and an hour case we have eight phone classes each uh the feature vector has thirty nine components and uh the you know the the affine transform of the thirty nine by forty matrix that we get about thirteen thousand the twelve thousand raw we perform right normalisation as we do in our speaker I D's and that's our feature vector and then we do support vector machine training with linear kernels and you know the hyperplane is really the the model the language model model for the language case and the L A D's scores this from your test sample yeah the results and this is a very crude system but bear with me so um we try this first with english uh mlr reference model so we use female english speakers only an hour in our uh reference model and we get a you know we get some results some people are right um we can play this game when we actually combining male and a female transform and we get a better result insistent with with what we see and speaker i work uh but when we use a single gender independent multilingual animal a reference model we do much better so this just goes to show first that it works in principle secondly that again the multilingual phone models work we better than the line fig oh now we want to get this result down to be more competitive with our standard uh it's a cepstral uh girls so first of all we can uh we can we can use a little trick the training conversations actually pretty long compared to the test the conversation that's the test set so we can actually split our training conversations into thirty seconds uh segments and get many more data points for the svm training we can also optimise the number of gaussians in our F models to be smaller that forces the mlr to do more adaptation work in the transform uh as opposed to just using different regions if you're gaussian mixture and finally we can we can do that now to try to project out uh within language where the light so um uh so that uh that's all done you kind of incrementally and you see that the that the average uh equal error rate goes down from you know the seven to just blow for so we're not quite they are yet far as the baseline of the cepstral gmm goes but it's much more okay um now again another incremental improvement we augment the plp front end of the mlr system with a twenty five mlp features so a number of features goes from um uh thirty uh thirty and nine times forty to that one the other block diagonal opponent that accounts for adapting the mlp features which is twenty five yeah six so overall the feature dimension increases from the twelve thousand to uh to just but you are under eighteen okay and the performance goals oh okay see improves uh and well to thirteen central i i reduction okay so now i want to go back two phone phonotactic modelling and as we've seen um i you know hardly anybody uses language models anymore i'm gonna use yeah for uh for phonotactic modelling uh so we wanted to do the same uh and see if if what we saw before still work so um as we found also you know many years ago that in uh in speaker ideas that svm models uh plight phone tandem features uh what better than language and that to me because of right um so here we want to apply this to the multilingual phone right hmmm wars and we use the uh T if there are no good yeah campbell uh and we do not perform any rank normalisation sound like yeah sure the uh again we play this game that we use split our training uh conversation sides into segments that match the length the test data and that gives us more uh that is as more training samples a four or a smear uh so this was our baseline using a language model over the phone and ram uh and that was the old is all then when we do an S yeah that with with the same feature space trigrams we do slightly worse but uh whereas previously we did not get again with foreground we now i you again with foreground so with with the additional features that and uh the result actually get better than up uh finally we can uh we confuse the two uh phonotactic systems the lm based the svm based system and we got another uh back so uh apparently the svm is a better tool when it comes to modelling very sparse uh features uh and that's why we see again from going from like that we also tried uh using now but that no gain from that yeah replicating something we tried and speaker I D it didn't work um however we haven't tried to the uh you know dimensionality reduction techniques like proposed in the uh in the previous talk so that's certainly something okay okay and just kind of the grand finale what we put everything together this is our single best system the oh phone recognition or phonotactic S yeah with that result over there one word yes and this is our other baseline the cepstral gmm and then we can incrementally add at uh a phonotactic or or phone based systems uh we see again from combining the caps O all the combination start with the cepstral gmm uh so we first of all we see that doing mllr type modelling on cepstral features uh does combine with the with the cepstral gmm um the the multilingual uh prlm system is the best i think the combined with the baseline see a whopping send uh reduction there yeah um and then adding on top of these these two uh and you can have you know all the others and you get you go down for another twenty percent relative so we essentially you had the error rate from you know two point eight seven that one or uh which looks like a pretty nice reduction um the well i really told you the highlights only the fact that two different kinds of phonotactic modelling actually combine um the fact that you type cepstral modelling combine um and the interesting thing is here that adding multiple uh this what behind the P prlm adding multiple language pacific uh phonotactic model does not help okay the one thing that all these other things it's no longer useful to actually have the line pacific phone recognition okay just a quick rundown of some one of our future direction so obviously we want to uh verify these results with more recent uh lre dataset we want in particular trials on the language uh the dialogue type the path and um you know the svm approach this already it seemed as for some the previous talks can be pursued in the parallel with multiple language specific phone set um but more interestingly i think we should we train the mlp features to actually be a well matched to the multilingual phone set that we're using now at the end but so uh that's it yeah button additional improvement the um uh we could all do not very interesting we could you mlp features for all the language pacific phone recognisers the handling them we might not really sue because we trying to get rid of the lines fig one recogniser and we can of course then go to more high level feature uh features that we've tried and worked well in speaker I D such as prosodic features and constrained uh caps okay so here the the the people messages so we try there is fine uh phone based systems uh for language I D using techniques that we uh that we had previously seen uh to work well in asr and also in speaker I D uh we for the first time to our knowledge we tried using mllr svm modelling for the language reckon have a network um the uh multilingual i guess the biggest take a math that the multilingual phone model approach is is works better and as is simpler then using a combination of a language did you parable and it still gives you some games if you combine language mister uh phone recognition um the mlp front end uh can proof but not that so what others found that mlp fine then you gain line recognition carries over to these two techniques that we explored here and uh the mllr in the cepstral gmm uh approach for cepstral modelling also combine quite well um well the rest of set already so that's it any questions right thank you very much for nice to know and at the beginning you said that the multilingual phone um did you mention she works proximate the same as the language dependent one do you have i mean numbers or oh no it's not here no i i have them you know what home but but i didn't think it was really relevant 'cause when we imagine phone recognition accuracy we actually usually apply phonotactic model but in for language I D purposes we throw away the phonotactic model because we want to be very sensitive to the uh you know to the particulars of the line because maybe that was the very beginning of the top but we was more details about this discriminative mlp features that you're feeding into the hmmm phone recogniser or do you like then the posterior by some postprocessing or button next yeah so playing with yes they are actually quite plaques uh they were by the way we didn't train anything particular for this language yeah we just that's something that we have used in in word rec yeah uh in fact it was all the way you know it's basically these you just were optimised for word recognition and conversational english telephone um uh so we take um we take actually which a plp features over a nine frame window and then perform the usual kind of M L T mlp uh uh training with those those input features um we also form we also use the hats features which are kind of a derivative of the trap features uh going back to you know like uh a man work uh so those capture more long term uh critical band energies um and then we combine the posteriors from these two mlps into a single set of posterior vectors and then um then we would use it to twenty five dimensions using using uh yeah any questions what is your noise to you um i does falling on from scrooge um uh the mlp setup what are you trying do you you beautiful your mind ooh so it's an english phone set so it has another forty five uh categories that's performing frame level class cation so you trying to predict the phone at each frame uh english phone of each frame regardless of the length so as i said we did not train language specific or even the multilingual mlp we we just be using the english uh specific mlp that we had the really and perhaps try do you oh you might hmmm you know that's what it put in the future work as one of the obvious the proof that you could actually generalised concept cover all languages and then we try the animal hmmm i one um um huh oh uh it's a fake it's a it's a uh it's a mapping designed by a phonetician yeah this oh uh we plan to do that but we have i see right