Speech Transcript - PCA-based Feature Extraction for Phonotactic Language Recognition

okay i would start my name is my skin colour and i will present uh my work that was done to delete my code for the typical but on taken back problematic outcry we would get then and did not ski uh the this presentation will be about feature extraction for phonotactic language recognition it should be done by pca and this is the overview of the whole cable first pick a little bit about motivation of this work by we want to do it and then i will describe uh the results on the nist uh uh language recognition evolution that was mine so basically for the introduction oh if we want to uh recognise languages um by phonotactic model C basically can i do we use uh anger models like language models very compute likelihood oh well sometimes there's a given specific uninterrupted models of languages uh or we can actually tried to use discriminative models like as the M based models that are usually performing better and this is what we will talking about this presentation usually uh for this as the N models a linear kernel and soft margin are used means basically that uh we L O somehow flyer uh so the problem with with uh this as the M approach is that we need to really very large feature vectors if we use uh let's say trigrams for five uh foreground uh going for higher are orders it's computationally uh yeah almost impossible because uh the growth of the uh of the uh features so feature set uh like financial uh i like his work here slide and we can easily compute that for some second all use like if the set of the phonemes large like it phonemes it will be using four grams then easily they can deal with much more than million of possible features of course always all these features um are present in the data about this this is like two article a limitation so uh we need to somehow we meet this space and usually we can either discard features like that will perform some selection sure or we can do a combination of the features which we'll call here feature extraction that is what i'll be describing later so basically for the feature selection we can either choose the the features that but you're frequently in the data like it is useless to have uh uh in this feature vector combinations of uh of phonemes that form anger on that never a cure the well training set which can easily uh a cure like some some combinations of phonemes are uh very unlikely to happen and we maybe uh some reasonable pruning then uh such thing of another a cure and of course there there are also other combinations of power in grounds that actually that can happen joker sometimes but uh it is not very meaningful to them in in that features that uh and we can uh usually discard them a base on a on a some threshold value so basically although although thank around or cure well then sample you can be discarded other approach two use this kinetic information and uh that means that we will try to keep all the and grammars that are actually good for the classification of languages it is slightly different because you can imagine that someone a low frequency in grounds that are quite rare uh in general across languages might be uh quite uh why this can wrap it or quite informative for discrimination or some such a language uh so uh this can be like a better way to this card feature uh oh here i would like to show that idea why we tried to use feature extraction because uh we can easily see that for example someone uh combinations of phonemes will have well various like zero values we can discard them is it a like i that on the bruise uh slide but on the other hand there can be combinations and grounds like i have written then here like being being being anything which it just something that sounds are computed the same but in some cases like is right for example but it can only happen that some a phoneme combinations will have very similar pronunciation variant and then maybe i'll uh frequently come here and uh in the lattices and uh of course even if a frequency of these i think around quite high it would be a good idea at least class then together somehow so that we would need not to deal with this uh with this uh oh uh with this the amount of features that is like you use less goes it can be seen that you don't need to like four choose here but it would be you know have just one so this is like motivation example what we try to do and uh uh we we have tried to to use a simple pca average like a dinner projection which can be used to compute some of the same some matrix two or from some linear projection of the but original feature space to some lower dimensional feature space and uh it's seems like it is uh a good way how to five curse of dimensionality that is caused by the financial increasing numbers number of parameters than me increase the size of the context like and we go from trigram program and so on and actually quite similar idea works in um normal bigram language modelling uh so it sounds like yeah some reasonable way to go and uh that of course the other is like uh are that we don't need to tune many parameters to try that's all it's very fast simple and there's still plenty of tools that can be used to compute pca so so simplicity is one of the reasons like using technique and now i will discuss the results that we have thing on a nice uh language recognition thousand nine uh on the coast that condition or all the durations will be in the the law line so uh for our development set uh but used for business uh thousand nine uh we have tried to first best again what happens then we discard features by their frequency so we are keeping only uh the most frequently appearing five thousand features in that and that was and so on and you can actually see that uh that the accuracy of the system rows of an interview at the more features and it seems like natural that it would be good to people all the features and what the as the ends to cool and what are they use the useful features and and not the discard any of them but of course this is impossible we don't have a result in in this table what would happen if yeah if we would use all the features because that will feature space like over one minute of combinations of course be uh not all these combinations they really happened to appear in the the training set but uh the amount of four combinations that actually happened it's like out several hundreds of thousands uh and this is simply impossible to to compute uh in a reasonable time so so uh this is like the the result that can be interpreted like yeah that we cannot go further and uh then we have to use the you see a actually this is shown on the and the trigram because uh as i have said previously on the program uh steam of it longer and a recogniser you are not able to compute the the full uh for feature space a base so uh yeah so uh this is on trigram we are can seen that the for system is around two point three C average and that's that and that that's the last line and the previous lines are when we previews this this feature space from this thirty six thousand features like one hundred five hundred and so on the you can see actually that when we go to something like five or five hundred or one thousand features which is like uh okay six times less and the original your space we can uh uh get almost the same performance then then the speed of that is described in more detail in the paper can be had a large oh for training that's stan and testing uh actually the the you don't have the think basis even faster than all the training phase because in the training phase basically need to estimate first pca one while in the testing phase we don't need to do this the only project the data so it can be seen from from this like that actually seems to work reasonable uh yes and maybe i can add it did we actually tried to use um more uh more toolkits that are freely available to compute these svms models and like uh we have tried to tune all of these to obtain the best performance then like um my very cool experience is that and it as a gmm svm search but quite good results and pleading there is like ten times faster but about five percent worse in accuracy and now for the for the result with multiple systems because we have trained uh a hungarian anger in phoneme recogniser english phoneme recogniser impression phoneme recogniser then in the end we use all the results together we'll see that later and we can see on this table of a basically happened like if you would focus on the i greens payment you can see that actually five hundred features were quite well but uh when we go two thousand bits actually better and then uh we don't observe any real time man from going to four thousand features so it seems like the the value around one thousand features uh seems to be quite good uh then the interesting thing is that actually foreground work uh um horsemen trigrams of course space at the feature space of foreground yeah it's not full we need it apart from some feature selection there because otherwise in the estimation of pca would be difficult do uh so so basically it seems a reasonable to use just trigrams and it should work okay yeah that the data that results in more they detail later now for the english system we can see that so i does basically the same thing as uh for the hungarians stan even it seems that it would be enough to keep it just five hundred matures of course uh the optimal size of of this a reduced uh space uh depends also on density of values in the lattices and these things so so it's not like sound some singularly about it should be uh somehow tuned for every system but of course using bigger uh bigger features some of the problem so the competition fine goes up and then twenty four directions system it was the last one and the the largest uh phoneme set was quite difficult to train one uh we have uh actually tried to use some more uh training data as uh in all the to use uh systems we have used uh our ten thousand to uh training samples to train the sustains about for the last then we have actually uh used to almost fifty thousand is already very large and the original feature space those uh more than one hundred thousand sure so there's also a very age sixteen men uh quite belong to a given training that pca but can be seen that in the end it works the buttons it would be definitely good to train the systems all this stints on all the available data it can be seen from this result in the end we did not apart from that uh here we have final result actually happens when we use all the trigrams stance from of the previous slide and all the forums since can be seen that but the trigrams are they performing rather than foreground and from the from the combination of no three grams for foreground uh we can get some small improvement that goes across all the conditions but it's very small uh what what this more useful is so i think a system that was trained on more data that russian trigram all stan and so that gives us actually uh better improvement than using foreground and that in the end of an when uh we were able to fix so the development set it was described by all double quote in in this presentation in the morning uh it possible to to get even much better result and but it's around one point eight on the original thought second switch like right number and uh well i don't have the results for the fusion but uh that is in the paper at all levels presenting so for the conclusion we can say that we can achieve uh very high speed up as was in the previous tables like we can uh uh a system that is trained um much faster than a hundred times and that we don't lose almost any performance like on the accuracy and uh of course uh this uh this uh technique and it can be used to radius of the parameter space for the future space uh that would allow us to to use um a more complicated techniques like as the ends with nonlinear kernels then very need to tune the more parameters which is quite difficult to do uh when we were operating false just by and some ideas for for some future work of course we can think about some more complicate it a feature reduction technique that would be like something nonlinear maybe some neural net and can be oh it estimated that this kind of oh okay even bigger uh feature space reductions uh it it's similar performance and of course uh uh in this uh in this examples um for the speed up the results of that uh i have shown here we have been estimated you see all the data which is not really need it and we have given some other results from which we know that we can estimate P C yeah you say just on the subset of the data and then you can get even faster yeah so that would be like all thanks for attention questions thanks um i mean that's very surprising um one hundred dimensions to the and that's two capture sonnets yeah um usually when you do principal components and um some sort of room um like you estimated to to learn as well and then use how many are not used in to account for ninety percent my five percent oh hmmm i wondered do you know uh how much of the variability is captured um well um actually i didn't try to compute this i was just looking at them and the final results link the accuracy of the system and i was reduced the data so i'm would not i am not able for this uh i think it might be interesting and i think it might be interesting to do the calculation when you huh the right description hmmm or not uh see the variability here really um due to the redundancy in in the in the something that hmmm i know the right mine's are goings we choose many very similar yes yeah so it's a that you can project to when you're not there right not oh if you just have the correct transcription but you don't have that yeah my intuition would be the phonotactic very um thirty second utterance still usable oh so and hmmm would be much time um dimension well my impression hmmm i uh_huh i i think it would be nice if we could see some like that hmmm okay yeah i assume that right if you just you you know yeah so we set and case you might well you some of the you you you get some nice um i'm sorry you need you can't use an no need to be weeks and i these well i mean it's not a nice yeah yeah so um yeah i i guess that's table it's like the most the simple technique to use them uh that was like the the reason why we have used it here just to see that the idea of a work or not but of course i'm not saying that because the optimal thing thank you questions yeah yeah this is based on ignorance but you see you see my question reason in surrey oh use it is pca it's easy easy yeah where dimension to the million hmmm so we choose specifically you do they as far as i do you see a you need it covariance matrix we should i think i mean you know in our paper we we use site uh another paper of our there is right a technicality maybe a on large amounts of data right efficiently and uh really uh you can find even code from a lot uh for this estimation of pca where you don't need to compute to all covariance matrix but uh it is uh it is based on iterative algorithm uh that uh arounds like uh that doesn't mean that the full covariance matrix yeah well you don't if you have more questions anything else you

PCA-based Feature Extraction for Phonotactic Language Recognition

SESSION 10: Language recognition – phonotactics

Added: 14. 7. 2010 11:08, Author: Tomáš Mikolov, Oldřich Plchot, Ondřej Glembek, Lukáš Burget, Jan Černocký (Brno University of Technology), Length: 0:20:38