is might go i am gonna out so to day about an approach to selection of a order n-grams in a from that phonotactic a was recognition yeah lance time is to drive so are we try to go quickly a i will do a short introduction that we show you a a a which is that fit to select method call are what we present are we show you this pretty meant that that a a to that and results and i will finish with a shot somebody of the work so the motivation the mean but reason is that a a phonotactic language recognition hi of the and side expected to have more this scheme that if a information from the language i mean more languages which it's bit weak a information i the problem is that hi and number at the number of n-grams yeah a a spongy like as N increases so so have there are some a a a computer the in it's and uh that's why uh many many a C stands usually stick the and or more that to three or four a a we cannot i apply that that directly dimensionality of reductions like pca your the eh two sets a huge space so uh uh is thirty got yeah some a uh works related to fit selection i would mention yes two of them a has one by every just one that that in i cups to first and then eight where and they uh would tape just sent the to work well filter method that was used to select the most discriminative can if and mean as one N grams based on S B M where and that those n-grams where where a and it to get that's of out a a subset of anger some G is like that uh quite the similar work my turn it down i they use the same rubber filter method a a but use set to this to mean that if we tell you first uh as B M's but issue you in which is by see basically the same a a as the but it was one and second that use also it chi square measure the fact is that in both cases a there was no improvement or even at the relation and when hi high get than four grams were used we also had a quite a similar problem we we uh we have rest a quite similar problem in a previous work what we uh uh eat phonotactic language recognition using um a the colour oh prince of phone n-grams in that case that features space was very weak and that the K yeah we use that was just to to do a or can base fits a selection was too but to build in this past a vector of respect of cones using only the most frequent uh units but the problem is that in a high or than the not you know a there is fit to speech is you really huge so we been a simple uh a frequency based selection can be a a a a a channel so let's do frequent space effect selection that H be the number of phonetic units of an acoustic the colour a is so that is six eight for what when possible and n-grams i if a the number of units of an acoustic or it would be for and and save and they all that of the n-grams they fit to space it's really huge but uh we must take we must to and we must think that most of the most of those fits we we not the appear in the we will not be seen in the training so we can forget get well i mean even and most of the scene it's as would have a very low counts so we can for a bit then and just simply be select the most frequent fits the problem is that with such a high sets set you say a fit space the vector a set of a the uh i mean we cannot even uh two the could active comes we cannot uh a store all the comedy seconds of the of all day features sing in the training set so we state of the directly uh a a all the collective comes we must is me they is quite simple we have one i used the full training set i we are one that's like to buy to wheel table with all the comedy of cons of the scene fits but that it equally well when clean that's a what so every every collected K calls we are one that retain all the only those entries we hi good cons than a given first tell so in P D all the and this with no work calls are described that that it they can a set to zero uh yeah both T and tao a would be stick or constants so by mid this must be to in order to get a quite a beach table in our experiments we usually a a a a a a get and this ten ten times speaker tables and that the site is nice so the and is that the site the size do you find of the the side we we try to get a ten times speaker tape so the proposed going this work seen that we start with that and that T with an M table C will be dead parameter that L sats a a how many combative comes with we for a great since the last a since the last the update and for every i a training sent as we do we can wait the cons and did say well in the table and then we update the T parameter when that you parameter is high and the key from i'm at that we have data table type is with the model all the entries with a i the end we must do it in and a final that updates to see that they final size of the table is much be good than the decide and size then we to get a table and just use the most frequent and for the use race that it it to this are quite common approach in i phonotactic language recognition we use a and from this for this we estimate like this is uh and uh we use as bn based language model with backoff n-grams and think oh gaussian back and a linear fusion i D N a training development and test a a corporate i was that a T got for the two thousand and seven common used a language recognition evaluation a we use yes a ten conversations for development for function calibration and and those the compensation what where to split a for the splitting thirty seconds a thirty second second the evaluation was carried out in the core a it can be back thirty seconds plus oh this of what use was uh for a valuable the from the chorus a a a a a and know from but group for checks a on guard interval and version that this is where of time with htk using the a a brand or C as the a modeling who is that was done a using a deep linear uh quite fast you have only enough of the M and in back can um for was done using their focal toolkit keep from nick before i doing that according to the using their of a brand of the goal this a we split we are more the non-speech a a segments a from from the training segments and they all the non phonetic you knees where maps where mapped to set we use a a a remote that we do we do use to like this is so we use the but have the core there's only to get a estimates of day a gmm is state uh we did to a phone we use for that S and those that is where model by means of support vector much is using the knows a a with the test and that and back of n-grams think and using the stand that i the background probability weighting the training was doing use one versus all but so let's jump from fit three times to four and a we take just that all the all the grams a in training we see that that we got only in a round about to two me on a a fixed in train so i are have different numbers for each the call that so that is no need so to and you speech that with two thousand two medium as we can yeah a count them and select the most for of them if we use the full to um two medium the as in fact but not all the features which will be really need send so they are a well i but it's size of the a sparse vectors of this sparse code vectors was found to be a a a about seventy first we use the four four gram scrolls a we getting me prove mean feel of it or send the even you whatever right but we should take into account that they are but spectral size and it when we use the full four grams what's was quite a be your and and a three gram baseline system so that would that would be a problem if we've got a lot of data data for example or for the two thousand and nine competition what what the that was much be so that first thing was just to uh select the most frequent units from there full for but all around yeah in this day were you can see all the was when D we select a start in from so and units a to fight so you is obviously as we select a less and less units you their but but he's is more that equal whatever rate grows at not money but with some was delay of solution that's why we prefer see a there a a coarse P cost for for for evaluation because see somehow more more as significant and you but a right because as a mall a a small perturbations are run the it what a point lead to different people are are like by so it we mark two to to select "'em" points first there one hundred thousand and second was the thirty thousand there is an is that once a hundred percent features is more or less the same number of features that with full sector and the S C if that's and uh one was selected because are but it's vector size is more or less a Q but i more or less give a link to their site comes case so the computational of course the case of thirty some stuff for that was and was more or less the same as state so let's try to jam um four grams to high gear and we is the only fixed K and now but values to ensure at least to me for is at the end of the out all a a i guess the just to note that the a key value a sick you in to more than two hours of voice i would so that means that close to features that even at the in the year of the counties these really a each read a really low in two years out yeah are are a a a remote a a also as N increases as we use high get an a and all or i the number of like n-grams decreases in this this table we can see how many how many like n-grams grams we can find a as we change in all but all that from three to set and you can see that when we get the most free bins saving grams or or that around only twelve of them P so we select a seven as the high guest or that for and this table you can see the ross rules in the probability dimension on a too for two sec select you on uh leads a hundred thoughts and of thirty thirty thousand you're scene from three i'm to after two seven seven for order and uh as you can see the the right a a a a a a with a four and then and C runs a once a a that be from from and grams a with a five six and seven they that the on the the they have a a a a a the from the four gram system anyway a good day the be good a wouldn't you know was that uh uh even the rest what not bad that they were not what's the i mean the are somehow this stable a that they have quite a a big not in a quite big K a number of i or other and in the east or we can collect eighteen eighteen thousand so uh i would try to fine is my presentation we present that that the a for its a mixed in with a a a a a has been proposed so i has some i don't make fits a selection with the that has been proposed which of was to perform phonotactic being based language recognition with a high or that and are or from an improvements a a with a route to the baseline trigram svm system have been reported in experiments are on there so those and seven used competition but the base when when uh applying the proposed a in to select the most frequent units up to four five six and seven a is from our was obtained when selecting that a hundred thousand most frequent units up to five grams would you lead and need what or a great improvement of eleven percent with a to that these than three gram system i we are currently working on the evaluation of as smart the selection criteria on the these uh approach so that's so on a thank you have question for uh that i or a from we i think what we known as was that with each lower or or gram you had a different dynamic range i was wondering if you tried to be scaled them differently or or if them separately or something that yeah right oh no yeah yeah yeah uh yeah yeah i don't right just leave them on the vector yeah yeah oh ah a okay yeah i and and this one and and uh but we i oh okay a thank you yeah one iteration of me this one or the other that well well you phonetic for i yeah oh because general yeah i that that were but i uh_huh yeah true sure but H M we have somehow sent in politics but i maybe i i i and the so you pressed to the you mean we right here's acoustics is that we have more say the three looking at a but i and all these no baseline i oh that station yes think that that work but i a a a a a a a a sure if we we use for one it and the X i mean like anyway way i think somehow some something this so how you see to i that uh thinks is you when you are pop used in and in in a in uh in a a a a special uh a a a a a out of you for all of them for the rest of the thing