and the of the not oh have the perhaps effect so that like to a i'm gonna tend to say this yeah i she can she can a is that correct that's thank you and check say i probably said something oh so uh i gonna presents some work on language i D using a a a a a combination of tori based approach and prosody based on link a this work was done by um two of my staff members are but she sent one am an ash a i will be presenting a all but later today um a work uh were focused on a is uh to develop a a a a close it to a language id task uh a the approach that we're using is a phonological feature based scheme we use this uh for number of applications and accent classification as well as dialect id and and are here we're gonna be using this for language id i and the mean uh aspect or advancement here is combines the prosody articulatory based uh structure um are evaluations were gonna a benchmark performance against a a a a parallel phone bank a recognizer with language model of pprlm a using a close to a uh that uh our corpus of uh a five in in a languages uh from the seven portion of D so what is the motivation for using our ticket or based of features well uh languages accents and dialects they have different dominant uh articulatory traits and we believe that emphasising these components were i help improve language id as opposed to just a for statistical based approach um some of these traits for example we can look initialization of valves routed valves uh a lack of to flex type phonemes a lack of constant constant type clusters um if you're looking for example at diphthongs so a languages like danish may have seventy five to a hundred tip songs where's languages like japanese have no depth song so uh i the presence or absence of some of these traits i would be a useful to see the articulatory uh domain and in addition to this uh automatically learning these uh articulatory traits uh are well us to kind of hopefully build a model level that contribute to improved a language classification so this slide is uh a little busy but it's uh a key for the proposed system so we kind of start off first but i uh extracting our phonological features so is a time-frequency representation here i would probably out five time as a phonological feature representation presentation we tend use government phonology based approach is or or partitioning each each of these blocks here represents a different articulatory type trade like lip rounding uh time height and well one for order backwards uh in addition to that uh the this approach or the uh phonological feature extraction scheme is a traditional hmm based approach are trained on oh switchboard corpus um and so that's that the first step here uh the uh bottom to i have a point here um uh four steps four in five in two in three four and five is basically the phonological feature a language feature extraction uh and so uh two in three are basically the uh and prosodic feature extraction face so when we look at a prosodic feature phase where are analysing uh consonant bound a constant clusters um this is all down an unsupervised manner so we don't know what the phone sequences of course um so we uh break up uh the sequence in uh pseudo syllables uh are constant all our constant i clusters um and after that we extract a prosody based traits and this includes uh both pitch contours and energy contours at a us to go a syllable level uh i'm the phonological states over on this side uh wind up extracting uh the static a a language features at the um uh at the frame level and so that's what's done here in step four um this gives us a static snapshot um oh the from logical feature values that to get from a get a particular given time and each static a feature is also um augmented to a in a unigram uh bigram and trigram type of representation so we have a nick an expanded to language features set um that's of the static side on the uh dynamic side uh at step five uh we extract features a long time and phonological feature so if you look at uh i get a nice point point here um can see here we have phonological features going this way time going this way so this plot kind of shows a movement along the phonological feature and then across time it's so this move mean here kind of shows E a pattern well we have movement the articulatory are the phonological features as well as movement across time a so this uh get was uh a new feature uh it generates a a a uh a change for every phonological feature change so i'll show you that the next slide here so and since uh the long uh ball is here represent present a phonological feature a values across time in each of the dots that you see in here they were present a a change in the values much like a of but at the phonological features that H so here we can kinda show some examples of what this might look like uh for articulatory inspired language features um we can see static type combinations of phonological features so that would be the combination of all of these here we my C dynamic changes so and here we see the loop from this stage to here uh we have uh one phonological feature turning off into turning on so we can look at this transition and we also have the absence of phonological features that might represent the combination here which may be an indication of a particular language uh a rate that is unique to this uh a language that's saying a in addition to that we can also look at a static I features of the static combinations of features you work across a uh the phonological feature type for particular time uh block here and what we do is we tend to skip a we only need to get a uh a snapshot here and if you work will skip the next one because is the same so we're just capturing the individual uh us a static uh a logical feature vectors that are unique uh to to that stuff there um so an example might be um uh for particular language feature about but might be found very high for that particular town a position and this gives you a static a representation for time for uh fact nist and height okay these are also augmented uh with their unigram bigram and trigram type combinations for the static language features that were using uh and a because of that they'll allow us to have some type of allophonic type variations that can be captured in the a language model in addition to that we can look at be extracting the dynamic features here and in this context to were obviously also looking at transitions where you have movement so when things are static uh you don't change and when there is a movement here you kind of an and a five those parts so we're looking at those of value pairs when of phonological feature changes and we skip uh i things that uh are are are are not changing oh over time so an example would be like a a a language feature that would be a place uh a from a real or to a labial type position and that which show you a movement uh are ticket right hand and uh bigram uh sorry unigram bigram and trigram type language model combinations of the dynamic language or features incorporated to arrive at this phase as well use a maxim to be classification framework again i said uh uh it's a close set uh a a a of language five languages working with we extract the evidence from uh these language features uh represented here and we find a maximum entropy classifier for a particular language uh and the language features themselves could be articulatory-prosodic or combination of those and the prosodic cases would be energy and pitch the look at those at this phase so the prayer prosody based language features a motivation here we do know that uh a perception of um of languages by humans show that uh prosody is an important factor they uh which track the language features from pitch and energy contours the extraction strategy uses the wrapped uh algorithm for pitch information and we normalize uh the pitch value a for the means so where you remove for uh some of the speaker dependency there how to contours themselves are broken up as i mentioned and zero syllable for um and that's done using the phonological feature based parsing scheme and there are a lot pitch log energy contours of then approximated using a lagrange multiple uh um what ground polynomial basis uh and those coefficients are used and a gmm based a classifier for uh for the language so these are lower branch paul uh polynomials that are used in have a value a very between my one a plus one so we get different shapes here and we approximate the contours for the energy and pitch using these polynomials we have three of them here that so the the prosody based a language features are set of this way a the coefficients some cells are used to train gmms the gmm a are just uh the cluster centroids for the code for those particular prosody of components um and uh the vector some cells form the language features are gonna be working with and again uh unigram bigram trigram language models for the codebook entries are used uh for the evaluation uh we used to a five languages uh uh a in in in T I the are an indian languages a hundred speakers per language seventy five hours a a a of speech per language uh there's ninety hours of spontaneous speech this work has focused on the read speech um a or or at a kind of that telugu want a more uh are these four and a merrily em uh a on a seven port here the are but the for the if a so them are not a is the one that belongs to the in the uh a type type a languages the other for kind of telugu tamil and meryl am are are very in a language so uh uh we are benchmarking this approach against a a um a parallel bank of phone recognizer a language model uh a on the see with the performance would be so uh we uh elaborated or uh a i guess with a broom sorry uh but and we're using but so uh uh a phone recognition setup so when their first set of uh phone recognizers from but we use the a german hindi japanese mandarin and the spanish uh set a a second set was uh a english czech hungarian and russian so uh we wanted to see if we left out any in the in type languages is where that when actually make it day these is some of the results for ticket a based language features evaluated on the read speech and we can see using just the static features some cells of fifty nine percent a language i D K case here um yeah are out to a is the uh one that actually uh scores the best and there's most confusion between canada and tell a um um if you focus just some but and the dynamic language features again a uh uh we have a a significant improvement uh at list and the are do uh i D uh a kind of the still the same and some improvement uh on B am are at a so the i dynamic features increases the seventy one point nine percent if you combine of the static and dynamic language that features a seventy four percent okay uh i next uh we looked at uh a incorporating or or considering the prosody based uh a language features um this included pitch energy and the combinations of pitch and energy contours of course prosody by itself is not uh an overriding factor that can you can build a language id system but uh if you are meant that right a spectral based uh structure you can improve things so again chance here it's a five where classifications so twenty percent chance so you can see that there is an improvement um by combining both pitch and energy based contours using the well a ground problem male type modeling scheme forty seven percent classification next to uh we combine the phonological feature prosodic based uh set ups uh performance actually increases uh two seventy nine uh a point uh a five percent uh using this combination and using the pprlm we was uh a two point uh as uh two percent some other experiment we had other one that it two point seven percent um and so we're still getting better performance with the pprlm verses the phonological feature prosodic based uh structure um we did uh do some additional experiments uh of extracting the prosodic based pieces to the pprlm a i see that when improve performance and so these are the final results using static or a language to uh features from a um i uh using static language features from phonological feature based scheme a a fifty nine percent uh using dynamic um language features double the number of feature set size you seven thousand a seventy one percent and the combination of static and dynamic a we get seventy four percent uh using prosodic type structure you can see you don't get much improvement there but if you can combine pitch and energy contour me information and improves and the combination of phonological and prosodic based schemes given a give you seventy five percent um the P V are line give us more pro improvement but if you do have system fusion for that you a four percent increase absolute in a language id for the five language so um it does show some improvements uh i incorporating the phonological feature in the prosody based uh structure here so a conclusion we present a new framework for using articulatory and prosodic information for language identification we developed a new methodology for extracting language features from phonological representations not a language features themselves oh i mean i can learn from a maximum tree based techniques um the combination of prosodic and articulatory type information was shown to be useful for improved uh a language D and of the proposed system um shows some further improvements when combined with a pprlm a type system in the future we're gonna expand uh this too as they a new languages and also consider performance on the spontaneous speech which seems some changes in our and production type traits for the spontaneous speech and uh there are some references uh from the page thank you i i oh you we agree "'cause" we actually ran we've on make same types experiments um the same five are big corpus for dialect id and uh so you can clearly see that the for very confusable south indian languages that's much more challenging we've seen some some differences when we look at accent structure and previous interspeech icassp papers that's why we think that we can look at languages that are particularly close together this send how think that the subtle differences that you might see and these languages may come out a little bit more and in the articulatory type pattern you look at the larger statistical based schemes like pprlm approaches um i think somehow uh if there are big differences between uh the languages i think it's maybe a little bit more difficult to kind of we those things out sometimes uh if there's channel or microphone mismatch like what fred it talked about using map sometimes that tends to dominate the differences between the sets so i must are collected on all the same way you can't be sure that so i do agree we try to make that the task more challenging we are participating in the L this year so we hope that this might uh come to fruition the better when there is a wider oh i i i yeah i oh i was recorded uh oh a a it was actually recorded in in uh on the street and in each of the different regions so a people were recorded in kind kind of a quite a when they are reading and then they were in more public settings when they were uh and spontaneous mode so we have spontaneous and we have more of a noisy version of that they i at each region yeah i i i yeah a is there so we have information on uh we've also run a listener tests on this one we are looking so we we had a paper on this and dialect id but were very careful because these are languages and not dialects um but we have a listener tests where we had listeners that for a probably lingo base spoke either two or three of the five languages to see if they could assess the differences um we had some of those results in the previous interspeech conference but yes the there are some different i think what you're asking is where they were all recorded and consistent space um so the recording set up was the same but they were recorded in different regional location a two joe chose a yeah well we we could you've seen we look at accent and dialect on read speech you don't get a good performance uh we've actually seen that tremendously and accent you just you don't get much information on X N sensing material from read speech um the spontaneous is really what you have to focus on and so we done this primarily because we can get the results a little bit faster on the read speech we we we are running experiments right now on the spontaneous part um we just and have them ready in time to get them to this particular paper what we do know a think