0:00:13the of the not
0:00:20have the perhaps effect
0:00:22so that like to a
0:00:23i'm gonna tend to say this
0:00:24yeah i
0:00:25she can she can a
0:00:27is that correct
0:00:28that's thank you and check
0:00:31i probably said something
0:00:34so uh i gonna presents some work on language i D using a a a a a combination of tori
0:00:38based approach and prosody based on link
0:00:41a this work was done by um
0:00:44two of my staff members are but she sent one am an ash
0:00:48a i will be presenting a all but later today
0:00:51a work uh were focused on a is uh to develop a a a a close it to a language
0:00:55id task
0:00:57uh a the approach that we're using is a phonological feature based scheme
0:01:01we use this uh for number of applications and accent classification
0:01:06as well as dialect id
0:01:08and and are here we're gonna be using this for language id
0:01:11i and the mean uh aspect or advancement here is combines the prosody articulatory based uh structure
0:01:17um are evaluations were gonna a benchmark performance against a a a a parallel phone bank a recognizer with language
0:01:24model of pprlm
0:01:26a using a close to a uh that uh
0:01:29our corpus of uh a five in in a languages uh from the seven portion of D
0:01:36what is the motivation for using our ticket or based of features well
0:01:40uh languages accents and dialects they have different dominant uh articulatory traits and we believe that emphasising these components were
0:01:48i help improve language id
0:01:50as opposed to just a for statistical based approach
0:01:53um some of these traits for example we can look initialization of valves routed valves uh a lack of to
0:01:59flex type phonemes
0:02:01a lack of constant constant type clusters
0:02:03um if you're looking for example at diphthongs so a languages like danish may have seventy five to a hundred
0:02:09tip songs
0:02:10where's languages like japanese have no depth song so
0:02:13uh i the presence or absence of some of these traits
0:02:17i would be a useful to see the articulatory uh domain
0:02:21and in addition to this uh automatically learning these uh articulatory traits uh are well us to kind of hopefully
0:02:26build a model level that contribute to improved a language classification
0:02:31so this slide is uh a little busy but it's uh a key for the proposed system
0:02:37so we kind of start off first
0:02:39but i uh extracting our phonological features so is a time-frequency representation here
0:02:44i would probably out five time as a phonological feature representation presentation we tend use government phonology based approach is
0:02:51or or partitioning
0:02:53each each of these blocks here represents a different articulatory type trade like lip rounding
0:02:58uh time height and well one
0:03:00for order backwards
0:03:01uh in addition to that uh the this approach or the uh phonological feature extraction scheme
0:03:07is a traditional hmm based approach are trained on
0:03:10oh switchboard corpus
0:03:12um and so that's that the first step here
0:03:15uh the uh bottom to
0:03:17i have a point here
0:03:19uh four
0:03:20steps four in five in two in three four and five
0:03:24is basically the phonological feature
0:03:26a language feature extraction
0:03:28uh and so uh two in three are basically the
0:03:32and prosodic feature extraction face so when we look at a prosodic feature phase
0:03:37where are analysing uh consonant bound a constant clusters
0:03:41um this is all down an unsupervised manner so we don't know what the phone sequences of course
0:03:47so we uh break up uh the sequence in
0:03:51uh pseudo syllables uh are constant all our constant i clusters
0:03:56and after that we extract a prosody based traits and this includes
0:04:00uh both pitch contours and energy contours
0:04:03at a us to go a syllable level
0:04:06uh i'm the phonological states over on this side
0:04:10uh wind up extracting
0:04:11uh the static a a language features at the um
0:04:15uh at the frame level and so that's what's done here in step four
0:04:18um this gives us a static snapshot
0:04:22oh the from logical feature values that to get from a get a particular given time
0:04:26and each static a feature is also um
0:04:29augmented to a in a unigram uh bigram and trigram type of representation so we have a nick an expanded
0:04:36language features set
0:04:39that's of the static side on the uh dynamic side
0:04:42uh at step five
0:04:44uh we extract features a long time and phonological feature so if you look at uh
0:04:50i get a nice
0:04:51point point here
0:04:52can see here we have phonological features going this way time going this way so this plot kind of shows
0:04:58a movement along the phonological feature and then across time it's so
0:05:01this move mean here kind of shows E
0:05:03a pattern
0:05:05well we have movement the articulatory are the phonological features as well as
0:05:09movement across time
0:05:11a so this uh get was uh a new feature
0:05:13uh it generates a a a uh a change for every phonological feature change so i'll show you that the
0:05:19next slide here so
0:05:20and since
0:05:21uh the long uh ball is here represent present a phonological feature
0:05:27a values across time
0:05:28in each of the dots that you see in here
0:05:31they were present a a change in the values much like a of but at the phonological features that H
0:05:38so here we can kinda show some examples of what this might look like
0:05:41uh for articulatory inspired language features
0:05:44um we can see static type combinations of phonological features so that would be the combination of all of these
0:05:50we my C dynamic changes so and here we see the loop from this stage to here
0:05:55uh we have uh one phonological feature turning off into turning on so we can look at this transition
0:06:00and we also have the absence of phonological features that might represent the combination here
0:06:05which may be an indication of a particular language
0:06:08uh a rate that is unique to this uh a language that's saying
0:06:13a in addition to that we can also look at a static I features of the static combinations of features
0:06:17you work across a
0:06:19uh the phonological feature type for particular time
0:06:22uh block here
0:06:24what we do is we tend to skip a we only need to get a
0:06:28uh a snapshot here and if you work will skip the next one because is the same
0:06:32so we're just capturing the individual
0:06:34uh us a static uh a logical feature vectors that are unique uh to to that stuff there
0:06:41um so an example might be um
0:06:43uh for particular language feature about but might be found very high for that particular town
0:06:49a position
0:06:50and this gives you a static a representation for time for uh fact nist and height
0:06:55these are also augmented uh with their unigram bigram and trigram type combinations for the static language features that were
0:07:02uh and
0:07:03a because of that they'll allow us to have some type of
0:07:06allophonic type variations that can be captured in the a language model
0:07:11in addition to that we can look at be extracting the dynamic features here and in this context to were
0:07:16obviously also looking at transitions where you have movement
0:07:19so when things are static
0:07:20uh you don't change and when there is a movement here you kind of an and a five those parts
0:07:25so we're looking at those of value pairs when of phonological feature changes
0:07:29and we skip uh i things that uh are are are are not changing
0:07:33oh over time so an example would be like a
0:07:36a a language feature that would be a place uh
0:07:39a from a real or to a labial type position
0:07:41and that which show you a movement uh are ticket right hand
0:07:45and uh bigram uh sorry unigram bigram and trigram type language model combinations of the dynamic language or features
0:07:53incorporated to arrive at this phase as well
0:07:56use a maxim to be classification framework again i said uh uh it's a close set uh a a a
0:08:01of language five languages working with
0:08:04we extract the evidence from uh these language features
0:08:07uh represented here and we find a maximum entropy classifier for a particular language
0:08:13uh and the language features themselves could be articulatory-prosodic or combination of those and the prosodic cases would be energy
0:08:19and pitch the look at those at this phase
0:08:22so the prayer prosody based language features a motivation here we do know that uh a perception of um
0:08:28of languages by humans show that uh prosody is an important factor they
0:08:32uh which track the language features from pitch and energy contours the extraction strategy uses the wrapped uh algorithm for
0:08:38pitch information
0:08:40and we normalize uh the pitch value
0:08:42a for the means so where you remove for uh some of the speaker dependency there
0:08:46how to contours themselves are broken up as i mentioned and zero syllable for
0:08:51and that's done using the phonological feature based parsing scheme
0:08:55and there are a lot pitch log energy contours of then approximated using a lagrange multiple uh um what ground
0:09:00polynomial basis uh and those coefficients are used
0:09:04and a gmm based a classifier for
0:09:06uh for the language
0:09:08so these are lower branch paul uh polynomials that are used in have a value
0:09:13a very between my one a plus one so we get different shapes here and we approximate the contours for
0:09:18the energy and pitch using
0:09:20these polynomials we have three of them here that
0:09:24so the the prosody based a language features are set of this way a the coefficients some cells are used
0:09:29to train gmms the gmm
0:09:31a are just uh
0:09:33the cluster centroids for the code for those particular prosody of components
0:09:38um and uh the vector some cells form the language features are gonna be working with and again
0:09:44uh unigram bigram trigram language models for the codebook entries are used
0:09:50uh for the evaluation uh we used to a five languages uh
0:09:54uh a in in in T I the are an indian languages a hundred speakers per language seventy five hours
0:10:00a a a of speech per language uh there's ninety hours of spontaneous speech this work has focused on the
0:10:06read speech
0:10:08a or or at a kind of that telugu want a more uh are these four and a merrily em
0:10:13uh a on a seven port here the are but the for the if
0:10:17a so them are not a is the one that belongs to the in the uh a type type a
0:10:22the other for kind of telugu tamil and meryl am
0:10:25are are very in a language
0:10:27so uh uh we are benchmarking this approach against a a um
0:10:32a parallel bank of phone recognizer a language model
0:10:35uh a on the see with the performance would be so
0:10:38uh we
0:10:40elaborated or uh
0:10:42a i guess with a broom
0:10:44uh but and we're using but so uh
0:10:47uh a phone recognition setup
0:10:48so when their first set of uh phone recognizers from but we use the a german hindi japanese mandarin and
0:10:55the spanish uh set
0:10:57a a second set was uh
0:10:59a english czech hungarian and russian so
0:11:02uh we wanted to see if we left out any in the in type languages is where that when actually
0:11:06make it day
0:11:08these is some of the results for ticket a based language features evaluated on the read speech
0:11:13and we can see using just the static features some cells of fifty nine percent a language i D K
0:11:18case here
0:11:20yeah are out to a is the uh one that actually uh scores the best and there's most confusion between
0:11:26canada and tell a
0:11:28um um
0:11:29if you focus just some but and the dynamic language features again a
0:11:33uh uh we have a a significant improvement uh
0:11:37at list and the are do uh i D uh a kind of the still the same
0:11:42and some improvement uh on B am are at a so the
0:11:45i dynamic features increases the seventy one point nine percent
0:11:48if you combine of the static and dynamic language that features a seventy four percent
0:11:55uh i next uh we looked at uh a incorporating or or considering the prosody based uh a language features
0:12:01um this included pitch energy and the combinations of pitch and energy contours
0:12:06of course prosody by itself is not uh an overriding factor that can
0:12:10you can build a language id system but uh if you are meant that right
0:12:14a spectral based uh structure you can improve things
0:12:17so again chance here it's a five where classifications so twenty percent chance so you can see that there is
0:12:22an improvement
0:12:23um by combining both pitch and energy based contours using the
0:12:27well a ground problem male type modeling scheme
0:12:30forty seven percent classification
0:12:34next to uh we combine the phonological feature prosodic based uh set ups uh performance actually increases uh two seventy
0:12:41nine uh a point uh a five percent
0:12:44uh using this combination
0:12:46and using the pprlm
0:12:48we was uh a two point uh as uh two percent
0:12:52some other experiment we had other one that it two point seven percent
0:12:55um and so we're still getting better performance with the pprlm verses the phonological feature prosodic based uh structure
0:13:02um we did uh do some additional experiments uh of extracting the prosodic based pieces to the pprlm
0:13:09a i see that when improve performance and so these are the final results
0:13:13using static or a language to uh features from a
0:13:16um i
0:13:19static language features from phonological feature based scheme
0:13:22a a fifty nine percent uh using dynamic um
0:13:25language features double the number of feature set size you seven thousand
0:13:30a seventy one percent and the combination of static and dynamic a we get seventy four percent
0:13:36uh using prosodic type structure you can see you don't get much improvement there but if you can combine pitch
0:13:41and energy contour me information
0:13:44and improves
0:13:45and the combination of phonological and prosodic based schemes given a give you seventy five percent
0:13:50um the P V are line give us more pro improvement but if you do have
0:13:54system fusion for that
0:13:56you a four percent increase
0:13:59absolute in a language id for the five language
0:14:02um it does show some improvements uh i incorporating the phonological feature in the prosody based uh structure
0:14:09so a conclusion we present a new framework for using articulatory and prosodic information for language identification
0:14:16we developed a new methodology for extracting language features from phonological representations
0:14:22not a language features themselves
0:14:24oh i mean i can learn from a maximum tree based techniques
0:14:28the combination of prosodic and articulatory type information was shown to be useful for improved uh a language D
0:14:35and of the proposed system um
0:14:37shows some further improvements when combined with a pprlm a type system
0:14:42in the future we're gonna expand uh this too
0:14:44as they a new languages and also consider performance on the spontaneous speech which seems some changes in our and
0:14:51production type traits for the
0:14:52spontaneous speech
0:14:54and uh there are some references uh from the page
0:14:57thank you
0:15:36we agree "'cause" we actually ran we've on make same types experiments um the same five are big corpus for
0:15:41dialect id
0:15:43and uh
0:15:44so you can clearly see that the for very confusable south indian languages
0:15:49that's much more challenging we've seen some some differences when we look at accent structure and
0:15:54previous interspeech icassp papers
0:15:56that's why we think that we can look at languages that are particularly close together
0:16:00this send how think that the subtle differences that you might see and these languages may come out a little
0:16:05bit more and in the articulatory type pattern
0:16:08you look at the larger statistical based schemes like pprlm
0:16:14i think somehow uh
0:16:16if there are big differences between uh the languages i think it's maybe a little bit more difficult to kind
0:16:21of we those things out sometimes
0:16:24uh if there's channel or microphone mismatch like what fred it talked about using map
0:16:29sometimes that tends to dominate the differences between the sets so
0:16:32i must are collected on all the same way you can't be sure that so
0:16:36i do agree we try to make that the task more challenging
0:16:39we are participating in the L this year so we hope that this might uh
0:16:43come to fruition the better when
0:16:45there is a wider
0:17:01i was recorded
0:17:10it was actually recorded in in uh on the street
0:17:14in each of the different regions
0:17:16a people were recorded in kind kind of a quite a when they are reading and then they were in
0:17:20more public settings when they were
0:17:24and spontaneous mode so we have spontaneous and we have more of a noisy version of that
0:17:32i at each region yeah
0:17:40is there so we have information on uh we've also run a listener tests on this one we are looking
0:17:44so we we had a paper on this and dialect id but were very careful because these are languages and
0:17:49not dialects
0:17:51um but we have a listener tests where we had listeners that
0:17:55for a probably lingo base spoke either two or three of the five languages
0:17:59to see if they could assess the differences
0:18:02we had some of those results in the previous interspeech conference but yes the there are some different i think
0:18:06what you're asking is where they were all recorded
0:18:09and consistent space
0:18:11so the recording set up was the same
0:18:13but they were recorded in different regional location
0:18:22joe chose
0:18:39yeah well we we could you've seen we look at accent and dialect on read speech you don't get a
0:18:44good performance uh we've actually seen that tremendously and accent you just
0:18:49you don't get much information on X N sensing material from read speech
0:18:54the spontaneous is really what you have to focus on and so
0:18:57we done this primarily because we can get the results
0:19:00a little bit faster on the read speech we we we are running experiments right now on the spontaneous part
0:19:05um we just and have them ready in time to get them to this particular paper what we do know