Speech Transcript - LANGUAGE IDENTIFICATION USING A COMBINED ARTICULATORY PROSODY FRAMEWORK

0:00:13	and
0:00:13	the of the not
0:00:16	oh
0:00:20	have the perhaps effect
0:00:22	so that like to a
0:00:23	i'm gonna tend to say this
0:00:24	yeah i
0:00:25	she can she can a
0:00:27	is that correct
0:00:28	that's thank you and check
0:00:30	say
0:00:31	i probably said something
0:00:33	oh
0:00:34	so uh i gonna presents some work on language i D using a a a a a combination of tori
0:00:38	based approach and prosody based on link
0:00:41	a this work was done by um
0:00:44	two of my staff members are but she sent one am an ash
0:00:48	a i will be presenting a all but later today
0:00:50	um
0:00:51	a work uh were focused on a is uh to develop a a a a close it to a language
0:00:55	id task
0:00:57	uh a the approach that we're using is a phonological feature based scheme
0:01:01	we use this uh for number of applications and accent classification
0:01:06	as well as dialect id
0:01:08	and and are here we're gonna be using this for language id
0:01:11	i and the mean uh aspect or advancement here is combines the prosody articulatory based uh structure
0:01:17	um are evaluations were gonna a benchmark performance against a a a a parallel phone bank a recognizer with language
0:01:24	model of pprlm
0:01:26	a using a close to a uh that uh
0:01:29	our corpus of uh a five in in a languages uh from the seven portion of D
0:01:35	so
0:01:36	what is the motivation for using our ticket or based of features well
0:01:40	uh languages accents and dialects they have different dominant uh articulatory traits and we believe that emphasising these components were
0:01:48	i help improve language id
0:01:50	as opposed to just a for statistical based approach
0:01:53	um some of these traits for example we can look initialization of valves routed valves uh a lack of to
0:01:59	flex type phonemes
0:02:01	a lack of constant constant type clusters
0:02:03	um if you're looking for example at diphthongs so a languages like danish may have seventy five to a hundred
0:02:09	tip songs
0:02:10	where's languages like japanese have no depth song so
0:02:13	uh i the presence or absence of some of these traits
0:02:17	i would be a useful to see the articulatory uh domain
0:02:21	and in addition to this uh automatically learning these uh articulatory traits uh are well us to kind of hopefully
0:02:26	build a model level that contribute to improved a language classification
0:02:31	so this slide is uh a little busy but it's uh a key for the proposed system
0:02:37	so we kind of start off first
0:02:39	but i uh extracting our phonological features so is a time-frequency representation here
0:02:44	i would probably out five time as a phonological feature representation presentation we tend use government phonology based approach is
0:02:51	or or partitioning
0:02:53	each each of these blocks here represents a different articulatory type trade like lip rounding
0:02:58	uh time height and well one
0:03:00	for order backwards
0:03:01	uh in addition to that uh the this approach or the uh phonological feature extraction scheme
0:03:07	is a traditional hmm based approach are trained on
0:03:10	oh switchboard corpus
0:03:12	um and so that's that the first step here
0:03:15	uh the uh bottom to
0:03:17	i have a point here
0:03:18	um
0:03:19	uh four
0:03:20	steps four in five in two in three four and five
0:03:24	is basically the phonological feature
0:03:26	a language feature extraction
0:03:28	uh and so uh two in three are basically the
0:03:32	uh
0:03:32	and prosodic feature extraction face so when we look at a prosodic feature phase
0:03:37	where are analysing uh consonant bound a constant clusters
0:03:41	um this is all down an unsupervised manner so we don't know what the phone sequences of course
0:03:46	um
0:03:47	so we uh break up uh the sequence in
0:03:51	uh pseudo syllables uh are constant all our constant i clusters
0:03:55	um
0:03:56	and after that we extract a prosody based traits and this includes
0:04:00	uh both pitch contours and energy contours
0:04:03	at a us to go a syllable level
0:04:06	uh i'm the phonological states over on this side
0:04:10	uh wind up extracting
0:04:11	uh the static a a language features at the um
0:04:15	uh at the frame level and so that's what's done here in step four
0:04:18	um this gives us a static snapshot
0:04:21	um
0:04:22	oh the from logical feature values that to get from a get a particular given time
0:04:26	and each static a feature is also um
0:04:29	augmented to a in a unigram uh bigram and trigram type of representation so we have a nick an expanded
0:04:35	to
0:04:36	language features set
0:04:38	um
0:04:39	that's of the static side on the uh dynamic side
0:04:42	uh at step five
0:04:44	uh we extract features a long time and phonological feature so if you look at uh
0:04:50	i get a nice
0:04:51	point point here
0:04:52	um
0:04:52	can see here we have phonological features going this way time going this way so this plot kind of shows
0:04:58	a movement along the phonological feature and then across time it's so
0:05:01	this move mean here kind of shows E
0:05:03	a pattern
0:05:05	well we have movement the articulatory are the phonological features as well as
0:05:09	movement across time
0:05:11	a so this uh get was uh a new feature
0:05:13	uh it generates a a a uh a change for every phonological feature change so i'll show you that the
0:05:19	next slide here so
0:05:20	and since
0:05:21	uh the long uh ball is here represent present a phonological feature
0:05:27	a values across time
0:05:28	in each of the dots that you see in here
0:05:31	they were present a a change in the values much like a of but at the phonological features that H
0:05:38	so here we can kinda show some examples of what this might look like
0:05:41	uh for articulatory inspired language features
0:05:44	um we can see static type combinations of phonological features so that would be the combination of all of these
0:05:49	here
0:05:50	we my C dynamic changes so and here we see the loop from this stage to here
0:05:55	uh we have uh one phonological feature turning off into turning on so we can look at this transition
0:06:00	and we also have the absence of phonological features that might represent the combination here
0:06:05	which may be an indication of a particular language
0:06:08	uh a rate that is unique to this uh a language that's saying
0:06:13	a in addition to that we can also look at a static I features of the static combinations of features
0:06:17	you work across a
0:06:19	uh the phonological feature type for particular time
0:06:22	uh block here
0:06:24	and
0:06:24	what we do is we tend to skip a we only need to get a
0:06:28	uh a snapshot here and if you work will skip the next one because is the same
0:06:32	so we're just capturing the individual
0:06:34	uh us a static uh a logical feature vectors that are unique uh to to that stuff there
0:06:41	um so an example might be um
0:06:43	uh for particular language feature about but might be found very high for that particular town
0:06:49	a position
0:06:50	and this gives you a static a representation for time for uh fact nist and height
0:06:54	okay
0:06:55	these are also augmented uh with their unigram bigram and trigram type combinations for the static language features that were
0:07:01	using
0:07:02	uh and
0:07:03	a because of that they'll allow us to have some type of
0:07:06	allophonic type variations that can be captured in the a language model
0:07:11	in addition to that we can look at be extracting the dynamic features here and in this context to were
0:07:16	obviously also looking at transitions where you have movement
0:07:19	so when things are static
0:07:20	uh you don't change and when there is a movement here you kind of an and a five those parts
0:07:25	so we're looking at those of value pairs when of phonological feature changes
0:07:29	and we skip uh i things that uh are are are are not changing
0:07:33	oh over time so an example would be like a
0:07:36	a a language feature that would be a place uh
0:07:39	a from a real or to a labial type position
0:07:41	and that which show you a movement uh are ticket right hand
0:07:45	and uh bigram uh sorry unigram bigram and trigram type language model combinations of the dynamic language or features
0:07:53	incorporated to arrive at this phase as well
0:07:56	use a maxim to be classification framework again i said uh uh it's a close set uh a a a
0:08:01	of language five languages working with
0:08:04	we extract the evidence from uh these language features
0:08:07	uh represented here and we find a maximum entropy classifier for a particular language
0:08:13	uh and the language features themselves could be articulatory-prosodic or combination of those and the prosodic cases would be energy
0:08:19	and pitch the look at those at this phase
0:08:22	so the prayer prosody based language features a motivation here we do know that uh a perception of um
0:08:28	of languages by humans show that uh prosody is an important factor they
0:08:32	uh which track the language features from pitch and energy contours the extraction strategy uses the wrapped uh algorithm for
0:08:38	pitch information
0:08:40	and we normalize uh the pitch value
0:08:42	a for the means so where you remove for uh some of the speaker dependency there
0:08:46	how to contours themselves are broken up as i mentioned and zero syllable for
0:08:50	um
0:08:51	and that's done using the phonological feature based parsing scheme
0:08:55	and there are a lot pitch log energy contours of then approximated using a lagrange multiple uh um what ground
0:09:00	polynomial basis uh and those coefficients are used
0:09:04	and a gmm based a classifier for
0:09:06	uh for the language
0:09:08	so these are lower branch paul uh polynomials that are used in have a value
0:09:13	a very between my one a plus one so we get different shapes here and we approximate the contours for
0:09:18	the energy and pitch using
0:09:20	these polynomials we have three of them here that
0:09:24	so the the prosody based a language features are set of this way a the coefficients some cells are used
0:09:29	to train gmms the gmm
0:09:31	a are just uh
0:09:33	the cluster centroids for the code for those particular prosody of components
0:09:38	um and uh the vector some cells form the language features are gonna be working with and again
0:09:44	uh unigram bigram trigram language models for the codebook entries are used
0:09:50	uh for the evaluation uh we used to a five languages uh
0:09:54	uh a in in in T I the are an indian languages a hundred speakers per language seventy five hours
0:10:00	a a a of speech per language uh there's ninety hours of spontaneous speech this work has focused on the
0:10:06	read speech
0:10:07	um
0:10:08	a or or at a kind of that telugu want a more uh are these four and a merrily em
0:10:13	uh a on a seven port here the are but the for the if
0:10:17	a so them are not a is the one that belongs to the in the uh a type type a
0:10:21	languages
0:10:22	the other for kind of telugu tamil and meryl am
0:10:25	are are very in a language
0:10:27	so uh uh we are benchmarking this approach against a a um
0:10:32	a parallel bank of phone recognizer a language model
0:10:35	uh a on the see with the performance would be so
0:10:38	uh we
0:10:39	uh
0:10:40	elaborated or uh
0:10:42	a i guess with a broom
0:10:43	sorry
0:10:44	uh but and we're using but so uh
0:10:47	uh a phone recognition setup
0:10:48	so when their first set of uh phone recognizers from but we use the a german hindi japanese mandarin and
0:10:55	the spanish uh set
0:10:57	a a second set was uh
0:10:59	a english czech hungarian and russian so
0:11:02	uh we wanted to see if we left out any in the in type languages is where that when actually
0:11:06	make it day
0:11:08	these is some of the results for ticket a based language features evaluated on the read speech
0:11:13	and we can see using just the static features some cells of fifty nine percent a language i D K
0:11:18	case here
0:11:19	um
0:11:20	yeah are out to a is the uh one that actually uh scores the best and there's most confusion between
0:11:26	canada and tell a
0:11:28	um um
0:11:29	if you focus just some but and the dynamic language features again a
0:11:33	uh uh we have a a significant improvement uh
0:11:37	at list and the are do uh i D uh a kind of the still the same
0:11:42	and some improvement uh on B am are at a so the
0:11:45	i dynamic features increases the seventy one point nine percent
0:11:48	if you combine of the static and dynamic language that features a seventy four percent
0:11:54	okay
0:11:55	uh i next uh we looked at uh a incorporating or or considering the prosody based uh a language features
0:12:01	um this included pitch energy and the combinations of pitch and energy contours
0:12:06	of course prosody by itself is not uh an overriding factor that can
0:12:10	you can build a language id system but uh if you are meant that right
0:12:14	a spectral based uh structure you can improve things
0:12:17	so again chance here it's a five where classifications so twenty percent chance so you can see that there is
0:12:22	an improvement
0:12:23	um by combining both pitch and energy based contours using the
0:12:27	well a ground problem male type modeling scheme
0:12:30	forty seven percent classification
0:12:34	next to uh we combine the phonological feature prosodic based uh set ups uh performance actually increases uh two seventy
0:12:41	nine uh a point uh a five percent
0:12:44	uh using this combination
0:12:46	and using the pprlm
0:12:48	we was uh a two point uh as uh two percent
0:12:52	some other experiment we had other one that it two point seven percent
0:12:55	um and so we're still getting better performance with the pprlm verses the phonological feature prosodic based uh structure
0:13:02	um we did uh do some additional experiments uh of extracting the prosodic based pieces to the pprlm
0:13:09	a i see that when improve performance and so these are the final results
0:13:13	using static or a language to uh features from a
0:13:16	um i
0:13:17	uh
0:13:18	using
0:13:19	static language features from phonological feature based scheme
0:13:22	a a fifty nine percent uh using dynamic um
0:13:25	language features double the number of feature set size you seven thousand
0:13:30	a seventy one percent and the combination of static and dynamic a we get seventy four percent
0:13:36	uh using prosodic type structure you can see you don't get much improvement there but if you can combine pitch
0:13:41	and energy contour me information
0:13:44	and improves
0:13:45	and the combination of phonological and prosodic based schemes given a give you seventy five percent
0:13:50	um the P V are line give us more pro improvement but if you do have
0:13:54	system fusion for that
0:13:56	you a four percent increase
0:13:59	absolute in a language id for the five language
0:14:01	so
0:14:02	um it does show some improvements uh i incorporating the phonological feature in the prosody based uh structure
0:14:08	here
0:14:09	so a conclusion we present a new framework for using articulatory and prosodic information for language identification
0:14:16	we developed a new methodology for extracting language features from phonological representations
0:14:22	not a language features themselves
0:14:24	oh i mean i can learn from a maximum tree based techniques
0:14:27	um
0:14:28	the combination of prosodic and articulatory type information was shown to be useful for improved uh a language D
0:14:35	and of the proposed system um
0:14:37	shows some further improvements when combined with a pprlm a type system
0:14:42	in the future we're gonna expand uh this too
0:14:44	as they a new languages and also consider performance on the spontaneous speech which seems some changes in our and
0:14:51	production type traits for the
0:14:52	spontaneous speech
0:14:54	and uh there are some references uh from the page
0:14:57	thank you
0:15:05	i
0:15:06	i
0:15:24	oh
0:15:35	you
0:15:36	we agree "'cause" we actually ran we've on make same types experiments um the same five are big corpus for
0:15:41	dialect id
0:15:43	and uh
0:15:44	so you can clearly see that the for very confusable south indian languages
0:15:49	that's much more challenging we've seen some some differences when we look at accent structure and
0:15:54	previous interspeech icassp papers
0:15:56	that's why we think that we can look at languages that are particularly close together
0:16:00	this send how think that the subtle differences that you might see and these languages may come out a little
0:16:05	bit more and in the articulatory type pattern
0:16:08	you look at the larger statistical based schemes like pprlm
0:16:12	approaches
0:16:13	um
0:16:14	i think somehow uh
0:16:16	if there are big differences between uh the languages i think it's maybe a little bit more difficult to kind
0:16:21	of we those things out sometimes
0:16:24	uh if there's channel or microphone mismatch like what fred it talked about using map
0:16:29	sometimes that tends to dominate the differences between the sets so
0:16:32	i must are collected on all the same way you can't be sure that so
0:16:36	i do agree we try to make that the task more challenging
0:16:39	we are participating in the L this year so we hope that this might uh
0:16:43	come to fruition the better when
0:16:45	there is a wider
0:16:48	oh
0:16:53	i
0:16:54	i
0:16:58	i
0:16:58	yeah
0:16:59	i
0:17:00	oh
0:17:01	i was recorded
0:17:03	uh
0:17:04	oh
0:17:06	a
0:17:07	a
0:17:10	it was actually recorded in in uh on the street
0:17:13	and
0:17:14	in each of the different regions
0:17:15	so
0:17:16	a people were recorded in kind kind of a quite a when they are reading and then they were in
0:17:20	more public settings when they were
0:17:22	uh
0:17:24	and spontaneous mode so we have spontaneous and we have more of a noisy version of that
0:17:30	they
0:17:32	i at each region yeah
0:17:33	i
0:17:36	i
0:17:37	i
0:17:38	yeah
0:17:39	a
0:17:40	is there so we have information on uh we've also run a listener tests on this one we are looking
0:17:44	so we we had a paper on this and dialect id but were very careful because these are languages and
0:17:49	not dialects
0:17:51	um but we have a listener tests where we had listeners that
0:17:55	for a probably lingo base spoke either two or three of the five languages
0:17:59	to see if they could assess the differences
0:18:01	um
0:18:02	we had some of those results in the previous interspeech conference but yes the there are some different i think
0:18:06	what you're asking is where they were all recorded
0:18:09	and consistent space
0:18:10	um
0:18:11	so the recording set up was the same
0:18:13	but they were recorded in different regional location
0:18:18	a
0:18:21	two
0:18:22	joe chose
0:18:29	a
0:18:39	yeah well we we could you've seen we look at accent and dialect on read speech you don't get a
0:18:44	good performance uh we've actually seen that tremendously and accent you just
0:18:49	you don't get much information on X N sensing material from read speech
0:18:52	um
0:18:54	the spontaneous is really what you have to focus on and so
0:18:57	we done this primarily because we can get the results
0:19:00	a little bit faster on the read speech we we we are running experiments right now on the spontaneous part
0:19:05	um we just and have them ready in time to get them to this particular paper what we do know
0:19:11	a
0:19:12	think

LANGUAGE IDENTIFICATION USING A COMBINED ARTICULATORY PROSODY FRAMEWORK

Language Identification

Presented by: John Hansen, Author(s): Abhijeet Sangwan, Mahnoosh Mehrabani, John Hansen, The University of Texas at Dallas, United States