0:00:15a minutes no work to minimize the from the inverse in tokyo so to that
0:00:20like talk about the a speaker basis accent clustering of english using invariance invariant structure
0:00:25analysis in the speech accent archive
0:00:28so all the miss
0:00:44how can it go
0:00:56alright thank you so this is a lie on one public presentation so
0:01:02first background objective
0:01:03and then you what kind of corpus we used to what kind method of speech
0:01:06signals would used so after that i will show you have very interesting result of
0:01:10a previous study so that
0:01:12i was shown to be a coming experiments done in a current paper
0:01:18in this dog i focus on english
0:01:21that only the long as english but
0:01:23as you know english this is used as only
0:01:27global longish or intonation language spoken by everybody k
0:01:32the u d is this we can find more researchers so more teachers a
0:01:38for treating english as english she's get well the english
0:01:43so what is the english is what linguists it's a set of localised versions of
0:01:48so they claim that there is no standard pronunciation of english and american english and
0:01:53british english a bigger just two major example of accented english get
0:01:59and i this is to a very well known three circle more detail what english
0:02:03the inner circle misty a english as native language and outer circle is english class
0:02:09official language like single and expanding circle is a english that's for language japan helsinki
0:02:16in brazil
0:02:19and the in this situation still what kind we see changes like ian is found
0:02:24in this study is a linguistics
0:02:26so great interest lies in how one type of pronunciation compares to other varieties not
0:02:32now one type of pronunciation is incorrect to compute american english british english can
0:02:38so i
0:02:40a here what asking a simple question what is the minimum unit of accent diversity
0:02:44toward english as simple some people may say maybe country american accented or japanese accented
0:02:50and finny a feeling accent others may say might be tolerable in new york accident
0:02:55and helsinki accent
0:02:57sixty would town or village
0:02:59but if we consider
0:03:00the reason of accent
0:03:02it will be
0:03:03personal history of learning english
0:03:07the meeting model unit will be individual my english you want in life use of
0:03:12english and her english so how many how mean different kinds of things i mean
0:03:16users wings one point five billy
0:03:19so we can say do you their one point five
0:03:23itself in glasses on this planet
0:03:25okay so at the aim of this study is a technical feasibility of
0:03:30speaker basis accent clustering of what english
0:03:34so if you do bottom-up clustering you have to put up repeated distance matrix among
0:03:38all the elements on among all the speakers
0:03:41so then
0:03:41so that the aim of this study is that the feasibility technical feasibility to estimate
0:03:46into speaker accent distance
0:03:51so what kind of course we used the speech accent
0:03:54a high so this is very interesting very useful corpus for us so well developed
0:04:00by a wind we propose a weinberger be joy smells missing the first e
0:04:06in this development of the corpus he asked
0:04:10lots and lots of internationals a uses a linguist to read this common progress
0:04:16so please call still or something so obvious part of what's designed to achieve high
0:04:21performance a performance of a high phonetic coverage of american english
0:04:27my burgers american speaker so then this of course focus on before the makeup or
0:04:30smirk american english
0:04:31so i we show you one example
0:04:34of the speech accent archive
0:04:39sure i have to click this
0:04:44pretty i have already i k we have problems with paul
0:04:51so he's a speaker from czech republic
0:04:55so in speech can i of this kind of the variously accented english can be
0:05:00and also with this
0:05:01a corpus is very used to because it provides us with the i pure translates
0:05:07okay now transports
0:05:09or something like this
0:05:24sorry this thing that would transcripts
0:05:28i using this we constrain a predictor of the distances sadistic this is very useful
0:05:34so the next one
0:05:36so what is the technical challenge here okay
0:05:39so here i can say that the acoustic stiff acoustic difference acoustic distance between two
0:05:44speakers is now
0:05:45accent distance
0:05:47so we are what is show you a three example three utterances
0:05:51three utterances the reading
0:05:54descent this
0:05:55what is from american female speaker
0:05:57and the other two awful for my from my pronunciation
0:06:01the but the at this is my normal english normal english but the a bus
0:06:05much upon eyes to english this and
0:06:11very good female excel was consistently straightforward
0:06:14if you think india carefully first
0:06:17the fast as will be straight for perfect control carefully first
0:06:22so on so we used a little for diffusing sensitive carefully fast
0:06:27the question is how this excess closer to eighty or is close to be right
0:06:34so if you focus on acoustic difference between two speakers x has to be much
0:06:39closer to be because cell
0:06:41other sounds is generated by the same speaker okay but if you focus on accent
0:06:45difference or phonetic difference so i think x will be
0:06:50a will be just is close to two k
0:06:53so how to extract how to estimate accent distance between two speakers
0:06:57so some methods are possible but that in this talk our focus on
0:07:03for the special features
0:07:04used for that task
0:07:06so we tried to remove what suppress
0:07:10no linguistic factors is just age and gender so these are told what you relevant
0:07:13factors have to remove those things
0:07:16so in
0:07:16no more acoustic analysis of speech us of phase information removed and pitch harmonics are
0:07:22removed k what about speaker identity how to remove mean amounts of format on speech
0:07:27this is a question i
0:07:29so for that too
0:07:30something like phone a session skeleton has to be extracted for comparison
0:07:34so how to do that
0:07:36so in a previous study we need a up approves invariance astra invariant speech structure
0:07:42analysis that's a speaker-invariant was speaker independent
0:07:45representation of a speech
0:07:50how to extract the skeleton pronunciation scale to must be scaled
0:07:54good features that in this task good features should be insensitive
0:08:00to age and gender differences features should be sensitive to absent differences
0:08:05so this is your age difference in gender difference of the japanese vowels formant frequency
0:08:10i think it you know familiar with this k
0:08:13but this is the accent different system and be american english speaker dialects
0:08:18i will henceforth scoundrel upper westchester
0:08:21looking at this graph and these pass
0:08:25we can say that a good feature seems to be not feature instances
0:08:29but feature relations so distribution pattern the power supply someone that's among speakers k the
0:08:35same dialects but for different dialects the feature distributions a totally different
0:08:41we focus on in the stock we focus simulations or stable distribution can is this
0:08:46focused all and it can be represented geometrically as distance metrics
0:08:52the question here is the kind of this is matches the speaker independent what speaker-invariant
0:09:01so invariance in variability so how to extract have to define
0:09:06the invariant distance
0:09:08between two you speech you that all speech event
0:09:13a in studies of speaker conversion speech or speaker for i built is often modeled
0:09:17as a transformation of acoustic space
0:09:21this is see for example this is a closing space speak at
0:09:24and this wasn't speech space c speaker b
0:09:28one trajectory representing one actions good morning and so
0:09:32good morning of the speaker b y
0:09:34so how to extract speaker independent features from here
0:09:39so speaker independent speaker invariance can be interpreted as transforming variance
0:09:44so the question here is how what is the call pulley to complete ran some
0:09:49invariant feature manager
0:09:51so we
0:09:52found out f divergence is a very good candidate for that
0:09:56so and
0:10:00so here
0:10:01every speech event is characterised as distributions not a point in acoustic space so if
0:10:08we calculate after you've regions
0:10:10so this day visions measure is invariant with any kind of differentiable
0:10:16and continuous transform
0:10:18and then the it is interesting that if we want to have us to complete
0:10:23in various that has to be
0:10:25f divergence
0:10:27so speech contrast i mean less of a lexus batch based method which consists of
0:10:32a certain value features
0:10:34so let's use this let's use just to represent pronunciation to represent speech
0:10:40this is all approach this is trajectory can question space a so that we present
0:10:45one utterance converted into a sequence of distributions
0:10:48okay distribution has to be must use this must
0:10:51so that after that we calculate left divergence between any plp distributions
0:10:57in this talk we use about the chili a distance but the so distance is
0:11:00the one of the f divergence measures
0:11:03so that a speaker shows are the same procedure i looking at from a different
0:11:06viewpoint we implement this procedure as it
0:11:10training of hmm
0:11:11and calculating a distance between
0:11:14a any pair of state so one utterance from one instance hmm this build
0:11:19and then we extract a only contrast not only local contrasts but also distant contrasts
0:11:28okay so well i explained it the acid background objective and corpus in the method
0:11:34and i'm gonna show you some interesting result the previous work
0:11:39so well in two thousand
0:11:41six maybe
0:11:43still we did speaker basis accent clustering but this experiment are used
0:11:47simulated data similar to deal with simulated japanese english
0:11:52in this work we used a twelve japanese which a student for returnees from us
0:11:58so they can speak japanese of course very good speaker of japanese and they have
0:12:02very good speakers of american english
0:12:04so we asked them to say to pronounce
0:12:08a b t one us english words upbeat be that bad so these voice and
0:12:14also we aston to pronounce
0:12:16the ilp the told what a japanese what but people but the but the one
0:12:22and then we extracted vol one segment what medical e and it we should we
0:12:26created we form to follow based structures well based
0:12:30a structures
0:12:34but the we want to simulated variously accented japanese english so that for that we
0:12:40do replacement of some american english about was with japanese follows
0:12:45so why this is america things of all walls and the is one to guess
0:12:49eight is a difference of replacement s eight
0:12:52it's a no replacement
0:12:54so there
0:12:56or is you know
0:12:56american english american tings of hours
0:12:59and it is one replace
0:13:01all the bubbles american is of our sub replaced by japanese vowels
0:13:05totally japanese accented but works
0:13:07and as to gone is a seven well partially hardly japanese how we american english
0:13:12so well so for example this about voices apply used
0:13:16what kind of japanese possible so this is the replacement able assist these of always
0:13:21that replace replaced by a japanese follow of a
0:13:25e who april
0:13:29we have twelve speakers from a to l and eight pronunciation at jackson's one two
0:13:34eight k
0:13:35so we can have
0:13:37six and ninety six simulated learners
0:13:40that's cluster these
0:13:42these lattice
0:13:44so well as their power sample from power some post we can get of all
0:13:50what distributions and then we can get a distance matrix i mean about that show
0:13:55the subspace structure
0:13:58so well
0:14:00to cluster ninety six speakers we have to k are calculated ninety six ninety six
0:14:05distance metrics
0:14:07but one speaker is modeled as
0:14:10structure so how to
0:14:12the distance measure between two structures so we prepared two kinds of structure to structure
0:14:18distance measure
0:14:19so this is the first one so this is very simple definition of the distance
0:14:24between two structure it's euclidean distance between two speakers two structures
0:14:29so speaker a is blue one
0:14:31and green one
0:14:33so lets calculate euclidean distance between these two
0:14:36so this is another a suit definition of the distance between two structures
0:14:41so in this case
0:14:42us to focus let's focus all the volvo a of a speaker a speaker s
0:14:48and about what a speaker t o calculate the difference between these two this to
0:14:52be used
0:14:54power i involve what i speaker s and t but that's your distance
0:14:58so a summation star
0:15:00to difference
0:15:01two different definitions of died distance page
0:15:04so using these two
0:15:09we can have two
0:15:11ninety six ninety six distance matrix
0:15:14a man speakers
0:15:15so we if we if we troll gender grounds for these two that is metrics
0:15:21so what matters is what kind of results we can obtain
0:15:25if the result is like this we have very happy
0:15:28because one two three four is a pronunciation wax and
0:15:31so if the result is something like this a b c d well it's a
0:15:34speaker clustering
0:15:36we're not happy
0:15:37so what kind we sell we obtained
0:15:42so this is a result
0:15:44all the contrast based euclidean distance
0:15:48which the result of instance based distance measure
0:15:51the second definition distance metric
0:15:53so you can see
0:15:54one three c five what some noises can be found here but if always six
0:16:00rather good
0:16:02accent clustering
0:16:03but what about this j l k a y k d well complete speaker class
0:16:11big difference in the result of a dangerous ground so why
0:16:16so big difference
0:16:18so because that big difference east coast
0:16:22this difference of this is made a distance definition between two structures
0:16:27so this is a
0:16:28just a difference of two volvo set
0:16:31but that this is a difference of differences
0:16:34so this is first well i think this is the first order difference that cruise
0:16:38you other gives you speaker clustering but this is second-order differences that leaves you accent
0:16:44that is interesting thing
0:16:45so let's
0:16:47used is full
0:16:50all four
0:16:51real data
0:16:53speech accent archive
0:16:56so we have data are of into innocent speakers
0:17:00not the same pro graph
0:17:03so let's cluster these speakers
0:17:07but the
0:17:18but the
0:17:19and this work we use a at that we
0:17:22adaptive a little bit different strategy used in the a previous study so in previous
0:17:27study we calculate just euclidean distance between two structures but in this study
0:17:31we used
0:17:33the year we treated the of this in the calculation for vanessa regression problem
0:17:39so first we prepared a
0:17:42reference distances between two speakers so we first distances up a given from i pure
0:17:49so we first we did a dtw between two transcripts
0:17:53between two speakers that we can define reference distances
0:17:57and this is a target prediction still for prediction we used a regression model so
0:18:03here as we always used and input features structure based features
0:18:08for comparison we need another experiment
0:18:11silver at this is the distance between two
0:18:15phonotactic phonetic transcripts so in this case in and nine other experiment
0:18:20a phone then make transcripts are used
0:18:23so phonetic transcripts are converted into phone any conversion k
0:18:27it's a kind of rough transcripts
0:18:30then the we calculate the dtw distance between these two corresponds to rough calculation of
0:18:40a dtw i p a based reference is fess distance is we did you gap
0:18:46between two tracks
0:18:48but for dates
0:18:49we have to prepare
0:18:50i do systematic so all that i pa forms all the kinds of might be
0:18:55a force found it as a
0:18:57so well the number well i p r for some very few large more than
0:19:01three hundred
0:19:02so what we found that the one hundred fifty three i-th you've symbols can cover
0:19:06ninety six ninety five percent of all the a phone instances in s a x
0:19:10we ask them of in addition to produce
0:19:12these each of these symbols twenty times so we build speaker dependent formulation is really
0:19:18for not phoneme
0:19:19phone hmms built so we calculate the but that's a distance between any pair of
0:19:24i beautiful's
0:19:26so then we are prepared a form based distance matrix so use that we calculate
0:19:32transcript to transcript distance
0:19:35but the full this calculation we still like to the speakers from the s a
0:19:39y a part of the speakers is used was useful least for this task because
0:19:45many speakers of s a as thirty eight what the latent some words for example
0:19:49well wall were okay
0:19:51so it's a it's a kind of nonnativeness okay
0:19:54silver we belated these words so the a number of speakers that drastically reduced
0:19:59so a lot then that we shouldn't speaker number of origin speakers are more than
0:20:03one eighteen q but that the effective number of speakers is only three hundred three
0:20:09hundred seventy but the speaker pair number of speaker pair
0:20:13it's still very large
0:20:17i using this reference distance
0:20:21so we did we run now test the are so what kind of features we
0:20:24used features and regression model so we first we bill ubm hmm corresponding to the
0:20:31slu paragraph okay so to was use universal speech accent archive speech so we build
0:20:36h mount phoneme hmm concatenation
0:20:39and ubm spilled
0:20:41each addresses import map adaptation so adapt a speaker dependent hmm paragraph based hmm
0:20:48so that the structure calculation is done so well i since the paragraph contains two
0:20:53hundred twenty one phoneme instances by referring to by referring to cmu dictionary so to
0:20:58twenty two why this is metrics obtain so this is the kind of
0:21:02pronunciation scaled accent skeleton
0:21:05so be but
0:21:08what we want to predict is the accent distance between two speakers so the input
0:21:13features to as you all should be d for angel features between two speakers speaker
0:21:19s and t so here we used silver deformation metrics just a subtraction
0:21:26and t and where
0:21:28in previous works we did
0:21:30a the square some of these features i mean you could injustice but in this
0:21:35study we separate each of them and then the
0:21:39these features are used as input features in into the svm
0:21:43how many elements have been to mentions is quite huge twenty four kilos
0:21:48so one
0:21:49high dimensional vector can be present accent characteristics
0:21:53okay i think dataset kind of similar to a gmm supervector one a high dimensional
0:21:58vector can represent speaker characteristics
0:22:01so this is useful as input features as we all so as to devise a
0:22:06very general well
0:22:07one is used
0:22:10and then and that's one
0:22:12still was for many confusion up
0:22:15transcript at a transcript distance so two kinds of phoneme based transcripts are used one
0:22:21is over the transcript
0:22:23moreover transcripts
0:22:25i'm not going the other one is transcripts generated from a phoneme recognizer or phoneme
0:22:30error what detector
0:22:32the accuracies about seventy three point five percent so dtw stampeding transcripts of the two
0:22:39but there are four namely could transcript phoneme transcript
0:22:41a quick response to after question of accent used
0:22:46okay so it results
0:22:50two conditions and results
0:22:52so we did
0:22:54prediction experiments you a into models with two conditions
0:22:59a one is speaker all speaker pair open mode
0:23:02the other one speaker open mode
0:23:05the what we want to do stuff prediction of speaker distance accent distance between two
0:23:10speakers so than the
0:23:11the unit
0:23:14to a unit
0:23:15to that i mean be a unit of input to as we always that speaker
0:23:19pair i still speaker pair open mode it is that the
0:23:25not a single speaker pair it's not is found that simultaneously in training and testing
0:23:31speaker pair open mode
0:23:32so speaker open mode is also tested not a single speaker
0:23:37it's fun somebody nist two bits in training or testing
0:23:41so two modes
0:23:43and eer results accent distance prediction so we do crossvalidation above from a performance metric
0:23:49is the correlation to i p a based reference distance
0:23:53so this is a result
0:23:54so speaker pair open mode the correlations very high okay so this is so that
0:23:59result of articulation graph got reference to since i pa and predicted if a difference
0:24:04and but the speaker open mode
0:24:06the correlation is not so high grew quite little
0:24:09the oracle transcription gamma phoneme based one
0:24:12rough estimation of accents not to distance
0:24:16but in this case you can find at the speaker pair open mode predict to
0:24:20one is higher than what the transcription
0:24:22but this is low what what's low what on this but this still higher than
0:24:26using the so well transcription generated from this asr
0:24:30so why this is so low guess speak speaker open mode
0:24:34still that
0:24:35if we consider the mechanism of speaker us a few or we can say that
0:24:39the matter need to you about a likely to of accent adaboosted estimated as
0:24:43all the and
0:24:44the speaker pair open mode but all the and square in a speaker open mode
0:24:49so n is not a number of speakers available still
0:24:52speaker pair open mode speaker
0:24:54pair open mode
0:24:55so would be the that the magnitude up task difficulty yes on the estimated simple
0:25:01averaging of the test so this that a complicated version of
0:25:06okay so well let me companies can produce this work yes
0:25:12the ultimate goal of the studies to create a global really global well individual basis
0:25:18map of world english as
0:25:20so and then the for that we have to estimate we have to produce a
0:25:24technique to estimate the accent distance between any pair of speakers
0:25:29so for that we used
0:25:30that's speech accent archive you know still important speech structure analysis was used as speech
0:25:36analysis method experiments showed that the
0:25:39a high correlation was found that in speaker pair open mode but the a speaker
0:25:44open what is not sorry
0:25:46future directions so well i think structure vector plus it's be a result somewhat similar
0:25:52to ra gmm supervector high dimensional but one vector that can characterize speaker id and
0:25:57svm so but these states lots of people or researches use i-vectors are i-vector based
0:26:02features might be can be used for this
0:26:05and i was told if we change reengineering is still needed i think
0:26:09and now the machine waiting around techniques should be should be should be used and
0:26:13also we are interested in your more extensive collection the data are using cross source
0:26:20that's all existing way by all the speaker
0:26:26you're not a it should be question all right handers your correlation you're getting point
0:26:32nineteen point five real-time were you open speaker set
0:26:37that several european speaker so i all i use all that speaker's available
0:26:42so iteration speakers in the african speakers so well i still selected the speakers
0:26:49from ray the paragraph without inserting what deleting words
0:26:54which are my question is alright that still based on a perfectly red or on
0:27:00a on paragraph that right sure a large study people have shown that
0:27:04when you're working accent if you're reading prepared text versus spontaneous or conversational is on
0:27:12you get much more action yes and sure yes conversational speech and non speech sure
0:27:18still unclear comment on whether you think reach for each speaker c rats
0:27:25before coming here if the state helsinki reversed yesterday i skipped the first half discomforts
0:27:31there's a research team of collecting spun to a natural non-native english is okay so
0:27:37some other research groups of collecting data was spontaneous speech k us with my non-native
0:27:45kind of mess
0:27:47"'kay" missy data right last let's analyses in unexpected things
0:27:51so this database is a very artificial
0:27:54control dataset k
0:27:56so but the
0:27:57a what is possible ways
0:27:59spontaneous speech what is possible with control data so i think so something is possible
0:28:05is which control data and still something other things like a spot become possible with
0:28:11spontaneous data
0:28:14my proposal to those
0:28:15researchers is that the us to collect
0:28:19collection up control data and spontaneous data at the same type k
0:28:23so for example
0:28:25this the sat progress of please call stellar that probably is collected from the speakers
0:28:30users being was and then also you collected data responding to see the from those
0:28:36so then be you us you accent clustering is done well with a by using
0:28:41control data and then the so clustering result can be used to explain what is
0:28:47happening in non-native conversations
0:28:51the i think the what is that it is then issues you to collect both
0:28:55kind of control dater and spontaneous
0:28:59so i know that the also researchers claim that the s a is not
0:29:03really non-native data is it just artificial collection updated but the i think the from
0:29:11technical point of view so that kind dataset is very useful