0:00:15thank somewhat similar to kind of you
0:00:18related to be here are known to be part of august twenty sixteen the small
0:00:21percentage or c thank you unreasonable to have me here or in this meeting
0:00:27so that are okay i'm gonna give two days or something giving is so it's
0:00:31about a very classic problem question and speech communication about understanding variability and invariance and
0:00:38speech
0:00:40people been asking this for a long time
0:00:42so
0:00:43the specific sort of focus sample decrease of the very vocal instrument we have to
0:00:48produce the speech
0:00:52six different people here just showing the size of slices their vocal tract
0:00:58and we can see immediately each as the very uniquely shaped vocal instrument
0:01:03with which they produce a speech and which is what you're trying to use for
0:01:07doing speaker recognition speech signals produce sort of his vocal instrument
0:01:11in fact i just orange yourself if you're not familiar with this kind of looking
0:01:15into the be
0:01:17mouth
0:01:19i just for them are the nose and the time and that we limit of
0:01:22the soft palate that you know
0:01:25goes there just you because you'll see a lot of these pictures my talk today
0:01:30there is a good being more people
0:01:33all of them try to produce the well known
0:01:36but you can just a quick look at it and you see even study these
0:01:40people used to produce these sound the slightly different if we look at like another
0:01:43example
0:01:44that are like you know first and second speak very the speaker the lid rates
0:01:50at duncan
0:01:50make the gesture for making side well but they're slightly different
0:01:55so
0:01:57kinda know that the these kinds i that the production of both the structure in
0:02:02which these that speech production happens and how we produced be very close people
0:02:07and some of it is reflecting the speech signal was
0:02:09so we just what you're trying to sort of get out
0:02:14so that the ml my set of this line of work to say well what
0:02:17can speech signs you know play an understanding and supporting speech technologies development no only
0:02:24do we want to recognize speakers one o one make some different
0:02:29so specifically you know what focus today
0:02:33is to look at vocal tract structure the physical instrument at a given in function
0:02:37behaviour and within that about for producing speech
0:02:41and interplay between
0:02:42so by structure i mean physical characteristics of this vocal tract apparatus that we have
0:02:48right like the heart ballad geometry that on volume you know
0:02:51the length of the vocal tract the velum the no mass
0:02:54function typically refers to the hero characteristics of speech articulation
0:02:58how we dynamically warm for example to produce the consonants in all constructions the vocal
0:03:03tract you know to make a sound like intensely kind of research done to when
0:03:07you know
0:03:09and create a variation there were channel two
0:03:11create turbulence
0:03:15so
0:03:16this leads to very specific questions we asked right how are individual vocal tract differences
0:03:21with some pictures of people reflect in the speech acoustics
0:03:25candes no the inverse problem be predicted from the acoustics
0:03:30how to for a people sort of you know make a forty structural differences to
0:03:34create phonetic equivalents right because we all try to communicate use speech coding and language
0:03:40and in austin pointed out what contributes to distinguishing speakers from one another from the
0:03:44speech
0:03:45right so i want to emphasise not willing are we trying to differentiate individuals from
0:03:50their speech signal but understand what makes different from a structure
0:03:55so stop table one some of this
0:03:59sort of very on one where
0:04:02so we'll try to see how we can quantify individual variability given vocal tract quality
0:04:07try to see if we can pretty some of these from the signal and of
0:04:10what are the bounds of it and so one
0:04:13how to individual article two strategies to for can we explore you know automatic speaker
0:04:19recognition type you know applications and
0:04:23offer some interpretation while doing so
0:04:25so do approach that's i don't know or laboratory
0:04:29i one of my research groups is the cost bad all speech production articulation notes
0:04:33grew looks a lot of different questions including questions of variability so we take multimodal
0:04:39approach
0:04:39look at different kinds of ways of getting at the speech production to you know
0:04:44a more i patrol talk about a lot today audio another kind of the measurement
0:04:48technologies the whole a whole lot of a multimodal process of image processing and you
0:04:53know it's a speech processing and what the modelling based on that
0:04:57and try to use
0:04:58these kinds of engineering advances to gain insights about the dynamics of production speaker variability
0:05:06questions about speaking style prosody motions
0:05:10so the rest of that are gonna instructors falling
0:05:14so i'll focus the first part time seeing how we can measure speech production right
0:05:19how do we get those images and so one with that particular focus on a
0:05:24more i magnetic resonance imaging something that we've been trying to develop a lot
0:05:27a then given datasets data how do we analyze the island one with the sort
0:05:33of some modeling questions
0:05:35so
0:05:36how do you get it vocal tract imaging
0:05:39so there has been very central to speech science you know for a long time
0:05:44right the mac observer measure article three details the long surface tree of this and
0:05:50their number of techniques you know each with its own strengths and limitations
0:05:54you know for example really sort of i-vectors that were made right like you know
0:05:58when applied again stevens and so on text race you know
0:06:03but you know that's got pretty good temporal resolution but it's no not resay for
0:06:07people so it's no longer methodology and then the number of other techniques like ultrasound
0:06:13which provide you partial you all of the insides and not necessarily helpful for kinds
0:06:18of modeling hereafter and things like other target facilities shall use picture
0:06:23so here actually is an x ray
0:06:26i did that
0:06:31but in fact is scanned stevens
0:06:34right results are sound so you only see sound surfacing of parts of it on
0:06:39c the edges
0:06:41i so this is that the target you want people to speak about it like
0:06:47that no reading here with the contact electrodes
0:06:50and so when we speak the contact made by the time to the pilot provide
0:06:55you some insights about timing in coordination you know in speech to study
0:07:00right of it
0:07:01and finally
0:07:03by the time to noisy a person's down
0:07:05there
0:07:06be put little rice crispy like a sensors in there and measure the dynamics you
0:07:11know
0:07:12so you know provide you
0:07:14no
0:07:15the new possibilities and are created with the i to advances in the more i
0:07:19which provides you very good the soft tissue contrast to know be capable of basically
0:07:24what it relies on this the water content tissue so it that i didn't find
0:07:30and varies across very soft tissues so we make use of it by
0:07:34exciting the programs and they're releasing it signals generated according to this trend
0:07:38and then we can image it right
0:07:41it's very exciting because provides you very rides
0:07:45save provide you very good quality images but it's very slow the traditional one
0:07:50and so and also it has lot of things it's very noisy i know if
0:07:53you have are then into the scanner
0:07:55to produce speech sounds experiments a little town so these are somewhat things were contending
0:08:00with we put the last in years
0:08:01i mean so you know getting a so the very first that as sort of
0:08:06sub band of the main one third of around two thousand four
0:08:11we're in two
0:08:12a real-time imaging that is
0:08:15get two speeds
0:08:16that or sampling rates that are higher than
0:08:18what the speech rates are like you know what like
0:08:23twelve to be on aboriginal affairs or articulation rates and so
0:08:28maybe show you session
0:08:31huh
0:08:34i
0:08:36i
0:08:38i
0:08:41so
0:08:41if your family that the rainbow passage people write the exotic really ready when is
0:08:46very exciting for us to actually be able to this
0:08:49we we're doing acoustic recordings in a lot of the speech enhancement work therefore more
0:08:53i and was synchronise so kind of opened up a lot of for different possibilities
0:08:58for doing so
0:09:00there we saw
0:09:02so but unlike not happen that right really
0:09:04principal signals for a wide range for signals good but not but have been trying
0:09:09to see can be makes even better
0:09:11and so when you actually the kinds of rates
0:09:16for various because in the speech is not like one comedy using a lot of
0:09:19different you know and then mentoring task
0:09:21so from trials like no we're in spain
0:09:23to and of the saint sounds like on so one
0:09:27they are have different rate
0:09:28so we can get a about that kind of rates right would be really cool
0:09:33so
0:09:34in fact we were able to last year make a breakthrough
0:09:38and get up to sort of one hundred frames per second doing real time are
0:09:41with the
0:09:44more than one postdocs
0:09:46and not only do so very fast is very fast speech coding rate can really
0:09:51see that i'm to when you know a little
0:09:54but you can also do multiples playing simultaneously what you see here is assigned a
0:10:00slice by slice myself like they're
0:10:02or slice a axially like that or carly like this so we can do simultaneous
0:10:07you the vocal
0:10:09so i really exciting actually to be able to do it is really high rate
0:10:12to your two
0:10:14are insights
0:10:16and so this was made possible by both hardware and algorithmic a sort masses
0:10:22we developed a custom colour c requires four
0:10:27the thing
0:10:27it made lot of progress in both sequence design
0:10:31but also sort of consent reconstruction using compressed sensing things that have been happens in
0:10:35the process whatever
0:10:36so we were able to really
0:10:38speed this up and quite excited about it so this is all you know you're
0:10:42an experiment no
0:10:44some western sitting there in doing the audio collection you know the reprogram the scanner
0:10:48to that the audio synchronise with the leading
0:10:53we have interactive sort of
0:10:56control system to a select the scantily in one
0:11:01i
0:11:03i
0:11:05i
0:11:08she i four or she a four
0:11:15she i four
0:11:18lord
0:11:20she i o
0:11:24saying gonna get idea right so you can really see things you know that on
0:11:28the project it doesn't look that good like to actually
0:11:31and non-weighted which really good but actually now we are looking at production data that
0:11:36scales which is conducive the kinds of machine learning and approaches one could you
0:11:41although not be talking about be plotting
0:11:44this we are not outside the problem
0:11:46in addition to doing single plane or multi plane slice meeting we also very interesting
0:11:51the volume at least you want your interest in characterizing speakers with just one of
0:11:54the sort of the topics are researchers interest to control
0:11:58really force we off the geometry well people are speaking
0:12:03and we made some addresses there are two with about seven seconds of folding sort
0:12:07of or things like that
0:12:09we can do full sweep so
0:12:11the entire vocal tract and so we can get similar exemplary geometries off people's a
0:12:16set of clusters
0:12:18in addition
0:12:19we can also do really for getting to know that atomic will structures notable and
0:12:25of so we can do this classically to be to the more i and i'll
0:12:30show you why we are doing all these things for the kinds of measures what
0:12:33we really want to have a comprehensive idea of characterizing speakers a caucus by
0:12:39and the vocal instrument in behaviour
0:12:43so as soon as i one of the things we decide the recently been releasing
0:12:47a lot of these data so for people recognition one more than that really different
0:12:50speaker for both of them it you know sentences for six and
0:12:55with alignments and you know the image features and so on for its all available
0:13:00for free download so
0:13:04so you're some examples of that kind of data
0:13:07i
0:13:10i
0:13:15yes i
0:13:17she
0:13:19i
0:13:20so it's got five male and female speakers
0:13:22maybe some of them
0:13:26actually
0:13:28jamie money by
0:13:31and so on so
0:13:33and we also have alignment basically coregistration of this you know some algorithms for that
0:13:38then that's also released so we have this kind of data that we can work
0:13:42what so what you do this stuff
0:13:45so i'll sort of introduce some analysis preliminary
0:13:49a lot of image processing you to the very first thing is like actually getting
0:13:54at the structural details of the human will clap rather to people interested in sort
0:14:00of you know anatomy and more from a trends for her device
0:14:04of measuring everything else length of the ballot and
0:14:08and i and so one
0:14:10and that's what we wanted to do that very careful at each widget admit a
0:14:14imaging
0:14:16on top of that a for the we also want to track articulators right since
0:14:20articulator certain important specific task
0:14:23so we want to be able to automatically process these things
0:14:26so
0:14:26the methodology we sort of proposed was sort of and sampling for model
0:14:33and it's a very nice mathematical formulation actually work done by one of course
0:14:38and he was able to create a segmentation algorithm works fairly well
0:14:45so just things like okay i
0:14:49i
0:14:52so we're doing that now we would actually capture the various and timing we automatically
0:14:57from these vast amounts of data so it almost like to think about is one
0:15:00kind of feature extraction to me
0:15:04so we can all the buildings that are actually more linguistically more to us events
0:15:08by
0:15:09so one of my clothes collaborative school please so the founders of the articulately from
0:15:15all that even believe that us
0:15:18we sort of conceptualise speech production as a dynamical system
0:15:22and so varies articulators involving task basically created forming and not releasing constructions as we
0:15:29move around
0:15:30so we are interested in features like for example
0:15:33sort of a lip aperture and to but
0:15:36constriction degree and location so one so we are able to kind of that automatic
0:15:42twenty six
0:15:43another you
0:15:50so we need to automatically these things now so going from images to cut segmentation
0:15:56try to actually extract instead of linguistically meaningful
0:16:02features
0:16:06so that you know to do things like no a extract other kinds a representation
0:16:11like for example in look and pca on these contours two
0:16:14do look at the contributions of different articulators
0:16:18and so one so i'll just provide you some ways of getting at this sort
0:16:22of that objectively characterizing this production information
0:16:26and speaker specific
0:16:30so i so far is that like up for told you about look at how
0:16:34to get the data to some of that basic analysis and then with which we
0:16:39can now start looking at speaker specific properties
0:16:43so
0:16:45as i mentioned earlier data analysis to get an anatomical know how to characterise every
0:16:50single vocal instrument actual
0:16:52and this of the test was pretty well that anatomy literature and so on so
0:16:56we went to look at
0:16:57all those literature
0:16:59and you know compiled a whole bunch of these landmarks you may have become not
0:17:05the landmarks in speech
0:17:07and came up with these kinds of measures that we can get at like you
0:17:11know vocal tract sort of what legal and that the cavity lands in a separate
0:17:16and then you know and so on which we can sort of measure from these
0:17:20kinds of very high contrast images so that's one source of sort of speaker specific
0:17:27as an aside the also that you know since many degradations of same tokens by
0:17:31these people at different sessions no
0:17:34you're interested in how consists of people are and was very sort of
0:17:39sort of reaffirming that not people fairly okay fine how to produce that it opens
0:17:45you know that the measurements female we're very consistent so
0:17:48this is for example finding the correlation means and once again to
0:17:51something that presented in interspeech
0:17:55so you the strike we have this land fine article actually sort of environment with
0:18:00them which we are not be produce speech behavior we wanna know
0:18:05how much of it is dictated by the environment we have waters that strategies that
0:18:09are adopted by speakers of a unique to them due to various reasons which we
0:18:13can't really pinpoint but it is you know
0:18:15learning that they have done or the environment follows so more c can be sort
0:18:21of start deconstructing this little bit
0:18:25so next what also use a few examples subset along this direction
0:18:29so for example this picture want you to focus on the following and the palatal
0:18:33variation thought it is like you know your battery genders and think the heart circus
0:18:37we put you don't know right that's about the art part which is like important
0:18:40product or
0:18:41vocal apparatus so here we see
0:18:43but this person
0:18:45course my mouse
0:19:05that it
0:19:05so in a we see that this have i voices are very don't about it
0:19:11here a more posterior
0:19:14then i interior here is sharper drown
0:19:17that is just six different people
0:19:19so now how do we begin to actually why you are qualitatively seeing a
0:19:24can you quantify this right so
0:19:30so what i don't have a very
0:19:32was actually so that you know take these kinds of the extracted image shape and
0:19:37started doing sort of you know even simple pca analysis
0:19:41and showed that no for six percent of variance could be explained four bytes five
0:19:45first factor
0:19:47which were sort of akin to what was like to hunt concavity or complexity offish
0:19:51the next one was more know how forward-backward this
0:19:56this concavity was like sort of and curtin and then how sharp one so these
0:20:01this work test interpretations well that is actually very objective so
0:20:07so we can actually begin to quark one find cluster people along these sort of
0:20:11low dimensional search at least variables
0:20:14and then we can actually
0:20:15plug in these kinds of things into models right the like for example "'cause" you
0:20:20coupons see what acoustic consequences of these variations
0:20:24right
0:20:24so one of things you finite is that
0:20:27that is very word that that's the first performance very much
0:20:32where like the anti r g how four or five or this that the product
0:20:36shapes a incorrectly if you sharpness really didn't matter at least from these for star
0:20:41simulations
0:20:42so from a data to zero
0:20:45a morphological characters we can actually see pretty interpret what a casino once we can
0:20:50expect
0:20:51right
0:20:52in fact we can put this in a synthesiser articles and show at the other
0:20:57words from the th
0:20:59a little less
0:21:02to work on a basic you see are more one to let on
0:21:09you're going on in different bound to the plane
0:21:13so we can do this kind of analysis very no carefully
0:21:16so
0:21:18of course we also interested now likely due to inverse problem right can be estimated
0:21:22these shapes from given the acoustic signal how much of it is a available for
0:21:27us a body shape details right so
0:21:30we did the classic doing right okay be
0:21:34we have all kinds of features from the
0:21:37basic signal i want to realise right
0:21:42the shading on their way as we speak directly so it's influence
0:21:46but the environment and that apply the movements of that the behaviours right so what
0:21:50the mean one so
0:21:52that's what this way to know how we articulate
0:21:55and what we have
0:21:56both influences that influences the signal the
0:21:59so now see how it a single i
0:22:01and we show that no very simple first experiment we can get at the shape
0:22:05sort of detection
0:22:06concave a flat out that like sixty somebody persona time we can guess what kind
0:22:10of attitude they have just from the acoustic signal so that a more information is
0:22:14available
0:22:15so a more interesting question would be
0:22:18sort of a very classy morphological parameter that we've been using a lot as vocal
0:22:24tract length right this is something that office of been important speech rec aligned
0:22:28otherwise been and sound about
0:22:31well it's to
0:22:33normalize for also to estimate that things like for example we're doing a age-recognition and
0:22:39someone
0:22:39right so here again the same question
0:22:42what we have some of the speaker specific i think
0:22:46reflected in the signal right
0:22:47you wanna see how much we can grab added to pinpoint the speaker pair
0:22:52you can you know that you don't to some extent speakers compensated that for what
0:22:57environment they have and we wanna know so now how much
0:23:02all of it is residual that you can actually input
0:23:05get this is again vocal tract length i start with this because of a classic
0:23:09question that people basking so for example here is the data from a work area
0:23:13and you know and s and that the two thousand nine
0:23:16there are like you know a vocal tract length role with eight here
0:23:21for years and so we go across what from six centimetres one seventeen point five
0:23:27eighteen centimetres long
0:23:29and there's some
0:23:30different situation that happens are empirically for males and females well stuff
0:23:35and correspondingly z
0:23:37effect singly formant space in the spectrum
0:23:40no
0:23:42p by zeroing in on the first formant the rain for
0:23:47we can see that shorter vocal tract and
0:23:52shorter vocal tract and longer vocal tract how the space
0:23:56all that sort of
0:23:58get compress
0:23:59and you know shift and this kind of things happen
0:24:02and why people we've been doing implicitly or explicitly in when we do vtln
0:24:07is to basically normalize for this effect
0:24:12so the class that estimation vocal tract length you know has been back you know
0:24:17you know from or very simple sort of rest state
0:24:21sort of like what real impressed data to model we can begin estimate the land
0:24:26of the vocal tract from
0:24:28from the parameter
0:24:29right so what we are proposing
0:24:31what some sort of a problem the performance you can estimate the
0:24:35the delay parameter
0:24:38and
0:24:39one of the early work to improve work was by the key to you know
0:24:41or
0:24:43the really prediction
0:24:44okay and it's just an embryo relies on the third and fourth formant and other
0:24:50people the proposed in
0:24:51what we decide well now since actually
0:24:54direct evidence of the vocal tract length and acoustic
0:24:57can you come up with better regression models
0:25:00and sure enough to be sure that actually from this timit corpus i do not
0:25:05sure that we can get like really good estimates are not with very high correlations
0:25:10of vocal tract plan and you don't
0:25:12and this is kind of very interesting so that we are able to sort of
0:25:15progress and a good model estimate the model parameters
0:25:18and back to six now we are able to estimate vocal tract length as yet
0:25:22another set of more primitive detail of the person from the
0:25:25that's kind exciting
0:25:27last one last
0:25:29so
0:25:32summarizes what i just said no competition with that on a lot or estimation and
0:25:36availability of data and sort of you know good statistical methods allow us to get
0:25:40like better insights
0:25:42now
0:25:42moving on
0:25:44let's look at the slayer vocal tract is kind of the finding construct you know
0:25:49it's very hot defined then by this you was like no
0:25:54pretty funky and so that i'm actually plays a big role in how we dictate
0:25:59the talent
0:26:01so the question we ask is like okay
0:26:04we have sort of
0:26:06so vocal tract length and for infrequent the same charger showing you before
0:26:10we normalize for using clean you normalization but that is that what typically about
0:26:15we still have residual differences that are explained people you know putting as
0:26:21proposed like nonlinear vocal tract normalisation multi very limited all the test again at the
0:26:26specified what with so what we want to know is that the residual effect
0:26:30yes actually
0:26:32that's something about the size of that and the people have
0:26:36that some automatically to work well for
0:26:38so
0:26:40so i have up here is that the sentence and the like relative punk shape
0:26:44here
0:26:45this thing
0:26:47up to people
0:26:49we will explain some of the wall space differences
0:26:52okay
0:26:53so
0:26:54also the questions way but we have and this light of what is it well
0:27:00how does one defined measured on size
0:27:03or just people want to the concise is the people across the population
0:27:09what is effective downsizing articulation
0:27:11and
0:27:13what is that
0:27:14visible in the acoustics
0:27:16can be predicted and normalized
0:27:19same question so is very little don't publish work and that kind of thing
0:27:23a people know that there's a coordinated sort of a global the size of vocal
0:27:28tract that's be developed
0:27:30there are some disorders like you know balance enrollment so one but i one usually
0:27:35accuracy a large chunk sizes
0:27:38so
0:27:39what happens at least have so
0:27:42effect on how we produce speech like one lemmatization of corals corners of sounds like
0:27:50made in the corpus
0:27:52like else thing in a decent it's a one
0:27:56lemmatization it's like how we try to use it with the and laid than that
0:27:59are
0:28:01and sort of using almost like listing right leg lingual using the time in producing
0:28:06know what by labeled sounds like b and b
0:28:09and
0:28:10other call three articulation slowing of speech rate because you've larger mass to content of
0:28:14it
0:28:15and so on
0:28:16this something might mention but not
0:28:18much sort of quantify right
0:28:21so
0:28:22we sort of set out to say well we have lots of data
0:28:25can you set of a estimated mean posture huh
0:28:29and there is the segmentation
0:28:32and sort of
0:28:34come up with some proxy measure for someone right there was more things with it
0:28:38and so once you do that right we can actually plot the distributions of the
0:28:42time slices across the male and female speakers not to but corpus
0:28:46so what we see it
0:28:48the green
0:28:49e
0:28:50female i'm all your
0:28:53i don't average so there's significant setup
0:28:57six difference easy
0:28:58in the time
0:29:00size so yet another we can get added from the acoustic signal
0:29:04it set another sort of interpretable
0:29:06sort of
0:29:08marker
0:29:09it so
0:29:11because that
0:29:13how well we will at the environment this part structure with that down
0:29:17still not really well established again has open question so how do you really
0:29:23as this thing
0:29:25but
0:29:26we have taken sort of a shot
0:29:28so we did both sort of different kinds of normalization factor looking addressed cheapened
0:29:33well during movement this not much difference between don't they are pretty highly correlated
0:29:39so once you have that right
0:29:41we can actually not use this information in simulations say for example think it you
0:29:45model right people still study speech production
0:29:48we all the little from you know
0:29:52people like and that you know in our goner five
0:29:56there you can actually now reflect this back and try to study from analysis by
0:30:00synthesis
0:30:01so you have a mother tongue we can expect longer instructions and so on so
0:30:05what we did was to vary based on measurements we don't
0:30:09look at different constriction bands and
0:30:13locations just cy thumbsized difference will play a role in the acoustic selecting a four
0:30:18way
0:30:19so what we observe that concise differences in the population be had
0:30:24and what was estimated by simulation very well correlated in terms of part
0:30:29i part
0:30:30so it was very nice so what you saw see here is that the
0:30:33in the simulation spk and five
0:30:36the move that
0:30:39type of well ryan or likewise
0:30:43so the general trends are okay so
0:30:46so we saw all in all the pilot we saw with another what is it
0:30:51varies across speakers quite of a fifteen pick up to thirty percent
0:30:56had a consequence of a large time s
0:31:00longer constructions that are may in the vocal tract s p produce sounds because constructions
0:31:04are very sensual to how we produce very speech sounds
0:31:08they data stretching twist the wells basis so that's of us
0:31:14signal that the playwright
0:31:15and
0:31:17but this
0:31:18interplay between contractions performance and downsize is fairly complex requires much more sophisticated so
0:31:24learning
0:31:25a model that
0:31:27but with hopefully with data is you know these things can be pursued
0:31:32this one
0:31:33so the final thing sort of a not a on the slide of speaker specific
0:31:36behaviour
0:31:37is to actually talk about articulator study
0:31:40okay what i mean but that is how talkers move the vocal tracks right so
0:31:45as you know the vocal tract is actually a pretty clever assistants a very that
0:31:48we didn't systems of got all tolerance little bit
0:31:52exactly can use the same a different articulated to create the same to a complete
0:31:57the same task for example
0:31:59in move the john looks two
0:32:01both dialects to contribute by little constructions like no making b and b and one
0:32:06you have a mortgage august we lips
0:32:09and people have several ways to change their i every shapes to do this
0:32:13and so we columns are contractor strategies and some of these are speaker specific some
0:32:16of these language-specific consider a we wanna get added because is again yet another piece
0:32:22of the palatal as you try to understand what makes
0:32:25me different from you in trying when you produce speech signal
0:32:29the only just knowing that i'm different from you from a speech
0:32:33okay
0:32:33so this is approach you again very early work
0:32:36so we have lots of
0:32:38i built anymore i data
0:32:40so since then i don't know the database we collect is about from a pilot
0:32:45study of eighteen speakers but like north all these volume between all that stuff
0:32:48very detailed weight
0:32:50and so we can actually
0:32:53i get i know characterizing the morphology speaking style
0:32:57once we have that right be established what we call the speaker specific for maps
0:33:01a off but from the vocal tract shapes the construction so imagine
0:33:07the shape changes to create this task or like consummate dynamical system you know actually
0:33:12is estimate the for maps of like you know
0:33:15in that in a different recreation sense
0:33:17and then we can
0:33:19pulling all from each of these speakers format
0:33:21put this back and was synthesized or model
0:33:24which a to dynamical system ought to use and task dynamics
0:33:28and see that contributions of the varies articulators people use actually to predict how to
0:33:33be what studies people about
0:33:37so
0:33:38again reminding use of a cell we can go from data to extract a sort
0:33:43of a on tourism and do pca extract basically
0:33:47factors able contractually you know how much darker compute on with the time factors are
0:33:52and someone
0:33:52and then
0:33:53a from that we can go with estimate various constructions in a place of articulation
0:33:58you probably more right
0:34:00we have along the would try to make an six different anatomical regions like the
0:34:05outfielder reading about you can the be elevating their injuries and the one
0:34:10we can is
0:34:12automatically estimate that
0:34:13the baseline level what people this
0:34:16so
0:34:18problem so we have some insights from about eighteen speakers that we analyze testing again
0:34:23that are sorensen
0:34:25a leaf presented interest feet and fill white that we went about use a model
0:34:31based approach
0:34:32so
0:34:33be approximated like the speaker specific format a to pin from that a more i
0:34:37data from exceeding speakers
0:34:40the simulated with that a static to you have a to belong to the from
0:34:44a motor control sort of
0:34:46a legit are fantastic system
0:34:48the dynamical systems are basically
0:34:52control system that the in this state space for
0:34:54and then we were able to interpret the results so one of the results here
0:34:58like to make sure you know it's basically represent a the ratio of lips to
0:35:03use a lipstick
0:35:06and or job used by speakers to create constructions various constriction bilabial alveolar palatable
0:35:12we look print your along the vocal
0:35:15and you see that there's you know
0:35:17different
0:35:18ratio of how people use how much dog use
0:35:22one is like more target lips
0:35:25zero it's like you're using more
0:35:27different conceptions different we use
0:35:29different ways of creating transitions in fact used
0:35:34put this work we see that elephant on the right where you know
0:35:39contribute more than job in so
0:35:42except for all real close to the score of a target the time and
0:35:48the speakers in our set like in speaker
0:35:51very you know how they used to create the same kind of constructions i so
0:35:56people are different in how it studies i
0:35:59so one of the sort of this is very early inside straight how much speaker
0:36:04used on the lips you know it there's a function specificity how what is it
0:36:09out the remote are planning
0:36:10there are exceptions that actually begging for more sort of you know a computational approach
0:36:16is now with the data inside we can go and cy
0:36:20how people actually use the vocal instrument in producing this sounds
0:36:26that we call speech
0:36:29so the final in this is now we get family the slides we've been seeing
0:36:32this conference of
0:36:35so you are also explore a little bit
0:36:38well production information be of use in you know
0:36:42in speaker recognition type of experiment so we did little better well work one speaker
0:36:48verification with the production data does not much data so not so you know particular
0:36:54but that's the people pretty much common or things like so that was not one
0:36:58has this
0:37:00we'll speech production data be of any use at all your speaker verification
0:37:04so we know i one point on a getting like data like rewind showing right
0:37:10x-ray or more i or
0:37:11it's not
0:37:12but we okay in operation conditions
0:37:15right so we need to be able to have some articulatory type representation so people
0:37:20been working on inversion problems that is
0:37:23given
0:37:25acoustic
0:37:26can be estimated glitch parameters like this the classic problem in fact mozaic setting problem
0:37:31where you know where i feel that deep-learning that approaches that are very powerful because
0:37:35it's of any nonlinear process so you know these things every conducive to these
0:37:39mapping a
0:37:41nevertheless what we wanted us to do so a speaker-independent mapping
0:37:45right so this work of profound a small within just a few years ago what
0:37:52said well
0:37:52if i can really
0:37:54acoustic articulately mapping between people
0:37:56you know of that an exemplary talker right i have lots of data from one
0:38:00single speaker for like and synthesis right you always take long
0:38:03the properties from one talker and then try to produce it
0:38:08and then we can protect anyone else's acoustics on this
0:38:12so speakers maps to see how this guy were to produce the statistics like everything
0:38:17to get some semblance of an articulate representation
0:38:20so
0:38:22that we can do speaker independent sort of you know measures so that was sort
0:38:26of the i so we said well we can use a reference speaker
0:38:31to create a articulate acoustic target like to map and to the inverse model and
0:38:37then when you get that speakers
0:38:39for one acoustic signal
0:38:42we can actually do inverted sort of features and use these to a few
0:38:48the three
0:38:48there's any benefit the rationale there is enormous
0:38:53is that it pretty produces like projections they not no
0:38:57robust way and constraints the kind where
0:39:00provide sort of
0:39:02physically meaningful constraints on how we partition signal so
0:39:05that might be some advantage to come that come up
0:39:08so this was sort of you know
0:39:11this like earlier this year
0:39:13in c s l
0:39:15so
0:39:15the front end this started be used actually for some of these all experiments used
0:39:20x-ray microbeam database also available because a lot of speakers
0:39:25and standard
0:39:27thanks here gmm model because you don't have the much data
0:39:32and you're some sort of the initial results of you use just
0:39:37mfccs only you know
0:39:39that like what that for this small set that's not that's pretty noisy data set
0:39:44about
0:39:46you know seven point five the are but you know if you actually have the
0:39:50real articulation
0:39:52the measured articulation actually get a result of post
0:39:56in
0:39:57providing sort of you know nice complementary information that's kinda encouraging so that you might
0:40:02think about as an oracle experiment or upper bound if you have session
0:40:06now if you can use of the inverted sort of measurement about that we shall
0:40:12we do as well compare really well slightly better by putting them together actually provides
0:40:17you an additional both with this pretty significant actually
0:40:20so this grading of this kind of if you have lots of data that we
0:40:24are sort of you know if you have
0:40:25in the data to create these maps about speakers you know we need just example
0:40:29each case
0:40:30and if we can provide additional source of information
0:40:32perhaps will give us so the some wheels but maybe also some insight into why
0:40:37people are different or what data categories of articulation or structure and started is a
0:40:42different by
0:40:47so this is just the standard set of
0:40:51the first
0:40:52showing the same as of the film
0:40:55x-ray microbeam database
0:40:57so
0:40:59summary of the speaker recognition experiments that notes and she'll so that step
0:41:04of using both acoustic and articulatory information
0:41:07there is significant and f eight
0:41:10if you use of measured articulately information with the standard acoustic features
0:41:16gains of marble or more honest
0:41:18if we stuff you know used estimated articulate information
0:41:22so what would be nice is to actually look a new ways of doing english
0:41:27and with the kinds of so advances that are happening right now
0:41:31nor feels
0:41:32and the availability of data number two data
0:41:35to do
0:41:36i know this
0:41:37no better
0:41:38i'll be able to evaluate larger sort of acoustic data sets from sort of sre
0:41:42like the campaigns
0:41:45so mowing for most on
0:41:48so we're very excited about no some of this actually
0:41:52a premier work was done with my collaborators that lincoln laboratory some point your unique
0:41:58model is gonna
0:41:59and parallel work was mice your voice now also their
0:42:03and so we had some initial pilot work and then
0:42:06i recently got an innocent right actually to a and you the slider work people
0:42:10actually
0:42:12or okay we're doing speed signs looks like
0:42:14so we are excited about it
0:42:16and so our ideas do this in a very systematically your set to collect about
0:42:21two hundred subjects this
0:42:22all this
0:42:24real time and volume a tree and about
0:42:26detail and share with people
0:42:28and
0:42:31we kinda describe this sort of in an upcoming the paper
0:42:36and this is kind of that material if you're targeting i'll show the slides and
0:42:41people want to suggest that is we are more in for you collected what ten
0:42:44speakers of the product or so far
0:42:47with the project the starter
0:42:49i everything from a notable exception the rainbow passage two
0:42:52all kinds of you know spontaneously and so on
0:42:56if you have any suggestions ideas how what would be useful for speaker modeling you
0:43:01know i'm use like this now we have to consider
0:43:04most in order to be native speakers of english and about twenty percents could be
0:43:08nonnative speakers it's gotten in english
0:43:11but in other projects to collect a lot of people doing other languages are everything
0:43:17from african languages to other
0:43:20so finally also you know a getting insights inter speaker variability also we can do
0:43:25some sort of these use cases problem
0:43:27in the case or mother developing vocal tract length from kids tradition
0:43:32how the speaker very so that no manifesting the signal right so for example
0:43:37we've been working along with people operations of attending i'll or can see
0:43:42so the intention surgical interventions class actually basically what you with you
0:43:47the parts of town
0:43:48on top of that we have other therapeutic sort of treatments with the radiation and
0:43:51are
0:43:53people
0:43:53so cost like modified physical structural damage to the thing
0:43:58so here we see two
0:44:00of patients
0:44:02there are no
0:44:03one basically lost pretty much more so that are because the cancer with your base
0:44:08you know that and that's of the four reports on
0:44:10and it's replaced by reconstruct with them flat from the four
0:44:15so you see sort of variation in the convoy the normalized and therefore here
0:44:21so how this their speech cope what this is not getting speech and small is
0:44:25one of the big quality of life measure
0:44:26so we have different things is also keep us additional insights about you know looking
0:44:31at speaker variability
0:44:35the interesting something's only eleven cases you know and had in history the norton
0:44:39though
0:44:39some people bought reported on ability a so we have access to all other speakers
0:44:44and collect a lot of data from where and
0:44:46and so we can compare what
0:44:49a how to compensate how to use the strategies how person
0:44:54speaks pretty intuitively pretty well so
0:44:57this provides an additional source of information to understand this question of individual very good
0:45:05so in conclusion
0:45:07appoint someone may well yes data is very a good integral to advancing speech communication
0:45:13research your vocal tract information plays a crucial part of this piece of this but
0:45:18the like i believe
0:45:20so to do that we need to gather data from like lots of different sources
0:45:24to get a complete picture of the speech production
0:45:27it's that's
0:45:28not very telling from a technological computational
0:45:32as well this conceptual and theoretical to the perspective
0:45:35but
0:45:36i don't believe that are written still so that no applications including into the machine
0:45:41speech recognition speaker modeling
0:45:44but i that this sort of
0:45:47approach just like very interdisciplinary so people have to come together to work well on
0:45:51these topics
0:45:52and share
0:45:53so these are some of the people and my speech production that no
0:45:56the problem of our
0:45:58although a bottom line and people were currently there in particular award that
0:46:04we also contributed this particular a collection of my
0:46:08calling who does all these imaging work
0:46:11and testing them are scientist
0:46:13lois of these the linguist very
0:46:16well
0:46:16linguists provides a conceptual framework of how we
0:46:20approach
0:46:21such an that all this work on
0:46:23this apply meeting stuff recently and the lower can only morphology where my that was
0:46:29talking a lot model where
0:46:32that
0:46:34namely that a lot of things actually translating to a speaker verification
0:46:39and i separate that michael i-vectors in all our women amazing no i'm forty really
0:46:44for this guy had available
0:46:46and here not only finally no he's been very supportive is vanilla rampantly support incorrect
0:46:52he's be important for this and no i'm pushing is to
0:46:56not people one that's of things here too
0:46:58so that i thank all of you listening to be
0:47:02and various people find that
0:47:04well this is like online if you're interested including might be charged
0:47:09thank you very much
0:47:27for instance
0:47:32you very much with fascinating to
0:47:35two questions first of all
0:47:38when you're gonna get to the larynx
0:47:42because that's i'm okay i'm talking from the
0:47:46perspective you
0:47:48the forensic phoneticians
0:47:51and
0:47:54we are conscious of between speaker differences from the larynx on two
0:48:02spectral slope of that sort of thing but in this that suppressing
0:48:05and also super the residual e
0:48:09relationships between what i would
0:48:11give almost more robust harmful is we'll knowledge about the speaker variability in
0:48:20the nasal
0:48:22basically nasal cavity sinuses that sort of thing
0:48:26that is the below about speaker i
0:48:30it's great "'cause" you're not gonna get in this
0:48:32we telephone speech and so forth anything above
0:48:35three k is the good
0:48:37some parts that so the first questions about lyrics right
0:48:41so here are in this region
0:48:43so
0:48:45so the glottal so that the voice a voice source of phenomena like happens that
0:48:49much higher rate
0:48:50and so i'm are still is not good enough right it's about
0:48:54we can go about how did want reprints for second year
0:48:58so what people have been doing particular you know according to you salience one no
0:49:02up to
0:49:04you high speed imaging off this larynx but wouldn't camera to the nose
0:49:08in two
0:49:10little bit intervention and
0:49:13at so
0:49:14on the other hand
0:49:15what we can do you have them or i used to look at things like
0:49:19little joe hi then they'd all other things but also get some
0:49:22it it's one zero information
0:49:25and particularly one of things a more approaches like complete you of your region so
0:49:30we can really
0:49:31this is not available any of the other but all these people use you know
0:49:34in this so you look at like to be for usual sort of
0:49:40behavior phenomena
0:49:42and in terms of actually characterize and things like that is the variance and so
0:49:45on which don't change very much during speech behavioural i cannot to characterize that's what
0:49:49he really i contrast to weighted images
0:49:51to really characterize every speaker by you know what is that they have the you
0:49:55know and in terms of
0:49:57with which we can actually get i
0:50:00some anatomical good characterization of a speaker and see how can relate or account for
0:50:05it in the signal
0:50:06and so
0:50:08we are trying to see how can
0:50:10sort of controlling t do some multimodal meeting of voice source that no we tried
0:50:14to you d
0:50:15but you know they are quite small window into this thing is you know
0:50:19we wanna see the high speed stuff
0:50:23still open question in terms of contrary to meeting
0:50:29so that like by the button references
0:50:33in the previous slide show organisers people interested
0:50:40no more questions i was just
0:50:46normal
0:50:50s
0:50:53is it possible to say broadly
0:50:55if there are any a particular areas that show the greatest amount of the between
0:51:00speaker difference
0:51:02and that's to me and use
0:51:03so you know if you gonna look for where is a completely
0:51:08goodness knows it or is it just and that you know people differ in all
0:51:11sorts of the from which was
0:51:14so i think that the latter is that what my guess is right no unless
0:51:18we know i do think they begin to start begin to cluster
0:51:21a ones as increase the and number
0:51:25just like you know what we do it eigenvoice and the
0:51:28i didn't phase i think i'm sure a good prime things that start at clustering
0:51:32for getting direct mode
0:51:33but now the source of variability seems to be
0:51:36a perceptual point of view
0:51:38all the place
0:51:40plus you know how people became weakened that
0:51:42also varies quite a bit because you know
0:51:46where they come from mine how be applied and so one right and practices people
0:51:50use no
0:51:51there are other piece of work that i can talk about no one article to
0:51:54setting and you know
0:51:56ideas about
0:51:59how people set of actually
0:52:02be but i do
0:52:04extract parameters of
0:52:06from or to control problem point of view white people the for it i can
0:52:09lead to language or
0:52:11background or other kinds of things still open question
0:52:15but what i feel like as being trees that it is these of we talk
0:52:18about very small datasets is compared to what you've been for state would just on
0:52:23the speech side
0:52:25but if we increase this to some extent
0:52:28and again or this kind the computational tools and advances that you're making i think
0:52:33slowly can begin to understand this at the level to go
0:52:40open question
0:52:49structure so it are you make a comment
0:52:53you put up a kind of the acoustic to model but well all remember point
0:52:57out one thing from one of the workshops from
0:53:00the early nineties
0:53:02from mid sixties up until late eighties early nineties we use their own acoustic to
0:53:09model that was when you're like flat screen
0:53:12and we should tell at a summer student would basically spent the summer saying well
0:53:18actually the vocal track as a writing all turn and no one it really thought
0:53:23about what how much is that right angle actually impact vocal i persona formant locations
0:53:28and bandwidths
0:53:29so he we formulate a can or closed form solution i think they saw it
0:53:34was between one two three percent ships informed location bandwidths right so a very much
0:53:39like sting the physiological per state you take care what might one right basic questions
0:53:44you focused on speaker id
0:53:47i'm assuming many of your speakers here bilingual have you thought about looking at language
0:53:52id to see if the physiological production systematically changes between people speak one language versus
0:53:59another
0:54:00absolutely solid lines of that for the first a common to jon hansen made was
0:54:05regarding to but the vocal to been but it sort of unruly do the simulations
0:54:11note that
0:54:12for
0:54:13articulation acoustics and the effect of the band in fact there is a classic people
0:54:17by enrollment order moments on the
0:54:19and yes and the release of
0:54:21long time ago
0:54:23that actually estimates is about the three five percent the student actually verified it but
0:54:27some and simulations later on
0:54:31i used to get the last
0:54:34and
0:54:36so i think of the more recent models try to do this you know but
0:54:40like fans here simulations main street and simulations the one we can do with this
0:54:44node access to those one what you did i talked about right
0:54:48for all the postures from all these speakers we had that
0:54:50so and with the high performance computing
0:54:53this is becoming a reality we can actually what implanting and want to do right
0:54:56no nodes
0:54:58possible
0:55:00this second question
0:55:03john a reminder
0:55:07all the language id yes of course we have actually
0:55:10about
0:55:11forty or fifty different languages actually languages and set l to a second language them
0:55:17speak english in or datasets you know across very linguistic experiments we've been doing
0:55:22so one things we
0:55:24the real the data
0:55:25little bit not as much maybe
0:55:27cup people intuition language id
0:55:31may have some hypotheses and so on their be looked at things like articulately setting
0:55:34you know which is then
0:55:36the place from would you start executing a task right now from rest to rent
0:55:40you so if you think about as a database system right as you know from
0:55:44a individually creation like you know so the modelling you initial state is important from
0:55:48which we go to another state and where you set of but
0:55:53release that particular task and go to next aspect of making one construction going on
0:55:57an on and so we found that people have preferred sort of settings from which
0:56:02they start executing and that's very language specific we showed like normal german speakers presents
0:56:07and spanish speakers with english speakers so these kinds of things can be estimated from
0:56:11articulatory data
0:56:13the inversion is not been to the viewing done that no
0:56:17but that's quite possible and you know happy to share data
0:56:21top two people body
0:56:26okay
0:56:27sure it's first okay
0:56:32okay so
0:56:34i have a comment i like to respond
0:56:37one of all the problems in speaker recognition is i happens between the hot this
0:56:44but the speech right
0:56:48the first line that explains
0:56:51cepstral mean subtraction
0:56:54basically you find the way the average side of the vocal tract
0:57:02how does that sort of
0:57:04impact on what you
0:57:07right so that you know i didn't talk about the channel effects and channel normalization
0:57:11things that happen you know the recording conditions and so one right so
0:57:16one of things that the art of contemplating is like you know like many people
0:57:19have been talking what do joint factor analysis or these kinds of even with these
0:57:24new a deep-learning systems right
0:57:27you could these multiple factors jointly together to see how
0:57:31we can have speaker specific variability sort of measures
0:57:35and things that are cost by sort of other
0:57:39so it's a extraneous setup
0:57:42interferences or thirty two or more other kinds of transformation that might happen
0:57:46so that's what we're doing from first principal type things right like the way we
0:57:51want to do not just make the jump into a drawing some all these into
0:57:55some you know machine learning to and beginning to estimate by
0:58:01systematically trying to look at linguistic theory speech signs we could features to analysis by
0:58:07synthesis type of approaches and then we can then see well if you have other
0:58:11kinds of these kinds of snow
0:58:15both
0:58:16open environment speech recording not
0:58:19for distance the speech recording is spelled much interest to other bus
0:58:23for various reasons and
0:58:26we can account for these things so i tend to believe in that kind of
0:58:30more organic approach
0:58:38we have temporal one question may be processed foods
0:58:44i
0:58:47i'm sorry i'm the both fast
0:58:50i
0:58:51i won't first to thank you and it's very nice
0:58:57sorry noise
0:58:59science
0:59:00which technology and particularly in speaker recognition or in the forensic so
0:59:06adjust my common this to remind the difference between speaker recognition and a forensic voice
0:59:12comparison
0:59:14but it really both and
0:59:17the field
0:59:18present that you
0:59:20because
0:59:21we know about when we try to do some article in addition we think like
0:59:26that
0:59:27we have a huge difference between the board to read speech
0:59:32it's train include kick back wall
0:59:37speech right
0:59:40for speaker recognition we could imagine but the speaker are trying to
0:59:46very
0:59:47classical to you could not be processed
0:59:50in forensic voice
0:59:51comparison
0:59:52we could imagine exactly you put it right are reading my question
0:59:58posted but midget but
1:00:01would be five
1:00:03constructions the or optimization strategy you know that
1:00:08challenge department you expose
1:00:12yes and alright because there's certain things we can change certain things we can't write
1:00:16your given right that's one of the things that we are trying to go after
1:00:20that there's something are given in or physical instrument it can compensate for it as
1:00:25much but we still see the residual effects and want to see can you get
1:00:29it is residual effect maybe
1:00:31the bounds are not there so no i have a big that of information theory
1:00:35so always interesting bound the limits of things how much can be actually
1:00:39after all we have
1:00:40a one dimensional signal from which we project on all kinds of feature space and
1:00:44do all or computation based on that to do all the inferences problems targeted speaker
1:00:49or whatever this and so
1:00:52say you menu plate that the strategies that's only one degree of freedom or you
1:00:58know if you mean
1:01:00and then it causes some differences but still if we can account for this somehow
1:01:04i can you still see the residual effects of the instrument that there have or
1:01:10specific ways they are
1:01:11changing the shot used a common database when they have right you can't speak so
1:01:17just two random things with your articulation to create the speech sounds right so that's
1:01:23why not disjoint modelling of you know the structure and function you please a very
1:01:28interesting to see and how much can be spoofed by people like you know you
1:01:32may if you're getting added
1:01:33it remains to be seen by the no i
1:01:36but i'm hoping that like by no
1:01:38being very microscopic here these analyses we can get some insight into it
1:01:43you know but one that is very objective not you know
1:01:46just a
1:01:48impressionistic you know single this place is definitely all these experts billing talk about it
1:01:52on you know on the court
1:01:55i think that's one of the reasons
1:01:57here was very
1:01:59but support the idea no let's go it every object to way you know scientifically
1:02:03grounded way as possible
1:02:06we don't loads its adjoint you see vertigo
1:02:11can be so
1:02:13since then the speaker again thank you thank you