Speech Transcript - Understanding individual-level speech variability: From novel speech production data to robust speaker recognition

0:00:15	thank somewhat similar to kind of you
0:00:18	related to be here are known to be part of august twenty sixteen the small
0:00:21	percentage or c thank you unreasonable to have me here or in this meeting
0:00:27	so that are okay i'm gonna give two days or something giving is so it's
0:00:31	about a very classic problem question and speech communication about understanding variability and invariance and
0:00:38	speech
0:00:40	people been asking this for a long time
0:00:42	so
0:00:43	the specific sort of focus sample decrease of the very vocal instrument we have to
0:00:48	produce the speech
0:00:52	six different people here just showing the size of slices their vocal tract
0:00:58	and we can see immediately each as the very uniquely shaped vocal instrument
0:01:03	with which they produce a speech and which is what you're trying to use for
0:01:07	doing speaker recognition speech signals produce sort of his vocal instrument
0:01:11	in fact i just orange yourself if you're not familiar with this kind of looking
0:01:15	into the be
0:01:17	mouth
0:01:19	i just for them are the nose and the time and that we limit of
0:01:22	the soft palate that you know
0:01:25	goes there just you because you'll see a lot of these pictures my talk today
0:01:30	there is a good being more people
0:01:33	all of them try to produce the well known
0:01:36	but you can just a quick look at it and you see even study these
0:01:40	people used to produce these sound the slightly different if we look at like another
0:01:43	example
0:01:44	that are like you know first and second speak very the speaker the lid rates
0:01:50	at duncan
0:01:50	make the gesture for making side well but they're slightly different
0:01:55	so
0:01:57	kinda know that the these kinds i that the production of both the structure in
0:02:02	which these that speech production happens and how we produced be very close people
0:02:07	and some of it is reflecting the speech signal was
0:02:09	so we just what you're trying to sort of get out
0:02:14	so that the ml my set of this line of work to say well what
0:02:17	can speech signs you know play an understanding and supporting speech technologies development no only
0:02:24	do we want to recognize speakers one o one make some different
0:02:29	so specifically you know what focus today
0:02:33	is to look at vocal tract structure the physical instrument at a given in function
0:02:37	behaviour and within that about for producing speech
0:02:41	and interplay between
0:02:42	so by structure i mean physical characteristics of this vocal tract apparatus that we have
0:02:48	right like the heart ballad geometry that on volume you know
0:02:51	the length of the vocal tract the velum the no mass
0:02:54	function typically refers to the hero characteristics of speech articulation
0:02:58	how we dynamically warm for example to produce the consonants in all constructions the vocal
0:03:03	tract you know to make a sound like intensely kind of research done to when
0:03:07	you know
0:03:09	and create a variation there were channel two
0:03:11	create turbulence
0:03:15	so
0:03:16	this leads to very specific questions we asked right how are individual vocal tract differences
0:03:21	with some pictures of people reflect in the speech acoustics
0:03:25	candes no the inverse problem be predicted from the acoustics
0:03:30	how to for a people sort of you know make a forty structural differences to
0:03:34	create phonetic equivalents right because we all try to communicate use speech coding and language
0:03:40	and in austin pointed out what contributes to distinguishing speakers from one another from the
0:03:44	speech
0:03:45	right so i want to emphasise not willing are we trying to differentiate individuals from
0:03:50	their speech signal but understand what makes different from a structure
0:03:55	so stop table one some of this
0:03:59	sort of very on one where
0:04:02	so we'll try to see how we can quantify individual variability given vocal tract quality
0:04:07	try to see if we can pretty some of these from the signal and of
0:04:10	what are the bounds of it and so one
0:04:13	how to individual article two strategies to for can we explore you know automatic speaker
0:04:19	recognition type you know applications and
0:04:23	offer some interpretation while doing so
0:04:25	so do approach that's i don't know or laboratory
0:04:29	i one of my research groups is the cost bad all speech production articulation notes
0:04:33	grew looks a lot of different questions including questions of variability so we take multimodal
0:04:39	approach
0:04:39	look at different kinds of ways of getting at the speech production to you know
0:04:44	a more i patrol talk about a lot today audio another kind of the measurement
0:04:48	technologies the whole a whole lot of a multimodal process of image processing and you
0:04:53	know it's a speech processing and what the modelling based on that
0:04:57	and try to use
0:04:58	these kinds of engineering advances to gain insights about the dynamics of production speaker variability
0:05:06	questions about speaking style prosody motions
0:05:10	so the rest of that are gonna instructors falling
0:05:14	so i'll focus the first part time seeing how we can measure speech production right
0:05:19	how do we get those images and so one with that particular focus on a
0:05:24	more i magnetic resonance imaging something that we've been trying to develop a lot
0:05:27	a then given datasets data how do we analyze the island one with the sort
0:05:33	of some modeling questions
0:05:35	so
0:05:36	how do you get it vocal tract imaging
0:05:39	so there has been very central to speech science you know for a long time
0:05:44	right the mac observer measure article three details the long surface tree of this and
0:05:50	their number of techniques you know each with its own strengths and limitations
0:05:54	you know for example really sort of i-vectors that were made right like you know
0:05:58	when applied again stevens and so on text race you know
0:06:03	but you know that's got pretty good temporal resolution but it's no not resay for
0:06:07	people so it's no longer methodology and then the number of other techniques like ultrasound
0:06:13	which provide you partial you all of the insides and not necessarily helpful for kinds
0:06:18	of modeling hereafter and things like other target facilities shall use picture
0:06:23	so here actually is an x ray
0:06:26	i did that
0:06:31	but in fact is scanned stevens
0:06:34	right results are sound so you only see sound surfacing of parts of it on
0:06:39	c the edges
0:06:41	i so this is that the target you want people to speak about it like
0:06:47	that no reading here with the contact electrodes
0:06:50	and so when we speak the contact made by the time to the pilot provide
0:06:55	you some insights about timing in coordination you know in speech to study
0:07:00	right of it
0:07:01	and finally
0:07:03	by the time to noisy a person's down
0:07:05	there
0:07:06	be put little rice crispy like a sensors in there and measure the dynamics you
0:07:11	know
0:07:12	so you know provide you
0:07:14	no
0:07:15	the new possibilities and are created with the i to advances in the more i
0:07:19	which provides you very good the soft tissue contrast to know be capable of basically
0:07:24	what it relies on this the water content tissue so it that i didn't find
0:07:30	and varies across very soft tissues so we make use of it by
0:07:34	exciting the programs and they're releasing it signals generated according to this trend
0:07:38	and then we can image it right
0:07:41	it's very exciting because provides you very rides
0:07:45	save provide you very good quality images but it's very slow the traditional one
0:07:50	and so and also it has lot of things it's very noisy i know if
0:07:53	you have are then into the scanner
0:07:55	to produce speech sounds experiments a little town so these are somewhat things were contending
0:08:00	with we put the last in years
0:08:01	i mean so you know getting a so the very first that as sort of
0:08:06	sub band of the main one third of around two thousand four
0:08:11	we're in two
0:08:12	a real-time imaging that is
0:08:15	get two speeds
0:08:16	that or sampling rates that are higher than
0:08:18	what the speech rates are like you know what like
0:08:23	twelve to be on aboriginal affairs or articulation rates and so
0:08:28	maybe show you session
0:08:31	huh
0:08:34	i
0:08:36	i
0:08:38	i
0:08:41	so
0:08:41	if your family that the rainbow passage people write the exotic really ready when is
0:08:46	very exciting for us to actually be able to this
0:08:49	we we're doing acoustic recordings in a lot of the speech enhancement work therefore more
0:08:53	i and was synchronise so kind of opened up a lot of for different possibilities
0:08:58	for doing so
0:09:00	there we saw
0:09:02	so but unlike not happen that right really
0:09:04	principal signals for a wide range for signals good but not but have been trying
0:09:09	to see can be makes even better
0:09:11	and so when you actually the kinds of rates
0:09:16	for various because in the speech is not like one comedy using a lot of
0:09:19	different you know and then mentoring task
0:09:21	so from trials like no we're in spain
0:09:23	to and of the saint sounds like on so one
0:09:27	they are have different rate
0:09:28	so we can get a about that kind of rates right would be really cool
0:09:33	so
0:09:34	in fact we were able to last year make a breakthrough
0:09:38	and get up to sort of one hundred frames per second doing real time are
0:09:41	with the
0:09:44	more than one postdocs
0:09:46	and not only do so very fast is very fast speech coding rate can really
0:09:51	see that i'm to when you know a little
0:09:54	but you can also do multiples playing simultaneously what you see here is assigned a
0:10:00	slice by slice myself like they're
0:10:02	or slice a axially like that or carly like this so we can do simultaneous
0:10:07	you the vocal
0:10:09	so i really exciting actually to be able to do it is really high rate
0:10:12	to your two
0:10:14	are insights
0:10:16	and so this was made possible by both hardware and algorithmic a sort masses
0:10:22	we developed a custom colour c requires four
0:10:27	the thing
0:10:27	it made lot of progress in both sequence design
0:10:31	but also sort of consent reconstruction using compressed sensing things that have been happens in
0:10:35	the process whatever
0:10:36	so we were able to really
0:10:38	speed this up and quite excited about it so this is all you know you're
0:10:42	an experiment no
0:10:44	some western sitting there in doing the audio collection you know the reprogram the scanner
0:10:48	to that the audio synchronise with the leading
0:10:53	we have interactive sort of
0:10:56	control system to a select the scantily in one
0:11:01	i
0:11:03	i
0:11:05	i
0:11:08	she i four or she a four
0:11:15	she i four
0:11:18	lord
0:11:20	she i o
0:11:24	saying gonna get idea right so you can really see things you know that on
0:11:28	the project it doesn't look that good like to actually
0:11:31	and non-weighted which really good but actually now we are looking at production data that
0:11:36	scales which is conducive the kinds of machine learning and approaches one could you
0:11:41	although not be talking about be plotting
0:11:44	this we are not outside the problem
0:11:46	in addition to doing single plane or multi plane slice meeting we also very interesting
0:11:51	the volume at least you want your interest in characterizing speakers with just one of
0:11:54	the sort of the topics are researchers interest to control
0:11:58	really force we off the geometry well people are speaking
0:12:03	and we made some addresses there are two with about seven seconds of folding sort
0:12:07	of or things like that
0:12:09	we can do full sweep so
0:12:11	the entire vocal tract and so we can get similar exemplary geometries off people's a
0:12:16	set of clusters
0:12:18	in addition
0:12:19	we can also do really for getting to know that atomic will structures notable and
0:12:25	of so we can do this classically to be to the more i and i'll
0:12:30	show you why we are doing all these things for the kinds of measures what
0:12:33	we really want to have a comprehensive idea of characterizing speakers a caucus by
0:12:39	and the vocal instrument in behaviour
0:12:43	so as soon as i one of the things we decide the recently been releasing
0:12:47	a lot of these data so for people recognition one more than that really different
0:12:50	speaker for both of them it you know sentences for six and
0:12:55	with alignments and you know the image features and so on for its all available
0:13:00	for free download so
0:13:04	so you're some examples of that kind of data
0:13:07	i
0:13:10	i
0:13:15	yes i
0:13:17	she
0:13:19	i
0:13:20	so it's got five male and female speakers
0:13:22	maybe some of them
0:13:26	actually
0:13:28	jamie money by
0:13:31	and so on so
0:13:33	and we also have alignment basically coregistration of this you know some algorithms for that
0:13:38	then that's also released so we have this kind of data that we can work
0:13:42	what so what you do this stuff
0:13:45	so i'll sort of introduce some analysis preliminary
0:13:49	a lot of image processing you to the very first thing is like actually getting
0:13:54	at the structural details of the human will clap rather to people interested in sort
0:14:00	of you know anatomy and more from a trends for her device
0:14:04	of measuring everything else length of the ballot and
0:14:08	and i and so one
0:14:10	and that's what we wanted to do that very careful at each widget admit a
0:14:14	imaging
0:14:16	on top of that a for the we also want to track articulators right since
0:14:20	articulator certain important specific task
0:14:23	so we want to be able to automatically process these things
0:14:26	so
0:14:26	the methodology we sort of proposed was sort of and sampling for model
0:14:33	and it's a very nice mathematical formulation actually work done by one of course
0:14:38	and he was able to create a segmentation algorithm works fairly well
0:14:45	so just things like okay i
0:14:49	i
0:14:52	so we're doing that now we would actually capture the various and timing we automatically
0:14:57	from these vast amounts of data so it almost like to think about is one
0:15:00	kind of feature extraction to me
0:15:04	so we can all the buildings that are actually more linguistically more to us events
0:15:08	by
0:15:09	so one of my clothes collaborative school please so the founders of the articulately from
0:15:15	all that even believe that us
0:15:18	we sort of conceptualise speech production as a dynamical system
0:15:22	and so varies articulators involving task basically created forming and not releasing constructions as we
0:15:29	move around
0:15:30	so we are interested in features like for example
0:15:33	sort of a lip aperture and to but
0:15:36	constriction degree and location so one so we are able to kind of that automatic
0:15:42	twenty six
0:15:43	another you
0:15:50	so we need to automatically these things now so going from images to cut segmentation
0:15:56	try to actually extract instead of linguistically meaningful
0:16:02	features
0:16:06	so that you know to do things like no a extract other kinds a representation
0:16:11	like for example in look and pca on these contours two
0:16:14	do look at the contributions of different articulators
0:16:18	and so one so i'll just provide you some ways of getting at this sort
0:16:22	of that objectively characterizing this production information
0:16:26	and speaker specific
0:16:30	so i so far is that like up for told you about look at how
0:16:34	to get the data to some of that basic analysis and then with which we
0:16:39	can now start looking at speaker specific properties
0:16:43	so
0:16:45	as i mentioned earlier data analysis to get an anatomical know how to characterise every
0:16:50	single vocal instrument actual
0:16:52	and this of the test was pretty well that anatomy literature and so on so
0:16:56	we went to look at
0:16:57	all those literature
0:16:59	and you know compiled a whole bunch of these landmarks you may have become not
0:17:05	the landmarks in speech
0:17:07	and came up with these kinds of measures that we can get at like you
0:17:11	know vocal tract sort of what legal and that the cavity lands in a separate
0:17:16	and then you know and so on which we can sort of measure from these
0:17:20	kinds of very high contrast images so that's one source of sort of speaker specific
0:17:27	as an aside the also that you know since many degradations of same tokens by
0:17:31	these people at different sessions no
0:17:34	you're interested in how consists of people are and was very sort of
0:17:39	sort of reaffirming that not people fairly okay fine how to produce that it opens
0:17:45	you know that the measurements female we're very consistent so
0:17:48	this is for example finding the correlation means and once again to
0:17:51	something that presented in interspeech
0:17:55	so you the strike we have this land fine article actually sort of environment with
0:18:00	them which we are not be produce speech behavior we wanna know
0:18:05	how much of it is dictated by the environment we have waters that strategies that
0:18:09	are adopted by speakers of a unique to them due to various reasons which we
0:18:13	can't really pinpoint but it is you know
0:18:15	learning that they have done or the environment follows so more c can be sort
0:18:21	of start deconstructing this little bit
0:18:25	so next what also use a few examples subset along this direction
0:18:29	so for example this picture want you to focus on the following and the palatal
0:18:33	variation thought it is like you know your battery genders and think the heart circus
0:18:37	we put you don't know right that's about the art part which is like important
0:18:40	product or
0:18:41	vocal apparatus so here we see
0:18:43	but this person
0:18:45	course my mouse
0:19:05	that it
0:19:05	so in a we see that this have i voices are very don't about it
0:19:11	here a more posterior
0:19:14	then i interior here is sharper drown
0:19:17	that is just six different people
0:19:19	so now how do we begin to actually why you are qualitatively seeing a
0:19:24	can you quantify this right so
0:19:30	so what i don't have a very
0:19:32	was actually so that you know take these kinds of the extracted image shape and
0:19:37	started doing sort of you know even simple pca analysis
0:19:41	and showed that no for six percent of variance could be explained four bytes five
0:19:45	first factor
0:19:47	which were sort of akin to what was like to hunt concavity or complexity offish
0:19:51	the next one was more know how forward-backward this
0:19:56	this concavity was like sort of and curtin and then how sharp one so these
0:20:01	this work test interpretations well that is actually very objective so
0:20:07	so we can actually begin to quark one find cluster people along these sort of
0:20:11	low dimensional search at least variables
0:20:14	and then we can actually
0:20:15	plug in these kinds of things into models right the like for example "'cause" you
0:20:20	coupons see what acoustic consequences of these variations
0:20:24	right
0:20:24	so one of things you finite is that
0:20:27	that is very word that that's the first performance very much
0:20:32	where like the anti r g how four or five or this that the product
0:20:36	shapes a incorrectly if you sharpness really didn't matter at least from these for star
0:20:41	simulations
0:20:42	so from a data to zero
0:20:45	a morphological characters we can actually see pretty interpret what a casino once we can
0:20:50	expect
0:20:51	right
0:20:52	in fact we can put this in a synthesiser articles and show at the other
0:20:57	words from the th
0:20:59	a little less
0:21:02	to work on a basic you see are more one to let on
0:21:09	you're going on in different bound to the plane
0:21:13	so we can do this kind of analysis very no carefully
0:21:16	so
0:21:18	of course we also interested now likely due to inverse problem right can be estimated
0:21:22	these shapes from given the acoustic signal how much of it is a available for
0:21:27	us a body shape details right so
0:21:30	we did the classic doing right okay be
0:21:34	we have all kinds of features from the
0:21:37	basic signal i want to realise right
0:21:42	the shading on their way as we speak directly so it's influence
0:21:46	but the environment and that apply the movements of that the behaviours right so what
0:21:50	the mean one so
0:21:52	that's what this way to know how we articulate
0:21:55	and what we have
0:21:56	both influences that influences the signal the
0:21:59	so now see how it a single i
0:22:01	and we show that no very simple first experiment we can get at the shape
0:22:05	sort of detection
0:22:06	concave a flat out that like sixty somebody persona time we can guess what kind
0:22:10	of attitude they have just from the acoustic signal so that a more information is
0:22:14	available
0:22:15	so a more interesting question would be
0:22:18	sort of a very classy morphological parameter that we've been using a lot as vocal
0:22:24	tract length right this is something that office of been important speech rec aligned
0:22:28	otherwise been and sound about
0:22:31	well it's to
0:22:33	normalize for also to estimate that things like for example we're doing a age-recognition and
0:22:39	someone
0:22:39	right so here again the same question
0:22:42	what we have some of the speaker specific i think
0:22:46	reflected in the signal right
0:22:47	you wanna see how much we can grab added to pinpoint the speaker pair
0:22:52	you can you know that you don't to some extent speakers compensated that for what
0:22:57	environment they have and we wanna know so now how much
0:23:02	all of it is residual that you can actually input
0:23:05	get this is again vocal tract length i start with this because of a classic
0:23:09	question that people basking so for example here is the data from a work area
0:23:13	and you know and s and that the two thousand nine
0:23:16	there are like you know a vocal tract length role with eight here
0:23:21	for years and so we go across what from six centimetres one seventeen point five
0:23:27	eighteen centimetres long
0:23:29	and there's some
0:23:30	different situation that happens are empirically for males and females well stuff
0:23:35	and correspondingly z
0:23:37	effect singly formant space in the spectrum
0:23:40	no
0:23:42	p by zeroing in on the first formant the rain for
0:23:47	we can see that shorter vocal tract and
0:23:52	shorter vocal tract and longer vocal tract how the space
0:23:56	all that sort of
0:23:58	get compress
0:23:59	and you know shift and this kind of things happen
0:24:02	and why people we've been doing implicitly or explicitly in when we do vtln
0:24:07	is to basically normalize for this effect
0:24:12	so the class that estimation vocal tract length you know has been back you know
0:24:17	you know from or very simple sort of rest state
0:24:21	sort of like what real impressed data to model we can begin estimate the land
0:24:26	of the vocal tract from
0:24:28	from the parameter
0:24:29	right so what we are proposing
0:24:31	what some sort of a problem the performance you can estimate the
0:24:35	the delay parameter
0:24:38	and
0:24:39	one of the early work to improve work was by the key to you know
0:24:41	or
0:24:43	the really prediction
0:24:44	okay and it's just an embryo relies on the third and fourth formant and other
0:24:50	people the proposed in
0:24:51	what we decide well now since actually
0:24:54	direct evidence of the vocal tract length and acoustic
0:24:57	can you come up with better regression models
0:25:00	and sure enough to be sure that actually from this timit corpus i do not
0:25:05	sure that we can get like really good estimates are not with very high correlations
0:25:10	of vocal tract plan and you don't
0:25:12	and this is kind of very interesting so that we are able to sort of
0:25:15	progress and a good model estimate the model parameters
0:25:18	and back to six now we are able to estimate vocal tract length as yet
0:25:22	another set of more primitive detail of the person from the
0:25:25	that's kind exciting
0:25:27	last one last
0:25:29	so
0:25:32	summarizes what i just said no competition with that on a lot or estimation and
0:25:36	availability of data and sort of you know good statistical methods allow us to get
0:25:40	like better insights
0:25:42	now
0:25:42	moving on
0:25:44	let's look at the slayer vocal tract is kind of the finding construct you know
0:25:49	it's very hot defined then by this you was like no
0:25:54	pretty funky and so that i'm actually plays a big role in how we dictate
0:25:59	the talent
0:26:01	so the question we ask is like okay
0:26:04	we have sort of
0:26:06	so vocal tract length and for infrequent the same charger showing you before
0:26:10	we normalize for using clean you normalization but that is that what typically about
0:26:15	we still have residual differences that are explained people you know putting as
0:26:21	proposed like nonlinear vocal tract normalisation multi very limited all the test again at the
0:26:26	specified what with so what we want to know is that the residual effect
0:26:30	yes actually
0:26:32	that's something about the size of that and the people have
0:26:36	that some automatically to work well for
0:26:38	so
0:26:40	so i have up here is that the sentence and the like relative punk shape
0:26:44	here
0:26:45	this thing
0:26:47	up to people
0:26:49	we will explain some of the wall space differences
0:26:52	okay
0:26:53	so
0:26:54	also the questions way but we have and this light of what is it well
0:27:00	how does one defined measured on size
0:27:03	or just people want to the concise is the people across the population
0:27:09	what is effective downsizing articulation
0:27:11	and
0:27:13	what is that
0:27:14	visible in the acoustics
0:27:16	can be predicted and normalized
0:27:19	same question so is very little don't publish work and that kind of thing
0:27:23	a people know that there's a coordinated sort of a global the size of vocal
0:27:28	tract that's be developed
0:27:30	there are some disorders like you know balance enrollment so one but i one usually
0:27:35	accuracy a large chunk sizes
0:27:38	so
0:27:39	what happens at least have so
0:27:42	effect on how we produce speech like one lemmatization of corals corners of sounds like
0:27:50	made in the corpus
0:27:52	like else thing in a decent it's a one
0:27:56	lemmatization it's like how we try to use it with the and laid than that
0:27:59	are
0:28:01	and sort of using almost like listing right leg lingual using the time in producing
0:28:06	know what by labeled sounds like b and b
0:28:09	and
0:28:10	other call three articulation slowing of speech rate because you've larger mass to content of
0:28:14	it
0:28:15	and so on
0:28:16	this something might mention but not
0:28:18	much sort of quantify right
0:28:21	so
0:28:22	we sort of set out to say well we have lots of data
0:28:25	can you set of a estimated mean posture huh
0:28:29	and there is the segmentation
0:28:32	and sort of
0:28:34	come up with some proxy measure for someone right there was more things with it
0:28:38	and so once you do that right we can actually plot the distributions of the
0:28:42	time slices across the male and female speakers not to but corpus
0:28:46	so what we see it
0:28:48	the green
0:28:49	e
0:28:50	female i'm all your
0:28:53	i don't average so there's significant setup
0:28:57	six difference easy
0:28:58	in the time
0:29:00	size so yet another we can get added from the acoustic signal
0:29:04	it set another sort of interpretable
0:29:06	sort of
0:29:08	marker
0:29:09	it so
0:29:11	because that
0:29:13	how well we will at the environment this part structure with that down
0:29:17	still not really well established again has open question so how do you really
0:29:23	as this thing
0:29:25	but
0:29:26	we have taken sort of a shot
0:29:28	so we did both sort of different kinds of normalization factor looking addressed cheapened
0:29:33	well during movement this not much difference between don't they are pretty highly correlated
0:29:39	so once you have that right
0:29:41	we can actually not use this information in simulations say for example think it you
0:29:45	model right people still study speech production
0:29:48	we all the little from you know
0:29:52	people like and that you know in our goner five
0:29:56	there you can actually now reflect this back and try to study from analysis by
0:30:00	synthesis
0:30:01	so you have a mother tongue we can expect longer instructions and so on so
0:30:05	what we did was to vary based on measurements we don't
0:30:09	look at different constriction bands and
0:30:13	locations just cy thumbsized difference will play a role in the acoustic selecting a four
0:30:18	way
0:30:19	so what we observe that concise differences in the population be had
0:30:24	and what was estimated by simulation very well correlated in terms of part
0:30:29	i part
0:30:30	so it was very nice so what you saw see here is that the
0:30:33	in the simulation spk and five
0:30:36	the move that
0:30:39	type of well ryan or likewise
0:30:43	so the general trends are okay so
0:30:46	so we saw all in all the pilot we saw with another what is it
0:30:51	varies across speakers quite of a fifteen pick up to thirty percent
0:30:56	had a consequence of a large time s
0:31:00	longer constructions that are may in the vocal tract s p produce sounds because constructions
0:31:04	are very sensual to how we produce very speech sounds
0:31:08	they data stretching twist the wells basis so that's of us
0:31:14	signal that the playwright
0:31:15	and
0:31:17	but this
0:31:18	interplay between contractions performance and downsize is fairly complex requires much more sophisticated so
0:31:24	learning
0:31:25	a model that
0:31:27	but with hopefully with data is you know these things can be pursued
0:31:32	this one
0:31:33	so the final thing sort of a not a on the slide of speaker specific
0:31:36	behaviour
0:31:37	is to actually talk about articulator study
0:31:40	okay what i mean but that is how talkers move the vocal tracks right so
0:31:45	as you know the vocal tract is actually a pretty clever assistants a very that
0:31:48	we didn't systems of got all tolerance little bit
0:31:52	exactly can use the same a different articulated to create the same to a complete
0:31:57	the same task for example
0:31:59	in move the john looks two
0:32:01	both dialects to contribute by little constructions like no making b and b and one
0:32:06	you have a mortgage august we lips
0:32:09	and people have several ways to change their i every shapes to do this
0:32:13	and so we columns are contractor strategies and some of these are speaker specific some
0:32:16	of these language-specific consider a we wanna get added because is again yet another piece
0:32:22	of the palatal as you try to understand what makes
0:32:25	me different from you in trying when you produce speech signal
0:32:29	the only just knowing that i'm different from you from a speech
0:32:33	okay
0:32:33	so this is approach you again very early work
0:32:36	so we have lots of
0:32:38	i built anymore i data
0:32:40	so since then i don't know the database we collect is about from a pilot
0:32:45	study of eighteen speakers but like north all these volume between all that stuff
0:32:48	very detailed weight
0:32:50	and so we can actually
0:32:53	i get i know characterizing the morphology speaking style
0:32:57	once we have that right be established what we call the speaker specific for maps
0:33:01	a off but from the vocal tract shapes the construction so imagine
0:33:07	the shape changes to create this task or like consummate dynamical system you know actually
0:33:12	is estimate the for maps of like you know
0:33:15	in that in a different recreation sense
0:33:17	and then we can
0:33:19	pulling all from each of these speakers format
0:33:21	put this back and was synthesized or model
0:33:24	which a to dynamical system ought to use and task dynamics
0:33:28	and see that contributions of the varies articulators people use actually to predict how to
0:33:33	be what studies people about
0:33:37	so
0:33:38	again reminding use of a cell we can go from data to extract a sort
0:33:43	of a on tourism and do pca extract basically
0:33:47	factors able contractually you know how much darker compute on with the time factors are
0:33:52	and someone
0:33:52	and then
0:33:53	a from that we can go with estimate various constructions in a place of articulation
0:33:58	you probably more right
0:34:00	we have along the would try to make an six different anatomical regions like the
0:34:05	outfielder reading about you can the be elevating their injuries and the one
0:34:10	we can is
0:34:12	automatically estimate that
0:34:13	the baseline level what people this
0:34:16	so
0:34:18	problem so we have some insights from about eighteen speakers that we analyze testing again
0:34:23	that are sorensen
0:34:25	a leaf presented interest feet and fill white that we went about use a model
0:34:31	based approach
0:34:32	so
0:34:33	be approximated like the speaker specific format a to pin from that a more i
0:34:37	data from exceeding speakers
0:34:40	the simulated with that a static to you have a to belong to the from
0:34:44	a motor control sort of
0:34:46	a legit are fantastic system
0:34:48	the dynamical systems are basically
0:34:52	control system that the in this state space for
0:34:54	and then we were able to interpret the results so one of the results here
0:34:58	like to make sure you know it's basically represent a the ratio of lips to
0:35:03	use a lipstick
0:35:06	and or job used by speakers to create constructions various constriction bilabial alveolar palatable
0:35:12	we look print your along the vocal
0:35:15	and you see that there's you know
0:35:17	different
0:35:18	ratio of how people use how much dog use
0:35:22	one is like more target lips
0:35:25	zero it's like you're using more
0:35:27	different conceptions different we use
0:35:29	different ways of creating transitions in fact used
0:35:34	put this work we see that elephant on the right where you know
0:35:39	contribute more than job in so
0:35:42	except for all real close to the score of a target the time and
0:35:48	the speakers in our set like in speaker
0:35:51	very you know how they used to create the same kind of constructions i so
0:35:56	people are different in how it studies i
0:35:59	so one of the sort of this is very early inside straight how much speaker
0:36:04	used on the lips you know it there's a function specificity how what is it
0:36:09	out the remote are planning
0:36:10	there are exceptions that actually begging for more sort of you know a computational approach
0:36:16	is now with the data inside we can go and cy
0:36:20	how people actually use the vocal instrument in producing this sounds
0:36:26	that we call speech
0:36:29	so the final in this is now we get family the slides we've been seeing
0:36:32	this conference of
0:36:35	so you are also explore a little bit
0:36:38	well production information be of use in you know
0:36:42	in speaker recognition type of experiment so we did little better well work one speaker
0:36:48	verification with the production data does not much data so not so you know particular
0:36:54	but that's the people pretty much common or things like so that was not one
0:36:58	has this
0:37:00	we'll speech production data be of any use at all your speaker verification
0:37:04	so we know i one point on a getting like data like rewind showing right
0:37:10	x-ray or more i or
0:37:11	it's not
0:37:12	but we okay in operation conditions
0:37:15	right so we need to be able to have some articulatory type representation so people
0:37:20	been working on inversion problems that is
0:37:23	given
0:37:25	acoustic
0:37:26	can be estimated glitch parameters like this the classic problem in fact mozaic setting problem
0:37:31	where you know where i feel that deep-learning that approaches that are very powerful because
0:37:35	it's of any nonlinear process so you know these things every conducive to these
0:37:39	mapping a
0:37:41	nevertheless what we wanted us to do so a speaker-independent mapping
0:37:45	right so this work of profound a small within just a few years ago what
0:37:52	said well
0:37:52	if i can really
0:37:54	acoustic articulately mapping between people
0:37:56	you know of that an exemplary talker right i have lots of data from one
0:38:00	single speaker for like and synthesis right you always take long
0:38:03	the properties from one talker and then try to produce it
0:38:08	and then we can protect anyone else's acoustics on this
0:38:12	so speakers maps to see how this guy were to produce the statistics like everything
0:38:17	to get some semblance of an articulate representation
0:38:20	so
0:38:22	that we can do speaker independent sort of you know measures so that was sort
0:38:26	of the i so we said well we can use a reference speaker
0:38:31	to create a articulate acoustic target like to map and to the inverse model and
0:38:37	then when you get that speakers
0:38:39	for one acoustic signal
0:38:42	we can actually do inverted sort of features and use these to a few
0:38:48	the three
0:38:48	there's any benefit the rationale there is enormous
0:38:53	is that it pretty produces like projections they not no
0:38:57	robust way and constraints the kind where
0:39:00	provide sort of
0:39:02	physically meaningful constraints on how we partition signal so
0:39:05	that might be some advantage to come that come up
0:39:08	so this was sort of you know
0:39:11	this like earlier this year
0:39:13	in c s l
0:39:15	so
0:39:15	the front end this started be used actually for some of these all experiments used
0:39:20	x-ray microbeam database also available because a lot of speakers
0:39:25	and standard
0:39:27	thanks here gmm model because you don't have the much data
0:39:32	and you're some sort of the initial results of you use just
0:39:37	mfccs only you know
0:39:39	that like what that for this small set that's not that's pretty noisy data set
0:39:44	about
0:39:46	you know seven point five the are but you know if you actually have the
0:39:50	real articulation
0:39:52	the measured articulation actually get a result of post
0:39:56	in
0:39:57	providing sort of you know nice complementary information that's kinda encouraging so that you might
0:40:02	think about as an oracle experiment or upper bound if you have session
0:40:06	now if you can use of the inverted sort of measurement about that we shall
0:40:12	we do as well compare really well slightly better by putting them together actually provides
0:40:17	you an additional both with this pretty significant actually
0:40:20	so this grading of this kind of if you have lots of data that we
0:40:24	are sort of you know if you have
0:40:25	in the data to create these maps about speakers you know we need just example
0:40:29	each case
0:40:30	and if we can provide additional source of information
0:40:32	perhaps will give us so the some wheels but maybe also some insight into why
0:40:37	people are different or what data categories of articulation or structure and started is a
0:40:42	different by
0:40:47	so this is just the standard set of
0:40:51	the first
0:40:52	showing the same as of the film
0:40:55	x-ray microbeam database
0:40:57	so
0:40:59	summary of the speaker recognition experiments that notes and she'll so that step
0:41:04	of using both acoustic and articulatory information
0:41:07	there is significant and f eight
0:41:10	if you use of measured articulately information with the standard acoustic features
0:41:16	gains of marble or more honest
0:41:18	if we stuff you know used estimated articulate information
0:41:22	so what would be nice is to actually look a new ways of doing english
0:41:27	and with the kinds of so advances that are happening right now
0:41:31	nor feels
0:41:32	and the availability of data number two data
0:41:35	to do
0:41:36	i know this
0:41:37	no better
0:41:38	i'll be able to evaluate larger sort of acoustic data sets from sort of sre
0:41:42	like the campaigns
0:41:45	so mowing for most on
0:41:48	so we're very excited about no some of this actually
0:41:52	a premier work was done with my collaborators that lincoln laboratory some point your unique
0:41:58	model is gonna
0:41:59	and parallel work was mice your voice now also their
0:42:03	and so we had some initial pilot work and then
0:42:06	i recently got an innocent right actually to a and you the slider work people
0:42:10	actually
0:42:12	or okay we're doing speed signs looks like
0:42:14	so we are excited about it
0:42:16	and so our ideas do this in a very systematically your set to collect about
0:42:21	two hundred subjects this
0:42:22	all this
0:42:24	real time and volume a tree and about
0:42:26	detail and share with people
0:42:28	and
0:42:31	we kinda describe this sort of in an upcoming the paper
0:42:36	and this is kind of that material if you're targeting i'll show the slides and
0:42:41	people want to suggest that is we are more in for you collected what ten
0:42:44	speakers of the product or so far
0:42:47	with the project the starter
0:42:49	i everything from a notable exception the rainbow passage two
0:42:52	all kinds of you know spontaneously and so on
0:42:56	if you have any suggestions ideas how what would be useful for speaker modeling you
0:43:01	know i'm use like this now we have to consider
0:43:04	most in order to be native speakers of english and about twenty percents could be
0:43:08	nonnative speakers it's gotten in english
0:43:11	but in other projects to collect a lot of people doing other languages are everything
0:43:17	from african languages to other
0:43:20	so finally also you know a getting insights inter speaker variability also we can do
0:43:25	some sort of these use cases problem
0:43:27	in the case or mother developing vocal tract length from kids tradition
0:43:32	how the speaker very so that no manifesting the signal right so for example
0:43:37	we've been working along with people operations of attending i'll or can see
0:43:42	so the intention surgical interventions class actually basically what you with you
0:43:47	the parts of town
0:43:48	on top of that we have other therapeutic sort of treatments with the radiation and
0:43:51	are
0:43:53	people
0:43:53	so cost like modified physical structural damage to the thing
0:43:58	so here we see two
0:44:00	of patients
0:44:02	there are no
0:44:03	one basically lost pretty much more so that are because the cancer with your base
0:44:08	you know that and that's of the four reports on
0:44:10	and it's replaced by reconstruct with them flat from the four
0:44:15	so you see sort of variation in the convoy the normalized and therefore here
0:44:21	so how this their speech cope what this is not getting speech and small is
0:44:25	one of the big quality of life measure
0:44:26	so we have different things is also keep us additional insights about you know looking
0:44:31	at speaker variability
0:44:35	the interesting something's only eleven cases you know and had in history the norton
0:44:39	though
0:44:39	some people bought reported on ability a so we have access to all other speakers
0:44:44	and collect a lot of data from where and
0:44:46	and so we can compare what
0:44:49	a how to compensate how to use the strategies how person
0:44:54	speaks pretty intuitively pretty well so
0:44:57	this provides an additional source of information to understand this question of individual very good
0:45:05	so in conclusion
0:45:07	appoint someone may well yes data is very a good integral to advancing speech communication
0:45:13	research your vocal tract information plays a crucial part of this piece of this but
0:45:18	the like i believe
0:45:20	so to do that we need to gather data from like lots of different sources
0:45:24	to get a complete picture of the speech production
0:45:27	it's that's
0:45:28	not very telling from a technological computational
0:45:32	as well this conceptual and theoretical to the perspective
0:45:35	but
0:45:36	i don't believe that are written still so that no applications including into the machine
0:45:41	speech recognition speaker modeling
0:45:44	but i that this sort of
0:45:47	approach just like very interdisciplinary so people have to come together to work well on
0:45:51	these topics
0:45:52	and share
0:45:53	so these are some of the people and my speech production that no
0:45:56	the problem of our
0:45:58	although a bottom line and people were currently there in particular award that
0:46:04	we also contributed this particular a collection of my
0:46:08	calling who does all these imaging work
0:46:11	and testing them are scientist
0:46:13	lois of these the linguist very
0:46:16	well
0:46:16	linguists provides a conceptual framework of how we
0:46:20	approach
0:46:21	such an that all this work on
0:46:23	this apply meeting stuff recently and the lower can only morphology where my that was
0:46:29	talking a lot model where
0:46:32	that
0:46:34	namely that a lot of things actually translating to a speaker verification
0:46:39	and i separate that michael i-vectors in all our women amazing no i'm forty really
0:46:44	for this guy had available
0:46:46	and here not only finally no he's been very supportive is vanilla rampantly support incorrect
0:46:52	he's be important for this and no i'm pushing is to
0:46:56	not people one that's of things here too
0:46:58	so that i thank all of you listening to be
0:47:02	and various people find that
0:47:04	well this is like online if you're interested including might be charged
0:47:09	thank you very much
0:47:27	for instance
0:47:32	you very much with fascinating to
0:47:35	two questions first of all
0:47:38	when you're gonna get to the larynx
0:47:42	because that's i'm okay i'm talking from the
0:47:46	perspective you
0:47:48	the forensic phoneticians
0:47:51	and
0:47:54	we are conscious of between speaker differences from the larynx on two
0:48:02	spectral slope of that sort of thing but in this that suppressing
0:48:05	and also super the residual e
0:48:09	relationships between what i would
0:48:11	give almost more robust harmful is we'll knowledge about the speaker variability in
0:48:20	the nasal
0:48:22	basically nasal cavity sinuses that sort of thing
0:48:26	that is the below about speaker i
0:48:30	it's great "'cause" you're not gonna get in this
0:48:32	we telephone speech and so forth anything above
0:48:35	three k is the good
0:48:37	some parts that so the first questions about lyrics right
0:48:41	so here are in this region
0:48:43	so
0:48:45	so the glottal so that the voice a voice source of phenomena like happens that
0:48:49	much higher rate
0:48:50	and so i'm are still is not good enough right it's about
0:48:54	we can go about how did want reprints for second year
0:48:58	so what people have been doing particular you know according to you salience one no
0:49:02	up to
0:49:04	you high speed imaging off this larynx but wouldn't camera to the nose
0:49:08	in two
0:49:10	little bit intervention and
0:49:13	at so
0:49:14	on the other hand
0:49:15	what we can do you have them or i used to look at things like
0:49:19	little joe hi then they'd all other things but also get some
0:49:22	it it's one zero information
0:49:25	and particularly one of things a more approaches like complete you of your region so
0:49:30	we can really
0:49:31	this is not available any of the other but all these people use you know
0:49:34	in this so you look at like to be for usual sort of
0:49:40	behavior phenomena
0:49:42	and in terms of actually characterize and things like that is the variance and so
0:49:45	on which don't change very much during speech behavioural i cannot to characterize that's what
0:49:49	he really i contrast to weighted images
0:49:51	to really characterize every speaker by you know what is that they have the you
0:49:55	know and in terms of
0:49:57	with which we can actually get i
0:50:00	some anatomical good characterization of a speaker and see how can relate or account for
0:50:05	it in the signal
0:50:06	and so
0:50:08	we are trying to see how can
0:50:10	sort of controlling t do some multimodal meeting of voice source that no we tried
0:50:14	to you d
0:50:15	but you know they are quite small window into this thing is you know
0:50:19	we wanna see the high speed stuff
0:50:23	still open question in terms of contrary to meeting
0:50:29	so that like by the button references
0:50:33	in the previous slide show organisers people interested
0:50:40	no more questions i was just
0:50:46	normal
0:50:50	s
0:50:53	is it possible to say broadly
0:50:55	if there are any a particular areas that show the greatest amount of the between
0:51:00	speaker difference
0:51:02	and that's to me and use
0:51:03	so you know if you gonna look for where is a completely
0:51:08	goodness knows it or is it just and that you know people differ in all
0:51:11	sorts of the from which was
0:51:14	so i think that the latter is that what my guess is right no unless
0:51:18	we know i do think they begin to start begin to cluster
0:51:21	a ones as increase the and number
0:51:25	just like you know what we do it eigenvoice and the
0:51:28	i didn't phase i think i'm sure a good prime things that start at clustering
0:51:32	for getting direct mode
0:51:33	but now the source of variability seems to be
0:51:36	a perceptual point of view
0:51:38	all the place
0:51:40	plus you know how people became weakened that
0:51:42	also varies quite a bit because you know
0:51:46	where they come from mine how be applied and so one right and practices people
0:51:50	use no
0:51:51	there are other piece of work that i can talk about no one article to
0:51:54	setting and you know
0:51:56	ideas about
0:51:59	how people set of actually
0:52:02	be but i do
0:52:04	extract parameters of
0:52:06	from or to control problem point of view white people the for it i can
0:52:09	lead to language or
0:52:11	background or other kinds of things still open question
0:52:15	but what i feel like as being trees that it is these of we talk
0:52:18	about very small datasets is compared to what you've been for state would just on
0:52:23	the speech side
0:52:25	but if we increase this to some extent
0:52:28	and again or this kind the computational tools and advances that you're making i think
0:52:33	slowly can begin to understand this at the level to go
0:52:40	open question
0:52:49	structure so it are you make a comment
0:52:53	you put up a kind of the acoustic to model but well all remember point
0:52:57	out one thing from one of the workshops from
0:53:00	the early nineties
0:53:02	from mid sixties up until late eighties early nineties we use their own acoustic to
0:53:09	model that was when you're like flat screen
0:53:12	and we should tell at a summer student would basically spent the summer saying well
0:53:18	actually the vocal track as a writing all turn and no one it really thought
0:53:23	about what how much is that right angle actually impact vocal i persona formant locations
0:53:28	and bandwidths
0:53:29	so he we formulate a can or closed form solution i think they saw it
0:53:34	was between one two three percent ships informed location bandwidths right so a very much
0:53:39	like sting the physiological per state you take care what might one right basic questions
0:53:44	you focused on speaker id
0:53:47	i'm assuming many of your speakers here bilingual have you thought about looking at language
0:53:52	id to see if the physiological production systematically changes between people speak one language versus
0:53:59	another
0:54:00	absolutely solid lines of that for the first a common to jon hansen made was
0:54:05	regarding to but the vocal to been but it sort of unruly do the simulations
0:54:11	note that
0:54:12	for
0:54:13	articulation acoustics and the effect of the band in fact there is a classic people
0:54:17	by enrollment order moments on the
0:54:19	and yes and the release of
0:54:21	long time ago
0:54:23	that actually estimates is about the three five percent the student actually verified it but
0:54:27	some and simulations later on
0:54:31	i used to get the last
0:54:34	and
0:54:36	so i think of the more recent models try to do this you know but
0:54:40	like fans here simulations main street and simulations the one we can do with this
0:54:44	node access to those one what you did i talked about right
0:54:48	for all the postures from all these speakers we had that
0:54:50	so and with the high performance computing
0:54:53	this is becoming a reality we can actually what implanting and want to do right
0:54:56	no nodes
0:54:58	possible
0:55:00	this second question
0:55:03	john a reminder
0:55:07	all the language id yes of course we have actually
0:55:10	about
0:55:11	forty or fifty different languages actually languages and set l to a second language them
0:55:17	speak english in or datasets you know across very linguistic experiments we've been doing
0:55:22	so one things we
0:55:24	the real the data
0:55:25	little bit not as much maybe
0:55:27	cup people intuition language id
0:55:31	may have some hypotheses and so on their be looked at things like articulately setting
0:55:34	you know which is then
0:55:36	the place from would you start executing a task right now from rest to rent
0:55:40	you so if you think about as a database system right as you know from
0:55:44	a individually creation like you know so the modelling you initial state is important from
0:55:48	which we go to another state and where you set of but
0:55:53	release that particular task and go to next aspect of making one construction going on
0:55:57	an on and so we found that people have preferred sort of settings from which
0:56:02	they start executing and that's very language specific we showed like normal german speakers presents
0:56:07	and spanish speakers with english speakers so these kinds of things can be estimated from
0:56:11	articulatory data
0:56:13	the inversion is not been to the viewing done that no
0:56:17	but that's quite possible and you know happy to share data
0:56:21	top two people body
0:56:26	okay
0:56:27	sure it's first okay
0:56:32	okay so
0:56:34	i have a comment i like to respond
0:56:37	one of all the problems in speaker recognition is i happens between the hot this
0:56:44	but the speech right
0:56:48	the first line that explains
0:56:51	cepstral mean subtraction
0:56:54	basically you find the way the average side of the vocal tract
0:57:02	how does that sort of
0:57:04	impact on what you
0:57:07	right so that you know i didn't talk about the channel effects and channel normalization
0:57:11	things that happen you know the recording conditions and so one right so
0:57:16	one of things that the art of contemplating is like you know like many people
0:57:19	have been talking what do joint factor analysis or these kinds of even with these
0:57:24	new a deep-learning systems right
0:57:27	you could these multiple factors jointly together to see how
0:57:31	we can have speaker specific variability sort of measures
0:57:35	and things that are cost by sort of other
0:57:39	so it's a extraneous setup
0:57:42	interferences or thirty two or more other kinds of transformation that might happen
0:57:46	so that's what we're doing from first principal type things right like the way we
0:57:51	want to do not just make the jump into a drawing some all these into
0:57:55	some you know machine learning to and beginning to estimate by
0:58:01	systematically trying to look at linguistic theory speech signs we could features to analysis by
0:58:07	synthesis type of approaches and then we can then see well if you have other
0:58:11	kinds of these kinds of snow
0:58:15	both
0:58:16	open environment speech recording not
0:58:19	for distance the speech recording is spelled much interest to other bus
0:58:23	for various reasons and
0:58:26	we can account for these things so i tend to believe in that kind of
0:58:30	more organic approach
0:58:38	we have temporal one question may be processed foods
0:58:44	i
0:58:47	i'm sorry i'm the both fast
0:58:50	i
0:58:51	i won't first to thank you and it's very nice
0:58:57	sorry noise
0:58:59	science
0:59:00	which technology and particularly in speaker recognition or in the forensic so
0:59:06	adjust my common this to remind the difference between speaker recognition and a forensic voice
0:59:12	comparison
0:59:14	but it really both and
0:59:17	the field
0:59:18	present that you
0:59:20	because
0:59:21	we know about when we try to do some article in addition we think like
0:59:26	that
0:59:27	we have a huge difference between the board to read speech
0:59:32	it's train include kick back wall
0:59:37	speech right
0:59:40	for speaker recognition we could imagine but the speaker are trying to
0:59:46	very
0:59:47	classical to you could not be processed
0:59:50	in forensic voice
0:59:51	comparison
0:59:52	we could imagine exactly you put it right are reading my question
0:59:58	posted but midget but
1:00:01	would be five
1:00:03	constructions the or optimization strategy you know that
1:00:08	challenge department you expose
1:00:12	yes and alright because there's certain things we can change certain things we can't write
1:00:16	your given right that's one of the things that we are trying to go after
1:00:20	that there's something are given in or physical instrument it can compensate for it as
1:00:25	much but we still see the residual effects and want to see can you get
1:00:29	it is residual effect maybe
1:00:31	the bounds are not there so no i have a big that of information theory
1:00:35	so always interesting bound the limits of things how much can be actually
1:00:39	after all we have
1:00:40	a one dimensional signal from which we project on all kinds of feature space and
1:00:44	do all or computation based on that to do all the inferences problems targeted speaker
1:00:49	or whatever this and so
1:00:52	say you menu plate that the strategies that's only one degree of freedom or you
1:00:58	know if you mean
1:01:00	and then it causes some differences but still if we can account for this somehow
1:01:04	i can you still see the residual effects of the instrument that there have or
1:01:10	specific ways they are
1:01:11	changing the shot used a common database when they have right you can't speak so
1:01:17	just two random things with your articulation to create the speech sounds right so that's
1:01:23	why not disjoint modelling of you know the structure and function you please a very
1:01:28	interesting to see and how much can be spoofed by people like you know you
1:01:32	may if you're getting added
1:01:33	it remains to be seen by the no i
1:01:36	but i'm hoping that like by no
1:01:38	being very microscopic here these analyses we can get some insight into it
1:01:43	you know but one that is very objective not you know
1:01:46	just a
1:01:48	impressionistic you know single this place is definitely all these experts billing talk about it
1:01:52	on you know on the court
1:01:55	i think that's one of the reasons
1:01:57	here was very
1:01:59	but support the idea no let's go it every object to way you know scientifically
1:02:03	grounded way as possible
1:02:06	we don't loads its adjoint you see vertigo
1:02:11	can be so
1:02:13	since then the speaker again thank you thank you

Understanding individual-level speech variability: From novel speech production data to robust speaker recognition

Keynotes

Shri Narayanan