| 0:00:15 | thank somewhat similar to kind of you | 
|---|
| 0:00:18 | related to be here are known to be part of august twenty sixteen the small | 
|---|
| 0:00:21 | percentage or c thank you unreasonable to have me here or in this meeting | 
|---|
| 0:00:27 | so that are okay i'm gonna give two days or something giving is so it's | 
|---|
| 0:00:31 | about a very classic problem question and speech communication about understanding variability and invariance and | 
|---|
| 0:00:38 | speech | 
|---|
| 0:00:40 | people been asking this for a long time | 
|---|
| 0:00:42 | so | 
|---|
| 0:00:43 | the specific sort of focus sample decrease of the very vocal instrument we have to | 
|---|
| 0:00:48 | produce the speech | 
|---|
| 0:00:52 | six different people here just showing the size of slices their vocal tract | 
|---|
| 0:00:58 | and we can see immediately each as the very uniquely shaped vocal instrument | 
|---|
| 0:01:03 | with which they produce a speech and which is what you're trying to use for | 
|---|
| 0:01:07 | doing speaker recognition speech signals produce sort of his vocal instrument | 
|---|
| 0:01:11 | in fact i just orange yourself if you're not familiar with this kind of looking | 
|---|
| 0:01:15 | into the be | 
|---|
| 0:01:17 | mouth | 
|---|
| 0:01:19 | i just for them are the nose and the time and that we limit of | 
|---|
| 0:01:22 | the soft palate that you know | 
|---|
| 0:01:25 | goes there just you because you'll see a lot of these pictures my talk today | 
|---|
| 0:01:30 | there is a good being more people | 
|---|
| 0:01:33 | all of them try to produce the well known | 
|---|
| 0:01:36 | but you can just a quick look at it and you see even study these | 
|---|
| 0:01:40 | people used to produce these sound the slightly different if we look at like another | 
|---|
| 0:01:43 | example | 
|---|
| 0:01:44 | that are like you know first and second speak very the speaker the lid rates | 
|---|
| 0:01:50 | at duncan | 
|---|
| 0:01:50 | make the gesture for making side well but they're slightly different | 
|---|
| 0:01:55 | so | 
|---|
| 0:01:57 | kinda know that the these kinds i that the production of both the structure in | 
|---|
| 0:02:02 | which these that speech production happens and how we produced be very close people | 
|---|
| 0:02:07 | and some of it is reflecting the speech signal was | 
|---|
| 0:02:09 | so we just what you're trying to sort of get out | 
|---|
| 0:02:14 | so that the ml my set of this line of work to say well what | 
|---|
| 0:02:17 | can speech signs you know play an understanding and supporting speech technologies development no only | 
|---|
| 0:02:24 | do we want to recognize speakers one o one make some different | 
|---|
| 0:02:29 | so specifically you know what focus today | 
|---|
| 0:02:33 | is to look at vocal tract structure the physical instrument at a given in function | 
|---|
| 0:02:37 | behaviour and within that about for producing speech | 
|---|
| 0:02:41 | and interplay between | 
|---|
| 0:02:42 | so by structure i mean physical characteristics of this vocal tract apparatus that we have | 
|---|
| 0:02:48 | right like the heart ballad geometry that on volume you know | 
|---|
| 0:02:51 | the length of the vocal tract the velum the no mass | 
|---|
| 0:02:54 | function typically refers to the hero characteristics of speech articulation | 
|---|
| 0:02:58 | how we dynamically warm for example to produce the consonants in all constructions the vocal | 
|---|
| 0:03:03 | tract you know to make a sound like intensely kind of research done to when | 
|---|
| 0:03:07 | you know | 
|---|
| 0:03:09 | and create a variation there were channel two | 
|---|
| 0:03:11 | create turbulence | 
|---|
| 0:03:15 | so | 
|---|
| 0:03:16 | this leads to very specific questions we asked right how are individual vocal tract differences | 
|---|
| 0:03:21 | with some pictures of people reflect in the speech acoustics | 
|---|
| 0:03:25 | candes no the inverse problem be predicted from the acoustics | 
|---|
| 0:03:30 | how to for a people sort of you know make a forty structural differences to | 
|---|
| 0:03:34 | create phonetic equivalents right because we all try to communicate use speech coding and language | 
|---|
| 0:03:40 | and in austin pointed out what contributes to distinguishing speakers from one another from the | 
|---|
| 0:03:44 | speech | 
|---|
| 0:03:45 | right so i want to emphasise not willing are we trying to differentiate individuals from | 
|---|
| 0:03:50 | their speech signal but understand what makes different from a structure | 
|---|
| 0:03:55 | so stop table one some of this | 
|---|
| 0:03:59 | sort of very on one where | 
|---|
| 0:04:02 | so we'll try to see how we can quantify individual variability given vocal tract quality | 
|---|
| 0:04:07 | try to see if we can pretty some of these from the signal and of | 
|---|
| 0:04:10 | what are the bounds of it and so one | 
|---|
| 0:04:13 | how to individual article two strategies to for can we explore you know automatic speaker | 
|---|
| 0:04:19 | recognition type you know applications and | 
|---|
| 0:04:23 | offer some interpretation while doing so | 
|---|
| 0:04:25 | so do approach that's i don't know or laboratory | 
|---|
| 0:04:29 | i one of my research groups is the cost bad all speech production articulation notes | 
|---|
| 0:04:33 | grew looks a lot of different questions including questions of variability so we take multimodal | 
|---|
| 0:04:39 | approach | 
|---|
| 0:04:39 | look at different kinds of ways of getting at the speech production to you know | 
|---|
| 0:04:44 | a more i patrol talk about a lot today audio another kind of the measurement | 
|---|
| 0:04:48 | technologies the whole a whole lot of a multimodal process of image processing and you | 
|---|
| 0:04:53 | know it's a speech processing and what the modelling based on that | 
|---|
| 0:04:57 | and try to use | 
|---|
| 0:04:58 | these kinds of engineering advances to gain insights about the dynamics of production speaker variability | 
|---|
| 0:05:06 | questions about speaking style prosody motions | 
|---|
| 0:05:10 | so the rest of that are gonna instructors falling | 
|---|
| 0:05:14 | so i'll focus the first part time seeing how we can measure speech production right | 
|---|
| 0:05:19 | how do we get those images and so one with that particular focus on a | 
|---|
| 0:05:24 | more i magnetic resonance imaging something that we've been trying to develop a lot | 
|---|
| 0:05:27 | a then given datasets data how do we analyze the island one with the sort | 
|---|
| 0:05:33 | of some modeling questions | 
|---|
| 0:05:35 | so | 
|---|
| 0:05:36 | how do you get it vocal tract imaging | 
|---|
| 0:05:39 | so there has been very central to speech science you know for a long time | 
|---|
| 0:05:44 | right the mac observer measure article three details the long surface tree of this and | 
|---|
| 0:05:50 | their number of techniques you know each with its own strengths and limitations | 
|---|
| 0:05:54 | you know for example really sort of i-vectors that were made right like you know | 
|---|
| 0:05:58 | when applied again stevens and so on text race you know | 
|---|
| 0:06:03 | but you know that's got pretty good temporal resolution but it's no not resay for | 
|---|
| 0:06:07 | people so it's no longer methodology and then the number of other techniques like ultrasound | 
|---|
| 0:06:13 | which provide you partial you all of the insides and not necessarily helpful for kinds | 
|---|
| 0:06:18 | of modeling hereafter and things like other target facilities shall use picture | 
|---|
| 0:06:23 | so here actually is an x ray | 
|---|
| 0:06:26 | i did that | 
|---|
| 0:06:31 | but in fact is scanned stevens | 
|---|
| 0:06:34 | right results are sound so you only see sound surfacing of parts of it on | 
|---|
| 0:06:39 | c the edges | 
|---|
| 0:06:41 | i so this is that the target you want people to speak about it like | 
|---|
| 0:06:47 | that no reading here with the contact electrodes | 
|---|
| 0:06:50 | and so when we speak the contact made by the time to the pilot provide | 
|---|
| 0:06:55 | you some insights about timing in coordination you know in speech to study | 
|---|
| 0:07:00 | right of it | 
|---|
| 0:07:01 | and finally | 
|---|
| 0:07:03 | by the time to noisy a person's down | 
|---|
| 0:07:05 | there | 
|---|
| 0:07:06 | be put little rice crispy like a sensors in there and measure the dynamics you | 
|---|
| 0:07:11 | know | 
|---|
| 0:07:12 | so you know provide you | 
|---|
| 0:07:14 | no | 
|---|
| 0:07:15 | the new possibilities and are created with the i to advances in the more i | 
|---|
| 0:07:19 | which provides you very good the soft tissue contrast to know be capable of basically | 
|---|
| 0:07:24 | what it relies on this the water content tissue so it that i didn't find | 
|---|
| 0:07:30 | and varies across very soft tissues so we make use of it by | 
|---|
| 0:07:34 | exciting the programs and they're releasing it signals generated according to this trend | 
|---|
| 0:07:38 | and then we can image it right | 
|---|
| 0:07:41 | it's very exciting because provides you very rides | 
|---|
| 0:07:45 | save provide you very good quality images but it's very slow the traditional one | 
|---|
| 0:07:50 | and so and also it has lot of things it's very noisy i know if | 
|---|
| 0:07:53 | you have are then into the scanner | 
|---|
| 0:07:55 | to produce speech sounds experiments a little town so these are somewhat things were contending | 
|---|
| 0:08:00 | with we put the last in years | 
|---|
| 0:08:01 | i mean so you know getting a so the very first that as sort of | 
|---|
| 0:08:06 | sub band of the main one third of around two thousand four | 
|---|
| 0:08:11 | we're in two | 
|---|
| 0:08:12 | a real-time imaging that is | 
|---|
| 0:08:15 | get two speeds | 
|---|
| 0:08:16 | that or sampling rates that are higher than | 
|---|
| 0:08:18 | what the speech rates are like you know what like | 
|---|
| 0:08:23 | twelve to be on aboriginal affairs or articulation rates and so | 
|---|
| 0:08:28 | maybe show you session | 
|---|
| 0:08:31 | huh | 
|---|
| 0:08:34 | i | 
|---|
| 0:08:36 | i | 
|---|
| 0:08:38 | i | 
|---|
| 0:08:41 | so | 
|---|
| 0:08:41 | if your family that the rainbow passage people write the exotic really ready when is | 
|---|
| 0:08:46 | very exciting for us to actually be able to this | 
|---|
| 0:08:49 | we we're doing acoustic recordings in a lot of the speech enhancement work therefore more | 
|---|
| 0:08:53 | i and was synchronise so kind of opened up a lot of for different possibilities | 
|---|
| 0:08:58 | for doing so | 
|---|
| 0:09:00 | there we saw | 
|---|
| 0:09:02 | so but unlike not happen that right really | 
|---|
| 0:09:04 | principal signals for a wide range for signals good but not but have been trying | 
|---|
| 0:09:09 | to see can be makes even better | 
|---|
| 0:09:11 | and so when you actually the kinds of rates | 
|---|
| 0:09:16 | for various because in the speech is not like one comedy using a lot of | 
|---|
| 0:09:19 | different you know and then mentoring task | 
|---|
| 0:09:21 | so from trials like no we're in spain | 
|---|
| 0:09:23 | to and of the saint sounds like on so one | 
|---|
| 0:09:27 | they are have different rate | 
|---|
| 0:09:28 | so we can get a about that kind of rates right would be really cool | 
|---|
| 0:09:33 | so | 
|---|
| 0:09:34 | in fact we were able to last year make a breakthrough | 
|---|
| 0:09:38 | and get up to sort of one hundred frames per second doing real time are | 
|---|
| 0:09:41 | with the | 
|---|
| 0:09:44 | more than one postdocs | 
|---|
| 0:09:46 | and not only do so very fast is very fast speech coding rate can really | 
|---|
| 0:09:51 | see that i'm to when you know a little | 
|---|
| 0:09:54 | but you can also do multiples playing simultaneously what you see here is assigned a | 
|---|
| 0:10:00 | slice by slice myself like they're | 
|---|
| 0:10:02 | or slice a axially like that or carly like this so we can do simultaneous | 
|---|
| 0:10:07 | you the vocal | 
|---|
| 0:10:09 | so i really exciting actually to be able to do it is really high rate | 
|---|
| 0:10:12 | to your two | 
|---|
| 0:10:14 | are insights | 
|---|
| 0:10:16 | and so this was made possible by both hardware and algorithmic a sort masses | 
|---|
| 0:10:22 | we developed a custom colour c requires four | 
|---|
| 0:10:27 | the thing | 
|---|
| 0:10:27 | it made lot of progress in both sequence design | 
|---|
| 0:10:31 | but also sort of consent reconstruction using compressed sensing things that have been happens in | 
|---|
| 0:10:35 | the process whatever | 
|---|
| 0:10:36 | so we were able to really | 
|---|
| 0:10:38 | speed this up and quite excited about it so this is all you know you're | 
|---|
| 0:10:42 | an experiment no | 
|---|
| 0:10:44 | some western sitting there in doing the audio collection you know the reprogram the scanner | 
|---|
| 0:10:48 | to that the audio synchronise with the leading | 
|---|
| 0:10:53 | we have interactive sort of | 
|---|
| 0:10:56 | control system to a select the scantily in one | 
|---|
| 0:11:01 | i | 
|---|
| 0:11:03 | i | 
|---|
| 0:11:05 | i | 
|---|
| 0:11:08 | she i four or she a four | 
|---|
| 0:11:15 | she i four | 
|---|
| 0:11:18 | lord | 
|---|
| 0:11:20 | she i o | 
|---|
| 0:11:24 | saying gonna get idea right so you can really see things you know that on | 
|---|
| 0:11:28 | the project it doesn't look that good like to actually | 
|---|
| 0:11:31 | and non-weighted which really good but actually now we are looking at production data that | 
|---|
| 0:11:36 | scales which is conducive the kinds of machine learning and approaches one could you | 
|---|
| 0:11:41 | although not be talking about be plotting | 
|---|
| 0:11:44 | this we are not outside the problem | 
|---|
| 0:11:46 | in addition to doing single plane or multi plane slice meeting we also very interesting | 
|---|
| 0:11:51 | the volume at least you want your interest in characterizing speakers with just one of | 
|---|
| 0:11:54 | the sort of the topics are researchers interest to control | 
|---|
| 0:11:58 | really force we off the geometry well people are speaking | 
|---|
| 0:12:03 | and we made some addresses there are two with about seven seconds of folding sort | 
|---|
| 0:12:07 | of or things like that | 
|---|
| 0:12:09 | we can do full sweep so | 
|---|
| 0:12:11 | the entire vocal tract and so we can get similar exemplary geometries off people's a | 
|---|
| 0:12:16 | set of clusters | 
|---|
| 0:12:18 | in addition | 
|---|
| 0:12:19 | we can also do really for getting to know that atomic will structures notable and | 
|---|
| 0:12:25 | of so we can do this classically to be to the more i and i'll | 
|---|
| 0:12:30 | show you why we are doing all these things for the kinds of measures what | 
|---|
| 0:12:33 | we really want to have a comprehensive idea of characterizing speakers a caucus by | 
|---|
| 0:12:39 | and the vocal instrument in behaviour | 
|---|
| 0:12:43 | so as soon as i one of the things we decide the recently been releasing | 
|---|
| 0:12:47 | a lot of these data so for people recognition one more than that really different | 
|---|
| 0:12:50 | speaker for both of them it you know sentences for six and | 
|---|
| 0:12:55 | with alignments and you know the image features and so on for its all available | 
|---|
| 0:13:00 | for free download so | 
|---|
| 0:13:04 | so you're some examples of that kind of data | 
|---|
| 0:13:07 | i | 
|---|
| 0:13:10 | i | 
|---|
| 0:13:15 | yes i | 
|---|
| 0:13:17 | she | 
|---|
| 0:13:19 | i | 
|---|
| 0:13:20 | so it's got five male and female speakers | 
|---|
| 0:13:22 | maybe some of them | 
|---|
| 0:13:26 | actually | 
|---|
| 0:13:28 | jamie money by | 
|---|
| 0:13:31 | and so on so | 
|---|
| 0:13:33 | and we also have alignment basically coregistration of this you know some algorithms for that | 
|---|
| 0:13:38 | then that's also released so we have this kind of data that we can work | 
|---|
| 0:13:42 | what so what you do this stuff | 
|---|
| 0:13:45 | so i'll sort of introduce some analysis preliminary | 
|---|
| 0:13:49 | a lot of image processing you to the very first thing is like actually getting | 
|---|
| 0:13:54 | at the structural details of the human will clap rather to people interested in sort | 
|---|
| 0:14:00 | of you know anatomy and more from a trends for her device | 
|---|
| 0:14:04 | of measuring everything else length of the ballot and | 
|---|
| 0:14:08 | and i and so one | 
|---|
| 0:14:10 | and that's what we wanted to do that very careful at each widget admit a | 
|---|
| 0:14:14 | imaging | 
|---|
| 0:14:16 | on top of that a for the we also want to track articulators right since | 
|---|
| 0:14:20 | articulator certain important specific task | 
|---|
| 0:14:23 | so we want to be able to automatically process these things | 
|---|
| 0:14:26 | so | 
|---|
| 0:14:26 | the methodology we sort of proposed was sort of and sampling for model | 
|---|
| 0:14:33 | and it's a very nice mathematical formulation actually work done by one of course | 
|---|
| 0:14:38 | and he was able to create a segmentation algorithm works fairly well | 
|---|
| 0:14:45 | so just things like okay i | 
|---|
| 0:14:49 | i | 
|---|
| 0:14:52 | so we're doing that now we would actually capture the various and timing we automatically | 
|---|
| 0:14:57 | from these vast amounts of data so it almost like to think about is one | 
|---|
| 0:15:00 | kind of feature extraction to me | 
|---|
| 0:15:04 | so we can all the buildings that are actually more linguistically more to us events | 
|---|
| 0:15:08 | by | 
|---|
| 0:15:09 | so one of my clothes collaborative school please so the founders of the articulately from | 
|---|
| 0:15:15 | all that even believe that us | 
|---|
| 0:15:18 | we sort of conceptualise speech production as a dynamical system | 
|---|
| 0:15:22 | and so varies articulators involving task basically created forming and not releasing constructions as we | 
|---|
| 0:15:29 | move around | 
|---|
| 0:15:30 | so we are interested in features like for example | 
|---|
| 0:15:33 | sort of a lip aperture and to but | 
|---|
| 0:15:36 | constriction degree and location so one so we are able to kind of that automatic | 
|---|
| 0:15:42 | twenty six | 
|---|
| 0:15:43 | another you | 
|---|
| 0:15:50 | so we need to automatically these things now so going from images to cut segmentation | 
|---|
| 0:15:56 | try to actually extract instead of linguistically meaningful | 
|---|
| 0:16:02 | features | 
|---|
| 0:16:06 | so that you know to do things like no a extract other kinds a representation | 
|---|
| 0:16:11 | like for example in look and pca on these contours two | 
|---|
| 0:16:14 | do look at the contributions of different articulators | 
|---|
| 0:16:18 | and so one so i'll just provide you some ways of getting at this sort | 
|---|
| 0:16:22 | of that objectively characterizing this production information | 
|---|
| 0:16:26 | and speaker specific | 
|---|
| 0:16:30 | so i so far is that like up for told you about look at how | 
|---|
| 0:16:34 | to get the data to some of that basic analysis and then with which we | 
|---|
| 0:16:39 | can now start looking at speaker specific properties | 
|---|
| 0:16:43 | so | 
|---|
| 0:16:45 | as i mentioned earlier data analysis to get an anatomical know how to characterise every | 
|---|
| 0:16:50 | single vocal instrument actual | 
|---|
| 0:16:52 | and this of the test was pretty well that anatomy literature and so on so | 
|---|
| 0:16:56 | we went to look at | 
|---|
| 0:16:57 | all those literature | 
|---|
| 0:16:59 | and you know compiled a whole bunch of these landmarks you may have become not | 
|---|
| 0:17:05 | the landmarks in speech | 
|---|
| 0:17:07 | and came up with these kinds of measures that we can get at like you | 
|---|
| 0:17:11 | know vocal tract sort of what legal and that the cavity lands in a separate | 
|---|
| 0:17:16 | and then you know and so on which we can sort of measure from these | 
|---|
| 0:17:20 | kinds of very high contrast images so that's one source of sort of speaker specific | 
|---|
| 0:17:27 | as an aside the also that you know since many degradations of same tokens by | 
|---|
| 0:17:31 | these people at different sessions no | 
|---|
| 0:17:34 | you're interested in how consists of people are and was very sort of | 
|---|
| 0:17:39 | sort of reaffirming that not people fairly okay fine how to produce that it opens | 
|---|
| 0:17:45 | you know that the measurements female we're very consistent so | 
|---|
| 0:17:48 | this is for example finding the correlation means and once again to | 
|---|
| 0:17:51 | something that presented in interspeech | 
|---|
| 0:17:55 | so you the strike we have this land fine article actually sort of environment with | 
|---|
| 0:18:00 | them which we are not be produce speech behavior we wanna know | 
|---|
| 0:18:05 | how much of it is dictated by the environment we have waters that strategies that | 
|---|
| 0:18:09 | are adopted by speakers of a unique to them due to various reasons which we | 
|---|
| 0:18:13 | can't really pinpoint but it is you know | 
|---|
| 0:18:15 | learning that they have done or the environment follows so more c can be sort | 
|---|
| 0:18:21 | of start deconstructing this little bit | 
|---|
| 0:18:25 | so next what also use a few examples subset along this direction | 
|---|
| 0:18:29 | so for example this picture want you to focus on the following and the palatal | 
|---|
| 0:18:33 | variation thought it is like you know your battery genders and think the heart circus | 
|---|
| 0:18:37 | we put you don't know right that's about the art part which is like important | 
|---|
| 0:18:40 | product or | 
|---|
| 0:18:41 | vocal apparatus so here we see | 
|---|
| 0:18:43 | but this person | 
|---|
| 0:18:45 | course my mouse | 
|---|
| 0:19:05 | that it | 
|---|
| 0:19:05 | so in a we see that this have i voices are very don't about it | 
|---|
| 0:19:11 | here a more posterior | 
|---|
| 0:19:14 | then i interior here is sharper drown | 
|---|
| 0:19:17 | that is just six different people | 
|---|
| 0:19:19 | so now how do we begin to actually why you are qualitatively seeing a | 
|---|
| 0:19:24 | can you quantify this right so | 
|---|
| 0:19:30 | so what i don't have a very | 
|---|
| 0:19:32 | was actually so that you know take these kinds of the extracted image shape and | 
|---|
| 0:19:37 | started doing sort of you know even simple pca analysis | 
|---|
| 0:19:41 | and showed that no for six percent of variance could be explained four bytes five | 
|---|
| 0:19:45 | first factor | 
|---|
| 0:19:47 | which were sort of akin to what was like to hunt concavity or complexity offish | 
|---|
| 0:19:51 | the next one was more know how forward-backward this | 
|---|
| 0:19:56 | this concavity was like sort of and curtin and then how sharp one so these | 
|---|
| 0:20:01 | this work test interpretations well that is actually very objective so | 
|---|
| 0:20:07 | so we can actually begin to quark one find cluster people along these sort of | 
|---|
| 0:20:11 | low dimensional search at least variables | 
|---|
| 0:20:14 | and then we can actually | 
|---|
| 0:20:15 | plug in these kinds of things into models right the like for example "'cause" you | 
|---|
| 0:20:20 | coupons see what acoustic consequences of these variations | 
|---|
| 0:20:24 | right | 
|---|
| 0:20:24 | so one of things you finite is that | 
|---|
| 0:20:27 | that is very word that that's the first performance very much | 
|---|
| 0:20:32 | where like the anti r g how four or five or this that the product | 
|---|
| 0:20:36 | shapes a incorrectly if you sharpness really didn't matter at least from these for star | 
|---|
| 0:20:41 | simulations | 
|---|
| 0:20:42 | so from a data to zero | 
|---|
| 0:20:45 | a morphological characters we can actually see pretty interpret what a casino once we can | 
|---|
| 0:20:50 | expect | 
|---|
| 0:20:51 | right | 
|---|
| 0:20:52 | in fact we can put this in a synthesiser articles and show at the other | 
|---|
| 0:20:57 | words from the th | 
|---|
| 0:20:59 | a little less | 
|---|
| 0:21:02 | to work on a basic you see are more one to let on | 
|---|
| 0:21:09 | you're going on in different bound to the plane | 
|---|
| 0:21:13 | so we can do this kind of analysis very no carefully | 
|---|
| 0:21:16 | so | 
|---|
| 0:21:18 | of course we also interested now likely due to inverse problem right can be estimated | 
|---|
| 0:21:22 | these shapes from given the acoustic signal how much of it is a available for | 
|---|
| 0:21:27 | us a body shape details right so | 
|---|
| 0:21:30 | we did the classic doing right okay be | 
|---|
| 0:21:34 | we have all kinds of features from the | 
|---|
| 0:21:37 | basic signal i want to realise right | 
|---|
| 0:21:42 | the shading on their way as we speak directly so it's influence | 
|---|
| 0:21:46 | but the environment and that apply the movements of that the behaviours right so what | 
|---|
| 0:21:50 | the mean one so | 
|---|
| 0:21:52 | that's what this way to know how we articulate | 
|---|
| 0:21:55 | and what we have | 
|---|
| 0:21:56 | both influences that influences the signal the | 
|---|
| 0:21:59 | so now see how it a single i | 
|---|
| 0:22:01 | and we show that no very simple first experiment we can get at the shape | 
|---|
| 0:22:05 | sort of detection | 
|---|
| 0:22:06 | concave a flat out that like sixty somebody persona time we can guess what kind | 
|---|
| 0:22:10 | of attitude they have just from the acoustic signal so that a more information is | 
|---|
| 0:22:14 | available | 
|---|
| 0:22:15 | so a more interesting question would be | 
|---|
| 0:22:18 | sort of a very classy morphological parameter that we've been using a lot as vocal | 
|---|
| 0:22:24 | tract length right this is something that office of been important speech rec aligned | 
|---|
| 0:22:28 | otherwise been and sound about | 
|---|
| 0:22:31 | well it's to | 
|---|
| 0:22:33 | normalize for also to estimate that things like for example we're doing a age-recognition and | 
|---|
| 0:22:39 | someone | 
|---|
| 0:22:39 | right so here again the same question | 
|---|
| 0:22:42 | what we have some of the speaker specific i think | 
|---|
| 0:22:46 | reflected in the signal right | 
|---|
| 0:22:47 | you wanna see how much we can grab added to pinpoint the speaker pair | 
|---|
| 0:22:52 | you can you know that you don't to some extent speakers compensated that for what | 
|---|
| 0:22:57 | environment they have and we wanna know so now how much | 
|---|
| 0:23:02 | all of it is residual that you can actually input | 
|---|
| 0:23:05 | get this is again vocal tract length i start with this because of a classic | 
|---|
| 0:23:09 | question that people basking so for example here is the data from a work area | 
|---|
| 0:23:13 | and you know and s and that the two thousand nine | 
|---|
| 0:23:16 | there are like you know a vocal tract length role with eight here | 
|---|
| 0:23:21 | for years and so we go across what from six centimetres one seventeen point five | 
|---|
| 0:23:27 | eighteen centimetres long | 
|---|
| 0:23:29 | and there's some | 
|---|
| 0:23:30 | different situation that happens are empirically for males and females well stuff | 
|---|
| 0:23:35 | and correspondingly z | 
|---|
| 0:23:37 | effect singly formant space in the spectrum | 
|---|
| 0:23:40 | no | 
|---|
| 0:23:42 | p by zeroing in on the first formant the rain for | 
|---|
| 0:23:47 | we can see that shorter vocal tract and | 
|---|
| 0:23:52 | shorter vocal tract and longer vocal tract how the space | 
|---|
| 0:23:56 | all that sort of | 
|---|
| 0:23:58 | get compress | 
|---|
| 0:23:59 | and you know shift and this kind of things happen | 
|---|
| 0:24:02 | and why people we've been doing implicitly or explicitly in when we do vtln | 
|---|
| 0:24:07 | is to basically normalize for this effect | 
|---|
| 0:24:12 | so the class that estimation vocal tract length you know has been back you know | 
|---|
| 0:24:17 | you know from or very simple sort of rest state | 
|---|
| 0:24:21 | sort of like what real impressed data to model we can begin estimate the land | 
|---|
| 0:24:26 | of the vocal tract from | 
|---|
| 0:24:28 | from the parameter | 
|---|
| 0:24:29 | right so what we are proposing | 
|---|
| 0:24:31 | what some sort of a problem the performance you can estimate the | 
|---|
| 0:24:35 | the delay parameter | 
|---|
| 0:24:38 | and | 
|---|
| 0:24:39 | one of the early work to improve work was by the key to you know | 
|---|
| 0:24:41 | or | 
|---|
| 0:24:43 | the really prediction | 
|---|
| 0:24:44 | okay and it's just an embryo relies on the third and fourth formant and other | 
|---|
| 0:24:50 | people the proposed in | 
|---|
| 0:24:51 | what we decide well now since actually | 
|---|
| 0:24:54 | direct evidence of the vocal tract length and acoustic | 
|---|
| 0:24:57 | can you come up with better regression models | 
|---|
| 0:25:00 | and sure enough to be sure that actually from this timit corpus i do not | 
|---|
| 0:25:05 | sure that we can get like really good estimates are not with very high correlations | 
|---|
| 0:25:10 | of vocal tract plan and you don't | 
|---|
| 0:25:12 | and this is kind of very interesting so that we are able to sort of | 
|---|
| 0:25:15 | progress and a good model estimate the model parameters | 
|---|
| 0:25:18 | and back to six now we are able to estimate vocal tract length as yet | 
|---|
| 0:25:22 | another set of more primitive detail of the person from the | 
|---|
| 0:25:25 | that's kind exciting | 
|---|
| 0:25:27 | last one last | 
|---|
| 0:25:29 | so | 
|---|
| 0:25:32 | summarizes what i just said no competition with that on a lot or estimation and | 
|---|
| 0:25:36 | availability of data and sort of you know good statistical methods allow us to get | 
|---|
| 0:25:40 | like better insights | 
|---|
| 0:25:42 | now | 
|---|
| 0:25:42 | moving on | 
|---|
| 0:25:44 | let's look at the slayer vocal tract is kind of the finding construct you know | 
|---|
| 0:25:49 | it's very hot defined then by this you was like no | 
|---|
| 0:25:54 | pretty funky and so that i'm actually plays a big role in how we dictate | 
|---|
| 0:25:59 | the talent | 
|---|
| 0:26:01 | so the question we ask is like okay | 
|---|
| 0:26:04 | we have sort of | 
|---|
| 0:26:06 | so vocal tract length and for infrequent the same charger showing you before | 
|---|
| 0:26:10 | we normalize for using clean you normalization but that is that what typically about | 
|---|
| 0:26:15 | we still have residual differences that are explained people you know putting as | 
|---|
| 0:26:21 | proposed like nonlinear vocal tract normalisation multi very limited all the test again at the | 
|---|
| 0:26:26 | specified what with so what we want to know is that the residual effect | 
|---|
| 0:26:30 | yes actually | 
|---|
| 0:26:32 | that's something about the size of that and the people have | 
|---|
| 0:26:36 | that some automatically to work well for | 
|---|
| 0:26:38 | so | 
|---|
| 0:26:40 | so i have up here is that the sentence and the like relative punk shape | 
|---|
| 0:26:44 | here | 
|---|
| 0:26:45 | this thing | 
|---|
| 0:26:47 | up to people | 
|---|
| 0:26:49 | we will explain some of the wall space differences | 
|---|
| 0:26:52 | okay | 
|---|
| 0:26:53 | so | 
|---|
| 0:26:54 | also the questions way but we have and this light of what is it well | 
|---|
| 0:27:00 | how does one defined measured on size | 
|---|
| 0:27:03 | or just people want to the concise is the people across the population | 
|---|
| 0:27:09 | what is effective downsizing articulation | 
|---|
| 0:27:11 | and | 
|---|
| 0:27:13 | what is that | 
|---|
| 0:27:14 | visible in the acoustics | 
|---|
| 0:27:16 | can be predicted and normalized | 
|---|
| 0:27:19 | same question so is very little don't publish work and that kind of thing | 
|---|
| 0:27:23 | a people know that there's a coordinated sort of a global the size of vocal | 
|---|
| 0:27:28 | tract that's be developed | 
|---|
| 0:27:30 | there are some disorders like you know balance enrollment so one but i one usually | 
|---|
| 0:27:35 | accuracy a large chunk sizes | 
|---|
| 0:27:38 | so | 
|---|
| 0:27:39 | what happens at least have so | 
|---|
| 0:27:42 | effect on how we produce speech like one lemmatization of corals corners of sounds like | 
|---|
| 0:27:50 | made in the corpus | 
|---|
| 0:27:52 | like else thing in a decent it's a one | 
|---|
| 0:27:56 | lemmatization it's like how we try to use it with the and laid than that | 
|---|
| 0:27:59 | are | 
|---|
| 0:28:01 | and sort of using almost like listing right leg lingual using the time in producing | 
|---|
| 0:28:06 | know what by labeled sounds like b and b | 
|---|
| 0:28:09 | and | 
|---|
| 0:28:10 | other call three articulation slowing of speech rate because you've larger mass to content of | 
|---|
| 0:28:14 | it | 
|---|
| 0:28:15 | and so on | 
|---|
| 0:28:16 | this something might mention but not | 
|---|
| 0:28:18 | much sort of quantify right | 
|---|
| 0:28:21 | so | 
|---|
| 0:28:22 | we sort of set out to say well we have lots of data | 
|---|
| 0:28:25 | can you set of a estimated mean posture huh | 
|---|
| 0:28:29 | and there is the segmentation | 
|---|
| 0:28:32 | and sort of | 
|---|
| 0:28:34 | come up with some proxy measure for someone right there was more things with it | 
|---|
| 0:28:38 | and so once you do that right we can actually plot the distributions of the | 
|---|
| 0:28:42 | time slices across the male and female speakers not to but corpus | 
|---|
| 0:28:46 | so what we see it | 
|---|
| 0:28:48 | the green | 
|---|
| 0:28:49 | e | 
|---|
| 0:28:50 | female i'm all your | 
|---|
| 0:28:53 | i don't average so there's significant setup | 
|---|
| 0:28:57 | six difference easy | 
|---|
| 0:28:58 | in the time | 
|---|
| 0:29:00 | size so yet another we can get added from the acoustic signal | 
|---|
| 0:29:04 | it set another sort of interpretable | 
|---|
| 0:29:06 | sort of | 
|---|
| 0:29:08 | marker | 
|---|
| 0:29:09 | it so | 
|---|
| 0:29:11 | because that | 
|---|
| 0:29:13 | how well we will at the environment this part structure with that down | 
|---|
| 0:29:17 | still not really well established again has open question so how do you really | 
|---|
| 0:29:23 | as this thing | 
|---|
| 0:29:25 | but | 
|---|
| 0:29:26 | we have taken sort of a shot | 
|---|
| 0:29:28 | so we did both sort of different kinds of normalization factor looking addressed cheapened | 
|---|
| 0:29:33 | well during movement this not much difference between don't they are pretty highly correlated | 
|---|
| 0:29:39 | so once you have that right | 
|---|
| 0:29:41 | we can actually not use this information in simulations say for example think it you | 
|---|
| 0:29:45 | model right people still study speech production | 
|---|
| 0:29:48 | we all the little from you know | 
|---|
| 0:29:52 | people like and that you know in our goner five | 
|---|
| 0:29:56 | there you can actually now reflect this back and try to study from analysis by | 
|---|
| 0:30:00 | synthesis | 
|---|
| 0:30:01 | so you have a mother tongue we can expect longer instructions and so on so | 
|---|
| 0:30:05 | what we did was to vary based on measurements we don't | 
|---|
| 0:30:09 | look at different constriction bands and | 
|---|
| 0:30:13 | locations just cy thumbsized difference will play a role in the acoustic selecting a four | 
|---|
| 0:30:18 | way | 
|---|
| 0:30:19 | so what we observe that concise differences in the population be had | 
|---|
| 0:30:24 | and what was estimated by simulation very well correlated in terms of part | 
|---|
| 0:30:29 | i part | 
|---|
| 0:30:30 | so it was very nice so what you saw see here is that the | 
|---|
| 0:30:33 | in the simulation spk and five | 
|---|
| 0:30:36 | the move that | 
|---|
| 0:30:39 | type of well ryan or likewise | 
|---|
| 0:30:43 | so the general trends are okay so | 
|---|
| 0:30:46 | so we saw all in all the pilot we saw with another what is it | 
|---|
| 0:30:51 | varies across speakers quite of a fifteen pick up to thirty percent | 
|---|
| 0:30:56 | had a consequence of a large time s | 
|---|
| 0:31:00 | longer constructions that are may in the vocal tract s p produce sounds because constructions | 
|---|
| 0:31:04 | are very sensual to how we produce very speech sounds | 
|---|
| 0:31:08 | they data stretching twist the wells basis so that's of us | 
|---|
| 0:31:14 | signal that the playwright | 
|---|
| 0:31:15 | and | 
|---|
| 0:31:17 | but this | 
|---|
| 0:31:18 | interplay between contractions performance and downsize is fairly complex requires much more sophisticated so | 
|---|
| 0:31:24 | learning | 
|---|
| 0:31:25 | a model that | 
|---|
| 0:31:27 | but with hopefully with data is you know these things can be pursued | 
|---|
| 0:31:32 | this one | 
|---|
| 0:31:33 | so the final thing sort of a not a on the slide of speaker specific | 
|---|
| 0:31:36 | behaviour | 
|---|
| 0:31:37 | is to actually talk about articulator study | 
|---|
| 0:31:40 | okay what i mean but that is how talkers move the vocal tracks right so | 
|---|
| 0:31:45 | as you know the vocal tract is actually a pretty clever assistants a very that | 
|---|
| 0:31:48 | we didn't systems of got all tolerance little bit | 
|---|
| 0:31:52 | exactly can use the same a different articulated to create the same to a complete | 
|---|
| 0:31:57 | the same task for example | 
|---|
| 0:31:59 | in move the john looks two | 
|---|
| 0:32:01 | both dialects to contribute by little constructions like no making b and b and one | 
|---|
| 0:32:06 | you have a mortgage august we lips | 
|---|
| 0:32:09 | and people have several ways to change their i every shapes to do this | 
|---|
| 0:32:13 | and so we columns are contractor strategies and some of these are speaker specific some | 
|---|
| 0:32:16 | of these language-specific consider a we wanna get added because is again yet another piece | 
|---|
| 0:32:22 | of the palatal as you try to understand what makes | 
|---|
| 0:32:25 | me different from you in trying when you produce speech signal | 
|---|
| 0:32:29 | the only just knowing that i'm different from you from a speech | 
|---|
| 0:32:33 | okay | 
|---|
| 0:32:33 | so this is approach you again very early work | 
|---|
| 0:32:36 | so we have lots of | 
|---|
| 0:32:38 | i built anymore i data | 
|---|
| 0:32:40 | so since then i don't know the database we collect is about from a pilot | 
|---|
| 0:32:45 | study of eighteen speakers but like north all these volume between all that stuff | 
|---|
| 0:32:48 | very detailed weight | 
|---|
| 0:32:50 | and so we can actually | 
|---|
| 0:32:53 | i get i know characterizing the morphology speaking style | 
|---|
| 0:32:57 | once we have that right be established what we call the speaker specific for maps | 
|---|
| 0:33:01 | a off but from the vocal tract shapes the construction so imagine | 
|---|
| 0:33:07 | the shape changes to create this task or like consummate dynamical system you know actually | 
|---|
| 0:33:12 | is estimate the for maps of like you know | 
|---|
| 0:33:15 | in that in a different recreation sense | 
|---|
| 0:33:17 | and then we can | 
|---|
| 0:33:19 | pulling all from each of these speakers format | 
|---|
| 0:33:21 | put this back and was synthesized or model | 
|---|
| 0:33:24 | which a to dynamical system ought to use and task dynamics | 
|---|
| 0:33:28 | and see that contributions of the varies articulators people use actually to predict how to | 
|---|
| 0:33:33 | be what studies people about | 
|---|
| 0:33:37 | so | 
|---|
| 0:33:38 | again reminding use of a cell we can go from data to extract a sort | 
|---|
| 0:33:43 | of a on tourism and do pca extract basically | 
|---|
| 0:33:47 | factors able contractually you know how much darker compute on with the time factors are | 
|---|
| 0:33:52 | and someone | 
|---|
| 0:33:52 | and then | 
|---|
| 0:33:53 | a from that we can go with estimate various constructions in a place of articulation | 
|---|
| 0:33:58 | you probably more right | 
|---|
| 0:34:00 | we have along the would try to make an six different anatomical regions like the | 
|---|
| 0:34:05 | outfielder reading about you can the be elevating their injuries and the one | 
|---|
| 0:34:10 | we can is | 
|---|
| 0:34:12 | automatically estimate that | 
|---|
| 0:34:13 | the baseline level what people this | 
|---|
| 0:34:16 | so | 
|---|
| 0:34:18 | problem so we have some insights from about eighteen speakers that we analyze testing again | 
|---|
| 0:34:23 | that are sorensen | 
|---|
| 0:34:25 | a leaf presented interest feet and fill white that we went about use a model | 
|---|
| 0:34:31 | based approach | 
|---|
| 0:34:32 | so | 
|---|
| 0:34:33 | be approximated like the speaker specific format a to pin from that a more i | 
|---|
| 0:34:37 | data from exceeding speakers | 
|---|
| 0:34:40 | the simulated with that a static to you have a to belong to the from | 
|---|
| 0:34:44 | a motor control sort of | 
|---|
| 0:34:46 | a legit are fantastic system | 
|---|
| 0:34:48 | the dynamical systems are basically | 
|---|
| 0:34:52 | control system that the in this state space for | 
|---|
| 0:34:54 | and then we were able to interpret the results so one of the results here | 
|---|
| 0:34:58 | like to make sure you know it's basically represent a the ratio of lips to | 
|---|
| 0:35:03 | use a lipstick | 
|---|
| 0:35:06 | and or job used by speakers to create constructions various constriction bilabial alveolar palatable | 
|---|
| 0:35:12 | we look print your along the vocal | 
|---|
| 0:35:15 | and you see that there's you know | 
|---|
| 0:35:17 | different | 
|---|
| 0:35:18 | ratio of how people use how much dog use | 
|---|
| 0:35:22 | one is like more target lips | 
|---|
| 0:35:25 | zero it's like you're using more | 
|---|
| 0:35:27 | different conceptions different we use | 
|---|
| 0:35:29 | different ways of creating transitions in fact used | 
|---|
| 0:35:34 | put this work we see that elephant on the right where you know | 
|---|
| 0:35:39 | contribute more than job in so | 
|---|
| 0:35:42 | except for all real close to the score of a target the time and | 
|---|
| 0:35:48 | the speakers in our set like in speaker | 
|---|
| 0:35:51 | very you know how they used to create the same kind of constructions i so | 
|---|
| 0:35:56 | people are different in how it studies i | 
|---|
| 0:35:59 | so one of the sort of this is very early inside straight how much speaker | 
|---|
| 0:36:04 | used on the lips you know it there's a function specificity how what is it | 
|---|
| 0:36:09 | out the remote are planning | 
|---|
| 0:36:10 | there are exceptions that actually begging for more sort of you know a computational approach | 
|---|
| 0:36:16 | is now with the data inside we can go and cy | 
|---|
| 0:36:20 | how people actually use the vocal instrument in producing this sounds | 
|---|
| 0:36:26 | that we call speech | 
|---|
| 0:36:29 | so the final in this is now we get family the slides we've been seeing | 
|---|
| 0:36:32 | this conference of | 
|---|
| 0:36:35 | so you are also explore a little bit | 
|---|
| 0:36:38 | well production information be of use in you know | 
|---|
| 0:36:42 | in speaker recognition type of experiment so we did little better well work one speaker | 
|---|
| 0:36:48 | verification with the production data does not much data so not so you know particular | 
|---|
| 0:36:54 | but that's the people pretty much common or things like so that was not one | 
|---|
| 0:36:58 | has this | 
|---|
| 0:37:00 | we'll speech production data be of any use at all your speaker verification | 
|---|
| 0:37:04 | so we know i one point on a getting like data like rewind showing right | 
|---|
| 0:37:10 | x-ray or more i or | 
|---|
| 0:37:11 | it's not | 
|---|
| 0:37:12 | but we okay in operation conditions | 
|---|
| 0:37:15 | right so we need to be able to have some articulatory type representation so people | 
|---|
| 0:37:20 | been working on inversion problems that is | 
|---|
| 0:37:23 | given | 
|---|
| 0:37:25 | acoustic | 
|---|
| 0:37:26 | can be estimated glitch parameters like this the classic problem in fact mozaic setting problem | 
|---|
| 0:37:31 | where you know where i feel that deep-learning that approaches that are very powerful because | 
|---|
| 0:37:35 | it's of any nonlinear process so you know these things every conducive to these | 
|---|
| 0:37:39 | mapping a | 
|---|
| 0:37:41 | nevertheless what we wanted us to do so a speaker-independent mapping | 
|---|
| 0:37:45 | right so this work of profound a small within just a few years ago what | 
|---|
| 0:37:52 | said well | 
|---|
| 0:37:52 | if i can really | 
|---|
| 0:37:54 | acoustic articulately mapping between people | 
|---|
| 0:37:56 | you know of that an exemplary talker right i have lots of data from one | 
|---|
| 0:38:00 | single speaker for like and synthesis right you always take long | 
|---|
| 0:38:03 | the properties from one talker and then try to produce it | 
|---|
| 0:38:08 | and then we can protect anyone else's acoustics on this | 
|---|
| 0:38:12 | so speakers maps to see how this guy were to produce the statistics like everything | 
|---|
| 0:38:17 | to get some semblance of an articulate representation | 
|---|
| 0:38:20 | so | 
|---|
| 0:38:22 | that we can do speaker independent sort of you know measures so that was sort | 
|---|
| 0:38:26 | of the i so we said well we can use a reference speaker | 
|---|
| 0:38:31 | to create a articulate acoustic target like to map and to the inverse model and | 
|---|
| 0:38:37 | then when you get that speakers | 
|---|
| 0:38:39 | for one acoustic signal | 
|---|
| 0:38:42 | we can actually do inverted sort of features and use these to a few | 
|---|
| 0:38:48 | the three | 
|---|
| 0:38:48 | there's any benefit the rationale there is enormous | 
|---|
| 0:38:53 | is that it pretty produces like projections they not no | 
|---|
| 0:38:57 | robust way and constraints the kind where | 
|---|
| 0:39:00 | provide sort of | 
|---|
| 0:39:02 | physically meaningful constraints on how we partition signal so | 
|---|
| 0:39:05 | that might be some advantage to come that come up | 
|---|
| 0:39:08 | so this was sort of you know | 
|---|
| 0:39:11 | this like earlier this year | 
|---|
| 0:39:13 | in c s l | 
|---|
| 0:39:15 | so | 
|---|
| 0:39:15 | the front end this started be used actually for some of these all experiments used | 
|---|
| 0:39:20 | x-ray microbeam database also available because a lot of speakers | 
|---|
| 0:39:25 | and standard | 
|---|
| 0:39:27 | thanks here gmm model because you don't have the much data | 
|---|
| 0:39:32 | and you're some sort of the initial results of you use just | 
|---|
| 0:39:37 | mfccs only you know | 
|---|
| 0:39:39 | that like what that for this small set that's not that's pretty noisy data set | 
|---|
| 0:39:44 | about | 
|---|
| 0:39:46 | you know seven point five the are but you know if you actually have the | 
|---|
| 0:39:50 | real articulation | 
|---|
| 0:39:52 | the measured articulation actually get a result of post | 
|---|
| 0:39:56 | in | 
|---|
| 0:39:57 | providing sort of you know nice complementary information that's kinda encouraging so that you might | 
|---|
| 0:40:02 | think about as an oracle experiment or upper bound if you have session | 
|---|
| 0:40:06 | now if you can use of the inverted sort of measurement about that we shall | 
|---|
| 0:40:12 | we do as well compare really well slightly better by putting them together actually provides | 
|---|
| 0:40:17 | you an additional both with this pretty significant actually | 
|---|
| 0:40:20 | so this grading of this kind of if you have lots of data that we | 
|---|
| 0:40:24 | are sort of you know if you have | 
|---|
| 0:40:25 | in the data to create these maps about speakers you know we need just example | 
|---|
| 0:40:29 | each case | 
|---|
| 0:40:30 | and if we can provide additional source of information | 
|---|
| 0:40:32 | perhaps will give us so the some wheels but maybe also some insight into why | 
|---|
| 0:40:37 | people are different or what data categories of articulation or structure and started is a | 
|---|
| 0:40:42 | different by | 
|---|
| 0:40:47 | so this is just the standard set of | 
|---|
| 0:40:51 | the first | 
|---|
| 0:40:52 | showing the same as of the film | 
|---|
| 0:40:55 | x-ray microbeam database | 
|---|
| 0:40:57 | so | 
|---|
| 0:40:59 | summary of the speaker recognition experiments that notes and she'll so that step | 
|---|
| 0:41:04 | of using both acoustic and articulatory information | 
|---|
| 0:41:07 | there is significant and f eight | 
|---|
| 0:41:10 | if you use of measured articulately information with the standard acoustic features | 
|---|
| 0:41:16 | gains of marble or more honest | 
|---|
| 0:41:18 | if we stuff you know used estimated articulate information | 
|---|
| 0:41:22 | so what would be nice is to actually look a new ways of doing english | 
|---|
| 0:41:27 | and with the kinds of so advances that are happening right now | 
|---|
| 0:41:31 | nor feels | 
|---|
| 0:41:32 | and the availability of data number two data | 
|---|
| 0:41:35 | to do | 
|---|
| 0:41:36 | i know this | 
|---|
| 0:41:37 | no better | 
|---|
| 0:41:38 | i'll be able to evaluate larger sort of acoustic data sets from sort of sre | 
|---|
| 0:41:42 | like the campaigns | 
|---|
| 0:41:45 | so mowing for most on | 
|---|
| 0:41:48 | so we're very excited about no some of this actually | 
|---|
| 0:41:52 | a premier work was done with my collaborators that lincoln laboratory some point your unique | 
|---|
| 0:41:58 | model is gonna | 
|---|
| 0:41:59 | and parallel work was mice your voice now also their | 
|---|
| 0:42:03 | and so we had some initial pilot work and then | 
|---|
| 0:42:06 | i recently got an innocent right actually to a and you the slider work people | 
|---|
| 0:42:10 | actually | 
|---|
| 0:42:12 | or okay we're doing speed signs looks like | 
|---|
| 0:42:14 | so we are excited about it | 
|---|
| 0:42:16 | and so our ideas do this in a very systematically your set to collect about | 
|---|
| 0:42:21 | two hundred subjects this | 
|---|
| 0:42:22 | all this | 
|---|
| 0:42:24 | real time and volume a tree and about | 
|---|
| 0:42:26 | detail and share with people | 
|---|
| 0:42:28 | and | 
|---|
| 0:42:31 | we kinda describe this sort of in an upcoming the paper | 
|---|
| 0:42:36 | and this is kind of that material if you're targeting i'll show the slides and | 
|---|
| 0:42:41 | people want to suggest that is we are more in for you collected what ten | 
|---|
| 0:42:44 | speakers of the product or so far | 
|---|
| 0:42:47 | with the project the starter | 
|---|
| 0:42:49 | i everything from a notable exception the rainbow passage two | 
|---|
| 0:42:52 | all kinds of you know spontaneously and so on | 
|---|
| 0:42:56 | if you have any suggestions ideas how what would be useful for speaker modeling you | 
|---|
| 0:43:01 | know i'm use like this now we have to consider | 
|---|
| 0:43:04 | most in order to be native speakers of english and about twenty percents could be | 
|---|
| 0:43:08 | nonnative speakers it's gotten in english | 
|---|
| 0:43:11 | but in other projects to collect a lot of people doing other languages are everything | 
|---|
| 0:43:17 | from african languages to other | 
|---|
| 0:43:20 | so finally also you know a getting insights inter speaker variability also we can do | 
|---|
| 0:43:25 | some sort of these use cases problem | 
|---|
| 0:43:27 | in the case or mother developing vocal tract length from kids tradition | 
|---|
| 0:43:32 | how the speaker very so that no manifesting the signal right so for example | 
|---|
| 0:43:37 | we've been working along with people operations of attending i'll or can see | 
|---|
| 0:43:42 | so the intention surgical interventions class actually basically what you with you | 
|---|
| 0:43:47 | the parts of town | 
|---|
| 0:43:48 | on top of that we have other therapeutic sort of treatments with the radiation and | 
|---|
| 0:43:51 | are | 
|---|
| 0:43:53 | people | 
|---|
| 0:43:53 | so cost like modified physical structural damage to the thing | 
|---|
| 0:43:58 | so here we see two | 
|---|
| 0:44:00 | of patients | 
|---|
| 0:44:02 | there are no | 
|---|
| 0:44:03 | one basically lost pretty much more so that are because the cancer with your base | 
|---|
| 0:44:08 | you know that and that's of the four reports on | 
|---|
| 0:44:10 | and it's replaced by reconstruct with them flat from the four | 
|---|
| 0:44:15 | so you see sort of variation in the convoy the normalized and therefore here | 
|---|
| 0:44:21 | so how this their speech cope what this is not getting speech and small is | 
|---|
| 0:44:25 | one of the big quality of life measure | 
|---|
| 0:44:26 | so we have different things is also keep us additional insights about you know looking | 
|---|
| 0:44:31 | at speaker variability | 
|---|
| 0:44:35 | the interesting something's only eleven cases you know and had in history the norton | 
|---|
| 0:44:39 | though | 
|---|
| 0:44:39 | some people bought reported on ability a so we have access to all other speakers | 
|---|
| 0:44:44 | and collect a lot of data from where and | 
|---|
| 0:44:46 | and so we can compare what | 
|---|
| 0:44:49 | a how to compensate how to use the strategies how person | 
|---|
| 0:44:54 | speaks pretty intuitively pretty well so | 
|---|
| 0:44:57 | this provides an additional source of information to understand this question of individual very good | 
|---|
| 0:45:05 | so in conclusion | 
|---|
| 0:45:07 | appoint someone may well yes data is very a good integral to advancing speech communication | 
|---|
| 0:45:13 | research your vocal tract information plays a crucial part of this piece of this but | 
|---|
| 0:45:18 | the like i believe | 
|---|
| 0:45:20 | so to do that we need to gather data from like lots of different sources | 
|---|
| 0:45:24 | to get a complete picture of the speech production | 
|---|
| 0:45:27 | it's that's | 
|---|
| 0:45:28 | not very telling from a technological computational | 
|---|
| 0:45:32 | as well this conceptual and theoretical to the perspective | 
|---|
| 0:45:35 | but | 
|---|
| 0:45:36 | i don't believe that are written still so that no applications including into the machine | 
|---|
| 0:45:41 | speech recognition speaker modeling | 
|---|
| 0:45:44 | but i that this sort of | 
|---|
| 0:45:47 | approach just like very interdisciplinary so people have to come together to work well on | 
|---|
| 0:45:51 | these topics | 
|---|
| 0:45:52 | and share | 
|---|
| 0:45:53 | so these are some of the people and my speech production that no | 
|---|
| 0:45:56 | the problem of our | 
|---|
| 0:45:58 | although a bottom line and people were currently there in particular award that | 
|---|
| 0:46:04 | we also contributed this particular a collection of my | 
|---|
| 0:46:08 | calling who does all these imaging work | 
|---|
| 0:46:11 | and testing them are scientist | 
|---|
| 0:46:13 | lois of these the linguist very | 
|---|
| 0:46:16 | well | 
|---|
| 0:46:16 | linguists provides a conceptual framework of how we | 
|---|
| 0:46:20 | approach | 
|---|
| 0:46:21 | such an that all this work on | 
|---|
| 0:46:23 | this apply meeting stuff recently and the lower can only morphology where my that was | 
|---|
| 0:46:29 | talking a lot model where | 
|---|
| 0:46:32 | that | 
|---|
| 0:46:34 | namely that a lot of things actually translating to a speaker verification | 
|---|
| 0:46:39 | and i separate that michael i-vectors in all our women amazing no i'm forty really | 
|---|
| 0:46:44 | for this guy had available | 
|---|
| 0:46:46 | and here not only finally no he's been very supportive is vanilla rampantly support incorrect | 
|---|
| 0:46:52 | he's be important for this and no i'm pushing is to | 
|---|
| 0:46:56 | not people one that's of things here too | 
|---|
| 0:46:58 | so that i thank all of you listening to be | 
|---|
| 0:47:02 | and various people find that | 
|---|
| 0:47:04 | well this is like online if you're interested including might be charged | 
|---|
| 0:47:09 | thank you very much | 
|---|
| 0:47:27 | for instance | 
|---|
| 0:47:32 | you very much with fascinating to | 
|---|
| 0:47:35 | two questions first of all | 
|---|
| 0:47:38 | when you're gonna get to the larynx | 
|---|
| 0:47:42 | because that's i'm okay i'm talking from the | 
|---|
| 0:47:46 | perspective you | 
|---|
| 0:47:48 | the forensic phoneticians | 
|---|
| 0:47:51 | and | 
|---|
| 0:47:54 | we are conscious of between speaker differences from the larynx on two | 
|---|
| 0:48:02 | spectral slope of that sort of thing but in this that suppressing | 
|---|
| 0:48:05 | and also super the residual e | 
|---|
| 0:48:09 | relationships between what i would | 
|---|
| 0:48:11 | give almost more robust harmful is we'll knowledge about the speaker variability in | 
|---|
| 0:48:20 | the nasal | 
|---|
| 0:48:22 | basically nasal cavity sinuses that sort of thing | 
|---|
| 0:48:26 | that is the below about speaker i | 
|---|
| 0:48:30 | it's great "'cause" you're not gonna get in this | 
|---|
| 0:48:32 | we telephone speech and so forth anything above | 
|---|
| 0:48:35 | three k is the good | 
|---|
| 0:48:37 | some parts that so the first questions about lyrics right | 
|---|
| 0:48:41 | so here are in this region | 
|---|
| 0:48:43 | so | 
|---|
| 0:48:45 | so the glottal so that the voice a voice source of phenomena like happens that | 
|---|
| 0:48:49 | much higher rate | 
|---|
| 0:48:50 | and so i'm are still is not good enough right it's about | 
|---|
| 0:48:54 | we can go about how did want reprints for second year | 
|---|
| 0:48:58 | so what people have been doing particular you know according to you salience one no | 
|---|
| 0:49:02 | up to | 
|---|
| 0:49:04 | you high speed imaging off this larynx but wouldn't camera to the nose | 
|---|
| 0:49:08 | in two | 
|---|
| 0:49:10 | little bit intervention and | 
|---|
| 0:49:13 | at so | 
|---|
| 0:49:14 | on the other hand | 
|---|
| 0:49:15 | what we can do you have them or i used to look at things like | 
|---|
| 0:49:19 | little joe hi then they'd all other things but also get some | 
|---|
| 0:49:22 | it it's one zero information | 
|---|
| 0:49:25 | and particularly one of things a more approaches like complete you of your region so | 
|---|
| 0:49:30 | we can really | 
|---|
| 0:49:31 | this is not available any of the other but all these people use you know | 
|---|
| 0:49:34 | in this so you look at like to be for usual sort of | 
|---|
| 0:49:40 | behavior phenomena | 
|---|
| 0:49:42 | and in terms of actually characterize and things like that is the variance and so | 
|---|
| 0:49:45 | on which don't change very much during speech behavioural i cannot to characterize that's what | 
|---|
| 0:49:49 | he really i contrast to weighted images | 
|---|
| 0:49:51 | to really characterize every speaker by you know what is that they have the you | 
|---|
| 0:49:55 | know and in terms of | 
|---|
| 0:49:57 | with which we can actually get i | 
|---|
| 0:50:00 | some anatomical good characterization of a speaker and see how can relate or account for | 
|---|
| 0:50:05 | it in the signal | 
|---|
| 0:50:06 | and so | 
|---|
| 0:50:08 | we are trying to see how can | 
|---|
| 0:50:10 | sort of controlling t do some multimodal meeting of voice source that no we tried | 
|---|
| 0:50:14 | to you d | 
|---|
| 0:50:15 | but you know they are quite small window into this thing is you know | 
|---|
| 0:50:19 | we wanna see the high speed stuff | 
|---|
| 0:50:23 | still open question in terms of contrary to meeting | 
|---|
| 0:50:29 | so that like by the button references | 
|---|
| 0:50:33 | in the previous slide show organisers people interested | 
|---|
| 0:50:40 | no more questions i was just | 
|---|
| 0:50:46 | normal | 
|---|
| 0:50:50 | s | 
|---|
| 0:50:53 | is it possible to say broadly | 
|---|
| 0:50:55 | if there are any a particular areas that show the greatest amount of the between | 
|---|
| 0:51:00 | speaker difference | 
|---|
| 0:51:02 | and that's to me and use | 
|---|
| 0:51:03 | so you know if you gonna look for where is a completely | 
|---|
| 0:51:08 | goodness knows it or is it just and that you know people differ in all | 
|---|
| 0:51:11 | sorts of the from which was | 
|---|
| 0:51:14 | so i think that the latter is that what my guess is right no unless | 
|---|
| 0:51:18 | we know i do think they begin to start begin to cluster | 
|---|
| 0:51:21 | a ones as increase the and number | 
|---|
| 0:51:25 | just like you know what we do it eigenvoice and the | 
|---|
| 0:51:28 | i didn't phase i think i'm sure a good prime things that start at clustering | 
|---|
| 0:51:32 | for getting direct mode | 
|---|
| 0:51:33 | but now the source of variability seems to be | 
|---|
| 0:51:36 | a perceptual point of view | 
|---|
| 0:51:38 | all the place | 
|---|
| 0:51:40 | plus you know how people became weakened that | 
|---|
| 0:51:42 | also varies quite a bit because you know | 
|---|
| 0:51:46 | where they come from mine how be applied and so one right and practices people | 
|---|
| 0:51:50 | use no | 
|---|
| 0:51:51 | there are other piece of work that i can talk about no one article to | 
|---|
| 0:51:54 | setting and you know | 
|---|
| 0:51:56 | ideas about | 
|---|
| 0:51:59 | how people set of actually | 
|---|
| 0:52:02 | be but i do | 
|---|
| 0:52:04 | extract parameters of | 
|---|
| 0:52:06 | from or to control problem point of view white people the for it i can | 
|---|
| 0:52:09 | lead to language or | 
|---|
| 0:52:11 | background or other kinds of things still open question | 
|---|
| 0:52:15 | but what i feel like as being trees that it is these of we talk | 
|---|
| 0:52:18 | about very small datasets is compared to what you've been for state would just on | 
|---|
| 0:52:23 | the speech side | 
|---|
| 0:52:25 | but if we increase this to some extent | 
|---|
| 0:52:28 | and again or this kind the computational tools and advances that you're making i think | 
|---|
| 0:52:33 | slowly can begin to understand this at the level to go | 
|---|
| 0:52:40 | open question | 
|---|
| 0:52:49 | structure so it are you make a comment | 
|---|
| 0:52:53 | you put up a kind of the acoustic to model but well all remember point | 
|---|
| 0:52:57 | out one thing from one of the workshops from | 
|---|
| 0:53:00 | the early nineties | 
|---|
| 0:53:02 | from mid sixties up until late eighties early nineties we use their own acoustic to | 
|---|
| 0:53:09 | model that was when you're like flat screen | 
|---|
| 0:53:12 | and we should tell at a summer student would basically spent the summer saying well | 
|---|
| 0:53:18 | actually the vocal track as a writing all turn and no one it really thought | 
|---|
| 0:53:23 | about what how much is that right angle actually impact vocal i persona formant locations | 
|---|
| 0:53:28 | and bandwidths | 
|---|
| 0:53:29 | so he we formulate a can or closed form solution i think they saw it | 
|---|
| 0:53:34 | was between one two three percent ships informed location bandwidths right so a very much | 
|---|
| 0:53:39 | like sting the physiological per state you take care what might one right basic questions | 
|---|
| 0:53:44 | you focused on speaker id | 
|---|
| 0:53:47 | i'm assuming many of your speakers here bilingual have you thought about looking at language | 
|---|
| 0:53:52 | id to see if the physiological production systematically changes between people speak one language versus | 
|---|
| 0:53:59 | another | 
|---|
| 0:54:00 | absolutely solid lines of that for the first a common to jon hansen made was | 
|---|
| 0:54:05 | regarding to but the vocal to been but it sort of unruly do the simulations | 
|---|
| 0:54:11 | note that | 
|---|
| 0:54:12 | for | 
|---|
| 0:54:13 | articulation acoustics and the effect of the band in fact there is a classic people | 
|---|
| 0:54:17 | by enrollment order moments on the | 
|---|
| 0:54:19 | and yes and the release of | 
|---|
| 0:54:21 | long time ago | 
|---|
| 0:54:23 | that actually estimates is about the three five percent the student actually verified it but | 
|---|
| 0:54:27 | some and simulations later on | 
|---|
| 0:54:31 | i used to get the last | 
|---|
| 0:54:34 | and | 
|---|
| 0:54:36 | so i think of the more recent models try to do this you know but | 
|---|
| 0:54:40 | like fans here simulations main street and simulations the one we can do with this | 
|---|
| 0:54:44 | node access to those one what you did i talked about right | 
|---|
| 0:54:48 | for all the postures from all these speakers we had that | 
|---|
| 0:54:50 | so and with the high performance computing | 
|---|
| 0:54:53 | this is becoming a reality we can actually what implanting and want to do right | 
|---|
| 0:54:56 | no nodes | 
|---|
| 0:54:58 | possible | 
|---|
| 0:55:00 | this second question | 
|---|
| 0:55:03 | john a reminder | 
|---|
| 0:55:07 | all the language id yes of course we have actually | 
|---|
| 0:55:10 | about | 
|---|
| 0:55:11 | forty or fifty different languages actually languages and set l to a second language them | 
|---|
| 0:55:17 | speak english in or datasets you know across very linguistic experiments we've been doing | 
|---|
| 0:55:22 | so one things we | 
|---|
| 0:55:24 | the real the data | 
|---|
| 0:55:25 | little bit not as much maybe | 
|---|
| 0:55:27 | cup people intuition language id | 
|---|
| 0:55:31 | may have some hypotheses and so on their be looked at things like articulately setting | 
|---|
| 0:55:34 | you know which is then | 
|---|
| 0:55:36 | the place from would you start executing a task right now from rest to rent | 
|---|
| 0:55:40 | you so if you think about as a database system right as you know from | 
|---|
| 0:55:44 | a individually creation like you know so the modelling you initial state is important from | 
|---|
| 0:55:48 | which we go to another state and where you set of but | 
|---|
| 0:55:53 | release that particular task and go to next aspect of making one construction going on | 
|---|
| 0:55:57 | an on and so we found that people have preferred sort of settings from which | 
|---|
| 0:56:02 | they start executing and that's very language specific we showed like normal german speakers presents | 
|---|
| 0:56:07 | and spanish speakers with english speakers so these kinds of things can be estimated from | 
|---|
| 0:56:11 | articulatory data | 
|---|
| 0:56:13 | the inversion is not been to the viewing done that no | 
|---|
| 0:56:17 | but that's quite possible and you know happy to share data | 
|---|
| 0:56:21 | top two people body | 
|---|
| 0:56:26 | okay | 
|---|
| 0:56:27 | sure it's first okay | 
|---|
| 0:56:32 | okay so | 
|---|
| 0:56:34 | i have a comment i like to respond | 
|---|
| 0:56:37 | one of all the problems in speaker recognition is i happens between the hot this | 
|---|
| 0:56:44 | but the speech right | 
|---|
| 0:56:48 | the first line that explains | 
|---|
| 0:56:51 | cepstral mean subtraction | 
|---|
| 0:56:54 | basically you find the way the average side of the vocal tract | 
|---|
| 0:57:02 | how does that sort of | 
|---|
| 0:57:04 | impact on what you | 
|---|
| 0:57:07 | right so that you know i didn't talk about the channel effects and channel normalization | 
|---|
| 0:57:11 | things that happen you know the recording conditions and so one right so | 
|---|
| 0:57:16 | one of things that the art of contemplating is like you know like many people | 
|---|
| 0:57:19 | have been talking what do joint factor analysis or these kinds of even with these | 
|---|
| 0:57:24 | new a deep-learning systems right | 
|---|
| 0:57:27 | you could these multiple factors jointly together to see how | 
|---|
| 0:57:31 | we can have speaker specific variability sort of measures | 
|---|
| 0:57:35 | and things that are cost by sort of other | 
|---|
| 0:57:39 | so it's a extraneous setup | 
|---|
| 0:57:42 | interferences or thirty two or more other kinds of transformation that might happen | 
|---|
| 0:57:46 | so that's what we're doing from first principal type things right like the way we | 
|---|
| 0:57:51 | want to do not just make the jump into a drawing some all these into | 
|---|
| 0:57:55 | some you know machine learning to and beginning to estimate by | 
|---|
| 0:58:01 | systematically trying to look at linguistic theory speech signs we could features to analysis by | 
|---|
| 0:58:07 | synthesis type of approaches and then we can then see well if you have other | 
|---|
| 0:58:11 | kinds of these kinds of snow | 
|---|
| 0:58:15 | both | 
|---|
| 0:58:16 | open environment speech recording not | 
|---|
| 0:58:19 | for distance the speech recording is spelled much interest to other bus | 
|---|
| 0:58:23 | for various reasons and | 
|---|
| 0:58:26 | we can account for these things so i tend to believe in that kind of | 
|---|
| 0:58:30 | more organic approach | 
|---|
| 0:58:38 | we have temporal one question may be processed foods | 
|---|
| 0:58:44 | i | 
|---|
| 0:58:47 | i'm sorry i'm the both fast | 
|---|
| 0:58:50 | i | 
|---|
| 0:58:51 | i won't first to thank you and it's very nice | 
|---|
| 0:58:57 | sorry noise | 
|---|
| 0:58:59 | science | 
|---|
| 0:59:00 | which technology and particularly in speaker recognition or in the forensic so | 
|---|
| 0:59:06 | adjust my common this to remind the difference between speaker recognition and a forensic voice | 
|---|
| 0:59:12 | comparison | 
|---|
| 0:59:14 | but it really both and | 
|---|
| 0:59:17 | the field | 
|---|
| 0:59:18 | present that you | 
|---|
| 0:59:20 | because | 
|---|
| 0:59:21 | we know about when we try to do some article in addition we think like | 
|---|
| 0:59:26 | that | 
|---|
| 0:59:27 | we have a huge difference between the board to read speech | 
|---|
| 0:59:32 | it's train include kick back wall | 
|---|
| 0:59:37 | speech right | 
|---|
| 0:59:40 | for speaker recognition we could imagine but the speaker are trying to | 
|---|
| 0:59:46 | very | 
|---|
| 0:59:47 | classical to you could not be processed | 
|---|
| 0:59:50 | in forensic voice | 
|---|
| 0:59:51 | comparison | 
|---|
| 0:59:52 | we could imagine exactly you put it right are reading my question | 
|---|
| 0:59:58 | posted but midget but | 
|---|
| 1:00:01 | would be five | 
|---|
| 1:00:03 | constructions the or optimization strategy you know that | 
|---|
| 1:00:08 | challenge department you expose | 
|---|
| 1:00:12 | yes and alright because there's certain things we can change certain things we can't write | 
|---|
| 1:00:16 | your given right that's one of the things that we are trying to go after | 
|---|
| 1:00:20 | that there's something are given in or physical instrument it can compensate for it as | 
|---|
| 1:00:25 | much but we still see the residual effects and want to see can you get | 
|---|
| 1:00:29 | it is residual effect maybe | 
|---|
| 1:00:31 | the bounds are not there so no i have a big that of information theory | 
|---|
| 1:00:35 | so always interesting bound the limits of things how much can be actually | 
|---|
| 1:00:39 | after all we have | 
|---|
| 1:00:40 | a one dimensional signal from which we project on all kinds of feature space and | 
|---|
| 1:00:44 | do all or computation based on that to do all the inferences problems targeted speaker | 
|---|
| 1:00:49 | or whatever this and so | 
|---|
| 1:00:52 | say you menu plate that the strategies that's only one degree of freedom or you | 
|---|
| 1:00:58 | know if you mean | 
|---|
| 1:01:00 | and then it causes some differences but still if we can account for this somehow | 
|---|
| 1:01:04 | i can you still see the residual effects of the instrument that there have or | 
|---|
| 1:01:10 | specific ways they are | 
|---|
| 1:01:11 | changing the shot used a common database when they have right you can't speak so | 
|---|
| 1:01:17 | just two random things with your articulation to create the speech sounds right so that's | 
|---|
| 1:01:23 | why not disjoint modelling of you know the structure and function you please a very | 
|---|
| 1:01:28 | interesting to see and how much can be spoofed by people like you know you | 
|---|
| 1:01:32 | may if you're getting added | 
|---|
| 1:01:33 | it remains to be seen by the no i | 
|---|
| 1:01:36 | but i'm hoping that like by no | 
|---|
| 1:01:38 | being very microscopic here these analyses we can get some insight into it | 
|---|
| 1:01:43 | you know but one that is very objective not you know | 
|---|
| 1:01:46 | just a | 
|---|
| 1:01:48 | impressionistic you know single this place is definitely all these experts billing talk about it | 
|---|
| 1:01:52 | on you know on the court | 
|---|
| 1:01:55 | i think that's one of the reasons | 
|---|
| 1:01:57 | here was very | 
|---|
| 1:01:59 | but support the idea no let's go it every object to way you know scientifically | 
|---|
| 1:02:03 | grounded way as possible | 
|---|
| 1:02:06 | we don't loads its adjoint you see vertigo | 
|---|
| 1:02:11 | can be so | 
|---|
| 1:02:13 | since then the speaker again thank you thank you | 
|---|