so

becomes a features from university of vienna these prevent something

at the

department of cognitive biology

and main interest or in the evolution of language and the

mobile communication in

but separates

and what makes this

also very interesting for us is the t v

all the users synthetic speech

two

investigate is questions into

there's use hypotheses

and

are there is a

from the

i allowed artificial intelligence lab on the friday university

brussels

and he's

interested in the also in the cognitive

it uses of language and

all the user's machine learning in speech technology for

investigated in all

this

combinatorial

factor can

somehow be modeled

and

we also very well known for their work and

also for the work on the

nine q

we ct a monkey vocal tract of speech ready which we will here today

this is what i'm because

my family

is that sounds pretty good fact there are not you

i'll try not to put

thank you michael effective for the kind introduction said this is the first time bargain

i have tried to do it

tag team two you know like this that will see how well it works but

all start off and then part will

give you more technical details of the sort that i'm sure you all hungry for

on saturday morning

but i'll try and start off the start by giving some

just perspective on why a biologist like myself

who's interested in animal communication would dive in the speech science actually studied speech science

with people like and stevens and mit one as opposed arc

and use that used what kind of you guys invented to investigate how we animals

make their sounds and y what those sounds me

and then we're basically gonna talk

so in other words that using the technology of speech science

to create animal sounds to understand animal communication and then in the second part of

that arc will turn that around and say how can we use an understanding of

the animal vocal per tract

to understand the evolution of human speech

and that is that may the answer may surprise some of you

okay so why would why would anyone want to synthesize animal vocalisations why would you

wanna make a synthetic cats

academy our a synthetic bark

and's as i said

my drive my main reason for this is because i'm a biologist

reg interested in understanding the

the biology of animal communication from the point of view of physics and physiology and

because speech scientist of done so much of that work we can essentially borrow that

to understand animal communication

and then we'll turn of the second part where we try and understand how our

speech act

so i'm sure this is very familiar to you but i just wanna very quickly

run through the source-filter theory i'm sure virtually all of you are familiar with this

theory

what as applies to human language what you might be more surprised by is how

broad this theory applies across vertebrate

so with the possible exception of fish dolphins another toothed whales and probably a few

others like some rodent high frequency sounds

this theory that was developed to understand our and speech apparatus and you know basically

from the nineteen thirties onto the nineteen seventies turns out to apply to virtually all

other sounds that you might think of dogs barking cows moving birds singing it's utterance

the basic idea of course is that we can break the speech production

process into two components the source which turns aside airflow at the sound and the

filter which then modifies that's

using formant frequencies which are vocal tract resonances that filter out certain frequency

and this is an image that may look familiar

this these are vocal folds except these of the vocal folds on the siberian tiger

so these this is that a larynx that's the vocal folds are about that long

so of course it makes very low frequency vocalisations but you can see that the

basic process this error dynamically excited vibration is pretty much the same as what you

would see in human vocal folds

and of course the vibration rate of these vocal folds the rate at which they

slap together determines the pitch of the sound

and you may be wondering how we did this we didn't have a live tiger

vocalise thing with an enter scope died want to do that this is a dead

tagger so this tiger was removed from an animal that was used a nice put

on a table we blew air through it and we videotape that and what that

shows is just like in humans

we don't need active neural firing at the rate of the fundamental frequency to create

the source

and that seems to be true in the vast majority of sounds bird songs acts

are actually localising it at fundamentals of eight khz

whales or for are of localising at fundamentals of ten khz

all using the same principle

there are a few exceptions in my favourite one that many of you will be

familiar with

is one task per

that's a situation where the there is an actual contraction well each contraction of muscle

that generates the paper is driven by the brain so that's one of the few

exceptions where it's not this kind of passive vibration

but again for the vast majority of sounds at we're talking about including everything we

know from nonhuman primates this is the way

so then that's source out whether it's noisy or harmonic passes through the vocal tract

which

i we show my students this image the formants being like windows that allow certain

frequencies to pass through

but it certainly much more fun to listen to what a form it is

what i've done here is used lpc resynthesis

to take the human speech which is of course of the source

and the filter combines

where and or

and now i'm gonna take the formants of that speech

and apply them to this source this is a bison whirring

and this is what we hear as a result

i

i think everybody can understand the words even though it sounds more

terrifying when it's a bison saying it

just another random example this is an or well

in here is the nor we're with my performance

okay so i think that illustrates the point what we hear the vocal signal we

here is this composite of source and filter

and in these cases we can hear the filter doing the phonetic work

and this but the source still comes through loud

so taking this basic principles of source-filter theory we started thinking

okay what kind of

cues other than speech might be there an animal signals and one of the first

things that's now been

really extensively investigated was based on the idea that vocal tract length correlates with body

size and because formant frequencies are determined by vocal tract length maybe formants provide a

cue to body size in other species

so the first part of this is easy we just get "'em" a riser x

rays a measure of the vocal tract length you can do that on anaesthetised animals

and then we is a little harder to get them to vocalise but when we

do that and that of the formants we find this is just one of many

cases these are monkeys that vocal tract length correlates with formant dispersion which is the

average spacing between the formants and because vocal tract length correlates with body size that

means the body length correlates very nicely

with well sorry this is one body like correlates very nicely with formants

and i first this in monkeys but then we didn't obvious and in pigs it's

true in humans it's true and dear this seems like a kind of for the

mental aspect of the voice signal that it carries information about body so

so

this is something that we can see as scientist objectively we can measure this

but the question is do animals pay attention to that

so it's fine if i go and i measure formants and i can say formants

correlate with body size but that's kind of meaningless for animal communication unless the animals

themselves perceive that signal

so

this is where animal sound synthesis comes and how do we ask that question how

do we find out whether an animal is paying attention to formants

and the answer this is a long time ago this you may some of you

may recognise this all version of matlab running on an old macintosh that i generated

this speech animal sounds synthesizer using very standard technology that most of you will be

familiar with basically

when you're prediction predict the formants subtract those away and we have an error signal

which we can use as a source and then we can change the formants shift

only the formants leaving everything else the same and ask if the animals perceive that

shift inform

now the way we do these experiments how do you ask an animal whether it

perceives that we usually do you something called habituation this a bit you a sheep

where we play a bunch of sounds that

the in this case the formants remain the same but other aspects very the fundamental

frequency the length et cetera varies performance are fixed

and now once

our listening animal

stops paying attention

so it may take

ten plays or a hundred play is before the animal finally stops looking at the

sound but once it's gotten with the original sounds then we play the sounds where

we change the formants or change whatever variable interest

and we

if the animal pays attention to that

if they perceive it

and find it

salient enough to be noticeable then they should look again

okay

so the first piece is i actually tried this with his whooping cranes a now

explain why the second

so what i'm gonna do you know it's sort of walk you through this experiment

these are whooping crane contact calls

and what we did is play a bunch of the actual calls from one particular

brand

and they sound like this

or

it's more here's another one sound pretty similar to our years

and we keep playing those in cell are so these are recorded we're playing these

from a laptop and now we see if the listening bird looks up to we

wait till the bird goes down its feeding we play one of these sounds and

it looks at

because it sounds like there's another would be great

so the logic is pretty simple

the case of whooping cranes we had to do this in the winter

it takes these birds hundreds of trials before they start listening before they start paying

attention to the laptop dies and it starts snowing et cetera et cetera

but eventually we were able to do this

where you get the bird the bits are weighted by playing these kinds of sounds

over and over

anyway and then

just to be safe

we play a synthetic replica that we've run through the synthesizer but without changing the

formants and if everything's fine they shouldn't just a bit rate of that hears with

that sounds like

pretty similar

and now here's the key moment

we play either the formants lowered

where the formants fire

or

and of course you walk in here that because you're humans and you we already

knew you perceive formants so the question is one of the birds do

and when we do this what we find is that initially

the birds respond eighty percent of the time on average but has we go as

we get so twenty five or thirty trials finally the last but you a sheep

trial

by definition is the one where they don't look at all we actually get three

of those in a row now we play that synthetic replica they don't work

so that means or synthesizer is working and then finally we play these test stimuli

and

we get a massive just a pitch

so we've done this that would make a difference

sees and always found the same thing it seems like paying attention the formant frequency

shifts

in this kind of context is a basic mammalian thing

birds to it monkeys do it dogs to it pigs do it and of course

people

so now you might ask can we go further with that and for example these

are two colleagues who have used animal sound synthesis

you basically look at what other species are using these formant frequencies for

in this case we can show that the model that the deer or the colours

are using these sounds as indicators of body size and the kind of evidence we

have is for example males played by another male with its with lower formant frequencies

that with an elongated vocal tract runaway and are afraid females find the more attractive

et cetera et cetera this is again been done with many speech

many of probably many of you have heard gear but you might not of her

the colossal this is a colossal they have a very impressive vocalisation

if you're wondering how a little teddy bear sized animal

makes that terrifying sound

it's because they actually have a track which is that they've

pull the larynx down to make their vocal tract much longer then it would be

and a normal animal so by and one getting their vocal tract they make themselves

and vector

just these are a few of the many publications that use this approach that i

just been telling you about to dig deeper into animal communication so i hope but

makes the case that this is a worthwhile thing to do it again in a

wide variety of sleazy

okay so now maybe getting something that's closer to what a lot of you do

i wanna turn to the to the this is supposed to be part two sorry

we just

put this together yesterday

why would you

what i mean how can you turn this around to start ask questions about

human communication based on what we understand about animals

and the first fact that kind of course fact that many people in the world

of speech sciences been trying to understand for a long time is the fact that

we humans are amazing it imitating sounds we not only imitate the speech sounds of

our environment

but we learn to sing songs we can even in the tape animal sounds or

basically kids will imitate whatever sounds they have a rare

and it turns out that are nearest living relatives the great apes can't do this

at all

so this is just one example all these are examples of apes that been raised

in human homes

and of course a human child by the edge of about one is already making

the sounds a bit it is already starting to say it's first words and making

the sounds of its environment that adheres and it's in its native language phonology or

phonology is and no eight has ever done that no ape has even spontaneously said

mama much less learn complex vocalisations

and the question that has i mean people are known this for a long time

the question that has been driving this field for at least a hundred years and

start once time is why is

why is it that

and animal

that's in english seemingly so similar to us that can

where to do things like i h

and drive a car

can even produce the most basic

speech so

with its vocal tract

so that's the sort of driving force behind the second part of

block

and there's two theories darwin had already mentioned this one is that has something to

do with the peripheral vocal apparatus

and the other is that it has more to do with the brain and darwin

said well they probably both matter but the brain is probably more important what we're

gonna try and convince you now is that it is actually the brain that's g

and vocal tract differences although they exist are not what are keeping a monkey or

an ape from producing speech

now the most famous example of

a difference between us and apes is illustrated by this these m r is on

the on the left side we see here a chimpanzee and the red line marks

the vocal folds so that's the larynx

and of course in humans the larynx is descended in the vocal tract it pulls

down in the throat

where is in the chimpanzee the lexus and a high position engaged in the nasal

passage most the time

and that means that on

rests flat in the in them in the map of the tongue is basically sitting

like this

what happens in humans

is that are we essentially swallow the back of our town are larynx to sends

pulling the time with it so that we have this two part on that we

can move up and down and back and forth and that's how we get this

wide variety of speech

so the idea and this goes back to darwin's time but it really became concrete

in the nineteen sixties is that

with the time like that

you simply can't make the sense of speech and therefore no matter what brain was

in control that vocal tract can make the sounds that you would need to imitate

speech

and it's a plausible hypothesis

it goes back to actually my and meant for phil lieberman who was my phd

thesis supervisor published a series of papers in the late sixties and early seventies

and what he did was take a dead multi and the beta cast of the

vocal tract of the smoky

they use that to produce a computer program to simulate the sounds that

vocal tract can make there was a lot of guesswork involved because it was one

that multi and one cast

but they did the best they could

and what they found this is an formant one

to space

what they found it is yours the famous three vials the point files of english

e

and are that are found in most languages and all those things in there all

the numbers are what the monkey vocal tract or what the computer model of the

multi track remotely vocal tract could do

so they concluded that the acoustic vowel space of a riesz as multi use quite

restricted they lack the output mechanism

for speech per

and this is one of those ideas like i said it's its well-founded in acoustics

if you look at what we actually do when we produce speech these just a

couple videos that it will be familiar

a rainbow as division of white light into many beautiful colours

you see that from dancing around in that two dimensional space

here it is slow down a bit

so we use that ni that additional space "'cause" by swallowing the back of our

turn we clearly are using that to its full extent when we produce speech

so i think this lieberman hypothesis is quite plausible

i became suspicious of this when we first started to train do x rays of

animals as they vocalise instead of looking at data animals like this is the classic

way of analysing the animal vocal tract take a day got cut in half and

draw conclusions about that we trying to get a good localising in the x ray

harder than it may seem

i have that many animals sitting in a situation like this without localising at all

but this little go was one of our first subjects in we played it it's

mother's bleeds it would respond

and this is what we saw in the extra

also use again i want you to look in this region right there

when you look that's this anonymous claimed

at the glottis prevents mouth breathing so in other words the idea based on the

static anatomy is that a goat can't breeze through its mouth

and so here's what we actually see

this i

pulling down a

such that every one of those vocalisations passes out through the mouth the get

now this shouldn't be that surprising if you think about if you wanna make allow

the sound you should other eight through your mouth and not through your nose but

again this is what i'm data most acclaimed was impossible up until we started doing

this work we've seen in another animal so this is a dog you're gonna see

a very expensive pulling down of the larynx to send of the larynx when the

dog barks this is low motion

however

that's the lyrics

right

what you can see here is that every time the dog parks

the larynx pulls down pulling the back at the time with it and basically going

into a human like vocal configuration but just one only animal is talking white only

while it's vocal i

the unusual thing about is that are larynx stays low we keep our larynx low

light on not only while we're vocal

so when we first got these data more than it's almost twenty years ago i

became convinced that this that the set of the larynx can't be the crucial factor

keeping animals from localising

but unfortunately in the text books it canteens said the reason monkeys can't localise rates

can't localise

based on peripheral and that they just don't have the vocal tract

and it was what i saw the simpsons episode where

where

it system

the simpsons the main guy

part no the old guy

homer homework like you

can wear gets this multi

and the motley can talk so homers learning sign language are kept saying it's because

he doesn't have the vocal tree

so that's when we decided okay this dog and goat stuff isn't enough we have

to do it with nonhuman primates and working together with passive thousand far whose monkeys

they were and bart who's gonna take over from here we check x rays like

this one

the multi vocal arising

and you'll see there's a little movement of the larynx just the same as we

saw in the gutter in the dog and then we trace those to create a

vocal tract model in this is where part's gonna

i

do you wanna take this

that looks good

a reality

okay

so

yes how we actually

and model to

to create

localization of the monkey no

if you think about it it's very different problem from or a problem that requires

a very different solution from what we use for human speech because what we're trying

to do is to figure out what the monkey

could do in principle with its vocal tract and it's not based on what it's

actually doing the whole point is that we count multi don't well so

so what we don't have is a corpus of data on which we could use

some kind of machine learning problem

so what we need to do is

that really productive approach

based on

what is in it sends a very old fashioned way of going about speech synthesis

and which is articulatory synthesis the not just recap which relate

how it works for you but i assume you mural intimately familiar with it and

what i would like to stress however is that even though we can to be

talking about biology and about speech assigns

these methods were developed by people who we're actually engineers they were also people interested

in trying to be able to put is many phone conversations on transplant transatlantic cables

as possible

and so this is very much

the fear read it has been developed by engineers by people who were working with

the same goals

as you guys

so how this articulatory synthesis where well you start with an articulatory model you start

with an it year of how the vocal tract works

and from

with a model you can create different positions of the tongue and lips et cetera

and from that you need to calculate what is called an area function so an

area function is basically the cross sectional area of the vocal tract at each position

in the vocal tract

and it turns out that the precise details of that area function

well the area is the thing that counts the precise shape in the sense that

for instance there is a

right angle here in the vocal tract that's cool because of the wavelength interval you

can ignore that so you can basically model it as straight q with the circular

cross sectional shape but the area is the important thing now of course if you

want to

model that any computer model you have to discuss the score times that so what

you and that is

with is called a chi model so i and number of choose along the length

of the vocal tract from that

larynx basically to that

and then on the basis of that you can calculate the acoustic response either in

the time-domain the frequency domain so that's what we're going to do so how did

we do that for the monkey model

this is the x-ray image that to come sages child

with the outline

and in red here you can see the outline of the vocal tract

so this is what we have this is what we start with we have we

had about a hundred of these

and i guess they were made by hand that ratings were made by hand and

so what we first need to do is to figure out

how the sound waves propagate through this tract

and for that the technique that we use is called a medial axis transform so

it's basically you're trying to squeeze

a circle

through that tract and that circle basically represents the propagating acoustic wavefront and if the

line in the middle it's kind of the center of the wavefront and the radius

of the circle

for the diameter of the circle as the diameter of the vocal tract

so this is what you end up with

and so

you can then calculate for each position

in the vocal tract

from the glottis to the lips

the diameter

okay so you have it

a function

the diameter of the vocal tract

at each point in the vocal tract

however the problem is that this is just

part of what we need we need to have the area we don't need to

have we that the diameter isn't enough so the problem is

we need to calculate the area on the bases of the observed diameter

no fortunately it turns out that do good approximation for those monkey vocal tract the

function converting diameter to area

is more or less the same everywhere in the vocal tract so how do we

figured that out

apart from the x-ray movies we also had a few mri scans of than the

anaesthetised monkey

and if you if you look at that

so this is this side view so this is where the basically the monkeys

let's are

this is it's vocal tract

here's the larynx

and so you can make if you cross

section of cuts there and you can see that the shape of the vocal tract

i don't these different

cross section there is

follows this it's not quite a rabble but

in this particular shape is kind of the same everywhere

and so what you want to know is

for a given opening of the vocal tract how large is that area so suppose

that the

the diameter would be

about

about this

so the area would be this now if you open up further then of obviously

the area gets bigger any turns out that follows you know it's just a matter

of integration any turns out that what you find is that the areas proportional to

some cut some constant

times the diameter to the power of

one point four there's no deep theoretical reason for that value of one point for

each it's something that we learned from observing

so now by applying that function to the diameters that we observe we actually find

a

the area function so this is

the position

and the area that at each point

in the vocal tract no

the next step is turning that into someone's

and for that we use a again very old fashioned classical approach and acoustic a

mobile an electric line analog of the vocal track again you can kind of see

that historically a lot of this theory was

developed by electrical engineers "'cause" it's an electrical electronic circuit so for each of those

discrete to you

the electric line a lot models just model basically models the physical wave equation with

a little electrical circuit

and from that

we can then calculate the

formant frequencies

so for each of those hundred points

we

we can calculate the first and the second and third formant and these are the

values we actually calculated for all those

all those point

and but there's

didn't from this point we've kind of

determined what the acoustic abilities of the monkey vocal tract or not

from there there's different things that you could do

in principle

on the basis of this kind of data you can actually make a computer articulatory

model

and so this is something that is changing my to as done in nineteen eighty

nine again you know quite some time ago on the basis of a very similar

data about the human vocal tract

but

it's not certain that we have enough data to actually do the same thing so

changing my to what he didn't was he made a thousand

tracing so the vocal track and if you if you in if you know how

difficult it is to make a single tracing

you can imagine how much time he must've spent on making this model

and what he then that is basically

look at these articulations to a factor analysis and basically derive an hour and articulatory

model

and articulatory synthesizer so you could basically then use that model to synthesize new so

no the problem is we don't have that many tracing so we couldn't problem probably

couldn't make a good quality model

what we wanted to do and what to comes is going to say in a

moment to explain a moment it's re-synthesize some of these sounds and that's still very

challenging with a articulatory synthesizer and it wasn't reading necessary for what we wanted to

do so we took slightly different approach

now

one of the things we wanted to do with just quantify the

articulatory abilities of monkeys and compared them to humans

and wanting to do that

we could measure the

acoustic range of the monkey vocalisations and one way to do that is by calculating

the convex hull now again i'm assume you're all familiar with what a convex whole

is just very quickly show you how we did it basically if you wanna call

calculate the context will

you start with the one of the extreme points

and then you

basically

fit a lying

a round those points like if you if you would take a rubber band and

just

squeeze it around the points and then you can do several things you can calculate

the area of the context of all or you can calculate the extend of these

things in the f one or the first formant or the second formant and the

thing that we did was we based ourselves on the extent

well in the area and the extent

and one of the things we get is the amp this week

wanted to know how the monkey sound it

it would be speaking

and in order to do that we

modified some human sounds in a way very similar to what the comes just showed

remote recordings

and so this is it

sentences spoken by human we that's like this into the

formant tracks which is basically which represents the

the filter and the source

and then we modified those formants

in a

in a way to make it more similar to a monkey vocal tract so what

you've seen so far in the examples that to comes at play to you is

where the formants were just shifted up or shifted down we did a little more

so we modified them

didn't just so the

we need to shift the formants up a little bit because the monkey vocal tract

is shorter than the human vocal tract so that the formants tend to be higher

but in addition what we found is that the range of the second formant it's

somewhat be used in the monkey vocal tract

in comparison to the

human vocal tract so we also

breast the range of the second formant

and then we resynthesized the sound

now

the thing with

and analysis in terms of source and filter

is that it's complete so if you have discourse information and the filter information

you can basically

re-synthesize the sound perfectly this and there's no loss

so if we would you just

the humans stores with the modified formants the sound would probably have sounded to perfect

so what we wanted to do is use the source that was more monkey like

so we actually also synthesized in use force which was based on a very simple

model

the monkey vocal folds which vibrating the much more irregular weight and human vocal folds

do so we took our monkey stores

applied

the modified formant filter to it

and then we got a real monkey focalization

and this is where the complete x over again

okay

so

hopefully that satisfied your morning need for technical details but now you must all be

wondering after this is just a synopsis of the whole process that we x-ray the

monkey making a hundred different vocal tract configurations

basically everything that monkey did while he was in our x ray

we trace those

we use the medial axis and then this complex area diameter the area function to

create the

model of the vocal tract and then we can form for a synthesized performance from

and so what we get here's the original data from lieberman that i showed you

at the beginning so the red triangle represents a human females bocal the f one

f range of two range of a human female with e a new making up

the points

and that little blue triangle is what the all model from lieberman said a monkey

could do

and this is what are mark our model looks like compared to that

so unlike me romans model which is very restricted we can see that the multi

what a remote key actually does would be to a quite wide variety and the

first formant but a somewhat compressed second formant

we use that to create multi vowels so artificial multi vowels that occupy the corner

of the corners of that convex hull so with five motive hours in a discrimination

task humans are basically at ceiling record so they do just as well with the

monkey vowels as they do with human vowels and what that shows us

is the to mark his capacity to produce a diverse set of files the same

as the number in most human languages namely five

is absolutely intact so the monkeys vocal tract

has no problem doing that

we also have good indications that things like bilabial and glottal stops et cetera et

cetera many of the different consonants would be possible so clearly the multi vocal tract

is capable of producing a wide range of seven

note that all sounds very dry such kind of more interesting to hear what are

model sounds like if we're trying to imitate human speech

i usually so we the model for this was my wife

so we had or speak a bunch of sentences but rather than play her first

what you should understand i'm gonna play the monkey model first and see if you

can understand with the smoke you say

right i

right

everybody got it right

okay and their this is my wife's formants with that synthetic monkey a source

i

okay

right i

time so

what you can here is that there's the phonetic content is basically preserved the human

formants are lower which makes sense because humans are larger than monkeys so it has

a more based c and less where you're the sound to it but i

that the phonetic content is basically present so what the shows us is that whatever

it is that keeps a monkey or an eight rate and the human how speaking

it's not the peripheral vocal tract it's not the anatomy of their total there

and that's basically the conclusion that we drew from this paper the paper was called

multi vocal tracts are speech ready

and what that tells us is that rather than looking more at the anatomy of

the vocal tract

we should be paying attention to what to the brain that's in charge and that

would be another talk to explain we have lots of evidence about what is about

the human brain that gives a such exquisite control over a vocal apparatus but it

doesn't seen that the vocal apparatus itself

the crucial thing and put in other terms we've done it with the multi but

i'm quite sure that the same thing would be true with a dog or a

pig or a cal if a human brain were in control a dog or at

cal or a pig or a monkey

the vocal tract would be perfectly able to communicate english

so

there's a lot of work to do before we make talking animals but it's gonna

involve the brain and not the vocal tract

okay so that is our story that was actually faster than we thought just to

they are general conclusions is that

you can use these methods that we're mainly developed by physicists and engineers to understand

human language for human speech to basically understand and synthesize a wide variety of vertebrate

sounds

i nearly work with four arms with birds and mammals but other people have used

these same methods to do things like alligators and frauds so these are very general

principles what you all learned in your sort of intro the speech class actually applies

to most of the species we know about

it's not the vocal tract that keeps most mammals from talking it's really their neural

control of that vocal tract

and i think the more general message that probably

meaningful to pretty much everybody in this room is a better understanding of the physics

and physiology of the vocal production system whether it's and the dog a remote you're

a dirac a wall can really play a key role it should play a key

role in speech synthesis

and thus you wanna say a few extra words of wisdom i guess

no

okay so we i think we have plenty of time for questions so thanks to

all the people who did this work and thank you for

it'll take the question mike or should i

i

a cushion is able to

inspired by using the women the ball box

the vocal folds

them again example can force for by using the like behaviour the dynamics will say

he's trying to imitate a human it's just what dogs do when they bark it's

the ways a second this is one point so and the second is that at

the last part of the user that

the key by the key difference lies in new mechanisms was really in the what

no mechanism yes neural mechanism so my question is able

as sometimes because of the dot plot the that this happens so will be disabilities

but actually act was again and almost a result of the bit if

it is not gonna but only in time

so my question was

i just talked that the debut the end of the vocal fold dynamics for the

ball but

and the most mapping that happens in the subject

because of that these so is there any kind of q for this was a

good use ms

question i two r are you asking about the recovery of the source properties or

i'm asking about the new them again is on that is responsible because for that

piece was good

for the auditory perception or for the production okay so what we know i don't

have a slide for this but we know that in humans there are direct connections

from the neural from the motor cortex onto the neurons you actually control the laryngeal

and the tongue muscles

those direct connections from cortex on to the laryngeal matter of us are not present

in most members

so these are absent in other primates they appear to be absent in austin cats

and travel et cetera but in those p c's which are good vocal imitators and

this includes many birds the parents and my numbers but it also include some packets

include elephants it includes various the tations

so in all of those groups that have been investigated these direct connections the equivalent

of what we humans have are present so the current theory for what is it

about our brains that gives us this control is that we have direct connections a

lot of the motor neurons

and in most animals there's only indirect connections via various brain stem intermediary onto the

vocal system itself

so in other words we've got this new we its essentially like a new gear

shift on this each and vocal tract that we've got

that gives our brains more control over it then we would otherwise have

a lot more interesting talk

so myself i have a free pass at home and a white or evidence we

nitpick

and so i also works for that it would be quite directional at all be

remote or police and what they are saying yes i don't is you are there

are also paper published in a channel about converting bring thing last told to speech

that the much using speech synthesis for a construction

of speech from right how do thing how it is possible to actual and or

something similar for our pets to be able to evangelise handle task a signal possible

sufficient

but that's an interesting question so if

given that we can use your all signals but fmri or geology to synthesize okay

speech

could we do the same thing for animals and my answer from most animals because

of my answer the first question would be no the reason is that the there

is a correspondence between the cortical signals that we can measure it something like fmri

really g and the actual sounds that are produced

because in most animals its mainly the brain stem in the midrange that are controlling

these as someone attacking or a dog parks

it doesn't in fact you can remove the cortex and a cat are still meowing

adorable still more

in the same way that a human baby who's born without cortex will still cry

and laugh

in a normal way

so i but also say if i would be a lot easier to do this

is probably better usage rent money

see if you can synthesize laughter and crying

from a cortical signal y prediction would be you and if you can do that

humans then you won't be able to do it in so i would predict a

fink laugh like what i go a that's not a real that i should be

correctly control but when i really laugh are i really cry

that's gonna be coming from this score brain that's very hard to measure and so

you should be able to synthesize realistic laughter crying even it easy maybe

do you have any evidence of what the which point enables cmbp connection between the

brain and the vocal tract it starts appearing

that's the unfortunate answer to that is no probably many of you know there's a

there's a whole field in this you have a slide about this there's a whole

field that's essentially trying to reconstruct

based on fossils when in our history when of this i in the common in

history of a revolution these that are capacity for speech occurred and the old argument

was always based on if we could know when the larynx decided and we would

know one speech occurred

hey what i think i've shown you and all this work is that it's not

alaryngeal descent

that's crucial for speech it's these direct connections

and those unfortunately there's just no fossil q

to whether there's direct connections that's basically the stuff that really doesn't preserved even for

an hour

much less for in the fossil record you would need

detailed narrow an at any on the micron level to answer that question so it

even it's even hard with again

please

so to comes and i are

well we agree on the importance of the of the neural control of course and

but we can disagree on the

exact precise interpretation of and what the vocal tract data means and video clip

i can we do this you know how we think we're

that so innocent you could say that has been some fine tuning of the of

the human vocal tract to for localization and if you

you know if you if you the little liberal in the interpretation of what we

find in the fossil record you can say

it happened somewhere between three million and three hundred thousand years ago

it's not very precise i

so that the evidence for this is based on various cues that supposedly indicate based

on the base of the scroll what the position of the larynx and tone would

be it just "'cause" with

"'cause" i have these slides and i took them out "'cause" i thought we'd be

too long i want to show you some examples on animals that have independently modify

their vocal tract

in a way that has nothing to do with speech so the way you can

make your vocal tract longer is one make your nose longer like this process monkey

or lots of various animals like elephants course you can stick your lips out which

many species do so if you do this you sound bigger and if you do

this you sound smaller or you can do more bizarre things like

make an extension to your nasal tract that forms a big crest like that dinosaur

up there or these birds which because sources at the base of the trachea have

elongated trachea and all of these adaptations seem to be ways of making that animal

sound bigger

it's just a nice example this is an animal with the permanently descended larynx is

a red deer and you'll find this a pretty impressive sound

wow

wow

so the first thing you probably noticed in that images that pinnits pumping that we're

going back that ignore that look at what's happening

okay what's happening in the front of the animal and you'll see

i as well

back and forth

and so when we first saw these videos we were like what is this and

it turns out what this is that resting position of the larynx that's is a

permanently descended larynx in an argument animal and watch what it does what it localisers

i

i

so i think we could all agree that some much more impressive just set of

the larynx then the few centimetres that happens in humans

and it turns out

these are not the only species because in our islands p c's there's a secondary

the set of the larynx that happens only and then and only at puberty and

i think that's exactly the same kind of adaptation that makes this to do your

sound bigger the aurora or a bird sound bigger so i guess that's where we

differ i think that

even if we know when the larynx to send it in humans it could have

been an adaptation to just make yourself sound bigger and it might have been a

million years after that

that we started using that for speech

so that's why i really don't think the fossils are gonna answer because we do

not have any answer the only way we're gonna get it i think is by

is from genetics now we're covering genetics

the gene genome from data seven the neanderthals and these that might help us answer

this question about the recognition

i've just want to mention that the result where you know scores against based on

the part of the story my question is about earlier you and more to communicate

of course okay bye divorce so

you know you're talking about the vocal tract varies with a voice source of for

really downtime it's whatever

a lot of seems to do with a with a voice source do have an

idea of video poker bring

which is i don't aboard to

to use pieces

well not use the vocal really over emotions so for sure of social behaviors

we we've got actually quite a lot of evidence about sort of overall vocabulary size

for different species but most of that comes from relatively intuitive

scientist listen and they say it in a there's about five sounds there is about

twenty sounds there

only a few species have we really don't what we need to do which is

played back experiments to see what the animals discriminate from others and i would say

in many cases that shows us that something that we think is one thing what's

a i'm not i'm now or a bark or ground actually has multiple a variance

so but i think a conservative number for animal vocabularies is something like fifteen thumbs

and a less conservative number would be something like fifty difference

and in some birds it goes a lot larger than that but if you're talking

about your average mammal it somewhere in that right so roughly thirty would be a

good nonhuman primate

vocabulary size of discriminable so that have different meetings

of course there are sounds animals like can make thousands of different sounds

but they do this for example birds in their songs or wales in their songs

but they don't appear to use this to second of different meetings so then we

can talk about vocabulary anymore we have to just start talking about

it's more like

phonemes or syllables types router and then meetings

we will say something

sorry

is there's somebody else but who and what do we know what is the frequency

resolution of the monkey hearing

so that we could hear the relative position of all the formants but

to reproduce it absolutely i mean most monkeys have a higher free a higher high

frequency cutoffs the most monkeys could hear up to forty or even sixteen khz so

the high frequencies are more extensive than ours

but where it counts in the low frequencies they're perfect frequency resolution so from five

hundred hz to twenty five or thirty five hundred hertz which is where all that

formant information is they can they can

and that's why of course an animal like and or a chimpanzee or basically any

other species you cares can learn to discriminate different human words

virtually every dog knows its name and in some cases you can train a dog

to discriminate between hundreds or even thousands of words

and they can do that

so the speech perception apparatus seems to be built on the basically why they share

perceptual masking

sorry

i'm nothing and speech synthesis and of course leaving about how to

it would be a place to say that but

why

actually did you

need to do this in this is what we do not to sort of more

standard phonetic thing just flew

record load of loads of monkey localizations and measure the formant and what you what

would happen if we did that

well we we've done that and we've actually looked at the subset of the sounds

so remember what we have a some of these vocal tract doing what multi vocal

tract to do and that influence of things like feeding chewing swallowing et cetera it

also includes a class of

non vocal displays that most known human pride well most monkeys and apes to do

things like this

which it's called lip smacking it's a very typical primate thing but it's virtually silent

so they make some able a little tiny bit of sand and once p c's

they actually vocalise when they do it turns out that those that the most is

doing a lot more with its vocal tract in these visual displays then it doesn't

it's auditory display

so if we just take that the vocal tract configurations where the monte is making

a sound it's a subset of what the vocal tract can actually do and in

project these nonvocal communicative so you

could call them visual communication signals have a lot of the a lot of the

interesting variance of the vocal tract shape are there

and because those are silent we have to figure out what it sound like if

the monkey was vocalised so that's why we have to that's why we had to

do all this work that's why it took

years to do this

and then adjust and to that

well i guess coincidentally almost at the same time as our paper came out that

we change the way and according to which just mentioned here in the front and

came up with the paper where they get exactly what you would use it and

they five and basically that

actually what the user to different monkey species act-utterance but and they can produce a

surprisingly large range of silence that especially surprising if you compared to what the lieberman

had claimed that they could produce

but not as large as the range of sounds that are mobile produced so

they do mainly not produce in their in their actual productions the potential that they

have with their vocal tract

i would like to come from that i understood correctly what you say on this

slide

that there is that more generally

it is generally passive

is the output or at least experiment that

generally this woman from give the in two thousand then

that just air flow is coming out

and then we can say that the vibration rate is generally a c

i think this is too risky

because this is exactly what would happen if you i'm dead and you bust a

are thrown

air flow through my vocal folds

i don't think we mush my much will be different

and in order to do that even though to say that is generally passive i

think you have to go and look

more about neuronal activity

and not just about experiment i respect teachers work but i think this is to

dangers to

to say these

you on that slide i think there may be a miss i mean because we're

not saying that you don't need muscles to put the larynx in to phonatory position

of course you do that work in this case i move you tigers larynx in

the phonatory position

what we're saying is that the individual pulses that represent the fundamental frequency so the

openings and closings of the glottis that's what that's what is passively determined by things

like muscle tension and pressure

so we're not saying that muscle activity doesn't play a role what we're saying that

it doesn't have to happen at the periodicity of the fundamental frequency

and that's obvious thing if you think about a pack that's producing sounds at forty

thousand ten at a forty thousand hz there's no way neurons can fire that neurons

basically can't fire faster than thousand

so even if it didn't work for something like an elephant and it does work

for something like a cat at thirty hz

it could never work for most of the high causation

even a cat two thousand hz and certainly not these animals that are producing in

the high khz range it has to be passed because there's no way neurons can

fire or muscles can twitch

that rapidly

so the clean is not then in humans you or any animal that you don't

need to use muscles to put the and that to control the larynx you do

but only that you don't need muscle activity at the frequency the fundamental frequency

is that make sense

it's better

and some just curious

you labour man and you both did work trying to figure out exactly the same

thing a subject and i came to radically different conclusions so

was the lieberman what's the improvements is that approach never going to work or what

was the issue that distinguished and that you know that made the difference between what

you did and he did and what can that teachers for other things we want

to do as well do not draw conclusions

i would say from the i mean maybe you can comment on this two but

from the point of view the technology

what we're doing to understand how you go from a vocal tract to formant frequencies

not much just change they did a pretty good job a given the computers they

had their simulation was pretty good their problem was in the biology their problem was

that they took a single then animal and the expected that

then animal was gonna tell them the range of motions that are possible in a

living animals vocal tract

so they had no indication of what the dynamics

the vocal tract or

from looking at the data and that's what we needed this x rays of a

building monkey to be able to find out

okay so but you don't saying that you can never figure out what to do

is going on from a dead animal what so if you

so

so by the way that is class which should be familiar name two people working

on speech synthesis with the call theorem one of these paper here and so he

was basically the guy at the acoustic modeling

work and so at the time there are q competing labs working on speech synthesis

and i basically the acoustic model i used for my model is basically contemporaneous with

a ten is quite small so indeed you know classic stuff

so basically they just didn't have the data it's kind of like all eighties neural

nets verses google

they just didn't have the data and we have a data

and

yes

and i think it's a very as

defined benefit different bands right okay not can make it and fifteen t fact there

is no

something like fifteen to fifty as a session one and here is to now

if the semantics of a time to express

i was trained praying all rights a very different

set

just a it is a fiction planes are the and in my state is virtually

pains they're very different to what they're trying to express

there's a certain set of course vocalisations that are very widely shared among species for

so for example sounds that means threat sounds that say i'm being mean and scary

so i tend to be low and have very low performance

sounds that are appeasing in saying that we don't hurt me i'm just a little

guy tend to be high frequency

so we see that class the vocalisations vary widely across mammals and birds

then we have this class of kind of meeting vocalisations that a lot of species

do but they typically sound very different sometimes it's males just going well like that

and sometimes it much more interesting and complicated

and then there's typically mother infant communications and so there's usually sounds that are that

a mother users with for this particular in mammals that the mother uses to communicate

again very widespread

and then there's really weird stuff mike where all songs or echo location clicks at

all phones that are really only found in particular groups so i'd say there's a

kind of shared core of semantics and then various it's biology so there's all kinds

of weird stuff in the corners but if you say parental care

aggression affiliation

and

there's also alarm calls and three calls are pretty common but a handful of maybe

five semantic axes would probably do it from a standard

well the there are some vocalisations that basically saying i'm here

and their other vocalisations the try their best a high that so back the a

very high frequency quiet thing that tails off it makes it hard to find so

various alarm calls are like that

it like a there is an active basis is it

so for fact that market i block

in fact it is quite a lot of human where it's that's right

but if a vocabulary but can express is so small that maps model about what

making this pen

seven or something that brightness

to various have

i do not put it in a fight

if i if i and response

and then where it's at an unconstrained a few

that's kind of frustration very

well i think that is a fundamental finding of animal communication is that animals understand

a lot more the then they can say

so essentially we have many species for example that understand not only their own species

but they can learn the alarm calls of other species in their environment and of

course animals raise with humans learn to understand human words and not of the species

every produce those

so it just does the child's write any of us are receptive vocabulary the words

we understand are much larger than the number of words we say typically

for most animals i think the receptive vocabulary is large and the productive vocabulary is

very limited

when they find that frustrating or not

i don't know that's harder so

so the

humans have more control over all or there are also in the water no value

model to use the excitation signal was much working or

so project was to every other mean and what we present more clearer and more

how to model

this case back to this image we've done a lot of work now doing excise

larynx work in one of the things we found is the most species can very

easily be driven into a chaotic state

where rather than this nice regular harmonic process that we see here you get essentially

coupled oscillators and the vocal folds generating chaos and you can see the classic steps

from by phonation into a triphone a period doubling to chaos in vocal folds in

virtually every species that we looked at

now and it seems to be very easy for most animals to go into a

chaotic state and that's reflected by the fact that many sounds we hear animals produce

or have a chaotic source

so for example monkeys do this all the time they do this

an even dog barks are like that there's the they let themselves use chaos much

more in speech and you like this

but unless you're batman

you know

nobody does that we we'd we favour this harmonic source for most things if you

listen to a baby crying you'll hear plenty of k

so i think what's hard to say is whether humans

we can produce chaos with their vocal folds but do we just choose to use

this nice regular harmonic nice clear pitch signal

because it

you know better for understanding or it sounds nice or a vocal folds actually less

inclined to go chaotic

than those of other species

that's a question that i don't think we can answer at present

but we certainly do a lot less chaos monkeys it's the most common thing you're

gonna hear these threads grounds

are chaotic and so that's what we were trying to model in the sentence

so i've done if you

models where there's interaction between the vocal tract in the vocal folds and also looking

at chaotic vibrations and one of the other things that you find even if you

get these chaotic vibrations is it's somewhat well it's

quite a bit harder to control vocal fold onset so tends to be more gradual

and which makes for instance it almost impossible to make a distinction between voiced and

voiceless

that consonants which are pretty important in speech and so am i just find out

there but it seems that this

more

regular vibration of the human vocal fold is useful for speech whether it's you know

being

the being used by speech because that way or because whether it has become that

way because it useful for speech that's another question

okay

thank you very much