Speech Transcript - Keynote 1: ISCA Medalist

0:00:16	Well the main thing I'm grateful for
0:00:20	is for the award and this wonderful medal. It's a
0:00:23	amazing honor.
0:00:24	And
0:00:27	particularly
0:00:28	particularly pleasing to me because I love this community. I love the Interspeech
0:00:33	community and the Interspeech conferences.
0:00:38	Some people in the audience, I don't know who ??, but she knows particularly that
0:00:43	I'm particularly proud of my ISCA,
0:00:46	previously ESCA, membership number being thirty.
0:00:50	And here is a list of the conferences in the Interspeech series starting
0:00:56	with the predecessor of the first Eurospeech and it was the meeting in Edinburgh in
0:01:01	1988.
0:01:02	All of the Eurospeech conferences and on the ICSOP
0:01:05	conferences and since Interspeech 2000
0:01:08	and the one ?? come read and the one I was actually at.
0:01:14	And another four that you find my name in the program was
0:01:20	co-author or member or area chair.
0:01:24	And so that's only three of the them.
0:01:27	You see I have nothing to do with it's Genevan, it's Burgan and it's Budapest.
0:01:32	I have actually being to
0:01:33	Pittsburgh and I've been to Geneva.
0:01:36	Pitty about Budapest.
0:01:38	Such a lovely city and I'll probably never get the chance. I missed it in
0:01:42	1999.
0:01:44	However I love these conferences
0:01:46	and
0:01:52	it's the interdisciplinary nature that I particularly
0:01:57	appreciate.
0:01:58	You heard from the introduction that some
0:02:02	interdisciplinary is
0:02:04	... well it's heart of psycholinguistics
0:02:07	that we're the interdisciplinary undertaking.
0:02:11	But I loved the idea from the beginning of bringing all the speech communities together
0:02:16	in a single organization and
0:02:20	single conference series.
0:02:23	And
0:02:24	I think the founding fathers of the organisations, the founding
0:02:30	members of Eurospeech
0:02:32	quite a broad theme there
0:02:35	and the founding
0:02:37	father or founding fellow, because we
0:02:40	never knew who it was, for ICSOP that was Heroi Fujisaki.
0:02:44	These people were visionaries
0:02:46	and the continuing success of this conference series is a tribute
0:02:52	to their vision.
0:02:53	Back
0:02:54	in the 1980's, early 90's
0:02:58	and that's
0:03:00	that's why I'm very proud to be to be part of this
0:03:04	of this community, this interdisciplinary community
0:03:08	and
0:03:10	I love the conferences and I'm just tremendously grateful
0:03:14	for the award of this medal, so thank you very much to everybody
0:03:19	involved.
0:03:21	So
0:03:22	back to my title slide.
0:03:27	I'm afraid it's a little messy
0:03:29	or they're all my affiliations on that. Tanja
0:03:32	already mentioned most of them. You would think wouldn't you that
0:03:36	the various people involved would at least chosen the same shade of red
0:03:41	but
0:03:43	down on the right-hand side is my primary affiliation at the moment
0:03:48	the MARCS Institute and University of Western Sydney. My previous european
0:03:52	affiliations which I still have a meritus position on the
0:03:55	left of the bottom
0:03:57	and
0:03:58	the upper layer of loggers there.
0:04:03	I want to call your attention to for practical reason.
0:04:07	So on the on the right is the Centre Of Excellence For The Dynamics Of
0:04:11	Language which is the
0:04:12	an enormous ground actually, it's the big prize in Australian
0:04:17	ground landscape
0:04:19	and this is
0:04:20	this is gonna run for
0:04:22	seven years. It's just started. In fact if I'm
0:04:25	not mistaken it's actually today, it's the first
0:04:28	day of its operation. So it was just awarded, we've just been setting it up
0:04:32	of the last six months and it's starting off today.
0:04:36	And it's a grant worth some 28 million Australian Dollars over seven years
0:04:42	and on the left of that is another big ground
0:04:46	running in the Netherlands for the last .. it's been going for about a year
0:04:49	and a half now
0:04:50	Language in Interaction
0:04:52	and that's a similar kind of undertaking and again it's 27 million euros
0:04:58	over period of ten years.
0:05:01	And
0:05:02	it is remarkable
0:05:03	that two
0:05:04	government organizations, two government research councils, across different sides of
0:05:10	The World more and less simultaneously saw it was really important to stick some serious
0:05:18	funding
0:05:18	into language research, speech and language research.
0:05:22	Okay now the practical reason that I wanted draw
0:05:24	your attention to these two is that they both have websites
0:05:28	and
0:05:29	if you have
0:05:31	bright undergraduates looking for a PhD place
0:05:34	at the moment, please go to the Language and Interaction web website where every
0:05:39	six months for at least next six years will be
0:05:42	bunch of new PhD positions advertised.
0:05:47	We are looking worldwide for bright PhD
0:05:51	candidates. It's being run mainly as a training
0:05:53	ground, so the mainly PhD positions on this ground.
0:05:57	And on the right if you know somebody's looking for a postdoc position we are
0:06:01	about to in
0:06:02	the Centre of Excellence about to advertise a very large number of postdoctoral positions mostly
0:06:07	many
0:06:08	of them require linguistics background,
0:06:10	but please go on look at that website
0:06:12	too, if you or your students or anybody you know
0:06:16	is looking for such a position.
0:06:19	Okay.
0:06:20	Onto my title Learning about speech why did I choose that?
0:06:25	As Tanja
0:06:27	rubbed in
0:06:28	there weren't many topics that I could have chosen.
0:06:33	In choosing this one
0:06:37	I was guided by first looking at the abstracts for the other keynote
0:06:41	talks in this conference.
0:06:45	And I discovered that there is a theme
0:06:48	two of them actually have learning in the title, two out of the others.
0:06:51	And all of them address some
0:06:54	form of learning about speech and I thought well okay
0:07:00	it would be really useful
0:07:02	in the spirit of encouraging the interdisciplinary communication and integration across the various
0:07:10	Interspeech areas,
0:07:12	if I took
0:07:14	the same kind of
0:07:18	general theme
0:07:19	and started by
0:07:22	by sketching what I think of the
0:07:25	some of them most important basic attributes
0:07:29	of human learning about speech. Namely.
0:07:33	But it starts at
0:07:34	the very earliest possible moment,
0:07:37	no kidding,
0:07:39	I will illustrate that in a second.
0:07:42	That it
0:07:43	actually shapes the
0:07:45	processing, it engineers the
0:07:48	the algorithms that
0:07:50	are going on in your brain
0:07:52	that is that the speech you learn about
0:07:55	sets up the processing that you're going to be using for the rest of your
0:07:59	life. This is
0:07:59	also was foreshadowed and what Tanja just told you about me.
0:08:04	And it never stops, it never stops learning.
0:08:07	Okay
0:08:08	so onto
0:08:11	the first part of that.
0:08:13	So let's listen to something.
0:08:15	Warning: you won't be able to understand it.
0:08:18	Well, at least I hope not.
0:08:40	Okay, I see several people in the audience
0:08:42	making ...
0:08:46	movements to show that they have understood what was going on.
0:08:57	Because
0:09:00	what we know now that
0:09:03	infants start learning about speech as soon as the auditory system that they have
0:09:08	is functional.
0:09:10	And the auditory system becomes functional in the third trimester of a mother's pregnancy.
0:09:16	But this to say
0:09:17	for the last three months before you are born you are already listening
0:09:22	to speech
0:09:24	and
0:09:25	when
0:09:26	a baby is born
0:09:29	the baby already shows preference for the native language or another language. Very like you
0:09:35	can't tell a difference between individual languages for instance, it's known that you can't tell
0:09:38	the
0:09:38	difference between Dutch and English on the day you born.
0:09:41	If you're
0:09:42	but you have a preference if you were
0:09:44	exposed to an environment speaking one of those languages for that kind of language.
0:09:49	So what did you think
0:09:51	was in that audio that I just played, I mean what did it sounds like?
0:09:55	Speech, right? But
0:09:57	what else could do ... What language was that?
0:10:02	Do you have any idea?
0:10:04	What language might that have been?
0:10:08	Was it Chinese?
0:10:18	I think that this is an easy question for you guys, come on.
0:10:21	Well, were they speaking chinese in that? No!
0:10:26	Sorry?
0:10:27	Yeah, but it was English, it was Canadian English actually, so
0:10:32	the point is you can't and the baby can
0:10:36	tell
0:10:38	before birth
0:10:41	that it's recording taken from a Canadian team which did the
0:10:46	recording in the mood of
0:10:49	almost
0:10:53	in a moment about eight and half months to nine
0:10:57	months of pregnancy, right? So you can put a little microphone in.
0:11:01	And
0:11:02	let's don't thing
0:11:04	too much about this.
0:11:06	You can actually make a recording within a womb and that's the kind of
0:11:16	audio that you get. So that kind of audio is
0:11:20	presented to a babies before they're even born and so that's why
0:11:26	they get born with preference, with knowing something about the general shape
0:11:30	of the language. So you can tell that's stress based language, right?
0:11:36	That was the stress based language you were listening to.
0:11:39	So.
0:11:42	Learning about speech starts as early as possible.
0:11:47	We also know now, another thing that many people in this audience would know, that
0:11:52	actually infant
0:11:53	speech perception is one of the most rapidly
0:11:55	growing areas in speech processing,
0:11:59	speech research and all of the moment.
0:12:02	When I set up
0:12:04	a lab 15 years ago in the Netherlands, it was the first modern speech perception
0:12:10	lab,
0:12:10	infant speech lab in the Netherlands, now there're half a dozen.
0:12:15	And people who,
0:12:17	PhD students who graduate in this topic have no trouble finding a position. Everybody in
0:12:23	the
0:12:24	U.S. is hiring every psychology and linguistics
0:12:26	department's that have somebody doing infant speech perception at the moment.
0:12:29	Good job.
0:12:30	Good place
0:12:31	for students to get into.
0:12:33	But what
0:12:35	the
0:12:35	recent explosion of research in this area
0:12:39	has meant that some
0:12:42	we've actually overturned some of the initial ideas that we had in this area, so
0:12:47	we now know that
0:12:49	it is really
0:12:51	infant
0:12:52	speech
0:12:53	learning that's really grounded in social communication. It's
0:12:57	these social interactions with the caregivers that
0:13:02	that
0:13:03	actually
0:13:05	motivates
0:13:06	the child to continue learning.
0:13:11	That we also know that
0:13:14	we don't teach individual words to the babies
0:13:18	in the
0:13:19	in this very early period they're mainly exposed to continuous speech input and they learn
0:13:25	from it.
0:13:26	That constructing vocabulary and phonology together
0:13:33	it was first thought because of the results that we had that you had to
0:13:37	learn the
0:13:38	the
0:13:40	finding repertoire of your language first and only then you could start building a vocabulary.
0:13:46	Well
0:13:47	successful building of vocabulary is slow, but nevertheless the very first
0:13:55	access to meaning can now be shown
0:13:58	as early as the very first access to
0:14:02	sound
0:14:04	contrast.
0:14:05	And the latest,
0:14:09	also from my colleagues in Sydney, is that part of the,
0:14:14	sorry you know how it was, the a kind of speech
0:14:19	called Motherese. The special way you talk to babies.
0:14:22	You know you see a baby and you start talking in a special way and
0:14:25	it turns out
0:14:25	that part of this is under the infants control, it's the infant who
0:14:30	who
0:14:30	actually
0:14:32	elicits this kind of speech by
0:14:35	responding positively to it and
0:14:40	also trains
0:14:42	caregivers to stop doing or
0:14:45	to start doing one kind of
0:14:49	speech with enhanced finding contrasts and then stop doing that later and start doing
0:14:55	individual words and so on. So that's all under the babies' control.
0:15:02	So what we
0:15:04	tried to do in the lab that I set up in
0:15:09	Nijmegen, the Netherlands, some fifteen years ago was to
0:15:13	look
0:15:15	using
0:15:19	the techniques, the electrophysiological techniques
0:15:23	of brain sciences, so using Event-related potentials in the infant brain
0:15:29	to look at
0:15:32	the signature of word recognition in an infant brain, that's what we were looking for.
0:15:35	We decided to go
0:15:36	and look for what does word recognition look like
0:15:39	in an infant's brain.
0:15:41	And we found it.
0:15:42	So he's an infant in our lab,
0:15:45	so
0:15:46	sweet, right?
0:15:47	You don't have to stick the electrodes on their heads
0:15:53	separately, we just have a little cap
0:15:54	and
0:15:56	they were quite happy to wear a little cap.
0:15:58	And,
0:15:59	and so
0:16:00	what we usually do is
0:16:03	familiarize them with speech, so it could be words in isolation or it could be
0:16:09	sentences
0:16:10	and
0:16:11	and then we
0:16:12	continue
0:16:14	playing some
0:16:16	speech as it might be
0:16:18	continuous sentences containing
0:16:21	the words that they've already heard or containing some other words.
0:16:25	Okay?
0:16:26	And
0:16:29	what we find is a particular kind of response, this is the word recognition response,
0:16:34	a negative
0:16:35	response to familiarized words compared to the
0:16:40	unfamiliarized words.
0:16:42	It's in the left side of the brain
0:16:45	and
0:16:46	it
0:16:48	this is word onset, it's the word onset here, right.
0:16:52	And
0:16:54	and you'll see it's about some
0:16:56	half a second after
0:16:58	word onset.
0:17:00	And so this is the word recognition effect that you can see
0:17:04	in
0:17:05	in an infant's brain.
0:17:09	So
0:17:13	we know
0:17:14	as I said that in the first year of life
0:17:18	infants mainly hear
0:17:20	continuous speech.
0:17:22	Okay so they're able to learn words from continuous
0:17:25	speech and so in this experiment
0:17:29	we only used continuous speech.
0:17:34	And this was with ten month old infants now they don't have understanding any of
0:17:38	this, you don't have to
0:17:39	understand. Whatever, it's in Dutch.
0:17:43	It's just the
0:17:44	showing what they were like, so that in the particular trial
0:17:49	you'd have
0:17:50	eight different
0:17:53	sentences and all the sentences have one word in common
0:17:56	and this is the word drummer, which happens to be drama, right?
0:17:59	And
0:18:01	and
0:18:03	then you switch to hearing four
0:18:06	sentences
0:18:07	later on
0:18:09	and
0:18:10	the trick is that of course all of these things can occur in pairs, so
0:18:15	for every infant
0:18:16	that hears eight sentences with drummer
0:18:18	right there's gonna be another
0:18:20	infant that's gonna hear eight sentences with fakirs.
0:18:23	Okay
0:18:25	and
0:18:26	so then you have two each of these sentences and what you expect is that
0:18:29	you get more
0:18:30	negative response to whichever word you have actually
0:18:36	already heard
0:18:38	and that's exactly what you found. This one has just been published, as you see.
0:18:42	And so what we have is the proof that
0:18:45	just exposing
0:18:47	infants to a word in an continuous speech
0:18:51	contexts is enough for them to recognize that same word form
0:18:55	and now they don't have understanding of anything at ten months
0:18:57	right, they are not understanding anything about. They're pulling out
0:19:00	words out of continuous speech
0:19:03	at this
0:19:04	at this early age.
0:19:06	Okay
0:19:07	now this is
0:19:09	given the fact that
0:19:11	the
0:19:12	input to infants is mainly continuous speech
0:19:18	is of course vital that they can do this, right? And another
0:19:27	important finding that has come from this series of
0:19:34	experiments and in using infants' word recognition effect
0:19:39	is that
0:19:41	it
0:19:42	predicts
0:19:43	your later
0:19:44	language performance
0:19:46	as a child,
0:19:47	right? So that
0:19:49	if you're showing that
0:19:51	to become negative going response that I've just talked about already
0:19:56	at seven months which is very early
0:20:00	if it's a nice big effect that you get, a big difference
0:20:05	and if it's a nice clean
0:20:10	a response in the brain
0:20:12	then
0:20:13	for instance here is the
0:20:18	I've sorted here
0:20:20	two
0:20:22	groups of infants
0:20:23	which had a negative responses at age of seven months
0:20:27	or in the same experiment did not have a negative response.
0:20:31	And at age three
0:20:33	look at their comprehension scores, their sentence productions scores, the size of vocabulary scores.
0:20:40	The blue guys, the ones who showed that segment, that word recognition effect in continuous
0:20:46	speech
0:20:46	at age seven months already
0:20:48	performing
0:20:50	much better. So it's a vital for your
0:20:52	later development of
0:20:55	speech and language competence.
0:20:57	Here is an actual
0:21:01	participant by participant correlation
0:21:05	between the size of the response,
0:21:10	so remember that we're looking at negative response so
0:21:13	the bigger it is down here, right? The more negative it is
0:21:18	the bigger your scores
0:21:20	in the number of words you know at age one or the number of words
0:21:25	you can speak
0:21:25	at age two. Both correlate significantly, so this is really important.
0:21:32	Okay, so starting early
0:21:34	and
0:21:36	listening actually just to real continuous speech
0:21:40	and
0:21:41	recognizing that what it consists of is
0:21:45	reccuring
0:21:47	items, that you can pull out of that speech signal and store for later use.
0:21:52	That is
0:21:53	setting up a vocabulary
0:21:54	bin and starting early on that
0:21:56	really launches your
0:21:58	language skill.
0:22:01	And we're currently working on just how long that some
0:22:06	that effect lasts.
0:22:08	So the second
0:22:10	major topic
0:22:12	that I want to talk about is how learning shapes processing.
0:22:17	You'll know already from Tanja's introduction that this has actually been the
0:22:24	guiding
0:22:25	theme of my research for the last
0:22:28	well I don't think we are going how many years it is now
0:22:32	for a long time.
0:22:34	And I could easily stand here and talk for the whole hour about this topic
0:22:40	alone
0:22:40	or I could talk for a month about this topic alone but I'm not going
0:22:43	to. I am going to take one particular
0:22:45	really cool,
0:22:47	very small
0:22:49	example of how it works.
0:22:53	So the point is that
0:22:56	the way you actually deal with the speech signal,
0:23:00	the actual processes that you apply
0:23:04	at different
0:23:07	depending on the language you
0:23:09	grew up speaking or your primary language, right? So those of you out there
0:23:15	for whom English is not your primary
0:23:19	language you're gonna have different
0:23:21	processes going on
0:23:23	in your head
0:23:24	than what I have.
0:23:25	Okay
0:23:27	now
0:23:28	I'm gonna take this really tiny
0:23:32	form of processing. So you take a fricative sound right s or f.
0:23:37	Now these are pretty simple sounds.
0:23:39	How do we recognise? How do we identify
0:23:42	a sound,
0:23:43	right? For these fricatives do we actually just
0:23:48	analyze
0:23:49	the frication noise
0:23:51	which is different for sss, fff.
0:23:54	You can hear just hear the difference
0:23:56	sss high frequency energy, right?
0:23:57	fff is lower.
0:24:00	Or do we analyze the surrounding
0:24:03	that information in the vowels? Well, there is always transitional information in any speech
0:24:10	signal between sounds. So are we using this in identifying s and f?
0:24:15	Well.
0:24:17	Maybe we shouldn't because s and f are
0:24:20	tremendously common
0:24:22	sounds across languages and their pronunciation is very similar across languages so we probably
0:24:28	expect it to be much the same way they are processed across languages.
0:24:32	But we cannot always test whether
0:24:36	vowel information is used in the following way.
0:24:41	You ask:
0:24:43	is going to be harder
0:24:45	to identify particular sound,
0:24:48	this works for any sound, right, now we are talking about s and f,
0:24:52	if you insert them into a context that was originally added with another sound?
0:24:57	Okay.
0:24:58	So in the experiment I'm gonna tell you about
0:25:02	your task is just to detect a sound that might be s or f in
0:25:06	this experiment,
0:25:07	okay?
0:25:08	And it's gonna be nonsense you're listening to so
0:25:10	dokubapi pekida tikufa
0:25:13	right and your task would then be to press the button when you hear f
0:25:18	as sound of
0:25:19	f in tikufa.
0:25:20	And crucial thing is that every one of those target
0:25:23	sound is gonna come from another recording every one of them
0:25:26	and it's gonna be either another
0:25:28	recording which had origin,
0:25:31	which originally have the same.
0:25:39	In the tikufa is either gonna have come from another utterance of tikufa
0:25:44	or it's gonna come from
0:25:48	the tiku_a is gonna come from
0:25:50	tikusa
0:25:51	and have the f put into it, right? So you're going to have
0:25:56	mismatch in vowel cues if it was originally tikusa
0:26:00	and congruent vowel cues if it was another utterance of tikufa.
0:26:04	Now some of you who teach speech science may recognise
0:26:08	this experiment because it was originally ... it's a very old experiment,
0:26:12	right?
0:26:13	Anybody recognised it?
0:26:15	It was originally published in 1958, right? Really old experiment.
0:26:21	First done with American English
0:26:25	and the result was very surprising because what
0:26:28	was found was different for f and s,
0:26:31	right?
0:26:32	That in the case of f
0:26:35	if it came from another, if tiku_a was originally tikusa
0:26:41	then
0:26:42	it was harder to, if you put the f
0:26:46	into a different context that was much harder to detect it,
0:26:49	whereas if you did it with the s there was zero effect
0:26:53	of the cross-splicing. No effect whatsoever for s.
0:26:56	But a big effect for f.
0:26:59	So listeners are only using vowel context for f but they weren't using it for
0:27:04	s, right? A so this
0:27:05	just seemed like a bit of puzzle at the time. But you know in 1958,
0:27:09	these old results has been
0:27:11	in the text books for years you know. It's in the text books.
0:27:15	And the explanation was well you know that it's the high frequency energy in s
0:27:20	that makes it clearer,
0:27:21	it's you don't need to listen to anything else the vowels, you can just do
0:27:25	s on the frication noise
0:27:27	alone but f is not so clear, so you need something else.
0:27:32	Wrong.
0:27:34	As you will see
0:27:37	so
0:27:39	I'm going to tell you about some thesis work of my student A. Wagner
0:27:44	a few years ago.
0:27:46	And she first replicated this experiment, so what I'm gonna plug up here is
0:27:52	the cross-splicing effect
0:27:56	for f minus the effect for s,
0:27:59	right so,
0:28:00	you know that
0:28:02	the bigger effect for f
0:28:04	than there is for s, we just saw that, right?
0:28:07	And so she replicated that right. The original one was American English she did it
0:28:13	with British English and get exactly the same
0:28:15	effect, so the
0:28:18	huge effect for f and very little effect for s
0:28:24	So the size of the effect for f is bigger.
0:28:27	And she did in Spanish and got exactly the same result,
0:28:30	okay.
0:28:32	So it's looking good for the original hypothesis, right?
0:28:36	And then she did it in Dutch.
0:28:38	Nothing.
0:28:39	In fact there was no effect for either s or f in Dutch
0:28:44	or in Italian, she did an Italian,
0:28:46	or in German, she did in German,
0:28:48	so okay.
0:28:51	Audience response time again, right? So I missed that,
0:28:54	I didn't tell you one crucial bit of information here.
0:28:58	The Spanish listeners were in Madrid,
0:29:02	so this is Castilian Spanish,
0:29:05	so what two English,
0:29:08	think now
0:29:09	what two English
0:29:10	and Castilian Spanish have
0:29:13	that Dutch and
0:29:14	German and
0:29:15	Italian,
0:29:17	Chinese or whatever languages don't have?
0:29:21	You're good, you're really good.
0:29:23	That's right.
0:29:27	So here, this is the reason you think the original explanation
0:29:31	?? that s is clearer.
0:29:34	Accounts for the results for English and Spanish, but doesn't account for the results for
0:29:37	Dutch and
0:29:38	Italian and German, right? But the
0:29:43	the explanation that
0:29:45	you need extra information for f,
0:29:49	because it's so like θ, right? Because f and θ are about the most confusable
0:29:55	phonemes in any phoneme repertoire.
0:29:59	As the confusion matrix of English certainly shows us.
0:30:04	So you need the extra information for f just because there is another sound in
0:30:10	your phoneme repertoire which its confusable with,
0:30:14	but how do you test that explanation?
0:30:16	Well,
0:30:18	you need,
0:30:19	now you know I'm not gonna ask you to guess what's coming
0:30:21	up, right, because you know it from it if you are looking at the slide.
0:30:25	But you need a language
0:30:26	which has a lot of different s sounds, right?
0:30:30	Because then the effect should reverse
0:30:33	if you find a language with a lot of other sounds like s
0:30:37	and yes Polish is such a language.
0:30:40	Then want you should find in that cross-slicing experiment is that
0:30:45	that
0:30:46	you get a big effect
0:30:48	for mismatching vowel cues for s
0:30:51	and nothing much for f, if you don't have also have θ theta in the
0:30:55	language.
0:30:55	And that's exactly what you find in Polish.
0:30:58	Very nice result. How cool is that overturn the textbooks in your PhD?
0:31:03	So,
0:31:06	we listened to different sources of information in different
0:31:11	languages, right? So we learn to process the signal differently
0:31:16	even s and f are really articulated much the same across languages, but in Spanish
0:31:21	and English you
0:31:22	have fricatives that resemble f and in Polish
0:31:25	you have fricatives that resembles s, so you have to pay
0:31:28	extra attention to surrounding,
0:31:31	well it helps to pay extra attention to surrounding
0:31:36	speech information to identify them.
0:31:39	The information that surrounds
0:31:41	inter-vowel vocalic
0:31:44	consonants is always going to be there. There is always information in the vowel which
0:31:48	you only use
0:31:49	if it helps you.
0:31:51	Okay
0:31:52	onto the third
0:31:54	point that I want to make.
0:31:56	Learning about speech
0:31:58	never stops.
0:32:01	Even if we were only to speak one language,
0:32:04	even if we knew every word of that language, so we didn't have to learn
0:32:08	any new words,
0:32:09	even if we always heard speech spoken in clean conditions
0:32:13	there still learning to be done, especially whenever we meet new
0:32:15	talker which we can do every day. Especially at the conference.
0:32:22	When we do meet new talkers, we adapt quickly.
0:32:26	That's one of the
0:32:26	the most robust findings in human speech recognition, right? We have no problem walking into
0:32:32	a shop
0:32:33	and engage in a conversation with somebody behind the counter we never spoken to before.
0:32:40	And this kind of talker adaptation also begins very early
0:32:44	in infancy
0:32:46	and it continues through
0:32:47	life.
0:32:50	So
0:32:53	as I already said
0:32:55	you know about
0:32:57	particular talkers you can tell your
0:33:00	mother's speech from other
0:33:02	talkers at birth.
0:33:03	So these experiments that people do at birth, right. I mean it's literally within
0:33:09	the first couple of hours after an infant is born. In some labs they are
0:33:14	presenting them with speech and see
0:33:16	if they shown a preference. And they show a preference by sucking
0:33:19	harder to keep the,
0:33:21	you got to pacify the sucker with the transducer and
0:33:26	keep speech signal going and you find
0:33:30	that infants will suck longer that hear their own mother's voice than other voices.
0:33:36	But when do they,
0:33:38	when do they tell the difference between
0:33:42	unfamiliar
0:33:43	talkers, so you have new talkers, when can an infant
0:33:47	tell whether,
0:33:49	whether,
0:33:50	whether they're same or not?
0:33:52	Well you can test discrimination easily
0:33:56	in infants, right.
0:33:58	And it's a method habituation test methat that we use.
0:34:03	So what you do is that you have baby sitting on
0:34:07	caretaker's mother's lap.
0:34:10	And mother's listening to something else, right. You bring in a music tape or something,
0:34:14	so mother
0:34:15	can't hear what babies are hearing
0:34:17	and
0:34:19	baby is hearing speech coming over
0:34:22	loudspeakers
0:34:24	and is looking at a pattern on the screen which
0:34:30	and if they look away the speech will stop,
0:34:32	right.
0:34:33	Sorry.
0:34:36	What happens is you
0:34:37	play them
0:34:39	a repeating
0:34:40	stimulus of some kind, so
0:34:42	in this experiment that I'm gonna talk about, the repeating stimulus is just
0:34:46	some sentences that they wouldn't understand
0:34:48	being spoken by
0:34:50	three different speakers, interchanging one's. Speaker will say
0:34:54	a sentence and the next one will say a couple of sentences and the first
0:34:57	one will also say a couple of sentences
0:34:58	again and third speaker also says sentence These are just sentences that the babies can't
0:35:03	actually understand.
0:35:04	These babies are actually seven months old. Younger than the baby in the picture there.
0:35:10	And
0:35:11	so as to the
0:35:13	stimulus keeps repeating the infant keeps listening, right.
0:35:19	And the stimulus keeps repeating,
0:35:22	and the infant keeps listening,
0:35:24	and the stimulus keeps repeating,
0:35:30	and eventually baby get bored and looks away, right.
0:35:33	And at that point
0:35:35	you change the input,
0:35:37	right.
0:35:38	And then you wanna know if and that's the way you test discrimination, does the
0:35:43	baby look back? Right.
0:35:44	Look back at the screen and perk up.
0:35:47	Okay and continues to look at
0:35:52	the screen and thereby keep the speech going.
0:35:57	Well,
0:35:58	so
0:35:59	these were seven month olds as I said, so really they don't understand anything like
0:36:04	no words yet.
0:36:05	Maybe that recognise their own name, that's about it.
0:36:10	And we have
0:36:11	got three different voices, the three different
0:36:15	young women
0:36:17	that have reasonably similar voices
0:36:19	talking away and saying sentences that are you know way beyond seven month olds' comprehension
0:36:25	like: Artist are attracted to life in the capital.
0:36:30	And then at the point in which the infant
0:36:34	loses attention you'll bring in a fourth voice,
0:36:39	a new voice and the question is: Does the infant notice?
0:36:43	Okay.
0:36:44	So these are Dutch babies. This was run in Nijmegen.
0:36:49	And yes, they do.
0:36:51	They really do notice the difference, right.
0:36:55	As long as it's in Dutch.
0:36:56	We also did the experiment with four people talking in Japanese,
0:37:00	four people talking Italian
0:37:02	and it was no significant
0:37:06	discrimination in that case. So it's only in the native language, right. That is to
0:37:10	say the
0:37:10	language of the environment that they have been exposed to.
0:37:14	So
0:37:15	this is important because it's not
0:37:18	whether speech is understood that's going on here, it's whether sound is familiar, beucase what
0:37:24	infants are doing between six and nine months is there
0:37:27	they're building up their knowledge of the phonology of
0:37:31	their language and building up their first
0:37:35	store of words.
0:37:38	So
0:37:39	and then this is important. Some of you probably know the literature from forensic
0:37:44	speach science on this and you know that
0:37:47	that
0:37:51	if you're trying to do a voice lineup and pick a speaker you heard in
0:37:56	a
0:37:56	criminal context or something and that speakers is speaking a language you don't know very
0:38:01	well
0:38:02	you're much poorer at making a judgement than if they're speaking
0:38:06	the same language as your native language.
0:38:08	And
0:38:10	this appears to be based on exactly the same
0:38:13	the same
0:38:14	basic phonology
0:38:18	adjustment that some
0:38:20	that we see happening in the first year of life.
0:38:24	And we can do a little bartery. We can show adaptation to
0:38:29	to new talkers
0:38:31	and strange speech sounds
0:38:33	in a perceptual learning experiment that we first
0:38:37	ran about eleven years ago
0:38:40	and has been replicated in many languages and in many labs around The World since.
0:38:47	And in this paradigm what we do is we start with a learning phase, right.
0:38:51	Now there are many different kinds of things you can do in this learning phase,
0:38:55	but one of them is
0:38:56	to ask people to decide, they're listening to individual
0:39:01	tokens and you ask them to decide
0:39:03	is this the real world or not?
0:39:05	Right.
0:39:06	And that's called lexical decision task, right.
0:39:09	So here's somebody doing lexical decision and they're looking
0:39:12	the hearing cushion,
0:39:13	astopa, fire place, fire place yes, that's the word, magnify yes,
0:39:20	heno no that's not a word, devilish yes, defa no that's not a word and
0:39:23	so on just going through pressing the button.
0:39:25	Yes, no, yes, no and so on.
0:39:26	Right.
0:39:27	Now the crucial thing in this experiment that we're doing
0:39:30	is that we're changing one of the sounds
0:39:33	in the experiment,
0:39:34	okay.
0:39:35	And we're gonna stick with s and f here, just to keep things simple,
0:39:40	but again we've done it with a lot of different sounds,
0:39:43	so
0:39:45	if you
0:39:46	for instance had a
0:39:48	sound that was halfway between s and f,
0:39:52	we
0:39:54	create a sound along a continuum between s and f that's halfway in between, in
0:39:58	the middle,
0:39:59	and we stick it on the end of a word like which would've been giraffe
0:40:02	but
0:40:03	but then that sounds like
0:40:05	this.
0:40:08	No, like here.
0:40:10	Can you hear that it's a blend of f and s.
0:40:13	And
0:40:16	and a dozen of other words in the experiment
0:40:20	which all should have an f in
0:40:23	and
0:40:24	if they had a s it would be a non-word, so we expose
0:40:30	a group of people to learning that
0:40:33	the way the speakers says f
0:40:35	is this strange thing which is a bit more s like.
0:40:39	Meanwhile there's another group
0:40:41	that's doing the same experiment,
0:40:44	right.
0:40:45	And they're hearing things like this.
0:40:49	That's exactly the same sound at the end of what should be horse.
0:40:54	Right, so they have been trained
0:40:56	to
0:40:57	hear that particular strange sound and identify it as s.
0:41:02	Where the other group identifies it as
0:41:04	as f, right.
0:41:06	And then you do a standard phoneme categorization experiment,
0:41:12	right. Where what everybody hear is exactly the same continue
0:41:28	and some of them were better s and some of them were better f,
0:41:32	but none of them are really good s but the
0:41:36	the point is that
0:41:38	you make a
0:41:40	categorization function out of an experiment like that, right, which goes from one
0:41:45	of those sounds to the other
0:41:47	and you would normally,
0:41:49	under normal conditions get
0:41:52	a baseline categorization function that are shown up there
0:41:57	and if you, but if you're
0:41:59	if a category was expanded
0:42:01	you might get that function and if your s category was expanded you might get
0:42:06	that function okay so
0:42:07	that's what we're gonna look at
0:42:09	as a result of
0:42:10	our experiment, which just one group of people and expanded their f category and another
0:42:14	group of people
0:42:15	and expanded their s category and that's exactly what you get,
0:42:19	right.
0:42:21	Completely different functions for identical continua,
0:42:24	right.
0:42:26	Okay, so we exposed these people to a change sound in just a few words
0:42:32	so we had
0:42:35	up to twenty words in our experiments, but people were
0:42:37	tested on many fewer words and obviously
0:42:40	in real life where the new talker probably works with one
0:42:44	occurrence
0:42:47	and
0:42:48	it only works if you could work out what the sound was
0:42:51	supposed to be, right. And with real words, so if we did
0:42:54	the same thing with non-words there's no significant shift, those are both exactly
0:42:59	equivalent to the baseline function.
0:43:02	So that's basically what we're doing.
0:43:05	Adapting to talkers we just met by adapting our phoneme boundaries
0:43:11	especially for them.
0:43:13	Now this as I've already said
0:43:18	has spawned a huge number of follow-up experiments, not only in our lab.
0:43:23	We know that to generalize across the vocabulary don't have to
0:43:27	have the same sound in a similar
0:43:30	context.
0:43:33	We know that lots of different kinds of exposure
0:43:36	can
0:43:37	can bring about the adaptation
0:43:40	doesn't have to be lexical decision task, you don't have to be making any decision
0:43:44	about the word,
0:43:45	you just have passive exposure, you can have
0:43:47	non-sense words if their phone is phonotactic
0:43:51	constraints force you to
0:43:54	choose one particular sound.
0:43:57	And we know that it's pretty much speaker's specific
0:44:01	that is the least adjustment is bigger for the speaker you actually heard
0:44:06	and we've done it across many different languages and I brought along some results
0:44:11	from Mandarin, because Mandarin gives as something really beautiful.
0:44:16	Namely that you can do the same
0:44:18	adjustment, the same
0:44:21	experiment with segments and with tones, right.
0:44:24	Different kinds of speech sounds as I said not just
0:44:29	the same segments that I used in that
0:44:32	experimental but here they are again f and s in Mandarin. Same result.
0:44:36	Right.
0:44:38	Very new data.
0:44:39	And there is the result when you do it with tone one and tone two
0:44:43	and
0:44:43	in Mandarin exactly the same way. Make an ambiguous stimulus halfway between tone one and
0:44:49	tone two.
0:44:50	And you get the same adjustment.
0:44:54	You do
0:44:57	use this, you can use this
0:44:59	kind of adaptation
0:45:01	effectively in a second language which is good.
0:45:06	At least
0:45:07	in this experiment by colleagues of mine in
0:45:12	Nijmegen using the same Dutch input with Dutch listeners get
0:45:16	exactly the same shift, right.
0:45:18	And
0:45:19	German students, now German and Dutch are very close languages, and the German students come
0:45:25	to
0:45:26	study in the Netherlands in Nijmegen, they take, imagine this the rest of
0:45:32	you who've gone to study in an another
0:45:36	country you know, which doesn't speak your L1 (first language).
0:45:41	They take a course for five weeks,
0:45:44	a course in Dutch for five weeks and at the end of that five weeks
0:45:48	they just go into the lectures
0:45:49	which are in Dutch
0:45:50	and they're just treated like anybody else
0:45:55	in the,
0:45:56	so that long it takes to learn
0:45:58	to get up to speed.
0:46:00	If you're German that long it takes to get up to speed with
0:46:04	Dutch, okay.
0:46:05	So not surprisingly
0:46:07	huge effect, the same effect, the same
0:46:11	experiment
0:46:13	and
0:46:14	with German students in the Netherlands. I have to say that I'm actually, this is
0:46:20	this is my current research, one of my current research projects
0:46:24	and the news isn't hundred percent good on this
0:46:27	topic after all, because I brought along some data which
0:46:32	which is actually just from a couple weeks ago, we've only just got it in,
0:46:37	and this is
0:46:42	adaptation in two languages,
0:46:45	in the same individuals. Now you just seen that graph.
0:46:48	That's the Mandarin listeners doing the task in Mandarin
0:46:53	and what I'm trying to do in one of my current projects
0:46:58	is look at the processing
0:47:01	of different languages by the same person,
0:47:05	right. Because I want to track down what's
0:47:08	what is the source of native language listening advantages in
0:47:12	various different context and so what I'm trying to do now is look at the
0:47:18	same people
0:47:19	doing the same kind of task.
0:47:23	It might be listening to noises, it might be perceptual learning for speakers and so
0:47:28	on
0:47:28	in their different languages.
0:47:30	So here are the same Mandarin listeners
0:47:33	doing the English experiment.
0:47:39	Not so good.
0:47:41	So
0:47:42	it looks
0:47:42	and these were tested in China so
0:47:47	it was,
0:47:49	they are not in immersion situation, it is their second language and they are living
0:47:54	in their
0:47:54	L1 environment, so that's not quite
0:47:57	as hopeful as
0:48:00	as the previous
0:48:05	study. However one thing we know about some
0:48:08	about this adaptation to talkers, we've already seen that discrimination
0:48:13	between talkers is something that even seven month old listeners can do, so what about
0:48:19	this kind of
0:48:22	lexically based adaptation to strange pronunciation. We decided to test this in children
0:48:29	which couldn't really use a
0:48:35	lexical decision experiment, because you can't really ask kids, they don't know a lot of
0:48:40	words.
0:48:42	So we did a picture verification experiments with them.
0:48:46	A giraffe and the one on the right is a Platypus, right.
0:48:49	So the first one ends with the f and the second
0:48:52	one ends with the s. We're doing the s/f thing again.
0:48:56	And
0:48:57	and then we had a name continua for our
0:49:02	for our
0:49:04	finding categorization, so again you don't want to be asking young kids to
0:49:09	decide whether they're hearing f or s, it's not natural
0:49:12	task but if you teach them that the guy on the left is called Fimpy
0:49:16	and the guy on the right is called Simpy
0:49:19	and then you give them something that's halfway between Fimpy and Simpy, right.
0:49:24	Then
0:49:25	then you can
0:49:27	get a phoneme categorization experiment and we first of all had to validate
0:49:33	the task with adults, needless to say we did not
0:49:37	have to do,
0:49:39	the adults could just press a button.
0:49:42	So I didn't have to point to the character and so on.
0:49:45	But we get the same shift again for the adults
0:49:50	and we get it with twelve year olds and we get it with sixty years
0:49:54	olds and important differences with twelve
0:49:56	year olds and six year olds is that twelve year olds can read already.
0:49:59	And six year olds can't read.
0:50:01	And there is a certain school of thought that believes
0:50:06	that you get phoneme categories from reading. But you don't get phoneme categories from reading,
0:50:10	you have
0:50:11	your phoneme categories in place very early in life.
0:50:15	So
0:50:17	that's exactly the same effect as you say very early in life even at age
0:50:23	six
0:50:23	you're using your perceptual learning to
0:50:26	understand new talkers.
0:50:28	And I think I saw our debt over there, so I'm going to show some
0:50:31	of
0:50:31	some of ?? data presented,
0:50:34	so we know, yes there you are.
0:50:38	This is some of the older work so that we know that
0:50:43	that this kind of perceptual learning goes on in life. I brought this particular
0:50:51	result which is again with s and f and was presented to Interspeech in 2012
0:50:57	so I
0:50:58	hope you were all there and you all heard it actually
0:51:01	but they also have some
0:51:05	2013 paper with
0:51:07	different phoneme continuum which I urge you also to look at.
0:51:13	So
0:51:14	even when you're losing your hearing you'll still doing this perceptual learning
0:51:19	and adapting to
0:51:23	to new talkers, so learning about new talkers is just
0:51:27	something that human listeners do
0:51:30	throughout
0:51:31	the lifespan.
0:51:32	So that brings me
0:51:33	to my final slide.
0:51:36	So this has been a
0:51:38	quick
0:51:39	tour through some highlights of some really important issues in human learning about speech.
0:51:44	Namely that it starts as early as a possibly can,
0:51:48	that it actually trains up the nature of the processes
0:51:52	and that it never actually stops.
0:51:55	So
0:51:58	when I was doing this I thought well actually you know
0:52:01	I love these conferences because they're the
0:52:04	interdisciplinary, because we get to talk about the same topic from
0:52:08	from
0:52:09	from different viewpoints. So what actually
0:52:12	would I think after
0:52:14	preparing this talk?
0:52:17	What I think is the
0:52:19	biggest difference you could put your finger on between human learning about speech and
0:52:24	machine learning about speech.
0:52:27	So I have been talking about this during week and I'll give you
0:52:32	that question to take to all the other keynotes and think about too
0:52:39	but
0:52:39	if you'd say, you know, it starts at the earliest possible moment, well I mean
0:52:44	so would a good machine
0:52:47	learning algorithm, right? I mean
0:52:50	it shapes the processing, it actually changes the algorithms that you're using, that's not the
0:52:55	usual
0:52:55	way because we usually start
0:52:58	in programming
0:53:01	machine learning system we start with the algorithm, right?
0:53:06	You don't actually change the algorithm
0:53:08	as a result of the input, but you could. I mean
0:53:12	there's no logical reason why that can't be done I think.
0:53:19	And never stops what I mean that's not the difference, is it? No that's not
0:53:22	a difference you can run
0:53:23	any machine learning algorithm as long as you like.
0:53:27	I think buried in one of many very early slides is
0:53:32	something which is crucially important
0:53:35	and that is the social reward.
0:53:38	That we now know to be really important factor in the early human
0:53:43	learning about speech and you can think of humans
0:53:46	as machines that really
0:53:49	want to
0:53:50	learn about speech. I'd be very happy to talk about this
0:53:54	at any time
0:53:56	during the rest of this week
0:53:58	or
0:53:58	or at any other time
0:54:00	too and I thank you very much for your attention.
0:54:29	Hi and fascinating talk
0:54:31	so a quick question. Your boundaries the ??. Do they change as a function
0:54:35	of the adjacent vowels? So far versus
0:54:39	fa, sa versus fa. ??
0:54:46	We've always used a whatever was the constant
0:54:53	context.
0:54:54	So you're talkind about perceptual learning experiments?
0:54:59	The last set of experiments, right? We've always tried to use a
0:55:05	varying context so I can't answer that question. If we had used only a
0:55:14	or hang on
0:55:16	we did use a constant context in the non-word experiment with
0:55:25	phonotactic constraints, but then that was different in many other ways so
0:55:31	no I can't answer that question but,
0:55:36	there is some tangential
0:55:39	answer, information from another lab
0:55:44	which has shown that people can learn
0:55:47	in this way,
0:55:49	a dialect feature
0:55:51	that is only
0:55:53	applied in a certain context.
0:55:56	So
0:55:57	the answer would be yes. People would be sensitive to that if it was consistent,
0:56:02	yes.
0:56:07	Tanja?
0:56:11	There are two in the same row.
0:56:17	Caroline.
0:56:19	Have you found any sex specific differences in the infants' responses?
0:56:24	Have we found sex specific differences in the infants' responses. There are some
0:56:29	sex specific differences
0:56:31	in. But we have not found them in
0:56:36	in these speech
0:56:37	segmentation. In the word recognition in continuous speech we've actually always looked
0:56:44	and never found a significant difference between boys and girls.
0:56:52	That was the a short one. So are there any other questions or not?
0:57:02	With respect to the
0:57:05	negative responses
0:57:07	on the words
0:57:08	that you used there,
0:57:10	that was presented in the experiment
0:57:13	and
0:57:14	that
0:57:15	at age three the children were..
0:57:17	Right.
0:57:17	The size of the negative going brain potential, right?
0:57:25	Is that just
0:57:27	would you say that could be good to
0:57:31	detect pathology?
0:57:34	Yes.
0:57:35	Definitely and the person whose name you saw on the slides as first author Caroline
0:57:41	Junge
0:57:42	is actually starting a new
0:57:45	personal career development award project in Amsterdam
0:57:50	and in Utrecht, sorry in Utrecht, where she will actually look at that.
0:57:56	Okay so, thank you so much again for delivering this wonderful keynote and
0:58:02	congratulations again for being our ISCA medalist. I am happy that you're around so you
0:58:07	can back our medallist over
0:58:09	the whole duration of the Interspeech conference. Thank you Anne.

Keynote 1: ISCA Medalist

Keynotes

Anne Cutler, ISCA Medalist, MARCS Institute, University of Western Sydney, Australia