Speech Transcript - Reverse Engineering: Infant Language Acquisition

0:00:16	so
0:00:17	well
0:00:19	thank you for i'm thinking the organisers for allow me to be sort of surprise
0:00:25	at talker and
0:00:27	and so i'm going to tell you a little bit what we have been doing
0:00:31	in terms of trying to understand language acquisition
0:00:35	now when well as a parent we are trying to understand how babies are learning
0:00:42	languages it seems very obvious we are just using on tuition and well maybe it's
0:00:46	just have to listen to what we're turning right it's very simple
0:00:51	now as a psychologist
0:00:53	then we have been trained to try to think in terms to take the place
0:00:57	of the baby okay
0:00:59	how's it feel to be a baby indeed this situation well it's going to be
0:01:03	a lot more complicated because
0:01:05	we don't understand that or what we told we just have the signal
0:01:09	and now
0:01:10	what i would like to do is to take the third perspective
0:01:14	which would be the perspective of an engineer i'm not an engineer myself i'm a
0:01:18	psychologist but here the idea is try to see how could we basically construct a
0:01:23	system that does what the baby do
0:01:25	okay that's the perspective we would like to push
0:01:29	so okay
0:01:32	so this what we basically we know
0:01:35	are we think we know about babies only language acquisition so this time nine here
0:01:40	is the model
0:01:41	so this is birth and this is the first offline
0:01:44	and as you can see babies are learning a quite a number of things quite
0:01:47	quickly so
0:01:49	basically here babies are starting to say the few words and before that they are
0:01:53	at rings various organisations but actually
0:01:57	before they are trained the first where they are
0:02:00	learning quite a bit of information about their own language
0:02:04	for instance the start to be sensitive to that of a list that are a
0:02:08	typical of the language channels will start to build some representation of the consonants that
0:02:13	here are starting to be all basically language models with a sequential dependencies et cetera
0:02:19	et cetera so this is taking place
0:02:21	very only
0:02:23	way before they can show us that they have
0:02:26	then these things okay
0:02:28	at the same time they also learning over aspects of language in the prior the
0:02:31	prosody and in the lexical domain
0:02:34	so
0:02:35	how do we know that babies are doing this well this is all job a
0:02:38	psychologist to try to interpret interrogate the babies that don't talk
0:02:42	and we have to find a clever ways to
0:02:46	build situations where the babies is going to
0:02:50	for instance look at the screen or
0:02:52	sec a little
0:02:55	blind people here
0:02:56	and this behavior of a maybe a way to control this thing really that they're
0:03:00	going to be presented with so in the typical experiment that was basically the beginning
0:03:05	of this field in the seventies
0:03:07	okay time as it this study where you basically presents over and over again each
0:03:12	time the babies doing this little behaviour this section we have your we present the
0:03:15	same syllable so it's a
0:03:19	and you can see here that the basically the frequency of this setting is decreasing
0:03:24	because it's boring is always the same syllable but then suddenly
0:03:28	you change the syllable or not spell
0:03:31	okay and now the baby sucking a lot more
0:03:33	okay that means that the baby has
0:03:35	notice that there was a change
0:03:38	and this to all the conditions where the same the same syllable continue blah exactly
0:03:44	the same syllable when in slightly different one
0:03:47	so
0:03:47	with this kind of ideas you can basically pro babies perception of a speech sounds
0:03:54	and you can ask yourself okay to discriminate but i'm part
0:03:57	dot and got and always kind of sounds you can also program memory
0:04:01	have they memorise have the segmented out
0:04:05	particular frequent or interesting parts of the environment so this sounds in that environment all
0:04:11	they also more fancy type of equipment the to do the same kind of experiments
0:04:15	but i'm not going to talk about them
0:04:18	so
0:04:19	the question that's really interest me here is a how can we understand what babies
0:04:25	are doing okay not if you open up a linguistic i mean psycho linguistic the
0:04:30	rampant all technology journals you find some hypothesis
0:04:35	that interesting but i'm not going to talk about them because unfortunately this series
0:04:40	do not a low to basically have an idea of what are the mechanisms that
0:04:43	babies are using for
0:04:46	understanding speech
0:04:49	you do fine in psychology and also linguistic jungles publications trying to
0:04:54	basic cut down to learning problem to solve a so for instance some people i'm
0:04:58	going to talk more about that have studied how you could
0:05:01	fine phonemes
0:05:03	from row speech using some kind of course unsupervised clustering
0:05:07	but also known the once you have to the i don't put the phone and
0:05:11	find the word forms
0:05:12	or once you of the reform sums from learn some semantics et cetera et cetera
0:05:17	so these are
0:05:19	this paper was out on basically less technologies that an english they are not the
0:05:23	done by engineers and
0:05:26	they what one particular aspect of them is that they are focusing on really
0:05:31	as a small part of the problem of the learning problem
0:05:35	and they also
0:05:38	basically making a lot of assumptions about the rest of the system
0:05:41	so that the question we can ask ourselves is
0:05:45	could we make a global system that would learn but with many of the babies
0:05:49	doing by concatenating these elements
0:05:51	and what i i'm
0:05:53	i think i will try to do demonstrate to use that such a system simply
0:05:57	does not work
0:05:59	doesn't work because it doesn't scale it has incorporate some particularity is and you also
0:06:04	we doesn't press one what the previous doing anyway
0:06:07	so i'm going to focus on this particular part and we talk a lot about
0:06:12	that at least two talks today focused on how you could discover some units of
0:06:18	speech from
0:06:19	from row speech in psychology
0:06:24	it's really people believe that bay the weight babies do that is by accumulating evidence
0:06:31	and doing some kind of and but unsupervised clustering
0:06:35	so this is the paper a couple of papers were published
0:06:39	basically that i stack these babies that six months are able to distinguish sounds that
0:06:43	are not in and language so they can distinguish dot i wouldn't well if you
0:06:49	are speaking yes and you say that i say right
0:06:53	but most of you wouldn't hear that and contrary to the image maybe i six
0:06:58	months but the twelve months they have lost but the ability because that contrast is
0:07:02	not used in the language
0:07:04	okay and so the hypothesis about how babies do that is that they basically accumulating
0:07:09	evidence and doing some kind of statistical clustering based on the input that's available in
0:07:14	the language
0:07:15	now
0:07:17	and that in the number of papers have a try to demonstrate that you can
0:07:21	build a system like this
0:07:23	however
0:07:24	most of this papers have dealt with a very small number of categories so these
0:07:29	are sort of proof of principle papers that basically construct data according to distributions deck
0:07:35	is and they show that you can find these by doing some kind of clustering
0:07:39	so that's nice but does it scale we
0:07:44	and as everybody knows here we know that speech is more complicated and this is
0:07:50	basically running speech and you got more conversational speech you need some not separated the
0:07:55	a not so segmented easily segmentation is part of the problem except for except
0:08:00	okay so is where i started to get involved in this problem working
0:08:06	with a
0:08:07	that sounds if a hopkins and we wish we choose the idea was basically to
0:08:12	try to apply real simple unsupervised clustering algorithm on the row speech on running speech
0:08:19	and see what we get could we get phonemes up about
0:08:23	so this is what
0:08:25	we did they have there was this the idea is you start with a very
0:08:28	simple markov model with just one state and then you speak the states into
0:08:34	various
0:08:36	possibilities you can split it in time and time domain or like a horizontally like
0:08:40	you have two different versions of each sound and then you can make this
0:08:45	continue to H rate is a graph drawing process until you have a very complicated
0:08:49	network
0:08:50	and so in other to analyze
0:08:53	the what the system was doing what sensitive and a bad actors and it was
0:08:59	to apply decoding
0:09:03	using finite state transducers so that you can basically have some interpretation of what the
0:09:07	states mean
0:09:08	and what was discovered was that the phoneme the units that are found in this
0:09:12	kind of system are very small smaller than phonemes
0:09:16	but even if you concatenate them and these are the most frequent
0:09:20	so strings concatenation is
0:09:23	they correspond not a phone is but more to contextual and of phones
0:09:27	that is also thought problem which the units are not very talker invariant
0:09:34	but so
0:09:35	so this problem sun a very surprising for those of you work with speech and
0:09:39	that's majority of people here because we all know that again the phonemes are not
0:09:44	going to be found a in such of in such a way
0:09:47	this one problem i want to we insist on because i think that's quite crucial
0:09:52	and we talked about that in earlier discussions
0:09:55	is the fact that languages
0:09:58	do contain elements
0:10:01	that are
0:10:03	that you will discover if you do some kind of unsupervised clustering but there is
0:10:07	no way to merge them into abstract phonemes and this is due to the existence
0:10:13	allophones okay you have in many languages in most languages you have and a phonics
0:10:17	rules like for instance in france you have the overall voiced what get number one
0:10:22	and you have the unvoiced in cannot wrote all okay so this sounds exist in
0:10:28	the language there is no i think you can do about that they actually are
0:10:30	two different phonemes in some other language
0:10:34	so
0:10:35	you are going to and the fact discovering this units
0:10:40	okay so in a with a purely bottom-up fashion there is no way to remove
0:10:44	this
0:10:46	okay
0:10:47	so
0:10:48	well
0:10:49	you could say and that's actually was one of the question what but was discussed
0:10:55	before how many phone ins how many units you want to discover
0:11:01	and it was sort of set it doesn't really matter we can take a sixty
0:11:05	four we can take hundred
0:11:08	well actually it doesn't matter for the rest of the processing at least
0:11:11	that's what we discovered with the
0:11:14	phd student of mine so what we did there was to basically vary the number
0:11:19	of allophones that you used to
0:11:22	transcribed speech okay and then we use a these other algorithm which is this word
0:11:28	segmentation algorithm
0:11:30	that was referred also to before so we use a one of sharon goldwater type
0:11:35	of algorithm
0:11:36	and
0:11:37	what we found so here what you have is the segmentation f-score and this is
0:11:41	basically number of this is the number of phones is converted into the number of
0:11:46	alternate word forms that the
0:11:48	and phones create
0:11:50	and you can see that the performances is affected is dropping
0:11:54	this is the right here for english french and japanese and in some languages like
0:11:59	japanese it's really having a very detrimental effect you have lots of allophones then it's
0:12:04	becoming extremely difficult to find words okay because these are reasons just break down
0:12:11	so it doesn't matter to have to start of to start with good units
0:12:18	so this is another experiment that was reported by our and
0:12:23	where again issues you basically replaced with some kind of unsupervised you need
0:12:30	and you try to
0:12:31	feed that onto a word segmentation system then you end up with a very poor
0:12:36	performance
0:12:38	okay so
0:12:41	that means that phonemes
0:12:43	at least with a simple minded
0:12:45	clustering system is not able acoustically
0:12:49	so there are two ideas from their which i want to discuss one is to
0:12:53	use a better all the three model and the other is to use the top
0:12:55	down
0:12:57	model top down information
0:13:00	so
0:13:01	regarding the
0:13:04	well i'm basically going to
0:13:06	this is just a summary of what i said so what we have right now
0:13:10	is with some of simplified of fate input we there are some unsupervised learning clustering
0:13:16	i present have been successful with more realistic input we have a system that works
0:13:22	but they use heavy the supervised
0:13:24	a models and the question is where we can we build systems that
0:13:29	a combined this portion of the space
0:13:33	and
0:13:34	so i'm not going to present a much work that we did on unsupervised for
0:13:39	pruning discovery
0:13:40	because for me there was a plenary a very important question first is that how
0:13:46	we evaluate unsupervised phone and discovery
0:13:49	so imagine you have a system the discovered units how do you know how can
0:13:53	you evaluate whether these units a good a not good
0:13:56	so traditionally people use for name error rate which is busy you train the phone
0:14:02	and decoder which is what we did with this is successive state splitting
0:14:06	it was this
0:14:10	finite state transducer that translated the find that the states of the system into phonemes
0:14:17	of course the problem is that when you do that then maybe a lot of
0:14:20	the performance at the end is due to the decoder
0:14:25	it may not be
0:14:27	the fact that these units are good it just that you have trained a good
0:14:30	because
0:14:32	and also we don't even know that phonemes of the relevant units for this for
0:14:36	inference okay
0:14:37	so maybe they are using something else maybe they're using
0:14:39	syllables diphones some other kind of you
0:14:43	so the idea is to use and so the variation technique that's basically is suited
0:14:50	to do this kind of work
0:14:51	and
0:14:52	and the idea of entered ideas that we don't really care whether babies are all
0:14:57	the system is discovering phonemes what we care about is that the able to distinguish
0:15:01	words that mean different things
0:15:03	so talk and all
0:15:05	the whole mean different things so they should be distinguish but the system know what
0:15:08	the what how you cope with the just okay
0:15:12	so this is the idea underlying the same different task that are in had means
0:15:16	pushy
0:15:17	all these years and we have first slightly different version of that which we called
0:15:22	at X task
0:15:24	so with the
0:15:26	the same different task goes like this you are i'd if you two words to
0:15:30	talk and then you have to say whether the same word
0:15:33	and you compute the acoustic distance between then and these are the distribution of the
0:15:37	two acoustic distance and what or and showed
0:15:41	was that
0:15:41	if you are basically doing things within the same you same talker the two distributions
0:15:47	are quite different so it's easy to say what is the same word or not
0:15:51	but if it's the same if it's a different or quite becomes a lot harder
0:15:55	okay
0:15:56	so what we did west to
0:16:00	build on this
0:16:02	and
0:16:03	ask a slightly different question i give you three things i give you don't say
0:16:07	by one talker
0:16:08	the whole say by the same poker
0:16:11	and then talk say by a second talk okay so now you have to say
0:16:14	whether this
0:16:16	i am here is closer to this one obvious one
0:16:19	so this is simple psychoacoustic task
0:16:22	that's
0:16:23	for me it's really inspired but what by the type of experiments we do we
0:16:27	babysit apples and with that we can compute the primes you can compute
0:16:32	the values that are that have
0:16:35	i mean a psychological interpretation but also we can
0:16:39	basic you'd have a very fine grained analysis of the type of errors the system
0:16:43	is doing so there we apply this task to a database of syllables that have
0:16:49	been recalled in english across talkers
0:16:51	and this is the performance that you get a recitals what's nice with this kind
0:16:55	of that you got really compared human and machine
0:16:58	and this is performance of humans and this performance of mfcc coefficients okay so we
0:17:03	can see there is a quite a bit of difference between
0:17:06	these two kind of
0:17:08	of systems
0:17:10	so this these are actually run on meaning that's a double so we can be
0:17:15	the case that humans are using meaning to do this task okay
0:17:21	but then this task this kind of task can be used to test different kind
0:17:25	of features
0:17:26	which is nice so that's what we did with the this of mine too much
0:17:31	that's
0:17:31	and also hynek
0:17:36	here i actually so the same the same as i was talking about so a
0:17:39	crosstalk or phone and discrimination you can then apply a typical processing pipeline where you
0:17:46	start with the signal you power spectrum and you will kind of transformations
0:17:51	and you can see way whether each of these different types of approximation you due
0:17:56	to the signal is actually improving a not all that particular task
0:18:00	okay so this in this graph you have the effect of performance depending on the
0:18:04	number spectrum channels and what we found was that the
0:18:09	actually phone and discrimination task requires fewer channels stand for instance if you were to
0:18:13	do a talk a discrimination task which we can do now having
0:18:17	dog spoken back to speakers and then a for item that's a different word but
0:18:22	one of the first talk about this they all the talk
0:18:26	so i'm not going to say more about that but
0:18:28	but we
0:18:30	that's the ideas that
0:18:32	trying to specify the proper evaluation tasks is going to help devising proper features that
0:18:39	between the would work for
0:18:44	unsupervised learning
0:18:46	this is this work we started with another post of mine
0:18:50	what he did was to apply the deep belief network
0:18:55	to the to this problem so this is we already
0:18:59	learned a lot about this ring the first day of the talk
0:19:03	but then what you can do is you can compare the performance of this deep
0:19:07	belief network representations that each of the levels to do this kind of discrimination task
0:19:12	okay
0:19:13	and this is the mfcc for instance this is what you have
0:19:17	at the first level of the dbn
0:19:20	without any training so actually you are doing better
0:19:25	this is the error rate here and if you do some unsupervised training like the
0:19:31	restricted boltzmann machine training actually a green slightly worse okay on that task now it
0:19:37	does not that this pre-training here helps a lot when you do supervised training after
0:19:42	that but if you don't do supervised training actually not doing much
0:19:46	okay so i think it's that's what i'm saying is important to have a good
0:19:49	evaluation task
0:19:50	for unsupervised
0:19:52	problems because then you can discover whether you unsupervised unit is actually mean any good
0:19:57	or not
0:19:59	okay so not in the time remains i would like to talk a little bit
0:20:03	about
0:20:04	this all the idea the idea of using top-down information
0:20:08	and that that's an idea that was not at least to me very
0:20:11	natural because
0:20:13	i have this idea that maybe should maybe should learn first the phonemes the elements
0:20:18	of the language before running higher
0:20:20	or other information but of course phonemes a part of a big system okay and
0:20:26	so maybe the meaning the definition of the finance is
0:20:31	emerges out of the big system so the intuition there is that maybe babies are
0:20:36	trying to learn the whole system
0:20:37	and why they do that they are going to find if one
0:20:42	okay so
0:20:44	so all the different things we try i'm going to
0:20:49	talk about this idea of using lexical information
0:20:53	so lexical information is a very simple idea
0:20:56	is the following is that
0:20:57	typically when you have to retake two words that random
0:21:02	or you just you to you take your you whole lexicon and the you try
0:21:06	to find minimal pairs
0:21:08	that would a actually different on one only one segment so for instance cannot and
0:21:13	cannot
0:21:13	okay
0:21:15	you don't find a lot of than you do fine then but they are
0:21:18	very infrequent statistic
0:21:20	so now if you are looking at your lexicon you imagine you are
0:21:25	you have some initial position to find the words and then you are looking at
0:21:29	all the set of maybe more it is that you find you have to find
0:21:32	a lot of minimal pairs that correspond to
0:21:35	this contrast a whole then you can be pretty sure that it's not really a
0:21:39	phoneme ink
0:21:40	contrast these are probably telephone
0:21:43	and that's the intuition okay
0:21:45	so how we tested that
0:21:47	we started with
0:21:50	a transcribed corpus then we the transcribed it into phonemes then we make random allophones
0:21:58	this is not going well
0:22:00	okay
0:22:01	and then we transcribe this a phoneme a transcription again into fine
0:22:06	of very fine description with all these other phones and we vary the number of
0:22:10	other phones
0:22:11	so that's how we generate the corpus and then the task is to take this
0:22:15	and basically fine
0:22:17	which pairs of phones belong to the same phone and want
0:22:22	using just information in a corpus
0:22:26	so
0:22:28	so that's what we do
0:22:29	and this is the basically
0:22:32	so we compute the distance
0:22:35	now the number of different
0:22:38	minimal process that you have for each contrast
0:22:42	and we compute here the area-under-the-curve and that's this
0:22:47	right here and this is the number of phones
0:22:49	so don't look at this curve here this curve here is the relevant one is
0:22:55	the effect of using the strategy of counting the number of
0:22:59	also i mean the multi okay
0:23:01	so the performance is quite good and it's actually not really affected
0:23:06	negatively by the number of phones that you had
0:23:09	okay so this is
0:23:11	this strategy works quite well but of course it's cheating right because there i assume
0:23:17	that the babies had the boundaries of words
0:23:20	but they don't and in fact i showed you just before that it's actually extremely
0:23:24	difficult if you have lots of allophones to find the boundaries of words
0:23:28	so
0:23:29	so that that's a kind of security that we would like to avoid
0:23:35	and so the idea that the un T martine the postdoc had
0:23:39	which was great was to say well maybe we don't need to have an exactly
0:23:42	second maybe babies can go and build a proper lexicon with the whatever segmentation and
0:23:48	reason they have it's going to be incomplete is going to be wrong you has
0:23:51	many long words in it
0:23:53	but still maybe could be you useful thing to have and that's what we find
0:23:57	here so this is there we use of free really extremely rudimentary segment segmentation sources
0:24:03	using an n-gram to the ten percent most frequent n-grams in the corpus and that's
0:24:08	the lexicon so it was really pretty awful
0:24:11	but still it provided
0:24:15	actually performance that was almost as good as the gold mexican
0:24:20	okay
0:24:22	so then that the
0:24:24	and demanding went to japan and then i had pasta a doctoral student
0:24:28	who said well we could even go even further than that
0:24:32	maybe babies could be constructed
0:24:34	some approximate semantics
0:24:37	and the reason why it could be useful to do that is that well cannot
0:24:40	okay
0:24:42	they are different allophones because they are in minimal pair but what about this one
0:24:48	these are two words in french can and cannot
0:24:50	and the if i way to apply the same strategy i would declare that and
0:24:55	the are allophones which is wrong and then i would end up with the a
0:25:00	japanese french
0:25:02	type of this than so that's not what we want
0:25:07	so but on the other hand if we have some idea of even vague idea
0:25:10	that cannot
0:25:12	about the meaning of got out actually not okay now
0:25:15	but some kind of bird and it whereas this one is some kind of water
0:25:21	thing then that's that maybe that sufficient to help distinguish these two cases that's also
0:25:28	kind of cannot
0:25:31	so there
0:25:33	what we did what we do the same kind of pipeline we make the problem
0:25:37	actually more realistic by having a instead of having run them allophones we generated allophones
0:25:42	by using tied a three state
0:25:49	using hmm
0:25:50	actually that makes the protection much more difficult that they are phones are more realistic
0:25:55	but it's also becoming the lexical started data represented before it is having trouble with
0:25:59	that
0:26:01	and then the idea is that you take that now don't cheat anymore you are
0:26:05	trying to recover
0:26:07	possible words from that and then you do some semantic estimation and then
0:26:11	now you compute the semantic distance between two pairs of phones
0:26:16	so how does it work
0:26:18	what word segmentation
0:26:19	a state-of-the-art
0:26:22	minimum description length or adaptive grammar
0:26:25	okay so we know that is working but we know that's not working very well
0:26:28	okay especially if we have lots of allophones it's going to have a pretty bad
0:26:33	estimate of the lexicon
0:26:35	but then we still take that as alex again and then we apply the latent
0:26:38	semantic analysis
0:26:41	which basically is
0:26:43	is counting how many
0:26:44	how much time these different terms occur in the dig documents and here we took
0:26:49	the comments we to go ten sentences length
0:26:52	so we have this whole corpus we segmented into ten sentences and we compute the
0:26:57	this matrix of counts which we then decompose and we arrive at semantic representation where
0:27:02	basic each word now is a vector
0:27:05	so the i mean not that people in india the mean much more sophisticated things
0:27:08	like this one so this is pretty
0:27:13	first
0:27:13	are older a semantic analysis
0:27:17	what but what's nice was that we can compute the cosine between the two or
0:27:21	semantic propose semantic presentation of the words and the idea of now is that if
0:27:26	you have to allophones they should have quite i don't similar and
0:27:30	vectors because the are occurring the same context
0:27:34	okay so that's the result
0:27:37	there
0:27:38	so in this in this study what we did was to try to
0:27:44	to look at them
0:27:46	because we have generated is allophones on the basis of hmms we can compute this
0:27:50	the acoustic distance within that okay so obviously acoustic distance
0:27:54	is going to help you have to allophones two forms that are quite close to
0:27:58	one other maybe they should be grouped together because they are likely allophones
0:28:03	we also know that is not working
0:28:05	if we only have that's it's not working because the bottom-up strategy doesn't work on
0:28:09	performance is not that
0:28:11	but it still not perfect okay so that's the performance
0:28:14	in the task where you i give you to phones and you have to tell
0:28:19	you whether they're allophones all okay
0:28:21	the chances fifty percent
0:28:23	so this is the percent correct if you use acoustic you only for english and
0:28:28	japanese of
0:28:30	something missing there which is the number find a phone so it's hundred five hundred
0:28:34	thousand that the phones
0:28:37	and this is that the effect of the acoustic distance the semantic distance okay and
0:28:42	the performance is almost as good as the acoustic distance
0:28:45	when you combine then you have very good performance
0:28:48	so that should that shows
0:28:50	that
0:28:51	that you can basically
0:28:54	use this kind of semantic representation even though they are in computed on the basis
0:28:59	of an extremely bad lexical i mean at this level here the number of real
0:29:03	words that you find with the ad upper brummer a type of framework is about
0:29:07	twenty percent so you're like second is twenty percent real words and eighty percent
0:29:12	but nevertheless
0:29:14	that mexican and is enough to give you some semantic top-down information so that's shows
0:29:21	that the semantic information is very strong
0:29:26	alright so
0:29:27	i'm going to wrap up very quickly so i thought that the with this idea
0:29:32	that babies would go in a sort of
0:29:36	a bottom-up fashion
0:29:38	that doesn't work it didn't scalable and it was a climate is
0:29:43	also and it doesn't really account for the fact that babies are learning works
0:29:48	before they have really for those zoom in on their eventually of phonemes okay
0:29:53	in fact now they are but it's showing that even at six months babies have
0:29:56	an idea of the semantic
0:29:59	representations of words
0:30:01	so basically babies out and everything
0:30:05	so now we would like to replace is a scenario by
0:30:09	a system like this where we you start with row speech and you try to
0:30:13	learn all the levels
0:30:14	at the same time
0:30:16	and then of course you're going to do a bad job the phonemes are going
0:30:19	to be wrong the word segment to be wrong everything of semantics going to be
0:30:22	awful
0:30:23	but then
0:30:24	you combine all this and you make a better next iteration
0:30:28	so that would be the
0:30:30	the proposed architecture for babies
0:30:34	you know the for this to work you have to work we have to do
0:30:37	a lot more work we have to
0:30:39	basically a stop
0:30:40	of course using target language and try to approximate the really put that B C
0:30:45	getting as much as we can
0:30:47	we have to quantify what does it mean to have a pro to lexical propose
0:30:50	something okay so i gave you an idea of and evaluation procedure for evaluating what
0:30:56	is it to have a proper phoneme we have to do the same thing for
0:30:59	pro to words and propose semantics et cetera
0:31:03	because these are sort of really approximate representations
0:31:07	and then
0:31:08	and then
0:31:09	the synergies are what i this the describe just
0:31:12	now which is
0:31:14	this image is a when you try to learn the phonological units alone you are
0:31:17	doing about the bad job semantic representations alone it's difficult but you can basically
0:31:23	if you try to learn
0:31:24	sort of a joint model you are going to be better
0:31:29	and they are a lot of potential synergies you could you could imagine
0:31:33	the last thing that i have to do as a psychologist of course is to
0:31:36	go back to the baby and test with the babies are actually doing this but
0:31:39	i'm not going to
0:31:41	talk about that now
0:31:43	and find the
0:31:44	i mean why should we do this i think this reverse engineering of you meant
0:31:49	infant is really a new channel G break and i think
0:31:54	both sides can bring out of things
0:31:57	both psychologist engineers can bring ideas we can bring some interesting corpora and we can
0:32:03	test with other the ideas are going to
0:32:07	real realistic probabilities
0:32:08	and then region is can bring algorithms and also test this large scale test on
0:32:13	real data is very important
0:32:17	and we have a lot to work on because that's would be some of the
0:32:21	potential architecture i try to put everything that has been
0:32:26	documented somewhere in terms of
0:32:28	potential link between different levels that you're
0:32:32	thank you approximate is and there i guess you have added a lot of things
0:32:37	you have also babies are actual arc estimating so maybe this articulation is feeding back
0:32:42	to help in constructing
0:32:44	the sub-lexical units
0:32:46	they also any string the faces
0:32:48	of caretakers and they also have a lot of which are
0:32:52	semantic input for acquisition so all these
0:32:56	representation have to be put in at some point that and but i think we
0:32:59	what we have to do is to establish whether we do have interesting synergies a
0:33:04	not if we don't then we can be factor the system to separate subsystems
0:33:10	and that's what have to say so this is this is the team like human
0:33:13	these are
0:33:14	very nice colleagues that help to this work
0:33:29	okay so we are gonna have a abbreviated channel so we don't have a hold
0:33:34	on a time for questions but
0:33:35	one or two
0:33:47	what do you think about inference learning
0:33:50	something between
0:33:51	phonemes and words like syllables which have a nice sort of acoustics
0:33:56	chunkiness to them
0:33:58	i mean that's actually
0:34:01	that's was the basically the hypothesis i had when i did not used
0:34:05	but the role of the syllables
0:34:09	i guess
0:34:12	i mean that's perfectly possible i mean the thing is that
0:34:19	in the way task i think that what
0:34:22	deep belief networks are doing by having these inputs recitation where you stack about hundred
0:34:30	and fifty millisecond of signal
0:34:33	some of going in that direction i mean how to the fifties milliseconds basically the
0:34:36	size of syllable
0:34:39	so
0:34:40	i guess that's behind this i mean if this is what people are using the
0:34:44	most recent and the reason is that that's basically the domain of the syllable where
0:34:48	you have coarticulation that so that's where you can capture the essential information for
0:34:54	recover in contrast
0:34:55	so i think there are many ways in which syllable type units could play a
0:34:59	role it could be in an implicit fashion like i just said or you could
0:35:04	have actually tried to build recognition system i mean units that have the still shape
0:35:09	which is another way to the
0:35:15	i mean we know we know that inference are counting syllables for instance so at
0:35:19	birth it can be can effect you present then a three syllable
0:35:23	words and then you switched into syllables the notice the change they don't notice the
0:35:27	change if you go from for need to six phone instances which is the same
0:35:31	racial of change
0:35:33	so we have evidence that there are a decent syllable nuclei are things that the
0:35:39	pay attention to
0:35:40	a lot
0:35:44	thank you for your talk manual i think you at one time told me that
0:35:48	almost from day one
0:35:50	that infants sorta can are take you can imitate articulatory gestures
0:35:55	that somehow hardcode
0:35:58	in and well okay it and i don't know how you do that experiment but
0:36:02	on the other hand all of these acquiring you know phonemic balance in and word
0:36:08	boundary segmentation lexicons that seems to be all sort of this
0:36:12	part of the plastic city of learning and inference so why isn't the some notion
0:36:18	of starting with the articulate you articulatory gestures since that is sort of
0:36:23	there are beginning wise that part of your model or is it should be
0:36:28	in the most the mobile it's sparse
0:36:31	so i have i have actually of course
0:36:34	so i have proposed a working on this and they're actually number of
0:36:40	have tried to incorporate
0:36:42	using actually deep learning systems trying to learn at the same time speech features and
0:36:47	articulatory features
0:36:50	and then
0:36:50	and then if you retrain like this and then you only present now speech features
0:36:55	you are two between better decoding then if you had where learning speech feature so
0:37:00	we know that there is some notion in which that could work but of course
0:37:03	this work was done with
0:37:05	real i don't i articulation
0:37:07	the baby articulation is much more primitive so it's not here that's going to help
0:37:11	as much but that's one of thing we want to try
0:37:15	so i think we a relative time but of manually i believe you're gonna be
0:37:18	here tomorrow as well so i encourage people to go and ask all these questions
0:37:22	i think they're very they're very relevant work we all so this community

Reverse Engineering: Infant Language Acquisition

Limited Resources Day

Emmanuel Dupoux