Speech Transcript - The Neural Mechanisms of Speech Production: From computational modeling to neural prosthesis

0:00:13	so it is my on privilege this morning to introduce a our keynote speaker frank on some
0:00:22	we see a computational and cognitive neural scientist specialising in speech and sensory motor control
0:00:30	is from the from the
0:00:34	department of speech language hearing sciences and biomedical engineering at boston university when i also obtained his phd
0:00:43	and is research combines theoretical modelling
0:00:46	with behaviour or and your imaging experiments to characterise the neural computation underlying speech and language so this is a
0:00:55	fascinating research field
0:00:58	which we thought would advantages the informal all in research
0:01:04	and so without further ado
0:01:06	like a you to help me welcome a corpus of frank and
0:01:19	morning thanks for showing up to thirty in the morning i'd like to start by thanking organisers for inviting to
0:01:25	this conference in such a beautiful location
0:01:28	and that also like to acknowledge my collaborators before it gets started the main collaborators on the work i'll talk
0:01:35	about today include
0:01:36	people from my lab at boston university including adjacent orville jonathan rumble or remember
0:01:42	supper gauche alfonso the other yet to cast an on my a pave elise a cop annapolis and or in
0:01:48	C V A
0:01:50	but in addition we collaborate a lot with outside labs and i'll be talking about a number of projects that
0:01:56	involve collaborations with people at mit including just a perk L melanie matthias and harlan lane
0:02:02	we've work we should you my a to create a speech synthesizer we use for much of our modelling work
0:02:09	and phillip kennedy and his colleagues at neural signals to work with us on our neural prosthesis project which i'll
0:02:16	talk about at the end of the lecture
0:02:20	the research program in our laboratory has the following goals
0:02:25	we are interested in understanding the brain first and foremost and
0:02:29	we're in particular interested in a lucid aiding the neural processes that underlie a normal speech learning and production
0:02:37	but we are also interested in looking at disorders and our goal is to provide a mechanistic model based account
0:02:44	and by model here i mean a neural network model that mimics the brain processes that are underlying speech and
0:02:52	using this model to on understand communication disorders problems that happen when part of the circuit is broken
0:03:00	and i'll talk a bit about communication disorders today but will focus on the last part of our work which
0:03:06	is developing technologies that eight individuals with severe communication disorders and i'll talk a bit about project involving a patient
0:03:14	with locked in syndrome who was
0:03:16	given a brain implant in order to try to restore some speech processing
0:03:22	the methods we use a include neural network modelling we use a very simple neural networks the neurons in our
0:03:29	models are simply actors that i have a nonlinear thresholding a of the output
0:03:36	we have other equations that define synaptic weights between the neurons
0:03:41	and we adjust these weights in a learning process is better described in a bit
0:03:45	we test the model using a number of different types of experiments we use motor and auditory cycle physics experiments
0:03:52	to look at speech look at the formant frequencies for example drinks different speech task
0:03:57	and we also use functional brain imaging including fmri but also i'm E G and E G to try to
0:04:04	verify the model or i help us improve the model by pointing out weaknesses in the model
0:04:10	and the final set of things we do a given that we're a computational neuroscience department we're interested in
0:04:17	producing a technologies also that are capable of helping people with communication disorders and i'll talk about one project involves
0:04:24	the development of neural prosthesis or a allowing people to speak to have problems with their that speech out
0:04:34	the studies we carry out are largely organised around one particular model which we call the diva model and this
0:04:41	is a neural network model of speech acquisition and production that we've developed over the past twenty years in our
0:04:46	lab
0:04:48	so in today's talk up first give you an overview of the diva model including a description of the process
0:04:53	of learning that allows the model to tune up so that it can produce speech sound
0:04:57	i'll talk a bit about how we extract simulated fmri activity from the model fmri is functional magnetic resonance imaging
0:05:05	and this is a technique for measuring blood flow in the brain and areas of the brain that are active
0:05:11	during that
0:05:12	have increased blood flow one so we can identify from fmri what parts of the brain are most active for
0:05:17	a task and differences in activities for different at task
0:05:22	condition
0:05:23	this allows us to test the model and i'll show an example of this where we use auditory perturbation of
0:05:28	speech in real time so that a speaker is saying word but they hear something slightly different
0:05:33	and we use this to test a particular aspect of the model which involves auditory feedback control of speech
0:05:40	and then model and the talk with a presentation of a project that involved
0:05:46	communication disorders in this case an extreme communication disorder in a patient with locked in syndrome was completely paralysed and
0:05:54	unable to move
0:05:56	and so we are working on prosody sees more people in this condition to help restore their ability to speak
0:06:03	so that they can communicate with people around them
0:06:08	this slide usable schematic of the diva model i will not be talking about the full model much i will
0:06:14	use a simplified schematic in a minute
0:06:16	a what i want to point out is that the different blocks in this diagram correspond to different brain regions
0:06:23	that in include different
0:06:25	what we call neural maps a neural map in our terminology is simply a set of neurons that represent a
0:06:32	particular type of information so and motor cortex for example down here in the vector motor cortex part of the
0:06:38	model we have articulator velocity imposition map
0:06:42	what these are neurons basically that command that positions of speech articulators in and articulatory synthesizer
0:06:51	i would just schema ties here so the output of our model is a set of commands to an articulatory
0:06:56	synthesizer this is just a piece of software which you provide a set of articulator positions as input this a
0:07:04	synthesiser we use the most is creative actions you my dad involve
0:07:09	seven articulatory degrees of freedom there's a job degree of freedom three talking degrees of freedom to live degrees of
0:07:16	freedom for opening in profusion
0:07:18	and a larynx height degree of freedom and together once you specify these positions of these articulators you can create
0:07:26	a vocal tract area function and you can use that area function to synthesise a and acoustic signal that would
0:07:32	be produced by vocal tract of bad shape
0:07:36	the models
0:07:39	productions are that back to model in the form of auditory since mada sensory information that go to maps
0:07:45	for auditory statements madison's restate located in auditory cortical areas in herschel drivers and the posterior superior temporal gyro
0:07:54	and this may have sensory cortical areas in the central some at a sensory cortex and supra marginal gyro
0:08:01	each of the large boxes here represents a map in this report cortex
0:08:05	and the smaller boxes represent represents a sub cortical components of the model most notably a base of anglia loop
0:08:13	for initiating speech output
0:08:16	and sarabelle or loop
0:08:18	which contribute to several aspects of production i'm going to focus on the cortical components of the model today for
0:08:24	clarity
0:08:26	and so i'll use this simplified version of the model which doesn't have all the components but it has all
0:08:32	the main processing levels that will need to go to today's talk show the highest level processing in the model
0:08:40	is what we call a speech sound at
0:08:42	and this is corresponds to cells in the left entropy motor cortex and inferior frontal gyros
0:08:49	in what is commonly called broke "'cause" area and then the promoter court are cortex immediately behind broke as area
0:08:57	in the model each one of these cells comes to represent a different speech sound and a speech sound in
0:09:03	the model can be either a phoneme or syllable or even a multi syllabic phrase the key thing here is
0:09:10	that it's something that's produce
0:09:11	very frequently so that there's a stored motor program for that speech sound and the canonical sort of speech sound
0:09:18	that we use
0:09:19	is the syllable so for the remainder the talk i'll talk mostly about yeah syllable production when referring to the
0:09:24	speech sound map
0:09:26	so cells in the speech sound map project
0:09:30	both to be primary motor cortex through what we call a feed-forward pathway at which is a set of learned
0:09:37	commands for producing these speech sounds and the activate associated cells in the motor cortex that command the right articulator
0:09:44	movement
0:09:45	but also be speech map sound map cells project to sensory areas
0:09:49	and what they do is they send
0:09:51	targets to those sensory area so if i want to produce a particular syllable such as bar
0:09:57	when i say bah i expect to hear certain things i expect certain formant frequencies that as a function of
0:10:03	time and that information is represented by synaptic projections from the speech sound map over to what we call an
0:10:10	auditory error my
0:10:11	where this target is compared to incoming auditory information
0:10:16	similarly when we produce a syllable we expected to feel a particular way when i say a for example i
0:10:22	expect my lips to touch for the B E and then to release
0:10:25	for the vowel this sort of information is represented in a smack sensory target that projects over to this matter
0:10:32	sensory cortical areas where it is compared to incoming smell sensory information
0:10:37	these targets are learned as is this feed forward command during learning process that'll describe briefly in just a minute
0:10:45	the arrows in the diagram represent synaptic projections from one type of representation to another
0:10:52	so you can think of these synaptic projections is basically transforming information from one sort of representation frame into another
0:10:59	representation frame and the main representations we focus on here are
0:11:04	phonetic representations in the speech sound map
0:11:06	motor representations in the articulator velocity and position maps
0:11:11	auditory representations in the auditory maps and finally estimate of sensory representation and smacked sensory map
0:11:18	the auditory dimensions we use in the model are typically corresponding to formant frequencies and all that talk about that
0:11:25	quite a bit as i go on in the talk
0:11:27	whereas this matter sensory targets correspond to things like
0:11:31	a fresher tactile information from the lips and the tong while you're speaking as well as muscle information about
0:11:40	lengths of muscles that give you a read of where you're articulators are in the vocal tract
0:11:47	okay so just to give you feel of what the model does so i'm gonna show the synthesizer the articulatory
0:11:54	synthesizer with just purely random movements now so this is
0:11:58	at what we do in the very early stages of learning in the model we randomly move the speech articulators
0:12:05	that creates auditory information it's mada sensory information
0:12:09	from the speech and we can associate auditory information and the smell sensory information with each other and with the
0:12:16	motor information that was used to produce the movements of speech so these movements don't sound anything like speech as
0:12:23	you'll see here
0:12:25	so this is just a randomly activating the seven dimensions of movie
0:12:32	so this is what the model does for the first forty five minutes we call this a babbling cycle take
0:12:37	about forty five minutes real time to go through this
0:12:40	and what the model does is it tunes up many of the projections between the different areas so here for
0:12:45	example in red are the projections that are turn tune during this random babbling cycle
0:12:50	so the key the key things being learned here are relationships between motor command
0:12:56	mada sensory feedback and auditory feedback
0:12:59	and in particular what the model needs to learn for producing sounds later is how to correct for sensory errors
0:13:06	and so what the model was learning largely is if i need to change my first formant frequency in an
0:13:13	upward direction for example because i'm too low
0:13:16	then i need to activate a particular set of motor commands and this will come a flow through a feedback
0:13:21	control mapped to the motor cortex
0:13:24	and will translate this auditory error into a motor corrective command
0:13:29	and similarly if i feel that my lips are not closing enough for be there will be a smack sensory
0:13:36	error representing that and that's ml sense rare will then be mapped into a corrective motor command in the motor
0:13:41	cortex
0:13:43	these arrows in red here are the transformations basically or synaptic weights their encoding these transformations and they're tuned up
0:13:51	during this babbling cycle
0:13:54	well
0:13:55	after the babbling cycle so to this point the model still has no sense of speech sounds this is correspond
0:14:01	very early babbling in infant
0:14:04	up to about six months of age before they start really learning in producing sounds from a particular language and
0:14:11	the next stage of the model handles the learning of speech sounds from a particular language and this is the
0:14:16	imitation process in the model
0:14:18	and what happens in the imitation process is we provide the model with an auditory target so we give it
0:14:23	a sound file of somebody producing a word or phrase
0:14:28	the formant frequencies are extracted and are used as the auditory target for the model
0:14:34	and the model then attempts to produce the sound by reading out whatever feed forward commands it might have if
0:14:41	it just heard the sound for the first time for the first time it will not have any feed forward
0:14:46	commands because it hasn't yet produce the sound it doesn't know what commands are necessary to produce the sound
0:14:51	and so in this case it's going to rely largely on auditory feedback control in order to produce the sound
0:14:57	because all it has an auditory target
0:14:59	the model attempts to produce the sound it makes some errors but it does some things correctly due to the
0:15:05	feedback control and it takes whatever commands are generated on the first attempt and uses them as the feed forward
0:15:11	command for the next attack
0:15:13	so the next attempt now has
0:15:16	a better feed forward command so there the there will be fewer errors will be a less of a correction
0:15:22	but again both the
0:15:24	a feed forward command and the correction added together that's the total output that's then
0:15:29	turned into the feed forward command for the next iteration and with each iteration the air gets smaller and smaller
0:15:35	due to the incorporation of these corrective motor commands into the feed forward command
0:15:41	just to give you an example of what that sounds like so here is an example that was presented to
0:15:46	the model a ford learning
0:15:50	the dog
0:15:52	this is a speaker saying good doggy and
0:15:54	here that are more
0:15:57	a dog
0:15:58	and what the model is going to now try to do is it's going to try to mimic this with
0:16:03	initially no feed forward command and just using auditory feedback control auditory feedback control system was tuned up during the
0:16:11	earlier babbling stage
0:16:13	and so it does a reasonable rendition but it's kind of sloppy
0:16:17	i
0:16:18	this is the second attempt it'll be significantly improve because the commands feedback commands from the first attempt to been
0:16:25	now moved into the feed forward command
0:16:31	i
0:16:32	and then by the sixth attempt the model has perfectly learn the sound meaning that it there are no errors
0:16:39	in its formant frequencies which is all i can hear from the sound pretty much and so it sounds like
0:16:44	this
0:16:47	this was the original
0:16:49	a dog
0:16:50	so what you can here is that the formant frequencies pretty much track the original formant frequencies in this case
0:16:55	they tracked imperfectly we looked at just the first three formant frequencies of the speech sound
0:17:01	when doing this and so in this case we would say the model has learned to produce this phrase now
0:17:06	so it would have a speech sound map sell devoted to that phrase if we activate that sell it reads
0:17:12	the phrase out now with no error too
0:17:16	well an important aspect of this model is that it's a neural network in the reason we chose the neural
0:17:22	network construction is so that we could
0:17:25	investigate brain function in more detail so what we've done is we've taken each of the neurons in the model
0:17:31	and we localise them in a standard brain space a stereo tactic space
0:17:37	that is a commonly used for analysing neuroimaging results from experiments such as fmri experiments and so here these orange
0:17:46	dots represent the different components of the model
0:17:50	a here for example this is the central focus in the brain where the motor cortex is in front of
0:17:55	the so central focus on the smell sensory cortex is behind it
0:17:58	and we have representations of the speech articulators in this region in both hemispheres
0:18:03	the auditory cortical areas include state cells and auditory error cells which was a novel prediction we made from the
0:18:11	model that these cells would reside somewhere in the higher level auditory cortical areas and i'll talk about testing that
0:18:17	prediction in you minute
0:18:19	we have some at a sensory cells in the us mass entry cortical areas of the super marginal drivers here
0:18:26	and these include are some have sensory error cells also crucial to
0:18:30	feedback control
0:18:32	and so forth so in general the representations in the model are bilateral meeting there are other neurons for
0:18:40	representing the lip are located on in both hemispheres but the highest level of the model the speech sound map
0:18:47	is left lateralized and the reason it's left lateralized is that
0:18:52	a large amount of data from the neurology literature suggests that
0:18:57	the left hemisphere is where we store our speech motor programs
0:19:01	in particular if there is damage to the left entropy motor cortex or adjoining brokers area here in the inferior
0:19:09	frontal drivers
0:19:10	speakers have what's referred to as a proxy of speech and this is an inability to read out the motor
0:19:17	programs for speech sound so they hear the sound they understand what the word is a and they
0:19:24	they try to say it but they just can't get the syllables to come out and this in our bus
0:19:30	because their motor programs represent about the speech sound map cells
0:19:34	are damaged due to the stroke if you have a stroke in the right hemisphere in the corresponding location there
0:19:41	is no upper active speech is largely spare
0:19:45	and in our view this is because the right hemisphere as all described about that are is more involved in
0:19:51	feedback control then feed forward control
0:19:54	an important insight is that once an adult speakers learn to produce the speech sounds of his or her language
0:20:01	and their speech articulators of largely stop growing
0:20:04	they don't need feedback control very often because their feed forward commands are already accurate
0:20:10	and if you for example listen to the speech of a somebody who became deaf as an adult for many
0:20:16	years many years there's speech remains largely intelligible a presumably because these motor programs are intact
0:20:23	and they by themselves are enough to produce the speech properly
0:20:27	i
0:20:28	in an adult however if we do something novel to the person such as
0:20:32	block their job why they try to try to speak or we perturbed auditory feedback of their speech then we
0:20:38	should reactivate the feedback control system by first activating sensory error cells that detect that they sensory feedback isn't what
0:20:46	it should be
0:20:47	and then motor correction takes place to the feedback control pathways of the model
0:20:54	okay so just to high like the
0:20:58	use of these locations what i'll show you now is a typical simulation where we have the model produce an
0:21:05	utterance in this case it saying how the
0:21:08	and what you'll see you'll hear first the production in our model the activities of the neurons correspond to electrical
0:21:15	activity in the brain
0:21:17	fmri actually measures blood flow in the brain and blood flow is a function of the electrical activity but it's
0:21:23	quite slow down relative to the activity peaks for five seconds after the speeches started and so what you'll see
0:21:32	is
0:21:33	the brain activity starting to build up in terms of blood flow over time after the utterances produced
0:21:41	so here the utterance was at the beginning but only later D C they hemodynamic response and this is actually
0:21:46	quite useful for us because we can do neuroimaging experiments
0:21:50	where people speak in silent
0:21:53	and then we collect data after they're done speaking at the peak of this blood flow so what we would
0:21:58	do is basically have them speak in silence and
0:22:03	at this point we would take scans with an fmri scanner is very loud which would interrupt the speech if
0:22:09	it was going on during your speech but in this case were able to scan after the speech is completed
0:22:14	and get a measure of what brain activity what brain regions where active and how active they were during speech
0:22:21	production
0:22:23	okay so that's an overview of the model next what i'll do is going to a little more detail about
0:22:28	the functioning of the feedback control system
0:22:31	and my main goal here is simply to give you i feel for the type of experiment we do we've
0:22:36	done many experiments of this sort to test and refine the model over the years
0:22:41	and the experimental talk about in this case is an experiment involving auditory perturbation of the speech signal well subject
0:22:48	is speaking in an M R I scan
0:22:51	so just to review then the model has the feed forward control system shown on the left ear and the
0:22:59	feedback control system shown on the right
0:23:01	and feedback control has both an auditory and isomap sensory component
0:23:06	so during production of speech when we activate this speech sound map cell to produce the speech sound
0:23:13	in the feedback control system we read out these targets to the sum at a sensory system into the auditory
0:23:18	system and those targets are compared to the incoming auditoriums mada sensory information
0:23:25	the targets take the form of regions so there's an acceptable region of F one that they can be in
0:23:30	if they're anywhere within this region there okay but if they go outside of the region and ever cell is
0:23:35	activated and that will drive the
0:23:38	oh and by driving articulator movements that will move it back into the appropriate target region
0:23:44	so
0:23:45	if we have an error arising in one of these maps and in particular we're gonna be focusing on the
0:23:51	auditory error map
0:23:53	what happens next in the models that the sarah gets transform
0:23:56	through a feedback control map in the right up we motor cortex
0:24:01	and then projected to the motor cortex in the form of a corrected motor command and so what the model
0:24:07	is essentially learned is how to take auditory errors and correct them with motor movement
0:24:13	in terms of mathematics this corresponds to a pseudo inverse of that you colby in matrix that relates the articulatory
0:24:20	and auditory spaces
0:24:22	and this can be learned during babbling simply by moving the articulators around and seeing what changes in some at
0:24:28	a sensory and auditory state take place
0:24:31	the fact that we have this feedback control map in the right entropy motor cortex now when the model that
0:24:36	was partially the result of the experiment that i'll be talking about this was not originally in the model originally
0:24:42	these projections what's of the primary motor cortex
0:24:44	i'll show the experimental result the cost us to change that component of the model
0:24:50	okay
0:24:52	so i based on this feedback control system we can make some explicit predictions about brain activities during speech
0:24:59	and in particular we made some predictions about what would happen if we shifted your first formant frequency during speech
0:25:07	so that when we set it back to you over earphones in fifty milliseconds you hear something slightly different than
0:25:14	what you're actually producing
0:25:16	well according to our model the should "'cause" activity of cells and auditory error map which we have localised to
0:25:24	posterior superior temporal drivers and that the adjoining plan and temporal these regions in these still be in fig
0:25:31	on the temporal lobe
0:25:32	so we should see increased activity there if we perturbed the speech
0:25:38	and also we should see some motor corrective activity because according to our model the feedback control system will kick
0:25:45	in when it hears this error even during that particular and
0:25:48	and it will try to correct if the utterance is long enough it will try to correct the error that
0:25:54	is her
0:25:56	now keep in mind that auditory feedback takes time to get back up to the brain so that i'm from
0:26:02	motor cortical activity tomb movement and sound output to get hearing that sound output in project
0:26:09	ejecting about up to your auditory cortex is somewhere in the neighbourhood of a hundred two hundred fifty milliseconds
0:26:16	and so we should see a corrective command kicking in not at the instant that the perturbation start
0:26:22	what about a hundred or one twenty five milliseconds later because that's how long it takes to process this auditory
0:26:28	feedback
0:26:30	so what we did was we developed a digital signal processing system that allowed us to shift the first formant
0:26:37	frequency in real-time meaning that a subject hears the sound with a sixty millisecond delay which is pretty much unnoticeable
0:26:46	to the subject
0:26:47	even unperturbed speech has that same sixty millisecond delay so they're always hearing
0:26:52	a slightly delayed version other speech over headphones we play a rather loud over the headphones and they speak quietly
0:26:59	as a result of this and the reason we do that as we want to minimize things like bone conduction
0:27:04	of the actual speech
0:27:06	and make them focus on the auditory feedback that we're providing them which is the perturbed auditory feedback
0:27:12	and what we do in particular is we take the first formant frequency and in one fourth of the utterances
0:27:18	we will perturbed it either up or down so three out of every four utterances are unperturbed
0:27:25	one in four is perturbed well excuse me one in eight is perturbed up and one in eight is perturbed
0:27:32	down so
0:27:33	they get these perturbations randomly distributed they can't predict them because first of all the direction changes all the time
0:27:42	and secondly because many of the productions are not prepare
0:27:46	and oh what we did well here's what this sounds like so the people were producing vowels
0:27:52	that the bout and so the words that they would produce work are words like that and pack and
0:27:59	pads
0:28:00	and here's an example of on shifted speech before the perturbation
0:28:08	i
0:28:09	and here is a case where we've shifted F one upward and upward shift about one corresponds to a more
0:28:16	open mouth and that should make the pet
0:28:19	a vowel sound a little bit more like an ad
0:28:22	and so if you hear the perturbed version of that production
0:28:27	i
0:28:27	it sounds more like that then yeah in this case so that original
0:28:33	sorry
0:28:38	i
0:28:40	hi
0:28:41	so it's consciously noticeable to you now when i play to you like this but most subjects don't notice what's
0:28:46	going on during the experiment we asked them afterwards that they notice anything sometimes will say
0:28:52	occasionally my speech sound a little odd but usually they didn't really notice that much of anything going on with
0:28:59	their speech and yeah their brains are definitely picking up this difference and we found that without them or i
0:29:07	we also look at their formant frequencies so what i'm showing here is
0:29:13	a normalized for F one
0:29:16	and what normalize means in this case is that the F one in a baseline on perturbed utterance
0:29:22	is what we expect to see that will take the F one in a given utterance we'll compared to that
0:29:29	baseline
0:29:30	it's exactly the same then we'll have a value of one so if they're producing the exact same thing is
0:29:36	they do in the baseline they would stay flat on this value of one
0:29:39	on the other hand if they're increasing their F one then we'll see the normalized F one go about one
0:29:46	in if they're decreasing F one will see go below one
0:29:50	the
0:29:51	gray shaded areas here are the competence in ninety five percent confidence intervals of the subjects productions in the experiment
0:29:59	and what we see for the down shift is that over time the subjects increase their F one to try
0:30:06	to correct for the ad decrease of F one that we
0:30:09	given them with the perturbation
0:30:12	and in the case where we up shift their speech they decrease F one as shown by this confidence interval
0:30:18	here
0:30:19	the split between the two occurs right about where we expect which is somewhere around a hundred two hundred and
0:30:26	fifty milliseconds after the first sound comes out that a here with the perturbation
0:30:33	the solid lines here are the results of simulations of the diva model producing the same speech sounds under perturbed
0:30:40	conditions
0:30:41	and so the black dashed line here shows the models productions in the option if condition we see weights about
0:30:47	a hundred twenty five when this case actually it only weights about eighty milliseconds are delay loop which short here
0:30:52	and then it starts to compensate for the utterance
0:30:56	similarly in the down shift case it goes for about eighty milliseconds until it starts to your the error and
0:31:03	then it compensates in an upward direction
0:31:05	and we can see that the models productions fall in a confidence intervals of the subjects production so the model
0:31:11	but i produces a good fit of the behavioural data
0:31:16	but we also took a look at the neuroimaging data and on the bottom what i'm showing is the results
0:31:23	of a simulation that we're and before be study where we generated predictions of fmri activity
0:31:30	when we compare shifted speech to non shifted speech as i mentioned one we shift the speech that should uttering
0:31:37	these auditory error cells on and we've localise them to these posterior areas of the temporal gyros here
0:31:44	when those error cells become active they should lead to a motor correction and these are shown by activities in
0:31:51	the motor cortex here in the model simulation
0:31:55	now we also see a little bit stale valour activity here in the model but i'll skip that for two
0:32:00	days
0:32:01	talk
0:32:02	here on the top is what we actually got from our experimental results for the ship minus no ship contrast
0:32:08	the auditory hair cells were pretty much where we expected them so first of all there are auditory ourselves there
0:32:15	are cells in your brain that detect the difference between what you're saying and what you expect it to sound
0:32:20	like even as an adult
0:32:22	these auditory errors of become active at but we noticed is that the motor corrective activity we saw was actually
0:32:29	right lateralized in it was pretty motor it wasn't bilateral and primary motor as we predicted it's farther forward in
0:32:36	the brain it's in a more pretty motor cortical real area
0:32:39	and it's right lateralized so one of the things we learned from this experiment was that auditory feedback control appears
0:32:46	to be right lateralized in the frontal cortex
0:32:49	and so we modify the model to have an auditory feedback that
0:32:53	are sorry a feedback control map in the right entropy motor cortex area correspond with this region here
0:33:01	we actually ran a parallel experiment where we perturbed speech with the balloon in the mouth so we actually
0:33:09	we build a machine that
0:33:11	a perturbed your job while you were speaking at so you would be saying something like a P and during
0:33:16	the how this balloon would blowup very rapidly it was a little was actually the finger of a lot of
0:33:21	that would follow up to about a centimetre and half and would block your job from closing so that when
0:33:26	you were
0:33:27	done with that i'm getting ready to say that consonant and the final vowel key then the job was blocked
0:33:33	the job could move as much subjects compensate again
0:33:37	and we saw in that experiment activity in their smell sense recordable areas corresponding to this matter sensory error map
0:33:45	but we also saw a right lateralized motor cortical activity and so based on these two experiment
0:33:51	we modify the model to include a right lateralized feedback control map that we did not have in the original
0:33:57	model
0:34:02	okay so
0:34:03	the other thing we can do is we can look at connectivity in brain activities using techniques such as structural
0:34:10	equation modelling a very briefly in a structural equation modelling analysis what we would do is we would use a
0:34:18	we define model of connectivity in the brain and then we would go and look at the fmri data and
0:34:24	see how much of the covariance matrix of the fmri data we had a can be captured by this model
0:34:31	if we optimize the connections and so what as cm does is it
0:34:36	reduces connection strings that are produced in that modelling gives you goodness of fit data
0:34:41	and in addition to being able to the data very well meaning that are cut connections in the model are
0:34:47	in the right place
0:34:49	we also noted a an increase in the what what's called effective connectivity so an increase the strength of the
0:34:56	effect of these
0:34:57	auditory areas on the motor areas in the right hemisphere when the speech was perturbed so the interpretation of that
0:35:05	is when i picture of your speech but with an auditory perturbation like this
0:35:09	the error cells are active that drives activity in the right that for motor cortex and so we have an
0:35:14	increase affect on the motor cortex from the auditory areas in this case
0:35:19	and so this is further support for the structure in the model and the feedback control system that we just
0:35:28	the score
0:35:30	okay so that's one example of an experimental test we've done a very large number of a test of this
0:35:36	sort
0:35:36	we've tested predictions of can "'em" addicts in the model so we look we work with people who measure articulator
0:35:45	movements using
0:35:46	electromagnetic articulatory this is a technique where you basically glue receiver coils on the talking in the lips and the
0:35:55	job and you can measure the very accurately the position of the articulators of these points on the articulators
0:36:03	in the midsagittal plane and from this you can estimate quite a accurately in time the positions of speech articulators
0:36:10	and compare them to
0:36:12	productions that use the in the model we've done a lot of work looking at for example phonetic context effects
0:36:19	in our production which i'll come back to later R is a phoneme in english that is produced with a
0:36:24	very wide range of articulatory variability
0:36:27	the acoustic cues for are very stable this been shown by people such as voice in S P wilson
0:36:34	and what you see in the if you produce movements with the model is that
0:36:40	the model will also produce very different articulations for are in different phonetic contexts and this has to do with
0:36:45	the fact that it's starting from dish different initial positions and it's simply going to the five closest point to
0:36:51	the acoustic target
0:36:53	that it can get to and that point will be in different parts of the articulator space depending on where
0:36:58	you start
0:37:00	we looked at a large number of experiments on other types a particular articulatory movements both in
0:37:08	normal hearing and hearing impaired individuals we look at what happens when you put a bite blocked in we look
0:37:13	at what happens when you noise mask these speakers and we've also looked at what happens over time for in
0:37:21	speech of people with cochlear implants for example so
0:37:24	in the case of a cochlear implant recipient that was an adult would already learn to speak
0:37:29	when they first
0:37:31	receive the cochlear implant they hear a sounds that are not the same as the sounds that they used here
0:37:38	so their auditory targets don't match
0:37:41	what's coming in from the cochlear implant and it actually impairs their speech for a little while a before about
0:37:48	a month or so before they start to improve their speech and by a year it show up very strong
0:37:54	improvements in the speech
0:37:56	and according to the model this is occurring because they have to retune their auditory feedback control system to deal
0:38:02	with the new feedback and only when that auditory feedback control system is tunic and they start to retune the
0:38:07	movements to produce more distinct speech data
0:38:12	a we've also done a number of neuroimaging experiments for example we predicted that you left entropy motor cortex
0:38:21	involves syllabic motor programs
0:38:24	and we use the technique called repetition suppression in fmri where you present us to really that change and some
0:38:32	dimensions but don't change in other dimensions
0:38:35	and with this technique you can find out what is it about the seemingly that a particular brain region cares
0:38:41	about and using this technique we were able to show that in fact the only region in the brain that
0:38:46	we found that had
0:38:47	a syllabic sort of representation was the left entropy motor cortex where we believe these syllabic motor programs are located
0:38:54	a highlighting the fact that the syllable is a particularly important entity for motor control
0:39:00	and this we believe is because our syllables are very high we a practise and well to the motor programs
0:39:07	that we can read out we don't have to produce the individual phonemes we read out the whole syllable as
0:39:12	a motor program that we've stored in memory
0:39:16	finally we've been able fourteen would lead to even at test the models predictions electra physiologically in this was in
0:39:24	a case
0:39:25	of a patient with locked in syndrome that'll state speak about in a bit and i'll talk about exactly what
0:39:30	we were able to verify using electro physiology in this case actual recording from neurons in the court
0:39:39	okay so
0:39:40	the last part might talk now will start to focus on using the model to investigate communication disorders
0:39:47	and we've done a number of studies of this sort we as i mentioned look that speech in normal hearing
0:39:54	and hearing impaired populations
0:39:57	we are now doing quite a bit of work on stuttering which is a very common speech disorder that affects
0:40:03	about one percent of the population stuttering is a very complicated disorder it's been known
0:40:10	since the beginning of time basically every culture seems to have people who stutter within them within that culture people
0:40:17	been trying to cure stuttering for ever and we've been unable to do so and the brains of people who
0:40:23	stutter are actually
0:40:24	really similar to bring the people who don't stutter and unless you look very closely and if you start looking
0:40:30	very closely you start to see things like white matter differences
0:40:35	and grey matter thickness differences in the brain and these tend to be localised around the base of anglia alamo
0:40:41	cortical loop and so are you of stuttering is that several different problems can occur in this loop very difference
0:40:48	that people would who stutter
0:40:51	can have different locations of damage or of an anomaly in their basic english alma cortical loop and this can
0:40:59	lead all of these can lead to stuttering and the complexity of this order is partly because
0:41:05	it's a system level disorder where different parts of the system can cause problems it's not always the same part
0:41:11	of the system that's a problematic in different people who stutter and so one of the important areas of research
0:41:19	for stuttering is
0:41:20	computational modelling of this loop to get a much better understanding of what's going on and how these different problems
0:41:25	can lead to similar sorts of behaviour
0:41:29	we looked at we're looking at what's pass moderate dysphonia which is a vocal fold problem similar to just only
0:41:36	it's a
0:41:37	a problem where typically the vocal folds are too tense during speech
0:41:42	again appears to be basal gangly a loop related
0:41:46	a proxy of speech which involves left hemisphere frontal damage a child that a proxy of speech which is actually
0:41:52	a different disorder from acquired a proxy a speech this tends to involve more widespread
0:42:00	kind of lesser damage but in a more widespread a portion of the brain
0:42:05	and so forth and the project all talk most about here will be a project involving neural prosthesis for locked
0:42:12	in syndrome and this is a project that we're doing a are we done with bill kennedy from neural signals
0:42:19	a locality developed technology for implanting brains of people with locked in syndrome and we help them build a prosthesis
0:42:28	from that technology
0:42:31	so typically are studies where we're looking at disorders involve some sort of damage version of the model it's a
0:42:37	neural network so we can go in and we can mess up white matter projections which are these synaptic projections
0:42:42	we can mess up
0:42:43	neurons in a particular area we can even adjust things light levels of neurotransmitters some studies suggest that there may
0:42:53	be an excess of double mean and some people who stutter
0:42:56	well we have added up i mean receptors or base of anglia loop so we can go in and we
0:43:01	can start changing double mean levels and seeing how that changes but the behaviour of the model and also the
0:43:07	brain activities of the model
0:43:09	and what we're doing now is running a number of imaging studies involving people who stutter or we've made predictions
0:43:15	based on several possible
0:43:19	lead to damage in the brain that may result in stutter stuttering and we're testing those predictions both by seeing
0:43:25	if the model is capable of producing stuttering behaviour but also seeing if the brain activities
0:43:31	match up with what we see in people who stutter there are many different ways to invoke stuttering in the
0:43:36	model but each way causes a different pattern of brain activity to occur
0:43:42	so by having both the behavioural results and the neuroimaging or results we can do a much a more detailed
0:43:49	treatment of what exactly is going on in this population
0:43:54	the example i'm gonna spend the rest of the talk describing is a bit different where in this case the
0:44:01	speech motor system of the of the patient was
0:44:05	intact
0:44:06	but patient was suffering from locked in syndrome due to a brain stem stroke
0:44:11	a locked in syndrome is a syndrome where
0:44:15	patients have intact cognition and sensation but they're completely unable to perform voluntary movement so it's a case of being
0:44:23	almost kind of
0:44:25	buried in your own body alive and the patients sometimes have eye movements patient we worked with could vary slowly
0:44:33	move his eyes up and down his eyelids actually to answer yes no questions
0:44:39	this was the only form of communication here at
0:44:42	and so prior to our involvement in the project he was implanted as part of a project developing technologies for
0:44:51	locked in patients to control computers or external devices
0:44:56	these technologies are referred to by several different names brain computer interface or brain machine interface or neural prosthesis
0:45:05	and in this case we were focusing on a neural prosthesis for speech restoration
0:45:10	the locked in syndrome is typically caused by either brain stem stroke and eventual ponce or more commonly people become
0:45:19	locked in through neural degenerative diseases such as a last which are attacked the motor system
0:45:25	people who suffer from a less
0:45:27	go through a stage for the later stages of the disease wait where they are basically locked in there unable
0:45:34	to move or speak
0:45:35	but still fully conscious and with sensation
0:45:41	well the electrode that was developed by are calling filled kennedy is schema ties here and here's a photograph of
0:45:49	it it's a tiny glass cone that is open on both bands the cone is about a millimetre long they're
0:45:56	three gold wires inside the cone
0:45:59	there coded with a and insulator except at the very end where the wires cut off and that acts as
0:46:07	a recording site so there are three recording sites within the cone one is used as a reference and the
0:46:12	other two are used as recording channels
0:46:15	and these wires are this electrode is inserted into the stripper cortex here i've got a schematic of the cortex
0:46:23	which is good consists of six layers of cell types
0:46:28	the goal is to get this near layer five but the cortex
0:46:32	where the output neurons are these are the motor neurons that project in the in the motor cortex these are
0:46:39	neurons a project for the periphery to "'cause" movement
0:46:42	but it doesn't matter too much where you go because the cone is build with i nerve growth factor and
0:46:47	what happens is
0:46:49	over a month or two X sounds actually grow into this conan lock it into place that's very important because
0:46:55	it stops movement if you have movement of a an electrode in the brain
0:47:00	use get problems such as cleo says which is scar tissue building up around the electrode and stopping a the
0:47:06	electron from picking up signals
0:47:08	in this case the wires are actually inside a protected class cone and nobody else's builds up inside the cone
0:47:16	so it's a permanent electrode you can implant this electrode and record form from it for many years and if
0:47:22	when we did the project all talk about the electorate had been in the subjects brain for over three and
0:47:28	a half years
0:47:31	so
0:47:33	the electrode location was chosen in this case by having subject attempt to produce speech well in a and fmri
0:47:40	scanner
0:47:41	and what we i noticed was that the brain activity is a relatively normal looks like brain activity of
0:47:49	of a neurological a normal person trying to produce speech and in particular we there's a blob of activity on
0:47:57	the three central drivers which is the location of the motor cortex
0:48:01	in the region where we expect for speech so i'm going to refer to this region of speech motor cortex
0:48:08	this is where the electrode was implanted so this is an fmri S can perform before implantation here is actually
0:48:15	a C T scan afterwords where you can see in the same brain area the wires of the electrode coming
0:48:21	out
0:48:22	this is bottom picture is a three D A C T scan showing this call a where you can see
0:48:29	the training out to me where the electorate was inserted you can see the wires coming out and the wires
0:48:34	go into a package of electronics that is located under the skin
0:48:39	and these electronics amplify the signal and then send it is radio signals across the scout
0:48:44	we attach intent as basically that just antenna coils to the scout so the subject has a normal looking had
0:48:52	yes hair on his head there's nothing sticking out of his head
0:48:56	when he comes into the lab we attach these antenna to the scout eight we tune them to just the
0:49:02	right frequencies and they pick up the two signals that we are generating from are electrode
0:49:08	the signals are then routed to a recording system and then to a computer where we can operate on those
0:49:13	signals
0:49:14	in real time
0:49:17	well
0:49:18	oh
0:49:19	kennedy had implanted the patient two years before we are several years before we got involved in the project
0:49:27	but they were having trouble decoding the signals and part of the problem is
0:49:31	that if you look in motor cortex there's nothing obvious that corresponds to a word or for that syllable or
0:49:38	phoneme you don't see neurons turn on when the subject produces a particular syllable and then shut off twenty the
0:49:46	subjects done
0:49:47	a U C instead that all the neurons are just subtly changing their activity over time so there it appears
0:49:53	that there's some sort of continuous representation here in the motor cortex there's not a representation of just words and
0:49:59	phonemes at least at the motor level
0:50:02	a cantonese a group contacted us because we had a model of what these brain areas are doing and so
0:50:09	we collaborated on decoding these signals and routing them to a speech synthesizer so the subject could actually control some
0:50:17	speech output
0:50:19	well
0:50:20	the tricky question here is what is the neural code for speech in the motor cortex
0:50:26	and the problem of course is that there are no prior studies people don't go into a human motor cortex
0:50:33	and record normally
0:50:35	and monkeys don't speak you know whether animals speak so we don't have any single cell data about what's going
0:50:41	on in the motor cortex during speech we have data from our movements and we use the insights from this
0:50:48	data
0:50:48	yeah but we are also used insights from what we saw in human speech movements to determine what where the
0:50:54	variables that these people were controlling what was the motor system caring about
0:50:59	mostly to care about muscle positions or data care about the sound signal
0:51:04	and there is some available data from simulation studies the motor cortex these come from
0:51:11	the work by up and field who work with epilepsy patients who were having surgeries to remove portions of the
0:51:18	cortex that were
0:51:19	causing a epileptic fits
0:51:22	before they did the removal what they would do is actually stimulate in the court ecstasy out what
0:51:30	parts of the brain we're doing why any particular what they wanted to do was avoid parts of the brain
0:51:35	involved in speech and they mapped out along the motor cortex areas that "'cause" movements of the speech articulators for
0:51:41	example and other areas that caused interruptions of speech and so for
0:51:46	and these studies were informative and we help we use them to help us determine where to localise some of
0:51:52	the neurons in the model but they don't really tell you about what kind of representation is being used by
0:51:57	the neurons when you stimulate a portion of cortex are stimulating hundreds of neurons minimally they were using something like
0:52:04	two bolts for stimulation the maximum activity even ron is fifty five mill of also the stimulation signal was dramatically
0:52:11	bigger than any natural signal
0:52:13	and it activates a large area of cortex and so you see a gross
0:52:17	where lee form the movement coming out and speech movements tended to be things like that our price of the
0:52:22	subject might say that
0:52:24	something like this adjust the of movement it's not really a speech sound they don't produce any words or anything
0:52:30	like that
0:52:31	and from these sorts of studies it's next to impossible to determine what sort of representation is going on in
0:52:37	the motor cortex
0:52:39	a however we do have our model which does provide the first explicit characterisation of what these response properties should
0:52:46	be of speech motor cortical cells we have actual speech motor cortical cells in the model they are tuned to
0:52:52	particular things
0:52:54	and so what we did was we use the model to guide are search for information in this part of
0:53:00	the brain
0:53:01	and i want to point out that the characterisation provided by the model was something that we spent twenty years
0:53:08	refining so we ran a large number of experiments testing different possibilities about how speech was control
0:53:15	and we ended up with a particular format in the model and that's no coincidence that's because we spent a
0:53:22	lot of time looking at that in here is the result of one such study which a highlights the fact
0:53:28	that in motor planning
0:53:30	sound appears to be more important than where you're talking is actually located and this is a study of the
0:53:37	phoneme are that i mentioned before just to describe what you're going to see here so that the each of
0:53:43	these lines you see represents a tongue shape
0:53:47	and they're to chunk shapes in each panel there's a dashed line
0:53:52	so this is the tip of the time this is the centre the tongue in this back of the tongue
0:53:56	where actually measuring the positions of these transducers that are located on the time using a thirty kilometre E
0:54:01	and the dashed lines show the tongue shape that occurs seventy five milliseconds before
0:54:09	B centre of the R which happens to be they minimum of the F three trajectories
0:54:14	and the dark bold lines show the tongue shape at the center ready are a or at that have three
0:54:21	minimum so in this case you can see the speaker used
0:54:24	and upward movement other tongue tip to produce the R
0:54:28	in this panel
0:54:30	so what we have over here in our two separate subjects where we have measurements from the subject on the
0:54:36	top row and then productions of the model represented in the bottom row and the model was actually using speaker-specific
0:54:43	vocal tract in this case so
0:54:45	what we did was we took the subject we are collected a number of them are i stands while they
0:54:50	were producing different phonemes
0:54:52	we did principal components analysis to pull out their main movement degrees of freedom we had their acoustic signals and
0:54:58	so we built a synthesiser that had their vocal tract shape and produce their formant frequencies
0:55:04	then we had the diva model learned to control their vocal tract so we put this vocal tract synthesiser in
0:55:10	place of the my the synthesizer we battled the vocal tract around had it learn at to produce hours and
0:55:16	then we went back and had it
0:55:18	produce the estimate lee in the study and in this case the people producing utterances
0:55:24	walk around
0:55:25	what drum and one row of so B R was either preceded by a sound at the orgy
0:55:34	what we see is that the subject produces very different movements in these three cases so in a context the
0:55:40	subject uses it upward movement of the tongue tip like we see over here
0:55:44	but in the D context the subject actually move their tongue backwards to produce the R
0:55:49	in the G context they move their time downward to produce the are so they're using three completely different gestures
0:55:55	are articulatory movements to produce the R and yet the producing pretty much the same after each race the F
0:56:01	three traces are very similar in these cases
0:56:04	if we take the model and we have it produce R's with the speaker-specific vocal tract we see that the
0:56:11	model because it cares about the acoustic signal primarily it's trying to get these F three target
0:56:17	and the model also uses different movements in the different context an impact the movements reflect the movements of the
0:56:23	speaker so here the model uses an upward movement of the tongue tip here the model uses the backward movement
0:56:29	of the time and here the model uses a downward movement of the time to produce are so
0:56:34	what we see is that with a very simple model that's just going to be appropriate position and formant frequency
0:56:39	space we can capture this complicated variability in the articulator movements
0:56:45	of the actual speaker
0:56:47	a another thing to note here is this is the second speaker again the model replicates the movements and the
0:56:53	model also capture speaker-specific differences here in this case the speaker use the small upward tongue tip movement to produce
0:57:01	the R
0:57:02	up at the speaker for reasons having to do with the morphology of their vocal tract had to do a
0:57:06	much bigger movement of the tongue tip to produce the are in a contact
0:57:11	and again the model produces a bigger movement in this speakers case than in the speaker space so
0:57:17	this provides a pretty solid data that speakers are really concentrating on
0:57:21	the formant frequency trajectories of their speech output more so than where the individual articulators were located
0:57:29	and so we made production and that we should see formant frequency representations in the speech motor cortical area if
0:57:38	we're able to look at what's going on during speech
0:57:42	a the slide i'm sure everybody here follows this appears actually the formant frequency traipse traces for good doggy this
0:57:51	is what i'd use of the target for the
0:57:54	simulations i showed you earlier and down here i show the first two formant frequencies what's called the formant frame
0:58:00	plane and the important point here is that if we can move if we can just change F one and
0:58:06	F two we can produce pretty much all of the vowels
0:58:09	of the language because they are differentiated by their first two formant frequencies and so formant frequency space provides a
0:58:18	very low dimensional continuous space for the planning of movements
0:58:22	and that's crucial for the development of the brain computer interface
0:58:27	okay and why is a crucial well
0:58:31	there have been our number brain computer interfaces that involve implants and the hand area
0:58:37	of the motor cortex
0:58:39	and what they do usually is they decode cursor position on the screen from neural activities in the hand area
0:58:46	and people learn to control movement of a cursor by who are activating their neurons in their hand motor cortex
0:58:55	now they when they build these interfaces they don't try to decode all of the joint angles of the arm
0:59:01	and then determine where the cursor would be based on where the mouse would be instead they go directly to
0:59:06	the output space in this case the two dimensional cursor space
0:59:11	in the reason they do that is we're dealing with a very small number of neurons in these sorts of
0:59:15	studies relative to the entire motor system there are hundreds of millions of neurons involved in your motor system
0:59:21	and in the best case you might get a hundred neurons in the brain computer interface we were actually getting
0:59:26	far fewer from that then that we had a very old in plant that only had two electrode wire
0:59:32	so we were getting somewhere we had less than ten neurons maybe is a few as two or three neurons
0:59:38	we could pull out more signals than that but they weren't signal nor on activities
0:59:42	well if we tried to pull out a high dimensional representation of the arm configuration from a small number of
0:59:49	neurons we can have a tremendous amount of error and this is why they don't do that instead they try
0:59:54	to pull out a very low dimensional thing which is this two D cursor position
0:59:58	well we're doing the analogous thing here instead of trying to pull out all of the articulator positions that determine
1:00:04	the shape of the vocal tract we're simply going to the output space which is the formant frequency space which
1:00:10	for the for about production can be as simple as a two-dimensional signal
1:00:16	okay so what we're doing is basically decoding and intended sound position in this two D formant frequency space
1:00:23	that's generated from motor cortical cells a but is a much lower dimensional thing then the entire vocal tract shape
1:00:32	well the first thing we need to do is verify that this formant frequency information was actually in this part
1:00:37	of the brain and the way we did this was we had that subject try to imitate a minute long
1:00:44	vowel sequence that was something like
1:00:46	yeah year who this lasted a minute and they were told the subject was told to do this in synchrony
1:00:56	with the stimulus
1:00:58	this is crucial because we don't know otherwise when he's trying to speak up because no speech comes out and
1:01:04	so what we do is we record the neural activities during this minute long attempted utterance
1:01:08	and then we try to map them into the formant frequencies that the subject was trying to imitate so the
1:01:14	square wave here right which is kind of the C is that the actual in this case actually have to
1:01:21	going up and down and here's the actual F one going up and down for the different vowels
1:01:27	and the solid are not bold squiggly line here is the decoded signal a it's not great but it's actually
1:01:35	highly statistically significant we did cross validated training and testing and we had a very highly significant
1:01:42	a representation of the formant frequencies our values one point six nine four F one point six eight for F
1:01:48	two and so this verifies that there is indeed formant frequency information in your primary motor cortex
1:01:55	and so the next step was simply to use this information to try to produce speech output
1:02:00	just as a review for most of you formant synthesis of speech has been around for a long time goner
1:02:07	font for example in nineteen fifty three use this very large piece of electronic equipment here
1:02:14	with this style was on a two-dimensional pad and what he did was he would be stylus around on the
1:02:20	pad and the location of the stylus was i location in the F one F two space
1:02:26	so is basically moving around in the formant plane and just by moving this cursor around in this two dimensional
1:02:32	space is able to produce
1:02:33	intelligible speech so here's an example
1:02:39	i
1:02:41	so the good news here is that with just two dimensions some degree of speech output can be produced
1:02:47	consonants are very difficult i'll get back to that at the end but certainly bows are possible with this sort
1:02:53	of synthesis
1:02:55	so what we did was we took the system and we so here is a schematic are electrode in the
1:03:01	speech motor cortex
1:03:03	is recorded by this are picked up and amplified and then sent across the sky now
1:03:08	we record the signals and we then run them through a neural decoder and what the neural decoder does is
1:03:15	it predicts what formant frequencies are being attempted based on the activities so it's trained up on one of these
1:03:21	one minute long sequences
1:03:23	and once you train it up then it can take a set of a neural activities and translate that into
1:03:30	a predicted first and second formant frequency which we can then send over a speech synthesiser to the subject
1:03:36	the delay from the brain activity to the sound output was fifty milliseconds in our system and this is approximately
1:03:42	the same delay as
1:03:43	your motor cortical activity to your sound output and this is crucial because if the subject is going to be
1:03:49	able to learn to use this synthesiser you need to have an actual feedback delay if you delay speech feedback
1:03:55	by a hundred milliseconds in a normal speaker
1:03:58	they start to become highly disfluent they go through some stuttering like behaviour they'll start talking it's very disruptive so
1:04:08	it's important that this thing at operates very quickly
1:04:11	and produces this feedback in a natural time frame
1:04:17	now what i'm gonna show is the subsets performance with the speech bci so we had "'em" produce a about
1:04:24	tasks so subject would start out at the centre about
1:04:29	then would it is
1:04:31	ask on each trial was to go to about that we told him to go to so in the video
1:04:36	well play you'll hear the computer say
1:04:38	listen
1:04:39	and it'll say something like yea i
1:04:42	and it'll say speak and then he supposed to say E with the synthesiser so you'll hear his sound output
1:04:49	as produced by the synthesizer as the attempts to produce the bow that was being that presented in you'll see
1:04:56	that the target values in green here
1:04:59	the cursor you'll see is the subjects location in the formant frequency space
1:05:04	a most of the trials we did not provide visual feedback the subject didn't need visual feedback and we saw
1:05:10	no increase in performance from visual feedback E instead use the auditory feedback that we produced from the synthesiser to
1:05:17	produce a better and better speech
1:05:20	or what speech sounds at least and so here are five examples five consecutive productions in a block
1:05:31	we speak
1:05:35	a so that's a directivity very quickly want to the target
1:05:40	so
1:05:45	be your egos awfully here's the error any kind of steers the back into the target five
1:05:55	another directive is next trial isn't here you'll seems to me like yeah the before the timeout
1:06:12	but nobody around here
1:06:18	so straight to the target so what we saw were
1:06:21	to sorta behaviours often times it was straight to the target but other times you would go off a little
1:06:26	bit and then you would see him one see her the feedback going off you would see "'em"
1:06:31	and presumably in his head he's trying to change the shape of this time we don't to try to you
1:06:35	know
1:06:36	try to actually say the sound so he's trying to reshape where that sound is going and so you'll see
1:06:41	"'em" kind of steered toward the target in those cases so what's
1:06:48	happening in these slides is or these panels is i'm showing the error a rate here course they hit rate
1:06:54	as a function of block so any given session we would have
1:06:58	a four blocks of trials there were about five productions to ten productions per block so during the course of
1:07:06	a session he would produce anywhere between about
1:07:09	ten that's what course ten to twenty repetitions but about actually five to ten repetitions of each about
1:07:16	and when he first starts his hit rate is just below fifty percent that's above chance but it's not great
1:07:23	but we see with practise it gets better with each block and by the end he's improved a set rate
1:07:29	to over seventy per se
1:07:31	on average in a in fact in the later sessions he was able to get up to about ninety percent
1:07:36	hit rate if we look at the end point error as a function of block this is how far away
1:07:41	he was from the target and formant space i when the trial and that
1:07:45	so if it was a success it would be zero if it's not a success and there's an error we
1:07:50	see that this pretty much linearly drops off over the course of a forty five minute session
1:07:56	and this movement i'm also improves a little bit
1:07:59	this slide shows what happens over many sessions so these are twenty five sessions
1:08:04	one thing to note here is and this is the endpoint error we're looking at one thing to note is
1:08:09	that there's a lot of variability from day to day i'll be happy to talk about that we had to
1:08:13	train up a new decoder everyday because we weren't sure we had the same neurons everyday
1:08:17	so some days the decoder work very well like here in other days it didn't work so well what we
1:08:24	saw on average over the sessions is that the subject got better and better at learning to use the synthesisers
1:08:30	meaning that
1:08:31	even though he was given a brand new synthesiser on the twenty that session it didn't take "'em" nearly as
1:08:36	long to get good it using that a synthesiser
1:08:42	well to summarise them for the speech brain computer interface here
1:08:45	there are several mount novel aspects of this interface that was the first real time speech brain computer interface so
1:08:52	this is the first attempt to actually decode ongoing speech as opposed to pulling out words or moving a cursor
1:08:59	to choose words on the screen
1:09:02	it was the first real time control using wireless system a wireless is very important for this because
1:09:10	if you have a connector coming out of your head which is the case for some patients you get the
1:09:15	sort of surgery
1:09:17	that connector actually can have an infection build up over build up around it and this is a constant problem
1:09:24	for people with this sort of system wireless systems are the weight of the future
1:09:29	we were able to do a wireless system because we only had two channels of information a current systems have
1:09:36	usually hundred channels or more of information and the wireless technology is still catching up so these hundred channel systems
1:09:44	typically still have
1:09:45	connectors coming out of the head
1:09:48	and finally are project was the first real time control within a lecture that in been implanted for this long
1:09:53	the selected within for over three years this highlights the utility of the sort of electrode we you
1:10:00	or permanent implantation the speech that came out was extremely rudimentary as you saw but keep in mind that where
1:10:08	we have two tiny wires of information coming out of the brain
1:10:13	pulling out information from at ten neurons max
1:10:17	out of the hundreds of millions of neurons involved in the system and yet the subject was still able to
1:10:22	learn to use the system and improve the speech over time their number things we're working on now to improve
1:10:28	this
1:10:29	at most notably we're working on improving synthesis that we are developing two-dimensional synthesisers that can produce both vowels and
1:10:37	consonants and that sound much more natural than a straight formant synthesiser
1:10:41	a number of groups are working on smaller electronics and more electrodes
1:10:45	the state-of-the-art now as i mentioned is probably ten times the information that we were able to get out of
1:10:51	this brain computer interface so we would expect a dramatic improvement
1:10:56	in a performance with the modern system
1:10:59	and we're spending a lot of time working on decoding techniques that are i'm improved as well the initial decoder
1:11:07	that you give these subjects is a very rough it just gets i mean the ballpark and that's because there's
1:11:13	not nearly enough information
1:11:15	to an upper decoder properly from a training a sample and so what people are working on include people in
1:11:22	our lab are decoders that actually tune while the subject is trying to use the prosthesis of not only is
1:11:29	the subjects motor system adapting to use the prosthesis
1:11:32	but the prosthesis itself is helping that adaptation by a cutting error down on each production very slowly over time
1:11:40	to help the system state to overtime
1:11:43	and with that i'd like to
1:11:45	again thank at my collaborators and also thank the N I D C D and N S F four funds
1:11:51	that funded this research
1:12:05	okay so we have time for two questions
1:12:08	morgan
1:12:11	yeah
1:12:12	really interesting to
1:12:15	yeah it is pretty strong emphasis and formants room this numbers and speeches in that
1:12:21	when you have the playback of doggy the
1:12:24	go
1:12:26	that's of great so right is there other work that you're doing with stop consonants are figuring out a way
1:12:32	to put things like that in your right eye so i largely focused on performance for simplicity during the talk
1:12:39	the smell sensory feedback control system in the model actually does
1:12:44	a lot of the work for stop consonants so for example for a B we have a target for the
1:12:49	closure itself or so there is in addition to the formant representation we have tactile dimensions that supplement the targets
1:13:02	mass sensory feedback i is i in our model secondary auditory feedback largely because during development we get auditory targets
1:13:12	in their entirety from people around us
1:13:14	but we don't we can't
1:13:16	tell what's going on in their mouth so early development we believe is largely driven by auditory dimensions
1:13:21	this may have sensory system learns what goes on when you properly produce the sound and then it later contributes
1:13:27	to the production once you build up this madison street target
1:13:31	no one other quick note is another simplification here is that
1:13:35	at frequencies a strictly speaking are very different for women and children and men and so we when we are
1:13:43	using a different voices we use the normalized formant frequency space where we actually use ratios of the formant frequencies
1:13:51	to a
1:13:52	to help accommodate
1:13:55	i
1:13:57	the question
1:13:59	right well as i think that i think you but i think that scare yeah
1:14:04	i mean and if for any debate between and you can literally tag it's very
1:14:13	i just one can understand where you're coming from one to because we really and in working with
1:14:20	people have
1:14:21	and look very similar to a data you know we here
1:14:27	there you can delete it look at right
1:14:31	then you can get that the articulatory information at that it just use where actually made perfect memory for example
1:14:38	yeah okay well that's so that so in my are you
1:14:44	the
1:14:45	gestural score is more or less equivalent swore ford motor command and that feed forward command is tuned up to
1:14:52	hit auditory target so
1:14:54	we do have a job in a factor gestural score in the form of a feed-forward motor command and so
1:14:59	if you produce speech very rapidly that whole fee for motor command will get read out but it won't necessarily
1:15:07	make the right sounds if you push to the limit
1:15:10	so for example in the perfect memory case the model would you be you know would do the gesture for
1:15:15	the see if it's producing a very rapidly
1:15:18	it wouldn't that he may not come out but it would presumably here a slight error in try to correct
1:15:25	for that a little bit in later production but
1:15:28	to make a long story short my view is that the gestural score which i think does exist is something
1:15:34	that is equivalent to a feed-forward motor man
1:15:37	and
1:15:38	the people model does it show how huge amount that gestural score how you keep it to do over time
1:15:43	and
1:15:44	things like that okay
1:15:46	yeah
1:15:48	and then
1:15:50	thanks a really amusing talk and
1:15:53	oh
1:15:54	seems to me that people did review but someone who sensory feedback doesn't really tell you about what those words
1:16:02	mean
1:16:03	all those mean people through all those sort of visual track on any of the kind of feedback and speech
1:16:09	production
1:16:10	it absolutely does but we do not have anything like that in the model so we purposely focused on motor
1:16:16	control speech is a motor control problem and
1:16:19	the words are meaningless in the to the model that of course a simplification
1:16:25	track ability for us to be able to study a system that we could actually characterise computationally out were working
1:16:32	well or a higher level connecting this model which is kind of a low level motor control model if you
1:16:38	will with higher level models of
1:16:41	sequencing of syllables and we're starting to think about how these
1:16:46	sequencing areas of the brain interact with areas that represent meeting and so middle frontal drivers for example is
1:16:56	very commonly associated with some word meaning and temporal oh these areas you know i
1:17:02	but the sequencing system a but we have not yet model that so this kind of
1:17:10	in our view we're gonna working our way up from the bottom where the bottom is motor control and the
1:17:15	top as language
1:17:17	we're not that far up there yeah
1:17:25	so it was really inspiring talk
1:17:28	i'm
1:17:30	kind of wondering that thinking about
1:17:32	the beginning of your talk and the babbling in imitation
1:17:36	face
1:17:37	one of the things is pretty
1:17:41	apparent from that is that you're starting out effectively with your model with adult vocal tract
1:17:49	and they're listening to external stimuli which are also kind of matched so right so what is your take on
1:17:56	the i'm i work with very back then a lot on thinking about things like normalisation i'm kinda curious what
1:18:02	your take on online
1:18:05	how things change as the as you know you get a six month old and their vocal tract rows and
1:18:12	stuff like that how do you see that fitting into model well so
1:18:16	i think that highlights the fact that
1:18:18	formant strictly are not the representation that's used for this transformation from at all you know when the channel here's
1:18:25	an adult sample they're hearing the big muscular normalized version of it that their vocal tract can imitate because
1:18:33	frequencies themselves they can't imitate but things like so we've looked at a number representations that involve things like wall
1:18:40	of the ratio of the formants and so forth
1:18:43	and those improve its abilities and they work
1:18:47	well in some cases but we haven't found that and like what is that
1:18:52	that representation
1:18:54	i where i think it is in the brain i think in playing time prowling the higher order auditory areas
1:18:59	that's probably where you're representing speech in this
1:19:02	are independent manner
1:19:03	but what exactly those dimensions are i can't say for sure it's something some normalized formant representation but the ones
1:19:12	we tried we tried miller's space for example
1:19:17	eighty nine paper a they're not for satisfactory they do a lot of the normalisation from but they don't work
1:19:23	that well for controlling movements
1:19:27	oh i mean one of the things that i was thinking about is that keith johnson for example really
1:19:32	feels like well this normalisation is actually learn phenomenon so it's easy feels like you have some of the machinery
1:19:38	there instead of i mean deposit
1:19:40	that it's
1:19:41	you know it is some operation
1:19:43	that you could ever imagine
1:19:45	having an adaptive
1:19:47	system that actually you know what that normalisation
1:19:50	it's possible i there's just so examples like parents being able to see and so forth so i think that
1:19:56	there's something about the mammalian auditory system that pulls out that the dimensions that it pulls out naturally are
1:20:03	largely speaker-independent already that the i mean it pulls out all kinds of information but for speech system i think
1:20:09	it you know that's what's using but
1:20:11	i wish i could deviate more satisfactory answer
1:20:15	nor did you have a great for a while
1:20:18	question from cell and i using that when it
1:20:22	is it just the first three data using we have first three are first two depending so for the prosthesis
1:20:28	project we just use the first two for the simulations i showed for the rest of the people do the
1:20:33	simulations those first three okay "'cause" we just in recent work for example that are
1:20:39	and we and then you can tell information about which particular a term shape which is if you look at
1:20:45	high of one right and when that is ideal
1:20:49	it would be great if you pay include create something like any idea do not know what the other hand
1:20:55	i was just gonna say we can look at that so by controlling F one through F three we can
1:20:59	see what F
1:21:01	for an F five would be for four different are configurations we haven't looked at that yeah but
1:21:07	my view is that is that they're perceptually not very important or even salient so of course the physics will
1:21:13	make times you know the form a slightly different if your tongue shapes are
1:21:16	are different especially for the higher for men
1:21:19	but i think that the speakers are you know what they what they perceive and
1:21:24	is largely limited to lower formants i think some your earlier work
1:21:29	just about
1:21:31	no clear and not heard this argument that because you're selling a plate is a christian for see that brad
1:21:39	story and then some more dishes at work that i actually did they give you colouring
1:21:45	i mean you can a speaker-specific information here to saint and make it sound like a different person it's getting
1:21:50	a plan what the values are i see so we yeah so we just fix those formants in our model
1:21:57	a zero values for all sounds and
1:22:00	you can hear the sounds properly but it like you know the voice quality may well change if we allow
1:22:04	them to
1:22:06	very good for just one just a continued but at the more like you would be able to add and
1:22:11	when you add determine what the acoustic features are that these various case because you get the right to place
1:22:17	in about
1:22:18	does its trees but you get this will continue on in between right that would be great information people and
1:22:24	speaker independent and you know
1:22:27	speaker identification and characteristics right and speaker recognition can assistant
1:22:32	as well as well speech therapy and pronunciation tools
1:22:37	so that just something to think about all revisit that
1:22:40	okay so we're gonna close that session because i don't want to sort of a take too much out of
1:22:45	the right but like that's like thanks also be gone again

The Neural Mechanisms of Speech Production: From computational modeling to neural prosthesis

Keynotes

Frank Guenther (Boston University)