Speech Transcript - DYNAMICS OF TONGUE GESTURES EXTRACTED AUTOMATICALLY FROM ULTRASOUND

0:00:13	um my name is jeff very
0:00:14	uh and i'm in a talk about using ultrasound to visualise that on during speech
0:00:20	uh
0:00:21	so first of all
0:00:22	we might wanna ask
0:00:24	why would we want to use ultrasound to look at
0:00:27	the tongue anyway uh
0:00:30	so the the main so that one of the main reasons is because there's
0:00:33	there is evidence to suggest that the
0:00:36	or a lot more
0:00:38	then just audio processing during
0:00:40	speech uh perception
0:00:42	so
0:00:43	for example there is uh
0:00:44	the well-known make can fact so
0:00:47	uh this shows of there's integration between the visual signal and audio signal at least uh during human speech perception
0:00:59	okay so
0:01:00	i i here the short uh a video
0:01:03	them demonstrating them would the meg or fact so if you if you look at the lips
0:01:08	uh during the video and you listen and
0:01:10	then you you you are perceived different uh syllables
0:01:14	so let's let's
0:01:15	was know that
0:01:19	ah
0:01:20	ah
0:01:22	ah
0:01:24	ah
0:01:26	okay so you should uh if you were watching the video you should of notice you should of heard different
0:01:31	sounds
0:01:32	right
0:01:33	but now if you if you close your eyes or look away from the video of you don't look at
0:01:37	the lips
0:01:38	all replay the video and you'll see that there actually the same sound in uh acoustically
0:01:46	a
0:01:48	ah
0:01:49	ah
0:01:51	a
0:01:53	okay so
0:01:55	uh if you if you're not looking at the lips then you notice that the sound is always the same
0:02:00	that's always bar
0:02:01	however if you look at the lips
0:02:02	and here the sound then changes okay
0:02:05	so uh this is
0:02:07	a really strong affect and it doesn't matter which language you speak
0:02:10	uh it's still you get this affect least most people do so
0:02:15	oh this is this really suggests that you know are brain makes use of lots of different
0:02:20	uh
0:02:20	information
0:02:22	so uh
0:02:23	there's a recent study uh from
0:02:26	eleven italy
0:02:27	showing that uh
0:02:29	there's also use of the motor system
0:02:31	specifically
0:02:32	the part of the brain that control the tongue and the lips during speech uh perception
0:02:38	uh but this the there their fact is only observed what during noisy speech so what that means is that
0:02:44	uh during a noisy
0:02:45	situation
0:02:46	if you have trouble hearing the
0:02:49	the person you're listening to
0:02:51	that uh you you may use the you're motor cortex may become involved with helping you to parse the speech
0:02:57	that you're listening to
0:02:59	um so
0:03:00	this has lots of
0:03:02	sort of a
0:03:02	interesting questions to pursue and
0:03:06	we wanna be able to use ultrasound
0:03:08	to uh investigate
0:03:10	what what the tongue is doing
0:03:12	during speech
0:03:16	so
0:03:17	it a typical ultrasound image of the time uh i at added the profile the face
0:03:22	to sort of give you landmarks so what we're looking at here so it's a midsagittal view
0:03:27	and uh the tongue tip
0:03:28	is always uh
0:03:30	to the right of the image
0:03:32	and uh
0:03:33	some you can get as far
0:03:35	back depending on the subject and and the probe you're using
0:03:40	you can usually get most of the tongue body uh sometimes at tongue tip is not
0:03:45	visible but you
0:03:46	can definitely see that on body so this uh
0:03:49	bright and represents the surface of the tongue
0:03:53	so here's a typical uh
0:03:55	segment of speech
0:03:57	and what it looks like
0:03:58	in ultrasound
0:04:01	but
0:04:03	i
0:04:05	okay i'll point that again
0:04:07	i
0:04:08	okay so you can see that tongue moving around
0:04:11	uh to make the different sounds
0:04:17	uh so applications for this um
0:04:21	or recently ban uh a new
0:04:23	no ultrasound machine been released uh
0:04:26	that's hand so this is a
0:04:28	fully
0:04:29	operational ultrasound machine
0:04:31	and so there's possibilities in the future of having
0:04:35	uh portable ultrasound machines integrated with other sensors
0:04:39	uh to do speech recognition
0:04:41	oh there's also a large interest in silent speech interface which is being able to
0:04:47	uh measure some sort of uh articulatory motion
0:04:51	without vocalisation and being able to resynthesize speech from that
0:04:55	so this could be useful
0:04:56	yeah environment where
0:04:58	uh it's either
0:04:59	there's too much noise
0:05:01	or if you're an environment where you have to
0:05:03	be silent right by still need to communicate
0:05:07	uh a and their so the possibility of adding a model of time motion
0:05:12	to uh uh uh
0:05:15	to a speech recognition system
0:05:16	so
0:05:17	uh we can use these images to
0:05:20	construct such a model
0:05:23	okay so in this work
0:05:26	uh
0:05:26	we're just addressing these questions so first of all we wanna we wanted to classify tongue shapes
0:05:33	into a you know phonemes and also
0:05:37	uh use this uh
0:05:38	to try to segment this the ultrasound video into phoneme second
0:05:44	uh so the tool we chose to do this is
0:05:47	a variation of a deep belief network that we're calling the translational deep belief network
0:05:52	so
0:05:54	uh a deep belief network is uh
0:05:56	it's composed of us of stacked restricted boltzmann machines so
0:06:01	restricted both so machines or probabilistic generative models
0:06:04	so
0:06:06	uh and they uh of so machine consists of a single
0:06:10	uh visible layer and a single hidden layer
0:06:13	and uh so it's a it's a yeah neural network
0:06:16	and
0:06:17	basically in a deep belief network
0:06:20	uh the uh the hidden layer of the of the first
0:06:23	restricted both and machine becomes the visible layer in the second one
0:06:27	and you just stack them up
0:06:28	uh and uh
0:06:31	so
0:06:31	these these uh
0:06:33	deep belief networks are typically trained
0:06:36	in two stages so
0:06:38	the first thing happens is
0:06:40	uh i
0:06:41	you uh do an unsupervised
0:06:43	pre-training training stage
0:06:45	uh that and it sort of uh what happens is the network
0:06:49	the uh D correlates
0:06:51	the data so it comes up it learns a a representation on its own
0:06:55	without any uh
0:06:57	or of a human labeled and knowledge and then
0:07:00	after that pre-training is been done a second uh a discriminative stage
0:07:05	uh a sort of the
0:07:07	optimises as the network to
0:07:09	output
0:07:10	a label so the the human
0:07:12	label is not actually added
0:07:14	to the uh
0:07:16	to the training and tell the second stage so
0:07:19	with the translational
0:07:21	deep belief network
0:07:22	um
0:07:24	we wanted to add and the label during the pre-training as well
0:07:29	so uh this is
0:07:31	what what uh
0:07:33	base the basic idea well what we're
0:07:35	proposing
0:07:37	so we have a a a basic uh a regular deep net
0:07:41	but we we train it with all the
0:07:44	the sensory inputs
0:07:46	and the labels concatenated to form the visible layer
0:07:50	that we we train a a a uh regular deep belief network
0:07:54	and uh
0:07:55	then we copy the weights of the up the
0:07:58	upper hidden layers over to a new network
0:08:01	and we retrain
0:08:03	this bottom layer
0:08:04	uh to except
0:08:06	only the the you sensor inputs
0:08:08	without a label
0:08:10	so then then uh finally we do uh back propagation to at the labels on
0:08:16	uh
0:08:16	at the at the last stage
0:08:19	so
0:08:19	what i more about that
0:08:23	um so
0:08:24	the first thing you wanna do is is at the
0:08:27	you you want to add the uh human the labeled data
0:08:31	into the into the pre-training stage when
0:08:34	and that helps the um
0:08:37	the network to
0:08:39	build features that contain information
0:08:42	about the labels and the and the sensor input
0:08:46	so the first thing to do is train a deep belief network
0:08:50	on the both of the labels and the
0:08:52	sensors
0:08:54	and then we do we replace the
0:08:57	second there the weep we place the bottom layer of the network with the translational
0:09:02	uh to uh to except the images only
0:09:05	and then we do discriminative backprop uh two
0:09:09	uh fine tune the network
0:09:11	and then that allows us to extract a label from an from an unlabeled in image
0:09:17	so again too
0:09:19	sort of explain this we're just copying the weights from the original network over
0:09:23	yeah and uh
0:09:24	then substituting this bottom one with the with the new network that we train
0:09:29	so
0:09:30	uh to train this
0:09:32	oh we're using
0:09:34	uh a slight variation on the contrastive divergence rule
0:09:38	so
0:09:40	uh this is the equation for that and what it says is
0:09:44	you sample um
0:09:45	you sample that the with the
0:09:47	input
0:09:48	and the hidden the the
0:09:50	representation in the hidden layer
0:09:52	you sample from both of those to get reconstruction so this is where the
0:09:56	the fact that it's a generative a a generative model comes in play so you
0:10:01	yeah uh use the
0:10:03	hidden layer to re generate the visible output
0:10:05	and then use the the the re generated
0:10:07	a a visible layer to reconstruct
0:10:10	another hidden layer and then you minimize the difference between those
0:10:15	uh so here it is it's grammatically
0:10:17	um
0:10:19	what we're doing is uh
0:10:21	we're sampling from this slayer to get are hidden units and then ins
0:10:24	normally you it just sample this to reconstruct this slayer again
0:10:28	but the the the key difference here
0:10:31	is where can wear sampling
0:10:33	to a new uh a new
0:10:35	visible layer
0:10:36	they contains only the sensor input
0:10:39	and then sampling from that
0:10:41	to re can to get reconstructed it units
0:10:44	and then that allows us to apply the contrastive divergence rule
0:10:51	okay so
0:10:53	for our experiment
0:10:55	uh
0:10:56	we use just five uh phoneme categories so we had a
0:11:00	a database of
0:11:03	uh
0:11:03	one thousand eight or ninety three images the represented prototypical shapes
0:11:09	uh for those categories so
0:11:11	uh as we went through by hand and and hand labeled
0:11:14	uh where we thought the prototypical shape
0:11:17	for the uh P T K R in L in we also chose um
0:11:22	images is represented a non category or a garbage category
0:11:26	and those were images that
0:11:28	where at least uh five frames in the video away from a peak image
0:11:33	so then uh
0:11:35	to feed this to the network we uh use this sort of labeling scheme
0:11:39	where uh we have a one versus all already uh
0:11:42	representation re and we repeat that
0:11:45	a bunch of time so that
0:11:47	uh there's
0:11:48	similar number of
0:11:50	input uh input
0:11:52	uh nodes on the visible layer
0:11:54	for images and for labels that way
0:11:58	a network doesn't just minimise the error for the label
0:12:01	it actually
0:12:03	has to take that into account so
0:12:05	we have that this so if we had a category two or a T the label would look like this
0:12:10	a B
0:12:11	you know the second one would be uh a one and then we would repeat this string
0:12:15	a much at times
0:12:18	so uh then
0:12:20	we of the images and scaled down
0:12:22	the relevant part
0:12:24	to do a smaller number of pixels and did five fold
0:12:28	cross
0:12:30	so we had a dramatic differences in accuracy uh
0:12:34	from a regular deep belief network and are uh translational
0:12:38	deep belief network so
0:12:41	as you can see um i have average the
0:12:44	standard deep belief network got about
0:12:46	forty two percent
0:12:48	and uh when we add the label information during pre-training we get a lot a lot higher in the eighties
0:12:56	um so we compare this to some other method is um
0:13:02	there's been work uh on
0:13:04	using on constructing what are what they're calling i can on so there that's similar to the eigen faces of
0:13:10	turk and pentland were
0:13:12	uh and the is a just pca analysis so
0:13:15	you uh but all the images
0:13:17	a you have a a a set of images and you find the principal components so that of images
0:13:22	and then you can represent
0:13:24	all your images in in terms of
0:13:27	the uh
0:13:28	the co coefficient of
0:13:30	the first and
0:13:32	uh
0:13:33	upon
0:13:34	so
0:13:34	we use that to do dimensionality reduction in that up with a um hundred
0:13:40	represent each image with a hundred coefficients
0:13:43	and then use that to train uh a support vector machine and that got
0:13:47	i got fifty three percent
0:13:49	accuracy
0:13:51	uh we also used
0:13:53	um
0:13:54	try trying to
0:13:56	just uh use um
0:13:58	segment out part of the image so by
0:14:00	just tracing
0:14:02	the tongue surface so
0:14:04	we have uh these ultrasound images and instead of using the whole image we just one use
0:14:09	what the human things as the relevant part so that's just
0:14:13	tracing the
0:14:14	tongue surface and um
0:14:17	in previous work we showed how you can use
0:14:20	this uh
0:14:22	deep belief network to also extract
0:14:25	uh these traces automatically
0:14:27	so
0:14:28	and all talk about that
0:14:31	but uh
0:14:31	so we we use these features instead just sample these curves have fifty points
0:14:36	and use that to train an a uh a support vector machine and got seventy percent accuracy
0:14:43	so uh that nice thing about the
0:14:46	translational deep belief nets
0:14:49	is they can also be used to do
0:14:51	this uh
0:14:53	sort of automatics image image segmentation or uh
0:14:57	so the way that works is
0:15:00	using again the generative properties of the of the model to
0:15:04	uh construct an auto-encoder so
0:15:07	what that means is
0:15:09	uh we first train a network up to
0:15:12	uh just like before
0:15:14	but then we can use this
0:15:16	this is a top hidden layer
0:15:19	to reconstruct
0:15:20	what it's on the input so this is a um an audible coder
0:15:24	in the sense that
0:15:26	what you put in you can reconstruct on the output
0:15:29	so
0:15:30	that means one we give it an image and a label
0:15:33	that we can uh reconstruct that image and that label
0:15:37	so uh we is this property
0:15:40	uh to train an auto-encoder
0:15:42	like this
0:15:43	and then again we
0:15:45	we create a new network
0:15:47	by copying
0:15:48	all this
0:15:49	peace over to the new network
0:15:51	and then
0:15:52	using this uh
0:15:54	the T R B M
0:15:55	two
0:15:56	retrain a bottom layer a bottom layer of the network to accept only the image so
0:16:01	this allows us to put in an image and get a label out
0:16:05	uh in this case the label is that on trace so
0:16:08	but again this is
0:16:10	or input looks like is a read this image
0:16:13	and put it into the network in it
0:16:14	reduces the uh at trace
0:16:17	okay
0:16:19	so um
0:16:22	the other thing we looked at was uh segmenting the speech um
0:16:26	sequence so
0:16:27	in this case we use just a regular sequence which included um
0:16:31	images that were not part of the original training set
0:16:35	so we only trained on just prototypical shapes and the sequence
0:16:39	change takes uh contains lots of transitional stage as well
0:16:44	so
0:16:45	uh when we did that um the classifier or uh that the the deep belief network was
0:16:50	it it's uh gives us a representation or gives us an activation
0:16:55	they can handle multiple categories so
0:16:58	oh that looks like this one uh we are we actually get
0:17:02	uh for the sequence we sort of see the dynamics of the tongue motion and we can use this to
0:17:06	segment
0:17:07	the speech stream so for example
0:17:10	well on this frame we have activation for a the P
0:17:13	shape and the T shape
0:17:15	at the same time so it's sort
0:17:17	transitioning from those shape
0:17:20	and so putting it all together we have
0:17:23	the original ultrasound um
0:17:25	the automatic
0:17:26	automatically-extracted extracted contour or and then here will be the label
0:17:30	that it shows
0:17:32	and uh
0:17:35	it looks like that
0:17:36	okay
0:17:38	so uh that's all i have
0:17:48	thank you very much
0:17:49	so question
0:17:53	come
0:17:56	no
0:17:57	and
0:17:58	it
0:18:03	yeah
0:18:05	yeah you you training set from a uh uh one person or
0:18:09	of person
0:18:10	and
0:18:10	you you get and you don
0:18:12	in yeah you and but there is the
0:18:14	probably a and you john the tree of uh
0:18:17	as will that's right there's large differences between people and actually uh
0:18:22	for the tone
0:18:23	surface extraction we use we trained with nine speakers
0:18:27	and uh it was able to generalise to
0:18:31	uh a new speakers as long as the
0:18:34	their time shape was
0:18:35	um sort of
0:18:37	a i around about the same size as another
0:18:40	as one of those nine subjects was trained
0:18:42	uh i've for the classification we trained on
0:18:45	to uh two speakers
0:18:47	and it it did well for for those people so
0:18:50	uh we still
0:18:51	that's still an ongoing uh
0:18:53	area where we need to
0:18:55	see how well a general is to other speakers but you're right
0:18:58	about that
0:18:59	thank
0:19:00	and the fact that many you turns
0:19:04	he's may yeah to the shape of the are also to and when of speaking
0:19:10	a fixed
0:19:10	uh i think it it's mostly to do with the shape of the time and especially during data collection
0:19:16	uh the depth of the scanned and things like that make a large different
0:19:20	uh_huh so maybe
0:19:22	be norm announced
0:19:23	yeah that's and that's yeah for
0:19:25	okay
0:19:27	as the question
0:19:28	note it's move to that makes
0:19:31	don't

DYNAMICS OF TONGUE GESTURES EXTRACTED AUTOMATICALLY FROM ULTRASOUND

Medical Imaging

Presented by: Jeff Berry, Author(s): Jeff Berry, Ian Fasel, University of Arizona, United States