0:00:13um my name is jeff very
0:00:14uh and i'm in a talk about using ultrasound to visualise that on during speech
0:00:21so first of all
0:00:22we might wanna ask
0:00:24why would we want to use ultrasound to look at
0:00:27the tongue anyway uh
0:00:30so the the main so that one of the main reasons is because there's
0:00:33there is evidence to suggest that the
0:00:36or a lot more
0:00:38then just audio processing during
0:00:40speech uh perception
0:00:43for example there is uh
0:00:44the well-known make can fact so
0:00:47uh this shows of there's integration between the visual signal and audio signal at least uh during human speech perception
0:00:59okay so
0:01:00i i here the short uh a video
0:01:03them demonstrating them would the meg or fact so if you if you look at the lips
0:01:08uh during the video and you listen and
0:01:10then you you you are perceived different uh syllables
0:01:14so let's let's
0:01:15was know that
0:01:26okay so you should uh if you were watching the video you should of notice you should of heard different
0:01:33but now if you if you close your eyes or look away from the video of you don't look at
0:01:37the lips
0:01:38all replay the video and you'll see that there actually the same sound in uh acoustically
0:01:53okay so
0:01:55uh if you if you're not looking at the lips then you notice that the sound is always the same
0:02:00that's always bar
0:02:01however if you look at the lips
0:02:02and here the sound then changes okay
0:02:05so uh this is
0:02:07a really strong affect and it doesn't matter which language you speak
0:02:10uh it's still you get this affect least most people do so
0:02:15oh this is this really suggests that you know are brain makes use of lots of different
0:02:22so uh
0:02:23there's a recent study uh from
0:02:26eleven italy
0:02:27showing that uh
0:02:29there's also use of the motor system
0:02:32the part of the brain that control the tongue and the lips during speech uh perception
0:02:38uh but this the there their fact is only observed what during noisy speech so what that means is that
0:02:44uh during a noisy
0:02:46if you have trouble hearing the
0:02:49the person you're listening to
0:02:51that uh you you may use the you're motor cortex may become involved with helping you to parse the speech
0:02:57that you're listening to
0:02:59um so
0:03:00this has lots of
0:03:02sort of a
0:03:02interesting questions to pursue and
0:03:06we wanna be able to use ultrasound
0:03:08to uh investigate
0:03:10what what the tongue is doing
0:03:12during speech
0:03:17it a typical ultrasound image of the time uh i at added the profile the face
0:03:22to sort of give you landmarks so what we're looking at here so it's a midsagittal view
0:03:27and uh the tongue tip
0:03:28is always uh
0:03:30to the right of the image
0:03:32and uh
0:03:33some you can get as far
0:03:35back depending on the subject and and the probe you're using
0:03:40you can usually get most of the tongue body uh sometimes at tongue tip is not
0:03:45visible but you
0:03:46can definitely see that on body so this uh
0:03:49bright and represents the surface of the tongue
0:03:53so here's a typical uh
0:03:55segment of speech
0:03:57and what it looks like
0:03:58in ultrasound
0:04:05okay i'll point that again
0:04:08okay so you can see that tongue moving around
0:04:11uh to make the different sounds
0:04:17uh so applications for this um
0:04:21or recently ban uh a new
0:04:23no ultrasound machine been released uh
0:04:26that's hand so this is a
0:04:29operational ultrasound machine
0:04:31and so there's possibilities in the future of having
0:04:35uh portable ultrasound machines integrated with other sensors
0:04:39uh to do speech recognition
0:04:41oh there's also a large interest in silent speech interface which is being able to
0:04:47uh measure some sort of uh articulatory motion
0:04:51without vocalisation and being able to resynthesize speech from that
0:04:55so this could be useful
0:04:56yeah environment where
0:04:58uh it's either
0:04:59there's too much noise
0:05:01or if you're an environment where you have to
0:05:03be silent right by still need to communicate
0:05:07uh a and their so the possibility of adding a model of time motion
0:05:12to uh uh uh
0:05:15to a speech recognition system
0:05:17uh we can use these images to
0:05:20construct such a model
0:05:23okay so in this work
0:05:26we're just addressing these questions so first of all we wanna we wanted to classify tongue shapes
0:05:33into a you know phonemes and also
0:05:37uh use this uh
0:05:38to try to segment this the ultrasound video into phoneme second
0:05:44uh so the tool we chose to do this is
0:05:47a variation of a deep belief network that we're calling the translational deep belief network
0:05:54uh a deep belief network is uh
0:05:56it's composed of us of stacked restricted boltzmann machines so
0:06:01restricted both so machines or probabilistic generative models
0:06:06uh and they uh of so machine consists of a single
0:06:10uh visible layer and a single hidden layer
0:06:13and uh so it's a it's a yeah neural network
0:06:17basically in a deep belief network
0:06:20uh the uh the hidden layer of the of the first
0:06:23restricted both and machine becomes the visible layer in the second one
0:06:27and you just stack them up
0:06:28uh and uh
0:06:31these these uh
0:06:33deep belief networks are typically trained
0:06:36in two stages so
0:06:38the first thing happens is
0:06:40uh i
0:06:41you uh do an unsupervised
0:06:43pre-training training stage
0:06:45uh that and it sort of uh what happens is the network
0:06:49the uh D correlates
0:06:51the data so it comes up it learns a a representation on its own
0:06:55without any uh
0:06:57or of a human labeled and knowledge and then
0:07:00after that pre-training is been done a second uh a discriminative stage
0:07:05uh a sort of the
0:07:07optimises as the network to
0:07:10a label so the the human
0:07:12label is not actually added
0:07:14to the uh
0:07:16to the training and tell the second stage so
0:07:19with the translational
0:07:21deep belief network
0:07:24we wanted to add and the label during the pre-training as well
0:07:29so uh this is
0:07:31what what uh
0:07:33base the basic idea well what we're
0:07:37so we have a a a basic uh a regular deep net
0:07:41but we we train it with all the
0:07:44the sensory inputs
0:07:46and the labels concatenated to form the visible layer
0:07:50that we we train a a a uh regular deep belief network
0:07:54and uh
0:07:55then we copy the weights of the up the
0:07:58upper hidden layers over to a new network
0:08:01and we retrain
0:08:03this bottom layer
0:08:04uh to except
0:08:06only the the you sensor inputs
0:08:08without a label
0:08:10so then then uh finally we do uh back propagation to at the labels on
0:08:16at the at the last stage
0:08:19what i more about that
0:08:23um so
0:08:24the first thing you wanna do is is at the
0:08:27you you want to add the uh human the labeled data
0:08:31into the into the pre-training stage when
0:08:34and that helps the um
0:08:37the network to
0:08:39build features that contain information
0:08:42about the labels and the and the sensor input
0:08:46so the first thing to do is train a deep belief network
0:08:50on the both of the labels and the
0:08:54and then we do we replace the
0:08:57second there the weep we place the bottom layer of the network with the translational
0:09:02uh to uh to except the images only
0:09:05and then we do discriminative backprop uh two
0:09:09uh fine tune the network
0:09:11and then that allows us to extract a label from an from an unlabeled in image
0:09:17so again too
0:09:19sort of explain this we're just copying the weights from the original network over
0:09:23yeah and uh
0:09:24then substituting this bottom one with the with the new network that we train
0:09:30uh to train this
0:09:32oh we're using
0:09:34uh a slight variation on the contrastive divergence rule
0:09:40uh this is the equation for that and what it says is
0:09:44you sample um
0:09:45you sample that the with the
0:09:48and the hidden the the
0:09:50representation in the hidden layer
0:09:52you sample from both of those to get reconstruction so this is where the
0:09:56the fact that it's a generative a a generative model comes in play so you
0:10:01yeah uh use the
0:10:03hidden layer to re generate the visible output
0:10:05and then use the the the re generated
0:10:07a a visible layer to reconstruct
0:10:10another hidden layer and then you minimize the difference between those
0:10:15uh so here it is it's grammatically
0:10:19what we're doing is uh
0:10:21we're sampling from this slayer to get are hidden units and then ins
0:10:24normally you it just sample this to reconstruct this slayer again
0:10:28but the the the key difference here
0:10:31is where can wear sampling
0:10:33to a new uh a new
0:10:35visible layer
0:10:36they contains only the sensor input
0:10:39and then sampling from that
0:10:41to re can to get reconstructed it units
0:10:44and then that allows us to apply the contrastive divergence rule
0:10:51okay so
0:10:53for our experiment
0:10:56we use just five uh phoneme categories so we had a
0:11:00a database of
0:11:03one thousand eight or ninety three images the represented prototypical shapes
0:11:09uh for those categories so
0:11:11uh as we went through by hand and and hand labeled
0:11:14uh where we thought the prototypical shape
0:11:17for the uh P T K R in L in we also chose um
0:11:22images is represented a non category or a garbage category
0:11:26and those were images that
0:11:28where at least uh five frames in the video away from a peak image
0:11:33so then uh
0:11:35to feed this to the network we uh use this sort of labeling scheme
0:11:39where uh we have a one versus all already uh
0:11:42representation re and we repeat that
0:11:45a bunch of time so that
0:11:47uh there's
0:11:48similar number of
0:11:50input uh input
0:11:52uh nodes on the visible layer
0:11:54for images and for labels that way
0:11:58a network doesn't just minimise the error for the label
0:12:01it actually
0:12:03has to take that into account so
0:12:05we have that this so if we had a category two or a T the label would look like this
0:12:10a B
0:12:11you know the second one would be uh a one and then we would repeat this string
0:12:15a much at times
0:12:18so uh then
0:12:20we of the images and scaled down
0:12:22the relevant part
0:12:24to do a smaller number of pixels and did five fold
0:12:30so we had a dramatic differences in accuracy uh
0:12:34from a regular deep belief network and are uh translational
0:12:38deep belief network so
0:12:41as you can see um i have average the
0:12:44standard deep belief network got about
0:12:46forty two percent
0:12:48and uh when we add the label information during pre-training we get a lot a lot higher in the eighties
0:12:56um so we compare this to some other method is um
0:13:02there's been work uh on
0:13:04using on constructing what are what they're calling i can on so there that's similar to the eigen faces of
0:13:10turk and pentland were
0:13:12uh and the is a just pca analysis so
0:13:15you uh but all the images
0:13:17a you have a a a set of images and you find the principal components so that of images
0:13:22and then you can represent
0:13:24all your images in in terms of
0:13:27the uh
0:13:28the co coefficient of
0:13:30the first and
0:13:34we use that to do dimensionality reduction in that up with a um hundred
0:13:40represent each image with a hundred coefficients
0:13:43and then use that to train uh a support vector machine and that got
0:13:47i got fifty three percent
0:13:51uh we also used
0:13:54try trying to
0:13:56just uh use um
0:13:58segment out part of the image so by
0:14:00just tracing
0:14:02the tongue surface so
0:14:04we have uh these ultrasound images and instead of using the whole image we just one use
0:14:09what the human things as the relevant part so that's just
0:14:13tracing the
0:14:14tongue surface and um
0:14:17in previous work we showed how you can use
0:14:20this uh
0:14:22deep belief network to also extract
0:14:25uh these traces automatically
0:14:28and all talk about that
0:14:31but uh
0:14:31so we we use these features instead just sample these curves have fifty points
0:14:36and use that to train an a uh a support vector machine and got seventy percent accuracy
0:14:43so uh that nice thing about the
0:14:46translational deep belief nets
0:14:49is they can also be used to do
0:14:51this uh
0:14:53sort of automatics image image segmentation or uh
0:14:57so the way that works is
0:15:00using again the generative properties of the of the model to
0:15:04uh construct an auto-encoder so
0:15:07what that means is
0:15:09uh we first train a network up to
0:15:12uh just like before
0:15:14but then we can use this
0:15:16this is a top hidden layer
0:15:19to reconstruct
0:15:20what it's on the input so this is a um an audible coder
0:15:24in the sense that
0:15:26what you put in you can reconstruct on the output
0:15:30that means one we give it an image and a label
0:15:33that we can uh reconstruct that image and that label
0:15:37so uh we is this property
0:15:40uh to train an auto-encoder
0:15:42like this
0:15:43and then again we
0:15:45we create a new network
0:15:47by copying
0:15:48all this
0:15:49peace over to the new network
0:15:51and then
0:15:52using this uh
0:15:54the T R B M
0:15:56retrain a bottom layer a bottom layer of the network to accept only the image so
0:16:01this allows us to put in an image and get a label out
0:16:05uh in this case the label is that on trace so
0:16:08but again this is
0:16:10or input looks like is a read this image
0:16:13and put it into the network in it
0:16:14reduces the uh at trace
0:16:19so um
0:16:22the other thing we looked at was uh segmenting the speech um
0:16:26sequence so
0:16:27in this case we use just a regular sequence which included um
0:16:31images that were not part of the original training set
0:16:35so we only trained on just prototypical shapes and the sequence
0:16:39change takes uh contains lots of transitional stage as well
0:16:45uh when we did that um the classifier or uh that the the deep belief network was
0:16:50it it's uh gives us a representation or gives us an activation
0:16:55they can handle multiple categories so
0:16:58oh that looks like this one uh we are we actually get
0:17:02uh for the sequence we sort of see the dynamics of the tongue motion and we can use this to
0:17:07the speech stream so for example
0:17:10well on this frame we have activation for a the P
0:17:13shape and the T shape
0:17:15at the same time so it's sort
0:17:17transitioning from those shape
0:17:20and so putting it all together we have
0:17:23the original ultrasound um
0:17:25the automatic
0:17:26automatically-extracted extracted contour or and then here will be the label
0:17:30that it shows
0:17:32and uh
0:17:35it looks like that
0:17:38so uh that's all i have
0:17:48thank you very much
0:17:49so question
0:18:05yeah you you training set from a uh uh one person or
0:18:09of person
0:18:10you you get and you don
0:18:12in yeah you and but there is the
0:18:14probably a and you john the tree of uh
0:18:17as will that's right there's large differences between people and actually uh
0:18:22for the tone
0:18:23surface extraction we use we trained with nine speakers
0:18:27and uh it was able to generalise to
0:18:31uh a new speakers as long as the
0:18:34their time shape was
0:18:35um sort of
0:18:37a i around about the same size as another
0:18:40as one of those nine subjects was trained
0:18:42uh i've for the classification we trained on
0:18:45to uh two speakers
0:18:47and it it did well for for those people so
0:18:50uh we still
0:18:51that's still an ongoing uh
0:18:53area where we need to
0:18:55see how well a general is to other speakers but you're right
0:18:58about that
0:19:00and the fact that many you turns
0:19:04he's may yeah to the shape of the are also to and when of speaking
0:19:10a fixed
0:19:10uh i think it it's mostly to do with the shape of the time and especially during data collection
0:19:16uh the depth of the scanned and things like that make a large different
0:19:20uh_huh so maybe
0:19:22be norm announced
0:19:23yeah that's and that's yeah for
0:19:27as the question
0:19:28note it's move to that makes