Speech Transcript - DYNAMICS OF TONGUE GESTURES EXTRACTED AUTOMATICALLY FROM ULTRASOUND

um my name is jeff very uh and i'm in a talk about using ultrasound to visualise that on during speech uh so first of all we might wanna ask why would we want to use ultrasound to look at the tongue anyway uh so the the main so that one of the main reasons is because there's there is evidence to suggest that the or a lot more then just audio processing during speech uh perception so for example there is uh the well-known make can fact so uh this shows of there's integration between the visual signal and audio signal at least uh during human speech perception okay so i i here the short uh a video them demonstrating them would the meg or fact so if you if you look at the lips uh during the video and you listen and then you you you are perceived different uh syllables so let's let's was know that ah ah ah ah okay so you should uh if you were watching the video you should of notice you should of heard different sounds right but now if you if you close your eyes or look away from the video of you don't look at the lips all replay the video and you'll see that there actually the same sound in uh acoustically a ah ah a okay so uh if you if you're not looking at the lips then you notice that the sound is always the same that's always bar however if you look at the lips and here the sound then changes okay so uh this is a really strong affect and it doesn't matter which language you speak uh it's still you get this affect least most people do so oh this is this really suggests that you know are brain makes use of lots of different uh information so uh there's a recent study uh from eleven italy showing that uh there's also use of the motor system specifically the part of the brain that control the tongue and the lips during speech uh perception uh but this the there their fact is only observed what during noisy speech so what that means is that uh during a noisy situation if you have trouble hearing the the person you're listening to that uh you you may use the you're motor cortex may become involved with helping you to parse the speech that you're listening to um so this has lots of sort of a interesting questions to pursue and we wanna be able to use ultrasound to uh investigate what what the tongue is doing during speech so it a typical ultrasound image of the time uh i at added the profile the face to sort of give you landmarks so what we're looking at here so it's a midsagittal view and uh the tongue tip is always uh to the right of the image and uh some you can get as far back depending on the subject and and the probe you're using you can usually get most of the tongue body uh sometimes at tongue tip is not visible but you can definitely see that on body so this uh bright and represents the surface of the tongue so here's a typical uh segment of speech and what it looks like in ultrasound but i okay i'll point that again i okay so you can see that tongue moving around uh to make the different sounds uh so applications for this um or recently ban uh a new no ultrasound machine been released uh that's hand so this is a fully operational ultrasound machine and so there's possibilities in the future of having uh portable ultrasound machines integrated with other sensors uh to do speech recognition oh there's also a large interest in silent speech interface which is being able to uh measure some sort of uh articulatory motion without vocalisation and being able to resynthesize speech from that so this could be useful yeah environment where uh it's either there's too much noise or if you're an environment where you have to be silent right by still need to communicate uh a and their so the possibility of adding a model of time motion to uh uh uh to a speech recognition system so uh we can use these images to construct such a model okay so in this work uh we're just addressing these questions so first of all we wanna we wanted to classify tongue shapes into a you know phonemes and also uh use this uh to try to segment this the ultrasound video into phoneme second uh so the tool we chose to do this is a variation of a deep belief network that we're calling the translational deep belief network so uh a deep belief network is uh it's composed of us of stacked restricted boltzmann machines so restricted both so machines or probabilistic generative models so uh and they uh of so machine consists of a single uh visible layer and a single hidden layer and uh so it's a it's a yeah neural network and basically in a deep belief network uh the uh the hidden layer of the of the first restricted both and machine becomes the visible layer in the second one and you just stack them up uh and uh so these these uh deep belief networks are typically trained in two stages so the first thing happens is uh i you uh do an unsupervised pre-training training stage uh that and it sort of uh what happens is the network the uh D correlates the data so it comes up it learns a a representation on its own without any uh or of a human labeled and knowledge and then after that pre-training is been done a second uh a discriminative stage uh a sort of the optimises as the network to output a label so the the human label is not actually added to the uh to the training and tell the second stage so with the translational deep belief network um we wanted to add and the label during the pre-training as well so uh this is what what uh base the basic idea well what we're proposing so we have a a a basic uh a regular deep net but we we train it with all the the sensory inputs and the labels concatenated to form the visible layer that we we train a a a uh regular deep belief network and uh then we copy the weights of the up the upper hidden layers over to a new network and we retrain this bottom layer uh to except only the the you sensor inputs without a label so then then uh finally we do uh back propagation to at the labels on uh at the at the last stage so what i more about that um so the first thing you wanna do is is at the you you want to add the uh human the labeled data into the into the pre-training stage when and that helps the um the network to build features that contain information about the labels and the and the sensor input so the first thing to do is train a deep belief network on the both of the labels and the sensors and then we do we replace the second there the weep we place the bottom layer of the network with the translational uh to uh to except the images only and then we do discriminative backprop uh two uh fine tune the network and then that allows us to extract a label from an from an unlabeled in image so again too sort of explain this we're just copying the weights from the original network over yeah and uh then substituting this bottom one with the with the new network that we train so uh to train this oh we're using uh a slight variation on the contrastive divergence rule so uh this is the equation for that and what it says is you sample um you sample that the with the input and the hidden the the representation in the hidden layer you sample from both of those to get reconstruction so this is where the the fact that it's a generative a a generative model comes in play so you yeah uh use the hidden layer to re generate the visible output and then use the the the re generated a a visible layer to reconstruct another hidden layer and then you minimize the difference between those uh so here it is it's grammatically um what we're doing is uh we're sampling from this slayer to get are hidden units and then ins normally you it just sample this to reconstruct this slayer again but the the the key difference here is where can wear sampling to a new uh a new visible layer they contains only the sensor input and then sampling from that to re can to get reconstructed it units and then that allows us to apply the contrastive divergence rule okay so for our experiment uh we use just five uh phoneme categories so we had a a database of uh one thousand eight or ninety three images the represented prototypical shapes uh for those categories so uh as we went through by hand and and hand labeled uh where we thought the prototypical shape for the uh P T K R in L in we also chose um images is represented a non category or a garbage category and those were images that where at least uh five frames in the video away from a peak image so then uh to feed this to the network we uh use this sort of labeling scheme where uh we have a one versus all already uh representation re and we repeat that a bunch of time so that uh there's similar number of input uh input uh nodes on the visible layer for images and for labels that way a network doesn't just minimise the error for the label it actually has to take that into account so we have that this so if we had a category two or a T the label would look like this a B you know the second one would be uh a one and then we would repeat this string a much at times so uh then we of the images and scaled down the relevant part to do a smaller number of pixels and did five fold cross so we had a dramatic differences in accuracy uh from a regular deep belief network and are uh translational deep belief network so as you can see um i have average the standard deep belief network got about forty two percent and uh when we add the label information during pre-training we get a lot a lot higher in the eighties um so we compare this to some other method is um there's been work uh on using on constructing what are what they're calling i can on so there that's similar to the eigen faces of turk and pentland were uh and the is a just pca analysis so you uh but all the images a you have a a a set of images and you find the principal components so that of images and then you can represent all your images in in terms of the uh the co coefficient of the first and uh upon so we use that to do dimensionality reduction in that up with a um hundred represent each image with a hundred coefficients and then use that to train uh a support vector machine and that got i got fifty three percent accuracy uh we also used um try trying to just uh use um segment out part of the image so by just tracing the tongue surface so we have uh these ultrasound images and instead of using the whole image we just one use what the human things as the relevant part so that's just tracing the tongue surface and um in previous work we showed how you can use this uh deep belief network to also extract uh these traces automatically so and all talk about that but uh so we we use these features instead just sample these curves have fifty points and use that to train an a uh a support vector machine and got seventy percent accuracy so uh that nice thing about the translational deep belief nets is they can also be used to do this uh sort of automatics image image segmentation or uh so the way that works is using again the generative properties of the of the model to uh construct an auto-encoder so what that means is uh we first train a network up to uh just like before but then we can use this this is a top hidden layer to reconstruct what it's on the input so this is a um an audible coder in the sense that what you put in you can reconstruct on the output so that means one we give it an image and a label that we can uh reconstruct that image and that label so uh we is this property uh to train an auto-encoder like this and then again we we create a new network by copying all this peace over to the new network and then using this uh the T R B M two retrain a bottom layer a bottom layer of the network to accept only the image so this allows us to put in an image and get a label out uh in this case the label is that on trace so but again this is or input looks like is a read this image and put it into the network in it reduces the uh at trace okay so um the other thing we looked at was uh segmenting the speech um sequence so in this case we use just a regular sequence which included um images that were not part of the original training set so we only trained on just prototypical shapes and the sequence change takes uh contains lots of transitional stage as well so uh when we did that um the classifier or uh that the the deep belief network was it it's uh gives us a representation or gives us an activation they can handle multiple categories so oh that looks like this one uh we are we actually get uh for the sequence we sort of see the dynamics of the tongue motion and we can use this to segment the speech stream so for example well on this frame we have activation for a the P shape and the T shape at the same time so it's sort transitioning from those shape and so putting it all together we have the original ultrasound um the automatic automatically-extracted extracted contour or and then here will be the label that it shows and uh it looks like that okay so uh that's all i have thank you very much so question come no and it yeah yeah you you training set from a uh uh one person or of person and you you get and you don in yeah you and but there is the probably a and you john the tree of uh as will that's right there's large differences between people and actually uh for the tone surface extraction we use we trained with nine speakers and uh it was able to generalise to uh a new speakers as long as the their time shape was um sort of a i around about the same size as another as one of those nine subjects was trained uh i've for the classification we trained on to uh two speakers and it it did well for for those people so uh we still that's still an ongoing uh area where we need to see how well a general is to other speakers but you're right about that thank and the fact that many you turns he's may yeah to the shape of the are also to and when of speaking a fixed uh i think it it's mostly to do with the shape of the time and especially during data collection uh the depth of the scanned and things like that make a large different uh_huh so maybe be norm announced yeah that's and that's yeah for okay as the question note it's move to that makes don't

DYNAMICS OF TONGUE GESTURES EXTRACTED AUTOMATICALLY FROM ULTRASOUND

Medical Imaging

Presented by: Jeff Berry, Author(s): Jeff Berry, Ian Fasel, University of Arizona, United States