Speech Transcript - DISCRIMINATIVE DURATION MODELING FOR SPEECH RECOGNITION WITH SEGMENTAL CONDITIONAL RANDOM FIELDS

i um um still this is work that uh we this summer at uh johns hopkins workshop and uh i just think out a a student that we had from start for she did the most of the work just to to talk uh so uh are we going to talk about uh and duration models and the particular take that we have here E is uh to look at uh discrimination specifically uh of duration models so we going to start with uh looking at uh the driving motivation uh for this work which is uh a look at what happens with duration and in the context of discrimination and from that in creation try to uh derive uh and to is that we can use to help um speech recognition and for that we need uh mathematical framework or the segmental conditional random field to integrate those feature easy and then uh we going to talk back to duration features and C specifically what uh features we've actually added to uh two segmental conditional random fields and that go with a result so uh get a each of the models uh the generative story of at each of the models is well known that it's a exponential uh distributions of the the probability of staying in a particular state is exponential and if you look get a what happens you in uh reality it's not the case it doesn't look like an exponential so we know these models are wrong a and the solution tend to be a is possible to fix essentially the H and the solutions tend to be a a a a a a a little bit awkward and difficult to that to used and and for that uh we introduce the segmental conditional random field uh but first what we need to do is whether uh actually a duration is is it good indicator for whether word is correctly recognized or incorrectly recognized so what we did was to uh but at that it is oh produced by the decoder and here we have the histogram of the word two uh uh uh against its so duration so but the X you have the duration and on the Y axis you had uh the frequency and which oh of that but you a word pronounced with uh that's a particular duration the question is no whether that's a good indication of whether the word is correctly recognized lot and so to to each the correctly recognized one from that that is and then did the same for the instances that where ms recognise and uh interestingly uh the ones that are ms but recognized tend to be shorter and i i i'll come back to uh to uh why we think it's the case but clearly uh those distributions are different so that they might be a useful for for us a to using the concept sec context of a discrimination so how do we uh do we turn that intuition tuition two uh or something that can help uh the speech recognition and G uh and that's that's the propose of segmental conditional random fields so that the peak the here the graph and so you see on top of are word uh but high is and so you grew from word to word from state to states uh the markov assumption that's basically a and gram language model have know the words you see that uh observations are grouped into small blocks and each block uh is associated with a word so i like hmms we each uh which show uh a separate frame by frame where and and the words that just a concatenation of frame here uh allowed oh you use of multiple observations in a single block of second a a to make uh the determination of a whether uh a word just the correct one or not and but you do that is you those observation and you create a feature vector and uh you are a score that is a a weighted sum of these feature vector of to the speech and and that's the log part and so basically uh that's we things do know about this model first of all there uh conditional models than then of which leads that there actually actually discriminative secondly um uh they are a lot to models which means that you can use uh multiple features of different type um to uh interpolated and make the determination of whether the word is correct not and sort of vol uh most importantly is that there segment of model which means that you by lower yourself to group observation but features uh that that are that were operate globally you uh this group of observation and so uh i he's an example and you for more information we i have a poster this afternoon uh describing the multiple approach is that we integrated in a segmental conditional random field from work and uh uh you see what of features we can at want a word a low low the features that we developed so uh one of "'em" is a a a not the system and the M our detection is it's an the system well by uh uh microsoft research uh where uh you can combine with the different high this a a a at the bottom you see uh phoneme detections and that are extracted from um a neural network or oh your press perceptual uh and and in the middle you see our our our uh features the duration feature for instance a a is just a number that you are so she with a a word hypothesis we can see sometimes a we are allow uh features to be missing that's something that uh the for mark allows us to do so uh to you have a different uh hypotheses we instead of he okay look at uh for duration and you can assign a different durations score for different uh so a word hypothesis depending on whether but what what the duration whether the duration is plotted but so this is basically what we want to do to great uh this since we show that you know short iterations are uh oh yeah uh in the proper and so he it a a real example from a is and the the true transcription it's a fragment the true transcription was in a place called to michael query which she's a place in i think somewhere in india uh and it's of very very rare or i and so what happens is that uh through the back weights and the language model likes to instead of the of the true hypothesis insert search very short words there are typically function words not are very frequent and and because they don't fit a tend to be shorter i'm so to my cocker a a typically you know my is is a shortened a has to be compressed to fit because because the because it's a section of the real high cost of this and uh so this is our goal uh we need to panel you know the were though the words that are yeah and uh the words of and blue are correct so we want to uh and uh additional books uh so the way we're going to do that is we going to produce to features to scores and if you remember these these are the histograms of a uh the correct incorrect that uh a the durations or frequency the histograms for durations of frequency when the word is recognized correctly or incorrectly for the work too so the blue one is the correct one in the the red one using using correct one so it it to you have a word hypotheses of two but that is um twenty frames are so we are going to look up that a a probability the uh a histogram and you see uh you see that the blue one is higher than that the red one and ultimately the model is going to learn that this difference should be uh should have a positive weight in it should help uh any hypothesis that has a positive difference and then rise any hypothesis that has a negative different so when anything that it ten frames or we had a large make it a penalty or a large but should but so i the thing we going to do is uh we going to only look at the top hundred were and the reason is we wanna draw these histograms of need enough samples to be able to draw a system grams uh reliably and luckily given given the skewness of the task the top hundred words a actually fifty percent of the probability mass and fifty percent of the error math so they relatively uh or two words and we can we can be secure for two my percent of the word types and the uh consist of a fifty percent of the work to okay and the feature that we looked at at as uh a and and short span so intuition here is that a you have this phenomenon where uh the language model in uh lots of small words for a large word i just trying to break up a large in frequent word to uh lots of small were so well to distinguish a case for instance called in calling between a a and and to mark korean T these are other uh word uh so the first one calling calling is just a substitution uh and the other one is is a a of a different type so instead of port producing to features we going to display to which all of these cases input to uh six features so a whenever you know there's no special style produced uh a features for that for that keys and whenever a word is of a different and we would produce two a features so weights can be assigned differently for these different cases and so we decided that are almost span was a word that span multiple words and a short span with a word that that was spanned by one a where okay the second uh i i or was that if you if you look at uh this has been reported multiple times in the literature basically a before a pause uh a word tends to be pronounced uh slow are so it it will have a longer duration and in the middle of a sentence are after a "'cause" it will tend to be uh of normal duration so to speak so if to get the the example sentence here uh a to present and "'cause" a two are present clinton said something a see that the second instance the blue instance yeah so a short duration and so can we can separate these it i have um where that appear at the end of the and of a sentence or before calls to be uh to have a different duration model okay and so uh we integrate this uh with the framework in the uh the uh in the model so we had a state-of-the-art art uh I B M B is "'cause" the either uh and this is a broadcast news task uh and we uh uh a combined it's with that M S R system and we got to fifteen point three then the we i did uh duration features so we can see that this is more uh well king uh when you at motion features and we and when you at them uh with different uh or the different variance that show and uh these features where uh cindy read your feature don't turn out to be as good as is in the other individual feature we try the workshop right so um in conclusion we uh i hope of of of a given uh uh and about uh how durations can be used for word discrimination and uh a idea that uh where is misrecognized that tend to be short or because they come from a from forcing them uh by the language model we tend to be you short function one we were able to uh to this intuition and um quantities if one features that where we were able to uh uh integrating the segmental conditional random field a framework to pen pen is uh spurious word hypothesis individually the our duration scores and we combine that uh with the but a to state of the art system or still a small improvement okay yeah have a few i have a question so that that yeah i i think that could effect the duration of them where my keys met yeah yeah i not only keep on but also a yeah the way sounds i think you yeah yeah and i i have a duration uh yes that's interesting um i have an yet but that that's one one thing we look that which uh i i think you reading report in a people which was i think interesting is you can look at the duration of each uh phone within the word and you can see that actually there they differ uh and and you see uh yeah exactly depending on that stress whether the stress is correct and this to see differences in the duration yeah one a question yeah

DISCRIMINATIVE DURATION MODELING FOR SPEECH RECOGNITION WITH SEGMENTAL CONDITIONAL RANDOM FIELDS

Speech Analysis

Presented by: Patrick Nguyen, Author(s): Justine Kao, Stanford University, United States; Geoffrey Zweig, Patrick Nguyen, Microsoft Research, United States