0:00:13um um still
0:00:14this is work that uh we this summer at uh johns hopkins workshop
0:00:19and uh
0:00:21i just think out a
0:00:22a student that we had
0:00:24from start for she did the most of the work
0:00:26just to to talk
0:00:31uh are we going to talk about uh and duration models
0:00:34and the particular take that we have here
0:00:36E is uh to look at uh discrimination specifically
0:00:41uh of duration models
0:00:43so we going to start with uh
0:00:45looking at uh the driving motivation
0:00:48uh for this work which is
0:00:50uh a look at
0:00:51what happens with duration
0:00:52and in the context of discrimination
0:00:55and from that in creation try to uh derive
0:01:00and to is that we can use
0:01:02to help
0:01:04speech recognition and for that we need uh
0:01:07mathematical framework or the segmental conditional random field
0:01:11to integrate those feature
0:01:15and then uh we going to talk back to duration features
0:01:19and C
0:01:20specifically what
0:01:21uh features we've actually added to uh
0:01:24two segmental conditional random fields
0:01:27and that
0:01:28go with a result
0:01:32so uh
0:01:34get a each of the models
0:01:35uh the generative
0:01:36story of at each of the models is well known that it's a
0:01:41distributions of the the
0:01:43probability of staying in a particular state
0:01:46is exponential
0:01:47and if you look get a what happens you in uh reality it's not the case it doesn't look like
0:01:52an exponential
0:01:53so we know these models are wrong
0:01:56a and the solution tend to be a is possible to fix
0:01:59essentially the H
0:02:00and the solutions tend to be a a a a a a a little bit awkward and difficult to that
0:02:05to used
0:02:07and and for that uh we introduce the segmental conditional random field
0:02:12uh but first what we need to do is whether
0:02:16actually a duration is is it good
0:02:19indicator for whether word is
0:02:21correctly recognized
0:02:23or incorrectly recognized
0:02:27so what we did was to uh
0:02:29but at that it is
0:02:31oh produced by the decoder
0:02:33and here we have the histogram
0:02:35of the word
0:02:39uh against its so duration so but the X you have the duration
0:02:43and on the Y axis you had
0:02:45uh the frequency and which
0:02:47oh of that but you a word pronounced with
0:02:50uh that's a particular duration
0:02:53the question is
0:02:54no whether
0:02:54that's a good indication of whether the word is correctly recognized lot
0:02:58and so
0:02:59to to each the correctly recognized one from that that is
0:03:02and then
0:03:03did the same for
0:03:05the instances that where ms recognise
0:03:08and uh
0:03:11uh the ones that are ms
0:03:13but recognized tend to be shorter
0:03:15and i i i'll come back to uh
0:03:17to uh why we think it's the case
0:03:19but clearly
0:03:21those distributions are different so that they might be a
0:03:24useful for for us
0:03:25a to using the concept sec
0:03:27context of a discrimination
0:03:30so how do we
0:03:32uh do we turn that
0:03:34intuition tuition two
0:03:36or something that can help uh
0:03:39the speech recognition and G
0:03:41uh and that's that's the propose of segmental conditional random fields
0:03:44so that the peak the here
0:03:46the graph
0:03:48and so you see on top of are word
0:03:52but high is
0:03:53and so you grew from word to word from state
0:03:56to states uh
0:03:57the markov assumption that's basically a
0:04:00and gram language model
0:04:02have know the words you see that uh observations are
0:04:06grouped into small blocks
0:04:07and each block
0:04:08uh is associated with a word
0:04:11so i like hmms we each uh
0:04:15which show uh a separate frame by frame where and and the words that just a concatenation of frame
0:04:21here uh
0:04:23oh you use of multiple observations in a single block of second
0:04:28a a to make uh the determination of a
0:04:32uh a word just the correct one or not
0:04:36but you do that is you
0:04:37those observation and you create a feature vector
0:04:41and uh you are a score that is a a weighted sum of these feature vector of to the speech
0:04:48and and that's the log part
0:04:51and so basically uh that's we things do know about this model first of all
0:04:55there uh conditional models
0:04:58than then of which leads that there
0:05:00actually actually discriminative
0:05:02secondly um
0:05:05uh they are a lot to models which means that you can use uh multiple features of different type
0:05:11um to uh interpolated
0:05:13and make the determination of whether the word is correct not
0:05:16and sort of vol
0:05:17uh most importantly is that there segment of model which means
0:05:21that you by lower yourself to group observation
0:05:24but features
0:05:26uh that
0:05:27that are that were operate globally you uh
0:05:30this group of observation
0:05:33and so uh i he's an example and you for more information we
0:05:37i have a poster this afternoon
0:05:39uh describing
0:05:40the multiple approach is that we integrated in a
0:05:42segmental conditional random field from work
0:05:45and uh uh you see what of features we can at want a word
0:05:51a low low the features that we developed so
0:05:53uh one of "'em" is a a a not the system and the M our detection is it's an the
0:05:59well by uh
0:06:00uh microsoft research uh where
0:06:03you can combine with the
0:06:05different high this
0:06:07a a a at the bottom you see uh
0:06:09phoneme detections
0:06:11and that are extracted from um
0:06:13a neural network or
0:06:15oh your press perceptual
0:06:17uh and and in the middle you see our
0:06:20our our uh features the duration feature for instance
0:06:23a a is just a number
0:06:25that you are so she
0:06:27a a word hypothesis
0:06:30we can see sometimes a we are allow
0:06:32uh features to be missing
0:06:35that's something that uh the for mark allows us to do
0:06:38so uh to you have a different uh hypotheses we instead of he
0:06:42look at uh
0:06:44for duration
0:06:45and you can assign a different durations score
0:06:48for different
0:06:50so a word hypothesis
0:06:52depending on whether
0:06:55but what what the duration whether the duration is plotted but
0:07:00so this is basically what we want to do
0:07:03to great uh
0:07:05this since we show that you know short iterations are
0:07:08uh oh yeah
0:07:09uh in the proper
0:07:11and so he it a a real example from a is
0:07:14and the the true transcription it's a fragment the true transcription was in a place called to michael query
0:07:20which she's a place in i think somewhere in india
0:07:23and it's of very very rare or i
0:07:26and so what happens is that uh through the back weights
0:07:29and the language model likes to
0:07:32instead of the of the true hypothesis
0:07:34insert search very short words
0:07:37there are typically function words not are very frequent
0:07:40and and because they don't fit a tend to be shorter
0:07:43i'm so to my
0:07:46a a typically you know my is is a shortened a
0:07:49has to be compressed to fit because
0:07:51because the
0:07:54because it's a section of the real high cost of this
0:07:56and uh
0:07:58so this is our goal uh we need to panel
0:08:01you know the were though the words that are yeah
0:08:03and uh the words of and blue are correct so we want to uh
0:08:09additional books
0:08:12uh so the way we're going to do that is we going to produce to features to scores
0:08:17and if you remember these these are the histograms of a uh
0:08:21the correct incorrect that
0:08:25a the durations or frequency the histograms for
0:08:28durations of frequency when the word is recognized correctly or incorrectly
0:08:32for the work too
0:08:33so the blue one is the correct one in the the red one using using correct one
0:08:38so it it to you have a word hypotheses of two
0:08:40but that is um
0:08:43twenty frames are
0:08:44so we are going to look up that a a probability
0:08:48the uh a histogram
0:08:50and you see uh you see that the blue one is higher than that
0:08:53the red one
0:08:54and ultimately the model is going to learn
0:08:57this difference
0:08:58should be
0:09:00should have a positive weight
0:09:02in it should help
0:09:03uh any hypothesis that has a positive difference
0:09:06and then rise any hypothesis that has a negative different
0:09:09so when anything that it ten frames or we had
0:09:12a large
0:09:13make it
0:09:15a penalty or a large but should but
0:09:20so i the thing we going to do is
0:09:23we going to only look at the top hundred were
0:09:26and the reason is we wanna draw these histograms of need enough
0:09:29samples to be able to draw a system grams
0:09:32uh reliably
0:09:34and luckily given given the skewness of the task
0:09:37the top hundred words a actually
0:09:40fifty percent of the probability mass and fifty percent of the error math
0:09:44so they relatively uh
0:09:47or two words
0:09:48and we can we can be secure for two
0:09:51my percent of the word types
0:09:53the uh consist of a fifty percent of the work to okay
0:10:00and the feature that we looked at at as uh
0:10:03a and and short span so
0:10:05intuition here is that
0:10:07a you have this phenomenon where
0:10:10uh the language model in
0:10:12uh lots of small words for a large word
0:10:14i just trying to break up a large in frequent word
0:10:17to uh lots of small were
0:10:21well to distinguish a case for instance called in calling between
0:10:25a a and and to mark korean T
0:10:27these are other
0:10:28uh word
0:10:30so the first one calling calling is just a substitution
0:10:33uh and the other one is is a a of a different type
0:10:36so instead of port producing
0:10:38to features we going to display to which all of these cases input to uh
0:10:44so a whenever
0:10:45you know there's no special style
0:10:47produced uh
0:10:49a features for that for that keys
0:10:51and whenever a word is of a different
0:10:54and we would produce two
0:10:56a features so weights can be assigned
0:10:59for these different cases
0:11:02and so we decided that are almost span was a word that span multiple words and a short span with
0:11:08that that was spanned by
0:11:10one a where
0:11:14the second uh
0:11:16i i or was that if you if you look at uh this has been reported multiple times in the
0:11:22a before a pause
0:11:24uh a word tends to be pronounced
0:11:27uh slow are so it it will have a longer duration
0:11:31and in the middle of a sentence are after a "'cause" it will tend to be uh of normal duration
0:11:36so to speak
0:11:37so if to get the
0:11:38the example sentence here
0:11:41a to present and
0:11:43a two are present clinton said
0:11:46a see that the second instance the blue instance
0:11:49so a short duration
0:11:53and so can we can separate these it
0:11:56i have um
0:11:58where that appear at the end of the
0:12:00and of a sentence or before calls
0:12:02to be uh
0:12:03to have a different duration model
0:12:08and so uh we integrate this
0:12:11with the framework in the uh
0:12:14the uh in the model so we had a state-of-the-art art uh I B M B is "'cause" the either
0:12:20uh and this is a broadcast news task
0:12:24and we uh
0:12:25uh a combined it's with that M S R system and we got to fifteen point three
0:12:30then the we i did uh duration features
0:12:34so we can see that this is more uh well king
0:12:38when you at motion features and we and when you at them
0:12:42uh with different uh
0:12:44or the different variance that show
0:12:48and uh these features where
0:12:51cindy read your feature don't turn out to be as good as is in the other
0:12:55individual feature we try the workshop
0:13:01so um in conclusion we
0:13:04uh i hope of of of a given uh uh and about
0:13:08how durations can be used
0:13:10for word discrimination
0:13:13and uh
0:13:14a idea that uh
0:13:16where is misrecognized that tend to be short or because they come from a
0:13:20from forcing them
0:13:21uh by the language model
0:13:23we tend to be you short function one
0:13:27we were able to uh
0:13:29to this intuition and
0:13:31um quantities if one features that where
0:13:34we were able to uh
0:13:36uh integrating the segmental conditional random field
0:13:39a framework
0:13:40to pen pen is uh
0:13:42spurious word hypothesis
0:13:44the our duration scores
0:13:47and we combine that
0:13:48uh with the
0:13:50but a to state of the art system or still
0:13:53a small improvement
0:14:03yeah have a few
0:14:05i have a question
0:14:08so that that yeah i i think that could effect the duration of them where
0:14:11my keys met
0:14:13yeah yeah i not only keep on but also a
0:14:17yeah the way sounds
0:14:19i think you yeah
0:14:21and i i have a duration
0:14:24uh yes that's interesting um
0:14:27i have an yet but that that's one one thing we look that which uh i i think you reading
0:14:31report in a people which was i think interesting is
0:14:33you can look at the
0:14:35duration of each uh phone
0:14:38within the word
0:14:39and you can see that actually there they differ
0:14:42and and you see uh yeah exactly depending on that stress whether the stress is correct and this
0:14:47to see differences in the duration
0:14:53yeah one a question