Speech Transcript - DISCRIMINATIVE DURATION MODELING FOR SPEECH RECOGNITION WITH SEGMENTAL CONDITIONAL RANDOM FIELDS

0:00:13	i
0:00:13	um um still
0:00:14	this is work that uh we this summer at uh johns hopkins workshop
0:00:19	and uh
0:00:21	i just think out a
0:00:22	a student that we had
0:00:24	from start for she did the most of the work
0:00:26	just to to talk
0:00:29	uh
0:00:29	so
0:00:31	uh are we going to talk about uh and duration models
0:00:34	and the particular take that we have here
0:00:36	E is uh to look at uh discrimination specifically
0:00:41	uh of duration models
0:00:43	so we going to start with uh
0:00:45	looking at uh the driving motivation
0:00:48	uh for this work which is
0:00:50	uh a look at
0:00:51	what happens with duration
0:00:52	and in the context of discrimination
0:00:55	and from that in creation try to uh derive
0:00:59	uh
0:01:00	and to is that we can use
0:01:02	to help
0:01:03	um
0:01:04	speech recognition and for that we need uh
0:01:07	mathematical framework or the segmental conditional random field
0:01:11	to integrate those feature
0:01:14	easy
0:01:15	and then uh we going to talk back to duration features
0:01:19	and C
0:01:20	specifically what
0:01:21	uh features we've actually added to uh
0:01:24	two segmental conditional random fields
0:01:27	and that
0:01:28	go with a result
0:01:32	so uh
0:01:34	get a each of the models
0:01:35	uh the generative
0:01:36	story of at each of the models is well known that it's a
0:01:39	exponential
0:01:40	uh
0:01:41	distributions of the the
0:01:43	probability of staying in a particular state
0:01:46	is exponential
0:01:47	and if you look get a what happens you in uh reality it's not the case it doesn't look like
0:01:52	an exponential
0:01:53	so we know these models are wrong
0:01:56	a and the solution tend to be a is possible to fix
0:01:59	essentially the H
0:02:00	and the solutions tend to be a a a a a a a little bit awkward and difficult to that
0:02:05	to used
0:02:07	and and for that uh we introduce the segmental conditional random field
0:02:12	uh but first what we need to do is whether
0:02:15	uh
0:02:16	actually a duration is is it good
0:02:19	indicator for whether word is
0:02:21	correctly recognized
0:02:23	or incorrectly recognized
0:02:27	so what we did was to uh
0:02:29	but at that it is
0:02:31	oh produced by the decoder
0:02:33	and here we have the histogram
0:02:35	of the word
0:02:36	two
0:02:37	uh
0:02:38	uh
0:02:39	uh against its so duration so but the X you have the duration
0:02:43	and on the Y axis you had
0:02:45	uh the frequency and which
0:02:47	oh of that but you a word pronounced with
0:02:50	uh that's a particular duration
0:02:53	the question is
0:02:54	no whether
0:02:54	that's a good indication of whether the word is correctly recognized lot
0:02:58	and so
0:02:59	to to each the correctly recognized one from that that is
0:03:02	and then
0:03:03	did the same for
0:03:05	the instances that where ms recognise
0:03:08	and uh
0:03:10	interestingly
0:03:11	uh the ones that are ms
0:03:13	but recognized tend to be shorter
0:03:15	and i i i'll come back to uh
0:03:17	to uh why we think it's the case
0:03:19	but clearly
0:03:20	uh
0:03:21	those distributions are different so that they might be a
0:03:24	useful for for us
0:03:25	a to using the concept sec
0:03:27	context of a discrimination
0:03:30	so how do we
0:03:32	uh do we turn that
0:03:34	intuition tuition two
0:03:35	uh
0:03:36	or something that can help uh
0:03:39	the speech recognition and G
0:03:41	uh and that's that's the propose of segmental conditional random fields
0:03:44	so that the peak the here
0:03:46	the graph
0:03:48	and so you see on top of are word
0:03:51	uh
0:03:52	but high is
0:03:53	and so you grew from word to word from state
0:03:56	to states uh
0:03:57	the markov assumption that's basically a
0:04:00	and gram language model
0:04:02	have know the words you see that uh observations are
0:04:06	grouped into small blocks
0:04:07	and each block
0:04:08	uh is associated with a word
0:04:11	so i like hmms we each uh
0:04:15	which show uh a separate frame by frame where and and the words that just a concatenation of frame
0:04:21	here uh
0:04:22	allowed
0:04:23	oh you use of multiple observations in a single block of second
0:04:28	a a to make uh the determination of a
0:04:31	whether
0:04:32	uh a word just the correct one or not
0:04:34	and
0:04:36	but you do that is you
0:04:37	those observation and you create a feature vector
0:04:41	and uh you are a score that is a a weighted sum of these feature vector of to the speech
0:04:48	and and that's the log part
0:04:51	and so basically uh that's we things do know about this model first of all
0:04:55	there uh conditional models
0:04:58	than then of which leads that there
0:05:00	actually actually discriminative
0:05:02	secondly um
0:05:05	uh they are a lot to models which means that you can use uh multiple features of different type
0:05:11	um to uh interpolated
0:05:13	and make the determination of whether the word is correct not
0:05:16	and sort of vol
0:05:17	uh most importantly is that there segment of model which means
0:05:21	that you by lower yourself to group observation
0:05:24	but features
0:05:26	uh that
0:05:27	that are that were operate globally you uh
0:05:30	this group of observation
0:05:33	and so uh i he's an example and you for more information we
0:05:37	i have a poster this afternoon
0:05:39	uh describing
0:05:40	the multiple approach is that we integrated in a
0:05:42	segmental conditional random field from work
0:05:45	and uh uh you see what of features we can at want a word
0:05:51	a low low the features that we developed so
0:05:53	uh one of "'em" is a a a not the system and the M our detection is it's an the
0:05:58	system
0:05:59	well by uh
0:06:00	uh microsoft research uh where
0:06:02	uh
0:06:03	you can combine with the
0:06:05	different high this
0:06:07	a a a at the bottom you see uh
0:06:09	phoneme detections
0:06:11	and that are extracted from um
0:06:13	a neural network or
0:06:15	oh your press perceptual
0:06:17	uh and and in the middle you see our
0:06:20	our our uh features the duration feature for instance
0:06:23	a a is just a number
0:06:25	that you are so she
0:06:26	with
0:06:27	a a word hypothesis
0:06:30	we can see sometimes a we are allow
0:06:32	uh features to be missing
0:06:35	that's something that uh the for mark allows us to do
0:06:38	so uh to you have a different uh hypotheses we instead of he
0:06:42	okay
0:06:42	look at uh
0:06:44	for duration
0:06:45	and you can assign a different durations score
0:06:48	for different
0:06:49	uh
0:06:50	so a word hypothesis
0:06:52	depending on whether
0:06:55	but what what the duration whether the duration is plotted but
0:07:00	so this is basically what we want to do
0:07:03	to great uh
0:07:05	this since we show that you know short iterations are
0:07:08	uh oh yeah
0:07:09	uh in the proper
0:07:11	and so he it a a real example from a is
0:07:14	and the the true transcription it's a fragment the true transcription was in a place called to michael query
0:07:20	which she's a place in i think somewhere in india
0:07:23	uh
0:07:23	and it's of very very rare or i
0:07:26	and so what happens is that uh through the back weights
0:07:29	and the language model likes to
0:07:32	instead of the of the true hypothesis
0:07:34	insert search very short words
0:07:37	there are typically function words not are very frequent
0:07:40	and and because they don't fit a tend to be shorter
0:07:43	i'm so to my
0:07:45	cocker
0:07:46	a a typically you know my is is a shortened a
0:07:49	has to be compressed to fit because
0:07:51	because the
0:07:54	because it's a section of the real high cost of this
0:07:56	and uh
0:07:58	so this is our goal uh we need to panel
0:08:01	you know the were though the words that are yeah
0:08:03	and uh the words of and blue are correct so we want to uh
0:08:08	and
0:08:09	uh
0:08:09	additional books
0:08:12	uh so the way we're going to do that is we going to produce to features to scores
0:08:17	and if you remember these these are the histograms of a uh
0:08:21	the correct incorrect that
0:08:23	uh
0:08:25	a the durations or frequency the histograms for
0:08:28	durations of frequency when the word is recognized correctly or incorrectly
0:08:32	for the work too
0:08:33	so the blue one is the correct one in the the red one using using correct one
0:08:38	so it it to you have a word hypotheses of two
0:08:40	but that is um
0:08:43	twenty frames are
0:08:44	so we are going to look up that a a probability
0:08:48	the uh a histogram
0:08:50	and you see uh you see that the blue one is higher than that
0:08:53	the red one
0:08:54	and ultimately the model is going to learn
0:08:57	that
0:08:57	this difference
0:08:58	should be
0:08:59	uh
0:09:00	should have a positive weight
0:09:02	in it should help
0:09:03	uh any hypothesis that has a positive difference
0:09:06	and then rise any hypothesis that has a negative different
0:09:09	so when anything that it ten frames or we had
0:09:12	a large
0:09:13	make it
0:09:15	a penalty or a large but should but
0:09:20	so i the thing we going to do is
0:09:22	uh
0:09:23	we going to only look at the top hundred were
0:09:26	and the reason is we wanna draw these histograms of need enough
0:09:29	samples to be able to draw a system grams
0:09:32	uh reliably
0:09:34	and luckily given given the skewness of the task
0:09:37	the top hundred words a actually
0:09:40	fifty percent of the probability mass and fifty percent of the error math
0:09:44	so they relatively uh
0:09:47	or two words
0:09:48	and we can we can be secure for two
0:09:51	my percent of the word types
0:09:53	and
0:09:53	the uh consist of a fifty percent of the work to okay
0:10:00	and the feature that we looked at at as uh
0:10:03	a and and short span so
0:10:05	intuition here is that
0:10:07	a you have this phenomenon where
0:10:10	uh the language model in
0:10:12	uh lots of small words for a large word
0:10:14	i just trying to break up a large in frequent word
0:10:17	to uh lots of small were
0:10:21	so
0:10:21	well to distinguish a case for instance called in calling between
0:10:25	a a and and to mark korean T
0:10:27	these are other
0:10:28	uh word
0:10:29	uh
0:10:30	so the first one calling calling is just a substitution
0:10:33	uh and the other one is is a a of a different type
0:10:36	so instead of port producing
0:10:38	to features we going to display to which all of these cases input to uh
0:10:43	six
0:10:44	features
0:10:44	so a whenever
0:10:45	you know there's no special style
0:10:47	produced uh
0:10:49	a features for that for that keys
0:10:51	and whenever a word is of a different
0:10:54	and we would produce two
0:10:56	a features so weights can be assigned
0:10:58	differently
0:10:59	for these different cases
0:11:02	and so we decided that are almost span was a word that span multiple words and a short span with
0:11:07	a
0:11:08	word
0:11:08	that that was spanned by
0:11:10	one a where
0:11:13	okay
0:11:14	the second uh
0:11:16	i i or was that if you if you look at uh this has been reported multiple times in the
0:11:20	literature
0:11:21	basically
0:11:22	a before a pause
0:11:24	uh a word tends to be pronounced
0:11:27	uh slow are so it it will have a longer duration
0:11:31	and in the middle of a sentence are after a "'cause" it will tend to be uh of normal duration
0:11:36	so to speak
0:11:37	so if to get the
0:11:38	the example sentence here
0:11:40	uh
0:11:41	a to present and
0:11:42	"'cause"
0:11:43	a two are present clinton said
0:11:45	something
0:11:46	a see that the second instance the blue instance
0:11:49	yeah
0:11:49	so a short duration
0:11:53	and so can we can separate these it
0:11:56	i have um
0:11:58	where that appear at the end of the
0:12:00	and of a sentence or before calls
0:12:02	to be uh
0:12:03	to have a different duration model
0:12:07	okay
0:12:08	and so uh we integrate this
0:12:10	uh
0:12:11	with the framework in the uh
0:12:14	the uh in the model so we had a state-of-the-art art uh I B M B is "'cause" the either
0:12:20	uh and this is a broadcast news task
0:12:23	uh
0:12:24	and we uh
0:12:25	uh a combined it's with that M S R system and we got to fifteen point three
0:12:30	then the we i did uh duration features
0:12:34	so we can see that this is more uh well king
0:12:37	uh
0:12:38	when you at motion features and we and when you at them
0:12:42	uh with different uh
0:12:44	or the different variance that show
0:12:48	and uh these features where
0:12:50	uh
0:12:51	cindy read your feature don't turn out to be as good as is in the other
0:12:55	individual feature we try the workshop
0:13:00	right
0:13:01	so um in conclusion we
0:13:04	uh i hope of of of a given uh uh and about
0:13:08	uh
0:13:08	how durations can be used
0:13:10	for word discrimination
0:13:13	and uh
0:13:14	a idea that uh
0:13:16	where is misrecognized that tend to be short or because they come from a
0:13:20	from forcing them
0:13:21	uh by the language model
0:13:23	we tend to be you short function one
0:13:27	we were able to uh
0:13:29	to this intuition and
0:13:31	um quantities if one features that where
0:13:34	we were able to uh
0:13:36	uh integrating the segmental conditional random field
0:13:39	a framework
0:13:40	to pen pen is uh
0:13:42	spurious word hypothesis
0:13:43	individually
0:13:44	the our duration scores
0:13:47	and we combine that
0:13:48	uh with the
0:13:50	but a to state of the art system or still
0:13:53	a small improvement
0:13:57	okay
0:14:03	yeah have a few
0:14:05	i have a question
0:14:08	so that that yeah i i think that could effect the duration of them where
0:14:11	my keys met
0:14:13	yeah yeah i not only keep on but also a
0:14:17	yeah the way sounds
0:14:19	i think you yeah
0:14:20	yeah
0:14:21	and i i have a duration
0:14:24	uh yes that's interesting um
0:14:27	i have an yet but that that's one one thing we look that which uh i i think you reading
0:14:31	report in a people which was i think interesting is
0:14:33	you can look at the
0:14:35	duration of each uh phone
0:14:38	within the word
0:14:39	and you can see that actually there they differ
0:14:41	uh
0:14:42	and and you see uh yeah exactly depending on that stress whether the stress is correct and this
0:14:47	to see differences in the duration
0:14:53	yeah one a question
0:15:00	yeah

DISCRIMINATIVE DURATION MODELING FOR SPEECH RECOGNITION WITH SEGMENTAL CONDITIONAL RANDOM FIELDS

Speech Analysis

Presented by: Patrick Nguyen, Author(s): Justine Kao, Stanford University, United States; Geoffrey Zweig, Patrick Nguyen, Microsoft Research, United States