Speech Transcript - Speech Recognition with Segmental Conditional Random Fields

0:00:13	okay
0:00:14	welcome to the morning session acoustic modeling
0:00:19	start off with a speech by geoff zweig well
0:00:21	sure we introduce
0:00:23	actually i'm really happy to introduce to have known since it was this high
0:00:27	but has grown a lot since it became since he was a graduate student
0:00:32	anyway
0:00:35	this we followed by the poster session and acoustic models
0:00:40	after building berkeley where he really didn't amazing job and was already interested in
0:00:46	green welcome crazy different models which was i've always liked
0:00:51	he went on to I B M you work
0:00:54	and you can tune into work actually on graphical models
0:00:59	not only working on the throughput but also
0:01:01	working on implementations
0:01:04	yeah and you got sucked into a lot of darpa meeting
0:01:08	as well as many of us that
0:01:09	and he moved on from there microsoft word's been since two thousand six
0:01:13	so he's well known that field now goes for
0:01:16	a principled
0:01:18	developments and also for implementations that have been useful for the community
0:01:24	some happy you know
0:01:26	by jeff appear to give a stock and sick mother interesting idea of segmental conditional random field
0:01:41	i thank you very much
0:01:46	okay so i'd like to talk us start today with a very high level description of what the theme of
0:01:55	the talk is going to be
0:01:57	and i tried to put a little bit of thought in advance into what would be a good a sort
0:02:02	of a pictorial metaphor pictorial representation of a what the talk would be about and also something that is a
0:02:11	fitting to the beautiful location that we're in today
0:02:16	when i did that i decided the best thing that i could come up with was this picture that you
0:02:21	see here of a nineteenth century clipper ship
0:02:25	and these are sort of very interesting things they were basically the space shuttle
0:02:30	of their day they were designed to go absolutely as fast as possible making trips from say in the to
0:02:37	london or boston
0:02:40	and when you look at the ship there you see that they put a huge amount of thought and engineering
0:02:46	into its design
0:02:48	and in particular if you look at those sales they didn't sorta just build a ship and then put one
0:02:54	a big holes where sail up on top of it instead what they did was they try in many ways
0:03:01	to harness sort of every aspect every facet of the wind
0:03:05	that they could that they could possibly do and so they have sales positioned in all different ways they have
0:03:11	some rectangular sales they have some that triangular sales they have the sort of the funny sale that you see
0:03:18	there back at the end
0:03:20	and the idea here is to really pull out absolutely all the energy that you can get from the wind
0:03:26	and then drive this thing forward
0:03:29	that relates to what i'm talking about today which is speech recognition systems that in a similar way harness together
0:03:37	a large number of information sources to try to drive the speech recognizer forward i in a faster and better
0:03:43	way
0:03:44	and this is going to lead to a discussion of log-linear models
0:03:48	a segmental models and then there's synthesis
0:03:52	and in the form of segmental conditional random fields
0:03:57	there's an outline of the talk
0:03:59	i'll start with some motivation of the word
0:04:03	i'll go into the mathematical details
0:04:05	a segmental conditional random field starting with hidden markov models
0:04:09	and then progressing through a sequence of models that to the ser at
0:04:14	i'll talk about a specific implementation that my colleague patrick knowing in and i put together this is a scarf
0:04:22	toolkit i'll talk about the language modeling that's and implemented there that's sort of interesting
0:04:28	are the inputs to the system and then the features that it generates from them
0:04:33	at present some experimental results are research challenges in a few concluding remarks
0:04:41	okay so the motivation of this work is that state-of-the-art speech recognizers really look at speech sort of a frame-by-frame
0:04:51	we go we extract are speech frames every ten milliseconds
0:04:55	are we extract the feature usually one kind of feature for example P L Ps or mfccs
0:05:02	and send those features into a time synchronous
0:05:06	recognizer the processes them and outputs were
0:05:10	i'm going to be the last person in the room to underestimate the power of that basic model and how
0:05:18	well you can get perform have good performance you can get from working with that kind of model
0:05:24	and doing a good job i in terms of the basics of it and so a very good question to
0:05:29	ask
0:05:30	is how to improve that model in some way
0:05:35	but that is not the question that i'm going to ask today
0:05:39	i instead i'm going to ask a different question i should say i will read task
0:05:45	a question because this is something that a number of people have looked at in the path
0:05:51	i in this is whether or not we could do better with the more general model
0:05:55	and in particular the questions i'd like to look into our whether we can move from a frame-wise analysis
0:06:02	to a segmental analysis
0:06:05	i from the use of real-valued feature vectors
0:06:08	i such as mfccs and plps
0:06:11	two more arbitrary feature functions
0:06:13	i E and if we can design a system around the synthesis
0:06:19	at some disparate information source
0:06:22	what's going to be new in this
0:06:24	is doing it in the context of log-linear modeling
0:06:28	and it's going to lead us to a model of the one that you see at the bottom of the
0:06:33	picture here
0:06:35	so in this model we have basically a two-state a two layer model i should say
0:06:40	at the top layer we are going to end up with states these are going to be segmental states representing
0:06:47	stereotypically words
0:06:49	and then at the bottom layer will have a sequence of observation streams will have many observations training
0:06:55	and these
0:06:58	each provide some information they can be many different kinds of information sources for example at the detection of a
0:07:06	phoneme the detection of the syllable detection of an energy burst a template match score
0:07:12	all kinds of different information coming in at through these multiple observation streams
0:07:17	and because their general like detections
0:07:21	they're not necessarily frame synchronous and you can have variable numbers
0:07:26	in the fixed and of time across the different streams
0:07:30	and we'll have a log-linear model that relates
0:07:33	the states that were hypothesized thing to the observations that are on hanging down below a below each state and
0:07:41	blocked into work
0:07:46	okay so i'd like to move on
0:07:48	and now and discuss
0:07:50	a ser S mathematically but starting first from hidden markov models
0:07:56	so here's a depiction of it a hidden markov model i think we're all a familiar with this
0:08:01	the key thing that we're we we're getting here is an estimation of the probability of the state sequence
0:08:10	i given an observation sequence in this model states usually represent context-dependent phones or sub states of context dependent phones
0:08:20	and the observations are most frequently i'm spectral representations such as mfccs or plps
0:08:27	the probability is given by the expression that you see there where we go frame by frame
0:08:32	and multiply i in transition probabilities the probability of a state at one time given the previous state
0:08:39	and then observation probabilities the probability of an observation at a given time given that state
0:08:45	in those observations are most frequently i gaussians on i mfcc or plp features
0:08:52	whereas in hybrid systems you can also use neural net posteriors as input to the
0:08:59	to the likelihood computation
0:09:04	okay so i think the for
0:09:06	sort of
0:09:07	big step away conceptually from the hidden markov model is maximum entropy mark a markov models
0:09:15	and these were first investigated by and wait right now party in the mid nineties in the context
0:09:20	part-of-speech tagging
0:09:22	for natural language processing
0:09:26	and then generalized or formalise by mccallum and his colleagues in two thousand
0:09:32	and then there were some i seminal application of these two speech recognition by jeff well when you ching now
0:09:40	in the mid two thousand
0:09:43	the idea behind these models
0:09:45	is to ask the question what if we don't condition the observation on the state but instead condition the state
0:09:52	on the observation
0:09:54	so if you look at the graph your what's happened is the arrow instead of going down it's going up
0:09:59	and we're conditioning a state at a given time J on the previous state and the current observation
0:10:06	state are still context-dependent phone states as they were before
0:10:11	but what we're gonna get out of this whole operation is the ability to have potentially much richer observations and
0:10:19	then mfccs down here
0:10:22	the probability of the state sequence given the observations are pretty an em am is given by this expression here
0:10:29	where we go through time frame by time frame and compute the probability of the current state given the previous
0:10:35	state
0:10:35	and the and the current observation
0:10:39	how do we do that
0:10:40	the key to this is to use
0:10:43	a
0:10:45	small little maximum entropy model
0:10:48	and apply it at every time frame
0:10:51	so what this maximum entropy model does
0:10:54	is primarily
0:10:56	computes some feature functions that i
0:11:00	that relate the state
0:11:02	previous time to the state at the current time
0:11:05	and the observation at the current time
0:11:07	those feature functions can be arbitrary functions they can return a real number of a binary number and they can
0:11:14	do an arbitrary computation
0:11:17	they get weighted by lambda
0:11:19	those are the parameters of the model summed over all the different kinds of features that you have and then
0:11:24	exponentially eight
0:11:26	it's normalized by the sum over all possible ways that you could assign values to the state they're of the
0:11:33	same of the same sort of expression
0:11:36	and this is doing two things again
0:11:38	the first is gonna let us have arbitrary feature functions that we use
0:11:43	rather than say gaussian mixture
0:11:45	and it's inherently discriminative in that it has this normalisation factor here
0:11:53	i'm gonna talk a lot about features and so i wanna make sure that we're on the same page in
0:11:58	terms of what exactly i mean by features and feature functions
0:12:02	features by the way are distinct from observations you observations of things you actually see and then the features
0:12:09	are numbers that you can Q using those observations as in
0:12:16	a nice way of thinking about the features is has a product of a state component and a linguistic compiled
0:12:24	i'm sorry state component and then the acoustic component
0:12:28	and i've illustrated a few possible state functions and acoustic functions
0:12:34	in this table and then the features the kind of features that you extract from that
0:12:40	so one very simple
0:12:42	function is to ask the question is the current state
0:12:47	why what's the current phone or what's the current context dependent on what's the value of that and just to
0:12:53	use a constant for the acoustic function
0:12:56	and you multiply those together and you have a binary feature
0:12:59	it's either
0:13:01	state is either this thing why or it's not zero one
0:13:04	and the weight that you learn on that is essentially a prior on that particular concept context dependent state
0:13:12	a full transition function would be the correct the previous state was X
0:13:17	and the current state is why previous upon the such and so and the current phone as such and so
0:13:22	we don't pay attention to the acoustics we just use one and that gives us a binary function that says
0:13:27	what the transition
0:13:29	little bit more interesting features when we start actually using the acoustic function
0:13:33	so one example of that is to say the state function is the current state is such and so
0:13:41	oh and by the way when i take my observation and plug it into my voicing detector that comes out
0:13:46	either yes it's voiced or no it's not voiced and i get a binary feature when i multiply those two
0:13:51	together
0:13:53	yet another example is the state is such an so
0:13:56	and i happen to have a
0:13:58	a gaussian mixture model for every state and when i plug the observation into the gaussian mixture model for that
0:14:04	state i get a score and i multiply the score by the by the fact that i'm seeing the state
0:14:10	and that gives me a real-valued a feature function
0:14:13	and so forth and so you can get fairly a fairly sophisticated feature functions this one down here by the
0:14:19	way is the one that quoting now use in there and the mm work where they looked at the rank
0:14:25	of a gaussian mixture model
0:14:29	the rank of the gaussian mixture model associated with a particular state and compared all the other states in the
0:14:35	system
0:14:38	let's move on to the conditional random field
0:14:40	now
0:14:41	it turns out that under certain pathological conditions if you using em atoms you can make a decision early on
0:14:50	and the transition structure
0:14:52	just so happens to be set up in a way and such that you would nor the observations for the
0:14:57	rest of the utterance
0:14:59	and you run into a problem i think these are pathological conditions but they can theoretically exist
0:15:06	and that motivated the development of conditional random field
0:15:10	where rather than doing a bunch of the local normalizations making a bunch of local state wise decisions there's one
0:15:18	global normalisation over all possible state sequences
0:15:22	because there is a global normalisation the it doesn't make sense to have a rose in the picture the arrows
0:15:29	indicate where you're gonna do the local normalisation and we're not doing a local normalisation
0:15:34	so the picture is this
0:15:36	the states are as with the maximum entropy model and the observations are also as with the maximum entropy model
0:15:42	i and the feature functions are as with the maximum entropy model the thing that's different is that when you
0:15:48	compute the probably the state given the observations
0:15:51	you normalise
0:15:54	not locally but once globally over all the possible ways that you can assign values
0:15:59	to those state C one
0:16:05	that brings me now to the segmental version of the crf which is the main point of the stock
0:16:11	so the key difference between the segmental version of the crf and the previous version of the crf
0:16:17	is that we're going to take the observations
0:16:21	and we're not going to block them into groups that correspond to segments
0:16:25	and we're actually gonna make those segments in the words
0:16:28	conceptually they could be any kind of segment they could be a phone segment or syllable segment but the rest
0:16:33	of this talk i'm gonna refer to them as word
0:16:36	and for each word we're gonna block together a bunch of observations and associate it concretely with that state
0:16:44	those observations again can be more general than mfccs for example they could be phoneme detections are the detection of
0:16:51	the of articulatory feature
0:16:54	there's some complexity that comes with this model because
0:16:58	even when we do training where we know how many words there are we don't know what the segmentation is
0:17:03	and so we'd have to consider all possible segmentations of the observations into the right number of were
0:17:10	and then this guy in this picture here for example we have to consider segmenting seven observations not justice to
0:17:16	two and three but maybe moving this guy over here and having three associated with the first word and only
0:17:22	one associated with the second word
0:17:24	and then three with the lab
0:17:26	when you do decoding you don't even know how many words there are in so you have to consider both
0:17:31	all the possible number of segments and all the possible segmentations
0:17:36	given that number of sec
0:17:39	this leads to an expression for segmental crfs that you see here
0:17:43	it's written in terms of the edges that exist in the top layer of the graph there
0:17:49	i each edge has a state to its left in the state to its right
0:17:54	and it has a group of observations that are a link together underneath it O T
0:18:01	and the segmentation is denoted by Q
0:18:04	with that notation the probability of a state sequence given by the observations is given by the expression you see
0:18:11	there which is essentially the same as expression for the regular crf
0:18:15	except that now we have the some over segmentations that are consistent with the number of segments that are hypothesized
0:18:24	or non during training
0:18:29	okay so that was
0:18:31	that was a lot of work to go to introduce segment features do we really need to introduce segmental features
0:18:36	at all do we get anything from that because after all with the with the crf the state sequence is
0:18:43	conditioned on the observations we've got the observation sitting there in front of us
0:18:47	isn't that enough is there anything else you need
0:18:50	and i think the answer to that is clearly yes you do need to have boundaries are you get more
0:18:56	if you talk about concrete boundaries
0:18:59	segment boundaries here a few examples of that
0:19:03	i'm suppose you wanna use template match scores
0:19:06	as a feature functions for example you have a segment and you ask the question what's the dtw distance between
0:19:13	this segment and the closest example of the word that i'm hypothesize thing in some database that i have
0:19:20	to do that you need to know where do you start the alignment where you end alignment and you need
0:19:24	the boundary so you get something from that you don't have when you just say here's a big blob of
0:19:29	observation
0:19:31	similarly word durations if you wanna talk about a word duration model you have to be precise about when the
0:19:36	word starts and when the word ends so that the duration is defined
0:19:40	turns out to be useful to have boundaries if you're incorporating scores from other models
0:19:45	two examples of that are the hmm likelihoods and fisher kernel scores
0:19:50	the latent in gales have used
0:19:52	and the point process model scores
0:19:55	that the ends in and dog have propose
0:19:59	later in the talk all talk about detection sub sequences
0:20:03	as features in there again we need to know the bound
0:20:08	okay so before proceeding i'd like to just emphasise that this is really building on along a tradition of work
0:20:15	and i want to go over and call out some of the components of that tradition the first is log-linear
0:20:21	models that use a frame level markov assumption
0:20:27	and there i think he work was done by jeff cohen you ching gal with the maximum entropy markov model
0:20:35	there really was the first to propose an exercise
0:20:38	the power of using general feature functions
0:20:44	shortly thereafter
0:20:46	hidden or actually it's a more or less simultaneously with that a hidden crfs were proposed by cohen award on
0:20:52	a and his colleagues and then there was a very interesting paper by under asking one of his students at
0:20:58	last year's asr you
0:21:00	i where essentially an extra hidden variables introduced into the crf
0:21:04	to represent gaussian mixture components
0:21:06	with the intention
0:21:08	of simulating mmi training in a conventional system
0:21:15	jeremy morris and error faster loosey a did some fascinating initial work on applying crfs and speech recognition
0:21:25	they used features such as neural net attribute posteriors
0:21:30	and in particular
0:21:31	the detection of sonority voicing manner of articulation and so forth as a feature functions that went into the into
0:21:40	the model
0:21:41	and they also proposed and experimented with the use of mlp phoneme posteriors as feature
0:21:48	and proposed the use of something called the clam didn't model
0:21:51	which is essentially a hybrid crf hmm-model where the crf phone posteriors are used as a state likelihood functions rather
0:22:01	than neural net posteriors in the standard hybrid system
0:22:05	the second tradition i'd like to call out is actually the tradition of segmental log-linear models
0:22:11	the first use of this was a termed a semi crfs by zero windy and cohen i in the development
0:22:19	in natural language processing
0:22:22	late evening gail's propose something term the conditional augmented statistical model which is a segmental crf
0:22:29	that uses hmm scores and fisher kernel score
0:22:33	saying rocking gail's propose the use of structured svms
0:22:37	which are essentially a segmental crfs with large margin training
0:22:43	later in stratford on have an interesting transducer representation that uses perceptron training and similarly achieves joint acoustic language and
0:22:52	duration model training
0:22:54	and finally georg cycle
0:22:56	and patrick million i have done a lot of work on flat direct models which are essentially whole sentence maximum
0:23:05	entropy
0:23:06	acoustic models maxent models at the segment level and you can think of these segmental models i'm talking about today
0:23:13	essentially stringing together a whole bunch of flat direct models one for each sect
0:23:20	it's also important to realise that there's significant previous work on just classical segmental modelling and detector based asr
0:23:29	the segmental modelling i think comes in sort of two main thread
0:23:33	in one of these a likelihoods are based on framewise computations so you have a different number of scores that
0:23:39	contribute each segment
0:23:41	and there's a long line of work that was done here by mari ostendorf and her students and the number
0:23:48	of other researchers so you see here
0:23:50	i and then in a separate thread
0:23:52	there's a development of using a fixed-length segment representation for each segment
0:23:58	that mari ostendorf insulin glucose
0:24:01	looked at in the late nineties and then jim glass more recently has worked on and contributed using
0:24:08	phone likelihoods in the computation in a way that i think is similar to the normalisation and the ser a
0:24:16	a framework
0:24:18	i'm going to talk about using detections phone detections the multi-phone detections and the so is it that i think
0:24:24	too much and lee and his colleagues in their proposal a detector based asr
0:24:30	which combines detector information in the bottom a way to do speech recognition
0:24:38	okay so i'm gonna move on now to the start implementation a specific implementation of a crf
0:24:44	and what this is going to do is essentially extend that tradition that i've mentioned
0:24:48	and it's going to extend it with the synthesis of detector based recognition i segmental modelling and log-linear modeling
0:24:58	going to further
0:25:00	develop some new features that weren't present before and in particular features termed existence expectation and levenshtein features
0:25:09	and then i'm extend that tradition i would an adaptation to large vocabulary speech recognition by fusing finite state language
0:25:18	modeling into that segmental framework for that
0:25:21	talking about
0:25:24	okay so let's move on to a specific implementation
0:25:28	so this is a toolkit that i've i developed with a patrick neumann
0:25:32	it's available from the web page that you see there you can download it and play around with it
0:25:39	and the features that i talk about net
0:25:42	arts
0:25:43	civic
0:25:44	to this implementation and they're sort of one way of realizing the general S crf framework and using it for
0:25:53	speech recognition where you sort of have to dot all the icing cross all the T's and make sure that
0:25:58	everything were
0:26:02	okay someone at heart stop us start by talking about how language models are implemented there are because it's sort
0:26:08	of a tricky issue
0:26:09	when i see a model like this
0:26:12	i think bigram language model i see to state
0:26:16	they're next to each other they're connected to a narrow that's like the probability of one state given the preceding
0:26:21	state and that looks a whole lot like a bigram language model so is that what we're talking about we
0:26:26	just talking about bigram language model C
0:26:29	and the answer is no what we're going to do is we're actually going to be able to model long
0:26:35	span acoustic context
0:26:37	by making these states
0:26:39	refer to states in an underlying finite state language model
0:26:44	here's an example of that
0:26:46	what you see on the left is a fragment from a finite state language model it's a trigram language model
0:26:52	so it has bigram history states
0:26:54	for example there's a bigram history state the dog and dog are and dog way
0:27:00	and sometimes we don't have all the trigrams in the world so to
0:27:05	decode and unseen trigram we need to be able to back off to a lower order history state so for
0:27:11	example if we're in the history state the dog we might have to back off to the history state dog
0:27:18	the one word history state and then we could decode a word that we haven't seen before in a trigram
0:27:22	context like yep and then moved to the history state dog yep
0:27:28	finally as a last resort you can back off to the null history state three down there at the bottom
0:27:34	and just decode any word in the vocabulary
0:27:38	okay so let's assume that we want to decode the sequence the dog yet
0:27:43	how would that look
0:27:45	we decode the first word the and we end up in the state seven here i haven't seen the history
0:27:52	the
0:27:54	then we decode the word dog
0:27:56	that moves us around up the state one we've seen the bigram now need all
0:28:02	now suppose you wanna decode yeah
0:28:06	to do that
0:28:08	so right now we're in state one
0:28:10	we gotten as far as the dog back to get us to state one here
0:28:15	and now suppose you want to decode yeah we'd have to back off
0:28:19	from state one to state two and then we could decode the wordnet and end up in state six over
0:28:26	here thought yeah
0:28:28	so what this means is that by the time we get around to decoding the wordnet
0:28:34	we know a lot more then
0:28:36	the last word was dog we actually know that the previous state was state one which corresponds to the to
0:28:42	word history the dog and so this is not a bigram language model that we have here is actually reflects
0:28:48	the semantics
0:28:50	of the trigram language model that you see in that fragment on the left
0:28:57	so there's two ways that we can use this one is to generate a basic language model score if we
0:29:03	provide the system with the with the finite state language model then we can just look up the language model
0:29:08	cost of transitioning between states and use that as one of the features in the system
0:29:13	but more interestingly we can create a binary feature for each are in the language model
0:29:21	now these arts and the language model are normally labeled with things like bigram probabilities a trigram probabilities or back-off
0:29:30	probability
0:29:31	what we're gonna do is we're gonna create a binary feature that just says have i traverse
0:29:36	the are in transitioning from one state to the next
0:29:40	so for example when we go from
0:29:42	the dog to dog yep which reversed to works
0:29:46	that are from one to two and then the art from two to six
0:29:49	the weights
0:29:50	the lamb does that we learn in association with that
0:29:54	are analogous to the back-off weights and the bigram weights of the normal language model but we're actually learning what
0:30:01	those weights are
0:30:03	what that means is that when we do training we end up with the discriminatively trained language model and actually
0:30:09	a language model that we train in association with the acoustic model training at the same time jointly with the
0:30:16	acoustic model training
0:30:18	so i think that sort of a interesting a phenomenon
0:30:23	okay i'd like to talk about the inputs to the system now
0:30:28	the first input are detector inputs so a detection is simply a unit and its midpoint
0:30:35	an example of that is shown here what we have found detections this is from a voice mail system in
0:30:41	it
0:30:41	and start from a voice search system and it looks like the person is asking for burgers except the person
0:30:48	says we're
0:30:49	bird
0:30:51	burgers E
0:30:53	and so the way to read this is that we detected the but at time frame seven ninety and or
0:30:58	time frame at and so forth and these correspond to the observations that are in red in the
0:31:05	in the illustration here
0:31:07	actually you can also provide a dictionaries that specify the expected sequence of detections for each word for example that
0:31:15	if we're going to decode burgers we expect both for good and so forth that pronunciation of the word
0:31:23	second input is lattices
0:31:26	that constrain the search space
0:31:28	the easiest way of getting these lattices is to use a conventional hmm system
0:31:33	and use it to just provide
0:31:35	i constraints on the search space
0:31:37	and the way to read this is
0:31:39	that from time twelve twenty one the time twenty five sixty a reasonable hypothesis is workings
0:31:48	and these times here give us segment boundaries hypothesized segment boundaries and the word gives us
0:31:56	possible labelings of the state
0:31:59	and we're gonna use those when we actually do the computations to constrain the set of possibilities we have to
0:32:05	consider
0:32:07	second kind of a lattice input is user-defined features
0:32:11	if you happen to have a model that you think provide some measure of consistency between the word that you're
0:32:19	hypothesize thing in the observations you can plug it in is user-defined feature like you see here
0:32:25	this lattice has a single feature that's been added it's it a dynamic time warping feature
0:32:30	and the particular one and i've got underlined in red here is indicating that the dtw feature value for hypothesized
0:32:38	sing the words fell
0:32:40	between frames nineteen eleven and twenty to sixty is eight point two seven
0:32:45	and that feature corresponds to one of the features in the log-linear models that exist on those vertical edges
0:32:54	now multiple input
0:32:56	are very much encouraged i and what you see here is a fragment of a of a lattice file that
0:33:03	christa monk put together
0:33:05	and you can see it's got lots of different a feature functions and he's defined
0:33:10	and essentially these features are the things that the follow that a metaphor that i started at the beginning
0:33:16	are analogous to the sales in the model that are providing the information in pushing the whole thing for work
0:33:22	and that we want to get as many of those
0:33:24	as possible
0:33:27	okay
0:33:28	let's talk about some features that are automatically defined from the inputs
0:33:34	the user-defined features are we need to find you don't have to worry about them once you put them in
0:33:38	on a lattice
0:33:40	if you provide detector sequences are set of features that can be automatically extracted and then the system will learn
0:33:46	the weights of those features those are existence expectation and levenshtein features along with something called the baseline feature
0:33:56	so the idea of an existence feature is to measure whether a particular unit
0:34:02	exists within the span of the word
0:34:04	but you're hypothesize thing
0:34:06	these are created for all word unit pair
0:34:10	and they have the advantage that you don't need any predefined pronunciation dictionary
0:34:15	but they have the disadvantage that you don't get any generalization ability across words
0:34:21	i here's an example suppose we hypothesize in the word record
0:34:25	and we spend the detections it and or
0:34:29	i would create a feature that says okay i'm hypothesize in accord
0:34:33	and i detected a in the span that would be in existence feature when you train the model presumably would
0:34:39	get a positive weight because presumably it's a good thing to detect if you're hypothesize in the word court
0:34:47	but
0:34:48	there's no generalisation ability across words here so that's a completely different a then the code that you would have
0:34:54	if you are hypothesized thing accordion and there's no transfer of the waiter smoothing there
0:35:03	the idea behind expectation features is to use a dictionary to avoid this and actually get generalization ability across were
0:35:11	there's three different kinds of expectation features
0:35:15	and i think i'll just go through by example and it describes the examples
0:35:20	so suppose let's take the first one suppose we're hypothesize in accord again and we detected it core
0:35:28	we have a correct except
0:35:30	oh but because we expect to see it on the basis of the dictionary and we've actually detected
0:35:37	now that feature is very different from the other feature because we can learn that that's a good thing that
0:35:42	detecting occur when you expect that could is good in the context of training on the word accord
0:35:48	and then use that same feature weight when we detected a in association with the word accordion or the working
0:35:55	at
0:35:56	second kind of expectation features of false reject of the unit
0:36:00	and that is an example of that where we expect to see it but we don't actually detected
0:36:05	finally you can have a false accept of the unit where you don't expect to see it based on your
0:36:09	dictionary pronunciation but it shows up there in the things that you've detected
0:36:14	and that the
0:36:15	in this example illustrates that
0:36:19	a levenshtein features are similar to expectation features but they
0:36:25	use stronger ordering constraints
0:36:29	the idea behind the levenshtein features to take the dictionary pronunciation of a word
0:36:34	and the units that you've detected
0:36:36	in association with that word
0:36:39	align them to each other get yeah the distance
0:36:42	and then create one feature for each kind of added that you've had to make
0:36:46	so the follow along in this example where we expect accord and we see that core
0:36:51	we have a substitution of the a match of the cover a match of the war and the delete of
0:36:57	the data
0:36:58	and again presumably we can learn that matching and a is a good thing in that has a positive way
0:37:04	by seeing one set of words you know training data and then use that
0:37:09	to evaluate hypotheses of new word
0:37:13	at test time where we haven't seen those particular words but they use these subword units
0:37:20	whose feature values we've already learned
0:37:25	okay the baseline features a kind of an important feature i wanna mention it here
0:37:29	i think many people in the room have had the experience of taking a system having a an interesting idea
0:37:37	very novel scientific thing to try out
0:37:41	doing it adding it in and it gets worse
0:37:44	and the idea behind the baseline features that we wanna think it's sort of the hippocratic oath
0:37:50	where we're gonna do no harm we're gonna have a system where you can add information to it
0:37:55	and not go backward
0:37:58	so we're gonna make it so that you can build on the best system that you have
0:38:02	by treating the output of that system as a word detector stream the detection of words
0:38:08	and then defining a feature this baseline feature that sorta stabilises assist
0:38:13	the definition of the baseline feature is that if you look at a at that are
0:38:18	that you're hypothesize thing
0:38:20	and you look at what words you've detected underneath it you get a plus one up for the baseline feature
0:38:26	if the hypothesized word covers exactly one baseline detection and the words are the same and otherwise you get a
0:38:33	minus one for this feature
0:38:36	here's an example of that
0:38:38	in the lattice path the sample path that were evaluating is a random like sort card or more
0:38:46	the baseline system output was randomly sort cards man detected it these vertical lines that you see here
0:38:54	so when we compute the baseline feature we take the first are random and we say how many words does
0:38:59	it cover
0:39:00	one that's good is it the same word no minus one
0:39:04	then we take a light we say how many words does it cover not
0:39:08	that's not going to get some minus one then we take sort we say how many words does it cover
0:39:12	one
0:39:13	is it the same yes okay we get a plus one there and finally called a mom covers two words
0:39:19	not one like it's supposed to so we get some minus one also
0:39:23	it turns out if you think about this you can see that
0:39:26	the way to optimize the baseline score is to output exactly as many words as a baseline system as output
0:39:33	and to make their identities
0:39:35	exactly the same as the baseline identities
0:39:38	so if you give the baseline feature high enough weight the baseline output is guaranteed
0:39:43	in practice of course you don't just set that we randomly but yeah the feature to the system with all
0:39:48	the other features and what is more in the way
0:39:52	okay i'd like to move on now to some experimental results
0:39:56	and the first of these has to use has to do with time using multi-phone detector is detecting multi-phone units
0:40:03	in the context of voice search is nothing special about voice search here it just happens to be the application
0:40:09	we were using
0:40:11	the idea is to try to empirically find multi-phone units
0:40:16	sequences of phones it tell us a lot about word
0:40:19	then to train an hmm system
0:40:22	whose units
0:40:23	are these multi-phone systems do we decoding with that hmm system and take its output is a sequence of multi-phone
0:40:30	detections
0:40:31	we're gonna put that detector stream then into the ser at
0:40:36	the main question here is what are good a phonetic sub sequences to you
0:40:41	and we're gonna start by using every subsequent sit occurs in the dictionary as a candidate
0:40:47	the expression for the mutual information between the unit you J and the word
0:40:53	W is given by this big
0:40:55	big mess that you see here
0:40:57	and the important thing about this to take aways that there is a tradeoff it turns out that you want
0:41:02	words that occur in about half i'm sorry you want units that occur in about half of the words so
0:41:08	that when you get one of these binary detections you actually get full bit of information
0:41:14	and from that sense of phones come close
0:41:17	but you also need words it can be reliably detected because the best unit in the world isn't gonna do
0:41:24	you any good if you can't actually detected and from that point of view one units are better
0:41:29	turns out that if you do a phone decoding of the data you can then compile statistics and choose the
0:41:34	units that are bad
0:41:36	and my colleague patrick million i'd throw a research stream along that and you can look at this paper for
0:41:45	for details
0:41:47	if you do this and look at what are the most informative units in this particular voice search task you
0:41:52	see something sort of interesting
0:41:54	some of them are very short like on an R
0:41:57	but then some of them are very long like california
0:42:01	and so we get these units that sometimes are short and frequent and sometimes long and what california still pretty
0:42:08	frequent but it's less frequent
0:42:12	okay so what happens if we use multi-phone units
0:42:14	we started with the baseline system that was about thirty seven percent
0:42:19	if we added phone detections that dropped by about a percent
0:42:24	if we use multi-phone units instead of phone units
0:42:28	that turns out to be better so that was gratifying that using these multi-phone units instead of the simple phone
0:42:34	units actually made a different
0:42:36	and then if you use symbols together works
0:42:38	little bit better
0:42:40	if you use a multiple phone and multi-phone units of three best units that were detected it's little bit yet
0:42:45	or better yet
0:42:47	and finally when we did discriminative training
0:42:50	that added a little bit more
0:42:53	and so what you see here is it is actually possible to exploit some redundant information in this kind of
0:42:59	a framework
0:43:02	the next kind of features i want to talk about are template features and this is work that was done
0:43:08	in the two thousand ten johns hopkins workshop
0:43:12	on wall street journal by my colleagues christa monk and are incomparable
0:43:18	i in order to understand that work you need to i need to say a just a little bit about
0:43:23	how
0:43:24	a baseline template system
0:43:28	that is but about the baseline template system that's used that live in university
0:43:34	so the idea here is that you have a big speech database
0:43:37	and you do forced alignment of all the utterances those utterances are rows in that top picture
0:43:43	and for each phone you know where it's boundaries are
0:43:47	and that's what those square boxes are those are phone bound
0:43:50	and you get a new utterance like the utterance at U C and the bottom
0:43:54	and you try to explain it by going into this
0:43:57	database that you haven't pulling out phone templates
0:44:01	and then doing an alignment of those phone templates to the news each such that you cover the whole of
0:44:06	the new utterance
0:44:08	since the original templates come with phone labels you can then read off the phone sequence
0:44:17	okay so suppose we have a system like that setup is it possible to use features
0:44:22	that are created from templates in the sort of the ser a frame or
0:44:26	and it turns out that you can do it in a in a sort of interesting the kinds of features
0:44:32	that you can have
0:44:33	so the idea is to create features
0:44:36	based on the template matches that explain a hypothesis what you see at the upper left is a hypothesis of
0:44:42	the word V
0:44:44	and we further aligned it so that we know where the first phone the is and the second phone E
0:44:50	is
0:44:51	then we go into the database we find all the close matches to those phones
0:44:56	so the number thirty five was a good match the number four hundred twenty three was a good match the
0:45:02	number one thousand two no twelve thousand eleven was a good match and so for
0:45:08	so given all those good matches what are some features that we can get
0:45:11	one of these features is a word id feature
0:45:14	what's a fraction of the templates that you see stacked up here that actually came from the word that were
0:45:20	hypothesized T V
0:45:22	another question is position consistency if the phone is word-initial like the
0:45:28	what fraction of the
0:45:30	the templates were word-initial in the original data that's another interesting feature
0:45:36	speaker id entropy are all the close matches just from one speaker that would be a bad thing "'cause" potentially
0:45:43	it's a flute
0:45:45	a degree of working if you look at how much you have to work those examples to get them to
0:45:50	fit what's the average word scale those are all features that the provide some information in that you can put
0:45:55	into the system
0:45:56	and the crust among wrote a word icassp paper that describes this in a detail
0:46:03	if we look at the results there we started a from a baseline template system at eight point two percent
0:46:10	adding the template metadata features provided an improvement
0:46:14	to seven point six percent
0:46:16	if we then add hmm scores we get the six point eight percent have to say there that the hmm
0:46:22	itself actually was seven point three so that's
0:46:25	seven point three sort of the baseline
0:46:27	and then adding phone detectors dropped it down finally to six point six percent and this is actually very good
0:46:34	number for the open vocab twenty a twenty K test set
0:46:39	and again this is showing the effective use
0:46:41	of multiple information source
0:46:45	okay the last
0:46:46	experimental result would like to go over
0:46:50	is a broadcast news system that we worked on also at the twenty ten C lsp workshop
0:46:57	i don't have time to go into detail on all the particular information sources that went into this
0:47:03	i just want to call out a few things
0:47:05	so I B M was kind enough to donate their at ellis system for use in creating a baseline system
0:47:12	that constrain the search space
0:47:16	we created a word detector system it
0:47:19	at microsoft research that created these word detections that you see here detector streams
0:47:24	there were a number of real value
0:47:28	information sources here and hansen had a point process model that he worked on justine how worked on a duration
0:47:35	model les atlas and some of the students had some scores based on modulation features those are provided real-valued a
0:47:42	feature scores such as you see here
0:47:45	and then facial i had some deep neural net phone detector
0:47:49	and samuel thomas looked at
0:47:52	the use of mlp phoneme detections in those provided the discrete detection streams that you see at the very bottom
0:47:59	there
0:48:00	if we look at the results let's just move over to the test results the baseline a system that we
0:48:07	built had a fifteen point seven percent word error rate
0:48:11	if we did that training with the scarf baseline feature there was a small improvement there i think that has
0:48:17	to do with the dynamic range of the baseline feature plus minus one versus the dynamic range of the original
0:48:23	likelihood
0:48:24	adding the a word detectors provided about a percent adding the other feature scores added a bit more
0:48:31	and the altogether we got about a nine point six percent
0:48:34	relative improvement or about twenty seven percent of the gain possible i given the lattices and again this indicates that
0:48:42	you can take multiple kinds of information put it into a system like this
0:48:47	and then move in the right direction
0:48:51	okay i want to just quickly go over a couple of research challenges i won't spend much time here because
0:48:57	i research challenges are things that haven't been done in people are gonna do what they're gonna do anyway
0:49:03	but i'll just mention a few things that seem like they might be interesting
0:49:06	what one of them would be to use in a crf
0:49:10	to boost hmm
0:49:12	in the motivation for this is that the use of the a word detectors in the broadcast news system was
0:49:18	actually a very effective we try to combine combination with rover and that didn't really work but we were able
0:49:25	to use it with this log-linear weighting
0:49:28	so the question is can we use crfs
0:49:30	a crfs in a more general boosting loop
0:49:33	the idea would be to train the system
0:49:35	take its output take the word-level output
0:49:38	reweighted training weighted according to the boosting algorithm up waiting the regions where we have mistake
0:49:45	train a new system
0:49:47	and then treat the output of that system is a new a detector stream to add in to the overall
0:49:54	group obsessed
0:49:56	second question is the use of spectro-temporal wreck receptive field models as detectors
0:50:03	previously we've worked on hmm systems S detectors i think would be interesting to try to train S T R
0:50:09	F
0:50:10	models
0:50:11	two
0:50:13	work as a detectors and provide these
0:50:17	detection streams
0:50:18	one way of approaching that would be to take a bunch of examples of phones or multi-phone units in class
0:50:24	examples and out-of-class examples for example
0:50:27	and train a maximum entropy classifier to make the distinction
0:50:33	and use the weight matrix of the maxent classifier essentially is a learned spectro-temporal receptive field
0:50:41	the last
0:50:43	idea that all throw out is to try to make much larger scale use of templates
0:50:49	information then we used so far we start from the wall street journal results
0:50:55	that there's comments there
0:50:58	and i think
0:50:59	maybe we could take that further for example in voice search systems we have an endless stream of data that
0:51:04	comes in
0:51:06	and we keep can transcribing some so we get more and more examples
0:51:09	of phones and words and sub-word units and so forth
0:51:12	and could we take some of those same features that are we're described previously and use "'em" a on a
0:51:18	much larger scale as they come in on an ongoing basis
0:51:24	okay so i'd like to conclude here
0:51:26	i've talked today about segmental log-linear model specifically segmental conditional random fields
0:51:33	i think these are flexible framework for testing novel scientific ideas
0:51:39	in particular they allow you to in integrate diverse information sources different types of information a different granularities at the
0:51:50	word level at the phone level at the frame level
0:51:53	information it comes in a variable quality level some can be better than others
0:51:57	potentially redundant
0:51:59	information sources and generally speaking much more than where i'm currently using
0:52:05	and finally i think there's a lot of interesting research left to do in this area
0:52:10	so thank you
0:52:18	okay we have a time for some questions
0:52:20	we do have
0:52:22	and please if you want to step up to my can actually put most close to the microphone that's
0:52:26	actually very helpful
0:52:41	so in a segmental models as an issue normalization
0:52:45	because the comparing hypotheses with different numbers of segments and so
0:52:49	there's an issue of how you how you make sure that you know
0:52:53	i was used with fewer segments over
0:52:56	longer segments knows wondering how
0:52:58	still with that yeah good question and they deal with it because when you do training you have to normalize
0:53:06	by considering all possible segmentations in the denominator
0:53:12	so when you do training you know how many segments are you know how many words there are in the
0:53:18	training hypothesis that gives you a fixed number like maybe it
0:53:23	and then you have this normaliser
0:53:25	where you have to consider all possible segmentations
0:53:29	and if the system has a strong bias
0:53:31	say towards
0:53:33	segmentations it only had one segment because there were fewer score
0:53:37	that wouldn't work because
0:53:40	your denominator then with the sign high weights
0:53:44	to the wrong segmentations it wouldn't assign highway to this thing in the did not in the numerator
0:53:50	it would that has ten segments it was assigned high weight
0:53:53	to the
0:53:56	hypotheses it just had a single segmentation in the denominator and the objective function would be bad
0:54:03	and
0:54:05	training would take care that because maximizing the objective function the conditional likelihood of the training data it would have
0:54:12	to assign parameter values
0:54:14	and so that it didn't have that particular bias
0:54:21	in one of those like you were saying that you in a discriminative kind of language more implicitly by building
0:54:27	the other morning
0:54:28	yeah so my question is but then it limits
0:54:31	the thing is in order to bring them or do we need to have a bit of it is not
0:54:35	really acoustically under data
0:54:37	but also have other features that but usually for language modeling we have well who do most of text data
0:54:43	with it but for which we may not have corresponding acoustic
0:54:46	feature
0:54:48	the whole you if we were to train a big language model from just X how to incorporate i didn't
0:54:55	it is yeah
0:54:56	so i think the way to do that
0:54:58	is to
0:55:01	annotate the lattice
0:55:03	with the language model score
0:55:06	that you get from this language model you train on lots and lots of data
0:55:11	so that scores gonna get into the system
0:55:14	then have a second language model that you could think of a sort of a corrective language model
0:55:20	that is trained only on the data for which you have acoustics
0:55:26	and
0:55:28	add those
0:55:29	features in
0:55:31	in addition to the language model score from the basic language model
0:55:39	all the other more
0:55:41	well it just one pass decoding
0:55:44	i mean for it what i understand that you take lattices and then to constrain your subspace
0:55:48	but then what if i have a language model which is much more complicated than n-gram and i wish to
0:55:53	do these coding
0:55:54	so the possible output of like this like structure
0:55:57	all that is the out of the core
0:56:00	there's a question of in theory and in the particular
0:56:05	implementation that we've made in the particular implementation that we've made a know it takes lattices in it produces one
0:56:12	best
0:56:13	there's nothing about the theory or the framework that says you can't take lattices in and produce lattices out and
0:56:21	i was just curious about the pocket yeah okay
0:56:29	i yeah i just have one question
0:56:31	i think is a good idea to combine the different source of information but this thing can also be dying
0:56:36	in a much simpler model right without using the concept of this that the buttons
0:56:41	you introduce the segments here on the what is real benefits you
0:56:45	on the
0:56:47	so i think the benefit is
0:56:52	features that you can't express
0:56:55	if you don't have the concept of the segment
0:56:58	an example of a feature where you need segment boundaries probably the simplest example is say a word duration model
0:57:06	you really need to talk about when this the word star
0:57:09	and when there's a word and
0:57:11	another example where i think it's useful is in template matching if there's a hypothesis and you wanna have a
0:57:19	feature of the form
0:57:23	what is the dtw distance
0:57:26	to the close this
0:57:27	example in my training database
0:57:30	of this word that i'm hypothesize thing
0:57:34	it helps if you have a boundary to start that dtw alignment and the boundaries and that dtw alignment
0:57:42	so i think the answer to the question is that
0:57:44	by reasoning explicitly about segmentations
0:57:49	you can incorporate features
0:57:52	that
0:57:53	you can incorporate if you reason only about frames
0:57:57	feature against your incorporating a very heuristic we
0:58:01	in some of the simple models of three levels
0:58:05	with all the
0:58:07	just combine them
0:58:09	mapping
0:58:10	the whole thing
0:58:11	and the green
0:58:13	it can we do that
0:58:15	i
0:58:20	i really i is so my own personal philosophy is that if you care about features if you care about
0:58:27	information
0:58:29	where the natural measure of that information is in terms of segments
0:58:34	then you're better off
0:58:37	bliss lately reasoning in terms of those units in terms of segments
0:58:41	then somehow trying to
0:58:44	implicitly or through the back door
0:58:47	encode that information in some other way
0:58:59	how many sorts of segments of you tried to be tried syllables of it right
0:59:04	i would imagine because many syllables are also monosyllabic words that you might cease in confusion
0:59:12	in your word models
0:59:14	i
0:59:15	syllables
0:59:17	right i didn't mention this as a as a research direction but one thing i'm really interested in is being
0:59:25	able to do decoding from scratch with the segmental model like this
0:59:30	i also didn't go into detail about the computational
0:59:34	burden of using these models
0:59:38	but it turns out that it's
0:59:40	proportional to the size of your vocabulary
0:59:43	so if you wanted to bottom-up decoding from scratch without reference
0:59:49	just some initial lattices or an external system
0:59:52	you need to use subword units for example syllables which are on the order of some thousands
0:59:58	or
1:00:00	even better phones and for phones we actually have
1:00:03	began some initial experiments
1:00:06	with doing bottom-up phone recognition actually just at the segment level with the pure segment model where we just by
1:00:14	force consider
1:00:16	all possible segments and all possible phones
1:00:22	okay let's think speaker

Speech Recognition with Segmental Conditional Random Fields

Invited Speakers

Geoff Zweig (Microsoft Research)