0:00:13okay
0:00:14welcome to the morning session acoustic modeling
0:00:19start off with a speech by geoff zweig well
0:00:21sure we introduce
0:00:23actually i'm really happy to introduce to have known since it was this high
0:00:27but has grown a lot since it became since he was a graduate student
0:00:32anyway
0:00:35this we followed by the poster session and acoustic models
0:00:40after building berkeley where he really didn't amazing job and was already interested in
0:00:46green welcome crazy different models which was i've always liked
0:00:51he went on to I B M you work
0:00:54and you can tune into work actually on graphical models
0:00:59not only working on the throughput but also
0:01:01working on implementations
0:01:04yeah and you got sucked into a lot of darpa meeting
0:01:08as well as many of us that
0:01:09and he moved on from there microsoft word's been since two thousand six
0:01:13so he's well known that field now goes for
0:01:16a principled
0:01:18developments and also for implementations that have been useful for the community
0:01:24some happy you know
0:01:26by jeff appear to give a stock and sick mother interesting idea of segmental conditional random field
0:01:41i thank you very much
0:01:46okay so i'd like to talk us start today with a very high level description of what the theme of
0:01:55the talk is going to be
0:01:57and i tried to put a little bit of thought in advance into what would be a good a sort
0:02:02of a pictorial metaphor pictorial representation of a what the talk would be about and also something that is a
0:02:11fitting to the beautiful location that we're in today
0:02:16when i did that i decided the best thing that i could come up with was this picture that you
0:02:21see here of a nineteenth century clipper ship
0:02:25and these are sort of very interesting things they were basically the space shuttle
0:02:30of their day they were designed to go absolutely as fast as possible making trips from say in the to
0:02:37london or boston
0:02:40and when you look at the ship there you see that they put a huge amount of thought and engineering
0:02:46into its design
0:02:48and in particular if you look at those sales they didn't sorta just build a ship and then put one
0:02:54a big holes where sail up on top of it instead what they did was they try in many ways
0:03:01to harness sort of every aspect every facet of the wind
0:03:05that they could that they could possibly do and so they have sales positioned in all different ways they have
0:03:11some rectangular sales they have some that triangular sales they have the sort of the funny sale that you see
0:03:18there back at the end
0:03:20and the idea here is to really pull out absolutely all the energy that you can get from the wind
0:03:26and then drive this thing forward
0:03:29that relates to what i'm talking about today which is speech recognition systems that in a similar way harness together
0:03:37a large number of information sources to try to drive the speech recognizer forward i in a faster and better
0:03:43way
0:03:44and this is going to lead to a discussion of log-linear models
0:03:48a segmental models and then there's synthesis
0:03:52and in the form of segmental conditional random fields
0:03:57there's an outline of the talk
0:03:59i'll start with some motivation of the word
0:04:03i'll go into the mathematical details
0:04:05a segmental conditional random field starting with hidden markov models
0:04:09and then progressing through a sequence of models that to the ser at
0:04:14i'll talk about a specific implementation that my colleague patrick knowing in and i put together this is a scarf
0:04:22toolkit i'll talk about the language modeling that's and implemented there that's sort of interesting
0:04:28are the inputs to the system and then the features that it generates from them
0:04:33at present some experimental results are research challenges in a few concluding remarks
0:04:41okay so the motivation of this work is that state-of-the-art speech recognizers really look at speech sort of a frame-by-frame
0:04:51we go we extract are speech frames every ten milliseconds
0:04:55are we extract the feature usually one kind of feature for example P L Ps or mfccs
0:05:02and send those features into a time synchronous
0:05:06recognizer the processes them and outputs were
0:05:10i'm going to be the last person in the room to underestimate the power of that basic model and how
0:05:18well you can get perform have good performance you can get from working with that kind of model
0:05:24and doing a good job i in terms of the basics of it and so a very good question to
0:05:29ask
0:05:30is how to improve that model in some way
0:05:35but that is not the question that i'm going to ask today
0:05:39i instead i'm going to ask a different question i should say i will read task
0:05:45a question because this is something that a number of people have looked at in the path
0:05:51i in this is whether or not we could do better with the more general model
0:05:55and in particular the questions i'd like to look into our whether we can move from a frame-wise analysis
0:06:02to a segmental analysis
0:06:05i from the use of real-valued feature vectors
0:06:08i such as mfccs and plps
0:06:11two more arbitrary feature functions
0:06:13i E and if we can design a system around the synthesis
0:06:19at some disparate information source
0:06:22what's going to be new in this
0:06:24is doing it in the context of log-linear modeling
0:06:28and it's going to lead us to a model of the one that you see at the bottom of the
0:06:33picture here
0:06:35so in this model we have basically a two-state a two layer model i should say
0:06:40at the top layer we are going to end up with states these are going to be segmental states representing
0:06:47stereotypically words
0:06:49and then at the bottom layer will have a sequence of observation streams will have many observations training
0:06:55and these
0:06:58each provide some information they can be many different kinds of information sources for example at the detection of a
0:07:06phoneme the detection of the syllable detection of an energy burst a template match score
0:07:12all kinds of different information coming in at through these multiple observation streams
0:07:17and because their general like detections
0:07:21they're not necessarily frame synchronous and you can have variable numbers
0:07:26in the fixed and of time across the different streams
0:07:30and we'll have a log-linear model that relates
0:07:33the states that were hypothesized thing to the observations that are on hanging down below a below each state and
0:07:41blocked into work
0:07:46okay so i'd like to move on
0:07:48and now and discuss
0:07:50a ser S mathematically but starting first from hidden markov models
0:07:56so here's a depiction of it a hidden markov model i think we're all a familiar with this
0:08:01the key thing that we're we we're getting here is an estimation of the probability of the state sequence
0:08:10i given an observation sequence in this model states usually represent context-dependent phones or sub states of context dependent phones
0:08:20and the observations are most frequently i'm spectral representations such as mfccs or plps
0:08:27the probability is given by the expression that you see there where we go frame by frame
0:08:32and multiply i in transition probabilities the probability of a state at one time given the previous state
0:08:39and then observation probabilities the probability of an observation at a given time given that state
0:08:45in those observations are most frequently i gaussians on i mfcc or plp features
0:08:52whereas in hybrid systems you can also use neural net posteriors as input to the
0:08:59to the likelihood computation
0:09:04okay so i think the for
0:09:06sort of
0:09:07big step away conceptually from the hidden markov model is maximum entropy mark a markov models
0:09:15and these were first investigated by and wait right now party in the mid nineties in the context
0:09:20part-of-speech tagging
0:09:22for natural language processing
0:09:26and then generalized or formalise by mccallum and his colleagues in two thousand
0:09:32and then there were some i seminal application of these two speech recognition by jeff well when you ching now
0:09:40in the mid two thousand
0:09:43the idea behind these models
0:09:45is to ask the question what if we don't condition the observation on the state but instead condition the state
0:09:52on the observation
0:09:54so if you look at the graph your what's happened is the arrow instead of going down it's going up
0:09:59and we're conditioning a state at a given time J on the previous state and the current observation
0:10:06state are still context-dependent phone states as they were before
0:10:11but what we're gonna get out of this whole operation is the ability to have potentially much richer observations and
0:10:19then mfccs down here
0:10:22the probability of the state sequence given the observations are pretty an em am is given by this expression here
0:10:29where we go through time frame by time frame and compute the probability of the current state given the previous
0:10:35state
0:10:35and the and the current observation
0:10:39how do we do that
0:10:40the key to this is to use
0:10:43a
0:10:45small little maximum entropy model
0:10:48and apply it at every time frame
0:10:51so what this maximum entropy model does
0:10:54is primarily
0:10:56computes some feature functions that i
0:11:00that relate the state
0:11:02previous time to the state at the current time
0:11:05and the observation at the current time
0:11:07those feature functions can be arbitrary functions they can return a real number of a binary number and they can
0:11:14do an arbitrary computation
0:11:17they get weighted by lambda
0:11:19those are the parameters of the model summed over all the different kinds of features that you have and then
0:11:24exponentially eight
0:11:26it's normalized by the sum over all possible ways that you could assign values to the state they're of the
0:11:33same of the same sort of expression
0:11:36and this is doing two things again
0:11:38the first is gonna let us have arbitrary feature functions that we use
0:11:43rather than say gaussian mixture
0:11:45and it's inherently discriminative in that it has this normalisation factor here
0:11:53i'm gonna talk a lot about features and so i wanna make sure that we're on the same page in
0:11:58terms of what exactly i mean by features and feature functions
0:12:02features by the way are distinct from observations you observations of things you actually see and then the features
0:12:09are numbers that you can Q using those observations as in
0:12:16a nice way of thinking about the features is has a product of a state component and a linguistic compiled
0:12:24i'm sorry state component and then the acoustic component
0:12:28and i've illustrated a few possible state functions and acoustic functions
0:12:34in this table and then the features the kind of features that you extract from that
0:12:40so one very simple
0:12:42function is to ask the question is the current state
0:12:47why what's the current phone or what's the current context dependent on what's the value of that and just to
0:12:53use a constant for the acoustic function
0:12:56and you multiply those together and you have a binary feature
0:12:59it's either
0:13:01state is either this thing why or it's not zero one
0:13:04and the weight that you learn on that is essentially a prior on that particular concept context dependent state
0:13:12a full transition function would be the correct the previous state was X
0:13:17and the current state is why previous upon the such and so and the current phone as such and so
0:13:22we don't pay attention to the acoustics we just use one and that gives us a binary function that says
0:13:27what the transition
0:13:29little bit more interesting features when we start actually using the acoustic function
0:13:33so one example of that is to say the state function is the current state is such and so
0:13:41oh and by the way when i take my observation and plug it into my voicing detector that comes out
0:13:46either yes it's voiced or no it's not voiced and i get a binary feature when i multiply those two
0:13:51together
0:13:53yet another example is the state is such an so
0:13:56and i happen to have a
0:13:58a gaussian mixture model for every state and when i plug the observation into the gaussian mixture model for that
0:14:04state i get a score and i multiply the score by the by the fact that i'm seeing the state
0:14:10and that gives me a real-valued a feature function
0:14:13and so forth and so you can get fairly a fairly sophisticated feature functions this one down here by the
0:14:19way is the one that quoting now use in there and the mm work where they looked at the rank
0:14:25of a gaussian mixture model
0:14:29the rank of the gaussian mixture model associated with a particular state and compared all the other states in the
0:14:35system
0:14:38let's move on to the conditional random field
0:14:40now
0:14:41it turns out that under certain pathological conditions if you using em atoms you can make a decision early on
0:14:50and the transition structure
0:14:52just so happens to be set up in a way and such that you would nor the observations for the
0:14:57rest of the utterance
0:14:59and you run into a problem i think these are pathological conditions but they can theoretically exist
0:15:06and that motivated the development of conditional random field
0:15:10where rather than doing a bunch of the local normalizations making a bunch of local state wise decisions there's one
0:15:18global normalisation over all possible state sequences
0:15:22because there is a global normalisation the it doesn't make sense to have a rose in the picture the arrows
0:15:29indicate where you're gonna do the local normalisation and we're not doing a local normalisation
0:15:34so the picture is this
0:15:36the states are as with the maximum entropy model and the observations are also as with the maximum entropy model
0:15:42i and the feature functions are as with the maximum entropy model the thing that's different is that when you
0:15:48compute the probably the state given the observations
0:15:51you normalise
0:15:54not locally but once globally over all the possible ways that you can assign values
0:15:59to those state C one
0:16:05that brings me now to the segmental version of the crf which is the main point of the stock
0:16:11so the key difference between the segmental version of the crf and the previous version of the crf
0:16:17is that we're going to take the observations
0:16:21and we're not going to block them into groups that correspond to segments
0:16:25and we're actually gonna make those segments in the words
0:16:28conceptually they could be any kind of segment they could be a phone segment or syllable segment but the rest
0:16:33of this talk i'm gonna refer to them as word
0:16:36and for each word we're gonna block together a bunch of observations and associate it concretely with that state
0:16:44those observations again can be more general than mfccs for example they could be phoneme detections are the detection of
0:16:51the of articulatory feature
0:16:54there's some complexity that comes with this model because
0:16:58even when we do training where we know how many words there are we don't know what the segmentation is
0:17:03and so we'd have to consider all possible segmentations of the observations into the right number of were
0:17:10and then this guy in this picture here for example we have to consider segmenting seven observations not justice to
0:17:16two and three but maybe moving this guy over here and having three associated with the first word and only
0:17:22one associated with the second word
0:17:24and then three with the lab
0:17:26when you do decoding you don't even know how many words there are in so you have to consider both
0:17:31all the possible number of segments and all the possible segmentations
0:17:36given that number of sec
0:17:39this leads to an expression for segmental crfs that you see here
0:17:43it's written in terms of the edges that exist in the top layer of the graph there
0:17:49i each edge has a state to its left in the state to its right
0:17:54and it has a group of observations that are a link together underneath it O T
0:18:01and the segmentation is denoted by Q
0:18:04with that notation the probability of a state sequence given by the observations is given by the expression you see
0:18:11there which is essentially the same as expression for the regular crf
0:18:15except that now we have the some over segmentations that are consistent with the number of segments that are hypothesized
0:18:24or non during training
0:18:29okay so that was
0:18:31that was a lot of work to go to introduce segment features do we really need to introduce segmental features
0:18:36at all do we get anything from that because after all with the with the crf the state sequence is
0:18:43conditioned on the observations we've got the observation sitting there in front of us
0:18:47isn't that enough is there anything else you need
0:18:50and i think the answer to that is clearly yes you do need to have boundaries are you get more
0:18:56if you talk about concrete boundaries
0:18:59segment boundaries here a few examples of that
0:19:03i'm suppose you wanna use template match scores
0:19:06as a feature functions for example you have a segment and you ask the question what's the dtw distance between
0:19:13this segment and the closest example of the word that i'm hypothesize thing in some database that i have
0:19:20to do that you need to know where do you start the alignment where you end alignment and you need
0:19:24the boundary so you get something from that you don't have when you just say here's a big blob of
0:19:29observation
0:19:31similarly word durations if you wanna talk about a word duration model you have to be precise about when the
0:19:36word starts and when the word ends so that the duration is defined
0:19:40turns out to be useful to have boundaries if you're incorporating scores from other models
0:19:45two examples of that are the hmm likelihoods and fisher kernel scores
0:19:50the latent in gales have used
0:19:52and the point process model scores
0:19:55that the ends in and dog have propose
0:19:59later in the talk all talk about detection sub sequences
0:20:03as features in there again we need to know the bound
0:20:08okay so before proceeding i'd like to just emphasise that this is really building on along a tradition of work
0:20:15and i want to go over and call out some of the components of that tradition the first is log-linear
0:20:21models that use a frame level markov assumption
0:20:27and there i think he work was done by jeff cohen you ching gal with the maximum entropy markov model
0:20:35there really was the first to propose an exercise
0:20:38the power of using general feature functions
0:20:44shortly thereafter
0:20:46hidden or actually it's a more or less simultaneously with that a hidden crfs were proposed by cohen award on
0:20:52a and his colleagues and then there was a very interesting paper by under asking one of his students at
0:20:58last year's asr you
0:21:00i where essentially an extra hidden variables introduced into the crf
0:21:04to represent gaussian mixture components
0:21:06with the intention
0:21:08of simulating mmi training in a conventional system
0:21:15jeremy morris and error faster loosey a did some fascinating initial work on applying crfs and speech recognition
0:21:25they used features such as neural net attribute posteriors
0:21:30and in particular
0:21:31the detection of sonority voicing manner of articulation and so forth as a feature functions that went into the into
0:21:40the model
0:21:41and they also proposed and experimented with the use of mlp phoneme posteriors as feature
0:21:48and proposed the use of something called the clam didn't model
0:21:51which is essentially a hybrid crf hmm-model where the crf phone posteriors are used as a state likelihood functions rather
0:22:01than neural net posteriors in the standard hybrid system
0:22:05the second tradition i'd like to call out is actually the tradition of segmental log-linear models
0:22:11the first use of this was a termed a semi crfs by zero windy and cohen i in the development
0:22:19in natural language processing
0:22:22late evening gail's propose something term the conditional augmented statistical model which is a segmental crf
0:22:29that uses hmm scores and fisher kernel score
0:22:33saying rocking gail's propose the use of structured svms
0:22:37which are essentially a segmental crfs with large margin training
0:22:43later in stratford on have an interesting transducer representation that uses perceptron training and similarly achieves joint acoustic language and
0:22:52duration model training
0:22:54and finally georg cycle
0:22:56and patrick million i have done a lot of work on flat direct models which are essentially whole sentence maximum
0:23:05entropy
0:23:06acoustic models maxent models at the segment level and you can think of these segmental models i'm talking about today
0:23:13essentially stringing together a whole bunch of flat direct models one for each sect
0:23:20it's also important to realise that there's significant previous work on just classical segmental modelling and detector based asr
0:23:29the segmental modelling i think comes in sort of two main thread
0:23:33in one of these a likelihoods are based on framewise computations so you have a different number of scores that
0:23:39contribute each segment
0:23:41and there's a long line of work that was done here by mari ostendorf and her students and the number
0:23:48of other researchers so you see here
0:23:50i and then in a separate thread
0:23:52there's a development of using a fixed-length segment representation for each segment
0:23:58that mari ostendorf insulin glucose
0:24:01looked at in the late nineties and then jim glass more recently has worked on and contributed using
0:24:08phone likelihoods in the computation in a way that i think is similar to the normalisation and the ser a
0:24:16a framework
0:24:18i'm going to talk about using detections phone detections the multi-phone detections and the so is it that i think
0:24:24too much and lee and his colleagues in their proposal a detector based asr
0:24:30which combines detector information in the bottom a way to do speech recognition
0:24:38okay so i'm gonna move on now to the start implementation a specific implementation of a crf
0:24:44and what this is going to do is essentially extend that tradition that i've mentioned
0:24:48and it's going to extend it with the synthesis of detector based recognition i segmental modelling and log-linear modeling
0:24:58going to further
0:25:00develop some new features that weren't present before and in particular features termed existence expectation and levenshtein features
0:25:09and then i'm extend that tradition i would an adaptation to large vocabulary speech recognition by fusing finite state language
0:25:18modeling into that segmental framework for that
0:25:21talking about
0:25:24okay so let's move on to a specific implementation
0:25:28so this is a toolkit that i've i developed with a patrick neumann
0:25:32it's available from the web page that you see there you can download it and play around with it
0:25:39and the features that i talk about net
0:25:42arts
0:25:43civic
0:25:44to this implementation and they're sort of one way of realizing the general S crf framework and using it for
0:25:53speech recognition where you sort of have to dot all the icing cross all the T's and make sure that
0:25:58everything were
0:26:02okay someone at heart stop us start by talking about how language models are implemented there are because it's sort
0:26:08of a tricky issue
0:26:09when i see a model like this
0:26:12i think bigram language model i see to state
0:26:16they're next to each other they're connected to a narrow that's like the probability of one state given the preceding
0:26:21state and that looks a whole lot like a bigram language model so is that what we're talking about we
0:26:26just talking about bigram language model C
0:26:29and the answer is no what we're going to do is we're actually going to be able to model long
0:26:35span acoustic context
0:26:37by making these states
0:26:39refer to states in an underlying finite state language model
0:26:44here's an example of that
0:26:46what you see on the left is a fragment from a finite state language model it's a trigram language model
0:26:52so it has bigram history states
0:26:54for example there's a bigram history state the dog and dog are and dog way
0:27:00and sometimes we don't have all the trigrams in the world so to
0:27:05decode and unseen trigram we need to be able to back off to a lower order history state so for
0:27:11example if we're in the history state the dog we might have to back off to the history state dog
0:27:18the one word history state and then we could decode a word that we haven't seen before in a trigram
0:27:22context like yep and then moved to the history state dog yep
0:27:28finally as a last resort you can back off to the null history state three down there at the bottom
0:27:34and just decode any word in the vocabulary
0:27:38okay so let's assume that we want to decode the sequence the dog yet
0:27:43how would that look
0:27:45we decode the first word the and we end up in the state seven here i haven't seen the history
0:27:52the
0:27:54then we decode the word dog
0:27:56that moves us around up the state one we've seen the bigram now need all
0:28:02now suppose you wanna decode yeah
0:28:06to do that
0:28:08so right now we're in state one
0:28:10we gotten as far as the dog back to get us to state one here
0:28:15and now suppose you want to decode yeah we'd have to back off
0:28:19from state one to state two and then we could decode the wordnet and end up in state six over
0:28:26here thought yeah
0:28:28so what this means is that by the time we get around to decoding the wordnet
0:28:34we know a lot more then
0:28:36the last word was dog we actually know that the previous state was state one which corresponds to the to
0:28:42word history the dog and so this is not a bigram language model that we have here is actually reflects
0:28:48the semantics
0:28:50of the trigram language model that you see in that fragment on the left
0:28:57so there's two ways that we can use this one is to generate a basic language model score if we
0:29:03provide the system with the with the finite state language model then we can just look up the language model
0:29:08cost of transitioning between states and use that as one of the features in the system
0:29:13but more interestingly we can create a binary feature for each are in the language model
0:29:21now these arts and the language model are normally labeled with things like bigram probabilities a trigram probabilities or back-off
0:29:30probability
0:29:31what we're gonna do is we're gonna create a binary feature that just says have i traverse
0:29:36the are in transitioning from one state to the next
0:29:40so for example when we go from
0:29:42the dog to dog yep which reversed to works
0:29:46that are from one to two and then the art from two to six
0:29:49the weights
0:29:50the lamb does that we learn in association with that
0:29:54are analogous to the back-off weights and the bigram weights of the normal language model but we're actually learning what
0:30:01those weights are
0:30:03what that means is that when we do training we end up with the discriminatively trained language model and actually
0:30:09a language model that we train in association with the acoustic model training at the same time jointly with the
0:30:16acoustic model training
0:30:18so i think that sort of a interesting a phenomenon
0:30:23okay i'd like to talk about the inputs to the system now
0:30:28the first input are detector inputs so a detection is simply a unit and its midpoint
0:30:35an example of that is shown here what we have found detections this is from a voice mail system in
0:30:41it
0:30:41and start from a voice search system and it looks like the person is asking for burgers except the person
0:30:48says we're
0:30:49bird
0:30:51burgers E
0:30:53and so the way to read this is that we detected the but at time frame seven ninety and or
0:30:58time frame at and so forth and these correspond to the observations that are in red in the
0:31:05in the illustration here
0:31:07actually you can also provide a dictionaries that specify the expected sequence of detections for each word for example that
0:31:15if we're going to decode burgers we expect both for good and so forth that pronunciation of the word
0:31:23second input is lattices
0:31:26that constrain the search space
0:31:28the easiest way of getting these lattices is to use a conventional hmm system
0:31:33and use it to just provide
0:31:35i constraints on the search space
0:31:37and the way to read this is
0:31:39that from time twelve twenty one the time twenty five sixty a reasonable hypothesis is workings
0:31:48and these times here give us segment boundaries hypothesized segment boundaries and the word gives us
0:31:56possible labelings of the state
0:31:59and we're gonna use those when we actually do the computations to constrain the set of possibilities we have to
0:32:05consider
0:32:07second kind of a lattice input is user-defined features
0:32:11if you happen to have a model that you think provide some measure of consistency between the word that you're
0:32:19hypothesize thing in the observations you can plug it in is user-defined feature like you see here
0:32:25this lattice has a single feature that's been added it's it a dynamic time warping feature
0:32:30and the particular one and i've got underlined in red here is indicating that the dtw feature value for hypothesized
0:32:38sing the words fell
0:32:40between frames nineteen eleven and twenty to sixty is eight point two seven
0:32:45and that feature corresponds to one of the features in the log-linear models that exist on those vertical edges
0:32:54now multiple input
0:32:56are very much encouraged i and what you see here is a fragment of a of a lattice file that
0:33:03christa monk put together
0:33:05and you can see it's got lots of different a feature functions and he's defined
0:33:10and essentially these features are the things that the follow that a metaphor that i started at the beginning
0:33:16are analogous to the sales in the model that are providing the information in pushing the whole thing for work
0:33:22and that we want to get as many of those
0:33:24as possible
0:33:27okay
0:33:28let's talk about some features that are automatically defined from the inputs
0:33:34the user-defined features are we need to find you don't have to worry about them once you put them in
0:33:38on a lattice
0:33:40if you provide detector sequences are set of features that can be automatically extracted and then the system will learn
0:33:46the weights of those features those are existence expectation and levenshtein features along with something called the baseline feature
0:33:56so the idea of an existence feature is to measure whether a particular unit
0:34:02exists within the span of the word
0:34:04but you're hypothesize thing
0:34:06these are created for all word unit pair
0:34:10and they have the advantage that you don't need any predefined pronunciation dictionary
0:34:15but they have the disadvantage that you don't get any generalization ability across words
0:34:21i here's an example suppose we hypothesize in the word record
0:34:25and we spend the detections it and or
0:34:29i would create a feature that says okay i'm hypothesize in accord
0:34:33and i detected a in the span that would be in existence feature when you train the model presumably would
0:34:39get a positive weight because presumably it's a good thing to detect if you're hypothesize in the word court
0:34:47but
0:34:48there's no generalisation ability across words here so that's a completely different a then the code that you would have
0:34:54if you are hypothesized thing accordion and there's no transfer of the waiter smoothing there
0:35:03the idea behind expectation features is to use a dictionary to avoid this and actually get generalization ability across were
0:35:11there's three different kinds of expectation features
0:35:15and i think i'll just go through by example and it describes the examples
0:35:20so suppose let's take the first one suppose we're hypothesize in accord again and we detected it core
0:35:28we have a correct except
0:35:30oh but because we expect to see it on the basis of the dictionary and we've actually detected
0:35:37now that feature is very different from the other feature because we can learn that that's a good thing that
0:35:42detecting occur when you expect that could is good in the context of training on the word accord
0:35:48and then use that same feature weight when we detected a in association with the word accordion or the working
0:35:55at
0:35:56second kind of expectation features of false reject of the unit
0:36:00and that is an example of that where we expect to see it but we don't actually detected
0:36:05finally you can have a false accept of the unit where you don't expect to see it based on your
0:36:09dictionary pronunciation but it shows up there in the things that you've detected
0:36:14and that the
0:36:15in this example illustrates that
0:36:19a levenshtein features are similar to expectation features but they
0:36:25use stronger ordering constraints
0:36:29the idea behind the levenshtein features to take the dictionary pronunciation of a word
0:36:34and the units that you've detected
0:36:36in association with that word
0:36:39align them to each other get yeah the distance
0:36:42and then create one feature for each kind of added that you've had to make
0:36:46so the follow along in this example where we expect accord and we see that core
0:36:51we have a substitution of the a match of the cover a match of the war and the delete of
0:36:57the data
0:36:58and again presumably we can learn that matching and a is a good thing in that has a positive way
0:37:04by seeing one set of words you know training data and then use that
0:37:09to evaluate hypotheses of new word
0:37:13at test time where we haven't seen those particular words but they use these subword units
0:37:20whose feature values we've already learned
0:37:25okay the baseline features a kind of an important feature i wanna mention it here
0:37:29i think many people in the room have had the experience of taking a system having a an interesting idea
0:37:37very novel scientific thing to try out
0:37:41doing it adding it in and it gets worse
0:37:44and the idea behind the baseline features that we wanna think it's sort of the hippocratic oath
0:37:50where we're gonna do no harm we're gonna have a system where you can add information to it
0:37:55and not go backward
0:37:58so we're gonna make it so that you can build on the best system that you have
0:38:02by treating the output of that system as a word detector stream the detection of words
0:38:08and then defining a feature this baseline feature that sorta stabilises assist
0:38:13the definition of the baseline feature is that if you look at a at that are
0:38:18that you're hypothesize thing
0:38:20and you look at what words you've detected underneath it you get a plus one up for the baseline feature
0:38:26if the hypothesized word covers exactly one baseline detection and the words are the same and otherwise you get a
0:38:33minus one for this feature
0:38:36here's an example of that
0:38:38in the lattice path the sample path that were evaluating is a random like sort card or more
0:38:46the baseline system output was randomly sort cards man detected it these vertical lines that you see here
0:38:54so when we compute the baseline feature we take the first are random and we say how many words does
0:38:59it cover
0:39:00one that's good is it the same word no minus one
0:39:04then we take a light we say how many words does it cover not
0:39:08that's not going to get some minus one then we take sort we say how many words does it cover
0:39:12one
0:39:13is it the same yes okay we get a plus one there and finally called a mom covers two words
0:39:19not one like it's supposed to so we get some minus one also
0:39:23it turns out if you think about this you can see that
0:39:26the way to optimize the baseline score is to output exactly as many words as a baseline system as output
0:39:33and to make their identities
0:39:35exactly the same as the baseline identities
0:39:38so if you give the baseline feature high enough weight the baseline output is guaranteed
0:39:43in practice of course you don't just set that we randomly but yeah the feature to the system with all
0:39:48the other features and what is more in the way
0:39:52okay i'd like to move on now to some experimental results
0:39:56and the first of these has to use has to do with time using multi-phone detector is detecting multi-phone units
0:40:03in the context of voice search is nothing special about voice search here it just happens to be the application
0:40:09we were using
0:40:11the idea is to try to empirically find multi-phone units
0:40:16sequences of phones it tell us a lot about word
0:40:19then to train an hmm system
0:40:22whose units
0:40:23are these multi-phone systems do we decoding with that hmm system and take its output is a sequence of multi-phone
0:40:30detections
0:40:31we're gonna put that detector stream then into the ser at
0:40:36the main question here is what are good a phonetic sub sequences to you
0:40:41and we're gonna start by using every subsequent sit occurs in the dictionary as a candidate
0:40:47the expression for the mutual information between the unit you J and the word
0:40:53W is given by this big
0:40:55big mess that you see here
0:40:57and the important thing about this to take aways that there is a tradeoff it turns out that you want
0:41:02words that occur in about half i'm sorry you want units that occur in about half of the words so
0:41:08that when you get one of these binary detections you actually get full bit of information
0:41:14and from that sense of phones come close
0:41:17but you also need words it can be reliably detected because the best unit in the world isn't gonna do
0:41:24you any good if you can't actually detected and from that point of view one units are better
0:41:29turns out that if you do a phone decoding of the data you can then compile statistics and choose the
0:41:34units that are bad
0:41:36and my colleague patrick million i'd throw a research stream along that and you can look at this paper for
0:41:45for details
0:41:47if you do this and look at what are the most informative units in this particular voice search task you
0:41:52see something sort of interesting
0:41:54some of them are very short like on an R
0:41:57but then some of them are very long like california
0:42:01and so we get these units that sometimes are short and frequent and sometimes long and what california still pretty
0:42:08frequent but it's less frequent
0:42:12okay so what happens if we use multi-phone units
0:42:14we started with the baseline system that was about thirty seven percent
0:42:19if we added phone detections that dropped by about a percent
0:42:24if we use multi-phone units instead of phone units
0:42:28that turns out to be better so that was gratifying that using these multi-phone units instead of the simple phone
0:42:34units actually made a different
0:42:36and then if you use symbols together works
0:42:38little bit better
0:42:40if you use a multiple phone and multi-phone units of three best units that were detected it's little bit yet
0:42:45or better yet
0:42:47and finally when we did discriminative training
0:42:50that added a little bit more
0:42:53and so what you see here is it is actually possible to exploit some redundant information in this kind of
0:42:59a framework
0:43:02the next kind of features i want to talk about are template features and this is work that was done
0:43:08in the two thousand ten johns hopkins workshop
0:43:12on wall street journal by my colleagues christa monk and are incomparable
0:43:18i in order to understand that work you need to i need to say a just a little bit about
0:43:23how
0:43:24a baseline template system
0:43:28that is but about the baseline template system that's used that live in university
0:43:34so the idea here is that you have a big speech database
0:43:37and you do forced alignment of all the utterances those utterances are rows in that top picture
0:43:43and for each phone you know where it's boundaries are
0:43:47and that's what those square boxes are those are phone bound
0:43:50and you get a new utterance like the utterance at U C and the bottom
0:43:54and you try to explain it by going into this
0:43:57database that you haven't pulling out phone templates
0:44:01and then doing an alignment of those phone templates to the news each such that you cover the whole of
0:44:06the new utterance
0:44:08since the original templates come with phone labels you can then read off the phone sequence
0:44:17okay so suppose we have a system like that setup is it possible to use features
0:44:22that are created from templates in the sort of the ser a frame or
0:44:26and it turns out that you can do it in a in a sort of interesting the kinds of features
0:44:32that you can have
0:44:33so the idea is to create features
0:44:36based on the template matches that explain a hypothesis what you see at the upper left is a hypothesis of
0:44:42the word V
0:44:44and we further aligned it so that we know where the first phone the is and the second phone E
0:44:50is
0:44:51then we go into the database we find all the close matches to those phones
0:44:56so the number thirty five was a good match the number four hundred twenty three was a good match the
0:45:02number one thousand two no twelve thousand eleven was a good match and so for
0:45:08so given all those good matches what are some features that we can get
0:45:11one of these features is a word id feature
0:45:14what's a fraction of the templates that you see stacked up here that actually came from the word that were
0:45:20hypothesized T V
0:45:22another question is position consistency if the phone is word-initial like the
0:45:28what fraction of the
0:45:30the templates were word-initial in the original data that's another interesting feature
0:45:36speaker id entropy are all the close matches just from one speaker that would be a bad thing "'cause" potentially
0:45:43it's a flute
0:45:45a degree of working if you look at how much you have to work those examples to get them to
0:45:50fit what's the average word scale those are all features that the provide some information in that you can put
0:45:55into the system
0:45:56and the crust among wrote a word icassp paper that describes this in a detail
0:46:03if we look at the results there we started a from a baseline template system at eight point two percent
0:46:10adding the template metadata features provided an improvement
0:46:14to seven point six percent
0:46:16if we then add hmm scores we get the six point eight percent have to say there that the hmm
0:46:22itself actually was seven point three so that's
0:46:25seven point three sort of the baseline
0:46:27and then adding phone detectors dropped it down finally to six point six percent and this is actually very good
0:46:34number for the open vocab twenty a twenty K test set
0:46:39and again this is showing the effective use
0:46:41of multiple information source
0:46:45okay the last
0:46:46experimental result would like to go over
0:46:50is a broadcast news system that we worked on also at the twenty ten C lsp workshop
0:46:57i don't have time to go into detail on all the particular information sources that went into this
0:47:03i just want to call out a few things
0:47:05so I B M was kind enough to donate their at ellis system for use in creating a baseline system
0:47:12that constrain the search space
0:47:16we created a word detector system it
0:47:19at microsoft research that created these word detections that you see here detector streams
0:47:24there were a number of real value
0:47:28information sources here and hansen had a point process model that he worked on justine how worked on a duration
0:47:35model les atlas and some of the students had some scores based on modulation features those are provided real-valued a
0:47:42feature scores such as you see here
0:47:45and then facial i had some deep neural net phone detector
0:47:49and samuel thomas looked at
0:47:52the use of mlp phoneme detections in those provided the discrete detection streams that you see at the very bottom
0:47:59there
0:48:00if we look at the results let's just move over to the test results the baseline a system that we
0:48:07built had a fifteen point seven percent word error rate
0:48:11if we did that training with the scarf baseline feature there was a small improvement there i think that has
0:48:17to do with the dynamic range of the baseline feature plus minus one versus the dynamic range of the original
0:48:23likelihood
0:48:24adding the a word detectors provided about a percent adding the other feature scores added a bit more
0:48:31and the altogether we got about a nine point six percent
0:48:34relative improvement or about twenty seven percent of the gain possible i given the lattices and again this indicates that
0:48:42you can take multiple kinds of information put it into a system like this
0:48:47and then move in the right direction
0:48:51okay i want to just quickly go over a couple of research challenges i won't spend much time here because
0:48:57i research challenges are things that haven't been done in people are gonna do what they're gonna do anyway
0:49:03but i'll just mention a few things that seem like they might be interesting
0:49:06what one of them would be to use in a crf
0:49:10to boost hmm
0:49:12in the motivation for this is that the use of the a word detectors in the broadcast news system was
0:49:18actually a very effective we try to combine combination with rover and that didn't really work but we were able
0:49:25to use it with this log-linear weighting
0:49:28so the question is can we use crfs
0:49:30a crfs in a more general boosting loop
0:49:33the idea would be to train the system
0:49:35take its output take the word-level output
0:49:38reweighted training weighted according to the boosting algorithm up waiting the regions where we have mistake
0:49:45train a new system
0:49:47and then treat the output of that system is a new a detector stream to add in to the overall
0:49:54group obsessed
0:49:56second question is the use of spectro-temporal wreck receptive field models as detectors
0:50:03previously we've worked on hmm systems S detectors i think would be interesting to try to train S T R
0:50:09F
0:50:10models
0:50:11two
0:50:13work as a detectors and provide these
0:50:17detection streams
0:50:18one way of approaching that would be to take a bunch of examples of phones or multi-phone units in class
0:50:24examples and out-of-class examples for example
0:50:27and train a maximum entropy classifier to make the distinction
0:50:33and use the weight matrix of the maxent classifier essentially is a learned spectro-temporal receptive field
0:50:41the last
0:50:43idea that all throw out is to try to make much larger scale use of templates
0:50:49information then we used so far we start from the wall street journal results
0:50:55that there's comments there
0:50:58and i think
0:50:59maybe we could take that further for example in voice search systems we have an endless stream of data that
0:51:04comes in
0:51:06and we keep can transcribing some so we get more and more examples
0:51:09of phones and words and sub-word units and so forth
0:51:12and could we take some of those same features that are we're described previously and use "'em" a on a
0:51:18much larger scale as they come in on an ongoing basis
0:51:24okay so i'd like to conclude here
0:51:26i've talked today about segmental log-linear model specifically segmental conditional random fields
0:51:33i think these are flexible framework for testing novel scientific ideas
0:51:39in particular they allow you to in integrate diverse information sources different types of information a different granularities at the
0:51:50word level at the phone level at the frame level
0:51:53information it comes in a variable quality level some can be better than others
0:51:57potentially redundant
0:51:59information sources and generally speaking much more than where i'm currently using
0:52:05and finally i think there's a lot of interesting research left to do in this area
0:52:10so thank you
0:52:18okay we have a time for some questions
0:52:20we do have
0:52:22and please if you want to step up to my can actually put most close to the microphone that's
0:52:26actually very helpful
0:52:41so in a segmental models as an issue normalization
0:52:45because the comparing hypotheses with different numbers of segments and so
0:52:49there's an issue of how you how you make sure that you know
0:52:53i was used with fewer segments over
0:52:56longer segments knows wondering how
0:52:58still with that yeah good question and they deal with it because when you do training you have to normalize
0:53:06by considering all possible segmentations in the denominator
0:53:12so when you do training you know how many segments are you know how many words there are in the
0:53:18training hypothesis that gives you a fixed number like maybe it
0:53:23and then you have this normaliser
0:53:25where you have to consider all possible segmentations
0:53:29and if the system has a strong bias
0:53:31say towards
0:53:33segmentations it only had one segment because there were fewer score
0:53:37that wouldn't work because
0:53:40your denominator then with the sign high weights
0:53:44to the wrong segmentations it wouldn't assign highway to this thing in the did not in the numerator
0:53:50it would that has ten segments it was assigned high weight
0:53:53to the
0:53:56hypotheses it just had a single segmentation in the denominator and the objective function would be bad
0:54:03and
0:54:05training would take care that because maximizing the objective function the conditional likelihood of the training data it would have
0:54:12to assign parameter values
0:54:14and so that it didn't have that particular bias
0:54:21in one of those like you were saying that you in a discriminative kind of language more implicitly by building
0:54:27the other morning
0:54:28yeah so my question is but then it limits
0:54:31the thing is in order to bring them or do we need to have a bit of it is not
0:54:35really acoustically under data
0:54:37but also have other features that but usually for language modeling we have well who do most of text data
0:54:43with it but for which we may not have corresponding acoustic
0:54:46feature
0:54:48the whole you if we were to train a big language model from just X how to incorporate i didn't
0:54:55it is yeah
0:54:56so i think the way to do that
0:54:58is to
0:55:01annotate the lattice
0:55:03with the language model score
0:55:06that you get from this language model you train on lots and lots of data
0:55:11so that scores gonna get into the system
0:55:14then have a second language model that you could think of a sort of a corrective language model
0:55:20that is trained only on the data for which you have acoustics
0:55:26and
0:55:28add those
0:55:29features in
0:55:31in addition to the language model score from the basic language model
0:55:39all the other more
0:55:41well it just one pass decoding
0:55:44i mean for it what i understand that you take lattices and then to constrain your subspace
0:55:48but then what if i have a language model which is much more complicated than n-gram and i wish to
0:55:53do these coding
0:55:54so the possible output of like this like structure
0:55:57all that is the out of the core
0:56:00there's a question of in theory and in the particular
0:56:05implementation that we've made in the particular implementation that we've made a know it takes lattices in it produces one
0:56:12best
0:56:13there's nothing about the theory or the framework that says you can't take lattices in and produce lattices out and
0:56:21i was just curious about the pocket yeah okay
0:56:29i yeah i just have one question
0:56:31i think is a good idea to combine the different source of information but this thing can also be dying
0:56:36in a much simpler model right without using the concept of this that the buttons
0:56:41you introduce the segments here on the what is real benefits you
0:56:45on the
0:56:47so i think the benefit is
0:56:52features that you can't express
0:56:55if you don't have the concept of the segment
0:56:58an example of a feature where you need segment boundaries probably the simplest example is say a word duration model
0:57:06you really need to talk about when this the word star
0:57:09and when there's a word and
0:57:11another example where i think it's useful is in template matching if there's a hypothesis and you wanna have a
0:57:19feature of the form
0:57:23what is the dtw distance
0:57:26to the close this
0:57:27example in my training database
0:57:30of this word that i'm hypothesize thing
0:57:34it helps if you have a boundary to start that dtw alignment and the boundaries and that dtw alignment
0:57:42so i think the answer to the question is that
0:57:44by reasoning explicitly about segmentations
0:57:49you can incorporate features
0:57:52that
0:57:53you can incorporate if you reason only about frames
0:57:57feature against your incorporating a very heuristic we
0:58:01in some of the simple models of three levels
0:58:05with all the
0:58:07just combine them
0:58:09mapping
0:58:10the whole thing
0:58:11and the green
0:58:13it can we do that
0:58:15i
0:58:20i really i is so my own personal philosophy is that if you care about features if you care about
0:58:27information
0:58:29where the natural measure of that information is in terms of segments
0:58:34then you're better off
0:58:37bliss lately reasoning in terms of those units in terms of segments
0:58:41then somehow trying to
0:58:44implicitly or through the back door
0:58:47encode that information in some other way
0:58:59how many sorts of segments of you tried to be tried syllables of it right
0:59:04i would imagine because many syllables are also monosyllabic words that you might cease in confusion
0:59:12in your word models
0:59:14i
0:59:15syllables
0:59:17right i didn't mention this as a as a research direction but one thing i'm really interested in is being
0:59:25able to do decoding from scratch with the segmental model like this
0:59:30i also didn't go into detail about the computational
0:59:34burden of using these models
0:59:38but it turns out that it's
0:59:40proportional to the size of your vocabulary
0:59:43so if you wanted to bottom-up decoding from scratch without reference
0:59:49just some initial lattices or an external system
0:59:52you need to use subword units for example syllables which are on the order of some thousands
0:59:58or
1:00:00even better phones and for phones we actually have
1:00:03began some initial experiments
1:00:06with doing bottom-up phone recognition actually just at the segment level with the pure segment model where we just by
1:00:14force consider
1:00:16all possible segments and all possible phones
1:00:22okay let's think speaker