Speech Transcript - Speech Recognition with Segmental Conditional Random Fields

okay

welcome to the morning session acoustic modeling

start off with a speech by geoff zweig well

sure we introduce

actually i'm really happy to introduce to have known since it was this high

but has grown a lot since it became since he was a graduate student

anyway

this we followed by the poster session and acoustic models

after building berkeley where he really didn't amazing job and was already interested in

green welcome crazy different models which was i've always liked

he went on to I B M you work

and you can tune into work actually on graphical models

not only working on the throughput but also

working on implementations

yeah and you got sucked into a lot of darpa meeting

as well as many of us that

and he moved on from there microsoft word's been since two thousand six

so he's well known that field now goes for

a principled

developments and also for implementations that have been useful for the community

some happy you know

by jeff appear to give a stock and sick mother interesting idea of segmental conditional random field

i thank you very much

okay so i'd like to talk us start today with a very high level description of what the theme of

the talk is going to be

and i tried to put a little bit of thought in advance into what would be a good a sort

of a pictorial metaphor pictorial representation of a what the talk would be about and also something that is a

fitting to the beautiful location that we're in today

when i did that i decided the best thing that i could come up with was this picture that you

see here of a nineteenth century clipper ship

and these are sort of very interesting things they were basically the space shuttle

of their day they were designed to go absolutely as fast as possible making trips from say in the to

london or boston

and when you look at the ship there you see that they put a huge amount of thought and engineering

into its design

and in particular if you look at those sales they didn't sorta just build a ship and then put one

a big holes where sail up on top of it instead what they did was they try in many ways

to harness sort of every aspect every facet of the wind

that they could that they could possibly do and so they have sales positioned in all different ways they have

some rectangular sales they have some that triangular sales they have the sort of the funny sale that you see

there back at the end

and the idea here is to really pull out absolutely all the energy that you can get from the wind

and then drive this thing forward

that relates to what i'm talking about today which is speech recognition systems that in a similar way harness together

a large number of information sources to try to drive the speech recognizer forward i in a faster and better

way

and this is going to lead to a discussion of log-linear models

a segmental models and then there's synthesis

and in the form of segmental conditional random fields

there's an outline of the talk

i'll start with some motivation of the word

i'll go into the mathematical details

a segmental conditional random field starting with hidden markov models

and then progressing through a sequence of models that to the ser at

i'll talk about a specific implementation that my colleague patrick knowing in and i put together this is a scarf

toolkit i'll talk about the language modeling that's and implemented there that's sort of interesting

are the inputs to the system and then the features that it generates from them

at present some experimental results are research challenges in a few concluding remarks

okay so the motivation of this work is that state-of-the-art speech recognizers really look at speech sort of a frame-by-frame

we go we extract are speech frames every ten milliseconds

are we extract the feature usually one kind of feature for example P L Ps or mfccs

and send those features into a time synchronous

recognizer the processes them and outputs were

i'm going to be the last person in the room to underestimate the power of that basic model and how

well you can get perform have good performance you can get from working with that kind of model

and doing a good job i in terms of the basics of it and so a very good question to

ask

is how to improve that model in some way

but that is not the question that i'm going to ask today

i instead i'm going to ask a different question i should say i will read task

a question because this is something that a number of people have looked at in the path

i in this is whether or not we could do better with the more general model

and in particular the questions i'd like to look into our whether we can move from a frame-wise analysis

to a segmental analysis

i from the use of real-valued feature vectors

i such as mfccs and plps

two more arbitrary feature functions

i E and if we can design a system around the synthesis

at some disparate information source

what's going to be new in this

is doing it in the context of log-linear modeling

and it's going to lead us to a model of the one that you see at the bottom of the

picture here

so in this model we have basically a two-state a two layer model i should say

at the top layer we are going to end up with states these are going to be segmental states representing

stereotypically words

and then at the bottom layer will have a sequence of observation streams will have many observations training

and these

each provide some information they can be many different kinds of information sources for example at the detection of a

phoneme the detection of the syllable detection of an energy burst a template match score

all kinds of different information coming in at through these multiple observation streams

and because their general like detections

they're not necessarily frame synchronous and you can have variable numbers

in the fixed and of time across the different streams

and we'll have a log-linear model that relates

the states that were hypothesized thing to the observations that are on hanging down below a below each state and

blocked into work

okay so i'd like to move on

and now and discuss

a ser S mathematically but starting first from hidden markov models

so here's a depiction of it a hidden markov model i think we're all a familiar with this

the key thing that we're we we're getting here is an estimation of the probability of the state sequence

i given an observation sequence in this model states usually represent context-dependent phones or sub states of context dependent phones

and the observations are most frequently i'm spectral representations such as mfccs or plps

the probability is given by the expression that you see there where we go frame by frame

and multiply i in transition probabilities the probability of a state at one time given the previous state

and then observation probabilities the probability of an observation at a given time given that state

in those observations are most frequently i gaussians on i mfcc or plp features

whereas in hybrid systems you can also use neural net posteriors as input to the

to the likelihood computation

okay so i think the for

sort of

big step away conceptually from the hidden markov model is maximum entropy mark a markov models

and these were first investigated by and wait right now party in the mid nineties in the context

part-of-speech tagging

for natural language processing

and then generalized or formalise by mccallum and his colleagues in two thousand

and then there were some i seminal application of these two speech recognition by jeff well when you ching now

in the mid two thousand

the idea behind these models

is to ask the question what if we don't condition the observation on the state but instead condition the state

on the observation

so if you look at the graph your what's happened is the arrow instead of going down it's going up

and we're conditioning a state at a given time J on the previous state and the current observation

state are still context-dependent phone states as they were before

but what we're gonna get out of this whole operation is the ability to have potentially much richer observations and

then mfccs down here

the probability of the state sequence given the observations are pretty an em am is given by this expression here

where we go through time frame by time frame and compute the probability of the current state given the previous

state

and the and the current observation

how do we do that

the key to this is to use

small little maximum entropy model

and apply it at every time frame

so what this maximum entropy model does

is primarily

computes some feature functions that i

that relate the state

previous time to the state at the current time

and the observation at the current time

those feature functions can be arbitrary functions they can return a real number of a binary number and they can

do an arbitrary computation

they get weighted by lambda

those are the parameters of the model summed over all the different kinds of features that you have and then

exponentially eight

it's normalized by the sum over all possible ways that you could assign values to the state they're of the

same of the same sort of expression

and this is doing two things again

the first is gonna let us have arbitrary feature functions that we use

rather than say gaussian mixture

and it's inherently discriminative in that it has this normalisation factor here

i'm gonna talk a lot about features and so i wanna make sure that we're on the same page in

terms of what exactly i mean by features and feature functions

features by the way are distinct from observations you observations of things you actually see and then the features

are numbers that you can Q using those observations as in

a nice way of thinking about the features is has a product of a state component and a linguistic compiled

i'm sorry state component and then the acoustic component

and i've illustrated a few possible state functions and acoustic functions

in this table and then the features the kind of features that you extract from that

so one very simple

function is to ask the question is the current state

why what's the current phone or what's the current context dependent on what's the value of that and just to

use a constant for the acoustic function

and you multiply those together and you have a binary feature

it's either

state is either this thing why or it's not zero one

and the weight that you learn on that is essentially a prior on that particular concept context dependent state

a full transition function would be the correct the previous state was X

and the current state is why previous upon the such and so and the current phone as such and so

we don't pay attention to the acoustics we just use one and that gives us a binary function that says

what the transition

little bit more interesting features when we start actually using the acoustic function

so one example of that is to say the state function is the current state is such and so

oh and by the way when i take my observation and plug it into my voicing detector that comes out

either yes it's voiced or no it's not voiced and i get a binary feature when i multiply those two

together

yet another example is the state is such an so

and i happen to have a

a gaussian mixture model for every state and when i plug the observation into the gaussian mixture model for that

state i get a score and i multiply the score by the by the fact that i'm seeing the state

and that gives me a real-valued a feature function

and so forth and so you can get fairly a fairly sophisticated feature functions this one down here by the

way is the one that quoting now use in there and the mm work where they looked at the rank

of a gaussian mixture model

the rank of the gaussian mixture model associated with a particular state and compared all the other states in the

system

let's move on to the conditional random field

now

it turns out that under certain pathological conditions if you using em atoms you can make a decision early on

and the transition structure

just so happens to be set up in a way and such that you would nor the observations for the

rest of the utterance

and you run into a problem i think these are pathological conditions but they can theoretically exist

and that motivated the development of conditional random field

where rather than doing a bunch of the local normalizations making a bunch of local state wise decisions there's one

global normalisation over all possible state sequences

because there is a global normalisation the it doesn't make sense to have a rose in the picture the arrows

indicate where you're gonna do the local normalisation and we're not doing a local normalisation

so the picture is this

the states are as with the maximum entropy model and the observations are also as with the maximum entropy model

i and the feature functions are as with the maximum entropy model the thing that's different is that when you

compute the probably the state given the observations

you normalise

not locally but once globally over all the possible ways that you can assign values

to those state C one

that brings me now to the segmental version of the crf which is the main point of the stock

so the key difference between the segmental version of the crf and the previous version of the crf

is that we're going to take the observations

and we're not going to block them into groups that correspond to segments

and we're actually gonna make those segments in the words

conceptually they could be any kind of segment they could be a phone segment or syllable segment but the rest

of this talk i'm gonna refer to them as word

and for each word we're gonna block together a bunch of observations and associate it concretely with that state

those observations again can be more general than mfccs for example they could be phoneme detections are the detection of

the of articulatory feature

there's some complexity that comes with this model because

even when we do training where we know how many words there are we don't know what the segmentation is

and so we'd have to consider all possible segmentations of the observations into the right number of were

and then this guy in this picture here for example we have to consider segmenting seven observations not justice to

two and three but maybe moving this guy over here and having three associated with the first word and only

one associated with the second word

and then three with the lab

when you do decoding you don't even know how many words there are in so you have to consider both

all the possible number of segments and all the possible segmentations

given that number of sec

this leads to an expression for segmental crfs that you see here

it's written in terms of the edges that exist in the top layer of the graph there

i each edge has a state to its left in the state to its right

and it has a group of observations that are a link together underneath it O T

and the segmentation is denoted by Q

with that notation the probability of a state sequence given by the observations is given by the expression you see

there which is essentially the same as expression for the regular crf

except that now we have the some over segmentations that are consistent with the number of segments that are hypothesized

or non during training

okay so that was

that was a lot of work to go to introduce segment features do we really need to introduce segmental features

at all do we get anything from that because after all with the with the crf the state sequence is

conditioned on the observations we've got the observation sitting there in front of us

isn't that enough is there anything else you need

and i think the answer to that is clearly yes you do need to have boundaries are you get more

if you talk about concrete boundaries

segment boundaries here a few examples of that

i'm suppose you wanna use template match scores

as a feature functions for example you have a segment and you ask the question what's the dtw distance between

this segment and the closest example of the word that i'm hypothesize thing in some database that i have

to do that you need to know where do you start the alignment where you end alignment and you need

the boundary so you get something from that you don't have when you just say here's a big blob of

observation

similarly word durations if you wanna talk about a word duration model you have to be precise about when the

word starts and when the word ends so that the duration is defined

turns out to be useful to have boundaries if you're incorporating scores from other models

two examples of that are the hmm likelihoods and fisher kernel scores

the latent in gales have used

and the point process model scores

that the ends in and dog have propose

later in the talk all talk about detection sub sequences

as features in there again we need to know the bound

okay so before proceeding i'd like to just emphasise that this is really building on along a tradition of work

and i want to go over and call out some of the components of that tradition the first is log-linear

models that use a frame level markov assumption

and there i think he work was done by jeff cohen you ching gal with the maximum entropy markov model

there really was the first to propose an exercise

the power of using general feature functions

shortly thereafter

hidden or actually it's a more or less simultaneously with that a hidden crfs were proposed by cohen award on

a and his colleagues and then there was a very interesting paper by under asking one of his students at

last year's asr you

i where essentially an extra hidden variables introduced into the crf

to represent gaussian mixture components

with the intention

of simulating mmi training in a conventional system

jeremy morris and error faster loosey a did some fascinating initial work on applying crfs and speech recognition

they used features such as neural net attribute posteriors

and in particular

the detection of sonority voicing manner of articulation and so forth as a feature functions that went into the into

the model

and they also proposed and experimented with the use of mlp phoneme posteriors as feature

and proposed the use of something called the clam didn't model

which is essentially a hybrid crf hmm-model where the crf phone posteriors are used as a state likelihood functions rather

than neural net posteriors in the standard hybrid system

the second tradition i'd like to call out is actually the tradition of segmental log-linear models

the first use of this was a termed a semi crfs by zero windy and cohen i in the development

in natural language processing

late evening gail's propose something term the conditional augmented statistical model which is a segmental crf

that uses hmm scores and fisher kernel score

saying rocking gail's propose the use of structured svms

which are essentially a segmental crfs with large margin training

later in stratford on have an interesting transducer representation that uses perceptron training and similarly achieves joint acoustic language and

duration model training

and finally georg cycle

and patrick million i have done a lot of work on flat direct models which are essentially whole sentence maximum

entropy

acoustic models maxent models at the segment level and you can think of these segmental models i'm talking about today

essentially stringing together a whole bunch of flat direct models one for each sect

it's also important to realise that there's significant previous work on just classical segmental modelling and detector based asr

the segmental modelling i think comes in sort of two main thread

in one of these a likelihoods are based on framewise computations so you have a different number of scores that

contribute each segment

and there's a long line of work that was done here by mari ostendorf and her students and the number

of other researchers so you see here

i and then in a separate thread

there's a development of using a fixed-length segment representation for each segment

that mari ostendorf insulin glucose

looked at in the late nineties and then jim glass more recently has worked on and contributed using

phone likelihoods in the computation in a way that i think is similar to the normalisation and the ser a

a framework

i'm going to talk about using detections phone detections the multi-phone detections and the so is it that i think

too much and lee and his colleagues in their proposal a detector based asr

which combines detector information in the bottom a way to do speech recognition

okay so i'm gonna move on now to the start implementation a specific implementation of a crf

and what this is going to do is essentially extend that tradition that i've mentioned

and it's going to extend it with the synthesis of detector based recognition i segmental modelling and log-linear modeling

going to further

develop some new features that weren't present before and in particular features termed existence expectation and levenshtein features

and then i'm extend that tradition i would an adaptation to large vocabulary speech recognition by fusing finite state language

modeling into that segmental framework for that

talking about

okay so let's move on to a specific implementation

so this is a toolkit that i've i developed with a patrick neumann

it's available from the web page that you see there you can download it and play around with it

and the features that i talk about net

arts

civic

to this implementation and they're sort of one way of realizing the general S crf framework and using it for

speech recognition where you sort of have to dot all the icing cross all the T's and make sure that

everything were

okay someone at heart stop us start by talking about how language models are implemented there are because it's sort

of a tricky issue

when i see a model like this

i think bigram language model i see to state

they're next to each other they're connected to a narrow that's like the probability of one state given the preceding

state and that looks a whole lot like a bigram language model so is that what we're talking about we

just talking about bigram language model C

and the answer is no what we're going to do is we're actually going to be able to model long

span acoustic context

by making these states

refer to states in an underlying finite state language model

here's an example of that

what you see on the left is a fragment from a finite state language model it's a trigram language model

so it has bigram history states

for example there's a bigram history state the dog and dog are and dog way

and sometimes we don't have all the trigrams in the world so to

decode and unseen trigram we need to be able to back off to a lower order history state so for

example if we're in the history state the dog we might have to back off to the history state dog

the one word history state and then we could decode a word that we haven't seen before in a trigram

context like yep and then moved to the history state dog yep

finally as a last resort you can back off to the null history state three down there at the bottom

and just decode any word in the vocabulary

okay so let's assume that we want to decode the sequence the dog yet

how would that look

we decode the first word the and we end up in the state seven here i haven't seen the history

the

then we decode the word dog

that moves us around up the state one we've seen the bigram now need all

now suppose you wanna decode yeah

to do that

so right now we're in state one

we gotten as far as the dog back to get us to state one here

and now suppose you want to decode yeah we'd have to back off

from state one to state two and then we could decode the wordnet and end up in state six over

here thought yeah

so what this means is that by the time we get around to decoding the wordnet

we know a lot more then

the last word was dog we actually know that the previous state was state one which corresponds to the to

word history the dog and so this is not a bigram language model that we have here is actually reflects

the semantics

of the trigram language model that you see in that fragment on the left

so there's two ways that we can use this one is to generate a basic language model score if we

provide the system with the with the finite state language model then we can just look up the language model

cost of transitioning between states and use that as one of the features in the system

but more interestingly we can create a binary feature for each are in the language model

now these arts and the language model are normally labeled with things like bigram probabilities a trigram probabilities or back-off

probability

what we're gonna do is we're gonna create a binary feature that just says have i traverse

the are in transitioning from one state to the next

so for example when we go from

the dog to dog yep which reversed to works

that are from one to two and then the art from two to six

the weights

the lamb does that we learn in association with that

are analogous to the back-off weights and the bigram weights of the normal language model but we're actually learning what

those weights are

what that means is that when we do training we end up with the discriminatively trained language model and actually

a language model that we train in association with the acoustic model training at the same time jointly with the

acoustic model training

so i think that sort of a interesting a phenomenon

okay i'd like to talk about the inputs to the system now

the first input are detector inputs so a detection is simply a unit and its midpoint

an example of that is shown here what we have found detections this is from a voice mail system in

and start from a voice search system and it looks like the person is asking for burgers except the person

says we're

bird

burgers E

and so the way to read this is that we detected the but at time frame seven ninety and or

time frame at and so forth and these correspond to the observations that are in red in the

in the illustration here

actually you can also provide a dictionaries that specify the expected sequence of detections for each word for example that

if we're going to decode burgers we expect both for good and so forth that pronunciation of the word

second input is lattices

that constrain the search space

the easiest way of getting these lattices is to use a conventional hmm system

and use it to just provide

i constraints on the search space

and the way to read this is

that from time twelve twenty one the time twenty five sixty a reasonable hypothesis is workings

and these times here give us segment boundaries hypothesized segment boundaries and the word gives us

possible labelings of the state

and we're gonna use those when we actually do the computations to constrain the set of possibilities we have to

consider

second kind of a lattice input is user-defined features

if you happen to have a model that you think provide some measure of consistency between the word that you're

hypothesize thing in the observations you can plug it in is user-defined feature like you see here

this lattice has a single feature that's been added it's it a dynamic time warping feature

and the particular one and i've got underlined in red here is indicating that the dtw feature value for hypothesized

sing the words fell

between frames nineteen eleven and twenty to sixty is eight point two seven

and that feature corresponds to one of the features in the log-linear models that exist on those vertical edges

now multiple input

are very much encouraged i and what you see here is a fragment of a of a lattice file that

christa monk put together

and you can see it's got lots of different a feature functions and he's defined

and essentially these features are the things that the follow that a metaphor that i started at the beginning

are analogous to the sales in the model that are providing the information in pushing the whole thing for work

and that we want to get as many of those

as possible

okay

let's talk about some features that are automatically defined from the inputs

the user-defined features are we need to find you don't have to worry about them once you put them in

on a lattice

if you provide detector sequences are set of features that can be automatically extracted and then the system will learn

the weights of those features those are existence expectation and levenshtein features along with something called the baseline feature

so the idea of an existence feature is to measure whether a particular unit

exists within the span of the word

but you're hypothesize thing

these are created for all word unit pair

and they have the advantage that you don't need any predefined pronunciation dictionary

but they have the disadvantage that you don't get any generalization ability across words

i here's an example suppose we hypothesize in the word record

and we spend the detections it and or

i would create a feature that says okay i'm hypothesize in accord

and i detected a in the span that would be in existence feature when you train the model presumably would

get a positive weight because presumably it's a good thing to detect if you're hypothesize in the word court

but

there's no generalisation ability across words here so that's a completely different a then the code that you would have

if you are hypothesized thing accordion and there's no transfer of the waiter smoothing there

the idea behind expectation features is to use a dictionary to avoid this and actually get generalization ability across were

there's three different kinds of expectation features

and i think i'll just go through by example and it describes the examples

so suppose let's take the first one suppose we're hypothesize in accord again and we detected it core

we have a correct except

oh but because we expect to see it on the basis of the dictionary and we've actually detected

now that feature is very different from the other feature because we can learn that that's a good thing that

detecting occur when you expect that could is good in the context of training on the word accord

and then use that same feature weight when we detected a in association with the word accordion or the working

second kind of expectation features of false reject of the unit

and that is an example of that where we expect to see it but we don't actually detected

finally you can have a false accept of the unit where you don't expect to see it based on your

dictionary pronunciation but it shows up there in the things that you've detected

and that the

in this example illustrates that

a levenshtein features are similar to expectation features but they

use stronger ordering constraints

the idea behind the levenshtein features to take the dictionary pronunciation of a word

and the units that you've detected

in association with that word

align them to each other get yeah the distance

and then create one feature for each kind of added that you've had to make

so the follow along in this example where we expect accord and we see that core

we have a substitution of the a match of the cover a match of the war and the delete of

the data

and again presumably we can learn that matching and a is a good thing in that has a positive way

by seeing one set of words you know training data and then use that

to evaluate hypotheses of new word

at test time where we haven't seen those particular words but they use these subword units

whose feature values we've already learned

okay the baseline features a kind of an important feature i wanna mention it here

i think many people in the room have had the experience of taking a system having a an interesting idea

very novel scientific thing to try out

doing it adding it in and it gets worse

and the idea behind the baseline features that we wanna think it's sort of the hippocratic oath

where we're gonna do no harm we're gonna have a system where you can add information to it

and not go backward

so we're gonna make it so that you can build on the best system that you have

by treating the output of that system as a word detector stream the detection of words

and then defining a feature this baseline feature that sorta stabilises assist

the definition of the baseline feature is that if you look at a at that are

that you're hypothesize thing

and you look at what words you've detected underneath it you get a plus one up for the baseline feature

if the hypothesized word covers exactly one baseline detection and the words are the same and otherwise you get a

minus one for this feature

here's an example of that

in the lattice path the sample path that were evaluating is a random like sort card or more

the baseline system output was randomly sort cards man detected it these vertical lines that you see here

so when we compute the baseline feature we take the first are random and we say how many words does

it cover

one that's good is it the same word no minus one

then we take a light we say how many words does it cover not

that's not going to get some minus one then we take sort we say how many words does it cover

one

is it the same yes okay we get a plus one there and finally called a mom covers two words

not one like it's supposed to so we get some minus one also

it turns out if you think about this you can see that

the way to optimize the baseline score is to output exactly as many words as a baseline system as output

and to make their identities

exactly the same as the baseline identities

so if you give the baseline feature high enough weight the baseline output is guaranteed

in practice of course you don't just set that we randomly but yeah the feature to the system with all

the other features and what is more in the way

okay i'd like to move on now to some experimental results

and the first of these has to use has to do with time using multi-phone detector is detecting multi-phone units

in the context of voice search is nothing special about voice search here it just happens to be the application

we were using

the idea is to try to empirically find multi-phone units

sequences of phones it tell us a lot about word

then to train an hmm system

whose units

are these multi-phone systems do we decoding with that hmm system and take its output is a sequence of multi-phone

detections

we're gonna put that detector stream then into the ser at

the main question here is what are good a phonetic sub sequences to you

and we're gonna start by using every subsequent sit occurs in the dictionary as a candidate

the expression for the mutual information between the unit you J and the word

W is given by this big

big mess that you see here

and the important thing about this to take aways that there is a tradeoff it turns out that you want

words that occur in about half i'm sorry you want units that occur in about half of the words so

that when you get one of these binary detections you actually get full bit of information

and from that sense of phones come close

but you also need words it can be reliably detected because the best unit in the world isn't gonna do

you any good if you can't actually detected and from that point of view one units are better

turns out that if you do a phone decoding of the data you can then compile statistics and choose the

units that are bad

and my colleague patrick million i'd throw a research stream along that and you can look at this paper for

for details

if you do this and look at what are the most informative units in this particular voice search task you

see something sort of interesting

some of them are very short like on an R

but then some of them are very long like california

and so we get these units that sometimes are short and frequent and sometimes long and what california still pretty

frequent but it's less frequent

okay so what happens if we use multi-phone units

we started with the baseline system that was about thirty seven percent

if we added phone detections that dropped by about a percent

if we use multi-phone units instead of phone units

that turns out to be better so that was gratifying that using these multi-phone units instead of the simple phone

units actually made a different

and then if you use symbols together works

little bit better

if you use a multiple phone and multi-phone units of three best units that were detected it's little bit yet

or better yet

and finally when we did discriminative training

that added a little bit more

and so what you see here is it is actually possible to exploit some redundant information in this kind of

a framework

the next kind of features i want to talk about are template features and this is work that was done

in the two thousand ten johns hopkins workshop

on wall street journal by my colleagues christa monk and are incomparable

i in order to understand that work you need to i need to say a just a little bit about

how

a baseline template system

that is but about the baseline template system that's used that live in university

so the idea here is that you have a big speech database

and you do forced alignment of all the utterances those utterances are rows in that top picture

and for each phone you know where it's boundaries are

and that's what those square boxes are those are phone bound

and you get a new utterance like the utterance at U C and the bottom

and you try to explain it by going into this

database that you haven't pulling out phone templates

and then doing an alignment of those phone templates to the news each such that you cover the whole of

the new utterance

since the original templates come with phone labels you can then read off the phone sequence

okay so suppose we have a system like that setup is it possible to use features

that are created from templates in the sort of the ser a frame or

and it turns out that you can do it in a in a sort of interesting the kinds of features

that you can have

so the idea is to create features

based on the template matches that explain a hypothesis what you see at the upper left is a hypothesis of

the word V

and we further aligned it so that we know where the first phone the is and the second phone E

then we go into the database we find all the close matches to those phones

so the number thirty five was a good match the number four hundred twenty three was a good match the

number one thousand two no twelve thousand eleven was a good match and so for

so given all those good matches what are some features that we can get

one of these features is a word id feature

what's a fraction of the templates that you see stacked up here that actually came from the word that were

hypothesized T V

another question is position consistency if the phone is word-initial like the

what fraction of the

the templates were word-initial in the original data that's another interesting feature

speaker id entropy are all the close matches just from one speaker that would be a bad thing "'cause" potentially

it's a flute

a degree of working if you look at how much you have to work those examples to get them to

fit what's the average word scale those are all features that the provide some information in that you can put

into the system

and the crust among wrote a word icassp paper that describes this in a detail

if we look at the results there we started a from a baseline template system at eight point two percent

adding the template metadata features provided an improvement

to seven point six percent

if we then add hmm scores we get the six point eight percent have to say there that the hmm

itself actually was seven point three so that's

seven point three sort of the baseline

and then adding phone detectors dropped it down finally to six point six percent and this is actually very good

number for the open vocab twenty a twenty K test set

and again this is showing the effective use

of multiple information source

okay the last

experimental result would like to go over

is a broadcast news system that we worked on also at the twenty ten C lsp workshop

i don't have time to go into detail on all the particular information sources that went into this

i just want to call out a few things

so I B M was kind enough to donate their at ellis system for use in creating a baseline system

that constrain the search space

we created a word detector system it

at microsoft research that created these word detections that you see here detector streams

there were a number of real value

information sources here and hansen had a point process model that he worked on justine how worked on a duration

model les atlas and some of the students had some scores based on modulation features those are provided real-valued a

feature scores such as you see here

and then facial i had some deep neural net phone detector

and samuel thomas looked at

the use of mlp phoneme detections in those provided the discrete detection streams that you see at the very bottom

there

if we look at the results let's just move over to the test results the baseline a system that we

built had a fifteen point seven percent word error rate

if we did that training with the scarf baseline feature there was a small improvement there i think that has

to do with the dynamic range of the baseline feature plus minus one versus the dynamic range of the original

likelihood

adding the a word detectors provided about a percent adding the other feature scores added a bit more

and the altogether we got about a nine point six percent

relative improvement or about twenty seven percent of the gain possible i given the lattices and again this indicates that

you can take multiple kinds of information put it into a system like this

and then move in the right direction

okay i want to just quickly go over a couple of research challenges i won't spend much time here because

i research challenges are things that haven't been done in people are gonna do what they're gonna do anyway

but i'll just mention a few things that seem like they might be interesting

what one of them would be to use in a crf

to boost hmm

in the motivation for this is that the use of the a word detectors in the broadcast news system was

actually a very effective we try to combine combination with rover and that didn't really work but we were able

to use it with this log-linear weighting

so the question is can we use crfs

a crfs in a more general boosting loop

the idea would be to train the system

take its output take the word-level output

reweighted training weighted according to the boosting algorithm up waiting the regions where we have mistake

train a new system

and then treat the output of that system is a new a detector stream to add in to the overall

group obsessed

second question is the use of spectro-temporal wreck receptive field models as detectors

previously we've worked on hmm systems S detectors i think would be interesting to try to train S T R

models

two

work as a detectors and provide these

detection streams

one way of approaching that would be to take a bunch of examples of phones or multi-phone units in class

examples and out-of-class examples for example

and train a maximum entropy classifier to make the distinction

and use the weight matrix of the maxent classifier essentially is a learned spectro-temporal receptive field

the last

idea that all throw out is to try to make much larger scale use of templates

information then we used so far we start from the wall street journal results

that there's comments there

and i think

maybe we could take that further for example in voice search systems we have an endless stream of data that

comes in

and we keep can transcribing some so we get more and more examples

of phones and words and sub-word units and so forth

and could we take some of those same features that are we're described previously and use "'em" a on a

much larger scale as they come in on an ongoing basis

okay so i'd like to conclude here

i've talked today about segmental log-linear model specifically segmental conditional random fields

i think these are flexible framework for testing novel scientific ideas

in particular they allow you to in integrate diverse information sources different types of information a different granularities at the

word level at the phone level at the frame level

information it comes in a variable quality level some can be better than others

potentially redundant

information sources and generally speaking much more than where i'm currently using

and finally i think there's a lot of interesting research left to do in this area

so thank you

okay we have a time for some questions

we do have

and please if you want to step up to my can actually put most close to the microphone that's

actually very helpful

so in a segmental models as an issue normalization

because the comparing hypotheses with different numbers of segments and so

there's an issue of how you how you make sure that you know

i was used with fewer segments over

longer segments knows wondering how

still with that yeah good question and they deal with it because when you do training you have to normalize

by considering all possible segmentations in the denominator

so when you do training you know how many segments are you know how many words there are in the

training hypothesis that gives you a fixed number like maybe it

and then you have this normaliser

where you have to consider all possible segmentations

and if the system has a strong bias

say towards

segmentations it only had one segment because there were fewer score

that wouldn't work because

your denominator then with the sign high weights

to the wrong segmentations it wouldn't assign highway to this thing in the did not in the numerator

it would that has ten segments it was assigned high weight

to the

hypotheses it just had a single segmentation in the denominator and the objective function would be bad

and

training would take care that because maximizing the objective function the conditional likelihood of the training data it would have

to assign parameter values

and so that it didn't have that particular bias

in one of those like you were saying that you in a discriminative kind of language more implicitly by building

the other morning

yeah so my question is but then it limits

the thing is in order to bring them or do we need to have a bit of it is not

really acoustically under data

but also have other features that but usually for language modeling we have well who do most of text data

with it but for which we may not have corresponding acoustic

feature

the whole you if we were to train a big language model from just X how to incorporate i didn't

it is yeah

so i think the way to do that

is to

annotate the lattice

with the language model score

that you get from this language model you train on lots and lots of data

so that scores gonna get into the system

then have a second language model that you could think of a sort of a corrective language model

that is trained only on the data for which you have acoustics

and

add those

features in

in addition to the language model score from the basic language model

all the other more

well it just one pass decoding

i mean for it what i understand that you take lattices and then to constrain your subspace

but then what if i have a language model which is much more complicated than n-gram and i wish to

do these coding

so the possible output of like this like structure

all that is the out of the core

there's a question of in theory and in the particular

implementation that we've made in the particular implementation that we've made a know it takes lattices in it produces one

best

there's nothing about the theory or the framework that says you can't take lattices in and produce lattices out and

i was just curious about the pocket yeah okay

i yeah i just have one question

i think is a good idea to combine the different source of information but this thing can also be dying

in a much simpler model right without using the concept of this that the buttons

you introduce the segments here on the what is real benefits you

on the

so i think the benefit is

features that you can't express

if you don't have the concept of the segment

an example of a feature where you need segment boundaries probably the simplest example is say a word duration model

you really need to talk about when this the word star

and when there's a word and

another example where i think it's useful is in template matching if there's a hypothesis and you wanna have a

feature of the form

what is the dtw distance

to the close this

example in my training database

of this word that i'm hypothesize thing

it helps if you have a boundary to start that dtw alignment and the boundaries and that dtw alignment

so i think the answer to the question is that

by reasoning explicitly about segmentations

you can incorporate features

that

you can incorporate if you reason only about frames

feature against your incorporating a very heuristic we

in some of the simple models of three levels

with all the

just combine them

mapping

the whole thing

and the green

it can we do that

i really i is so my own personal philosophy is that if you care about features if you care about

information

where the natural measure of that information is in terms of segments

then you're better off

bliss lately reasoning in terms of those units in terms of segments

then somehow trying to

implicitly or through the back door

encode that information in some other way

how many sorts of segments of you tried to be tried syllables of it right

i would imagine because many syllables are also monosyllabic words that you might cease in confusion

in your word models

syllables

right i didn't mention this as a as a research direction but one thing i'm really interested in is being

able to do decoding from scratch with the segmental model like this

i also didn't go into detail about the computational

burden of using these models

but it turns out that it's

proportional to the size of your vocabulary

so if you wanted to bottom-up decoding from scratch without reference

just some initial lattices or an external system

you need to use subword units for example syllables which are on the order of some thousands

even better phones and for phones we actually have

began some initial experiments

with doing bottom-up phone recognition actually just at the segment level with the pure segment model where we just by

force consider

all possible segments and all possible phones

okay let's think speaker

Speech Recognition with Segmental Conditional Random Fields

Invited Speakers

Geoff Zweig (Microsoft Research)