Speech Transcript - Joint Online Spoken Language Understanding and Language Modeling With Recurrent Neural Networks

right

is my great writer

two presents right after the two that paper nominees

so i hope you also you also like this talk

alright so

a this work is about

trans online spoken language understanding and the language modeling

these recurrent neural networks

my name is being real

this is the work with my otherwise are provided in

we are from carnegie mellon university

but this is not always while the talk

first of you introduce the background and the motivation of our work

volume by that's are we will explain in detail our proposed method

and then comes the experiment setup and the without analysis and finally

conclusions will be people

first the background

spoken language understanding is one of the important components in spoken dialogue systems

in slu

two major tasks

intense detection and slot filling

even though user query we want slu system to identify the user's intent

and also to extract

useful semantic constitutions from the user query

a given the

example query like

based show me the flights from seattle to stanley accords model

we want the as a whole system

to identify that

the user is looking for flight information that is the intent

and so we also want to

extract useful information such as if one location

it to location

and the departure time p g's the task force one feeling

intent detection

can be treated as a sequence classification problem

so standard of classifiers

a support vector machines with n-gram features

or convolution on your network

recursive neural networks can be applied

on the other hand slot filling

can be treated as a sequence labeling problems

so sequence models like maximum entropy markov model

conditional random fields

and recurrent neural networks

a good candidates for sequence labeling

intended detection small feeding are typically processed separately

in spoken language understanding systems

i joint model

that it can perform the two task

at the same time simplifies

the slu system

as only one model needs to be trained and function

also

i training

two related the task together

is it is likely that

we can improve the generalization performance of a task

using the other related the task

trance model for slot filling and the intended detection have been proposed in literature

using convolutional neural networks

and the recursive neural networks

the limitations of deep repairs proposed as so you're models

is that's this model typically

condition the a the output of this model typically conditioned

on the entire word sequence

which makes those model not very suitable for online tasks

for example in speech recognition

instead of receiving the be transcript taxed

at the end of the speech

you'd are typically prefer to see the ongoing from transcription

well the user speaks

similarly in spoken language understanding

wrist real-time intent detection and slot filling

the constraint system will be able to perform press one enquiry

well the user can take it

so in this work

we want to develop a model that can perform online spoken language understanding

as the new word arrives from the asr in g

we suggest that

the slu without

can provide additional context for the next word prediction

in the asr on and decoding

so we want to build a model that can perform on the slu

and language modeling jointly

here is a simple visualization of our proposed idea

so given a user query like first got i want a first class flights from

phoenix to seattle

and we push describe me to asr engine on a decoding

we use the arrival of the first few

words

our intent model

based on these available information

or why the estimation of the user intent

and

the

intent model gives very high confidence score

on a

the intent class i have fair and the lower

confidence score for the other content copies

confusion and conditional this intent estimation

p language model

i just use next word

prediction probabilities

so here we see that

the next the probability for price being the next word is pretty high because

twice

he's closely related

these the intents of i are fair

then we start with a rival of another word flight from the asr engine

the intent model update is intent estimation

and increased

the confidence score for instance cost flight

and

reduce the

confidence score for alpha

accordingly

the language model

i just ease

next word probability next word prediction probabilities

so here

the location related words such as pittsburgh and phoenix

received higher probability

and the price the probability of a price

is reduced

and diffuse

additional input from the

asr

all words

our intent model becomes more confidence that's what the user is looking for use the

flight information

and accordingly the language model

i just the next word probability

a piece the a conditioned on the intent estimation

and

in two we compute the processing

of the entire the car

note this is not be realization of our

proposed idea afford run online spoken language and the spoken language understanding and the language

modeling

okay next

our proposed method

okay here on the rnn

recurrent neural net models

for the three different tasks

that's we want to model in or walk us a bit is we i believe

these three models are very familiar to most of last the first one is the

standard recurrent

you know network language model

the second one is the are the model for intent detection

the last hidden state output

is used to produce the intent estimation

and the third model used recurrent neural network for slot filling

here different from the or in language model

the

the are the output is connected act of the hidden state so that's the slot

label dependencies can also be modeled

in the u d u r n

and here is our proposed joint model

so similar to the are independent rainy models input to the models

are the board in the u r in the given utterance

see most okay

so we have the word is included

and the hidden layer all boards is used for the three different tasks

so here cd represents the intent costs

s represent the small label

and

w represents the next word

so the output from the r and he the state is used use prosody to

used to generate

the

intent estimation

once we obtained the intense

uhuh intend the class probability distribution we draw a sample from these probability distribution

as the

as here at that some point in the cost

similarly what do the same thing for slate slot label

once we have to these two vector we cascade these two actor into a single

one

and use these i-th the complex vector

to the next word prediction

also we connect these contact vector

back

to the are and he the state

such that the intense variations on the sequence

as well as the small label dependencies can be modeled

you are in the recurrent neural network

well basically

the task all code

at each time-step depends on the task all posts from previous time steps

so by using the chain rule the three

models intense love reading and language model can be off vectorized accordingly

a closer look at our model

at each time-step words in goes into the art in the state

and

the input to the hidden states

are the he the states from the previous time step

the intended task strong labels from the previous times that

and they were input from the current time step

and

once we have these are instead of word

we perform

intent classification

slot-filling and next word probably next word prediction

in the sequence

so here

these intent distributions for label distribution and what its fusion

represents the

multilayer perceptual for each of the different task

the reason why we applied

multilayer perceptron for each task is because

using a shared a representation

which is the r and he the state a good for the street different tasks

you order to improve on the other two

introduce additional discriminative hours

for the joint model

we used a multilayer perceptron

given a multilayer perceptron for each task

instead of using simple linear transformation

"'kay" this one is about model training

is what we have seen so what we do use we

model the three different tasks jointly

doing model training the anywhere from the street given tasks

all probably are propagated

to the beginning of the input sequence

and we perform a linear interpolation of the cost for each task

so as

in this object a function

we can see that's we interpolate

the cost from the intent classification

from smart meeting and the language modeling linearly

and but addition be at one l two recommendations

to this object to this objective function

as we have no to used in the previous example

the intent estimation at the beginning of the sequence

may not be very stable anchor eight

so the confusion on

so when we do next word prediction

conditioning on the wrong intent cost

may not be desirable

to me to get easy fact

we proposed a schedule approach

in adjusting be intense contribution to the context

so to be specific

doing the first case that

we disabled

we disable the intent contribution to the contacts vector

entirety

and after the case that

we gradually

increase

the intent contribution to the contacts vector

and you the end of the sequence

so here we

propose just to use the linear you chris function of the case that and other

type of increasing functions like lock functions for the number functions can also be explored

okay so these are some model variations of the speech on the model that we

introduce just no

the first one is what we call it

the basic at one the model

so here

the same a shared representation from the art and hidden state

is used for the three different tasks

and there's no conditional dependencies

among these three different tasks so this is what we caught the basic at run

the model

the second one

once we produced the

intense estimation

the intent sample is connected

locally

to the next word prediction

without cost connecting these one back to the artist eight

so what we call these all we call this model

model these local context

the third one

this

a context like to is not connected to the local that squared prediction

is that it's connect directly is connect back to the art and he the state

so we call this model

the model this recurrence context

it last variation

is the one piece also local and recurrent context

and this is the thing model

as well to be seen just no

okay next one some experiments that have and without

so in the experiments the data that that'll be used

is the airline travel information system dataset and in this dataset in total we have

eighteen intent classes and a hundred and the twenty seven slot labels

for intense detection we evaluated

we intend model on classification intent classification error rate for small fading

but you evaluated i've a score

the details about our are in model

configurations

we use lstm cells as the basic rnns you need voice

stronger capability in term of modeling longer-term dependencies

we perform in a batch training using adam of optimisation method

and to improve the generalization k o all we're of the proposed model

we use drop out and out to regular stations

in order to

to evaluate the robustness of our proposed model

we not only experiment these the true text input

also please

noisy speech input

we use this to have of improved and these are some details in

our the si model setting which we will see

no well

basically in these experiments we report performance

using these two type of include the true text input and the speech input be

simulated noise

compare the performance of five different type of models

on these three different tasks

the intent caught the intent detection slot filling and the language modeling

and

here is the

in change detection performance

using true text input

the fine models from left to right

a the independence training models for a intended detection the basic it on the models

as will be seen just now in the in the model variations

the third one is the joint one of these intent context

force one is the joint model this marker label context

and the last one is the current model

this also type of context

so as we can see that joint model of east coast type

context

performed the best and eats achieves twenty six point three percent relative error reduction

or where the independent training intent models

of this what is the slot filling performance

you think the true text input

so as what can as what we can see that's

our proposed one-model shoulders a slight degradations on this slot filling f one score

comparing to the independent tree models

but this might due to the fact that

the dt proposed run model

lack of certain discriminative powers

for the multiple tasks because we are using the shared

representation from this

r and you just a good

but this

so just one aspect that we can be improved further in our future work for

the joint modeling

this one is the language modeling performance

using the should act input

as whatever can see

the best performing model is that one to model these intent and strongly slot label

context

and this model achieves eleven but its relative error

or action a sorry

relative reduction of perplexity

comparing to the independent training language model

so all one saying that we can not used from this result is that

the intent intense context

used very important

in term of producing a

cootes language modeling performance

we doddington context

bit one model be smart label contact used off

produced very similar performance

in term of a perplexity comparing to be independent of any models

here we show that intent

information internal context is very important for small for language modeling

and the last be some results he's

using these speech input

and asr output to our model

these are the for asr model settings

the first one is just use the without directly from the decoding

and second one use

after decoding we do rescoring restore five grand

language model

a sort of one use the rescoring this independence training rnn language model

last one is

the model that this rescoring

using our proposed drunks trendy model

as we can is what we can see from these without

the p d joint modeling the joint training

approach

produce the

best performance

across all these three evaluation criteria here

basically the word error rate force are

speech recognition in turn error anova of a score

so basically this result shows that

even ads d word error rates of a wrong troll

if you nine

our intent model and our model comes to perform can still produce

competitive performance in intense detection and the scroll speeding

so these numbers are slightly worse than the experiment

these two text input

that's on these two also to extract shows the robustness

of our proposed

model

okay lastly the conclusion

in this work

we proposed a rl model for trounced online

language a spoken language understanding and the language modeling

and it's a by modeling the street asked one three

our model is able to

achieve improved performance on the intent detection and the language modeling

to be slightly location

a small feeding performance

you order to show the robustness our model

we applied our model

on the asr on the past noisy speech impose

and we also observed consistent performance gain

or the infantry models

by using our joint model

so this is the end of the talk

right okay

okay

come from a few questions

that's

okay so the question is if i colour channel two we define the model what

are the criterias that i am i will be looking for

corpus yes

right so

basically it's all here is

we can see that we are using the recurrence new enough models

and

typically such models on nlp tasks requires

very large dataset to show stable and robust or robot performance

so the first criteria is a cost if we can have a lot of data

that would be the best

the bigger the better i will assume

and that seconds what i can single of is that

for it as

why this is the very simple rather simple dataset is because it is very

don't min imitate limited so most of the training utterances

a close to be related to flights

airline travel information

so if i can

you know review the covers

i which explore the

a multi domain

scenario

that to see whether our model is able to handle

you know perform

really good not only in the two men limited case but also in the generalized

braille in a more detailed many cases

so that is

what i really care about in the model in the corpus define

right i completely agree with you i think this is

it is very good suggestion is be here we are doing joint modeling of slu

and the language modeling

and typically language modeling used you know having asked to make a prediction of what

the user might say that the next that and

i think that is not very nice that is good

eval model have five words maybe have

just this is one single training instance

so our experiment for the

for it should tax simple which the we don't have that situation

that's in the asr output we may be seen in a partial

partial phrase ease or corrections

you know to look into these particularly in this work

but it that is something

"'cause" look into in the future work

alright okay thanks

just a quick original source will i like to multi language model over the local

you know trying about the main problem is about the corpus we have for training

or slu model is usually very small going for creating language model you will be

corpus so budgeting but right but jointly you know you needed to say that you

have to have a

you know you're automatically determine your

training a language model

right i think

i believe in this domain and

data at all

well labeled data that is really a limitation because we don't have very large male

labeled data for these slu task so

i think if we can put more effort in generating

you know

better quality coppers that you

have a lot of them of these slu research

that's question

yes i did

okay so i think that is a very good question so we have a

a chart in the paper but it initially here in the annotation

basically all be evaluated different number of different size of k

the basic a use one

starting from each that

we start gradually increasing the intent contribution

and we evaluate so we show the training curve and validation curve

for different k values

the but basically these values a set

not in the experiment is that all learned

in the in a kind of work

i think

definitely discover then i think this is

one of the hyper parameters that can be

then from the purely data-driven approach

just think that in the current work we

not select of uk values

and evaluates which is a

that's k values

okay so that's by the speaker again and that's university okay

Joint Online Spoken Language Understanding and Language Modeling With Recurrent Neural Networks

Oral Session 1: Dialogue state tracking & Spoken language understanding

Bing Liu and Ian Lane