right
is my great writer
two presents right after the two that paper nominees
so i hope you also you also like this talk
alright so
a this work is about
trans online spoken language understanding and the language modeling
these recurrent neural networks
my name is being real
this is the work with my otherwise are provided in
we are from carnegie mellon university
but this is not always while the talk
first of you introduce the background and the motivation of our work
volume by that's are we will explain in detail our proposed method
and then comes the experiment setup and the without analysis and finally
conclusions will be people
first the background
spoken language understanding is one of the important components in spoken dialogue systems
in slu
two major tasks
intense detection and slot filling
even though user query we want slu system to identify the user's intent
and also to extract
useful semantic constitutions from the user query
a given the
example query like
based show me the flights from seattle to stanley accords model
we want the as a whole system
to identify that
the user is looking for flight information that is the intent
and so we also want to
extract useful information such as if one location
it to location
and the departure time p g's the task force one feeling
intent detection
can be treated as a sequence classification problem
so standard of classifiers
like
a support vector machines with n-gram features
or convolution on your network
recursive neural networks can be applied
on the other hand slot filling
can be treated as a sequence labeling problems
so sequence models like maximum entropy markov model
conditional random fields
and recurrent neural networks
a good candidates for sequence labeling
intended detection small feeding are typically processed separately
in spoken language understanding systems
i joint model
that it can perform the two task
at the same time simplifies
the slu system
as only one model needs to be trained and function
also
i training
two related the task together
is it is likely that
we can improve the generalization performance of a task
using the other related the task
trance model for slot filling and the intended detection have been proposed in literature
using convolutional neural networks
and the recursive neural networks
the limitations of deep repairs proposed as so you're models
is that's this model typically
condition the a the output of this model typically conditioned
on the entire word sequence
which makes those model not very suitable for online tasks
for example in speech recognition
instead of receiving the be transcript taxed
at the end of the speech
you'd are typically prefer to see the ongoing from transcription
well the user speaks
similarly in spoken language understanding
wrist real-time intent detection and slot filling
the constraint system will be able to perform press one enquiry
well the user can take it
so in this work
we want to develop a model that can perform online spoken language understanding
as the new word arrives from the asr in g
more
we suggest that
the slu without
can provide additional context for the next word prediction
in the asr on and decoding
so we want to build a model that can perform on the slu
and language modeling jointly
here is a simple visualization of our proposed idea
so given a user query like first got i want a first class flights from
phoenix to seattle
and we push describe me to asr engine on a decoding
we use the arrival of the first few
words
our intent model
based on these available information
or why the estimation of the user intent
and
the
intent model gives very high confidence score
on a
the intent class i have fair and the lower
confidence score for the other content copies
confusion and conditional this intent estimation
p language model
i just use next word
prediction probabilities
so here we see that
the next the probability for price being the next word is pretty high because
twice
he's closely related
these the intents of i are fair
then we start with a rival of another word flight from the asr engine
the intent model update is intent estimation
and increased
the confidence score for instance cost flight
and
reduce the
confidence score for alpha
accordingly
the language model
i just ease
next word probability next word prediction probabilities
so here
the location related words such as pittsburgh and phoenix
received higher probability
and the price the probability of a price
is reduced
and diffuse
additional input from the
asr
all words
our intent model becomes more confidence that's what the user is looking for use the
flight information
and accordingly the language model
i just the next word probability
a piece the a conditioned on the intent estimation
and
in two we compute the processing
of the entire the car
note this is not be realization of our
proposed idea afford run online spoken language and the spoken language understanding and the language
modeling
okay next
our proposed method
okay here on the rnn
recurrent neural net models
for the three different tasks
that's we want to model in or walk us a bit is we i believe
these three models are very familiar to most of last the first one is the
standard recurrent
you know network language model
the second one is the are the model for intent detection
so
the last hidden state output
is used to produce the intent estimation
and the third model used recurrent neural network for slot filling
here different from the or in language model
the
the are the output is connected act of the hidden state so that's the slot
label dependencies can also be modeled
in the u d u r n
and here is our proposed joint model
so similar to the are independent rainy models input to the models
are the board in the u r in the given utterance
see most okay
so we have the word is included
and the hidden layer all boards is used for the three different tasks
so here cd represents the intent costs
s represent the small label
and
w represents the next word
so the output from the r and he the state is used use prosody to
used to generate
the
intent estimation
once we obtained the intense
uhuh intend the class probability distribution we draw a sample from these probability distribution
as the
as here at that some point in the cost
similarly what do the same thing for slate slot label
once we have to these two vector we cascade these two actor into a single
one
and use these i-th the complex vector
to the next word prediction
also we connect these contact vector
back
to the are and he the state
such that the intense variations on the sequence
as well as the small label dependencies can be modeled
you are in the recurrent neural network
well basically
the task all code
at each time-step depends on the task all posts from previous time steps
so by using the chain rule the three
models intense love reading and language model can be off vectorized accordingly
a closer look at our model
at each time-step words in goes into the art in the state
and
the input to the hidden states
are the he the states from the previous time step
the intended task strong labels from the previous times that
and they were input from the current time step
and
once we have these are instead of word
we perform
intent classification
slot-filling and next word probably next word prediction
in the sequence
so here
these intent distributions for label distribution and what its fusion
represents the
multilayer perceptual for each of the different task
the reason why we applied
multilayer perceptron for each task is because
using a shared a representation
which is the r and he the state a good for the street different tasks
you order to improve on the other two
introduce additional discriminative hours
for the joint model
we used a multilayer perceptron
given a multilayer perceptron for each task
instead of using simple linear transformation
"'kay" this one is about model training
is what we have seen so what we do use we
model the three different tasks jointly
so
doing model training the anywhere from the street given tasks
all probably are propagated
to the beginning of the input sequence
and we perform a linear interpolation of the cost for each task
so as
in this object a function
we can see that's we interpolate
the cost from the intent classification
from smart meeting and the language modeling linearly
and but addition be at one l two recommendations
to this object to this objective function
as we have no to used in the previous example
the intent estimation at the beginning of the sequence
may not be very stable anchor eight
so the confusion on
so when we do next word prediction
conditioning on the wrong intent cost
may not be desirable
to me to get easy fact
we proposed a schedule approach
in adjusting be intense contribution to the context
so to be specific
doing the first case that
we disabled
we disable the intent contribution to the contacts vector
entirety
and after the case that
we gradually
increase
the intent contribution to the contacts vector
and you the end of the sequence
so here we
propose just to use the linear you chris function of the case that and other
type of increasing functions like lock functions for the number functions can also be explored
okay so these are some model variations of the speech on the model that we
introduce just no
the first one is what we call it
the basic at one the model
so here
the same a shared representation from the art and hidden state
is used for the three different tasks
and there's no conditional dependencies
among these three different tasks so this is what we caught the basic at run
the model
the second one
once we produced the
intense estimation
the intent sample is connected
locally
to the next word prediction
without cost connecting these one back to the artist eight
so what we call these all we call this model
s
model these local context
the third one
this
a context like to is not connected to the local that squared prediction
is that it's connect directly is connect back to the art and he the state
so we call this model
the model this recurrence context
it last variation
is the one piece also local and recurrent context
and this is the thing model
as well to be seen just no
okay next one some experiments that have and without
so in the experiments the data that that'll be used
is the airline travel information system dataset and in this dataset in total we have
eighteen intent classes and a hundred and the twenty seven slot labels
for intense detection we evaluated
we intend model on classification intent classification error rate for small fading
but you evaluated i've a score
the details about our are in model
configurations
we use lstm cells as the basic rnns you need voice
stronger capability in term of modeling longer-term dependencies
we perform in a batch training using adam of optimisation method
and to improve the generalization k o all we're of the proposed model
we use drop out and out to regular stations
in order to
to evaluate the robustness of our proposed model
we not only experiment these the true text input
also please
noisy speech input
so
so
we use this to have of improved and these are some details in
our the si model setting which we will see
no well
basically in these experiments we report performance
using these two type of include the true text input and the speech input be
simulated noise
compare the performance of five different type of models
on these three different tasks
the intent caught the intent detection slot filling and the language modeling
and
here is the
in change detection performance
using true text input
the fine models from left to right
a the independence training models for a intended detection the basic it on the models
as will be seen just now in the in the model variations
the third one is the joint one of these intent context
force one is the joint model this marker label context
and the last one is the current model
this also type of context
so as we can see that joint model of east coast type
context
performed the best and eats achieves twenty six point three percent relative error reduction
or where the independent training intent models
so
of this what is the slot filling performance
you think the true text input
so as what can as what we can see that's
our proposed one-model shoulders a slight degradations on this slot filling f one score
comparing to the independent tree models
but this might due to the fact that
the dt proposed run model
lack of certain discriminative powers
for the multiple tasks because we are using the shared
representation from this
r and you just a good
but this
so just one aspect that we can be improved further in our future work for
the joint modeling
this one is the language modeling performance
using the should act input
as whatever can see
the best performing model is that one to model these intent and strongly slot label
context
and this model achieves eleven but its relative error
or action a sorry
relative reduction of perplexity
comparing to the independent training language model
so all one saying that we can not used from this result is that
the intent intense context
used very important
in term of producing a
cootes language modeling performance
we doddington context
bit one model be smart label contact used off
produced very similar performance
in term of a perplexity comparing to be independent of any models
so
here we show that intent
information internal context is very important for small for language modeling
and the last be some results he's
using these speech input
and asr output to our model
these are the for asr model settings
the first one is just use the without directly from the decoding
and second one use
after decoding we do rescoring restore five grand
language model
a sort of one use the rescoring this independence training rnn language model
last one is
the model that this rescoring
using our proposed drunks trendy model
as we can is what we can see from these without
the p d joint modeling the joint training
approach
produce the
best performance
across all these three evaluation criteria here
basically the word error rate force are
speech recognition in turn error anova of a score
so basically this result shows that
even ads d word error rates of a wrong troll
if you nine
our intent model and our model comes to perform can still produce
competitive performance in intense detection and the scroll speeding
so these numbers are slightly worse than the experiment
these two text input
that's on these two also to extract shows the robustness
of our proposed
model
okay lastly the conclusion
in this work
we proposed a rl model for trounced online
language a spoken language understanding and the language modeling
and it's a by modeling the street asked one three
our model is able to
achieve improved performance on the intent detection and the language modeling
to be slightly location
a small feeding performance
you order to show the robustness our model
we applied our model
on the asr on the past noisy speech impose
and we also observed consistent performance gain
or the infantry models
by using our joint model
so this is the end of the talk
right okay
okay
come from a few questions
that's
okay so the question is if i colour channel two we define the model what
are the criterias that i am i will be looking for
corpus yes
right so
basically it's all here is
we can see that we are using the recurrence new enough models
and
typically such models on nlp tasks requires
very large dataset to show stable and robust or robot performance
so the first criteria is a cost if we can have a lot of data
that would be the best
the bigger the better i will assume
and that seconds what i can single of is that
for it as
why this is the very simple rather simple dataset is because it is very
don't min imitate limited so most of the training utterances
a close to be related to flights
airline travel information
so if i can
you know review the covers
i which explore the
a multi domain
scenario
that to see whether our model is able to handle
you know perform
really good not only in the two men limited case but also in the generalized
braille in a more detailed many cases
so that is
what i really care about in the model in the corpus define
right i completely agree with you i think this is
it is very good suggestion is be here we are doing joint modeling of slu
and the language modeling
and typically language modeling used you know having asked to make a prediction of what
the user might say that the next that and
i think that is not very nice that is good
eval model have five words maybe have
just this is one single training instance
so our experiment for the
for it should tax simple which the we don't have that situation
that's in the asr output we may be seen in a partial
partial phrase ease or corrections
we
you know to look into these particularly in this work
but it that is something
"'cause" look into in the future work
alright okay thanks
just a quick original source will i like to multi language model over the local
you know trying about the main problem is about the corpus we have for training
or slu model is usually very small going for creating language model you will be
corpus so budgeting but right but jointly you know you needed to say that you
have to have a
you know you're automatically determine your
training a language model
right i think
i believe in this domain and
data at all
well labeled data that is really a limitation because we don't have very large male
labeled data for these slu task so
i think if we can put more effort in generating
you know
better quality coppers that you
have a lot of them of these slu research
that's question
yes i did
okay so i think that is a very good question so we have a
a chart in the paper but it initially here in the annotation
basically all be evaluated different number of different size of k
the basic a use one
starting from each that
we start gradually increasing the intent contribution
and we evaluate so we show the training curve and validation curve
for different k values
the but basically these values a set
not in the experiment is that all learned
in the in a kind of work
i think
definitely discover then i think this is
one of the hyper parameters that can be
then from the purely data-driven approach
just think that in the current work we
not select of uk values
and evaluates which is a
that's k values
okay so that's by the speaker again and that's university okay