0:00:15right
0:00:16is my great writer
0:00:18two presents right after the two that paper nominees
0:00:23so i hope you also you also like this talk
0:00:27alright so
0:00:29a this work is about
0:00:31trans online spoken language understanding and the language modeling
0:00:35these recurrent neural networks
0:00:37my name is being real
0:00:38this is the work with my otherwise are provided in
0:00:42we are from carnegie mellon university
0:00:46but this is not always while the talk
0:00:48first of you introduce the background and the motivation of our work
0:00:52volume by that's are we will explain in detail our proposed method
0:00:57and then comes the experiment setup and the without analysis and finally
0:01:03conclusions will be people
0:01:06first the background
0:01:09spoken language understanding is one of the important components in spoken dialogue systems
0:01:15in slu
0:01:16two major tasks
0:01:18intense detection and slot filling
0:01:20even though user query we want slu system to identify the user's intent
0:01:26and also to extract
0:01:28useful semantic constitutions from the user query
0:01:32a given the
0:01:33example query like
0:01:35based show me the flights from seattle to stanley accords model
0:01:38we want the as a whole system
0:01:41to identify that
0:01:43the user is looking for flight information that is the intent
0:01:47and so we also want to
0:01:49extract useful information such as if one location
0:01:53it to location
0:01:54and the departure time p g's the task force one feeling
0:02:00intent detection
0:02:02can be treated as a sequence classification problem
0:02:05so standard of classifiers
0:02:07like
0:02:08a support vector machines with n-gram features
0:02:11or convolution on your network
0:02:12recursive neural networks can be applied
0:02:16on the other hand slot filling
0:02:19can be treated as a sequence labeling problems
0:02:21so sequence models like maximum entropy markov model
0:02:26conditional random fields
0:02:27and recurrent neural networks
0:02:29a good candidates for sequence labeling
0:02:34intended detection small feeding are typically processed separately
0:02:38in spoken language understanding systems
0:02:41i joint model
0:02:42that it can perform the two task
0:02:44at the same time simplifies
0:02:46the slu system
0:02:48as only one model needs to be trained and function
0:02:52also
0:02:53i training
0:02:55two related the task together
0:02:57is it is likely that
0:02:59we can improve the generalization performance of a task
0:03:02using the other related the task
0:03:05trance model for slot filling and the intended detection have been proposed in literature
0:03:10using convolutional neural networks
0:03:12and the recursive neural networks
0:03:17the limitations of deep repairs proposed as so you're models
0:03:22is that's this model typically
0:03:24condition the a the output of this model typically conditioned
0:03:29on the entire word sequence
0:03:31which makes those model not very suitable for online tasks
0:03:35for example in speech recognition
0:03:37instead of receiving the be transcript taxed
0:03:40at the end of the speech
0:03:42you'd are typically prefer to see the ongoing from transcription
0:03:45well the user speaks
0:03:47similarly in spoken language understanding
0:03:50wrist real-time intent detection and slot filling
0:03:53the constraint system will be able to perform press one enquiry
0:03:57well the user can take it
0:04:01so in this work
0:04:02we want to develop a model that can perform online spoken language understanding
0:04:08as the new word arrives from the asr in g
0:04:12more
0:04:13we suggest that
0:04:15the slu without
0:04:16can provide additional context for the next word prediction
0:04:20in the asr on and decoding
0:04:24so we want to build a model that can perform on the slu
0:04:28and language modeling jointly
0:04:33here is a simple visualization of our proposed idea
0:04:37so given a user query like first got i want a first class flights from
0:04:41phoenix to seattle
0:04:43and we push describe me to asr engine on a decoding
0:04:48we use the arrival of the first few
0:04:50words
0:04:51our intent model
0:04:53based on these available information
0:04:55or why the estimation of the user intent
0:04:58and
0:04:59the
0:05:00intent model gives very high confidence score
0:05:03on a
0:05:04the intent class i have fair and the lower
0:05:07confidence score for the other content copies
0:05:10confusion and conditional this intent estimation
0:05:14p language model
0:05:15i just use next word
0:05:17prediction probabilities
0:05:19so here we see that
0:05:21the next the probability for price being the next word is pretty high because
0:05:26twice
0:05:27he's closely related
0:05:29these the intents of i are fair
0:05:32then we start with a rival of another word flight from the asr engine
0:05:37the intent model update is intent estimation
0:05:41and increased
0:05:43the confidence score for instance cost flight
0:05:45and
0:05:47reduce the
0:05:49confidence score for alpha
0:05:51accordingly
0:05:52the language model
0:05:54i just ease
0:05:56next word probability next word prediction probabilities
0:06:00so here
0:06:01the location related words such as pittsburgh and phoenix
0:06:06received higher probability
0:06:07and the price the probability of a price
0:06:10is reduced
0:06:13and diffuse
0:06:14additional input from the
0:06:16asr
0:06:17all words
0:06:19our intent model becomes more confidence that's what the user is looking for use the
0:06:24flight information
0:06:25and accordingly the language model
0:06:27i just the next word probability
0:06:30a piece the a conditioned on the intent estimation
0:06:35and
0:06:36in two we compute the processing
0:06:39of the entire the car
0:06:41note this is not be realization of our
0:06:45proposed idea afford run online spoken language and the spoken language understanding and the language
0:06:50modeling
0:06:52okay next
0:06:53our proposed method
0:06:57okay here on the rnn
0:07:00recurrent neural net models
0:07:01for the three different tasks
0:07:03that's we want to model in or walk us a bit is we i believe
0:07:08these three models are very familiar to most of last the first one is the
0:07:12standard recurrent
0:07:14you know network language model
0:07:16the second one is the are the model for intent detection
0:07:20so
0:07:20the last hidden state output
0:07:23is used to produce the intent estimation
0:07:27and the third model used recurrent neural network for slot filling
0:07:31here different from the or in language model
0:07:34the
0:07:36the are the output is connected act of the hidden state so that's the slot
0:07:41label dependencies can also be modeled
0:07:44in the u d u r n
0:07:48and here is our proposed joint model
0:07:52so similar to the are independent rainy models input to the models
0:07:56are the board in the u r in the given utterance
0:08:01see most okay
0:08:02so we have the word is included
0:08:05and the hidden layer all boards is used for the three different tasks
0:08:10so here cd represents the intent costs
0:08:12s represent the small label
0:08:14and
0:08:15w represents the next word
0:08:17so the output from the r and he the state is used use prosody to
0:08:22used to generate
0:08:24the
0:08:24intent estimation
0:08:26once we obtained the intense
0:08:29uhuh intend the class probability distribution we draw a sample from these probability distribution
0:08:34as the
0:08:36as here at that some point in the cost
0:08:39similarly what do the same thing for slate slot label
0:08:42once we have to these two vector we cascade these two actor into a single
0:08:46one
0:08:47and use these i-th the complex vector
0:08:49to the next word prediction
0:08:51also we connect these contact vector
0:08:54back
0:08:55to the are and he the state
0:08:57such that the intense variations on the sequence
0:09:01as well as the small label dependencies can be modeled
0:09:05you are in the recurrent neural network
0:09:09well basically
0:09:10the task all code
0:09:12at each time-step depends on the task all posts from previous time steps
0:09:16so by using the chain rule the three
0:09:19models intense love reading and language model can be off vectorized accordingly
0:09:26a closer look at our model
0:09:29at each time-step words in goes into the art in the state
0:09:33and
0:09:33the input to the hidden states
0:09:36are the he the states from the previous time step
0:09:40the intended task strong labels from the previous times that
0:09:44and they were input from the current time step
0:09:47and
0:09:48once we have these are instead of word
0:09:50we perform
0:09:52intent classification
0:09:53slot-filling and next word probably next word prediction
0:09:57in the sequence
0:09:59so here
0:10:00these intent distributions for label distribution and what its fusion
0:10:04represents the
0:10:05multilayer perceptual for each of the different task
0:10:09the reason why we applied
0:10:10multilayer perceptron for each task is because
0:10:14using a shared a representation
0:10:16which is the r and he the state a good for the street different tasks
0:10:21you order to improve on the other two
0:10:24introduce additional discriminative hours
0:10:27for the joint model
0:10:28we used a multilayer perceptron
0:10:31given a multilayer perceptron for each task
0:10:33instead of using simple linear transformation
0:10:40"'kay" this one is about model training
0:10:44is what we have seen so what we do use we
0:10:48model the three different tasks jointly
0:10:50so
0:10:52doing model training the anywhere from the street given tasks
0:10:55all probably are propagated
0:10:57to the beginning of the input sequence
0:11:00and we perform a linear interpolation of the cost for each task
0:11:04so as
0:11:06in this object a function
0:11:08we can see that's we interpolate
0:11:10the cost from the intent classification
0:11:14from smart meeting and the language modeling linearly
0:11:17and but addition be at one l two recommendations
0:11:23to this object to this objective function
0:11:28as we have no to used in the previous example
0:11:32the intent estimation at the beginning of the sequence
0:11:36may not be very stable anchor eight
0:11:39so the confusion on
0:11:41so when we do next word prediction
0:11:43conditioning on the wrong intent cost
0:11:46may not be desirable
0:11:47to me to get easy fact
0:11:50we proposed a schedule approach
0:11:52in adjusting be intense contribution to the context
0:11:57so to be specific
0:11:58doing the first case that
0:12:01we disabled
0:12:02we disable the intent contribution to the contacts vector
0:12:06entirety
0:12:07and after the case that
0:12:09we gradually
0:12:10increase
0:12:11the intent contribution to the contacts vector
0:12:15and you the end of the sequence
0:12:17so here we
0:12:19propose just to use the linear you chris function of the case that and other
0:12:22type of increasing functions like lock functions for the number functions can also be explored
0:12:31okay so these are some model variations of the speech on the model that we
0:12:36introduce just no
0:12:39the first one is what we call it
0:12:40the basic at one the model
0:12:42so here
0:12:44the same a shared representation from the art and hidden state
0:12:48is used for the three different tasks
0:12:50and there's no conditional dependencies
0:12:54among these three different tasks so this is what we caught the basic at run
0:12:57the model
0:12:58the second one
0:13:01once we produced the
0:13:03intense estimation
0:13:04the intent sample is connected
0:13:07locally
0:13:08to the next word prediction
0:13:10without cost connecting these one back to the artist eight
0:13:14so what we call these all we call this model
0:13:16s
0:13:17model these local context
0:13:19the third one
0:13:21this
0:13:22a context like to is not connected to the local that squared prediction
0:13:26is that it's connect directly is connect back to the art and he the state
0:13:30so we call this model
0:13:32the model this recurrence context
0:13:35it last variation
0:13:37is the one piece also local and recurrent context
0:13:40and this is the thing model
0:13:41as well to be seen just no
0:13:46okay next one some experiments that have and without
0:13:52so in the experiments the data that that'll be used
0:13:54is the airline travel information system dataset and in this dataset in total we have
0:13:59eighteen intent classes and a hundred and the twenty seven slot labels
0:14:04for intense detection we evaluated
0:14:08we intend model on classification intent classification error rate for small fading
0:14:12but you evaluated i've a score
0:14:16the details about our are in model
0:14:20configurations
0:14:21we use lstm cells as the basic rnns you need voice
0:14:25stronger capability in term of modeling longer-term dependencies
0:14:29we perform in a batch training using adam of optimisation method
0:14:33and to improve the generalization k o all we're of the proposed model
0:14:38we use drop out and out to regular stations
0:14:43in order to
0:14:45to evaluate the robustness of our proposed model
0:14:49we not only experiment these the true text input
0:14:53also please
0:14:54noisy speech input
0:14:55so
0:14:58so
0:14:59we use this to have of improved and these are some details in
0:15:03our the si model setting which we will see
0:15:06no well
0:15:08basically in these experiments we report performance
0:15:12using these two type of include the true text input and the speech input be
0:15:16simulated noise
0:15:18compare the performance of five different type of models
0:15:22on these three different tasks
0:15:24the intent caught the intent detection slot filling and the language modeling
0:15:31and
0:15:32here is the
0:15:34in change detection performance
0:15:37using true text input
0:15:40the fine models from left to right
0:15:42a the independence training models for a intended detection the basic it on the models
0:15:48as will be seen just now in the in the model variations
0:15:52the third one is the joint one of these intent context
0:15:56force one is the joint model this marker label context
0:15:59and the last one is the current model
0:16:02this also type of context
0:16:04so as we can see that joint model of east coast type
0:16:08context
0:16:09performed the best and eats achieves twenty six point three percent relative error reduction
0:16:16or where the independent training intent models
0:16:18so
0:16:21of this what is the slot filling performance
0:16:25you think the true text input
0:16:27so as what can as what we can see that's
0:16:30our proposed one-model shoulders a slight degradations on this slot filling f one score
0:16:36comparing to the independent tree models
0:16:39but this might due to the fact that
0:16:42the dt proposed run model
0:16:45lack of certain discriminative powers
0:16:48for the multiple tasks because we are using the shared
0:16:52representation from this
0:16:53r and you just a good
0:16:56but this
0:16:57so just one aspect that we can be improved further in our future work for
0:17:01the joint modeling
0:17:04this one is the language modeling performance
0:17:07using the should act input
0:17:09as whatever can see
0:17:11the best performing model is that one to model these intent and strongly slot label
0:17:15context
0:17:16and this model achieves eleven but its relative error
0:17:20or action a sorry
0:17:21relative reduction of perplexity
0:17:24comparing to the independent training language model
0:17:27so all one saying that we can not used from this result is that
0:17:32the intent intense context
0:17:35used very important
0:17:37in term of producing a
0:17:39cootes language modeling performance
0:17:41we doddington context
0:17:43bit one model be smart label contact used off
0:17:46produced very similar performance
0:17:48in term of a perplexity comparing to be independent of any models
0:17:53so
0:17:54here we show that intent
0:17:57information internal context is very important for small for language modeling
0:18:04and the last be some results he's
0:18:07using these speech input
0:18:08and asr output to our model
0:18:11these are the for asr model settings
0:18:13the first one is just use the without directly from the decoding
0:18:17and second one use
0:18:19after decoding we do rescoring restore five grand
0:18:22language model
0:18:23a sort of one use the rescoring this independence training rnn language model
0:18:29last one is
0:18:30the model that this rescoring
0:18:32using our proposed drunks trendy model
0:18:36as we can is what we can see from these without
0:18:39the p d joint modeling the joint training
0:18:42approach
0:18:44produce the
0:18:45best performance
0:18:46across all these three evaluation criteria here
0:18:50basically the word error rate force are
0:18:52speech recognition in turn error anova of a score
0:18:56so basically this result shows that
0:18:58even ads d word error rates of a wrong troll
0:19:03if you nine
0:19:04our intent model and our model comes to perform can still produce
0:19:10competitive performance in intense detection and the scroll speeding
0:19:13so these numbers are slightly worse than the experiment
0:19:18these two text input
0:19:19that's on these two also to extract shows the robustness
0:19:23of our proposed
0:19:25model
0:19:27okay lastly the conclusion
0:19:30in this work
0:19:31we proposed a rl model for trounced online
0:19:35language a spoken language understanding and the language modeling
0:19:38and it's a by modeling the street asked one three
0:19:43our model is able to
0:19:45achieve improved performance on the intent detection and the language modeling
0:19:50to be slightly location
0:19:51a small feeding performance
0:19:54you order to show the robustness our model
0:19:56we applied our model
0:19:59on the asr on the past noisy speech impose
0:20:03and we also observed consistent performance gain
0:20:07or the infantry models
0:20:10by using our joint model
0:20:13so this is the end of the talk
0:20:16right okay
0:20:22okay
0:20:23come from a few questions
0:20:25that's
0:21:00okay so the question is if i colour channel two we define the model what
0:21:05are the criterias that i am i will be looking for
0:21:09corpus yes
0:21:10right so
0:21:13basically it's all here is
0:21:14we can see that we are using the recurrence new enough models
0:21:17and
0:21:19typically such models on nlp tasks requires
0:21:22very large dataset to show stable and robust or robot performance
0:21:27so the first criteria is a cost if we can have a lot of data
0:21:31that would be the best
0:21:33the bigger the better i will assume
0:21:35and that seconds what i can single of is that
0:21:39for it as
0:21:40why this is the very simple rather simple dataset is because it is very
0:21:46don't min imitate limited so most of the training utterances
0:21:51a close to be related to flights
0:21:54airline travel information
0:21:56so if i can
0:21:57you know review the covers
0:21:59i which explore the
0:22:01a multi domain
0:22:04scenario
0:22:05that to see whether our model is able to handle
0:22:08you know perform
0:22:09really good not only in the two men limited case but also in the generalized
0:22:13braille in a more detailed many cases
0:22:15so that is
0:22:17what i really care about in the model in the corpus define
0:22:47right i completely agree with you i think this is
0:22:51it is very good suggestion is be here we are doing joint modeling of slu
0:22:56and the language modeling
0:22:57and typically language modeling used you know having asked to make a prediction of what
0:23:02the user might say that the next that and
0:23:05i think that is not very nice that is good
0:23:21eval model have five words maybe have
0:23:23just this is one single training instance
0:23:43so our experiment for the
0:23:46for it should tax simple which the we don't have that situation
0:23:50that's in the asr output we may be seen in a partial
0:23:55partial phrase ease or corrections
0:23:59we
0:24:00you know to look into these particularly in this work
0:24:02but it that is something
0:24:04"'cause" look into in the future work
0:24:35alright okay thanks
0:24:39just a quick original source will i like to multi language model over the local
0:24:44you know trying about the main problem is about the corpus we have for training
0:24:48or slu model is usually very small going for creating language model you will be
0:24:52corpus so budgeting but right but jointly you know you needed to say that you
0:24:57have to have a
0:24:59you know you're automatically determine your
0:25:02training a language model
0:25:04right i think
0:25:06i believe in this domain and
0:25:08data at all
0:25:09well labeled data that is really a limitation because we don't have very large male
0:25:15labeled data for these slu task so
0:25:18i think if we can put more effort in generating
0:25:21you know
0:25:22better quality coppers that you
0:25:24have a lot of them of these slu research
0:25:27that's question
0:25:44yes i did
0:25:56okay so i think that is a very good question so we have a
0:26:00a chart in the paper but it initially here in the annotation
0:26:03basically all be evaluated different number of different size of k
0:26:08the basic a use one
0:26:09starting from each that
0:26:11we start gradually increasing the intent contribution
0:26:14and we evaluate so we show the training curve and validation curve
0:26:18for different k values
0:26:20the but basically these values a set
0:26:23not in the experiment is that all learned
0:26:26in the in a kind of work
0:26:32i think
0:26:33definitely discover then i think this is
0:26:35one of the hyper parameters that can be
0:26:38then from the purely data-driven approach
0:26:41just think that in the current work we
0:26:43not select of uk values
0:26:45and evaluates which is a
0:26:48that's k values
0:26:50okay so that's by the speaker again and that's university okay