0:00:15 | right |
---|---|

0:00:16 | is my great writer |

0:00:18 | two presents right after the two that paper nominees |

0:00:23 | so i hope you also you also like this talk |

0:00:27 | alright so |

0:00:29 | a this work is about |

0:00:31 | trans online spoken language understanding and the language modeling |

0:00:35 | these recurrent neural networks |

0:00:37 | my name is being real |

0:00:38 | this is the work with my otherwise are provided in |

0:00:42 | we are from carnegie mellon university |

0:00:46 | but this is not always while the talk |

0:00:48 | first of you introduce the background and the motivation of our work |

0:00:52 | volume by that's are we will explain in detail our proposed method |

0:00:57 | and then comes the experiment setup and the without analysis and finally |

0:01:03 | conclusions will be people |

0:01:06 | first the background |

0:01:09 | spoken language understanding is one of the important components in spoken dialogue systems |

0:01:15 | in slu |

0:01:16 | two major tasks |

0:01:18 | intense detection and slot filling |

0:01:20 | even though user query we want slu system to identify the user's intent |

0:01:26 | and also to extract |

0:01:28 | useful semantic constitutions from the user query |

0:01:32 | a given the |

0:01:33 | example query like |

0:01:35 | based show me the flights from seattle to stanley accords model |

0:01:38 | we want the as a whole system |

0:01:41 | to identify that |

0:01:43 | the user is looking for flight information that is the intent |

0:01:47 | and so we also want to |

0:01:49 | extract useful information such as if one location |

0:01:53 | it to location |

0:01:54 | and the departure time p g's the task force one feeling |

0:02:00 | intent detection |

0:02:02 | can be treated as a sequence classification problem |

0:02:05 | so standard of classifiers |

0:02:07 | like |

0:02:08 | a support vector machines with n-gram features |

0:02:11 | or convolution on your network |

0:02:12 | recursive neural networks can be applied |

0:02:16 | on the other hand slot filling |

0:02:19 | can be treated as a sequence labeling problems |

0:02:21 | so sequence models like maximum entropy markov model |

0:02:26 | conditional random fields |

0:02:27 | and recurrent neural networks |

0:02:29 | a good candidates for sequence labeling |

0:02:34 | intended detection small feeding are typically processed separately |

0:02:38 | in spoken language understanding systems |

0:02:41 | i joint model |

0:02:42 | that it can perform the two task |

0:02:44 | at the same time simplifies |

0:02:46 | the slu system |

0:02:48 | as only one model needs to be trained and function |

0:02:52 | also |

0:02:53 | i training |

0:02:55 | two related the task together |

0:02:57 | is it is likely that |

0:02:59 | we can improve the generalization performance of a task |

0:03:02 | using the other related the task |

0:03:05 | trance model for slot filling and the intended detection have been proposed in literature |

0:03:10 | using convolutional neural networks |

0:03:12 | and the recursive neural networks |

0:03:17 | the limitations of deep repairs proposed as so you're models |

0:03:22 | is that's this model typically |

0:03:24 | condition the a the output of this model typically conditioned |

0:03:29 | on the entire word sequence |

0:03:31 | which makes those model not very suitable for online tasks |

0:03:35 | for example in speech recognition |

0:03:37 | instead of receiving the be transcript taxed |

0:03:40 | at the end of the speech |

0:03:42 | you'd are typically prefer to see the ongoing from transcription |

0:03:45 | well the user speaks |

0:03:47 | similarly in spoken language understanding |

0:03:50 | wrist real-time intent detection and slot filling |

0:03:53 | the constraint system will be able to perform press one enquiry |

0:03:57 | well the user can take it |

0:04:01 | so in this work |

0:04:02 | we want to develop a model that can perform online spoken language understanding |

0:04:08 | as the new word arrives from the asr in g |

0:04:12 | more |

0:04:13 | we suggest that |

0:04:15 | the slu without |

0:04:16 | can provide additional context for the next word prediction |

0:04:20 | in the asr on and decoding |

0:04:24 | so we want to build a model that can perform on the slu |

0:04:28 | and language modeling jointly |

0:04:33 | here is a simple visualization of our proposed idea |

0:04:37 | so given a user query like first got i want a first class flights from |

0:04:41 | phoenix to seattle |

0:04:43 | and we push describe me to asr engine on a decoding |

0:04:48 | we use the arrival of the first few |

0:04:50 | words |

0:04:51 | our intent model |

0:04:53 | based on these available information |

0:04:55 | or why the estimation of the user intent |

0:04:58 | and |

0:04:59 | the |

0:05:00 | intent model gives very high confidence score |

0:05:03 | on a |

0:05:04 | the intent class i have fair and the lower |

0:05:07 | confidence score for the other content copies |

0:05:10 | confusion and conditional this intent estimation |

0:05:14 | p language model |

0:05:15 | i just use next word |

0:05:17 | prediction probabilities |

0:05:19 | so here we see that |

0:05:21 | the next the probability for price being the next word is pretty high because |

0:05:26 | twice |

0:05:27 | he's closely related |

0:05:29 | these the intents of i are fair |

0:05:32 | then we start with a rival of another word flight from the asr engine |

0:05:37 | the intent model update is intent estimation |

0:05:41 | and increased |

0:05:43 | the confidence score for instance cost flight |

0:05:45 | and |

0:05:47 | reduce the |

0:05:49 | confidence score for alpha |

0:05:51 | accordingly |

0:05:52 | the language model |

0:05:54 | i just ease |

0:05:56 | next word probability next word prediction probabilities |

0:06:00 | so here |

0:06:01 | the location related words such as pittsburgh and phoenix |

0:06:06 | received higher probability |

0:06:07 | and the price the probability of a price |

0:06:10 | is reduced |

0:06:13 | and diffuse |

0:06:14 | additional input from the |

0:06:16 | asr |

0:06:17 | all words |

0:06:19 | our intent model becomes more confidence that's what the user is looking for use the |

0:06:24 | flight information |

0:06:25 | and accordingly the language model |

0:06:27 | i just the next word probability |

0:06:30 | a piece the a conditioned on the intent estimation |

0:06:35 | and |

0:06:36 | in two we compute the processing |

0:06:39 | of the entire the car |

0:06:41 | note this is not be realization of our |

0:06:45 | proposed idea afford run online spoken language and the spoken language understanding and the language |

0:06:50 | modeling |

0:06:52 | okay next |

0:06:53 | our proposed method |

0:06:57 | okay here on the rnn |

0:07:00 | recurrent neural net models |

0:07:01 | for the three different tasks |

0:07:03 | that's we want to model in or walk us a bit is we i believe |

0:07:08 | these three models are very familiar to most of last the first one is the |

0:07:12 | standard recurrent |

0:07:14 | you know network language model |

0:07:16 | the second one is the are the model for intent detection |

0:07:20 | so |

0:07:20 | the last hidden state output |

0:07:23 | is used to produce the intent estimation |

0:07:27 | and the third model used recurrent neural network for slot filling |

0:07:31 | here different from the or in language model |

0:07:34 | the |

0:07:36 | the are the output is connected act of the hidden state so that's the slot |

0:07:41 | label dependencies can also be modeled |

0:07:44 | in the u d u r n |

0:07:48 | and here is our proposed joint model |

0:07:52 | so similar to the are independent rainy models input to the models |

0:07:56 | are the board in the u r in the given utterance |

0:08:01 | see most okay |

0:08:02 | so we have the word is included |

0:08:05 | and the hidden layer all boards is used for the three different tasks |

0:08:10 | so here cd represents the intent costs |

0:08:12 | s represent the small label |

0:08:14 | and |

0:08:15 | w represents the next word |

0:08:17 | so the output from the r and he the state is used use prosody to |

0:08:22 | used to generate |

0:08:24 | the |

0:08:24 | intent estimation |

0:08:26 | once we obtained the intense |

0:08:29 | uhuh intend the class probability distribution we draw a sample from these probability distribution |

0:08:34 | as the |

0:08:36 | as here at that some point in the cost |

0:08:39 | similarly what do the same thing for slate slot label |

0:08:42 | once we have to these two vector we cascade these two actor into a single |

0:08:46 | one |

0:08:47 | and use these i-th the complex vector |

0:08:49 | to the next word prediction |

0:08:51 | also we connect these contact vector |

0:08:54 | back |

0:08:55 | to the are and he the state |

0:08:57 | such that the intense variations on the sequence |

0:09:01 | as well as the small label dependencies can be modeled |

0:09:05 | you are in the recurrent neural network |

0:09:09 | well basically |

0:09:10 | the task all code |

0:09:12 | at each time-step depends on the task all posts from previous time steps |

0:09:16 | so by using the chain rule the three |

0:09:19 | models intense love reading and language model can be off vectorized accordingly |

0:09:26 | a closer look at our model |

0:09:29 | at each time-step words in goes into the art in the state |

0:09:33 | and |

0:09:33 | the input to the hidden states |

0:09:36 | are the he the states from the previous time step |

0:09:40 | the intended task strong labels from the previous times that |

0:09:44 | and they were input from the current time step |

0:09:47 | and |

0:09:48 | once we have these are instead of word |

0:09:50 | we perform |

0:09:52 | intent classification |

0:09:53 | slot-filling and next word probably next word prediction |

0:09:57 | in the sequence |

0:09:59 | so here |

0:10:00 | these intent distributions for label distribution and what its fusion |

0:10:04 | represents the |

0:10:05 | multilayer perceptual for each of the different task |

0:10:09 | the reason why we applied |

0:10:10 | multilayer perceptron for each task is because |

0:10:14 | using a shared a representation |

0:10:16 | which is the r and he the state a good for the street different tasks |

0:10:21 | you order to improve on the other two |

0:10:24 | introduce additional discriminative hours |

0:10:27 | for the joint model |

0:10:28 | we used a multilayer perceptron |

0:10:31 | given a multilayer perceptron for each task |

0:10:33 | instead of using simple linear transformation |

0:10:40 | "'kay" this one is about model training |

0:10:44 | is what we have seen so what we do use we |

0:10:48 | model the three different tasks jointly |

0:10:50 | so |

0:10:52 | doing model training the anywhere from the street given tasks |

0:10:55 | all probably are propagated |

0:10:57 | to the beginning of the input sequence |

0:11:00 | and we perform a linear interpolation of the cost for each task |

0:11:04 | so as |

0:11:06 | in this object a function |

0:11:08 | we can see that's we interpolate |

0:11:10 | the cost from the intent classification |

0:11:14 | from smart meeting and the language modeling linearly |

0:11:17 | and but addition be at one l two recommendations |

0:11:23 | to this object to this objective function |

0:11:28 | as we have no to used in the previous example |

0:11:32 | the intent estimation at the beginning of the sequence |

0:11:36 | may not be very stable anchor eight |

0:11:39 | so the confusion on |

0:11:41 | so when we do next word prediction |

0:11:43 | conditioning on the wrong intent cost |

0:11:46 | may not be desirable |

0:11:47 | to me to get easy fact |

0:11:50 | we proposed a schedule approach |

0:11:52 | in adjusting be intense contribution to the context |

0:11:57 | so to be specific |

0:11:58 | doing the first case that |

0:12:01 | we disabled |

0:12:02 | we disable the intent contribution to the contacts vector |

0:12:06 | entirety |

0:12:07 | and after the case that |

0:12:09 | we gradually |

0:12:10 | increase |

0:12:11 | the intent contribution to the contacts vector |

0:12:15 | and you the end of the sequence |

0:12:17 | so here we |

0:12:19 | propose just to use the linear you chris function of the case that and other |

0:12:22 | type of increasing functions like lock functions for the number functions can also be explored |

0:12:31 | okay so these are some model variations of the speech on the model that we |

0:12:36 | introduce just no |

0:12:39 | the first one is what we call it |

0:12:40 | the basic at one the model |

0:12:42 | so here |

0:12:44 | the same a shared representation from the art and hidden state |

0:12:48 | is used for the three different tasks |

0:12:50 | and there's no conditional dependencies |

0:12:54 | among these three different tasks so this is what we caught the basic at run |

0:12:57 | the model |

0:12:58 | the second one |

0:13:01 | once we produced the |

0:13:03 | intense estimation |

0:13:04 | the intent sample is connected |

0:13:07 | locally |

0:13:08 | to the next word prediction |

0:13:10 | without cost connecting these one back to the artist eight |

0:13:14 | so what we call these all we call this model |

0:13:16 | s |

0:13:17 | model these local context |

0:13:19 | the third one |

0:13:21 | this |

0:13:22 | a context like to is not connected to the local that squared prediction |

0:13:26 | is that it's connect directly is connect back to the art and he the state |

0:13:30 | so we call this model |

0:13:32 | the model this recurrence context |

0:13:35 | it last variation |

0:13:37 | is the one piece also local and recurrent context |

0:13:40 | and this is the thing model |

0:13:41 | as well to be seen just no |

0:13:46 | okay next one some experiments that have and without |

0:13:52 | so in the experiments the data that that'll be used |

0:13:54 | is the airline travel information system dataset and in this dataset in total we have |

0:13:59 | eighteen intent classes and a hundred and the twenty seven slot labels |

0:14:04 | for intense detection we evaluated |

0:14:08 | we intend model on classification intent classification error rate for small fading |

0:14:12 | but you evaluated i've a score |

0:14:16 | the details about our are in model |

0:14:20 | configurations |

0:14:21 | we use lstm cells as the basic rnns you need voice |

0:14:25 | stronger capability in term of modeling longer-term dependencies |

0:14:29 | we perform in a batch training using adam of optimisation method |

0:14:33 | and to improve the generalization k o all we're of the proposed model |

0:14:38 | we use drop out and out to regular stations |

0:14:43 | in order to |

0:14:45 | to evaluate the robustness of our proposed model |

0:14:49 | we not only experiment these the true text input |

0:14:53 | also please |

0:14:54 | noisy speech input |

0:14:55 | so |

0:14:58 | so |

0:14:59 | we use this to have of improved and these are some details in |

0:15:03 | our the si model setting which we will see |

0:15:06 | no well |

0:15:08 | basically in these experiments we report performance |

0:15:12 | using these two type of include the true text input and the speech input be |

0:15:16 | simulated noise |

0:15:18 | compare the performance of five different type of models |

0:15:22 | on these three different tasks |

0:15:24 | the intent caught the intent detection slot filling and the language modeling |

0:15:31 | and |

0:15:32 | here is the |

0:15:34 | in change detection performance |

0:15:37 | using true text input |

0:15:40 | the fine models from left to right |

0:15:42 | a the independence training models for a intended detection the basic it on the models |

0:15:48 | as will be seen just now in the in the model variations |

0:15:52 | the third one is the joint one of these intent context |

0:15:56 | force one is the joint model this marker label context |

0:15:59 | and the last one is the current model |

0:16:02 | this also type of context |

0:16:04 | so as we can see that joint model of east coast type |

0:16:08 | context |

0:16:09 | performed the best and eats achieves twenty six point three percent relative error reduction |

0:16:16 | or where the independent training intent models |

0:16:18 | so |

0:16:21 | of this what is the slot filling performance |

0:16:25 | you think the true text input |

0:16:27 | so as what can as what we can see that's |

0:16:30 | our proposed one-model shoulders a slight degradations on this slot filling f one score |

0:16:36 | comparing to the independent tree models |

0:16:39 | but this might due to the fact that |

0:16:42 | the dt proposed run model |

0:16:45 | lack of certain discriminative powers |

0:16:48 | for the multiple tasks because we are using the shared |

0:16:52 | representation from this |

0:16:53 | r and you just a good |

0:16:56 | but this |

0:16:57 | so just one aspect that we can be improved further in our future work for |

0:17:01 | the joint modeling |

0:17:04 | this one is the language modeling performance |

0:17:07 | using the should act input |

0:17:09 | as whatever can see |

0:17:11 | the best performing model is that one to model these intent and strongly slot label |

0:17:15 | context |

0:17:16 | and this model achieves eleven but its relative error |

0:17:20 | or action a sorry |

0:17:21 | relative reduction of perplexity |

0:17:24 | comparing to the independent training language model |

0:17:27 | so all one saying that we can not used from this result is that |

0:17:32 | the intent intense context |

0:17:35 | used very important |

0:17:37 | in term of producing a |

0:17:39 | cootes language modeling performance |

0:17:41 | we doddington context |

0:17:43 | bit one model be smart label contact used off |

0:17:46 | produced very similar performance |

0:17:48 | in term of a perplexity comparing to be independent of any models |

0:17:53 | so |

0:17:54 | here we show that intent |

0:17:57 | information internal context is very important for small for language modeling |

0:18:04 | and the last be some results he's |

0:18:07 | using these speech input |

0:18:08 | and asr output to our model |

0:18:11 | these are the for asr model settings |

0:18:13 | the first one is just use the without directly from the decoding |

0:18:17 | and second one use |

0:18:19 | after decoding we do rescoring restore five grand |

0:18:22 | language model |

0:18:23 | a sort of one use the rescoring this independence training rnn language model |

0:18:29 | last one is |

0:18:30 | the model that this rescoring |

0:18:32 | using our proposed drunks trendy model |

0:18:36 | as we can is what we can see from these without |

0:18:39 | the p d joint modeling the joint training |

0:18:42 | approach |

0:18:44 | produce the |

0:18:45 | best performance |

0:18:46 | across all these three evaluation criteria here |

0:18:50 | basically the word error rate force are |

0:18:52 | speech recognition in turn error anova of a score |

0:18:56 | so basically this result shows that |

0:18:58 | even ads d word error rates of a wrong troll |

0:19:03 | if you nine |

0:19:04 | our intent model and our model comes to perform can still produce |

0:19:10 | competitive performance in intense detection and the scroll speeding |

0:19:13 | so these numbers are slightly worse than the experiment |

0:19:18 | these two text input |

0:19:19 | that's on these two also to extract shows the robustness |

0:19:23 | of our proposed |

0:19:25 | model |

0:19:27 | okay lastly the conclusion |

0:19:30 | in this work |

0:19:31 | we proposed a rl model for trounced online |

0:19:35 | language a spoken language understanding and the language modeling |

0:19:38 | and it's a by modeling the street asked one three |

0:19:43 | our model is able to |

0:19:45 | achieve improved performance on the intent detection and the language modeling |

0:19:50 | to be slightly location |

0:19:51 | a small feeding performance |

0:19:54 | you order to show the robustness our model |

0:19:56 | we applied our model |

0:19:59 | on the asr on the past noisy speech impose |

0:20:03 | and we also observed consistent performance gain |

0:20:07 | or the infantry models |

0:20:10 | by using our joint model |

0:20:13 | so this is the end of the talk |

0:20:16 | right okay |

0:20:22 | okay |

0:20:23 | come from a few questions |

0:20:25 | that's |

0:21:00 | okay so the question is if i colour channel two we define the model what |

0:21:05 | are the criterias that i am i will be looking for |

0:21:09 | corpus yes |

0:21:10 | right so |

0:21:13 | basically it's all here is |

0:21:14 | we can see that we are using the recurrence new enough models |

0:21:17 | and |

0:21:19 | typically such models on nlp tasks requires |

0:21:22 | very large dataset to show stable and robust or robot performance |

0:21:27 | so the first criteria is a cost if we can have a lot of data |

0:21:31 | that would be the best |

0:21:33 | the bigger the better i will assume |

0:21:35 | and that seconds what i can single of is that |

0:21:39 | for it as |

0:21:40 | why this is the very simple rather simple dataset is because it is very |

0:21:46 | don't min imitate limited so most of the training utterances |

0:21:51 | a close to be related to flights |

0:21:54 | airline travel information |

0:21:56 | so if i can |

0:21:57 | you know review the covers |

0:21:59 | i which explore the |

0:22:01 | a multi domain |

0:22:04 | scenario |

0:22:05 | that to see whether our model is able to handle |

0:22:08 | you know perform |

0:22:09 | really good not only in the two men limited case but also in the generalized |

0:22:13 | braille in a more detailed many cases |

0:22:15 | so that is |

0:22:17 | what i really care about in the model in the corpus define |

0:22:47 | right i completely agree with you i think this is |

0:22:51 | it is very good suggestion is be here we are doing joint modeling of slu |

0:22:56 | and the language modeling |

0:22:57 | and typically language modeling used you know having asked to make a prediction of what |

0:23:02 | the user might say that the next that and |

0:23:05 | i think that is not very nice that is good |

0:23:21 | eval model have five words maybe have |

0:23:23 | just this is one single training instance |

0:23:43 | so our experiment for the |

0:23:46 | for it should tax simple which the we don't have that situation |

0:23:50 | that's in the asr output we may be seen in a partial |

0:23:55 | partial phrase ease or corrections |

0:23:59 | we |

0:24:00 | you know to look into these particularly in this work |

0:24:02 | but it that is something |

0:24:04 | "'cause" look into in the future work |

0:24:35 | alright okay thanks |

0:24:39 | just a quick original source will i like to multi language model over the local |

0:24:44 | you know trying about the main problem is about the corpus we have for training |

0:24:48 | or slu model is usually very small going for creating language model you will be |

0:24:52 | corpus so budgeting but right but jointly you know you needed to say that you |

0:24:57 | have to have a |

0:24:59 | you know you're automatically determine your |

0:25:02 | training a language model |

0:25:04 | right i think |

0:25:06 | i believe in this domain and |

0:25:08 | data at all |

0:25:09 | well labeled data that is really a limitation because we don't have very large male |

0:25:15 | labeled data for these slu task so |

0:25:18 | i think if we can put more effort in generating |

0:25:21 | you know |

0:25:22 | better quality coppers that you |

0:25:24 | have a lot of them of these slu research |

0:25:27 | that's question |

0:25:44 | yes i did |

0:25:56 | okay so i think that is a very good question so we have a |

0:26:00 | a chart in the paper but it initially here in the annotation |

0:26:03 | basically all be evaluated different number of different size of k |

0:26:08 | the basic a use one |

0:26:09 | starting from each that |

0:26:11 | we start gradually increasing the intent contribution |

0:26:14 | and we evaluate so we show the training curve and validation curve |

0:26:18 | for different k values |

0:26:20 | the but basically these values a set |

0:26:23 | not in the experiment is that all learned |

0:26:26 | in the in a kind of work |

0:26:32 | i think |

0:26:33 | definitely discover then i think this is |

0:26:35 | one of the hyper parameters that can be |

0:26:38 | then from the purely data-driven approach |

0:26:41 | just think that in the current work we |

0:26:43 | not select of uk values |

0:26:45 | and evaluates which is a |

0:26:48 | that's k values |

0:26:50 | okay so that's by the speaker again and that's university okay |