0:00:16so my name is daniel and i'm fifty fusion at the technical university of unique
0:00:21and they it on a to prevent you the joint work of my colleagues in
0:00:24e
0:00:25about natural language understanding services and evaluation
0:00:30and this work is part of a bigger project a corporation between our share and
0:00:34the corporate technology department from the event
0:00:37and the project is called what's good social software and i would say very much
0:00:42driven by technology so we try a lot of
0:00:46new technologies
0:00:48two libraries and so on and we also do a lot of prototyping and one
0:00:53of these prototypes happen to be a chequebook because
0:00:56that's what you do these days
0:00:58if you want to be cool is a corporation
0:01:01so this is on a very abstract level yes picture we choose for our chat
0:01:07about and i don't want to go into detail on every point but i want
0:01:12to highlight to fink sort of first one is you can see that and contextual
0:01:15information
0:01:17plays a quite important role i'm in our chat about
0:01:20this is because also one of the focuses of the project
0:01:25because we also tried to build
0:01:26a context or which
0:01:29stores processes and distribute
0:01:31context information among different sources of the applications and this can be everything lied user
0:01:39profiles
0:01:41information about hardware or preferences and so on
0:01:45and why do we think it's important for jackpot also
0:01:48well if so just like the pipeline with the three steps
0:01:52and we think
0:01:54contact information can be very helpful in every of these steps so for example for
0:01:59the request interpretation
0:02:01you get a question like i want to how can i get home from
0:02:05the output
0:02:06and then obviously in order to generate a query out of this you first
0:02:11have to replace home with the information like an address city so this would be
0:02:17one example where contextual information could be useful
0:02:21then also
0:02:22so for me home is unique
0:02:24so from the button to munich you have a lot of different option you can
0:02:28fly to train
0:02:30i you can drive
0:02:31and so how to select which of these options you want to take
0:02:38i
0:02:40that's fixed
0:02:41and
0:02:45and well how do you decide which of these options you want to take
0:03:05okay
0:03:05so and so you have a lot of options and how to choose which option
0:03:09you can always choose to find a cheap this
0:03:12or you can take can't into account user preferences maybe i'm afraid of flying
0:03:17so the checkpoint shouldn't suggest and a flight or
0:03:22i don't even have a cockroach and suggested five
0:03:25and just another point where contextual information could be useful
0:03:29and then holds for the message generation on a very high level why which language
0:03:33i want to have an output or on which device
0:03:36am i receiving the message so or language service so if it's without has to
0:03:41be very short and so on
0:03:42so contextual information plays a very important role that actually that's not what i want
0:03:47to talk about today and today i want to focus on this and this is
0:03:51part
0:03:53so how can i analyse incoming requests
0:03:58so here we have an example how can i get from you need to the
0:04:01portal
0:04:02so what do we actually want to extract from this would be the first question
0:04:07so
0:04:08i think what would be useful is we first need somehow
0:04:13what is the user actually talking about what is the task
0:04:17and this would be fine connection from
0:04:20and then the other important things are i want to start somewhere
0:04:25in this case newly and i want to travel to somewhere
0:04:28and this is something like
0:04:31a concept so when we map just to the concept of natural language understanding services
0:04:37nearly all of them use intents and entities that's concepts own intent is basically
0:04:43a label for a whole message
0:04:46in this case the intent would be
0:04:48find a connection and entities are labels for part of the message can be a
0:04:54word it can be character multiple was multiple characters
0:04:57whatever
0:04:59and then i can define different entity types
0:05:02so for this example i could
0:05:04and
0:05:05define
0:05:06an entity type start and set type destination and what i would want to have
0:05:12from my from a natural language understanding service is when i have a i put
0:05:16in something like this
0:05:18i get this information
0:05:20the intent and the content and
0:05:24so and that's actually how all of them work so
0:05:28you can train all of them through a web interface and
0:05:31you do basically what you can see here so you mark the words in to
0:05:34select and so on
0:05:36you also have
0:05:38a more
0:05:40so
0:05:41if you want to train a lot of data obviously have not just to do
0:05:46all of this and about the phase so most of them also offer like edge
0:05:51importance function and this is actually the data from a formant of microsoft lose
0:05:59but they all look kind of similar
0:06:03okay so i already mentioned microsoft lose and there are a lot of either a
0:06:10popular services around there i think these are probably most popular one at the moment
0:06:15so when we started to implement our prototype we asked ourselves
0:06:21which of these should we use
0:06:23and has anybody here have a used one of them
0:06:29okay so has anyone ever tried multiple of them
0:06:35and
0:06:36maybe how to decide which one to use
0:06:40okay so
0:06:43so we didn't know how to choose so for the first thing we looked into
0:06:48recent publications because actually
0:06:51quite a few people are using it
0:06:54these days so from this year and largely confined quite some papers using one of
0:06:59them
0:07:00but none of these labels actually say okay we choose this because of so they
0:07:05just say we use this
0:07:07and we wanted to know why
0:07:09so we also has an ad or industry partner them and they also used in
0:07:13different and
0:07:14division different services we all the task
0:07:16i don't industry partner
0:07:18and their onset was usually
0:07:20well we have a contract with this company anyway or we got it for free
0:07:24so we are using it
0:07:26and well
0:07:28how was a valid reasons but still we bought
0:07:32that's not enough
0:07:34we want to know which services better
0:07:38i'm which serve as a better classification
0:07:40to make more educated decision which serve as we want to use so what we
0:07:44want to do is compare all of them
0:07:48and how you do that you train them all of the same data and test
0:07:52them all
0:07:52with the same data
0:07:54so unfortunately
0:07:57we were not able to compare all of them
0:08:00because so when we started actually and of the next was to enclose better
0:08:05i don't know maybe a change today but at this point in time they didn't
0:08:09offer actually poured function so you have to mark everything web interface
0:08:15and we
0:08:17couldn't all we didn't want to do that
0:08:19i'm with a i a for the batch import function but it was not working
0:08:23with external data so you could explore data from with the i-th entry for that
0:08:30according to the issues record it's
0:08:32unknown but
0:08:33although i'm not sure if it's really but or feature to look people in actually
0:08:40but
0:08:42so i already said that
0:08:44they all have kind of similar looking
0:08:48data format
0:08:49but still of course their oral somewhat different so some use just one file some
0:08:54distribute information
0:08:56on different files
0:08:59some down to position
0:09:00by character some by works and so on so what we did
0:09:05because we want to automated
0:09:07i'm just process as much as possible
0:09:10we implemented a small i from converter which is able to convert from a generic
0:09:17representation that we use for corpus
0:09:21convert them to the different important format
0:09:23and actually
0:09:25one thing that is
0:09:27maybe also interesting
0:09:29out of these this there are three
0:09:33services which a three
0:09:34so at i don't any i and without i
0:09:38a three as in three so they are free of charge
0:09:42and a that is free s and freedom because it's open source software
0:09:46and
0:09:48another and i think about the other is the rows like and i work with
0:09:53important formant
0:09:55from all the other services so that means
0:09:57when you switch from one of the commercial services rather
0:10:01you don't have to do any work you can just copy all your data and
0:10:04it's
0:10:06so in what we then be a the
0:10:08with the can
0:10:10we converted
0:10:11in the right format we use the api off to services to train them
0:10:16for the commercial services
0:10:19just a slight five or ten minutes and you can do it also for the
0:10:23rest if i am for rather you have to do it on the command line
0:10:27and i two rows four
0:10:30so roughly
0:10:31for hundreds instances that you're training you can
0:10:36assume it takes about one hour on a reasonably desktop machine
0:10:43and then
0:10:44other words
0:10:46the same
0:10:47only and the other direction
0:10:50we took again our corpus and test data from it
0:10:54send it to all different apis
0:10:57store the result annotations and then compared them to our
0:11:01gold standard
0:11:04so about the car as we used two of them
0:11:08one was
0:11:10and
0:11:11obtain
0:11:12through chat about that we will before so it was a working a telegram set
0:11:16what for public transport munich and it was manually checked by as
0:11:22and so we had twenty six
0:11:25questions requests from a set what and they had
0:11:30two different intents and five and a type so we have a lot of state
0:11:34or for intent and just type you
0:11:37this data was interesting because it's very natural and it was
0:11:42so users use the chat bots so it's kind of
0:11:48hopefully comparable to
0:11:50link linguistically from the form it would receive with
0:11:54the chat about
0:11:55but from the domain obviously and the men's was more interested in
0:12:01and technical domain that's why we had a second a corpus
0:12:04which we
0:12:06i collected from exchange so all programmers
0:12:10probably no stake overflow and they have a bunch of
0:12:13different platforms for different and topics
0:12:17and we took a questions from
0:12:20their platforms for web application and another platform
0:12:24core ask wouldn't to which is about questions
0:12:28about one to
0:12:30and these where detect with amazon mechanical turk
0:12:34and the stack exchange corpus is available online
0:12:37you can find it
0:12:38as detail
0:12:40so
0:12:43and
0:12:44we also in the corpus you can also find the answers to just questions because
0:12:49we only so
0:12:50questions which have a excepted answer although we are not using these utterances for our
0:12:56evaluation
0:12:57but it might be useful for somebody else in the future
0:13:02and also we took the highest ranked questions
0:13:05because we assume that they have a somewhat good quality
0:13:12how we do on a mechanical turk then well we basically models
0:13:16and
0:13:16the interface that all these services also offer so we presented a sentence and then
0:13:22utterances
0:13:24could
0:13:25highlights a different parts and are entities
0:13:29and they could choose from a predefined list of intense
0:13:34and we also asked them to rate how confident they are
0:13:37about their annotation
0:13:39and we only took into account annotations
0:13:43which where
0:13:45somewhat confident at least
0:13:46and for which we could find inter annotator agreement
0:13:50of more than sixty percent
0:13:54so this is what we get out the distribution of intense and that it is
0:14:00so the actual numbers a not so important but
0:14:04if you look at it you can see that there
0:14:06entities with more training data and less training data
0:14:10so we have some variety in there
0:14:13although of course in total it is rather small dataset still
0:14:19and then before we started our evaluation we had three main hypothesis
0:14:24so the first one might sound obvious but it was still the reason why we
0:14:30did all this because we assume that
0:14:33you should think about which of these services you choose and not just because of
0:14:36pricing but because of the quality of audiences
0:14:41or of the annotations
0:14:44we also assume that commercial products will overall perform better
0:14:48after all they have probably hundreds of thousands of use feeding and with data
0:14:54and therefore we also found that and especially for
0:14:58entities and intends where there's not much training data
0:15:02they should be
0:15:03better because they so i'm a values as
0:15:08machine learning big and moody which comes with
0:15:12three hundred megabytes of initial data so you would assume if there's not much training
0:15:16data provided that
0:15:20lewis watson and on have
0:15:22lot more data is to start with
0:15:26and we also thought that the quality of the labels is inference by the domain
0:15:31so if one service is
0:15:33load on the corpus about public transport it doesn't necessarily mean that it also good
0:15:38on the other corpora
0:15:41so this is on a very high level the
0:15:44results of collaboration
0:15:47what you can see
0:15:48the blue but which is lewis
0:15:51so this is f-score
0:15:53across all label so intents and entities combined in the paper you can find
0:15:58broken down version of it
0:16:00but so for the guys from microsoft and regulations new was based on every domain
0:16:08actually what was surprising for us that a rather came second
0:16:13so across all the domains it has the second best performance
0:16:17i'm which was quite surprising for us
0:16:20if you look into detail you can find also quite some interesting reasons why on
0:16:26some the main some service is useful for example and what's new
0:16:30was very bad on compared to the others on the public transport data because it
0:16:37content the
0:16:39it ignored
0:16:40so use only example with from into
0:16:43and
0:16:44you can have the same words for from into obviously all the time
0:16:48and
0:16:50what's and was the only service that was not able to distinguish between from and
0:16:54to
0:16:54so
0:16:55if you are right from you need to the portly or from the put into
0:17:00really
0:17:01what's and always gave
0:17:03both words the label from and to
0:17:06so this is for example one reason
0:17:09why we see different
0:17:11performances on a different domains
0:17:16so what are the key findings of or evaluation
0:17:20well as i said news performs best in all the domains we tested
0:17:24rather second best
0:17:27an interesting point if you look at intents and entities with
0:17:33not much training data it's there's no difference so large that is not
0:17:39better or worse on them then the commercial services
0:17:42so i'm it seems that there is no big influence all
0:17:47of the initial training set
0:17:48that is already there
0:17:51and well you see that domain matters but the question as to how much so
0:17:57lose still performs best in and all domains
0:18:01because that's kind of the question
0:18:03i'm can we now say okay you should always use lewis
0:18:07and i would say no
0:18:09you still have to trying to with your domain with your data
0:18:14i'm to find out which serve as the best for you
0:18:18also services might change and you without noticing use so
0:18:25it is
0:18:25that's why to think it is very useful to automate just five line with the
0:18:31scripts we did and so on because then you can do it on all the
0:18:34services and even redo it constantly to find out
0:18:39which service is
0:18:40i'm best from you
0:18:42the best for you and one
0:18:44interesting question ridge rose from
0:18:47these findings
0:18:49is if the commercial services really
0:18:52benefit that much from user data because when we talk with industry partners
0:18:58that was one of the main concern still
0:19:00we pay the money and prepaid and in data
0:19:03and
0:19:05so
0:19:06i'm not really sure about this so at least for the user defined entity so
0:19:10if i define my entity is cold start
0:19:14and i label one thousand datasets
0:19:18how it is useful for
0:19:21any of these services so because
0:19:23it's my user defined label
0:19:26and the able to extract from it
0:19:30maybe that's the reason why we don't see what we expected and when it comes
0:19:35to and
0:19:38entity types and intense with
0:19:40this training data that they do not perform
0:19:44significantly better
0:19:46thank you
0:19:53okay so we have a model five minutes for questions
0:20:05and experiments were great so
0:20:08full disclosure or someone the greatest rather so i'm slightly biased
0:20:11and
0:20:12did you go and
0:20:14tweak any of the hyper parameters
0:20:16in the rows a rotational e
0:20:19the hyper parameters did you just use the default other or did you tweak them
0:20:23now we use i think you could maybe squeeze that's more performance
0:20:26sure
0:20:32things were very talk this question is more common is for some more is that
0:20:36it seems that there's almost like lacking a baseline which is like one of like
0:20:41maybe a phd student for a week spending time trying to get the accuracy of
0:20:45something because these services are really designed for people are technical i think that is
0:20:48that this guy comparisons is also i just like the c
0:20:53maybe like you know what happened you just led to take something like a slightly
0:20:56more under some like that and just see how well you can do without these
0:20:59like these services are helping you want because like i think that they're that they're
0:21:02about what you could say well like you like and you very well that using
0:21:05those where you actually if you want but i'm really get the accuracy a should
0:21:08get into the details
0:21:10results
0:21:22displeasure percent you gotta start
0:21:25i absolutely loved here i
0:21:31x i i'm very appreciative that some independent party's taking the time to evaluate independent
0:21:37some services like lewis possibly the others to have something like active learning they'll suggests
0:21:44utterances you might wanna go and label once you've collected some utterances
0:21:49if i just an evaluation correctly you haven't done that here you have a fixed
0:21:52training set
0:21:54i'm curious have you looked at that aspect of the services altering any comments
0:21:59so i mean there are a lot of other aspects which we didn't look at
0:22:02so this is one point i'm another point is also
0:22:05and that a lot of these services including we also have like
0:22:10bill in entity type already
0:22:12so you have fixed
0:22:15a pre-trained entity types for look at phone numbers and so on
0:22:19and i think that's also something you can benefit a lot from to use them
0:22:25and
0:22:26but so we looked at them we also so for the ammends we did also
0:22:33and the comparison the about
0:22:35the functionalities of some of them include
0:22:39already giving responses can responses and so on
0:22:43but so really we were just dataset and we only did this evaluation
0:22:49on these things and because again if you do it with the suggestions and you
0:22:54have to do it fruity wrap interface and this means that you have to label
0:22:58five hundred utterances on all systems
0:23:04that is something that might be interesting in the future but takes more time
0:23:15you have any other questions we have a about two minutes left
0:23:21okay i have a question so
0:23:24so you this is a chat session so could you it of rate on the
0:23:28relationship we this work and chapel
0:23:31well as i said so i think this is one of the parts
0:23:37or this can be one a useful are you want to double upset but and
0:23:41what we saw
0:23:43the sign typical work is so i mean you use all differences and
0:23:48and if you just evaluate your chat part of the whole the end
0:23:53then
0:23:54you might be influenced by these results without knowing it so
0:23:58your chequebook might perform
0:24:00better just because you change at your natural language understanding service so i think
0:24:06it is important
0:24:08to know about these things and to think about it and also if you do
0:24:12an operation of a check or as a whole system and to take into account
0:24:17these things and i also think from an industry perspective
0:24:22these services i one of the reasons why
0:24:24set ups became so popular in the last time
0:24:27because it is really easy so
0:24:30you have other services which are not as popular with a really offer you to
0:24:36click together a whole set but without programming is a single line of code
0:24:41and here you can at least without having any knowledge about language processing machine learning
0:24:46whatsoever
0:24:48and i think therefore it's especially
0:24:51important for type of this double document and inference lot
0:24:56w one
0:24:57also
0:25:00okay click one place
0:25:16okay
0:25:17so it's about task so the sum of the speaker