Speech Transcript - Evaluating Natural Language Understanding Services for Conversational Question Answering Systems

0:00:16	so my name is daniel and i'm fifty fusion at the technical university of unique
0:00:21	and they it on a to prevent you the joint work of my colleagues in
0:00:24	e
0:00:25	about natural language understanding services and evaluation
0:00:30	and this work is part of a bigger project a corporation between our share and
0:00:34	the corporate technology department from the event
0:00:37	and the project is called what's good social software and i would say very much
0:00:42	driven by technology so we try a lot of
0:00:46	new technologies
0:00:48	two libraries and so on and we also do a lot of prototyping and one
0:00:53	of these prototypes happen to be a chequebook because
0:00:56	that's what you do these days
0:00:58	if you want to be cool is a corporation
0:01:01	so this is on a very abstract level yes picture we choose for our chat
0:01:07	about and i don't want to go into detail on every point but i want
0:01:12	to highlight to fink sort of first one is you can see that and contextual
0:01:15	information
0:01:17	plays a quite important role i'm in our chat about
0:01:20	this is because also one of the focuses of the project
0:01:25	because we also tried to build
0:01:26	a context or which
0:01:29	stores processes and distribute
0:01:31	context information among different sources of the applications and this can be everything lied user
0:01:39	profiles
0:01:41	information about hardware or preferences and so on
0:01:45	and why do we think it's important for jackpot also
0:01:48	well if so just like the pipeline with the three steps
0:01:52	and we think
0:01:54	contact information can be very helpful in every of these steps so for example for
0:01:59	the request interpretation
0:02:01	you get a question like i want to how can i get home from
0:02:05	the output
0:02:06	and then obviously in order to generate a query out of this you first
0:02:11	have to replace home with the information like an address city so this would be
0:02:17	one example where contextual information could be useful
0:02:21	then also
0:02:22	so for me home is unique
0:02:24	so from the button to munich you have a lot of different option you can
0:02:28	fly to train
0:02:30	i you can drive
0:02:31	and so how to select which of these options you want to take
0:02:38	i
0:02:40	that's fixed
0:02:41	and
0:02:45	and well how do you decide which of these options you want to take
0:03:05	okay
0:03:05	so and so you have a lot of options and how to choose which option
0:03:09	you can always choose to find a cheap this
0:03:12	or you can take can't into account user preferences maybe i'm afraid of flying
0:03:17	so the checkpoint shouldn't suggest and a flight or
0:03:22	i don't even have a cockroach and suggested five
0:03:25	and just another point where contextual information could be useful
0:03:29	and then holds for the message generation on a very high level why which language
0:03:33	i want to have an output or on which device
0:03:36	am i receiving the message so or language service so if it's without has to
0:03:41	be very short and so on
0:03:42	so contextual information plays a very important role that actually that's not what i want
0:03:47	to talk about today and today i want to focus on this and this is
0:03:51	part
0:03:53	so how can i analyse incoming requests
0:03:58	so here we have an example how can i get from you need to the
0:04:01	portal
0:04:02	so what do we actually want to extract from this would be the first question
0:04:07	so
0:04:08	i think what would be useful is we first need somehow
0:04:13	what is the user actually talking about what is the task
0:04:17	and this would be fine connection from
0:04:20	and then the other important things are i want to start somewhere
0:04:25	in this case newly and i want to travel to somewhere
0:04:28	and this is something like
0:04:31	a concept so when we map just to the concept of natural language understanding services
0:04:37	nearly all of them use intents and entities that's concepts own intent is basically
0:04:43	a label for a whole message
0:04:46	in this case the intent would be
0:04:48	find a connection and entities are labels for part of the message can be a
0:04:54	word it can be character multiple was multiple characters
0:04:57	whatever
0:04:59	and then i can define different entity types
0:05:02	so for this example i could
0:05:04	and
0:05:05	define
0:05:06	an entity type start and set type destination and what i would want to have
0:05:12	from my from a natural language understanding service is when i have a i put
0:05:16	in something like this
0:05:18	i get this information
0:05:20	the intent and the content and
0:05:24	so and that's actually how all of them work so
0:05:28	you can train all of them through a web interface and
0:05:31	you do basically what you can see here so you mark the words in to
0:05:34	select and so on
0:05:36	you also have
0:05:38	a more
0:05:40	so
0:05:41	if you want to train a lot of data obviously have not just to do
0:05:46	all of this and about the phase so most of them also offer like edge
0:05:51	importance function and this is actually the data from a formant of microsoft lose
0:05:59	but they all look kind of similar
0:06:03	okay so i already mentioned microsoft lose and there are a lot of either a
0:06:10	popular services around there i think these are probably most popular one at the moment
0:06:15	so when we started to implement our prototype we asked ourselves
0:06:21	which of these should we use
0:06:23	and has anybody here have a used one of them
0:06:29	okay so has anyone ever tried multiple of them
0:06:35	and
0:06:36	maybe how to decide which one to use
0:06:40	okay so
0:06:43	so we didn't know how to choose so for the first thing we looked into
0:06:48	recent publications because actually
0:06:51	quite a few people are using it
0:06:54	these days so from this year and largely confined quite some papers using one of
0:06:59	them
0:07:00	but none of these labels actually say okay we choose this because of so they
0:07:05	just say we use this
0:07:07	and we wanted to know why
0:07:09	so we also has an ad or industry partner them and they also used in
0:07:13	different and
0:07:14	division different services we all the task
0:07:16	i don't industry partner
0:07:18	and their onset was usually
0:07:20	well we have a contract with this company anyway or we got it for free
0:07:24	so we are using it
0:07:26	and well
0:07:28	how was a valid reasons but still we bought
0:07:32	that's not enough
0:07:34	we want to know which services better
0:07:38	i'm which serve as a better classification
0:07:40	to make more educated decision which serve as we want to use so what we
0:07:44	want to do is compare all of them
0:07:48	and how you do that you train them all of the same data and test
0:07:52	them all
0:07:52	with the same data
0:07:54	so unfortunately
0:07:57	we were not able to compare all of them
0:08:00	because so when we started actually and of the next was to enclose better
0:08:05	i don't know maybe a change today but at this point in time they didn't
0:08:09	offer actually poured function so you have to mark everything web interface
0:08:15	and we
0:08:17	couldn't all we didn't want to do that
0:08:19	i'm with a i a for the batch import function but it was not working
0:08:23	with external data so you could explore data from with the i-th entry for that
0:08:30	according to the issues record it's
0:08:32	unknown but
0:08:33	although i'm not sure if it's really but or feature to look people in actually
0:08:40	but
0:08:42	so i already said that
0:08:44	they all have kind of similar looking
0:08:48	data format
0:08:49	but still of course their oral somewhat different so some use just one file some
0:08:54	distribute information
0:08:56	on different files
0:08:59	some down to position
0:09:00	by character some by works and so on so what we did
0:09:05	because we want to automated
0:09:07	i'm just process as much as possible
0:09:10	we implemented a small i from converter which is able to convert from a generic
0:09:17	representation that we use for corpus
0:09:21	convert them to the different important format
0:09:23	and actually
0:09:25	one thing that is
0:09:27	maybe also interesting
0:09:29	out of these this there are three
0:09:33	services which a three
0:09:34	so at i don't any i and without i
0:09:38	a three as in three so they are free of charge
0:09:42	and a that is free s and freedom because it's open source software
0:09:46	and
0:09:48	another and i think about the other is the rows like and i work with
0:09:53	important formant
0:09:55	from all the other services so that means
0:09:57	when you switch from one of the commercial services rather
0:10:01	you don't have to do any work you can just copy all your data and
0:10:04	it's
0:10:06	so in what we then be a the
0:10:08	with the can
0:10:10	we converted
0:10:11	in the right format we use the api off to services to train them
0:10:16	for the commercial services
0:10:19	just a slight five or ten minutes and you can do it also for the
0:10:23	rest if i am for rather you have to do it on the command line
0:10:27	and i two rows four
0:10:30	so roughly
0:10:31	for hundreds instances that you're training you can
0:10:36	assume it takes about one hour on a reasonably desktop machine
0:10:43	and then
0:10:44	other words
0:10:46	the same
0:10:47	only and the other direction
0:10:50	we took again our corpus and test data from it
0:10:54	send it to all different apis
0:10:57	store the result annotations and then compared them to our
0:11:01	gold standard
0:11:04	so about the car as we used two of them
0:11:08	one was
0:11:10	and
0:11:11	obtain
0:11:12	through chat about that we will before so it was a working a telegram set
0:11:16	what for public transport munich and it was manually checked by as
0:11:22	and so we had twenty six
0:11:25	questions requests from a set what and they had
0:11:30	two different intents and five and a type so we have a lot of state
0:11:34	or for intent and just type you
0:11:37	this data was interesting because it's very natural and it was
0:11:42	so users use the chat bots so it's kind of
0:11:48	hopefully comparable to
0:11:50	link linguistically from the form it would receive with
0:11:54	the chat about
0:11:55	but from the domain obviously and the men's was more interested in
0:12:01	and technical domain that's why we had a second a corpus
0:12:04	which we
0:12:06	i collected from exchange so all programmers
0:12:10	probably no stake overflow and they have a bunch of
0:12:13	different platforms for different and topics
0:12:17	and we took a questions from
0:12:20	their platforms for web application and another platform
0:12:24	core ask wouldn't to which is about questions
0:12:28	about one to
0:12:30	and these where detect with amazon mechanical turk
0:12:34	and the stack exchange corpus is available online
0:12:37	you can find it
0:12:38	as detail
0:12:40	so
0:12:43	and
0:12:44	we also in the corpus you can also find the answers to just questions because
0:12:49	we only so
0:12:50	questions which have a excepted answer although we are not using these utterances for our
0:12:56	evaluation
0:12:57	but it might be useful for somebody else in the future
0:13:02	and also we took the highest ranked questions
0:13:05	because we assume that they have a somewhat good quality
0:13:12	how we do on a mechanical turk then well we basically models
0:13:16	and
0:13:16	the interface that all these services also offer so we presented a sentence and then
0:13:22	utterances
0:13:24	could
0:13:25	highlights a different parts and are entities
0:13:29	and they could choose from a predefined list of intense
0:13:34	and we also asked them to rate how confident they are
0:13:37	about their annotation
0:13:39	and we only took into account annotations
0:13:43	which where
0:13:45	somewhat confident at least
0:13:46	and for which we could find inter annotator agreement
0:13:50	of more than sixty percent
0:13:54	so this is what we get out the distribution of intense and that it is
0:14:00	so the actual numbers a not so important but
0:14:04	if you look at it you can see that there
0:14:06	entities with more training data and less training data
0:14:10	so we have some variety in there
0:14:13	although of course in total it is rather small dataset still
0:14:19	and then before we started our evaluation we had three main hypothesis
0:14:24	so the first one might sound obvious but it was still the reason why we
0:14:30	did all this because we assume that
0:14:33	you should think about which of these services you choose and not just because of
0:14:36	pricing but because of the quality of audiences
0:14:41	or of the annotations
0:14:44	we also assume that commercial products will overall perform better
0:14:48	after all they have probably hundreds of thousands of use feeding and with data
0:14:54	and therefore we also found that and especially for
0:14:58	entities and intends where there's not much training data
0:15:02	they should be
0:15:03	better because they so i'm a values as
0:15:08	machine learning big and moody which comes with
0:15:12	three hundred megabytes of initial data so you would assume if there's not much training
0:15:16	data provided that
0:15:20	lewis watson and on have
0:15:22	lot more data is to start with
0:15:26	and we also thought that the quality of the labels is inference by the domain
0:15:31	so if one service is
0:15:33	load on the corpus about public transport it doesn't necessarily mean that it also good
0:15:38	on the other corpora
0:15:41	so this is on a very high level the
0:15:44	results of collaboration
0:15:47	what you can see
0:15:48	the blue but which is lewis
0:15:51	so this is f-score
0:15:53	across all label so intents and entities combined in the paper you can find
0:15:58	broken down version of it
0:16:00	but so for the guys from microsoft and regulations new was based on every domain
0:16:08	actually what was surprising for us that a rather came second
0:16:13	so across all the domains it has the second best performance
0:16:17	i'm which was quite surprising for us
0:16:20	if you look into detail you can find also quite some interesting reasons why on
0:16:26	some the main some service is useful for example and what's new
0:16:30	was very bad on compared to the others on the public transport data because it
0:16:37	content the
0:16:39	it ignored
0:16:40	so use only example with from into
0:16:43	and
0:16:44	you can have the same words for from into obviously all the time
0:16:48	and
0:16:50	what's and was the only service that was not able to distinguish between from and
0:16:54	to
0:16:54	so
0:16:55	if you are right from you need to the portly or from the put into
0:17:00	really
0:17:01	what's and always gave
0:17:03	both words the label from and to
0:17:06	so this is for example one reason
0:17:09	why we see different
0:17:11	performances on a different domains
0:17:16	so what are the key findings of or evaluation
0:17:20	well as i said news performs best in all the domains we tested
0:17:24	rather second best
0:17:27	an interesting point if you look at intents and entities with
0:17:33	not much training data it's there's no difference so large that is not
0:17:39	better or worse on them then the commercial services
0:17:42	so i'm it seems that there is no big influence all
0:17:47	of the initial training set
0:17:48	that is already there
0:17:51	and well you see that domain matters but the question as to how much so
0:17:57	lose still performs best in and all domains
0:18:01	because that's kind of the question
0:18:03	i'm can we now say okay you should always use lewis
0:18:07	and i would say no
0:18:09	you still have to trying to with your domain with your data
0:18:14	i'm to find out which serve as the best for you
0:18:18	also services might change and you without noticing use so
0:18:25	it is
0:18:25	that's why to think it is very useful to automate just five line with the
0:18:31	scripts we did and so on because then you can do it on all the
0:18:34	services and even redo it constantly to find out
0:18:39	which service is
0:18:40	i'm best from you
0:18:42	the best for you and one
0:18:44	interesting question ridge rose from
0:18:47	these findings
0:18:49	is if the commercial services really
0:18:52	benefit that much from user data because when we talk with industry partners
0:18:58	that was one of the main concern still
0:19:00	we pay the money and prepaid and in data
0:19:03	and
0:19:05	so
0:19:06	i'm not really sure about this so at least for the user defined entity so
0:19:10	if i define my entity is cold start
0:19:14	and i label one thousand datasets
0:19:18	how it is useful for
0:19:21	any of these services so because
0:19:23	it's my user defined label
0:19:26	and the able to extract from it
0:19:30	maybe that's the reason why we don't see what we expected and when it comes
0:19:35	to and
0:19:38	entity types and intense with
0:19:40	this training data that they do not perform
0:19:44	significantly better
0:19:46	thank you
0:19:53	okay so we have a model five minutes for questions
0:20:05	and experiments were great so
0:20:08	full disclosure or someone the greatest rather so i'm slightly biased
0:20:11	and
0:20:12	did you go and
0:20:14	tweak any of the hyper parameters
0:20:16	in the rows a rotational e
0:20:19	the hyper parameters did you just use the default other or did you tweak them
0:20:23	now we use i think you could maybe squeeze that's more performance
0:20:26	sure
0:20:32	things were very talk this question is more common is for some more is that
0:20:36	it seems that there's almost like lacking a baseline which is like one of like
0:20:41	maybe a phd student for a week spending time trying to get the accuracy of
0:20:45	something because these services are really designed for people are technical i think that is
0:20:48	that this guy comparisons is also i just like the c
0:20:53	maybe like you know what happened you just led to take something like a slightly
0:20:56	more under some like that and just see how well you can do without these
0:20:59	like these services are helping you want because like i think that they're that they're
0:21:02	about what you could say well like you like and you very well that using
0:21:05	those where you actually if you want but i'm really get the accuracy a should
0:21:08	get into the details
0:21:10	results
0:21:22	displeasure percent you gotta start
0:21:25	i absolutely loved here i
0:21:31	x i i'm very appreciative that some independent party's taking the time to evaluate independent
0:21:37	some services like lewis possibly the others to have something like active learning they'll suggests
0:21:44	utterances you might wanna go and label once you've collected some utterances
0:21:49	if i just an evaluation correctly you haven't done that here you have a fixed
0:21:52	training set
0:21:54	i'm curious have you looked at that aspect of the services altering any comments
0:21:59	so i mean there are a lot of other aspects which we didn't look at
0:22:02	so this is one point i'm another point is also
0:22:05	and that a lot of these services including we also have like
0:22:10	bill in entity type already
0:22:12	so you have fixed
0:22:15	a pre-trained entity types for look at phone numbers and so on
0:22:19	and i think that's also something you can benefit a lot from to use them
0:22:25	and
0:22:26	but so we looked at them we also so for the ammends we did also
0:22:33	and the comparison the about
0:22:35	the functionalities of some of them include
0:22:39	already giving responses can responses and so on
0:22:43	but so really we were just dataset and we only did this evaluation
0:22:49	on these things and because again if you do it with the suggestions and you
0:22:54	have to do it fruity wrap interface and this means that you have to label
0:22:58	five hundred utterances on all systems
0:23:04	that is something that might be interesting in the future but takes more time
0:23:15	you have any other questions we have a about two minutes left
0:23:21	okay i have a question so
0:23:24	so you this is a chat session so could you it of rate on the
0:23:28	relationship we this work and chapel
0:23:31	well as i said so i think this is one of the parts
0:23:37	or this can be one a useful are you want to double upset but and
0:23:41	what we saw
0:23:43	the sign typical work is so i mean you use all differences and
0:23:48	and if you just evaluate your chat part of the whole the end
0:23:53	then
0:23:54	you might be influenced by these results without knowing it so
0:23:58	your chequebook might perform
0:24:00	better just because you change at your natural language understanding service so i think
0:24:06	it is important
0:24:08	to know about these things and to think about it and also if you do
0:24:12	an operation of a check or as a whole system and to take into account
0:24:17	these things and i also think from an industry perspective
0:24:22	these services i one of the reasons why
0:24:24	set ups became so popular in the last time
0:24:27	because it is really easy so
0:24:30	you have other services which are not as popular with a really offer you to
0:24:36	click together a whole set but without programming is a single line of code
0:24:41	and here you can at least without having any knowledge about language processing machine learning
0:24:46	whatsoever
0:24:48	and i think therefore it's especially
0:24:51	important for type of this double document and inference lot
0:24:56	w one
0:24:57	also
0:25:00	okay click one place
0:25:16	okay
0:25:17	so it's about task so the sum of the speaker

Evaluating Natural Language Understanding Services for Conversational Question Answering Systems

Second WOCHAT Special Session on Chatbots and Conversational Agents (WOCHAT-SS)

Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes and Manfred Langen