Speech Transcript - The E2E Dataset: New Challenges For End-to-End Generation

0:00:15	okay so
0:00:16	hi everyone
0:00:18	and one's a and i would like to
0:00:19	talk about the new data sets that you can't in a
0:00:25	that i nine we have
0:00:27	created at three at what university and it's a dataset design for
0:00:34	and two and the natural language generation
0:00:36	with that we mean that a generating a fully from data and from unaligned
0:00:45	data pairs so that means a pair of the meaning representation and the corresponding textual
0:00:49	reference
0:00:51	with no water additional annotation
0:00:54	this has already been down but so far all the approach is where limited to
0:01:00	relatively small datasets and all of them use of delexicalization
0:01:06	induce are the datasets you can see on the slide
0:01:09	and our goal here is to go a bit for the with the data driven
0:01:14	approach and to replicate the
0:01:17	rich dialogue can discourse phenomena
0:01:20	that had been targeted but you know the year and non end-to-end the rule based
0:01:24	or also statistical approaches
0:01:28	and what
0:01:28	we have down is
0:01:31	we have collected a new training dataset that should be challenging enough to
0:01:37	show
0:01:38	some
0:01:40	more interesting outputs more interesting sentences
0:01:43	and
0:01:44	it is also much bigger than all the previous datasets we have over fifty thousand
0:01:49	pairs of meaning representations and textual references
0:01:55	the textual references a longer so we usually
0:01:58	have more sentences that's
0:02:00	describe
0:02:01	one meaning representation and the sentences themselves are all also longer than in previous datasets
0:02:07	we
0:02:08	have also made the effort to collect the data set in as divers way as
0:02:13	possible
0:02:14	and that's why we used editorial
0:02:18	instructions to crowd workers on a
0:02:21	crowdsourcing website
0:02:23	and
0:02:24	we have found out that this leads to more divers descriptions so
0:02:29	if you if you look at these two examples
0:02:33	you we have a low cost
0:02:35	japanese-style cuisine and
0:02:36	you we have cheap japanese food so the
0:02:39	descriptions are very diapers and
0:02:42	also there's more of them on average than in most previous nlg datasets we have
0:02:48	more than eight
0:02:50	our preference texts better meaning representation
0:02:55	we have evaluated the dataset in various ways and compared it with the previous datasets
0:03:02	in the same domain
0:03:03	and we have found out that
0:03:05	we have
0:03:08	higher lexical richness which means
0:03:11	more
0:03:12	divers text and terms of words used and a higher proportion of rare words in
0:03:19	the data
0:03:20	the sentences are also
0:03:23	on average more syntactically complex so we have
0:03:29	longer and more complex sentences
0:03:32	and we have also up
0:03:34	us
0:03:35	kind of a semantic challenge because we asked the crowd workers only to verbalise information
0:03:40	that seems relevant given the instructional picture so actually this requires content selection also for
0:03:49	natural language generation which hasn't notes it's not present in the previous
0:03:55	state of sets of the same type
0:03:58	and we are organising a shell challenge with this dataset so
0:04:03	you can
0:04:05	all register for the challenge we would like to encourage you to do so
0:04:09	and try to train your own nlg system and
0:04:15	sub made your results
0:04:16	by the end of local october
0:04:18	we provide the data and also a baseline system along with the baseline system outputs
0:04:23	and metrics creates
0:04:25	is that
0:04:26	will be used for the challenge along with us some human evolution
0:04:32	so
0:04:33	is it and i woods
0:04:35	like to invite you to comment c or a poster later on and we can
0:04:40	talk about this some more
0:04:42	and definitely
0:04:44	and downloads the data and take part in your challenge
0:04:48	thank you

The E2E Dataset: New Challenges For End-to-End Generation

Special Session: Natural Language Generation for Dialogue Systems

Jekaterina Novikova, Ondřej Dušek and Verena Rieser