Speech Transcript - The Many Facets of Dialog

0:00:17	good morning everyone welcome to date three of us signal and on the like to
0:00:24	be here to introduce our third keynote speaker professor helen mapping from chinese university of
0:00:29	hong kong the howling gotta phd from mit
0:00:34	and she has been professor in a in hong kong chinese university of hong kong
0:00:41	for a sometime it's not count the number of years and in addition to what
0:00:47	she's done abilities aspects of speech and language processing language learning exact role
0:00:52	she is also involved in universal thing should be an associate universe archie's also given
0:00:57	presentations the world economic forum and world peace conference on the main i'm so she's
0:01:04	is not just doing research but actually trying to get a
0:01:09	a the information about speech and language and a help other people so without for
0:01:16	the to do that like to introduce professor how nine
0:01:31	so thank you very much talent for the kind introduction of the morning ladies and
0:01:36	gentlemen i'm really delighted to be here i wish to thank the organizers for the
0:01:40	very kind invitation
0:01:42	and i've been working as i once the a lot on language learning in recent
0:01:48	years but upon receiving the invitation from stick to al
0:01:51	i thought of this is a
0:01:53	excellent opportunity for me to take stock of what i've been doing
0:01:58	rather serendipity
0:02:00	on dialogue
0:02:01	so i decided to
0:02:05	choose this topic the many facets of dialogue for
0:02:09	my presentation
0:02:11	and in fact
0:02:14	the different fact that some going to cover
0:02:16	include
0:02:17	dialogue in teaching and learning
0:02:19	dialogue and e commerce
0:02:21	dialogue in cognitive assessment and the first three are more application oriented and then
0:02:28	the next to a more research oriented extracting semantic patterns from dialogues
0:02:32	and modeling user emotion changes in dialogues
0:02:38	so here we go the first one
0:02:40	is on
0:02:42	dialogue in teaching and learning
0:02:44	where
0:02:45	this project is
0:02:46	about investigating student discussion dialogues and learning outcomes in flip classroom teaching
0:02:54	so how is that my phd of it and more so too is
0:03:00	the research assistant in our t
0:03:02	i don't have three undergraduate student helpers in this project
0:03:08	so
0:03:09	this project came about because back in twenty twelve
0:03:13	that was actually a sweeping change in university education and home call
0:03:18	where
0:03:19	well the university have to migrate from a three year
0:03:23	curriculum to a for your curriculum
0:03:26	so what was said then we're admitting
0:03:28	students
0:03:29	who are one year younger
0:03:31	and we have to design a curriculum for first year engineering students which is brought
0:03:38	based meeting
0:03:39	or engineering students need to
0:03:42	take those course this
0:03:44	and among these is the engineering a freshman
0:03:48	math course
0:03:49	and because it's a broad base that mission
0:03:52	so we have really because this
0:03:54	and after a few years of teaching these big classes
0:03:58	we realise that we need to
0:04:01	sort of the students better
0:04:03	i specially for the each students
0:04:05	so we designed a
0:04:08	elite freshman amount of course
0:04:10	where it has a much more demanding a curriculum and of course students can opt
0:04:15	in an opt out of this course
0:04:18	it's basically of freshman year engineering math course
0:04:22	but we have this elite course and we have a very dedicated a teacher my
0:04:28	colleague a professor sit on jackie
0:04:31	and he's very creative and innovative and he has been
0:04:35	trying out many different
0:04:39	ways to teach the elite students
0:04:41	and so many different ways to flip it's constant
0:04:46	and eventually he's settled upon a
0:04:50	a mode where i'm gonna talk about that so in general is you know flip
0:04:56	classroom teaching involves having students watch online video lectures before they come into class and
0:05:03	then class it's all dedicated to base a cost discussions
0:05:08	so students are given
0:05:10	in class exercise this and they work in teams
0:05:14	and they discuss and in fact survey try to solve these problems and sometimes the
0:05:20	team
0:05:20	get picked to go up to the front and
0:05:24	presents but there there's solution to the their classmates
0:05:29	now this is that setting
0:05:33	and in fact it's in a computer lab so you have to see computers i
0:05:36	think it will be ideal if we have peace a reconfigurable furniture in a classroom
0:05:41	but hopefully it will come someday so
0:05:45	as i mentioned every week
0:05:48	the class
0:05:49	time it's
0:05:50	spent on
0:05:52	peer to peer learning and group discussions and some clips are selected to present their
0:05:56	solution
0:05:57	so
0:05:59	since we to let my students record
0:06:05	the student group discussions during class
0:06:08	so the dots are where the computer monitors are placed in the room
0:06:13	and the red dots are where we put the speech recorders
0:06:18	and
0:06:20	so you can see the students in groups and we actually get consents from most
0:06:25	of the groups
0:06:26	except for two
0:06:27	which are shown here to record their discussions
0:06:31	so technically
0:06:34	the contents of an audio file looks like this
0:06:36	so the lecture or woodstock the class
0:06:40	by addressing the whole class and also of course also close the cost
0:06:45	so we have lecture speech
0:06:47	at the beginning and at the end
0:06:49	and
0:06:50	at various
0:06:51	points in time
0:06:53	in the class
0:06:54	sometimes the lecture was speak and sometimes the ta will speak
0:06:57	again addressing the whole class
0:07:01	and there are times
0:07:02	when i still included finishes an exercise and they're invited to go up to the
0:07:07	front to present their solution but all the other times are open for the
0:07:12	student groups to discuss
0:07:15	within the team within the group to try to solve
0:07:18	the problem at hand
0:07:20	so this is the content of the audio file
0:07:23	so it's actually
0:07:25	we have two types of speech
0:07:27	one which is directed at the whole class
0:07:30	and one
0:07:31	which is the student group discussions
0:07:34	so we devised a methodology to automatic separation
0:07:38	between these two types
0:07:39	so that we can filter out the we want to be able to filter out
0:07:44	the
0:07:45	student group discussions speech
0:07:47	for further processing and studying here
0:07:51	this methodology we will be presenting a interspeech next week
0:07:55	now
0:07:56	it's actually
0:07:57	within that student group discussions
0:07:59	we actually segment the speech the audio
0:08:03	and this expectation is based on speaker change
0:08:06	and also if there's a pause
0:08:08	of more than one second duration then we'll segmented and
0:08:12	we have a lot of student helpers helping us in terms of transcribing
0:08:17	the speech
0:08:18	and a typical transcription looks like this
0:08:23	so each segment includes
0:08:25	the name
0:08:27	so for example gets more bits known as and report the call themselves and reburned
0:08:31	and here are the
0:08:32	i segments in fact that students we teach and we lecture in english but when
0:08:37	they are
0:08:38	open to discussing among themselves some of them
0:08:41	discussed input on parliamentary
0:08:43	philip and discussed in
0:08:44	in a cantonese
0:08:46	so
0:08:47	so here the speech is actually in chinese
0:08:50	and but i've translated it for presentation here so just to play for you
0:09:00	each of these segments in turn
0:09:02	so basically the first segment is a speaker a male speaker
0:09:08	say it really should be the same and then the females because they know these
0:09:11	piece to always exactly the same and so on so i'm gonna play for you
0:09:14	what the audio sounds like starting with the first segment
0:09:21	so that the first segment seconds segments
0:09:28	third segment
0:09:32	of segments and the last so very noisy
0:09:38	and
0:09:39	so what we have been working on is the transcription
0:09:44	now
0:09:45	the class exercise is generally take one which to solve
0:09:49	at each week i three classes
0:09:51	and so together the recordings composed a set
0:09:55	we have ten groups and over semester where we are able to record over twelve
0:10:00	weeks a we end up with a hundred and twenty
0:10:03	a weekly group discussions sets which we do not by w g d s
0:10:07	i don't speeds
0:10:09	fifty two have been transcribed this is from the previous offering
0:10:13	well as yours offering of the course
0:10:15	and the total a number of hours of the audio is five hundred fifty a
0:10:19	worse
0:10:20	and the total colours of discussion is about two hundred eighty hours and we've transcribed
0:10:27	about a hundred hours
0:10:29	so what we do care
0:10:30	as the beginning a beginning step
0:10:33	it's to look at the weekly group discussions that and try to look at
0:10:38	the discussions of the students and see whether it is relevant
0:10:42	so the core topic
0:10:44	and also whether it and also what level of activity
0:10:48	there was in communicative exchange
0:10:52	and that we try to conduct analysis to tie with the academic performance
0:10:57	of the group in the course
0:10:59	so
0:11:00	if we look at peace to
0:11:03	measures a relevance to the course topic in fact we divide that up into
0:11:09	two components
0:11:10	the first is the number of matching map terms
0:11:14	that's occur in the speech
0:11:16	so for example here is
0:11:18	it group audio
0:11:20	i
0:11:29	so basically they if there's a circle that usually use polar coordinates
0:11:34	and i've
0:11:35	used polar coordinates and then i've used it for integration but the variable y has
0:11:40	some problems
0:11:41	so that's what he thing
0:11:42	and in this
0:11:43	segments
0:11:45	we actually see the matching map terms based on some textbooks and mapped dictionaries these
0:11:52	other resources that we have chosen
0:11:55	and so we not take note of those
0:12:00	then the next component it's on content similarity and we figured that because the discussion
0:12:05	is there is solved and in cost exercise so they should bear similarity that discussions
0:12:11	content should have similarity to the in class exercise so to measure that's
0:12:16	we trained a
0:12:17	what effect model
0:12:19	and when we use that
0:12:21	to compute a segment vector so far
0:12:24	each segment in the discussion
0:12:26	we got a segment vector and we also get a document vector
0:12:30	from the in class exercise and we measure the cosine similarity
0:12:33	so here's an example of the a high similarity segment is on top versus the
0:12:39	low similarity segment and the bottom so you can see that's upon first glance the
0:12:44	top to segments they are indeed about some math
0:12:50	and then that the third one it's which chapter so it's referring to the text
0:12:56	probably
0:12:57	whereas the low similarity segments are general conversation
0:13:02	so that has to do with the relevance of the content we also measure the
0:13:08	level of activity in information exchange and for that
0:13:11	we
0:13:13	counts the number of segments in the inter in the discussion dialogue
0:13:17	and also the number of words
0:13:19	in the discussion dialogue and we add both
0:13:22	chinese characters and english words together
0:13:26	so it's actually for a weekly group discussions that we have
0:13:30	four features
0:13:31	two
0:13:34	putting to relevance to the course topic and two for information exchange measures
0:13:39	now
0:13:40	the next thing we do is to look at
0:13:43	be academic performance
0:13:45	so the learning outcome
0:13:46	that corresponds to each week scores topic
0:13:49	it's measured through the relevant question components
0:13:53	that's it's present in the way we've sets the midterm paper and the final exam
0:13:58	paper
0:13:59	so
0:14:00	basically we have a score and the final exam count sixty percent
0:14:04	the midterm talents forty percent but we have set the questions that's the course content
0:14:11	for each week will be present in different components
0:14:14	in the midterm and
0:14:16	final papers respectively
0:14:19	therefore we are able to
0:14:21	look at a groups overall performance according to the course content for a particular week
0:14:29	so this is the way we did the analysis and here's the
0:14:33	quick summary
0:14:34	so basically we looked at the high performing groups
0:14:38	versus the low performing groups and it's not surprise we can see that's
0:14:42	the high performing groups generally have a much higher average proportion of
0:14:46	matching map terms in the discussions
0:14:49	and also they have higher content similarity so
0:14:52	the worth it that use the discussion content it's much more relevant
0:14:57	and
0:14:58	in terms of communicative exchange activity the high-performing groups have many more
0:15:04	total segments exchanged and
0:15:08	more words
0:15:10	note that the first three measures so these three matching map terms content similarity
0:15:16	and number of segments exchanged
0:15:18	we did a success significance test and it's significant that the fourth one is at
0:15:24	point a weight so but i think it's still relevance and it still important an
0:15:30	important feature
0:15:32	so what have presented to you is if the first step
0:15:35	where we
0:15:37	collected the data and we try to investigate to the discussion dialogues in that it
0:15:41	flip classroom setting
0:15:42	in relation to learning outcomes
0:15:45	in terms of for the investigation what
0:15:48	our team will like to understand it's how
0:15:52	can
0:15:53	the student discussion
0:15:57	become if and if pair effective platform for peer to peer learning how the dialogue
0:16:03	facilitate learning and then hands learning
0:16:06	and for more if they're high-performing teams
0:16:09	because a very efficient exchange
0:16:12	in the dialogues
0:16:14	whether
0:16:14	we can use that information to inform formation
0:16:19	so right now that students would form a group to what the beginning of the
0:16:22	semester and they stick with that before the entire semester so
0:16:26	where thinking that if there cry performing groups as the results are very effective discussions
0:16:33	maybe if we are able to swap the groups around and
0:16:38	and
0:16:39	not this dialogue exchange the benefits of the dialogue exchange to learning
0:16:44	spread that maybe
0:16:45	you know rising tide
0:16:47	races all boats so maybe you and hands learning for the whole class
0:16:50	so that's the direction we'd like to take this investigation
0:16:55	so that the first section
0:16:57	no i will want to the second section which is on e commerce
0:17:00	so that this is actually the ching don't dialogue challenge in the summer of twenty
0:17:04	eighteen
0:17:06	and i had a summer
0:17:08	in turn
0:17:08	that year and i ching and is the undergraduate students and so i said well
0:17:14	maybe you may be interested in joining the team don't dialogue challenge but you have
0:17:19	no background luckily i have also had a part time a postdoctoral fellow duct according
0:17:25	to
0:17:25	and also doctor a value is a recent graduate from a group i'm he's not
0:17:30	working for the startup speech acts limited
0:17:33	and in particular i'd like to thank a doctor bones order to show don't go
0:17:37	and
0:17:37	miss them on track of
0:17:39	don't ai for running that's general dialogue challenge from which we've benefited a lot of
0:17:46	a special student
0:17:47	junior and undergraduate student
0:17:49	learning a lot
0:17:50	so
0:17:51	the goal of this dialogue challenge is to develop a chat part for you commerce
0:17:55	customer service
0:17:56	using gin don's very large dataset
0:17:59	they're giving us
0:18:00	they gave us one million chinese customer service conversations sessions
0:18:04	what amounts to twenty million conversation utterances or turns
0:18:07	this data covers ten after sales topics
0:18:10	and their unlabeled and for each of these topics may have for the subtopics so
0:18:16	for example in voice modification this topic
0:18:19	it can have
0:18:20	the subtopics of changing the name
0:18:22	changing the in voiced type asking about e invoices extraction
0:18:27	and the task it's to do the following we have a context
0:18:31	which consists of
0:18:32	the two previous conversation on
0:18:35	turns
0:18:35	so the two
0:18:36	so therefore utterances
0:18:38	from the two previous turns and the current query
0:18:41	from the user or from the customer
0:18:44	and the task is to generate a response for this context
0:18:49	okay so it's basically a of five utterance group
0:18:54	and we need to generate a response
0:18:57	and but generally that response from the system is evaluated by experts
0:19:02	a human experts to for from customer service
0:19:07	so there are two very well known approach is the retrieval-based approach and the gender
0:19:11	and racial based approach
0:19:13	and we
0:19:15	take advantage of the training data with the context and response pairs
0:19:19	in building bees
0:19:20	so i retrieval-based approaches very standard basically if the tf-idf plus cosine similarity
0:19:26	and our generation based approach is also a very standard configuration where we segmented
0:19:33	be chinese
0:19:35	context
0:19:36	the two previous
0:19:38	dialogue turns together with the current query
0:19:40	with that met that's
0:19:42	and then also we segment the response
0:19:45	and we feed those data and we model that statistical relation between the context
0:19:49	and the response
0:19:50	using i think to stick with attention
0:19:53	using this model
0:19:55	and so that's the training and also be inference phases
0:19:58	now
0:19:59	lee
0:20:00	system that we eventually submitted is a hybrid model
0:20:04	based on a
0:20:05	very commonly used rescoring framework
0:20:08	so what we did words to generate using their retrieval-based approach
0:20:14	and that's response alternatives
0:20:16	where we chose and to be twenty
0:20:18	so that it's
0:20:19	that there's enough choice that's but also it won't take too long
0:20:22	and
0:20:23	and we use the generation based approach to rescore
0:20:26	these twenty responses so
0:20:29	then i think about that it's be the generation based approach will
0:20:34	consider
0:20:35	the
0:20:35	given context and hand and the chosen response the relationship between those
0:20:40	and then we use this
0:20:42	we scored
0:20:45	the highest scoring response so we rescore it and we're a racket and use and
0:20:50	we check whether the highest scoring response has exceeded the threshold and this is arbitrarily
0:20:56	trout chosen
0:20:57	at points out of five
0:20:58	so if it exceeds a threshold then we'll output that response
0:21:02	otherwise we think that maybe that this signed that's our which we will base model
0:21:09	does not have enough information to choose the right response so we just use the
0:21:13	entire i think to seek
0:21:15	to generate that a new response
0:21:17	and so that the system and we got a technology innovation award for the system
0:21:22	so it has been a very fruitful experience especially for my undergraduate students and she
0:21:27	decided after this a general dialogue challenge to pursue a phd so she's actually starting
0:21:33	her first term as the phd student in our lab now
0:21:37	and also we got valuable data resources from the industry doing this summer
0:21:42	and i think
0:21:43	moving forward we'd like to
0:21:45	look into flexible use of context information
0:21:48	for different kinds of user inputs ranging from chit chats to one shot information-seeking enquiries
0:21:54	followup questions multi intent input et cetera and i think time yesterday i saw a
0:21:59	professor of folk owens
0:22:03	poster and i think i you have the a very comprehensive decomposition of this problem
0:22:08	so that's my second project and now i'm gonna move to the third project which
0:22:14	is looking at dialogue in cognitive screening
0:22:17	so investigating spoken language model markets in euro psychological dialogues for cognitive screening this is
0:22:24	actually a recently funded project is the very big project and we have a frost
0:22:29	university t
0:22:31	so there's the chinese university team
0:22:33	and we also have colleagues from h k u s t and also polytechnic university
0:22:38	so
0:22:39	but also from chinese university not only do we have engineers we also have
0:22:44	linguists
0:22:45	psychologists urologist
0:22:48	jerry education center and how just on our team so i'm really excited about this
0:22:52	team
0:22:53	and
0:22:54	we have our teaching hospital which is the prince of wales hospital and we also
0:22:59	building a new see which k teaching hospital which is a private hospital so i
0:23:03	think we're gonna be able to get
0:23:05	any
0:23:06	subjects to
0:23:08	participate in our study
0:23:10	so is actually this study focus on focuses on your cooperativeness order
0:23:17	so it's and another time for dimension
0:23:19	and it is and you know well that's know that the global population is ageing
0:23:24	fast and actually hong kong's population is ageing even faster
0:23:28	and cd neurocognitive is order
0:23:31	it's very prevalent among older at outs
0:23:34	it has an insidious onset it's chronic and progressive and there's a general global deterioration
0:23:40	and memory
0:23:41	communication thinking judgement and either probably to functions
0:23:44	and it's the most incapacitated
0:23:46	disease
0:23:48	now that cd manifests itself in communicative impairments such as uncoordinated articulation like this a
0:23:55	trio the subject may
0:23:57	news the capability in language use such as an aphasia
0:24:00	they may have a reduced vocabulary programmer weakened listening reading and writing
0:24:05	and the existing detection methods include brain scans blood tests
0:24:09	and face-to-face neural psychological and p assessments which include structured
0:24:14	semi-structured and free-form dialogues
0:24:17	so if we want dialogue is where the participant is invited to
0:24:24	to do a picture description so the given a picture or sometimes the process
0:24:29	and asked to describe it
0:24:31	now
0:24:33	my colleagues in the teaching hot scroll they have been recording
0:24:38	actually we we're allowed to record their then you're psychological tasks
0:24:43	and that will provide some that provide some initial data for our research so is
0:24:48	actually
0:24:49	the flow of the conversation includes the mmse
0:24:53	the many a mental state examination together with the montreal cognitive assessment a test
0:24:59	so it's the combination of both and there's some overlapping component so that's shared
0:25:05	and
0:25:06	we have about two hundred hours of a conversations between the clinicians and the subjects
0:25:10	it's a one on one
0:25:12	and euro psychological test
0:25:15	now here's an example so we have normal subjects and also others were cognitively impaired
0:25:22	and here are some examples of the
0:25:25	excerpts of the conversation so this is from a normal subject was ask about the
0:25:31	commonality between a training on a bicycle
0:25:33	and this is answer
0:25:36	and then the condition has size is big and then the subjects that yes to
0:25:39	train as long of the bike a smaller is in it and then the pledges
0:25:43	that's o
0:25:44	okay but what's called between them and the subjects that's both values for transport
0:25:49	now for the cognitively impaired subject the
0:25:53	the this is more typical and in fact the original
0:25:57	dialogue is in tiny so we also translated to into english for presentation here
0:26:03	and this is that the dialogue for a cooperative impaired subject so we did not
0:26:08	vary preliminary analysis based on about twenty individuals gender balance
0:26:13	and we look at than average number of utterances in and p assessment as
0:26:18	so you can see
0:26:19	that for males
0:26:21	so the total number of utterance the total number of utterances drop as we move
0:26:26	from the normal to the cognitively impaired
0:26:28	and also the same trend for the female
0:26:31	and then the cat time that sort of the reaction time
0:26:34	there's a general increase small increase
0:26:37	going from the normal to the cognitive impaired and this is for the male and
0:26:41	this one is for the female
0:26:42	also the normal subjects tend to speak faster so they put out more about how
0:26:48	your number of average characters per minute and average number of words per minute
0:26:52	and
0:26:55	so this is very preliminary data
0:26:58	and what we're looking at
0:26:59	different linguistic features such as
0:27:04	parameter quality
0:27:06	information density fluency and also acoustic features such as
0:27:10	and that it in addition to reaction time duration of pauses hesitations pitch prosody et
0:27:15	cetera so will be looking at a whole spectrum of these features
0:27:19	and also my student has developed an initial prototype which illustrates how interactive screening may
0:27:26	be done
0:27:27	and here's the
0:27:29	a demonstration video to show you
0:27:32	so it's actually it starts with
0:27:38	a word recall
0:27:39	exercise
0:27:41	please listen carefully i and going to state three words that i want you to
0:27:47	try to remember and repeat then back to me
0:27:51	please repeat the following three words to me
0:27:55	c then
0:27:57	can
0:27:58	radar
0:28:00	say a response it'd be
0:28:05	well
0:28:07	season
0:28:08	it should
0:28:10	river
0:28:18	good
0:28:20	please remember that three words that were presented and recall them later on
0:28:27	please your best to describe what is happening in the picture about
0:28:33	cap on the button below to begin our complete your response
0:28:42	i see
0:28:43	a family of four
0:28:46	or sitting in the living room
0:28:50	there is a order
0:28:53	monitor
0:28:55	carol
0:28:57	and the board
0:28:59	they are do you do we are we to release
0:29:06	i can't really see much clearly i don't know
0:29:12	that's
0:29:14	good
0:29:16	tap on data and that an if you have completed the task
0:29:20	tap on the try again that into redid the picture description task
0:29:31	please say that three words i asked you to remember earlier in the
0:29:37	recall and say that three words to me
0:29:41	say a response it'd be
0:29:47	season
0:29:50	rumour
0:29:53	i don't remember the last one
0:29:56	summer
0:29:58	u denotes the
0:30:07	so basically the system tries just or a job
0:30:11	the results of everyone several
0:30:13	the data
0:30:14	and so they're score charts
0:30:17	related to for example how many contracts a answers
0:30:21	correct responses were given the response time length get the gap time exact role so
0:30:27	i need to i need to state clearly that
0:30:30	the voice is actually so the voice is based on know that speech is based
0:30:36	on
0:30:37	real data but it's in chinese
0:30:39	so my student
0:30:42	translated to english and try to mimic the
0:30:45	the pause it and also used as you would think that the subject like to
0:30:50	say i think that's it so sort of talk
0:30:53	talking to himself
0:30:54	so he also mimic that so that is for illustration only
0:30:58	are most about data
0:31:00	will be in chinese cantonese or maybe
0:31:02	mandarin
0:31:04	so as a quick summary spoken dialogue offers easy accessibility
0:31:09	and high feature
0:31:11	resolution i'm talking about even millisecond resolution
0:31:14	in terms of reaction time and pause time extractor
0:31:17	for cognitive assessment so we want to be able to develop
0:31:21	a very speech language and dialogue processing technologies
0:31:24	to support holistic assessment of various cognitive functions
0:31:28	and domains
0:31:29	by combining dialog interaction with other interactions
0:31:33	and also we want to further develop this platform as the support of two
0:31:37	for cognitive screening
0:31:40	so that's the end of the third projects and now i'm gonna move away from
0:31:45	the applications oriented facets to a more research oriented facets
0:31:50	so the for project is on extracting
0:31:53	semantic patterns from user inputs
0:31:55	in dialogues and we've been developing a convex probably topic model for that and this
0:32:01	work done by a doctor according to myself and my colleague are professor younger
0:32:06	so
0:32:07	this study actually use it at its two and three
0:32:11	and to get about five thousand utterances to support our investigation
0:32:16	and that complex probably topic model
0:32:19	it's really and unsupervised approach
0:32:22	that is applicable to short text
0:32:24	and it can help us automatically identify semantic patterns from a dialogue corpus
0:32:30	via a geometric technique
0:32:32	so as shown here this that with the well-known m eight is
0:32:37	examples
0:32:38	we can see that semantic pattern of
0:32:40	show me flights
0:32:41	so this is an intent
0:32:43	and also another
0:32:44	semantic pattern of going from an origin to a destination and also
0:32:50	another
0:32:50	semantic pattern on a certain day
0:32:54	so we begin the space of m dimensions where if the vocabulary size and each
0:33:00	utterance forms in this space i'd point and the coordinates of the points
0:33:06	we
0:33:07	you close to the sum normalize worked out of that axis
0:33:11	so that there are two steps in our approach the first one is to embed
0:33:15	the utterances into a low dimensional affine subspace using principal component analysis so it's actually
0:33:21	this is a very common technique and the principal components in to capture
0:33:26	features that can optimally distinguish points by their semantic differences
0:33:31	then we want to the second step where we try to generate a compact
0:33:36	compact convex polytope
0:33:39	two
0:33:40	and close or the and bedded utterance points
0:33:43	and this is using
0:33:44	the quick whole algorithm
0:33:46	so i think illustration
0:33:50	this is what we call a normal type
0:33:54	convex polytope
0:33:55	and all these
0:33:57	points are always points so there are the illustrate be utterances in the corpus
0:34:03	residing in that space
0:34:05	maybe affine subspace
0:34:07	and the
0:34:08	compact a compact convex polytope the various ease of the pot the polytope
0:34:14	each vertex is actually
0:34:16	a point from the set of from the collection of utterance points
0:34:21	so each vertex
0:34:23	also corresponds to an utterance
0:34:25	now
0:34:26	we can then connect the linguistic aspects
0:34:29	of the utterances within the corpus to be geometric aspect of the convex palmtop
0:34:37	so it's actually you can think of the utterances in the dialogue corpus they become
0:34:42	embedded points in the affine subspace
0:34:44	the scope of the corpus
0:34:47	it's now and complex by be compact
0:34:50	convex polytope
0:34:51	that is delineated by the boundaries connecting liver disease
0:34:55	and then the semantic patterns of the language of the corpus
0:34:59	it's not represented
0:35:01	as
0:35:02	the vertices
0:35:03	of the complex
0:35:05	on of the compact convex polytope
0:35:09	now
0:35:09	because the very sees represents extreme points of the polytope
0:35:14	each are displayed can also be formed by a linear combination of the party types
0:35:18	for disease
0:35:20	so let's look at the a this corpus
0:35:23	be a this corpora
0:35:24	and as you know and it is we have these intents
0:35:28	and we also colour code them here and that we plot the utterances in be
0:35:33	a that's training corpora
0:35:35	in that space and which shows a two-dimensional space that you can
0:35:39	see all the plots on a plane
0:35:41	and then we won the quick all algorithm and it came up with this polytope
0:35:48	so this is the most compact one
0:35:51	and you can see
0:35:52	that the most compact
0:35:54	a polytope
0:35:55	meets
0:35:56	twelve or to see so v one v two
0:35:59	well the way to be twelve
0:36:04	now each word x actually also
0:36:06	corresponds to an utterance
0:36:08	so you can look at
0:36:10	the vertices one
0:36:11	tonight they're all
0:36:13	dark blue in colour and in fact they all
0:36:16	correspond to an address with the intent class think of lights
0:36:21	but next
0:36:22	is light blue
0:36:23	and actually a corresponds to
0:36:25	the intents of
0:36:27	abbreviation
0:36:29	and then vertex eleven is also dark blue so with vertex twelve
0:36:34	so this is
0:36:36	an illustration
0:36:37	of the convex polytope
0:36:39	now we can then look at each vertex
0:36:43	so we want to view nine they all
0:36:47	corresponds one hundred just so you can see
0:36:49	you want to v nine
0:36:51	so these not be one vertex once a vertex nine over here they're very close
0:36:55	together and essentially they are well
0:36:58	capturing the semantic pattern
0:37:00	of
0:37:01	from some origin to some destination and these are all
0:37:07	address this with the you labeled intent of flight
0:37:10	now vertex twelve it's very close by
0:37:14	and
0:37:15	but it's twelve itself the constituent utterance its flights to baltimore
0:37:20	so just having the destination
0:37:23	and
0:37:24	we when we also want to look at work text ten and eleven so let's
0:37:28	go to the next page
0:37:29	no vertex
0:37:30	and here in green
0:37:32	the other
0:37:34	utterances and if you look at the constants one utterances you can see that they're
0:37:39	all questions are what is an abbreviation
0:37:43	and then vertex alive it so the nearest neighbors of vertex eleven
0:37:49	basically all capture show me
0:37:51	show me some flights
0:37:53	okay so
0:37:54	you can see
0:37:55	that the versus ease the a generally together with their nearest neighbors capture some car
0:38:01	semantic patterns
0:38:02	now
0:38:03	for the context polytope we don't have any control on the number of er to
0:38:08	seize and it's usually unknown until you actually run the algorithm
0:38:13	so if you want to
0:38:15	control the number of vertices we can use
0:38:18	a simplex
0:38:20	and here again
0:38:22	we want to put plot in two d two dimensions so we chose a simplex
0:38:26	with three birdies so if we want to constrain it you
0:38:30	three courtesies we can use
0:38:32	the sequential quadratic programming algorithm
0:38:35	to come up with the minimum volume simplex
0:38:38	so just
0:38:40	for you to recall
0:38:42	this is the normal type convex polytope
0:38:44	so you can see
0:38:45	it has twelve were to see now we want to
0:38:49	control the number of vertices into three is that we want to
0:38:52	generate a
0:38:54	minute volume simplex and here is the output of the algorithm
0:38:58	okay so we can now see
0:39:00	we have the
0:39:01	minimum volume simplex with the river receives
0:39:04	and
0:39:05	if you look at this minimum volume simplex vertex one
0:39:08	two and three
0:39:09	and if you compare with the previous normal type
0:39:14	convex polytope so let's look at vertex one of the simplex
0:39:18	and it just corresponds to vertex eleven of the normal type polytope
0:39:23	and it also happens to coincide with an utterance
0:39:27	now if we go to vertex summary of the simplex you can see that there's
0:39:32	the light
0:39:33	blue
0:39:34	dots here and that actually corresponds to
0:39:37	for next
0:39:38	and
0:39:38	of the normal type up until so it's very close by
0:39:43	so the vertex
0:39:44	three of the simplex is very close to what extent of than normal type probably
0:39:50	channel
0:39:51	know what about
0:39:52	all these policies from one to nine and also verdicts twelve
0:39:56	these are all
0:39:58	we grouped into
0:40:00	into here
0:40:02	and we have a little bit by
0:40:04	extending vertex to
0:40:06	so you can see that is actually that's minimum
0:40:09	well in seven flights it's not encompassing all the utterance this week no longer guaranteed
0:40:14	that the verdict itself is an utterance points but
0:40:18	we have only three policies and the resulting
0:40:21	minimum value a minute volume simplex is formed by extrapolating the three lines
0:40:26	and joining the previous
0:40:27	not more type take bounding convex hull the vertices from that convex hull
0:40:32	including v ten
0:40:34	we tend to be a lot of n we eleven t v twelve
0:40:37	and then v eight and nine in be three lines
0:40:41	now
0:40:42	we can also look at
0:40:44	for this minimum volume simplex for each vertex we can look at it further so
0:40:49	for example
0:40:50	the first four attacks
0:40:53	you can look at feast on
0:40:54	nearest neighbors and here is the list of the utterances
0:40:58	that corresponds to e point each point
0:41:01	in the nearest neighbor group and they all have the pattern of show me
0:41:06	some flights from someplace to someplace show me flights so that some a semantic parser
0:41:11	now let's look at
0:41:13	verdicts two
0:41:15	so this is where you can see the patterns are from a and order to
0:41:20	a destination
0:41:21	for every vertex
0:41:23	because it's also residing in
0:41:25	the m dimensional space so the
0:41:29	coordinates can actually show was what are the top words the strongest words that are
0:41:32	most representative of the board chuck's
0:41:34	so you can also see
0:41:36	the list of ten top words for those verdicts coordinates of each you
0:41:41	now let's look at b three
0:41:44	the we and its nearest neighbors are shown here and it's mostly
0:41:48	about what it's
0:41:50	for by an abbreviation
0:41:51	okay so the minimum volume simplex actually also shows it allows us to pick
0:41:57	the number of vertices what is this we want to use and also shows some
0:42:01	of the semantic patterns
0:42:02	there are captured
0:42:04	and we paid three because we wanna be able to plot it
0:42:07	in fact and we can pick any arbitrary number of higher dimensions
0:42:12	so
0:42:13	we can examine at a higher dimensionality that semantic patterns
0:42:17	by analysing the nearest neighbors and also the top words of the verdict sees
0:42:21	so for example we ran
0:42:23	well one with sixteen dimensions
0:42:25	so we end up with seventeen courtesies
0:42:27	and i like that
0:42:28	first ten here
0:42:30	followed by the next
0:42:31	seven so seventeen altogether
0:42:33	and then here are the top words for each vertex and also the representative nearest
0:42:38	neighbor
0:42:40	so you can see that
0:42:42	for example verdicts full
0:42:44	it's cut it's capturing the semantic patterns show me something
0:42:48	and number x
0:42:50	from someplace to someplace
0:42:52	for x
0:42:52	eight
0:42:53	what does
0:42:54	some abbreviation me
0:42:56	and verdicts nine
0:42:58	asking about ground transportation
0:43:01	we also have er to seize one
0:43:03	two
0:43:06	five which
0:43:08	really
0:43:11	related to locations
0:43:12	and i think
0:43:13	that's because the perhaps due to data sparsity
0:43:17	and also verdicts the re
0:43:19	it's about can i get something i would like something
0:43:23	and vortex
0:43:24	so then
0:43:25	it's really a bunch of
0:43:27	frequently occurring words and i guess
0:43:29	now if we look at the next set inverted c
0:43:32	a vortex
0:43:33	thirteen it's
0:43:35	about flights from someplace
0:43:37	maybe to someplace as well
0:43:39	fourteen is what is something
0:43:41	sixty s list all
0:43:43	something and again verdicts eleven
0:43:47	fifteen and seventeen or location names
0:43:51	word x twelve
0:43:53	is an airline
0:43:54	name
0:43:55	exactly about either date a date or an airline so i think this is the
0:43:59	case where
0:44:00	we may have been
0:44:02	to address it introducing the subspace dimensions
0:44:05	and i think if we have one this
0:44:08	same experiment more dimensions hopefully it will
0:44:11	separate the day from the airline
0:44:14	so basically we're just playing around with this complex probably topic model as an a
0:44:22	tool for exploratory data analysis
0:44:25	and
0:44:26	i like the geometric nature because it helps me interpret the semantic patterns
0:44:31	and my hope is to extend this
0:44:34	from
0:44:34	semantic pattern extraction to tracking dialog states in the future
0:44:39	so that section four
0:44:41	and now
0:44:42	section five
0:44:44	i last section which is on
0:44:46	affective design
0:44:47	for conversational agents
0:44:49	modeling user emotion changes in a dialogue
0:44:51	this is actually the phd work of monotony
0:44:54	of with the students from to enquire university
0:44:57	and we also interned
0:44:59	in our lab in hong kong for a couple of summers because direct supervisor is
0:45:05	professor at your wafting part university
0:45:07	and this work it's conducted in their drink wa
0:45:11	chinese university joint research center a media sizes technologies and systems
0:45:15	which is and schlangen
0:45:16	and it just funded by the
0:45:18	national
0:45:19	natural science foundation of china
0:45:21	hong kong research grants council part we search scheme
0:45:25	so
0:45:26	a long time goal is to impart i
0:45:29	sensitivity
0:45:31	into conversational agents
0:45:32	which is important for user engagement and also for supporting
0:45:36	socially intelligence conversations
0:45:39	so
0:45:40	that's work look at inferring users emotion changes
0:45:44	i mean assumption is that emotive state change is related to the user's emotive state
0:45:50	in the covariance
0:45:51	dialogue turn and also the corresponding system response
0:45:56	so the objective is to infer the users emotion states
0:46:00	and also be emotive state change
0:46:02	which can in the future inform the generation of the system response
0:46:09	we use the p at a model pleasure arousal dominance framework for describing
0:46:14	emotions in a three dimensional continuous space
0:46:18	so pleasure it's more about positive and negative emotions are rows or is about mental
0:46:24	alertness and dominance is about more about control
0:46:28	so this is a real dialogue which is originally in chinese and again i
0:46:32	i have translated into english here for presentation
0:46:35	so this is a dialogue between a chat bots and the user
0:46:39	and
0:46:40	we have
0:46:42	annotated the p i d values
0:46:44	for each dialogue turn
0:46:45	so you can see for example in dialogue turn to
0:46:50	the user study broke up with me and the response from the system
0:46:53	is let it go you deserve a better one and you see that the from
0:46:57	the dialogue turn all the values of p a and the all
0:47:00	increase
0:47:02	and
0:47:03	and then
0:47:04	for example in dialogue turn eight
0:47:07	that use just said
0:47:08	actually
0:47:10	and the systems that use get me
0:47:12	would seem to amuse the user
0:47:14	so and also soft and the dominance
0:47:16	the value of the dominance
0:47:18	so these are the values that we work within the p d space and this
0:47:22	is our approach joe what's inferring emotive state change
0:47:27	on the left it's the speech input on the right is the output of emotion
0:47:31	recognition
0:47:32	and the prediction of emotion stick change
0:47:35	now we start by integrating the acoustic and lexical features
0:47:39	from the speech import
0:47:41	and
0:47:42	this is basically i'm multimodal fusion problem
0:47:45	and it is achieved by concatenating the features and then applying p
0:47:50	multitask learning convolutional
0:47:52	fusion auto-encoder
0:47:54	so it's go through different layers of convolution and max
0:47:57	and
0:47:58	and also max pooling
0:48:01	and
0:48:02	then we also
0:48:05	capture the system response as a whole utterance
0:48:08	and it is
0:48:09	this is because the holistic message is received by the user and the entire message
0:48:13	plays a role in influencing the users emotions
0:48:17	now the system response co and coding that uses a long short-term memory recurrent auto-encoder
0:48:23	and it is trained to map the system response into a sentence level vector
0:48:27	representation
0:48:30	next the user's input
0:48:32	and the system's response are further
0:48:34	combined using convolutional fusion
0:48:37	and
0:48:38	the framework
0:48:39	then performs emotion recognition using a stacked hidden layer
0:48:43	started only years and the results will be
0:48:46	further used for inferring emotive state change
0:48:49	and for this we use a multitask learning structured output layer
0:48:54	so that the dependency between them emotion state change
0:48:57	and the
0:48:59	emotion recognition output is captured
0:49:02	so in other words the e motive state change its conditioned on the recognise
0:49:06	emotion state of the current query
0:49:10	now the experimentation is done on i you mocap which is a corpus of very
0:49:14	widely used
0:49:15	in emotion recognition system
0:49:17	and also that so go voice assistant corpus so that so what is its did
0:49:22	corpus it has over four million put on what utterances in
0:49:27	three domains
0:49:28	it is transcribed by an asr engine with five point five percent whatever rates
0:49:32	now we actually look at the chat dialogues
0:49:36	and
0:49:36	there are
0:49:37	ninety eight thousand of such conversations between for the forty nine turns but we use
0:49:43	a pre-trained
0:49:45	you know emotional dnn to filter out the
0:49:48	the
0:49:49	neutral
0:49:50	dialogues
0:49:51	a neutral conversations so we ended up with about nine thousand
0:49:55	emotive conversations
0:49:56	with over fifty two thousand utterances which are selected for labeling
0:50:01	so labeling the p a d values
0:50:03	and then we run the emotion recognition and also the emotion state change
0:50:09	prediction
0:50:10	so we use a whole suite of evaluation criteria on but predicted emotive states
0:50:17	in p a d values and also the emotive state changes in p d values
0:50:21	the unweighted accuracy
0:50:24	the mean accuracy of different emotion categories
0:50:26	the mean absolute error and also the concordance correlation coefficient
0:50:31	now
0:50:32	this is a
0:50:33	benchmark against other recent work using other methods
0:50:37	and for i mocap and also for the so go data sets
0:50:44	the proposed approach
0:50:45	actually achieves competitive performance
0:50:48	in emotion recognition
0:50:50	now in emotion
0:50:52	change prediction actually
0:50:54	our proposed approach achieves a significantly better performance then be other approaches
0:51:00	but they're still room for improvement if you compare with
0:51:03	a human performance in human annotation
0:51:07	so to sum up this is among the first efforts to analyze
0:51:11	user input features
0:51:13	both acoustical and lexical features
0:51:15	together with the system response to understand how the user emotion changes
0:51:21	due to the system response and the dialogue
0:51:24	and we have achieved competitive performance in impulsive state change prediction
0:51:29	and we believe that this is a very important a step
0:51:33	to work to what's having socially intelligent virtual assistants
0:51:38	with the incorporation of affect sensitivity for human computer interaction
0:51:44	so
0:51:45	so my talk is in five chunks but this is the overall summary
0:51:49	basically
0:51:51	when i look back at all these different projects
0:51:54	you know with it very
0:51:57	tries on the message that
0:51:58	much can be gleaned
0:52:00	from dialogues
0:52:01	to understand many important phenomena including
0:52:04	how group discussions may facilitate learning
0:52:07	a student would discussions may facilitate learning
0:52:10	however the cuffs customer experience can be shaped by chopper responses and also the status
0:52:15	of an individual's cognitive health
0:52:17	and i guess i'm preaching to the choir here but i really truly believe there's
0:52:21	tremendous potential
0:52:23	we've only seen
0:52:24	the tip of an iceberg
0:52:25	and there's tremendous potential with abundant opportunities and a lot research so thank you very
0:52:30	much
0:52:38	thank you very much do we have questions
0:52:47	thank you very much going to us or regarding the topic three cognitive impairment so
0:52:52	we also working on that but still
0:52:55	so the heavy cognitive impairment of people is easy to detect case of just a
0:53:01	small conversation we can identify this guy so going to put compare
0:53:06	but i think problem is the mild cognitive impairment and ci voice on a is
0:53:14	a very difficult to detect
0:53:16	so i think so the final goal of this well maybe how to estimate the
0:53:22	degree of cognitive impairment using features so what the sig
0:53:29	so thank you very much for the question
0:53:32	indeed
0:53:34	in our study we will be covering
0:53:38	come to the normal adults also what they not call
0:53:44	minor in and cd that so the new terminology
0:53:49	if
0:53:49	my nancy the my small
0:53:52	and you will have a disorder
0:53:54	and major big
0:53:56	you have to disorder
0:53:58	and
0:53:59	so this is a what are learnt from our colleagues in eulogy so
0:54:06	for elderly people we need to be more diligent in engaging them in these
0:54:14	a positive assessments "'cause" they're a really exercises and there's subjective fluctuations going from one
0:54:23	exercise to another so therefore the more frequent you can
0:54:28	take the assessment of better
0:54:29	and
0:54:31	and the issue is not and axle scoring so the
0:54:35	that's obviously it's more the personal level and if there's any sudden changes perhaps more
0:54:41	drastic changes
0:54:43	in the
0:54:44	scoring level of the individual that is off
0:54:48	that would be an important
0:54:50	sign
0:54:51	and
0:54:53	and also tracking
0:54:55	frequently is important
0:54:57	so in the sometimes that are whole minor and cd more mild cognitive impairments harder
0:55:03	to detect those and also you have to work
0:55:08	again sort of the natural cognitive decline due to ageing and the pathological cognitive decline
0:55:15	so it's a it's in a complex problem but nevertheless because
0:55:21	dimension is such a big problem and people talk about
0:55:25	the dimension is not any of the age and global population
0:55:30	and there's not sure
0:55:31	so we just have to work very hard on how to do early
0:55:37	early detection and intervention thank you for the
0:55:41	question
0:55:46	thank you for this very nice thought maybe topics really impressive i was wondering especially
0:55:52	in relation to the classrooms and to the cognitive screening
0:55:57	the moment of understood by your
0:55:59	working on transcriptions rate on the basis of transcription of you made any experiments
0:56:04	but with this or and if so what was your experience there what's the likelihood
0:56:10	of being sufficiently good
0:56:12	so the
0:56:14	the classroom
0:56:16	it is very difficult
0:56:18	that's why we have two
0:56:19	we have no choice but work on transcriptions
0:56:22	but so for
0:56:24	the
0:56:26	the
0:56:27	the way we have recorded these neural psychological tests
0:56:32	it's actually between recognition and thus subject
0:56:35	so the conditions of i think that they don't want any sense
0:56:39	so we just put a phone there
0:56:41	and we can send the subject of course
0:56:43	and
0:56:44	depend on the device some of it we think it's doable
0:56:48	but we went to have a response on
0:56:51	speaker adaptive training and noise of is the
0:56:55	speech processing we
0:56:56	we need to fall in the kitchen sink to be able to do
0:57:00	well
0:57:12	thanks for agree though
0:57:14	is
0:57:15	on the cognitive assessment from a discourse structure point of view actually i was wondering
0:57:22	what sort of processing now you plan to do on those descriptions that they provide
0:57:27	apart from you know speech processing and lexical the cohesion any thoughts about in on
0:57:35	discourse coherence rhetorical relation
0:57:39	among the sentence is that they provide and so on
0:57:42	so thank you for that the one of a question we must look at that
0:57:45	we must okay that we haven't looked at that yet but is actually i have
0:57:51	for her from our you know our colleagues to other clinicians face a coherence in
0:57:56	following the
0:57:59	discourse of a dialog oftentimes show problems
0:58:03	if there's cognitive impairment so that is definitely
0:58:06	one aspect that we must
0:58:09	and in fact we would welcome any
0:58:11	interest the collaborators to look at that together
0:58:14	thank you for regression
0:58:20	a thanks for the survey instinct to you i'm to consider what to talk about
0:58:26	the emotional modeling the pat space move modeling is that just based on speech input
0:58:32	was are you also using i also using to analyse things like
0:58:37	us a nonverbal as a signals like laughter or sighing little things like that
0:58:43	right now we don't have that's it will be wonderful if we can have that
0:58:46	those features but right now it's really the speech input so acoustics and lexical input
0:58:52	and also the sentence level of the system's response
0:59:03	hi a question is about the a section five
0:59:07	so you due to prediction task you did emotion recognition and the emotive change prediction
0:59:13	so even though these some similar really think there is a subtle but important difference
0:59:17	between the two
0:59:19	so my question is
0:59:21	do you use the same features to do both does do you think there are
0:59:26	features that are more important for that you motives the rather than the emotion recognition
0:59:30	and
0:59:32	what difference have you seen
0:59:34	between these two
0:59:36	so requested so we think that
0:59:41	for the current query
0:59:42	based on the current user input we want to be able to
0:59:46	understand the motion of the user
0:59:49	but if you think about
0:59:51	what comes next so depending on how to respond
0:59:54	to the user
0:59:56	the system response the users emotion change the next
1:00:00	input
1:00:01	maybe different
1:00:03	right so for example
1:00:05	in be
1:00:06	in the
1:00:15	so here this is a subject him talking about a breakup
1:00:22	and
1:00:23	i first the system tries to
1:00:26	comfort the subject and then at some point you know the
1:00:31	the country the dialogue goes
1:00:38	i in timit assistive so are you real or not how can robot's no you
1:00:42	like
1:00:43	i know what you like as i do it should be
1:00:46	and then
1:00:46	the user says something
1:00:49	and at this point it sort of like a in this i at this point
1:00:52	of the dialogue you can you can respond in various ways but the talk about
1:00:57	that all used here
1:00:58	and then it seems that
1:01:02	a and then the user says you must be real so i think
1:01:06	but you most exchanges depend on a system response
1:01:09	so if we can
1:01:11	model that
1:01:12	and the way we've model that is to
1:01:15	to
1:01:17	mostly task training where a
1:01:19	e motion state change
1:01:22	it's dependent on the
1:01:24	recognize emotion
1:01:26	we want to be able to capture this dependency
1:01:29	and
1:01:30	in
1:01:31	and to be able you utilize this stuff
1:01:34	dependency is we choose how to
1:01:37	in the future choose how to
1:01:39	recent on how to generate the system response so that you can hopefully died off
1:01:44	dialogue be motioned change in the dialogue
1:01:47	in the way you

The Many Facets of Dialog

Keynotes

Helen Meng (Chinese University of Hong Kong, China)