Speech Transcript - Cross-Corpus Data Augmentation for Acoustic Addressee Detection

0:00:18	alright welcome to the second session on acoustics we well
0:00:24	follow this immediately with the sponsors session and then the
0:00:28	back with dinner or per speaker
0:00:30	is all like a came out
0:00:35	thank you
0:00:49	okay it's not all okay
0:00:51	okay sorry
0:00:53	hello vehicle it's a welcome to my talk my name is a ticket out
0:00:57	and that you might be
0:01:00	is not better or
0:01:05	sound check
0:01:06	okay that's good
0:01:08	things
0:01:09	how well come welcome to my talk so
0:01:14	today i'd like to present decided that's i conducted together with my colleagues
0:01:19	in was eager to lexical profound problem thinker the store like to thank them or
0:01:23	without them it would be impossible to conduct this research on this you attention
0:01:27	and so the use your problem as you probably can guess so this topic is
0:01:33	related is with the big problem introduced by then both those
0:01:40	at the beginning of our conference today
0:01:43	so it's also about stated
0:01:46	interaction and multi party interaction
0:01:49	so
0:01:51	a the title is cross corpus that accommodation for acoustic addressee detection
0:01:56	first of all i'd like to
0:01:58	clarify what was use action actually is
0:02:01	so it's a common trend that modern spoken dialogue systems i getting
0:02:07	more adaptive and human like
0:02:09	not you know the two
0:02:12	interact with multiple users under realistic conditions in the real physical world
0:02:18	and's
0:02:21	sorry
0:02:25	so
0:02:26	it may happen that's
0:02:29	not a single user of interest the system but a group of users and this
0:02:33	is exactly the place where the suit action
0:02:35	where
0:02:36	this young but the rises it appears in conversations between
0:02:43	technical system and the group of users
0:02:45	and it's
0:02:46	we're gonna call this kind of
0:02:49	interactions as human machine
0:02:50	conversations and here we have
0:02:53	realistic example from our data
0:02:56	so
0:02:58	the as the s
0:02:59	so base in such a mixed kind of instructions as this is supposed
0:03:03	to distinguish between human and compute a direct utterances
0:03:07	that means solving a binary
0:03:09	classification problem in order to maintain a efficient conversations in a realistic manner
0:03:15	it's important that
0:03:18	human direct utterances so the system is not supposed to give a direct answer to
0:03:22	human direct utterances
0:03:25	because otherwise it would so interrupt a dialogue flow between to human participants
0:03:34	well
0:03:35	a similar problem arises in can with in conversations between several adults and a child
0:03:41	and similarly to
0:03:43	function of this you'd actually caller's problem as i don't channel to be sued action
0:03:47	and here we have again
0:03:49	a realistic example how
0:03:52	not to educate your children but smart phones
0:03:59	yes and again in this case the is this is supposed to distinguish between adult
0:04:04	and child directed utterances produced by adults
0:04:07	and this also means
0:04:10	binary classification problem
0:04:12	and it's functionality may be useful for a system before mean
0:04:17	children developments mandatory in
0:04:21	mainly the let's assume that the list distinguishable are children and a directed acoustic patterns
0:04:27	the bigger progress so that shouldn't make in maintaining social interactions and
0:04:33	in particular in maintaining
0:04:36	spoken conversations
0:04:39	so
0:04:41	now
0:04:43	let's find out if
0:04:45	these two rejection problems have anything in common
0:04:51	first of all we need to answer the question how we address other people in
0:04:55	real life
0:04:56	the simplest way to do this is just
0:04:59	by name so or what we will okay cable or okay alex a or
0:05:04	i like this
0:05:06	then
0:05:08	we can do the same think implicitly by using for example das
0:05:12	i'm looking at him talking to you
0:05:15	then some contextual markers like a specific topics or
0:05:19	specialist a convenience
0:05:21	and
0:05:23	the
0:05:24	the last utterance if is to
0:05:26	modified acoustic speaking style and our prosody
0:05:29	and the present study is focused
0:05:32	exactly on the
0:05:35	last way
0:05:36	on the on the on the letter way of
0:05:38	addressing
0:05:40	subjects in our conversation
0:05:44	so the
0:05:46	the idea behind acoustic addressee detection is that people tend to change the remainder of
0:05:51	speech depending on whom the talking to
0:05:53	for example we may face some special to see such as hard of hearing people
0:05:58	actually people
0:06:00	children or spoken dialogue systems
0:06:03	that's in our opinion might have some communication difficulties
0:06:07	and talk into such it receives we intentionally
0:06:12	we intentionally modify all in the moment of a speech make you need a more
0:06:16	technical loud and generate the more understandable a since we do not
0:06:20	pc then as adequate conversational agents
0:06:23	and then main assumption that we make here is that's human the reckon speech
0:06:31	is supposed to be
0:06:32	similar to adult directed speech
0:06:36	well
0:06:43	and is
0:06:45	in the same way you much indirect speech is for so must be quite similar
0:06:48	to child directed speech
0:06:54	in our experiments we use
0:06:56	relatively simple and yet efficient approach data augmentation called makes a mix up encourages a
0:07:02	model to behave mean eerie into that space between seen data points and i it
0:07:08	already has quite many applications in
0:07:11	isr in
0:07:13	image recognition and
0:07:14	many other
0:07:16	popular fields
0:07:18	basically makes it generates a typical examples
0:07:21	as thing and combinations
0:07:24	of to random a real feature and label vectors take into the coefficients number
0:07:31	and it's this number is a real number randomly generated from a but it stiff
0:07:36	from but from a beta distribution
0:07:37	a specified as follows by the only parameter alpha so technically life i thought lays
0:07:44	within the interval from zero to infinity
0:07:47	but according to our experiments
0:07:50	so i four values higher than one
0:07:54	leads already two
0:07:55	and defeating
0:07:58	and it's in our opinion the most reasonable inter well to ri
0:08:02	this parameter is from zero to one
0:08:07	so
0:08:07	that's question is how many examples to generate and here
0:08:12	that's imagine that we just merge the
0:08:15	c
0:08:16	different datasets without applying any bit argumentation just put them together
0:08:21	so we generate one batch
0:08:24	from each dataset
0:08:25	and it means that we they can increase the initial model training data in the
0:08:30	target corpus in c times
0:08:33	but if you something sleep line except
0:08:35	so we generate
0:08:37	along this
0:08:38	but this seebosh's we generate a also
0:08:43	"'kay"
0:08:45	examples key
0:08:46	i'd
0:08:49	"'kay" artificial examples of from each real example
0:08:52	increasing the amount of training data in a
0:08:55	see you multiply a k plus one times
0:08:59	and it's important to note that if it but at the visual examples are generated
0:09:02	or
0:09:03	but relies on the fly without any significant delays in the training process so we
0:09:07	just
0:09:07	do it on the go
0:09:11	well you can see
0:09:14	the models that we used to
0:09:17	two
0:09:19	it uses all the views to solve our problem
0:09:23	and the they are arranged according to their complexity a little from
0:09:26	left to right
0:09:29	well the first model is a simple
0:09:32	we are as we am
0:09:34	using the compare functionals as the input so this is a pretty popular feature set
0:09:40	in the area for motion recognition was introduced at the interspeech to solve and thirteen
0:09:46	i guess
0:09:47	yes so these features are extracted from the whole utterance
0:09:52	next we apply
0:09:55	the l d model
0:09:57	that includes a recurrent neural network with long short-term memory
0:10:02	and so
0:10:03	repeat a bit of these which were also used to compute the
0:10:08	the compare function also for the for the first model
0:10:12	and in contrast to
0:10:14	the functionals the l d's have
0:10:17	a time continuous nature
0:10:20	so it's time continuous signal
0:10:22	and in the last more lost all model is and consistently for mean raw signal
0:10:28	processing so
0:10:30	it receives just the
0:10:33	raw audio utterance that buses statistical of convolutional input then there's and suffer the same
0:10:39	convolutional component the lunchroom with looks for
0:10:41	we launch with the memory
0:10:43	that was introduced the within the previous model
0:10:47	yes and to be
0:10:49	it should be as the as the reference point for the convolutional component be of
0:10:53	taking
0:10:53	the five-layer sounded like addiction slightly modified it for needs mainly be reused
0:10:58	it's dimensionality
0:11:00	so by reducing the number or of for use in each layer according to the
0:11:06	amount of data that we have at our disposal and we also reduced the kernel
0:11:11	sizes in this paper according to the dimensionality of the signal that we have
0:11:20	well
0:11:21	here you can see the data that we have at our disposal we
0:11:24	we have two datasets for modeling
0:11:27	emotional issue detection namely smart video corpus that's contains interactions between the user to consider
0:11:34	it and the mobile is this
0:11:35	and by the way this is the only corpus that's
0:11:38	that was
0:11:40	models like
0:11:42	played by wizard-of-oz setting
0:11:46	the next
0:11:47	corpus
0:11:48	is was this was this is a conversation corpus that contains
0:11:51	similarly to this we see that contains
0:11:54	interaction between the user a confederate and then almost an alex acero dot is data
0:11:58	is real
0:12:00	without any was of the for stimulation
0:12:03	and
0:12:04	the third corpus is home bank that's includes conversations between a and adults another adult
0:12:10	and the child
0:12:12	we tried to repeat use the same as pleadings into training development and test sets
0:12:18	that's
0:12:20	the introduced in the
0:12:21	i regional studies published but also the corpora
0:12:25	and they turned out to be approximately the same well in the proposal so
0:12:32	train development and test has a purple the proportion of four five by one by
0:12:36	four
0:12:40	first we conduct some preliminary analysis with a linear model the font model we perform
0:12:47	feature selection by means of recursively recursive feature elimination
0:12:51	we just the exclude a small portion of all
0:12:54	compare features with the lowest svm weights
0:12:57	and that we measure the performance
0:12:59	all the
0:13:01	you reduced feature set in terms of unweighted average recall
0:13:04	and if it just let us consider the is considered to be optimal
0:13:07	e for them
0:13:08	them dimensionality-reduction leads to a significant information loss as
0:13:13	and it's here in this in this figure we see that's the
0:13:18	the optimal feature sets a
0:13:20	right significantly
0:13:22	and it's also very interesting that's the size of the optimal feature set on this
0:13:26	p c is much greater than then the other two so it may be explained
0:13:30	by them
0:13:31	a wizard-of-oz model in probably
0:13:34	some of the participants
0:13:35	did it's really believe that they were interacting with the real technical system
0:13:39	and the this issue resulted in
0:13:43	mm slightly a acoustic the basic buttons
0:13:47	well another
0:13:50	sequence of experiments at we conduct is a is inverse local and look experiments the
0:13:54	local means leave one corpus out a everyone knows what it means and inverse local
0:14:00	am is just that we retrain a our model on one corpus and test on
0:14:06	each of the other corpora separately
0:14:08	so and in this figure there is a pretty clear relation between b a c
0:14:12	and
0:14:13	as we see
0:14:14	so or it's pretty natural that's
0:14:19	these corpora
0:14:21	perceived as similar by our system because
0:14:24	the domains pretty close and the they both your utterance in german
0:14:28	in contrast to home bank that was uttered english and as we can see from
0:14:32	this figure
0:14:33	so our
0:14:34	linear model
0:14:37	fails to find any direct relation between
0:14:41	this corpus and the other two
0:14:43	but let's take a look at the
0:14:45	at the at the next year
0:14:47	and here we notice a very interesting trend that's
0:14:52	even bill
0:14:52	hum bank
0:14:55	significantly differs from data to from data two corpora i think the linear model trained
0:15:00	on
0:15:02	on every on sorry one and u two corpora
0:15:05	a reforms on each of them equally well is if it's not range
0:15:10	on each of the corpus separately and tested on them separately
0:15:14	so it means that's
0:15:16	the data sets that we have a non coded
0:15:18	at least not contradictory
0:15:22	so well let's take a look at all experiments but
0:15:27	the l d model and various can on various contexts lands a prime example
0:15:33	and here
0:15:34	in each of the three cases
0:15:36	red green and blue we see that the
0:15:39	dashed line is located about the
0:15:42	the solid one
0:15:43	mean and that's a mix up results in this additional performance improvement no really
0:15:50	when the ready
0:15:51	when already applied to the same corpus
0:15:53	and
0:15:54	it's also interesting to note that
0:15:58	so the context and for two seconds
0:16:01	turns out to be optimal for each of the for each of the corpus given
0:16:05	a given that they have
0:16:07	very different utterance then distributions
0:16:10	so two seconds is sufficient to predict accuracies using acoustic commonality
0:16:16	well
0:16:16	unfortunately makes up gives no performance improvement to the end-to-end model or probably we just
0:16:21	don't have enough data to provide
0:16:28	so we really produce the same experiments with
0:16:32	local and inverse local on be neural network based models
0:16:35	and so the
0:16:37	they both show the same trends the
0:16:39	that's
0:16:40	s b c n b a c seem quite similar to them
0:16:44	and actually the end-to-end model managed to capture
0:16:47	this similarity even better compared to the l d one
0:16:51	but there is an issue with model with multi with multitask learning
0:16:55	particularly
0:16:56	the issue is that
0:16:58	our neural network
0:17:00	regardless of which one us start with reading to
0:17:05	so the sig to the easiest task
0:17:06	with the highest commission features and labels and he they can see that the model
0:17:11	trained on any two dataset
0:17:14	starts
0:17:15	like
0:17:15	so the model
0:17:17	completely ignores the home bank
0:17:19	even though it was trained on this corpus
0:17:22	and it also star discriminating
0:17:25	i guess with you dataset colour vegetation changes if we started by me so
0:17:30	so all over the corpora
0:17:33	and the model actually starts receiving
0:17:36	both corpora really efficient
0:17:38	efficiently
0:17:39	as if you go
0:17:41	trains a on each of the corpus separately and tested on each of the corpus
0:17:45	separately
0:17:47	again we really but we conduct
0:17:49	this index but we conducted a similar experiment it just merging all three
0:17:54	datasets with and without makes up
0:17:57	using all three models
0:17:58	and so here we can see that makes up a low rises both settle these
0:18:02	l d and models and also prevents overfitting
0:18:05	the specific corpus mainly dstc with the highest correlation with the features and labels as
0:18:09	i is the set so these this task for our system
0:18:13	but unfortunately makes up doesn't provide an improvement for the funk model
0:18:18	what
0:18:19	actually goal
0:18:20	this model
0:18:21	doesn't suffer from overfitting the specific task and
0:18:24	doesn't need to be regularized
0:18:25	you do it's very simple structure
0:18:27	did it is very simple architecture
0:18:30	well the last the last the series of experiments
0:18:33	is experiments with i some of the features
0:18:37	the idea behind them is that so
0:18:39	system directed utterances tandem age
0:18:44	the isr
0:18:45	acoustic and language models much better compared to
0:18:48	human addressed utterances
0:18:51	and it's
0:18:52	this definitely works in the human machine setting
0:18:56	but
0:18:57	it seems to be
0:18:58	not working
0:18:59	in the i don't channels i think and we just analyse the
0:19:03	the data itself so
0:19:06	deep inside and the noted that
0:19:09	sometimes addressing children
0:19:12	no
0:19:13	sanderson children so people don't even use words instead they just use some separate intonations
0:19:20	or sounds or so without any words and
0:19:23	this causes real problems to our asr meaning that's so
0:19:27	the are the
0:19:30	the asr confidence will be equal over both of the target process
0:19:34	so
0:19:35	this is the reason why it performs so where
0:19:38	at this humbling problem
0:19:41	so here we come to the conclusions and we can conclude that makes up improves
0:19:45	classification performance for models then this
0:19:49	predefined features and also
0:19:52	this is less like
0:19:53	and also enables multitask learning abilities
0:19:57	for both and joint models and models that it was conducted feature sets
0:20:03	just like and speech fragments
0:20:07	allows us to
0:20:08	capture
0:20:11	accuracies but the
0:20:12	sufficient quality
0:20:13	and actually the same conclusion was drawn by the group of
0:20:17	matters of researchers regarding english language
0:20:21	yes and
0:20:22	as a told
0:20:24	a couple beers before i saw confidence is not representative for a c d low
0:20:28	it still useful for each met and three so you all experiments we also a
0:20:34	bit a couple of baseline so we introduce the first official baseline for be a
0:20:38	sissy corpus and the ability to the on back into and baseline
0:20:43	for future directions i woods propose extending our experiments applying mix up to two dimensional
0:20:50	spectrograms and two features extracted with their without the convolutional component
0:20:54	thank you
0:21:01	we have time for some questions
0:21:04	hi a credit when you in c
0:21:08	yes i
0:21:11	i was wondering why it shows you a tree i don't child interaction between a
0:21:17	human machine interaction is there any literature likely to this decision or was it just
0:21:23	sort of this additional you know it was a but our assumption without any background
0:21:28	i mean it was like an interesting
0:21:30	assumption in interesting something to do not to prove it of the proved run
0:21:35	yes and so
0:21:36	conceptually
0:21:38	it should be like this that's not so sometimes we receive a system as an
0:21:44	infant or person have been lack of communication all scales
0:21:48	of and's that's what we take in as the basic assumption for
0:21:55	forums actually simulate conceptually there's do not sitting
0:21:59	conceptually distinct okay this is on one so i put into our experiments a single
0:22:05	i think
0:22:06	yes that's actually they are probably overlap but only partially
0:22:12	what's couldn't our experiments a single system is capable or float in both
0:22:16	that simultaneously
0:22:17	i perform far worse on the adult channel corpus
0:22:22	yes but because the baseline performance is far worse
0:22:25	i mean the highest baseline on one h b is like
0:22:29	it is zero point sixty four
0:22:32	all zero point six to six or something this
0:22:34	okay
0:22:36	so it just the matter of the data quality
0:22:48	high and just from a reporter numerous the interesting talk i was wondering
0:22:56	maybe i missed something did you see any language features it so no do you
0:23:01	not all can speculate so it is gonna be an impact on the performance of
0:23:06	what it means same as which we just i mean like a separate words or
0:23:09	for instance if i'm talking to a channel i might address to change in a
0:23:14	different way to address signals
0:23:17	okay well it's a difficult question human that's i told that sometimes talking to the
0:23:23	channel we don't use real words
0:23:25	this is the problem for language modeling right i mean i was my hypothesis is
0:23:30	that you would simplify the language to use if you're addressing a child their compared
0:23:35	when you address and yes we do we do
0:23:40	my speculation on this would be yes
0:23:43	we can so we can we can try to leverage in both textual and acoustical
0:23:47	modalities
0:23:48	to solve the same problem yes okay next
0:23:52	for one more
0:23:56	that is common
0:24:00	i just so have you checked
0:24:04	how well you do with respect to the results of the competence
0:24:07	so the same data set was used a similar data set was used as part
0:24:11	of the interspeech compared challenge anything the guy obviously don't like i think it was
0:24:16	seventy point something
0:24:17	so this curious but the look at the majority baseline so i you predicting the
0:24:22	majority class because essentially binary class prediction you do we
0:24:25	and so one thing that you model is just only
0:24:28	how to predict the majority class
0:24:31	i mean i use a
0:24:33	no
0:24:34	i use unweighted average recall and if it if it would predict just
0:24:39	just a majority class a so and so it means that actually the model we
0:24:44	just
0:24:45	a role
0:24:46	all the examples to the ones you melissa
0:24:49	it means that you're performance metric would be
0:24:54	like
0:24:55	not about than zero point a zero point five
0:24:59	because it's like it's like a global metric
0:25:02	sure but for instance even so if you look at the
0:25:06	the baseline for the speech and that's about seventy point something
0:25:10	so you so i we see you mean the baseline for combine corpus
0:25:16	of using the end-to-end or
0:25:18	similarly no i actually the end-to-end baseline was the word baseline
0:25:23	so and sixty four so
0:25:26	i remember the
0:25:29	the article
0:25:30	release the rights before the interest right before the submission for the challenge and the
0:25:36	result there's of the baseline for the intent model was like
0:25:39	is zero point fifty nine also
0:25:42	at rate and the end-to-end if you if you mean this and
0:25:45	if we talk about the entire multi model
0:25:49	like thing so the baseline was like
0:25:54	zero point seven also but they use the much a great the feature sets for
0:26:01	this and several models like a collective of models
0:26:05	include in michael for your words and two and so ill these and all that
0:26:10	stuff
0:26:13	okay let's thank our speaker again

Cross-Corpus Data Augmentation for Acoustic Addressee Detection

Oral Session 5: Acoustics

Oleg Akhtiamov, Ingo Siegert, Alexey Karpov and Wolfgang Minker