alright welcome to the second session on acoustics we well

follow this immediately with the sponsors session and then the

back with dinner or per speaker

is all like a came out

thank you

okay it's not all okay

okay sorry

hello vehicle it's a welcome to my talk my name is a ticket out

and that you might be

is not better or

sound check

okay that's good

things

how well come welcome to my talk so

today i'd like to present decided that's i conducted together with my colleagues

in was eager to lexical profound problem thinker the store like to thank them or

without them it would be impossible to conduct this research on this you attention

and so the use your problem as you probably can guess so this topic is

related is with the big problem introduced by then both those

at the beginning of our conference today

so it's also about stated

interaction and multi party interaction

so

a the title is cross corpus that accommodation for acoustic addressee detection

first of all i'd like to

clarify what was use action actually is

so it's a common trend that modern spoken dialogue systems i getting

more adaptive and human like

not you know the two

interact with multiple users under realistic conditions in the real physical world

and's

sorry

so

it may happen that's

not a single user of interest the system but a group of users and this

is exactly the place where the suit action

where

this young but the rises it appears in conversations between

technical system and the group of users

and it's

we're gonna call this kind of

interactions as human machine

conversations and here we have

realistic example from our data

so

the as the s

so base in such a mixed kind of instructions as this is supposed

to distinguish between human and compute a direct utterances

that means solving a binary

classification problem in order to maintain a efficient conversations in a realistic manner

it's important that

human direct utterances so the system is not supposed to give a direct answer to

human direct utterances

because otherwise it would so interrupt a dialogue flow between to human participants

well

a similar problem arises in can with in conversations between several adults and a child

and similarly to

function of this you'd actually caller's problem as i don't channel to be sued action

and here we have again

a realistic example how

not to educate your children but smart phones

yes and again in this case the is this is supposed to distinguish between adult

and child directed utterances produced by adults

and this also means

binary classification problem

and it's functionality may be useful for a system before mean

children developments mandatory in

mainly the let's assume that the list distinguishable are children and a directed acoustic patterns

the bigger progress so that shouldn't make in maintaining social interactions and

in particular in maintaining

spoken conversations

so

now

let's find out if

these two rejection problems have anything in common

first of all we need to answer the question how we address other people in

real life

the simplest way to do this is just

by name so or what we will okay cable or okay alex a or

i like this

then

we can do the same think implicitly by using for example das

i'm looking at him talking to you

then some contextual markers like a specific topics or

specialist a convenience

and

the

the last utterance if is to

modified acoustic speaking style and our prosody

and the present study is focused

exactly on the

last way

on the on the on the letter way of

addressing

subjects in our conversation

so the

the idea behind acoustic addressee detection is that people tend to change the remainder of

speech depending on whom the talking to

for example we may face some special to see such as hard of hearing people

actually people

children or spoken dialogue systems

that's in our opinion might have some communication difficulties

and talk into such it receives we intentionally

we intentionally modify all in the moment of a speech make you need a more

technical loud and generate the more understandable a since we do not

pc then as adequate conversational agents

and then main assumption that we make here is that's human the reckon speech

is supposed to be

similar to adult directed speech

well

and is

in the same way you much indirect speech is for so must be quite similar

to child directed speech

in our experiments we use

relatively simple and yet efficient approach data augmentation called makes a mix up encourages a

model to behave mean eerie into that space between seen data points and i it

already has quite many applications in

isr in

image recognition and

many other

popular fields

basically makes it generates a typical examples

as thing and combinations

of to random a real feature and label vectors take into the coefficients number

and it's this number is a real number randomly generated from a but it stiff

from but from a beta distribution

a specified as follows by the only parameter alpha so technically life i thought lays

within the interval from zero to infinity

but according to our experiments

so i four values higher than one

leads already two

and defeating

and it's in our opinion the most reasonable inter well to ri

this parameter is from zero to one

so

that's question is how many examples to generate and here

that's imagine that we just merge the

c

different datasets without applying any bit argumentation just put them together

so we generate one batch

from each dataset

and it means that we they can increase the initial model training data in the

target corpus in c times

but if you something sleep line except

so we generate

along this

but this seebosh's we generate a also

"'kay"

examples key

i'd

"'kay" artificial examples of from each real example

increasing the amount of training data in a

see you multiply a k plus one times

and it's important to note that if it but at the visual examples are generated

or

but relies on the fly without any significant delays in the training process so we

just

do it on the go

well you can see

the models that we used to

two

it uses all the views to solve our problem

and the they are arranged according to their complexity a little from

left to right

well the first model is a simple

we are as we am

using the compare functionals as the input so this is a pretty popular feature set

in the area for motion recognition was introduced at the interspeech to solve and thirteen

i guess

yes so these features are extracted from the whole utterance

next we apply

the l d model

that includes a recurrent neural network with long short-term memory

and so

repeat a bit of these which were also used to compute the

the compare function also for the for the first model

and in contrast to

the functionals the l d's have

a time continuous nature

so it's time continuous signal

and in the last more lost all model is and consistently for mean raw signal

processing so

it receives just the

raw audio utterance that buses statistical of convolutional input then there's and suffer the same

convolutional component the lunchroom with looks for

we launch with the memory

that was introduced the within the previous model

yes and to be

it should be as the as the reference point for the convolutional component be of

taking

the five-layer sounded like addiction slightly modified it for needs mainly be reused

it's dimensionality

so by reducing the number or of for use in each layer according to the

amount of data that we have at our disposal and we also reduced the kernel

sizes in this paper according to the dimensionality of the signal that we have

well

here you can see the data that we have at our disposal we

we have two datasets for modeling

emotional issue detection namely smart video corpus that's contains interactions between the user to consider

it and the mobile is this

and by the way this is the only corpus that's

that was

models like

played by wizard-of-oz setting

the next

corpus

is was this was this is a conversation corpus that contains

similarly to this we see that contains

interaction between the user a confederate and then almost an alex acero dot is data

is real

without any was of the for stimulation

and

the third corpus is home bank that's includes conversations between a and adults another adult

and the child

we tried to repeat use the same as pleadings into training development and test sets

that's

the introduced in the

i regional studies published but also the corpora

and they turned out to be approximately the same well in the proposal so

train development and test has a purple the proportion of four five by one by

four

first we conduct some preliminary analysis with a linear model the font model we perform

feature selection by means of recursively recursive feature elimination

we just the exclude a small portion of all

compare features with the lowest svm weights

and that we measure the performance

all the

you reduced feature set in terms of unweighted average recall

and if it just let us consider the is considered to be optimal

e for them

them dimensionality-reduction leads to a significant information loss as

and it's here in this in this figure we see that's the

the optimal feature sets a

right significantly

and it's also very interesting that's the size of the optimal feature set on this

p c is much greater than then the other two so it may be explained

by them

a wizard-of-oz model in probably

some of the participants

did it's really believe that they were interacting with the real technical system

and the this issue resulted in

mm slightly a acoustic the basic buttons

well another

sequence of experiments at we conduct is a is inverse local and look experiments the

local means leave one corpus out a everyone knows what it means and inverse local

am is just that we retrain a our model on one corpus and test on

each of the other corpora separately

so and in this figure there is a pretty clear relation between b a c

and

as we see

so or it's pretty natural that's

these corpora

perceived as similar by our system because

the domains pretty close and the they both your utterance in german

in contrast to home bank that was uttered english and as we can see from

this figure

so our

linear model

fails to find any direct relation between

this corpus and the other two

but let's take a look at the

at the at the next year

and here we notice a very interesting trend that's

even bill

hum bank

significantly differs from data to from data two corpora i think the linear model trained

on

on every on sorry one and u two corpora

a reforms on each of them equally well is if it's not range

on each of the corpus separately and tested on them separately

so it means that's

the data sets that we have a non coded

at least not contradictory

so well let's take a look at all experiments but

the l d model and various can on various contexts lands a prime example

and here

in each of the three cases

red green and blue we see that the

dashed line is located about the

the solid one

mean and that's a mix up results in this additional performance improvement no really

when the ready

when already applied to the same corpus

and

it's also interesting to note that

so the context and for two seconds

turns out to be optimal for each of the for each of the corpus given

a given that they have

very different utterance then distributions

so two seconds is sufficient to predict accuracies using acoustic commonality

well

unfortunately makes up gives no performance improvement to the end-to-end model or probably we just

don't have enough data to provide

so we really produce the same experiments with

local and inverse local on be neural network based models

and so the

they both show the same trends the

that's

s b c n b a c seem quite similar to them

and actually the end-to-end model managed to capture

this similarity even better compared to the l d one

but there is an issue with model with multi with multitask learning

particularly

the issue is that

our neural network

regardless of which one us start with reading to

so the sig to the easiest task

with the highest commission features and labels and he they can see that the model

trained on any two dataset

starts

like

so the model

completely ignores the home bank

even though it was trained on this corpus

and it also star discriminating

i guess with you dataset colour vegetation changes if we started by me so

so all over the corpora

and the model actually starts receiving

both corpora really efficient

efficiently

as if you go

trains a on each of the corpus separately and tested on each of the corpus

separately

again we really but we conduct

this index but we conducted a similar experiment it just merging all three

datasets with and without makes up

using all three models

and so here we can see that makes up a low rises both settle these

l d and models and also prevents overfitting

the specific corpus mainly dstc with the highest correlation with the features and labels as

i is the set so these this task for our system

but unfortunately makes up doesn't provide an improvement for the funk model

what

actually goal

this model

doesn't suffer from overfitting the specific task and

doesn't need to be regularized

you do it's very simple structure

did it is very simple architecture

well the last the last the series of experiments

is experiments with i some of the features

the idea behind them is that so

system directed utterances tandem age

the isr

acoustic and language models much better compared to

human addressed utterances

and it's

this definitely works in the human machine setting

but

it seems to be

not working

in the i don't channels i think and we just analyse the

the data itself so

deep inside and the noted that

sometimes addressing children

no

sanderson children so people don't even use words instead they just use some separate intonations

or sounds or so without any words and

this causes real problems to our asr meaning that's so

the are the

the asr confidence will be equal over both of the target process

so

this is the reason why it performs so where

at this humbling problem

so here we come to the conclusions and we can conclude that makes up improves

classification performance for models then this

predefined features and also

this is less like

and also enables multitask learning abilities

for both and joint models and models that it was conducted feature sets

just like and speech fragments

allows us to

capture

accuracies but the

sufficient quality

and actually the same conclusion was drawn by the group of

matters of researchers regarding english language

yes and

as a told

a couple beers before i saw confidence is not representative for a c d low

it still useful for each met and three so you all experiments we also a

bit a couple of baseline so we introduce the first official baseline for be a

sissy corpus and the ability to the on back into and baseline

for future directions i woods propose extending our experiments applying mix up to two dimensional

spectrograms and two features extracted with their without the convolutional component

thank you

we have time for some questions

hi a credit when you in c

yes i

i was wondering why it shows you a tree i don't child interaction between a

human machine interaction is there any literature likely to this decision or was it just

sort of this additional you know it was a but our assumption without any background

i mean it was like an interesting

assumption in interesting something to do not to prove it of the proved run

yes and so

conceptually

it should be like this that's not so sometimes we receive a system as an

infant or person have been lack of communication all scales

of and's that's what we take in as the basic assumption for

forums actually simulate conceptually there's do not sitting

conceptually distinct okay this is on one so i put into our experiments a single

i think

yes that's actually they are probably overlap but only partially

what's couldn't our experiments a single system is capable or float in both

that simultaneously

i perform far worse on the adult channel corpus

yes but because the baseline performance is far worse

i mean the highest baseline on one h b is like

it is zero point sixty four

all zero point six to six or something this

okay

so it just the matter of the data quality

high and just from a reporter numerous the interesting talk i was wondering

maybe i missed something did you see any language features it so no do you

not all can speculate so it is gonna be an impact on the performance of

what it means same as which we just i mean like a separate words or

for instance if i'm talking to a channel i might address to change in a

different way to address signals

okay well it's a difficult question human that's i told that sometimes talking to the

channel we don't use real words

this is the problem for language modeling right i mean i was my hypothesis is

that you would simplify the language to use if you're addressing a child their compared

when you address and yes we do we do

my speculation on this would be yes

we can so we can we can try to leverage in both textual and acoustical

modalities

to solve the same problem yes okay next

for one more

that is common

i just so have you checked

how well you do with respect to the results of the competence

so the same data set was used a similar data set was used as part

of the interspeech compared challenge anything the guy obviously don't like i think it was

seventy point something

so this curious but the look at the majority baseline so i you predicting the

majority class because essentially binary class prediction you do we

and so one thing that you model is just only

how to predict the majority class

i mean i use a

no

i use unweighted average recall and if it if it would predict just

just a majority class a so and so it means that actually the model we

just

a role

all the examples to the ones you melissa

it means that you're performance metric would be

like

not about than zero point a zero point five

because it's like it's like a global metric

sure but for instance even so if you look at the

the baseline for the speech and that's about seventy point something

so you so i we see you mean the baseline for combine corpus

of using the end-to-end or

similarly no i actually the end-to-end baseline was the word baseline

so and sixty four so

i remember the

the article

release the rights before the interest right before the submission for the challenge and the

result there's of the baseline for the intent model was like

is zero point fifty nine also

at rate and the end-to-end if you if you mean this and

if we talk about the entire multi model

like thing so the baseline was like

zero point seven also but they use the much a great the feature sets for

this and several models like a collective of models

include in michael for your words and two and so ill these and all that

stuff

okay let's thank our speaker again