0:00:18alright welcome to the second session on acoustics we well
0:00:24follow this immediately with the sponsors session and then the
0:00:28back with dinner or per speaker
0:00:30is all like a came out
0:00:35thank you
0:00:49okay it's not all okay
0:00:51okay sorry
0:00:53hello vehicle it's a welcome to my talk my name is a ticket out
0:00:57and that you might be
0:01:00is not better or
0:01:05sound check
0:01:06okay that's good
0:01:09how well come welcome to my talk so
0:01:14today i'd like to present decided that's i conducted together with my colleagues
0:01:19in was eager to lexical profound problem thinker the store like to thank them or
0:01:23without them it would be impossible to conduct this research on this you attention
0:01:27and so the use your problem as you probably can guess so this topic is
0:01:33related is with the big problem introduced by then both those
0:01:40at the beginning of our conference today
0:01:43so it's also about stated
0:01:46interaction and multi party interaction
0:01:51a the title is cross corpus that accommodation for acoustic addressee detection
0:01:56first of all i'd like to
0:01:58clarify what was use action actually is
0:02:01so it's a common trend that modern spoken dialogue systems i getting
0:02:07more adaptive and human like
0:02:09not you know the two
0:02:12interact with multiple users under realistic conditions in the real physical world
0:02:26it may happen that's
0:02:29not a single user of interest the system but a group of users and this
0:02:33is exactly the place where the suit action
0:02:36this young but the rises it appears in conversations between
0:02:43technical system and the group of users
0:02:45and it's
0:02:46we're gonna call this kind of
0:02:49interactions as human machine
0:02:50conversations and here we have
0:02:53realistic example from our data
0:02:58the as the s
0:02:59so base in such a mixed kind of instructions as this is supposed
0:03:03to distinguish between human and compute a direct utterances
0:03:07that means solving a binary
0:03:09classification problem in order to maintain a efficient conversations in a realistic manner
0:03:15it's important that
0:03:18human direct utterances so the system is not supposed to give a direct answer to
0:03:22human direct utterances
0:03:25because otherwise it would so interrupt a dialogue flow between to human participants
0:03:35a similar problem arises in can with in conversations between several adults and a child
0:03:41and similarly to
0:03:43function of this you'd actually caller's problem as i don't channel to be sued action
0:03:47and here we have again
0:03:49a realistic example how
0:03:52not to educate your children but smart phones
0:03:59yes and again in this case the is this is supposed to distinguish between adult
0:04:04and child directed utterances produced by adults
0:04:07and this also means
0:04:10binary classification problem
0:04:12and it's functionality may be useful for a system before mean
0:04:17children developments mandatory in
0:04:21mainly the let's assume that the list distinguishable are children and a directed acoustic patterns
0:04:27the bigger progress so that shouldn't make in maintaining social interactions and
0:04:33in particular in maintaining
0:04:36spoken conversations
0:04:43let's find out if
0:04:45these two rejection problems have anything in common
0:04:51first of all we need to answer the question how we address other people in
0:04:55real life
0:04:56the simplest way to do this is just
0:04:59by name so or what we will okay cable or okay alex a or
0:05:04i like this
0:05:08we can do the same think implicitly by using for example das
0:05:12i'm looking at him talking to you
0:05:15then some contextual markers like a specific topics or
0:05:19specialist a convenience
0:05:24the last utterance if is to
0:05:26modified acoustic speaking style and our prosody
0:05:29and the present study is focused
0:05:32exactly on the
0:05:35last way
0:05:36on the on the on the letter way of
0:05:40subjects in our conversation
0:05:44so the
0:05:46the idea behind acoustic addressee detection is that people tend to change the remainder of
0:05:51speech depending on whom the talking to
0:05:53for example we may face some special to see such as hard of hearing people
0:05:58actually people
0:06:00children or spoken dialogue systems
0:06:03that's in our opinion might have some communication difficulties
0:06:07and talk into such it receives we intentionally
0:06:12we intentionally modify all in the moment of a speech make you need a more
0:06:16technical loud and generate the more understandable a since we do not
0:06:20pc then as adequate conversational agents
0:06:23and then main assumption that we make here is that's human the reckon speech
0:06:31is supposed to be
0:06:32similar to adult directed speech
0:06:43and is
0:06:45in the same way you much indirect speech is for so must be quite similar
0:06:48to child directed speech
0:06:54in our experiments we use
0:06:56relatively simple and yet efficient approach data augmentation called makes a mix up encourages a
0:07:02model to behave mean eerie into that space between seen data points and i it
0:07:08already has quite many applications in
0:07:11isr in
0:07:13image recognition and
0:07:14many other
0:07:16popular fields
0:07:18basically makes it generates a typical examples
0:07:21as thing and combinations
0:07:24of to random a real feature and label vectors take into the coefficients number
0:07:31and it's this number is a real number randomly generated from a but it stiff
0:07:36from but from a beta distribution
0:07:37a specified as follows by the only parameter alpha so technically life i thought lays
0:07:44within the interval from zero to infinity
0:07:47but according to our experiments
0:07:50so i four values higher than one
0:07:54leads already two
0:07:55and defeating
0:07:58and it's in our opinion the most reasonable inter well to ri
0:08:02this parameter is from zero to one
0:08:07that's question is how many examples to generate and here
0:08:12that's imagine that we just merge the
0:08:16different datasets without applying any bit argumentation just put them together
0:08:21so we generate one batch
0:08:24from each dataset
0:08:25and it means that we they can increase the initial model training data in the
0:08:30target corpus in c times
0:08:33but if you something sleep line except
0:08:35so we generate
0:08:37along this
0:08:38but this seebosh's we generate a also
0:08:45examples key
0:08:49"'kay" artificial examples of from each real example
0:08:52increasing the amount of training data in a
0:08:55see you multiply a k plus one times
0:08:59and it's important to note that if it but at the visual examples are generated
0:09:03but relies on the fly without any significant delays in the training process so we
0:09:07do it on the go
0:09:11well you can see
0:09:14the models that we used to
0:09:19it uses all the views to solve our problem
0:09:23and the they are arranged according to their complexity a little from
0:09:26left to right
0:09:29well the first model is a simple
0:09:32we are as we am
0:09:34using the compare functionals as the input so this is a pretty popular feature set
0:09:40in the area for motion recognition was introduced at the interspeech to solve and thirteen
0:09:46i guess
0:09:47yes so these features are extracted from the whole utterance
0:09:52next we apply
0:09:55the l d model
0:09:57that includes a recurrent neural network with long short-term memory
0:10:02and so
0:10:03repeat a bit of these which were also used to compute the
0:10:08the compare function also for the for the first model
0:10:12and in contrast to
0:10:14the functionals the l d's have
0:10:17a time continuous nature
0:10:20so it's time continuous signal
0:10:22and in the last more lost all model is and consistently for mean raw signal
0:10:28processing so
0:10:30it receives just the
0:10:33raw audio utterance that buses statistical of convolutional input then there's and suffer the same
0:10:39convolutional component the lunchroom with looks for
0:10:41we launch with the memory
0:10:43that was introduced the within the previous model
0:10:47yes and to be
0:10:49it should be as the as the reference point for the convolutional component be of
0:10:53the five-layer sounded like addiction slightly modified it for needs mainly be reused
0:10:58it's dimensionality
0:11:00so by reducing the number or of for use in each layer according to the
0:11:06amount of data that we have at our disposal and we also reduced the kernel
0:11:11sizes in this paper according to the dimensionality of the signal that we have
0:11:21here you can see the data that we have at our disposal we
0:11:24we have two datasets for modeling
0:11:27emotional issue detection namely smart video corpus that's contains interactions between the user to consider
0:11:34it and the mobile is this
0:11:35and by the way this is the only corpus that's
0:11:38that was
0:11:40models like
0:11:42played by wizard-of-oz setting
0:11:46the next
0:11:48is was this was this is a conversation corpus that contains
0:11:51similarly to this we see that contains
0:11:54interaction between the user a confederate and then almost an alex acero dot is data
0:11:58is real
0:12:00without any was of the for stimulation
0:12:04the third corpus is home bank that's includes conversations between a and adults another adult
0:12:10and the child
0:12:12we tried to repeat use the same as pleadings into training development and test sets
0:12:20the introduced in the
0:12:21i regional studies published but also the corpora
0:12:25and they turned out to be approximately the same well in the proposal so
0:12:32train development and test has a purple the proportion of four five by one by
0:12:40first we conduct some preliminary analysis with a linear model the font model we perform
0:12:47feature selection by means of recursively recursive feature elimination
0:12:51we just the exclude a small portion of all
0:12:54compare features with the lowest svm weights
0:12:57and that we measure the performance
0:12:59all the
0:13:01you reduced feature set in terms of unweighted average recall
0:13:04and if it just let us consider the is considered to be optimal
0:13:07e for them
0:13:08them dimensionality-reduction leads to a significant information loss as
0:13:13and it's here in this in this figure we see that's the
0:13:18the optimal feature sets a
0:13:20right significantly
0:13:22and it's also very interesting that's the size of the optimal feature set on this
0:13:26p c is much greater than then the other two so it may be explained
0:13:30by them
0:13:31a wizard-of-oz model in probably
0:13:34some of the participants
0:13:35did it's really believe that they were interacting with the real technical system
0:13:39and the this issue resulted in
0:13:43mm slightly a acoustic the basic buttons
0:13:47well another
0:13:50sequence of experiments at we conduct is a is inverse local and look experiments the
0:13:54local means leave one corpus out a everyone knows what it means and inverse local
0:14:00am is just that we retrain a our model on one corpus and test on
0:14:06each of the other corpora separately
0:14:08so and in this figure there is a pretty clear relation between b a c
0:14:13as we see
0:14:14so or it's pretty natural that's
0:14:19these corpora
0:14:21perceived as similar by our system because
0:14:24the domains pretty close and the they both your utterance in german
0:14:28in contrast to home bank that was uttered english and as we can see from
0:14:32this figure
0:14:33so our
0:14:34linear model
0:14:37fails to find any direct relation between
0:14:41this corpus and the other two
0:14:43but let's take a look at the
0:14:45at the at the next year
0:14:47and here we notice a very interesting trend that's
0:14:52even bill
0:14:52hum bank
0:14:55significantly differs from data to from data two corpora i think the linear model trained
0:15:02on every on sorry one and u two corpora
0:15:05a reforms on each of them equally well is if it's not range
0:15:10on each of the corpus separately and tested on them separately
0:15:14so it means that's
0:15:16the data sets that we have a non coded
0:15:18at least not contradictory
0:15:22so well let's take a look at all experiments but
0:15:27the l d model and various can on various contexts lands a prime example
0:15:33and here
0:15:34in each of the three cases
0:15:36red green and blue we see that the
0:15:39dashed line is located about the
0:15:42the solid one
0:15:43mean and that's a mix up results in this additional performance improvement no really
0:15:50when the ready
0:15:51when already applied to the same corpus
0:15:54it's also interesting to note that
0:15:58so the context and for two seconds
0:16:01turns out to be optimal for each of the for each of the corpus given
0:16:05a given that they have
0:16:07very different utterance then distributions
0:16:10so two seconds is sufficient to predict accuracies using acoustic commonality
0:16:16unfortunately makes up gives no performance improvement to the end-to-end model or probably we just
0:16:21don't have enough data to provide
0:16:28so we really produce the same experiments with
0:16:32local and inverse local on be neural network based models
0:16:35and so the
0:16:37they both show the same trends the
0:16:40s b c n b a c seem quite similar to them
0:16:44and actually the end-to-end model managed to capture
0:16:47this similarity even better compared to the l d one
0:16:51but there is an issue with model with multi with multitask learning
0:16:56the issue is that
0:16:58our neural network
0:17:00regardless of which one us start with reading to
0:17:05so the sig to the easiest task
0:17:06with the highest commission features and labels and he they can see that the model
0:17:11trained on any two dataset
0:17:15so the model
0:17:17completely ignores the home bank
0:17:19even though it was trained on this corpus
0:17:22and it also star discriminating
0:17:25i guess with you dataset colour vegetation changes if we started by me so
0:17:30so all over the corpora
0:17:33and the model actually starts receiving
0:17:36both corpora really efficient
0:17:39as if you go
0:17:41trains a on each of the corpus separately and tested on each of the corpus
0:17:47again we really but we conduct
0:17:49this index but we conducted a similar experiment it just merging all three
0:17:54datasets with and without makes up
0:17:57using all three models
0:17:58and so here we can see that makes up a low rises both settle these
0:18:02l d and models and also prevents overfitting
0:18:05the specific corpus mainly dstc with the highest correlation with the features and labels as
0:18:09i is the set so these this task for our system
0:18:13but unfortunately makes up doesn't provide an improvement for the funk model
0:18:19actually goal
0:18:20this model
0:18:21doesn't suffer from overfitting the specific task and
0:18:24doesn't need to be regularized
0:18:25you do it's very simple structure
0:18:27did it is very simple architecture
0:18:30well the last the last the series of experiments
0:18:33is experiments with i some of the features
0:18:37the idea behind them is that so
0:18:39system directed utterances tandem age
0:18:44the isr
0:18:45acoustic and language models much better compared to
0:18:48human addressed utterances
0:18:51and it's
0:18:52this definitely works in the human machine setting
0:18:57it seems to be
0:18:58not working
0:18:59in the i don't channels i think and we just analyse the
0:19:03the data itself so
0:19:06deep inside and the noted that
0:19:09sometimes addressing children
0:19:13sanderson children so people don't even use words instead they just use some separate intonations
0:19:20or sounds or so without any words and
0:19:23this causes real problems to our asr meaning that's so
0:19:27the are the
0:19:30the asr confidence will be equal over both of the target process
0:19:35this is the reason why it performs so where
0:19:38at this humbling problem
0:19:41so here we come to the conclusions and we can conclude that makes up improves
0:19:45classification performance for models then this
0:19:49predefined features and also
0:19:52this is less like
0:19:53and also enables multitask learning abilities
0:19:57for both and joint models and models that it was conducted feature sets
0:20:03just like and speech fragments
0:20:07allows us to
0:20:11accuracies but the
0:20:12sufficient quality
0:20:13and actually the same conclusion was drawn by the group of
0:20:17matters of researchers regarding english language
0:20:21yes and
0:20:22as a told
0:20:24a couple beers before i saw confidence is not representative for a c d low
0:20:28it still useful for each met and three so you all experiments we also a
0:20:34bit a couple of baseline so we introduce the first official baseline for be a
0:20:38sissy corpus and the ability to the on back into and baseline
0:20:43for future directions i woods propose extending our experiments applying mix up to two dimensional
0:20:50spectrograms and two features extracted with their without the convolutional component
0:20:54thank you
0:21:01we have time for some questions
0:21:04hi a credit when you in c
0:21:08yes i
0:21:11i was wondering why it shows you a tree i don't child interaction between a
0:21:17human machine interaction is there any literature likely to this decision or was it just
0:21:23sort of this additional you know it was a but our assumption without any background
0:21:28i mean it was like an interesting
0:21:30assumption in interesting something to do not to prove it of the proved run
0:21:35yes and so
0:21:38it should be like this that's not so sometimes we receive a system as an
0:21:44infant or person have been lack of communication all scales
0:21:48of and's that's what we take in as the basic assumption for
0:21:55forums actually simulate conceptually there's do not sitting
0:21:59conceptually distinct okay this is on one so i put into our experiments a single
0:22:05i think
0:22:06yes that's actually they are probably overlap but only partially
0:22:12what's couldn't our experiments a single system is capable or float in both
0:22:16that simultaneously
0:22:17i perform far worse on the adult channel corpus
0:22:22yes but because the baseline performance is far worse
0:22:25i mean the highest baseline on one h b is like
0:22:29it is zero point sixty four
0:22:32all zero point six to six or something this
0:22:36so it just the matter of the data quality
0:22:48high and just from a reporter numerous the interesting talk i was wondering
0:22:56maybe i missed something did you see any language features it so no do you
0:23:01not all can speculate so it is gonna be an impact on the performance of
0:23:06what it means same as which we just i mean like a separate words or
0:23:09for instance if i'm talking to a channel i might address to change in a
0:23:14different way to address signals
0:23:17okay well it's a difficult question human that's i told that sometimes talking to the
0:23:23channel we don't use real words
0:23:25this is the problem for language modeling right i mean i was my hypothesis is
0:23:30that you would simplify the language to use if you're addressing a child their compared
0:23:35when you address and yes we do we do
0:23:40my speculation on this would be yes
0:23:43we can so we can we can try to leverage in both textual and acoustical
0:23:48to solve the same problem yes okay next
0:23:52for one more
0:23:56that is common
0:24:00i just so have you checked
0:24:04how well you do with respect to the results of the competence
0:24:07so the same data set was used a similar data set was used as part
0:24:11of the interspeech compared challenge anything the guy obviously don't like i think it was
0:24:16seventy point something
0:24:17so this curious but the look at the majority baseline so i you predicting the
0:24:22majority class because essentially binary class prediction you do we
0:24:25and so one thing that you model is just only
0:24:28how to predict the majority class
0:24:31i mean i use a
0:24:34i use unweighted average recall and if it if it would predict just
0:24:39just a majority class a so and so it means that actually the model we
0:24:45a role
0:24:46all the examples to the ones you melissa
0:24:49it means that you're performance metric would be
0:24:55not about than zero point a zero point five
0:24:59because it's like it's like a global metric
0:25:02sure but for instance even so if you look at the
0:25:06the baseline for the speech and that's about seventy point something
0:25:10so you so i we see you mean the baseline for combine corpus
0:25:16of using the end-to-end or
0:25:18similarly no i actually the end-to-end baseline was the word baseline
0:25:23so and sixty four so
0:25:26i remember the
0:25:29the article
0:25:30release the rights before the interest right before the submission for the challenge and the
0:25:36result there's of the baseline for the intent model was like
0:25:39is zero point fifty nine also
0:25:42at rate and the end-to-end if you if you mean this and
0:25:45if we talk about the entire multi model
0:25:49like thing so the baseline was like
0:25:54zero point seven also but they use the much a great the feature sets for
0:26:01this and several models like a collective of models
0:26:05include in michael for your words and two and so ill these and all that
0:26:13okay let's thank our speaker again