| 0:00:18 | alright welcome to the second session on acoustics we well |
|---|
| 0:00:24 | follow this immediately with the sponsors session and then the |
|---|
| 0:00:28 | back with dinner or per speaker |
|---|
| 0:00:30 | is all like a came out |
|---|
| 0:00:35 | thank you |
|---|
| 0:00:49 | okay it's not all okay |
|---|
| 0:00:51 | okay sorry |
|---|
| 0:00:53 | hello vehicle it's a welcome to my talk my name is a ticket out |
|---|
| 0:00:57 | and that you might be |
|---|
| 0:01:00 | is not better or |
|---|
| 0:01:05 | sound check |
|---|
| 0:01:06 | okay that's good |
|---|
| 0:01:08 | things |
|---|
| 0:01:09 | how well come welcome to my talk so |
|---|
| 0:01:14 | today i'd like to present decided that's i conducted together with my colleagues |
|---|
| 0:01:19 | in was eager to lexical profound problem thinker the store like to thank them or |
|---|
| 0:01:23 | without them it would be impossible to conduct this research on this you attention |
|---|
| 0:01:27 | and so the use your problem as you probably can guess so this topic is |
|---|
| 0:01:33 | related is with the big problem introduced by then both those |
|---|
| 0:01:40 | at the beginning of our conference today |
|---|
| 0:01:43 | so it's also about stated |
|---|
| 0:01:46 | interaction and multi party interaction |
|---|
| 0:01:49 | so |
|---|
| 0:01:51 | a the title is cross corpus that accommodation for acoustic addressee detection |
|---|
| 0:01:56 | first of all i'd like to |
|---|
| 0:01:58 | clarify what was use action actually is |
|---|
| 0:02:01 | so it's a common trend that modern spoken dialogue systems i getting |
|---|
| 0:02:07 | more adaptive and human like |
|---|
| 0:02:09 | not you know the two |
|---|
| 0:02:12 | interact with multiple users under realistic conditions in the real physical world |
|---|
| 0:02:18 | and's |
|---|
| 0:02:21 | sorry |
|---|
| 0:02:25 | so |
|---|
| 0:02:26 | it may happen that's |
|---|
| 0:02:29 | not a single user of interest the system but a group of users and this |
|---|
| 0:02:33 | is exactly the place where the suit action |
|---|
| 0:02:35 | where |
|---|
| 0:02:36 | this young but the rises it appears in conversations between |
|---|
| 0:02:43 | technical system and the group of users |
|---|
| 0:02:45 | and it's |
|---|
| 0:02:46 | we're gonna call this kind of |
|---|
| 0:02:49 | interactions as human machine |
|---|
| 0:02:50 | conversations and here we have |
|---|
| 0:02:53 | realistic example from our data |
|---|
| 0:02:56 | so |
|---|
| 0:02:58 | the as the s |
|---|
| 0:02:59 | so base in such a mixed kind of instructions as this is supposed |
|---|
| 0:03:03 | to distinguish between human and compute a direct utterances |
|---|
| 0:03:07 | that means solving a binary |
|---|
| 0:03:09 | classification problem in order to maintain a efficient conversations in a realistic manner |
|---|
| 0:03:15 | it's important that |
|---|
| 0:03:18 | human direct utterances so the system is not supposed to give a direct answer to |
|---|
| 0:03:22 | human direct utterances |
|---|
| 0:03:25 | because otherwise it would so interrupt a dialogue flow between to human participants |
|---|
| 0:03:34 | well |
|---|
| 0:03:35 | a similar problem arises in can with in conversations between several adults and a child |
|---|
| 0:03:41 | and similarly to |
|---|
| 0:03:43 | function of this you'd actually caller's problem as i don't channel to be sued action |
|---|
| 0:03:47 | and here we have again |
|---|
| 0:03:49 | a realistic example how |
|---|
| 0:03:52 | not to educate your children but smart phones |
|---|
| 0:03:59 | yes and again in this case the is this is supposed to distinguish between adult |
|---|
| 0:04:04 | and child directed utterances produced by adults |
|---|
| 0:04:07 | and this also means |
|---|
| 0:04:10 | binary classification problem |
|---|
| 0:04:12 | and it's functionality may be useful for a system before mean |
|---|
| 0:04:17 | children developments mandatory in |
|---|
| 0:04:21 | mainly the let's assume that the list distinguishable are children and a directed acoustic patterns |
|---|
| 0:04:27 | the bigger progress so that shouldn't make in maintaining social interactions and |
|---|
| 0:04:33 | in particular in maintaining |
|---|
| 0:04:36 | spoken conversations |
|---|
| 0:04:39 | so |
|---|
| 0:04:41 | now |
|---|
| 0:04:43 | let's find out if |
|---|
| 0:04:45 | these two rejection problems have anything in common |
|---|
| 0:04:51 | first of all we need to answer the question how we address other people in |
|---|
| 0:04:55 | real life |
|---|
| 0:04:56 | the simplest way to do this is just |
|---|
| 0:04:59 | by name so or what we will okay cable or okay alex a or |
|---|
| 0:05:04 | i like this |
|---|
| 0:05:06 | then |
|---|
| 0:05:08 | we can do the same think implicitly by using for example das |
|---|
| 0:05:12 | i'm looking at him talking to you |
|---|
| 0:05:15 | then some contextual markers like a specific topics or |
|---|
| 0:05:19 | specialist a convenience |
|---|
| 0:05:21 | and |
|---|
| 0:05:23 | the |
|---|
| 0:05:24 | the last utterance if is to |
|---|
| 0:05:26 | modified acoustic speaking style and our prosody |
|---|
| 0:05:29 | and the present study is focused |
|---|
| 0:05:32 | exactly on the |
|---|
| 0:05:35 | last way |
|---|
| 0:05:36 | on the on the on the letter way of |
|---|
| 0:05:38 | addressing |
|---|
| 0:05:40 | subjects in our conversation |
|---|
| 0:05:44 | so the |
|---|
| 0:05:46 | the idea behind acoustic addressee detection is that people tend to change the remainder of |
|---|
| 0:05:51 | speech depending on whom the talking to |
|---|
| 0:05:53 | for example we may face some special to see such as hard of hearing people |
|---|
| 0:05:58 | actually people |
|---|
| 0:06:00 | children or spoken dialogue systems |
|---|
| 0:06:03 | that's in our opinion might have some communication difficulties |
|---|
| 0:06:07 | and talk into such it receives we intentionally |
|---|
| 0:06:12 | we intentionally modify all in the moment of a speech make you need a more |
|---|
| 0:06:16 | technical loud and generate the more understandable a since we do not |
|---|
| 0:06:20 | pc then as adequate conversational agents |
|---|
| 0:06:23 | and then main assumption that we make here is that's human the reckon speech |
|---|
| 0:06:31 | is supposed to be |
|---|
| 0:06:32 | similar to adult directed speech |
|---|
| 0:06:36 | well |
|---|
| 0:06:43 | and is |
|---|
| 0:06:45 | in the same way you much indirect speech is for so must be quite similar |
|---|
| 0:06:48 | to child directed speech |
|---|
| 0:06:54 | in our experiments we use |
|---|
| 0:06:56 | relatively simple and yet efficient approach data augmentation called makes a mix up encourages a |
|---|
| 0:07:02 | model to behave mean eerie into that space between seen data points and i it |
|---|
| 0:07:08 | already has quite many applications in |
|---|
| 0:07:11 | isr in |
|---|
| 0:07:13 | image recognition and |
|---|
| 0:07:14 | many other |
|---|
| 0:07:16 | popular fields |
|---|
| 0:07:18 | basically makes it generates a typical examples |
|---|
| 0:07:21 | as thing and combinations |
|---|
| 0:07:24 | of to random a real feature and label vectors take into the coefficients number |
|---|
| 0:07:31 | and it's this number is a real number randomly generated from a but it stiff |
|---|
| 0:07:36 | from but from a beta distribution |
|---|
| 0:07:37 | a specified as follows by the only parameter alpha so technically life i thought lays |
|---|
| 0:07:44 | within the interval from zero to infinity |
|---|
| 0:07:47 | but according to our experiments |
|---|
| 0:07:50 | so i four values higher than one |
|---|
| 0:07:54 | leads already two |
|---|
| 0:07:55 | and defeating |
|---|
| 0:07:58 | and it's in our opinion the most reasonable inter well to ri |
|---|
| 0:08:02 | this parameter is from zero to one |
|---|
| 0:08:07 | so |
|---|
| 0:08:07 | that's question is how many examples to generate and here |
|---|
| 0:08:12 | that's imagine that we just merge the |
|---|
| 0:08:15 | c |
|---|
| 0:08:16 | different datasets without applying any bit argumentation just put them together |
|---|
| 0:08:21 | so we generate one batch |
|---|
| 0:08:24 | from each dataset |
|---|
| 0:08:25 | and it means that we they can increase the initial model training data in the |
|---|
| 0:08:30 | target corpus in c times |
|---|
| 0:08:33 | but if you something sleep line except |
|---|
| 0:08:35 | so we generate |
|---|
| 0:08:37 | along this |
|---|
| 0:08:38 | but this seebosh's we generate a also |
|---|
| 0:08:43 | "'kay" |
|---|
| 0:08:45 | examples key |
|---|
| 0:08:46 | i'd |
|---|
| 0:08:49 | "'kay" artificial examples of from each real example |
|---|
| 0:08:52 | increasing the amount of training data in a |
|---|
| 0:08:55 | see you multiply a k plus one times |
|---|
| 0:08:59 | and it's important to note that if it but at the visual examples are generated |
|---|
| 0:09:02 | or |
|---|
| 0:09:03 | but relies on the fly without any significant delays in the training process so we |
|---|
| 0:09:07 | just |
|---|
| 0:09:07 | do it on the go |
|---|
| 0:09:11 | well you can see |
|---|
| 0:09:14 | the models that we used to |
|---|
| 0:09:17 | two |
|---|
| 0:09:19 | it uses all the views to solve our problem |
|---|
| 0:09:23 | and the they are arranged according to their complexity a little from |
|---|
| 0:09:26 | left to right |
|---|
| 0:09:29 | well the first model is a simple |
|---|
| 0:09:32 | we are as we am |
|---|
| 0:09:34 | using the compare functionals as the input so this is a pretty popular feature set |
|---|
| 0:09:40 | in the area for motion recognition was introduced at the interspeech to solve and thirteen |
|---|
| 0:09:46 | i guess |
|---|
| 0:09:47 | yes so these features are extracted from the whole utterance |
|---|
| 0:09:52 | next we apply |
|---|
| 0:09:55 | the l d model |
|---|
| 0:09:57 | that includes a recurrent neural network with long short-term memory |
|---|
| 0:10:02 | and so |
|---|
| 0:10:03 | repeat a bit of these which were also used to compute the |
|---|
| 0:10:08 | the compare function also for the for the first model |
|---|
| 0:10:12 | and in contrast to |
|---|
| 0:10:14 | the functionals the l d's have |
|---|
| 0:10:17 | a time continuous nature |
|---|
| 0:10:20 | so it's time continuous signal |
|---|
| 0:10:22 | and in the last more lost all model is and consistently for mean raw signal |
|---|
| 0:10:28 | processing so |
|---|
| 0:10:30 | it receives just the |
|---|
| 0:10:33 | raw audio utterance that buses statistical of convolutional input then there's and suffer the same |
|---|
| 0:10:39 | convolutional component the lunchroom with looks for |
|---|
| 0:10:41 | we launch with the memory |
|---|
| 0:10:43 | that was introduced the within the previous model |
|---|
| 0:10:47 | yes and to be |
|---|
| 0:10:49 | it should be as the as the reference point for the convolutional component be of |
|---|
| 0:10:53 | taking |
|---|
| 0:10:53 | the five-layer sounded like addiction slightly modified it for needs mainly be reused |
|---|
| 0:10:58 | it's dimensionality |
|---|
| 0:11:00 | so by reducing the number or of for use in each layer according to the |
|---|
| 0:11:06 | amount of data that we have at our disposal and we also reduced the kernel |
|---|
| 0:11:11 | sizes in this paper according to the dimensionality of the signal that we have |
|---|
| 0:11:20 | well |
|---|
| 0:11:21 | here you can see the data that we have at our disposal we |
|---|
| 0:11:24 | we have two datasets for modeling |
|---|
| 0:11:27 | emotional issue detection namely smart video corpus that's contains interactions between the user to consider |
|---|
| 0:11:34 | it and the mobile is this |
|---|
| 0:11:35 | and by the way this is the only corpus that's |
|---|
| 0:11:38 | that was |
|---|
| 0:11:40 | models like |
|---|
| 0:11:42 | played by wizard-of-oz setting |
|---|
| 0:11:46 | the next |
|---|
| 0:11:47 | corpus |
|---|
| 0:11:48 | is was this was this is a conversation corpus that contains |
|---|
| 0:11:51 | similarly to this we see that contains |
|---|
| 0:11:54 | interaction between the user a confederate and then almost an alex acero dot is data |
|---|
| 0:11:58 | is real |
|---|
| 0:12:00 | without any was of the for stimulation |
|---|
| 0:12:03 | and |
|---|
| 0:12:04 | the third corpus is home bank that's includes conversations between a and adults another adult |
|---|
| 0:12:10 | and the child |
|---|
| 0:12:12 | we tried to repeat use the same as pleadings into training development and test sets |
|---|
| 0:12:18 | that's |
|---|
| 0:12:20 | the introduced in the |
|---|
| 0:12:21 | i regional studies published but also the corpora |
|---|
| 0:12:25 | and they turned out to be approximately the same well in the proposal so |
|---|
| 0:12:32 | train development and test has a purple the proportion of four five by one by |
|---|
| 0:12:36 | four |
|---|
| 0:12:40 | first we conduct some preliminary analysis with a linear model the font model we perform |
|---|
| 0:12:47 | feature selection by means of recursively recursive feature elimination |
|---|
| 0:12:51 | we just the exclude a small portion of all |
|---|
| 0:12:54 | compare features with the lowest svm weights |
|---|
| 0:12:57 | and that we measure the performance |
|---|
| 0:12:59 | all the |
|---|
| 0:13:01 | you reduced feature set in terms of unweighted average recall |
|---|
| 0:13:04 | and if it just let us consider the is considered to be optimal |
|---|
| 0:13:07 | e for them |
|---|
| 0:13:08 | them dimensionality-reduction leads to a significant information loss as |
|---|
| 0:13:13 | and it's here in this in this figure we see that's the |
|---|
| 0:13:18 | the optimal feature sets a |
|---|
| 0:13:20 | right significantly |
|---|
| 0:13:22 | and it's also very interesting that's the size of the optimal feature set on this |
|---|
| 0:13:26 | p c is much greater than then the other two so it may be explained |
|---|
| 0:13:30 | by them |
|---|
| 0:13:31 | a wizard-of-oz model in probably |
|---|
| 0:13:34 | some of the participants |
|---|
| 0:13:35 | did it's really believe that they were interacting with the real technical system |
|---|
| 0:13:39 | and the this issue resulted in |
|---|
| 0:13:43 | mm slightly a acoustic the basic buttons |
|---|
| 0:13:47 | well another |
|---|
| 0:13:50 | sequence of experiments at we conduct is a is inverse local and look experiments the |
|---|
| 0:13:54 | local means leave one corpus out a everyone knows what it means and inverse local |
|---|
| 0:14:00 | am is just that we retrain a our model on one corpus and test on |
|---|
| 0:14:06 | each of the other corpora separately |
|---|
| 0:14:08 | so and in this figure there is a pretty clear relation between b a c |
|---|
| 0:14:12 | and |
|---|
| 0:14:13 | as we see |
|---|
| 0:14:14 | so or it's pretty natural that's |
|---|
| 0:14:19 | these corpora |
|---|
| 0:14:21 | perceived as similar by our system because |
|---|
| 0:14:24 | the domains pretty close and the they both your utterance in german |
|---|
| 0:14:28 | in contrast to home bank that was uttered english and as we can see from |
|---|
| 0:14:32 | this figure |
|---|
| 0:14:33 | so our |
|---|
| 0:14:34 | linear model |
|---|
| 0:14:37 | fails to find any direct relation between |
|---|
| 0:14:41 | this corpus and the other two |
|---|
| 0:14:43 | but let's take a look at the |
|---|
| 0:14:45 | at the at the next year |
|---|
| 0:14:47 | and here we notice a very interesting trend that's |
|---|
| 0:14:52 | even bill |
|---|
| 0:14:52 | hum bank |
|---|
| 0:14:55 | significantly differs from data to from data two corpora i think the linear model trained |
|---|
| 0:15:00 | on |
|---|
| 0:15:02 | on every on sorry one and u two corpora |
|---|
| 0:15:05 | a reforms on each of them equally well is if it's not range |
|---|
| 0:15:10 | on each of the corpus separately and tested on them separately |
|---|
| 0:15:14 | so it means that's |
|---|
| 0:15:16 | the data sets that we have a non coded |
|---|
| 0:15:18 | at least not contradictory |
|---|
| 0:15:22 | so well let's take a look at all experiments but |
|---|
| 0:15:27 | the l d model and various can on various contexts lands a prime example |
|---|
| 0:15:33 | and here |
|---|
| 0:15:34 | in each of the three cases |
|---|
| 0:15:36 | red green and blue we see that the |
|---|
| 0:15:39 | dashed line is located about the |
|---|
| 0:15:42 | the solid one |
|---|
| 0:15:43 | mean and that's a mix up results in this additional performance improvement no really |
|---|
| 0:15:50 | when the ready |
|---|
| 0:15:51 | when already applied to the same corpus |
|---|
| 0:15:53 | and |
|---|
| 0:15:54 | it's also interesting to note that |
|---|
| 0:15:58 | so the context and for two seconds |
|---|
| 0:16:01 | turns out to be optimal for each of the for each of the corpus given |
|---|
| 0:16:05 | a given that they have |
|---|
| 0:16:07 | very different utterance then distributions |
|---|
| 0:16:10 | so two seconds is sufficient to predict accuracies using acoustic commonality |
|---|
| 0:16:16 | well |
|---|
| 0:16:16 | unfortunately makes up gives no performance improvement to the end-to-end model or probably we just |
|---|
| 0:16:21 | don't have enough data to provide |
|---|
| 0:16:28 | so we really produce the same experiments with |
|---|
| 0:16:32 | local and inverse local on be neural network based models |
|---|
| 0:16:35 | and so the |
|---|
| 0:16:37 | they both show the same trends the |
|---|
| 0:16:39 | that's |
|---|
| 0:16:40 | s b c n b a c seem quite similar to them |
|---|
| 0:16:44 | and actually the end-to-end model managed to capture |
|---|
| 0:16:47 | this similarity even better compared to the l d one |
|---|
| 0:16:51 | but there is an issue with model with multi with multitask learning |
|---|
| 0:16:55 | particularly |
|---|
| 0:16:56 | the issue is that |
|---|
| 0:16:58 | our neural network |
|---|
| 0:17:00 | regardless of which one us start with reading to |
|---|
| 0:17:05 | so the sig to the easiest task |
|---|
| 0:17:06 | with the highest commission features and labels and he they can see that the model |
|---|
| 0:17:11 | trained on any two dataset |
|---|
| 0:17:14 | starts |
|---|
| 0:17:15 | like |
|---|
| 0:17:15 | so the model |
|---|
| 0:17:17 | completely ignores the home bank |
|---|
| 0:17:19 | even though it was trained on this corpus |
|---|
| 0:17:22 | and it also star discriminating |
|---|
| 0:17:25 | i guess with you dataset colour vegetation changes if we started by me so |
|---|
| 0:17:30 | so all over the corpora |
|---|
| 0:17:33 | and the model actually starts receiving |
|---|
| 0:17:36 | both corpora really efficient |
|---|
| 0:17:38 | efficiently |
|---|
| 0:17:39 | as if you go |
|---|
| 0:17:41 | trains a on each of the corpus separately and tested on each of the corpus |
|---|
| 0:17:45 | separately |
|---|
| 0:17:47 | again we really but we conduct |
|---|
| 0:17:49 | this index but we conducted a similar experiment it just merging all three |
|---|
| 0:17:54 | datasets with and without makes up |
|---|
| 0:17:57 | using all three models |
|---|
| 0:17:58 | and so here we can see that makes up a low rises both settle these |
|---|
| 0:18:02 | l d and models and also prevents overfitting |
|---|
| 0:18:05 | the specific corpus mainly dstc with the highest correlation with the features and labels as |
|---|
| 0:18:09 | i is the set so these this task for our system |
|---|
| 0:18:13 | but unfortunately makes up doesn't provide an improvement for the funk model |
|---|
| 0:18:18 | what |
|---|
| 0:18:19 | actually goal |
|---|
| 0:18:20 | this model |
|---|
| 0:18:21 | doesn't suffer from overfitting the specific task and |
|---|
| 0:18:24 | doesn't need to be regularized |
|---|
| 0:18:25 | you do it's very simple structure |
|---|
| 0:18:27 | did it is very simple architecture |
|---|
| 0:18:30 | well the last the last the series of experiments |
|---|
| 0:18:33 | is experiments with i some of the features |
|---|
| 0:18:37 | the idea behind them is that so |
|---|
| 0:18:39 | system directed utterances tandem age |
|---|
| 0:18:44 | the isr |
|---|
| 0:18:45 | acoustic and language models much better compared to |
|---|
| 0:18:48 | human addressed utterances |
|---|
| 0:18:51 | and it's |
|---|
| 0:18:52 | this definitely works in the human machine setting |
|---|
| 0:18:56 | but |
|---|
| 0:18:57 | it seems to be |
|---|
| 0:18:58 | not working |
|---|
| 0:18:59 | in the i don't channels i think and we just analyse the |
|---|
| 0:19:03 | the data itself so |
|---|
| 0:19:06 | deep inside and the noted that |
|---|
| 0:19:09 | sometimes addressing children |
|---|
| 0:19:12 | no |
|---|
| 0:19:13 | sanderson children so people don't even use words instead they just use some separate intonations |
|---|
| 0:19:20 | or sounds or so without any words and |
|---|
| 0:19:23 | this causes real problems to our asr meaning that's so |
|---|
| 0:19:27 | the are the |
|---|
| 0:19:30 | the asr confidence will be equal over both of the target process |
|---|
| 0:19:34 | so |
|---|
| 0:19:35 | this is the reason why it performs so where |
|---|
| 0:19:38 | at this humbling problem |
|---|
| 0:19:41 | so here we come to the conclusions and we can conclude that makes up improves |
|---|
| 0:19:45 | classification performance for models then this |
|---|
| 0:19:49 | predefined features and also |
|---|
| 0:19:52 | this is less like |
|---|
| 0:19:53 | and also enables multitask learning abilities |
|---|
| 0:19:57 | for both and joint models and models that it was conducted feature sets |
|---|
| 0:20:03 | just like and speech fragments |
|---|
| 0:20:07 | allows us to |
|---|
| 0:20:08 | capture |
|---|
| 0:20:11 | accuracies but the |
|---|
| 0:20:12 | sufficient quality |
|---|
| 0:20:13 | and actually the same conclusion was drawn by the group of |
|---|
| 0:20:17 | matters of researchers regarding english language |
|---|
| 0:20:21 | yes and |
|---|
| 0:20:22 | as a told |
|---|
| 0:20:24 | a couple beers before i saw confidence is not representative for a c d low |
|---|
| 0:20:28 | it still useful for each met and three so you all experiments we also a |
|---|
| 0:20:34 | bit a couple of baseline so we introduce the first official baseline for be a |
|---|
| 0:20:38 | sissy corpus and the ability to the on back into and baseline |
|---|
| 0:20:43 | for future directions i woods propose extending our experiments applying mix up to two dimensional |
|---|
| 0:20:50 | spectrograms and two features extracted with their without the convolutional component |
|---|
| 0:20:54 | thank you |
|---|
| 0:21:01 | we have time for some questions |
|---|
| 0:21:04 | hi a credit when you in c |
|---|
| 0:21:08 | yes i |
|---|
| 0:21:11 | i was wondering why it shows you a tree i don't child interaction between a |
|---|
| 0:21:17 | human machine interaction is there any literature likely to this decision or was it just |
|---|
| 0:21:23 | sort of this additional you know it was a but our assumption without any background |
|---|
| 0:21:28 | i mean it was like an interesting |
|---|
| 0:21:30 | assumption in interesting something to do not to prove it of the proved run |
|---|
| 0:21:35 | yes and so |
|---|
| 0:21:36 | conceptually |
|---|
| 0:21:38 | it should be like this that's not so sometimes we receive a system as an |
|---|
| 0:21:44 | infant or person have been lack of communication all scales |
|---|
| 0:21:48 | of and's that's what we take in as the basic assumption for |
|---|
| 0:21:55 | forums actually simulate conceptually there's do not sitting |
|---|
| 0:21:59 | conceptually distinct okay this is on one so i put into our experiments a single |
|---|
| 0:22:05 | i think |
|---|
| 0:22:06 | yes that's actually they are probably overlap but only partially |
|---|
| 0:22:12 | what's couldn't our experiments a single system is capable or float in both |
|---|
| 0:22:16 | that simultaneously |
|---|
| 0:22:17 | i perform far worse on the adult channel corpus |
|---|
| 0:22:22 | yes but because the baseline performance is far worse |
|---|
| 0:22:25 | i mean the highest baseline on one h b is like |
|---|
| 0:22:29 | it is zero point sixty four |
|---|
| 0:22:32 | all zero point six to six or something this |
|---|
| 0:22:34 | okay |
|---|
| 0:22:36 | so it just the matter of the data quality |
|---|
| 0:22:48 | high and just from a reporter numerous the interesting talk i was wondering |
|---|
| 0:22:56 | maybe i missed something did you see any language features it so no do you |
|---|
| 0:23:01 | not all can speculate so it is gonna be an impact on the performance of |
|---|
| 0:23:06 | what it means same as which we just i mean like a separate words or |
|---|
| 0:23:09 | for instance if i'm talking to a channel i might address to change in a |
|---|
| 0:23:14 | different way to address signals |
|---|
| 0:23:17 | okay well it's a difficult question human that's i told that sometimes talking to the |
|---|
| 0:23:23 | channel we don't use real words |
|---|
| 0:23:25 | this is the problem for language modeling right i mean i was my hypothesis is |
|---|
| 0:23:30 | that you would simplify the language to use if you're addressing a child their compared |
|---|
| 0:23:35 | when you address and yes we do we do |
|---|
| 0:23:40 | my speculation on this would be yes |
|---|
| 0:23:43 | we can so we can we can try to leverage in both textual and acoustical |
|---|
| 0:23:47 | modalities |
|---|
| 0:23:48 | to solve the same problem yes okay next |
|---|
| 0:23:52 | for one more |
|---|
| 0:23:56 | that is common |
|---|
| 0:24:00 | i just so have you checked |
|---|
| 0:24:04 | how well you do with respect to the results of the competence |
|---|
| 0:24:07 | so the same data set was used a similar data set was used as part |
|---|
| 0:24:11 | of the interspeech compared challenge anything the guy obviously don't like i think it was |
|---|
| 0:24:16 | seventy point something |
|---|
| 0:24:17 | so this curious but the look at the majority baseline so i you predicting the |
|---|
| 0:24:22 | majority class because essentially binary class prediction you do we |
|---|
| 0:24:25 | and so one thing that you model is just only |
|---|
| 0:24:28 | how to predict the majority class |
|---|
| 0:24:31 | i mean i use a |
|---|
| 0:24:33 | no |
|---|
| 0:24:34 | i use unweighted average recall and if it if it would predict just |
|---|
| 0:24:39 | just a majority class a so and so it means that actually the model we |
|---|
| 0:24:44 | just |
|---|
| 0:24:45 | a role |
|---|
| 0:24:46 | all the examples to the ones you melissa |
|---|
| 0:24:49 | it means that you're performance metric would be |
|---|
| 0:24:54 | like |
|---|
| 0:24:55 | not about than zero point a zero point five |
|---|
| 0:24:59 | because it's like it's like a global metric |
|---|
| 0:25:02 | sure but for instance even so if you look at the |
|---|
| 0:25:06 | the baseline for the speech and that's about seventy point something |
|---|
| 0:25:10 | so you so i we see you mean the baseline for combine corpus |
|---|
| 0:25:16 | of using the end-to-end or |
|---|
| 0:25:18 | similarly no i actually the end-to-end baseline was the word baseline |
|---|
| 0:25:23 | so and sixty four so |
|---|
| 0:25:26 | i remember the |
|---|
| 0:25:29 | the article |
|---|
| 0:25:30 | release the rights before the interest right before the submission for the challenge and the |
|---|
| 0:25:36 | result there's of the baseline for the intent model was like |
|---|
| 0:25:39 | is zero point fifty nine also |
|---|
| 0:25:42 | at rate and the end-to-end if you if you mean this and |
|---|
| 0:25:45 | if we talk about the entire multi model |
|---|
| 0:25:49 | like thing so the baseline was like |
|---|
| 0:25:54 | zero point seven also but they use the much a great the feature sets for |
|---|
| 0:26:01 | this and several models like a collective of models |
|---|
| 0:26:05 | include in michael for your words and two and so ill these and all that |
|---|
| 0:26:10 | stuff |
|---|
| 0:26:13 | okay let's thank our speaker again |
|---|