0:00:15 | okay moving so this is a joint work for me from |
---|---|

0:00:20 | bob dunn first |

0:00:23 | so we describe a limited to challenge |

0:00:27 | i in the in a explain our approach it is based networks |

0:00:33 | i'll show how we can be applied |

0:00:35 | to the and then show experiments |

0:00:39 | so we start with a the channel the forced on the buttons |

0:00:45 | so we have a |

0:00:47 | actually she was we have some labeled they have not been a lot of the |

0:00:53 | unlabeled data is it as a development set and we have a it is |

0:00:59 | actually it's |

0:01:01 | is it does gratification one of the future it's just under classification task a but |

0:01:08 | there are two different first we have |

0:01:11 | unlabeled data that we do to use it is part of the classification and we |

0:01:15 | found out of state |

0:01:17 | data that you want to take your |

0:01:19 | so i in this task will focus on these |

0:01:22 | to challenge of how to use the unlabeled data |

0:01:26 | it's part of training and how to take care of ops |

0:01:30 | so that these two fifty in languages i |

0:01:34 | some of them are very similar to each other some of them are different |

0:01:40 | in this is the cost function for from the cost function we can roughly assume |

0:01:45 | that |

0:01:45 | one quarter of the test set and the unlabeled data is |

0:01:51 | use it |

0:01:53 | it will use this but |

0:01:55 | okay |

0:01:57 | so |

0:01:58 | e |

0:01:59 | i want to discuss how can use the unlabeled data for training in the case |

0:02:04 | of a deep-learning in the framework of people |

0:02:07 | so just a way to lose unlabeled data |

0:02:11 | we use it as a pre-training |

0:02:13 | instead of a random initialization of the hotel so we used a the unlabeled data |

0:02:19 | to a pre-training then network |

0:02:23 | that doesn't probably due to do pre-training |

0:02:26 | systems thinking |

0:02:28 | a restricted boltzmann machine |

0:02:31 | the second the middle based on the denoising |

0:02:36 | but both of them a pre-training |

0:02:38 | we used to extract |

0:02:40 | and then we form k one as well |

0:02:42 | the unlabeled the |

0:02:44 | so how do i |

0:02:46 | really mentioned how a how auto-encoder |

0:02:50 | is that well we have a |

0:02:53 | the data points point and we tried to construct the noise and they gathered structure |

0:03:02 | that is |

0:03:02 | similar is not okay to the clean data |

0:03:07 | you in our approach you use a generalization |

0:03:11 | of out to go there because corner store |

0:03:15 | the two years |

0:03:18 | and data just to work with taking points but to across the entire that'll |

0:03:24 | with a c |

0:03:26 | data we actually to the |

0:03:29 | that's what the cost function is the local section of |

0:03:33 | but also though the construction of the hidden layers |

0:03:37 | will explain how in more detail |

0:03:40 | so this is that it |

0:03:43 | this is a standard |

0:03:45 | you want we have |

0:03:47 | i don't think that we have a soft spot classification is done to a architecture |

0:03:55 | but beside it if we apply for labeled data |

0:04:00 | in case of unlabeled data with the same network |

0:04:05 | the same parameters but at each step we add noise |

0:04:10 | to the |

0:04:11 | data with of the important we add noise to each of the hidden layer we |

0:04:16 | try to recall that there will try to construct |

0:04:21 | the hidden layers so engaged every possible for them but these the base of the |

0:04:27 | c |

0:04:28 | however i and the construct and |

0:04:31 | a previously |

0:04:33 | i the that the cost function is one or construction |

0:04:37 | will be |

0:04:38 | so very close to the clean |

0:04:41 | a hidden layers |

0:04:45 | this is therefore in |

0:04:49 | for the former so we have encoder and decoder is it's the in the encoder |

0:04:55 | each you just |

0:04:56 | a hundred and voice |

0:05:00 | at each step |

0:05:02 | in the decoder their we construct the denoising huge layers |

0:05:08 | so if we will be more specific the main problem |

0:05:13 | all of this model is of course of action how we can reconstruct the hidden |

0:05:17 | layer based on the model |

0:05:20 | he or and of course factors previously or and |

0:05:25 | still we assume that the that we apply a additive gaussian noise |

0:05:31 | to the in the later so we use a you know i estimation |

0:05:37 | the daily we estimate |

0:05:39 | then i |

0:05:41 | we estimate hidden layer is a linear function |

0:05:45 | all of the noisy you hear that |

0:05:49 | we take the that we now require sufficient we take it |

0:05:54 | it is that you know away from the eh |

0:05:57 | the previous construction workers fact that leo this is the did not we know that |

0:06:02 | money not is applied by you know function plus a sigmoid function |

0:06:08 | and we don't for each one |

0:06:10 | several the concept is |

0:06:13 | similar to else |

0:06:15 | so if you have this intuition |

0:06:18 | but the idea here is to reconstruct the noisy but they are based on the |

0:06:27 | two |

0:06:28 | based on their the professional constructed |

0:06:32 | popping so once we have |

0:06:35 | it actually we have a training data that is based on |

0:06:39 | boast labeled data |

0:06:42 | and unlabeled data |

0:06:44 | the training cost function will be |

0:06:48 | standard cross entropy |

0:06:50 | applied on the labeled data |

0:06:52 | and construction and all applied on the data and bibles action a well i mean |

0:06:59 | reports about not just the input but |

0:07:01 | reconstructed each of the hidden layer |

0:07:04 | so if all |

0:07:10 | so if we go back to this picture so i |

0:07:14 | e in the unlabeled data we want |

0:07:18 | we inject the noisy version into the net and then try to are constructed such |

0:07:25 | that it will be very similar to the a noisy data |

0:07:32 | so in this way we use the unlabeled data not just is the three training |

0:07:37 | and then they forget about it but we use it i in it is part |

0:07:42 | of the training |

0:07:43 | the training data of the neural network is explicitly |

0:07:47 | very small bones the labeled and unlabeled data |

0:07:54 | this is an illustration of the power of a larger networks |

0:08:00 | so we can this is a result of the standard em based |

0:08:06 | a u i |

0:08:08 | it that's or something this is the number of flavours and this is the construction |

0:08:12 | or and we can see that using a large networks |

0:08:16 | we had we can have a |

0:08:19 | performance that is that is if everyone based on |

0:08:23 | only for every something like a one or two hundred labeled data |

0:08:29 | well all other consonants |

0:08:32 | of a images is argument |

0:08:35 | okay so this is the idea of other networks there you |

0:08:40 | we will apply to this a challenge |

0:08:46 | now we want to discuss how can we incorporate i'll sit in this a frame |

0:08:52 | or |

0:08:52 | so we use a you want network architecture of but i would add |

0:08:58 | another class a fixed it so we have fifty classes for each of the labels |

0:09:05 | for each of the unknown languages and it and that are out of that class |

0:09:10 | and the how we can train the out of state the label |

0:09:15 | so that we used a label layer distribution the legalisation |

0:09:20 | so what we meet again |

0:09:22 | assuming we do a batch a training so we can compute the frequency |

0:09:29 | of a of they all say |

0:09:33 | all of classifying the a language at the difference you of |

0:09:39 | off a classifier languages so we can count how many times we classified in the |

0:09:44 | language is english how it are we classified as hindu |

0:09:47 | that's right and how many we time to classify it as a state |

0:09:52 | and we have |

0:09:54 | a rough estimation what should be okay the histogram was what should be this distribution |

0:10:00 | we can assume that we have |

0:10:03 | all languages should be roughly |

0:10:05 | it appears |

0:10:07 | and i'll start should be roughly one quarter |

0:10:10 | of the |

0:10:12 | so we can say |

0:10:14 | it cross entropy |

0:10:16 | score function |

0:10:18 | two |

0:10:21 | it to who it define a score of |

0:10:24 | the data distribution of the classifier so and the main point is that we can |

0:10:30 | do it because we have a label |

0:10:32 | we have out of set |

0:10:34 | in the unlabeled data if we don't |

0:10:36 | in this challenge is the main point |

0:10:39 | in these times we have out of set in the unlabeled data so we can |

0:10:43 | assume that the adapted with the that some of my |

0:10:46 | the labels should be altered |

0:10:50 | okay so softly season a the additional cost function so we have |

0:10:55 | two |

0:10:56 | it cost function one is the lower cost function there is some supervised cost function |

0:11:01 | and the other the other one yees day |

0:11:06 | a discrepancy of course of the labeled the decisions |

0:11:11 | okay now i go to experiment what the three years the detector that deals with |

0:11:17 | the input is the i-vectors we use a |

0:11:21 | a natural therefore ready |

0:11:24 | value hidden layers |

0:11:25 | and the and we have a softmax output with diffuse t one |

0:11:30 | classes |

0:11:31 | so that this is the extent pale |

0:11:34 | this assimilation the two |

0:11:37 | we text sort it of the languages |

0:11:40 | it is inserted and the other languages |

0:11:43 | is out of set so we this is simulation we know |

0:11:47 | all the labels |

0:11:49 | i and he is example what is happening if we use |

0:11:53 | the baseline |

0:11:54 | they without today the latter it and if we and |

0:11:59 | a that the latter |

0:12:01 | score so we can see that |

0:12:03 | we gained a |

0:12:06 | in a significant improvement using the unlabeled data |

0:12:10 | the during the price of doing it it's more difficult to learn the log spectral |

0:12:17 | the prices is |

0:12:18 | that we need to do more reports but it's not a big issue the that |

0:12:23 | it is more |

0:12:26 | so this is there a result a the progress are the results |

0:12:31 | so we have a either doing a ladder or not doing other a and |

0:12:37 | taking the labour statistics |

0:12:39 | a or not take detecting the label statistics score |

0:12:43 | so this is the baseline in for our case |

0:12:48 | so if we take a larger |

0:12:51 | we get a an improvement |

0:12:54 | if we take a label statistics |

0:12:57 | we also get improvement but not much and if we a |

0:13:02 | a combine |

0:13:03 | the two strategies |

0:13:06 | the first strategy is the for unlabeled data in the second strategy |

0:13:10 | for all to start we get it would gain a significantly |

0:13:14 | improvement |

0:13:15 | we i and this would this is |

0:13:19 | a this problem is the out of set statistics |

0:13:23 | at the two |

0:13:25 | the daily the system will provide then |

0:13:28 | we tried to stall for example here what |

0:13:32 | what we classify us thirty percent |

0:13:36 | all of a development set as is |

0:13:39 | that would try |

0:13:41 | to a |

0:13:45 | to adjust the number of out of state to be one quarter |

0:13:50 | because we don't that roughly the that this should be that the number but |

0:13:54 | in the baseline we got improvement but here |

0:13:58 | it doesn't a |

0:14:01 | actually the performance and decreases |

0:14:04 | so i still the this was that the |

0:14:09 | you the best results |

0:14:12 | okay so to compute we tried to apply here lately a deep learning strategy the |

0:14:19 | take care of both |

0:14:21 | is a challenging role of three |

0:14:25 | unlabeled data and out of set for unlabeled data |

0:14:28 | we use the latter network that explicitly |

0:14:31 | take the i labeled data into account while training |

0:14:35 | four out of set a we use a label and distribution score |

0:14:40 | that is also i |

0:14:43 | i is used in the training |

0:14:45 | i we show that |

0:14:46 | combining |

0:14:47 | these two methodology we can mitigate |

0:14:50 | a improve the results |

0:14:54 | okay ten q |

0:15:01 | we have time for questions |

0:15:13 | can you tell us exactly how much this unsupervised data help you anyone either training |

0:15:21 | like |

0:15:22 | for example i imagine you do the also to express reconstruction in the same training |

0:15:27 | data that you have like a regularisation into the classification categorisation would it you did |

0:15:32 | you compare between added due to the regularization as what is the supervised and unsupervised |

0:15:37 | no need to measure how much you will gain by don't answer provide it's a |

0:15:42 | good question |

0:15:43 | i didn't write but |

0:15:48 | the utterance is used also is a regularization |

0:15:52 | you think that the draw pile a strategy is that |

0:16:00 | but not |

0:16:02 | not sure |

0:16:04 | just you deduction the well but |

0:16:06 | we did try |

0:16:08 | if i remember that it helps |

0:16:12 | but anyhow we need |

0:16:14 | do we need unlabeled data be because it has out that |

0:16:28 | but it's still |

0:16:33 | want to know if you a what applying some and kind of pre-processing for this |

0:16:37 | for the i-vectors |

0:16:38 | if what a new three and in and |

0:16:42 | what you know what you know within two |

0:16:46 | so called a low |

0:16:49 | so the i-vectors that what provided by nist by nist |

0:16:52 | the results but maybe preprocessing |

0:16:56 | i don't know but we tried we use the raw data |

0:17:11 | if there are no other questions let's think the speaker again |

0:17:18 | so i think we we're at the end of the session i think we have |

0:17:22 | a few |