0:00:15okay moving so this is a joint work for me from
0:00:20bob dunn first
0:00:23so we describe a limited to challenge
0:00:27i in the in a explain our approach it is based networks
0:00:33i'll show how we can be applied
0:00:35to the and then show experiments
0:00:39so we start with a the channel the forced on the buttons
0:00:45so we have a
0:00:47actually she was we have some labeled they have not been a lot of the
0:00:53unlabeled data is it as a development set and we have a it is
0:00:59actually it's
0:01:01is it does gratification one of the future it's just under classification task a but
0:01:08there are two different first we have
0:01:11unlabeled data that we do to use it is part of the classification and we
0:01:15found out of state
0:01:17data that you want to take your
0:01:19so i in this task will focus on these
0:01:22to challenge of how to use the unlabeled data
0:01:26it's part of training and how to take care of ops
0:01:30so that these two fifty in languages i
0:01:34some of them are very similar to each other some of them are different
0:01:40in this is the cost function for from the cost function we can roughly assume
0:01:45one quarter of the test set and the unlabeled data is
0:01:51use it
0:01:53it will use this but
0:01:59i want to discuss how can use the unlabeled data for training in the case
0:02:04of a deep-learning in the framework of people
0:02:07so just a way to lose unlabeled data
0:02:11we use it as a pre-training
0:02:13instead of a random initialization of the hotel so we used a the unlabeled data
0:02:19to a pre-training then network
0:02:23that doesn't probably due to do pre-training
0:02:26systems thinking
0:02:28a restricted boltzmann machine
0:02:31the second the middle based on the denoising
0:02:36but both of them a pre-training
0:02:38we used to extract
0:02:40and then we form k one as well
0:02:42the unlabeled the
0:02:44so how do i
0:02:46really mentioned how a how auto-encoder
0:02:50is that well we have a
0:02:53the data points point and we tried to construct the noise and they gathered structure
0:03:02that is
0:03:02similar is not okay to the clean data
0:03:07you in our approach you use a generalization
0:03:11of out to go there because corner store
0:03:15the two years
0:03:18and data just to work with taking points but to across the entire that'll
0:03:24with a c
0:03:26data we actually to the
0:03:29that's what the cost function is the local section of
0:03:33but also though the construction of the hidden layers
0:03:37will explain how in more detail
0:03:40so this is that it
0:03:43this is a standard
0:03:45you want we have
0:03:47i don't think that we have a soft spot classification is done to a architecture
0:03:55but beside it if we apply for labeled data
0:04:00in case of unlabeled data with the same network
0:04:05the same parameters but at each step we add noise
0:04:10to the
0:04:11data with of the important we add noise to each of the hidden layer we
0:04:16try to recall that there will try to construct
0:04:21the hidden layers so engaged every possible for them but these the base of the
0:04:28however i and the construct and
0:04:31a previously
0:04:33i the that the cost function is one or construction
0:04:37will be
0:04:38so very close to the clean
0:04:41a hidden layers
0:04:45this is therefore in
0:04:49for the former so we have encoder and decoder is it's the in the encoder
0:04:55each you just
0:04:56a hundred and voice
0:05:00at each step
0:05:02in the decoder their we construct the denoising huge layers
0:05:08so if we will be more specific the main problem
0:05:13all of this model is of course of action how we can reconstruct the hidden
0:05:17layer based on the model
0:05:20he or and of course factors previously or and
0:05:25still we assume that the that we apply a additive gaussian noise
0:05:31to the in the later so we use a you know i estimation
0:05:37the daily we estimate
0:05:39then i
0:05:41we estimate hidden layer is a linear function
0:05:45all of the noisy you hear that
0:05:49we take the that we now require sufficient we take it
0:05:54it is that you know away from the eh
0:05:57the previous construction workers fact that leo this is the did not we know that
0:06:02money not is applied by you know function plus a sigmoid function
0:06:08and we don't for each one
0:06:10several the concept is
0:06:13similar to else
0:06:15so if you have this intuition
0:06:18but the idea here is to reconstruct the noisy but they are based on the
0:06:28based on their the professional constructed
0:06:32popping so once we have
0:06:35it actually we have a training data that is based on
0:06:39boast labeled data
0:06:42and unlabeled data
0:06:44the training cost function will be
0:06:48standard cross entropy
0:06:50applied on the labeled data
0:06:52and construction and all applied on the data and bibles action a well i mean
0:06:59reports about not just the input but
0:07:01reconstructed each of the hidden layer
0:07:04so if all
0:07:10so if we go back to this picture so i
0:07:14e in the unlabeled data we want
0:07:18we inject the noisy version into the net and then try to are constructed such
0:07:25that it will be very similar to the a noisy data
0:07:32so in this way we use the unlabeled data not just is the three training
0:07:37and then they forget about it but we use it i in it is part
0:07:42of the training
0:07:43the training data of the neural network is explicitly
0:07:47very small bones the labeled and unlabeled data
0:07:54this is an illustration of the power of a larger networks
0:08:00so we can this is a result of the standard em based
0:08:06a u i
0:08:08it that's or something this is the number of flavours and this is the construction
0:08:12or and we can see that using a large networks
0:08:16we had we can have a
0:08:19performance that is that is if everyone based on
0:08:23only for every something like a one or two hundred labeled data
0:08:29well all other consonants
0:08:32of a images is argument
0:08:35okay so this is the idea of other networks there you
0:08:40we will apply to this a challenge
0:08:46now we want to discuss how can we incorporate i'll sit in this a frame
0:08:52so we use a you want network architecture of but i would add
0:08:58another class a fixed it so we have fifty classes for each of the labels
0:09:05for each of the unknown languages and it and that are out of that class
0:09:10and the how we can train the out of state the label
0:09:15so that we used a label layer distribution the legalisation
0:09:20so what we meet again
0:09:22assuming we do a batch a training so we can compute the frequency
0:09:29of a of they all say
0:09:33all of classifying the a language at the difference you of
0:09:39off a classifier languages so we can count how many times we classified in the
0:09:44language is english how it are we classified as hindu
0:09:47that's right and how many we time to classify it as a state
0:09:52and we have
0:09:54a rough estimation what should be okay the histogram was what should be this distribution
0:10:00we can assume that we have
0:10:03all languages should be roughly
0:10:05it appears
0:10:07and i'll start should be roughly one quarter
0:10:10of the
0:10:12so we can say
0:10:14it cross entropy
0:10:16score function
0:10:21it to who it define a score of
0:10:24the data distribution of the classifier so and the main point is that we can
0:10:30do it because we have a label
0:10:32we have out of set
0:10:34in the unlabeled data if we don't
0:10:36in this challenge is the main point
0:10:39in these times we have out of set in the unlabeled data so we can
0:10:43assume that the adapted with the that some of my
0:10:46the labels should be altered
0:10:50okay so softly season a the additional cost function so we have
0:10:56it cost function one is the lower cost function there is some supervised cost function
0:11:01and the other the other one yees day
0:11:06a discrepancy of course of the labeled the decisions
0:11:11okay now i go to experiment what the three years the detector that deals with
0:11:17the input is the i-vectors we use a
0:11:21a natural therefore ready
0:11:24value hidden layers
0:11:25and the and we have a softmax output with diffuse t one
0:11:31so that this is the extent pale
0:11:34this assimilation the two
0:11:37we text sort it of the languages
0:11:40it is inserted and the other languages
0:11:43is out of set so we this is simulation we know
0:11:47all the labels
0:11:49i and he is example what is happening if we use
0:11:53the baseline
0:11:54they without today the latter it and if we and
0:11:59a that the latter
0:12:01score so we can see that
0:12:03we gained a
0:12:06in a significant improvement using the unlabeled data
0:12:10the during the price of doing it it's more difficult to learn the log spectral
0:12:17the prices is
0:12:18that we need to do more reports but it's not a big issue the that
0:12:23it is more
0:12:26so this is there a result a the progress are the results
0:12:31so we have a either doing a ladder or not doing other a and
0:12:37taking the labour statistics
0:12:39a or not take detecting the label statistics score
0:12:43so this is the baseline in for our case
0:12:48so if we take a larger
0:12:51we get a an improvement
0:12:54if we take a label statistics
0:12:57we also get improvement but not much and if we a
0:13:02a combine
0:13:03the two strategies
0:13:06the first strategy is the for unlabeled data in the second strategy
0:13:10for all to start we get it would gain a significantly
0:13:15we i and this would this is
0:13:19a this problem is the out of set statistics
0:13:23at the two
0:13:25the daily the system will provide then
0:13:28we tried to stall for example here what
0:13:32what we classify us thirty percent
0:13:36all of a development set as is
0:13:39that would try
0:13:41to a
0:13:45to adjust the number of out of state to be one quarter
0:13:50because we don't that roughly the that this should be that the number but
0:13:54in the baseline we got improvement but here
0:13:58it doesn't a
0:14:01actually the performance and decreases
0:14:04so i still the this was that the
0:14:09you the best results
0:14:12okay so to compute we tried to apply here lately a deep learning strategy the
0:14:19take care of both
0:14:21is a challenging role of three
0:14:25unlabeled data and out of set for unlabeled data
0:14:28we use the latter network that explicitly
0:14:31take the i labeled data into account while training
0:14:35four out of set a we use a label and distribution score
0:14:40that is also i
0:14:43i is used in the training
0:14:45i we show that
0:14:47these two methodology we can mitigate
0:14:50a improve the results
0:14:54okay ten q
0:15:01we have time for questions
0:15:13can you tell us exactly how much this unsupervised data help you anyone either training
0:15:22for example i imagine you do the also to express reconstruction in the same training
0:15:27data that you have like a regularisation into the classification categorisation would it you did
0:15:32you compare between added due to the regularization as what is the supervised and unsupervised
0:15:37no need to measure how much you will gain by don't answer provide it's a
0:15:42good question
0:15:43i didn't write but
0:15:48the utterance is used also is a regularization
0:15:52you think that the draw pile a strategy is that
0:16:00but not
0:16:02not sure
0:16:04just you deduction the well but
0:16:06we did try
0:16:08if i remember that it helps
0:16:12but anyhow we need
0:16:14do we need unlabeled data be because it has out that
0:16:28but it's still
0:16:33want to know if you a what applying some and kind of pre-processing for this
0:16:37for the i-vectors
0:16:38if what a new three and in and
0:16:42what you know what you know within two
0:16:46so called a low
0:16:49so the i-vectors that what provided by nist by nist
0:16:52the results but maybe preprocessing
0:16:56i don't know but we tried we use the raw data
0:17:11if there are no other questions let's think the speaker again
0:17:18so i think we we're at the end of the session i think we have
0:17:22a few