0:00:15alright technique for this introduction and i would like reverse also thank the
0:00:20or this recognizer for inviting me to give this presentation and to give
0:00:25present my last work and also for bringing us here too
0:00:28this one different location it was an amazing week
0:00:30the with that was very good
0:00:32and the social events with many things so i i'll exercising so as a good
0:00:40so it was really good since was very enjoyable to a week to talk to
0:00:44people and meet would be blunt this costs and exchange ideas so that's what wonderful
0:00:48and the gospel begins to
0:00:50to see this winter vision of the basque country
0:00:53so hopefully we'll come back to visit that this tourist
0:00:56we have chance so they only presenting some of my let this work about using
0:01:02some of the i-vectors some kind of i-vectors to model the hidden layers and see
0:01:07the u d n and sparkling information in the hidden layers and because usually the
0:01:13actually the way we doing now the nn since we are trying to look to
0:01:17the output of the d n and land the n is to make some decisions
0:01:20or we look to use the bottleneck features one half of it and one of
0:01:24the hidden layer to use a bottleneck feature to do some classification with it
0:01:28but unfortunately not
0:01:29a lot not only not any work have been proposed to sit to look to
0:01:34the whole unpacking the nn
0:01:36because i believe that some way that we can there is some information were not
0:01:40exploring and using actually into the nn is the activities of the part of activation
0:01:45how the information was propagate over to the nn and that's what we're going to
0:01:49be talking today
0:01:50and show some results
0:01:54so this is the out of my possible our staffers by can an introduction benefit
0:01:58that all move onto
0:02:00you know slowly to the my lattice work but before that would give some you
0:02:04know and reduce the i-vectors which i don't need to because a lot of people
0:02:07you but probably a know what sometime better than me
0:02:11so i mean you guys you know the i-vectors is based on the gmm so
0:02:14the first pass will be based on a gmm how we use it for gmm
0:02:17so we present for the gmm white gmm mean adaptation
0:02:21and we are we show to study of case speaker recognition language recognition here i'm
0:02:25not give any i'm not telling you how
0:02:27how to build your language or speaker recognition system but i just want to tell
0:02:30you that would i-vectors we can do something that is what is a show and
0:02:34again see some very interesting behavior of the data how the channels and one of
0:02:38the remote the condition can affect
0:02:40for speaker recognition system if you don't do any channel compensation
0:02:43for language recognition or we show how the closeness of the speakers from data driven
0:02:47this what is asian so the number that would remove
0:02:51then the direction of how we can use actually some discrete i-vectors to models the
0:02:55gmm weight adaptation is just some work that has started one
0:02:59of hugo new when you most students pass in has sent by how do you
0:03:03sees is actually was the case of an in bellingham he visit my me an
0:03:07almighty for six months
0:03:09and we will start working this gmm one advantage of language id
0:03:13and then after that i'm that the announced are progressing comment over feel
0:03:18and that's where start looking saying maybe this discrete i-vectors can be also use it
0:03:22to model the posterior distribution for the n s
0:03:25so i start this is what had this of the second part of that also
0:03:28a start looking how you know the intended representing information in addition to layers
0:03:33because a lot of the box in the vision to show all you can recognise
0:03:37that's this moron that model actually cat's face from youtube videos or something like that
0:03:41can we do something for speech
0:03:43you know that's how i start thinking about using i-vector representation the model data layers
0:03:48and that's why
0:03:48then we show that you know how for example the accuracy goes more to go
0:03:52deep in indian and how the accuracy going for example for language id task how
0:03:56we go better
0:03:57and also how we can more than that of activation the progression of that you
0:04:00know the activation of the information over the non-target the nn
0:04:04so if you feel like you what one hours too much for you to sit
0:04:07in the shower and you want to the perfume is that you should even the
0:04:11first part because the gmm part but the second part maybe more interesting for you
0:04:16i would be not offended if you want to the
0:04:18so and that after the our finished by so given some conclusion of the work
0:04:23so as you know i-vectors have been largely used it's a nice way to work
0:04:28on to it's a compact representation that nicely of summarize and describe what's happening in
0:04:34a given recording
0:04:35you know it's have been largely used for a different task
0:04:39speaker language speaker diarization
0:04:41speech recognition so there isn't i-vectors was actually related to the gmm adaptation of the
0:04:47so i just say lately i have interested also in the gmm weights adaptation
0:04:51for using i-vectors and then are you know that after they move on to use
0:04:55this for the model that the nn based i-vectors
0:04:59for the for you what modifications
0:05:03so that's not you know slowly take you to data the
0:05:08to my the others for what is what slowly
0:05:11so you know in speech processing usually what you have you have a recording of
0:05:15this one recording and you transform the to get some features
0:05:18then based on the complexity of the distributions features you build a gmm usually classes
0:05:23when but the gmm top of this remote to maximize the probability of distributions
0:05:27so you know
0:05:29gmms are have been is defined by portions and portion has the weights the means
0:05:34and covariance matrix are described this portions
0:05:38so the way that some other countries the i-vectors in context set a concept of
0:05:42speaker recognition so the way we were doing it in early twenties well that's what
0:05:46how the kernel started
0:05:47you know we dig a lot of non-target speakers were trained a large gaussian mixture
0:05:53then after that because we don't have to meet sometime too many recordings from the
0:05:57same speaker where n and one maximum likelihood do adaptation so we tried got that
0:06:01the universal background model which is a cut prior of that how all the sounds
0:06:05looks like to the direction of t
0:06:07target speech
0:06:08and the so the way that okay this should happen the between source trajectory gmm
0:06:15supervectors because we finally he found that the one of the pine find out that
0:06:18only the adaptation of the means is enough so the main the weighted it the
0:06:23mean shift from this universal background models of the large gmm trained on a lot
0:06:27of data
0:06:28to the direction of the target speaker can be categorized of something happened the recording
0:06:32that make happen that shift
0:06:34so the lot of people starts to think this shift example packet kenny which one
0:06:38factor analysis to try to
0:06:39supplied with one speaker and channels
0:06:42during the gmm supervector shoot for example also become boundaries what would svms you know
0:06:47trying to model gmm as input to the svm to describe the model to the
0:06:51probability of between speakers there
0:06:53so the in the sense fear i-vectors came out as well
0:06:57so the i-vector disposal you have a gmms subspace the ubm is one point there
0:07:02and so we have one recording so we try to ship to the new recording
0:07:07to the ubm to this new recording so if you have a survey recordings i
0:07:12you we have look different one space the i-vectors extracted more the oldest variable between
0:07:16all this recording
0:07:17in the low dimensional space
0:07:20and we still rocking is the ubm
0:07:23so all this new recording can be mapped to this new space and now we
0:07:27can represent and i is reporting by and vector of a fixed line
0:07:31so this can be an modeled by this equation so we have the universal background
0:07:35model here middle and east recording gmm can supervectors can be explained by the ubm
0:07:41plus an offset
0:07:43when offset also describe the
0:07:45what happened is recording which is you are given by the i-vectors into proposed variable
0:07:49the space where the i-vector a vector space
0:07:51so now when you have a strange you doesn't margaret training for that you when
0:07:56you have new recording utterance from the to get the features than after that you
0:07:59map that you're subspace are you sure that all familiar with that
0:08:03so now i'm not going to give anyway how to tell you how to do
0:08:07speaker recognition you have been seen a lot of goods
0:08:10talks during this will wonder four
0:08:13all this is a conference but and still that will show you how we can
0:08:16do visualisation with it so
0:08:18first of also for speaker recognition this i-vectors have been applied for different kind of
0:08:22speaker recognition task of speaker modeling task like spoken speaker verification when you have a
0:08:26set of speakers you want anyway of recording you want to defy with those who
0:08:30spoke in this segment speaker verification when you have a to want to verify that
0:08:35to recording are coming from the same speaker or diarization
0:08:38you want to know box and one
0:08:40so for the for the speaker recognition task i would like to show some visualisation
0:08:46that explain to you what's happening in the they that if you don't do any
0:08:49channel compensation for do that
0:08:51i would like to notice the work of that currently was actually psd students with
0:08:55the unopened hyman that bill combine a mighty and he was working would not at
0:09:00so we took the this is that in the nist two thousand us a ten
0:09:04speaker recognition evaluation was based on i-vectors and the time of was that this that
0:09:08system was we build was actually based it was a single system that rounded to
0:09:12deal with the telephone and microphone data in the same subspace
0:09:15and so we look like a box five thousand recordings from that the data and
0:09:21we build a cosine similarity between all the recordings
0:09:24that i think that it does this make metrics that similarity matrix and he built
0:09:28teen is never appear at so is your for that would be connected to that
0:09:31this tenish never
0:09:33and then use this software called guess to do the graph visualisation
0:09:37so in this graph you know that the relative location of the node is not
0:09:41important but the relative distance between the notes
0:09:44and the clusters important because
0:09:46it's reflect how close they are and how to structure your data is
0:09:50so that so here
0:09:53exactly they female they data but database with the inter session with a channel compensation
0:10:00applied so we can see the colours are by speakers
0:10:03and the is so he's and he should british or point corresponded recording and cluster
0:10:08compare the speakers
0:10:09so for people that actually want to the museum and since all are this early
0:10:13week can you do i mean what's the at this was thinking twenty been this
0:10:21so the thing is like now what we start doing that we say okay well
0:10:25known that we tried to remove the channel components i said what happened well we
0:10:29lost the speaker clustering
0:10:31and something happen that were some cost so that happen that appeared in this clusters
0:10:35and always say like well what's going on he says so he went we went
0:10:39together we will look cd a
0:10:41to the labels and we start looking what's going also for example here
0:10:45each you one check all the microphone at used for the different back that they
0:10:49that the microphone was used to recover one of the recordings and you find that
0:10:54actually with the clusters like to the microphone that was have been use
0:10:57and that would like to pursue the pretty surprising for example it may assume at
0:11:00this at the telephone data we have like one in-cylinder and this of the microphone
0:11:05and also we have five that you also find that there's to actually for the
0:11:08same activities cluster two clusters and actually because the room was there
0:11:13that the ldc lifetime used to rooms the collected data so also the two rooms
0:11:18was also reflected in your data
0:11:20this is a liberal press every civilisation to show that you know i don't want
0:11:23to give your michael right one from two to one point five whatever but i
0:11:26don't tell you that if you don't anything about the market for the channel compensation
0:11:30it may be big issue
0:11:31so this is what happened there is only
0:11:33the data can be affected by the my microphone can be affected by the channels
0:11:37and also can be affected by the room that have been recorded
0:11:42so this is that what we do try on the market the channel compensation
0:11:47and we do the clustering by speaker and bit the visualisation is by the
0:11:52by channel so that specific the channel compensation doing some good job too
0:11:56trying to normalize this
0:11:57so i front lately we recognise mel bit and female on a y
0:12:01but different clusters of the time was better so this is that say the same
0:12:05at a later we all have see also the same behaviour so this is the
0:12:08one to the microphone data which is the most interesting
0:12:11and you can still see that split between microphone between the room one and room
0:12:15to the ldc and this use the collected data
0:12:19so this is actually unique visualisation
0:12:21that have been you know very helpful for us and stand and you know shows
0:12:26the people that actually about the what we are doing it makes sense
0:12:29and you know how we can still be fun to the some pictures and microphone
0:12:36a microphone channel compensation
0:12:39so this is the same thing so i honestly after that you know what we're
0:12:43doing language id two thousand eleven i start looking to the language id task so
0:12:47and i will try to do the same things also for visualisation so he language
0:12:52recognition task we have a verification is why doesn't fixations so you don't need to
0:12:56to spend too much time at that so here what i did is actually a
0:13:00i to connect nist two thousand nine i have an i-vectors was trained in the
0:13:03training data on it took it doesn't matter just a can cost
0:13:07and a two hundred recalling for each language i think we have like twenty three
0:13:12for that language
0:13:13and i know to the same thing salad build the cosine distance or similarity and
0:13:20bill between a separate graph and try to visualise it so this is what happened
0:13:24for this kind of language recognition class so for example here disappointed because we have
0:13:29for example
0:13:33english and into english close together
0:13:35we have into english and hindi and urdu you know like what they are very
0:13:39close together
0:13:41mandarin cantonese and that i mean and korean
0:13:44is same almost in the same cluster
0:13:48so also here's duration ugly green and was any and of course shines origin
0:13:53in the same cluster and also french and real
0:13:56so it's really data driven
0:13:58at a visualisation that show you how big how the closeness of the languages are
0:14:04from the acoustics
0:14:05that have the primary using to model the i-vector representation
0:14:09so here this is what have been you know you know
0:14:12i-vectors were allowed to do because you have this you know in the time with
0:14:15cosine distance between you can be lda to this was a bit as well
0:14:20that we can you know doing i-vectors and represent the data and see what's happening
0:14:25the data and how you can interpret what's
0:14:26what's phenomena is going on
0:14:28so that of is what is it was a good tools for that
0:14:31so it is a you know that meet now try to move on because i
0:14:34know that you all familiar i-vectors i don't want to
0:14:37to spend too much time it anymore probably prefer we want to the more interesting
0:14:41topic of this to of this talk so that after that i start looking to
0:14:46the gmm what adaptation is a say with the students from what has a higher
0:14:50and the way the gmm weight works that there's lot of actually the several decay
0:14:56that have been applied to that
0:14:57for example maximum likelihood should
0:14:59the most a simple way
0:15:01and one of the and also nonnegative magic factorisation which is actually you go via
0:15:06newman was working in that at the subspace multinomial model
0:15:10which is that what else complement inequality and what but people use
0:15:14and what we propose which called non-negative factor analysis because the you know that the
0:15:18gmms what adaptation is a little bit tricky because you have the nonnegativity of the
0:15:23weights as well as they should sum to one so this is can trying to
0:15:26do you have to deal with
0:15:27during the optimization and when you're training your
0:15:30your subspace
0:15:32so it's a
0:15:33so the whiteboard ogi validation for example you have a set of features like oneself
0:15:37recording industry features
0:15:39and you have any bn you model if you try to compute impostor distribution of
0:15:43a of a given a component for some time of a frame
0:15:49given the ubm subspace are you so we get this posteriors and then you and
0:15:53your then you accumulate that and can
0:15:55from that
0:15:56so the object so in order to get that the gmm what adaptation you don't
0:16:00you try to maximize looks very function given here
0:16:02and if you want to do a maximum likelihood so the way to do what
0:16:06you accumulate all this serious overtime and it divided by the number of frames that
0:16:11you haven't you can do maximum likelihood
0:16:14you can for example do nonnegative market factors estimation
0:16:18which consist that okay we just try to split this weights adaptation into little small
0:16:23negative matrix as
0:16:25basis that
0:16:25also maximize looks very functions that given here they the input is that the count
0:16:31and you try to estimate is to subspaces vector representation one assumptions one and they
0:16:36the representation of this in the subspace
0:16:38to characterize the weights adaptations
0:16:41so this is a negative matrix factorization is the you go value money students paper
0:16:46that describe that
0:16:48what implemented via t is that you have a multinomial distribution
0:16:52and which kind of is described
0:16:58so we have this subspace all that describe the a this the i-vector representation of
0:17:05in the weight subspaces the when did v is actually but so we have you
0:17:10know ubm plus share and didn't but no matter here also how to make sure
0:17:14that the weights obtained are normalized to one
0:17:18the good part of it here is that this is very good to when you
0:17:22have a nonlinear data to fit for example he an example i would like to
0:17:26but an specially older for shown with giving me the slides and that this
0:17:33here for example you have a gmm of to question for example
0:17:37and he would try to similar each point corresponds to one recording weights adaptation
0:17:41for example much estimation
0:17:44and we tried to simulate what happened when you have a large gmm so we
0:17:48have some sparsity not all the goshen would appear so we can see that this
0:17:51question here the corner sorry
0:17:55then the d
0:17:56so this abortion here we would not be this is just a simulation
0:18:00in what happened when you have a large ubm
0:18:03so we can see that we for example in this case how the data looks
0:18:06and this subspace moody model in the minima the sorry multinomial that model is very
0:18:13good to fit the data
0:18:15but that it has a drawbacks make overfit so that's why the but you guys
0:18:19user regularization do not make it more overfit
0:18:22so has send work at a time was trying to do that similar the same
0:18:28as an i-vector so you haven't ubm weight i weights and you want to make
0:18:33sure that new recordings had the ubm for you the weights for the new recordings
0:18:37is that it will be in what was an offset
0:18:40and the constraint here it's
0:18:42you they should a weighted sum to one and they should be noted nonnegative so
0:18:46we developed in an em like approach so but someone right in the center of
0:18:52sound i think we did something applied to maximize the likelihood of the objective function
0:18:58so you have to step second compute all i-vectors and you got many of they'd
0:19:02are you but the l and you have you tried and w because the convergence
0:19:06so let's say we tried to maximize the lower the likelihood of the data does
0:19:10a function of the subject that they should sum to one and they should be
0:19:14opposite if there is
0:19:16projected gradient ascend that can belong to do that
0:19:18and this is are you gonna go to the reference in you can find all
0:19:21the information i don't want to go there to be a for this talk to
0:19:28the difference between for example the non-negative factor analysis and the s m is of
0:19:33showing this table so that they i don't think that tend to not overfit because
0:19:39the approximate or the maximum data is that would not touch the corner compared to
0:19:44the ammonia s m
0:19:47but sometimes good sometimes bad dependent which application you are targeting
0:19:52but we compare that for several application they seem the same bit s m invented
0:19:56non-negative factor is practice to
0:19:59behave almost the same
0:20:01so this discrete i-vectors have been applied for several applications and purposes for example modeling
0:20:07of prosody that's what marcel that for his phd
0:20:11phonotactics when you model the n-grams for example on dry and the did that and
0:20:15method is based is this
0:20:17and also what we did for the gmm weight adaptation for language recognition and
0:20:23and dialect recognition would have sent has an work so
0:20:26in this paper we compared activity taking and i'm have
0:20:31assume m and as well as the you don't get a factor analysis so we
0:20:33can go and check that
0:20:35be almost behave the same thing as one for gmm weight adaptation
0:20:38so now in order to go to the fun part
0:20:44how we can use this
0:20:48discrete i-vectors to model the
0:20:51the gmm that the model that the nn activations i was actually the time of
0:20:55was motivated by
0:20:57this picture
0:20:59so i was watching what it was actually that any one of the pocketing whatever
0:21:03was given a talking to go on training or something like that and he was
0:21:06showing that you if you do like some a deep belief network to unsupervised trained
0:21:11your auto-encoder data
0:21:13and he trained in the millions of unlabeled youtube
0:21:17number link but component
0:21:20and he said that maybe if you divide one or in top you maybe you
0:21:23can actually construct
0:21:25the pictures and he was saying all kayaking see the cat
0:21:29and it will like okay well we do something for speech and wishart okay it's
0:21:33a continuous the time series but
0:21:35that was taken it can actually see how the data is are gonna the nn
0:21:40hidden layers and that's how it is exactly what motivated to start this work
0:21:45so remember that before i say we have a recording and the waitress from that
0:21:50to set of features
0:21:53then we get this feature to a gmm earlier now let's just remove the gmm
0:21:57and give it to
0:21:58due to deanna so for example we can do easy where a language recognition as
0:22:03in what you give some frame versus like modelling of frames that's what you not
0:22:07your from who did freeze paper really got thousand fourteen so it's input is of
0:22:12segment was just like a frame and output is a language and
0:22:16i will show the several the same like eggs experiment
0:22:20note that when you have a new recording and you want to make the decision
0:22:24you do a frame-by-frame decision and he aberration he tries to the max of the
0:22:29output so that's largely what we compared to and you can also do example show
0:22:35anymore seen on the n n's and you want to see how the data as
0:22:38representing in the this task so
0:22:43so imagine you have it that have been there so the way that we do
0:22:47the before as a set earlier is we to get the n and we take
0:22:51the output to make a decision
0:22:54you know like or alignment for example for ubm i-vectors
0:22:57or we take one hidden layer
0:22:59and are used to it as a bottleneck features
0:23:02but whenever and since we only see one level of what we've got the and
0:23:07only one
0:23:08one hidden layer or the output we don't see how the d n actually provide
0:23:12get the information over
0:23:13all his on fire the end on part of the nn and the reason for
0:23:17example imagine you have a sparsity coding for each
0:23:21for example for each hidden layers
0:23:23and use a for each input only fifty percent of your
0:23:27of your the foregone or inactive for example but for example drop out
0:23:33so the way that the data we colour information for example for class one the
0:23:38one and you will call it here and the one he would call you can
0:23:40be different
0:23:42because some randomness the way he would provocative what when coded information so if you
0:23:47can model you get more that of the battles activation of how the class went
0:23:52to the nn
0:23:54and this is an information that's available there but we're not using it
0:23:58and that's exactly what actually motivate me for doing for doing this work
0:24:02so can we looked at all hardly nn and see how to progress there and
0:24:07you know this is our should be one way to do with maybe is not
0:24:10the best way to maybe don't always but this is one way to do
0:24:14so the idea here were tried to do is
0:24:17since we had this discrete i-vectors that also based on counts
0:24:21and posteriors so can i use that to model
0:24:24i i-vectors for each that we should outlier
0:24:27that's what it is only built for example of the nn here we use an
0:24:30i-vectors are presented and one
0:24:31it into a taken as a present the lastly a loss leader as well and
0:24:36noted to do there i need to have some counts
0:24:39to react like we were so i'll be able to apply to my gmm weight
0:24:43adaptation techniques to do it be used for gmm weight adaptation so here is to
0:24:47when you get a combined counts
0:24:49for example you can compute the posterior fortyish norm activation foster for each normal then
0:24:55if we use you don't layer for each input your normalized to sum to one
0:24:58artificially a common either because the you know was not trying to do that
0:25:02and then you accumulated over time i became that became counts because here you should
0:25:07allow us to sum to one
0:25:10and you can you can use the same gmm to gain you don't change anything
0:25:13to them
0:25:14so the second one gonna post softmax for example
0:25:17similar thing but you ample softmax we generalize to map and sum to one
0:25:21and the accumulated you can also trained with softmax as well
0:25:24but what is the most important one which the most understanding of all this ad
0:25:30and it compute the probability activation operational wrong and its complement one minus one
0:25:35so you can consider this to normalize the one gmm of to work
0:25:40so now we don't you only model that you can use the d n and
0:25:43have the rest of the response so we don't normalise anything
0:25:47so here so for example here for example if you have one thousand four neurons
0:25:51you will have double their doubled that and you would have
0:25:55thousand of
0:25:57genments what to bush and you use the subspace model tool to do that what
0:26:01the constraint that we used to normalize and his company wayne one is complementary sum
0:26:05to one and in this case you don't do anything go wrong because you're modeling
0:26:09the same behavior of the nn
0:26:12we tried to compare few of them but we didn't will i'm not going too
0:26:15much in a detector the want to make too much numbers here to confuse you
0:26:19there will be have the same one
0:26:22so in this case the say we can use for example here for the first
0:26:26application we should dialect the condition
0:26:28i use non-negative factor analysis
0:26:30for the nist eight are you subspace multimodal more than one not be a model
0:26:34"'cause" i wanted to show that but actually but works there is no distinctive to
0:26:37be you
0:26:38so he to the say
0:26:40the non-negative factor analysis you have the weights of a new recordings used the ubm
0:26:44so with a wary compute d b m's can i the weights i usually take
0:26:49all the that the training data extract the count for each of them are normalized
0:26:54m and it took an average and that's might ubm so every ubm response for
0:26:58that's only the average response of a moral issue the layers
0:27:02for a given him and it and
0:27:04so if you shouldn't layers for a given all the recordings
0:27:08so when you can use the at the you wanna get the factor allows us
0:27:12to do that
0:27:14so now
0:27:15though that resting by is an eigen factor as a scan all support other approaches
0:27:19can help you also to model all the hidden layers as well one way to
0:27:23do it for example you can build hit and i-vectors for each subspace then you
0:27:28can compensate the i-vectors of them
0:27:30and you would have
0:27:31or you could have one
0:27:33that actually model everything with the constraint that uses hidden layers of some to well
0:27:38and this will allow you to see how
0:27:41you know how the correlation is happening between all the activation of your hidden layers
0:27:45and that's exactly what we did
0:27:49in order to do that we extended for example accented to d non-negative factor analysis
0:27:53so you have a different ubm each one corresponding to issue the layers and it
0:27:58would have a common
0:28:00i-vector that control all of all the output for each dollar data sorry you have
0:28:05a common
0:28:07i-vectors for all the weights for all data it hidden layers
0:28:14so in order to do that let's try to give some experiments and show something
0:28:22so the first experiment that i would like to show is in that some dialect
0:28:25id so we have a small sore from apart from vision
0:28:29so we were interested in doing some back here we have five dialects we have
0:28:33this isn't know how many recalling by training
0:28:36it's about forty hours important thing for ten or fifteen hours and it'll it an
0:28:40hour threeish a dialect
0:28:42and we have training how many cost for training and development and eval
0:28:47so a train the d n and
0:28:51so we have five class that problem of trying to the n and with five
0:28:55hidden layers
0:28:56thousand and the first you know little about two thousand and then after that i
0:29:00have five for all the hidden layers of five hundred
0:29:06five hundred
0:29:07then so the in is that the while training that the input is the same
0:29:11the is the features of a stack of
0:29:14i think was twenty one features frame then the output is the five dialect class
0:29:20the same as a google paper with any with the in a two
0:29:25then the when you get the i-vectors are used cosine scoring with lda and the
0:29:30people described earlier today
0:29:32and the best image method we find for this task is that the it's also
0:29:37most full rank
0:29:39as about thousand five hundred five and the for each other ones
0:29:42so that so the first results show is the i-vector results
0:29:47and he was the i-vectors actually it's worse than twenty to the d n an
0:29:52average of the output
0:29:54which a mean that for each frame you compute the posterior for the five o'clock
0:29:57for the five class and you average them and you mathematics which is exactly what
0:30:01we would paper describe and he is better because the that this the characteristic of
0:30:07this data is that's the recording are very short cuts around thirty second you know
0:30:12organ sometime less
0:30:14so we know that you know if you do that the nn and you do
0:30:17average scores it's always better you have already seen that talks in a wednesday afternoon
0:30:22a show that
0:30:23even for news data so this is the error rate sorry so that less is
0:30:30so now i will show you know there is a twenty do the i-vectors in
0:30:36the hidden layers and starting from it layer want to layer five and how the
0:30:42results are is
0:30:44more you go deep but there is which we know that
0:30:48so this understanding what are preprocessing on other feel like in a vision so we
0:30:53were able to do the same thing here so
0:30:55you can see that were from their one layer wanted to the board the devil
0:30:59that's cool down and i can't this
0:31:02five lighters because i want to show that sometimes there's no need to go too
0:31:05much deep
0:31:06for example layer five already saturated
0:31:09like that like five didn't have anything but they q prodigious to make sure that
0:31:13you know sometime we will try to make it really d but is not necessary
0:31:17so this is one example what you really don't want to do it
0:31:22and putting is now we can also see that you know we were able to
0:31:26see the accuracy of you should the layers and we can we also be able
0:31:29to prove that more you go deep in this that the network but there is
0:31:33a result are so you will probably get more information
0:31:36model in all the hidden layers maybe have model but the representation
0:31:40so here this is l deity
0:31:44to do that a dimension
0:31:46of the that the five classes is an lda project into dimension lda and a
0:31:51member the first on the presented this work and the what the slide that people
0:31:55say well but probability don't to lda i said that's true i forgot to do
0:32:00so this time i didn't forget
0:32:02and so what i took a set of the row i-vectors for example for the
0:32:05last layer
0:32:06and i do it i did jesse any to model that so now here just
0:32:10a zero i-vectors were using to see any use lda also you can see that
0:32:14for example the origin is around here so we can see the scatter going this
0:32:19which just signed that okay length normalization will be useful again
0:32:23so this is what you wanna do the length normalization due to the same thing
0:32:27so it's and speaker area
0:32:28so is the same thing so that normalisation is also useful here so
0:32:35i'm not sure this project was unfortunately i was hoping to see different behaviour but
0:32:38it is in say behave the same thing
0:32:42so this is using to see any cell this is a role
0:32:46so since the reason why i was asked this question because of the i was
0:32:49just which are trained to the task
0:32:51so how it really actually represent
0:32:54the that the data and the layer was and their important thing to do
0:32:58so this is a one is one thing that we were tracked
0:33:01so now
0:33:04i just say here probability result the i-vector result in that the nn and over
0:33:09averaging the scores of the frames which is better than i-vectors then more than in
0:33:14the hidden layers actually better is necessarily
0:33:17and the results so and i say that from all my experiment that they have
0:33:21been that seeing is that the last he'd of the last layer is the worst
0:33:25one in time of information so don't take decision that
0:33:28but with data we so that the old information is actually in the hidden layers
0:33:32there's no doubt about
0:33:34so here i give the last layer result and then what happened if you model
0:33:39everything one you get more again
0:33:42you get all other two percent gain by modeling all the hidden layers
0:33:46and the same thing would happen witness tape
0:33:49so my point here is you know is true hidden layers
0:33:52you know more go deep but there is
0:33:55but if you also looked at all the correlation that happening over all hidden layers
0:33:59is actually better
0:34:02and the reason for example why is you know the even people that do some
0:34:06you know brain division amount vision and everything that wanna try to the activation the
0:34:11cost of you know what him or more i've something's can use it and one
0:34:14level but you cannot see that how this to propagate maybe she can correcting about
0:34:18that if i'm wrong you know this way we can do the same thing for
0:34:22the n and we can
0:34:23top and one hidden layer or we can see what's happening all the d n
0:34:26and is the same okay
0:34:28you can you do td in my right to sit activation how it happened or
0:34:32you can cut and one levels can and make a decision this is the same
0:34:36thing can we just so this is the same behaviour and here i'm just saying
0:34:41the n and has more information that we are now using
0:34:45because we are not looking to the path of activation that he took too cold
0:34:49his data
0:34:51so this is a deck id probably are not familiar with that so probably move
0:34:54onto the speaker id but before that i did an experiment because i you know
0:35:00in the state of the i-vectors was completely unsupervised i was thinking okay so that
0:35:05i used is actually
0:35:07discriminatively trained for this specific task
0:35:10can i have the n and that was just using to call the data on
0:35:15for example
0:35:17and you know the simplest way to do it i say let me just try
0:35:20to do a good idea learning every n to try to see you know what
0:35:24happening i'm sure that people has more sophisticated network for that
0:35:28so i tried this every have the same are selected that trained before the same
0:35:32data these speech as input frames input
0:35:36and i use of dimensionality reduction at that it subspace and use cosine distance so
0:35:42we use five by the layers are b m's
0:35:44and i
0:35:45this of the results l the i-vectors here at the d n and output
0:35:49but i am having some struggle because i cannot go more than the first layer
0:35:55for the every m called an ongoing colours
0:35:58so the how the first layer give me the best at all is not as
0:36:01good as
0:36:03you know this discriminatively trained subspace with the in a subspace forty i-vectors but
0:36:10you know it's not that bad
0:36:12you know and that's what have been seen
0:36:14so the hidden layers the first one you trained is actually the best one more
0:36:19you go deeper
0:36:21it doesn't how and my
0:36:23my hypothesis i'm not sure if it's true
0:36:26because they are not jointly training
0:36:30if there may be they are all the number of the
0:36:34the layers are jointly trained to maximize the likelihood of the data that may be
0:36:38different story and that's why what that's what we are trying to investigate now
0:36:43with the my students so can we trained variation for example operational uncoded to train
0:36:47the maximize the likelihood of the data
0:36:49and see how
0:36:50all this representation has a meaningful or not
0:36:53so this is one thing that we are trying to explore
0:36:56so now for people that are more familiar would
0:37:01with the nist data so are you what you seen as it was wednesday afternoon
0:37:06session that people are more than in six languages
0:37:09i tried to the same thing so we selected with the help of like to
0:37:12laugh read a give me this subset of the data
0:37:17so first in the korean mandarin russian vietnamese
0:37:21and the difference between us and other people doing people try to use all evaluate
0:37:25data so that want to remove the mismatch but the trend not use the what
0:37:29of density s and v only be to avoid the mismatch
0:37:32it because i want to know what's going on
0:37:34for us was where everything together
0:37:37it seems that we didn't have this issue
0:37:39so that's the difference between possibly not you paper and sum p other papers in
0:37:43the that section of the
0:37:45wednesday afternoon so we should put everything together and we're trying to the n and
0:37:49that actually you take the frames as input and the output is a six class
0:37:54and this is actually that is also so actually before that i will say
0:37:58i train firefly the error five data layers about thousand ish
0:38:03the input is the frames sec frames of twenty one eleven contextfree side
0:38:10at certain context for each side sorry the output is the class
0:38:14of the six class use a linear according to this time before of course
0:38:19cosine this one is a collection
0:38:21and the so here this i the result in a subset of the thousand nine
0:38:26for the six languages
0:38:28so there's a result of the i-vectors intended to second ten second and three second
0:38:32and the average of the score which is what everyone is doing what you the
0:38:37direct approach
0:38:40so the that the characteristic of this is as have been said before
0:38:44it only got the this the and it's
0:38:49average only be the i-vectors in the three second entire thirty seconds and ten second
0:38:53it's not it doesn't work
0:38:55but what happened when you do the hidden layers is a little bit different story
0:39:00so is well more legal given that the nn but there is
0:39:05so this is the same thing a slow does not different story here
0:39:09but the thing is
0:39:12or actually here forty four you know participant and second that no one is able
0:39:16to be this because the this
0:39:18if you do the hidden layers and for example i want to the hidden layer
0:39:23it's obtain the best result everywhere
0:39:25for even for ten for this for to just forty seconds
0:39:29so hidden layers and also this is actually was interesting it is the hidden layers
0:39:33five is just the one preceding this i'll put e
0:39:38so this one sign the last layers as the one that you really don't need
0:39:42to look
0:39:43so based on the my experience so and here again see that the last in
0:39:47the letters actually marsh much better than
0:39:50then the i-vectors and as well as the nn output every
0:39:56so the hidden layers aims at that i-vectors representation for this case seems to do
0:40:01an interesting job of aggregating and pooling
0:40:05the frames data to make your representation of the data and you can do classification
0:40:09with it
0:40:09so this is an interesting funding for that so actually all surprising to see what's
0:40:14on the data
0:40:15so now
0:40:16what happened when you do everything model all that a whole hidden layers as well
0:40:21so here are show d
0:40:25i-vector representation d v d n and every score as well as the last hidden
0:40:28layer five
0:40:30and you know i'll i
0:40:34and also try to see what happen if you do
0:40:39all hidden layers what used again some k
0:40:44and you can win also one almost like zero point eight this sorry i forgot
0:40:48synthesis the averages right in there so we can see that for thirty seconds there
0:40:53is already low
0:40:54you know i don't i don't think that too much seriously
0:40:57that we was little bit here but for ten seconds we were able to wayne
0:41:02and forty eight the signal were also able to
0:41:06so it's the same behaviour that all hidden layers
0:41:10has better information than the one that single-layer of the time
0:41:14and also the last layer is also better the than the first layer and then
0:41:20then the first so that last is also but the minutes like the first layer
0:41:24a hidden layers and looking but the last output layer is not that much interesting
0:41:31in term of making decision
0:41:33so either one reason to be honest one explanation is that this the nn time
0:41:38by ten to overfit
0:41:39which i just a do
0:41:41second to shoot
0:41:43but even when they overfit like that and use them to make a representation or
0:41:48discrete your space
0:41:50it's in they work fine if you try to make decision what over fitting a
0:41:54different story
0:41:55as one thing here
0:41:56so this is what i have been finding this last
0:42:01you're trying to use this models to
0:42:04understand what's going on
0:42:09so let me try to conclude
0:42:12so we have five minutes and have something called that you want to say
0:42:16so the i-vector representation is you know an elegant way to do a representation of
0:42:22speech with the different lance you know a lot of people ready also used in
0:42:25a wood that's and twenty of
0:42:27of the work of the recordings the one where you have a long segment and
0:42:31short segment
0:42:32gmm innovation gmm weight adaptation subspace can also be applied to as a show sheen
0:42:39say that that's you have seen in this talk can be applied to model the
0:42:43d n and activation
0:42:44in the hidden layers as well and they would doing good job
0:42:48so was actually the take home here
0:42:51so that stating that they want to focus here the seldom under down for all
0:42:56the information that was modeling that the nn is not in the output but isn't
0:43:03looked at that it is this
0:43:05don't try to make a decision directly from the from the out
0:43:11and so also you know looking to one the liar at the time and not
0:43:17seen what's going on in all the data layers
0:43:19it may be a mistake were going but it's may be good also to look
0:43:23at that
0:43:24because it's will tell you what's how the information one to all the d n
0:43:28and how we show that each class to be model
0:43:32that's something to seem to be
0:43:35very useful
0:43:37the subspace approaches that have been trying is one thing that i was thinking off
0:43:42to do this work demo specially in time of modelling all data layers
0:43:47that you know we can use and it is seen to doing good job of
0:43:51putting and are aggregating that they the all the frames and give you are not
0:43:56representation with the maximum information you can use for your
0:43:59for your classification task
0:44:02so this has seemed to be very good even if the day was trained in
0:44:07frame based
0:44:08so with svms trained at the end frame based and use it to make a
0:44:12sequential classification
0:44:15i-vectors is actually a representation seems to be doing a really good job for that
0:44:23take two minutes to
0:44:25and we have to mitigate
0:44:30for future work contracts that we have been explored my students and colleague
0:44:35my colleagues
0:44:37now that's a earlier that the other than being using are based on frame based
0:44:42and segment length
0:44:43frame of contacts of twenty one or something like that
0:44:47it's not doing so we are trying to shift to
0:44:50more like memory the nn is like for example td and endorse unit time
0:44:56or l s t m or which is the
0:44:58special case of recurrent networks that's what ruben is doing
0:45:02my inter so we are trying to explore data instead of frame-by-frame to make more
0:45:09to extract a model more speech more the dynamics
0:45:14explore more data such vector for speaker
0:45:17to make them more useful for speaker
0:45:20we're still working on that as well
0:45:23and the set earlier i would be of interest and people spy authors in my
0:45:26talk to meet i mean maybe there is a better way to do
0:45:31to really corpus clear that the data speech
0:45:35and my whole is at some point we would be able to
0:45:39to get some speech modeling at the end the nn or speech colour so you
0:45:44it just call the speech and after that i used to discrete my space and
0:45:48use task for example i give you
0:45:51a bunch of thousand of recordings you call your data and after that you say
0:45:55i want to use speaker i want use language
0:45:58can i use from the
0:46:00from the same model
0:46:01just calling speech
0:46:03so if anyone has any idea or have any tell please come talk to me
0:46:09so also to make the things the i activation more interesting
0:46:15i'm interesting in exploring the sparsity of activation for if you know later
0:46:19no i'm not doing a specifically i'm trying to use that the nn training but
0:46:23is there a way to for example one way that i'm doing now we didn't
0:46:27have time to compare the result is dropped
0:46:30example i say
0:46:32what for each input fifty percent of my for additional layers fifty percent of mine
0:46:36or active
0:46:37so there is some randomness between the recording but when the hidden layers because
0:46:41i find that actually some if you do have at the end and the two
0:46:44hidden layers consecutively the layers sometime i redundant because i close together but supplied them
0:46:50is actually the two of these separation between it's better
0:46:54so if you do surpassed activation with for example would drop obviously the simplest way
0:46:58to do
0:46:59you make them complementary because there's some randomness happen in the middle
0:47:03so that you for that the nn to take different bat for each hidden layers
0:47:07are normally
0:47:10so that's something i'm really interesting to make the
0:47:13information but the between two consecutive
0:47:16hidden layers more powerful more interesting and then and make them more rather than rather
0:47:20than and
0:47:21and also there's a way to for example alternate activation functions
0:47:26by same we can say sigmoid rectified linear and sigmoid on
0:47:30so between two consecutive sigmoid that something in the method to make things changing a
0:47:35little bit
0:47:35so the behaviour change for the consecutive sigmoid
0:47:38so when you model down there is there's hopefully a way to get more information
0:47:44and you're so in the subspace and also how the how the d n and
0:47:48is coding information can be useful for the classification
0:47:53to conclude
0:47:55well i'm organising assess it doesn't sixteen portrait ago
0:47:59so hopefully to suit their lee's summit your paper the same is the same time
0:48:03as the c
0:48:04so that that's this work so please help to see there and if you come
0:48:08at the workshop you can also stay
0:48:11for the rest of the week you enjoy the beach and that the cocktails very
0:48:15nice to signature nor owns to make your compared to the right object a function
0:48:19so and so that and that's it i q
0:48:42jim had sent you from these distortions mum concerns just about a point which is
0:48:47not main point of view or which is not in the main point of your
0:48:52it's about the television in addition
0:48:54a particular always the t s in the stochastic neighboring of meetings
0:48:58the to use of form determinization think that
0:49:03this techniques and that is this phenomena useful and satisfying four
0:49:07for thinking for the it but also for the thinking and understandings the distributions
0:49:13but we remark and some if you put forth
0:49:17and so for presenting the high divorce which of data with those techniques particular these
0:49:24speaker classes
0:49:26i'll distributed along ambulance form norwegian
0:49:31this thing directions
0:49:33t s and then don't does not respect the initial distribution
0:49:38it separates speaker classes but so as you
0:49:42the does not respects is montreal
0:49:45direction of speaker classes
0:49:48so it is useful because we use e
0:49:52separation between necklaces of speaker
0:49:55but not
0:49:59or maybe more
0:50:01view of this is we'll distribution
0:50:05so it's i think a very good tool
0:50:07two but it may become few not to use it of to propose a new
0:50:15it's as those more one so you're saying i it's here's just want to show
0:50:21that you know how it's kind of structured but i'm not checking account how it's
0:50:25model was a distribution from a t c any that's what you're saying yes
0:50:32simply for the also points in particular fourteen and
0:50:45i didn't write down all the numbers but i saw you had results and b
0:50:49r and the dialect id task for other five dialects arabic
0:50:54and their numbers are three writing down here you had to i think that the
0:50:58fourth layer supervectors right a twelve point two percent and then when you into if
0:51:05we're was twelve point five percent
0:51:08and i apologise i didn't see a slight that that's if there so my question
0:51:14as you're moving forward you're actually getting improvement but would really be nice in dialect
0:51:19id it's a lot more subtle differences between a derelict right search a lot of
0:51:24times it interesting to figure out what are the things that are differentiating between each
0:51:29of the dialects so i'm wondering if it anywhere you go back
0:51:33and look and the bad the test files that you went through here for guitar
0:51:37residual moving in the improvement here
0:51:40you you're some not your hand it may be assumption would be that you're getting
0:51:44a few more files except it correctly but you're just likely to have a few
0:51:51morph rows rejected incorrectly a and it would be nice to can see what they
0:51:56balance it's are you getting more pluses
0:51:59and you're losing a few or are you not losing anything in gaining more so
0:52:03that's where i'd like to kind of c is you're moving down here is zero
0:52:06is a positive movement forward or are there some better falling backwards but the net
0:52:13gain is always possible
0:52:14no i agree with that in i didn't do it you know virginia the wood
0:52:18but also is interested at the time of than more interested also to see
0:52:23between the hidden layers what's if i'm getting i was hoping to see what happened
0:52:29the recording you know is that having a linguist work we made me trying to
0:52:32understand okay handling like this that classified correctly in the hidden layer five but not
0:52:37in the layer for three or to what make its change that it's so i
0:52:42want to know
0:52:42which affirmation of the five layer that got me to make this one better than
0:52:46another one that's true we window at the end we were thinking about
0:53:02so not too much just want to thank you very much for proposing a new
0:53:07solution to the very heart problem so
0:53:11i just like to put that the difficulty of the problem in into context because
0:53:15we've been banging our heads against the same kind of difficulty so
0:53:20to summarize the problem
0:53:23it the problem is to get a low dimensional representation of the information in the
0:53:28in that it in a sequence so you've got lots of speech frames
0:53:32and then you want to the stall the information in all the speech frames to
0:53:36single smallish vectors
0:53:41the reason is difficult is let's look at the i-vectors the classical i-vector so
0:53:47you can write down information that the generative model for the i-vectors in one equation
0:53:52you had
0:53:55it's very easy for most of us to just look at that an immediately understand
0:54:00so that's the general the fruit
0:54:02but what you're doing is the inference fruit
0:54:05from the data back to the two
0:54:08the hidden information so now we have two
0:54:12share all the information from all the frames accumulate that information back into
0:54:18back into the single
0:54:22vectors so
0:54:24if you look at the i-vector solution
0:54:27that the formula for
0:54:31calculating the i-vector posterior
0:54:33that's a lot more complex than
0:54:36just the generative formula for the i-vector
0:54:39and that takes as
0:54:43might be applied to the live
0:54:45i that formula and
0:54:48i believe it's similarly difficult for the neural network to learn that
0:54:53so you mentioned the variational bayes order encoders
0:54:59so we've been looking at that was quite a lot
0:55:02in the papers that have been published thus far it's always a one-to-one relationship between
0:55:07the hidden variable and the observation and then everything's i r d so
0:55:12i was machine learning by per state been solving that a much easier problem
0:55:18to accumulate on all that information is a harder problem that's also computationally it is
0:55:25also computationally hot
0:55:27if you think of the i-vectors posterior lots of piper's with published how to make
0:55:32that computationally lighter
0:55:36that's why say you all
0:55:37no solution is quite exciting to us
0:55:42what else also the one of the guy from machine learning ask me okay say
0:55:46okay so we have indian and you have your i-vectors representation can you propagate the
0:55:51errors from the i-vectors of the nn to make it more power for your specific
0:55:56task with the i-vector percent
0:55:58that's something interesting for psd topic noise
0:56:01if you're i
0:56:04you know way to combine the subspace and that the like the same as what
0:56:08people do in the data in asr the symmetry of training sequence of training can
0:56:13we do the summary things with when you have the error coming from the i-vector
0:56:16space that work to propagate the data the d n n's dow
0:56:21that's something maybe
0:56:23interesting as well that's we got from machine learning cost me this
0:56:35so not nice presentation nudging
0:56:39i hadn't thought of questions one was
0:56:43when people move from gmm based i-vectors to you know the nn
0:56:49least i-vectors using c you know just classes
0:56:54as i understood the improvement was
0:56:57because of the fact that just these was quantized much better than using gmms right
0:57:04i that it was phones as classes or you know languages classes
0:57:09if you doubly that you're proposing to use auto-encoder
0:57:14has no information about you know any classes so what's your intuition behind
0:57:20something like that would work better than
0:57:22using c you know ones are you know languages as classes
0:57:27well you know it's actually is a good question so my tuition is just a
0:57:33my feeling up in the speech processing and hairless how without doing it
0:57:37is we start too much scrolly
0:57:39make in to win information away from the signal
0:57:42for example
0:57:43here if you do line frame and language is a class
0:57:47i'm normalising speakers i'm doing the l d n is doing all the things for
0:57:52so i'm hoping to not do that
0:57:55try to maximize as much information
0:57:58as i can
0:58:00for example i give you
0:58:02to a four thousand six or ten thousand of speech i don't giving level about
0:58:06the development but you know going to train the speech continuance provides way in your
0:58:11data which you had be helpful for you because you have thousand hundred thousand speech
0:58:16and maybe in the industry is different
0:58:19i say you have moral appleton with us
0:58:22but for so can we do that so that's what i hope so i can
0:58:26you know this is the same talk what the jackal said the twenty have letterman
0:58:31can you use that you and your training
0:58:32so i'm hoping to have a kind of speech coder
0:58:36this model speech that you hear something you given the same thing from both sides
0:58:39of the affirmation is there
0:58:42it's not sure what away it just how to use it
0:58:45that's exactly feeling wineries and i'm not saying that would be the i don't they
0:58:49would be the destructive training or something like that i'm just saying that if i
0:58:52haven't all the speech coder that something like to maybe if i am too much
0:58:56use anything august alameda truth but that this is what i one is like something
0:59:01you know if we haven't woken colour style or something like that
0:59:04if the if he can produce the speech again
0:59:08so the information is there we just need extracted
0:59:12i don't know if it was clear and