0:00:15 | alright technique for this introduction and i would like reverse also thank the |
---|---|

0:00:20 | or this recognizer for inviting me to give this presentation and to give |

0:00:25 | present my last work and also for bringing us here too |

0:00:28 | this one different location it was an amazing week |

0:00:30 | the with that was very good |

0:00:32 | and the social events with many things so i i'll exercising so as a good |

0:00:38 | part |

0:00:39 | so |

0:00:40 | so it was really good since was very enjoyable to a week to talk to |

0:00:44 | people and meet would be blunt this costs and exchange ideas so that's what wonderful |

0:00:48 | and the gospel begins to |

0:00:50 | to see this winter vision of the basque country |

0:00:53 | so hopefully we'll come back to visit that this tourist |

0:00:56 | we have chance so they only presenting some of my let this work about using |

0:01:00 | the |

0:01:02 | some of the i-vectors some kind of i-vectors to model the hidden layers and see |

0:01:06 | how |

0:01:07 | the u d n and sparkling information in the hidden layers and because usually the |

0:01:12 | way |

0:01:13 | actually the way we doing now the nn since we are trying to look to |

0:01:17 | the output of the d n and land the n is to make some decisions |

0:01:20 | or we look to use the bottleneck features one half of it and one of |

0:01:24 | the hidden layer to use a bottleneck feature to do some classification with it |

0:01:28 | but unfortunately not |

0:01:29 | a lot not only not any work have been proposed to sit to look to |

0:01:34 | the whole unpacking the nn |

0:01:36 | because i believe that some way that we can there is some information were not |

0:01:40 | exploring and using actually into the nn is the activities of the part of activation |

0:01:45 | how the information was propagate over to the nn and that's what we're going to |

0:01:49 | be talking today |

0:01:50 | and show some results |

0:01:52 | so |

0:01:54 | so this is the out of my possible our staffers by can an introduction benefit |

0:01:58 | that all move onto |

0:02:00 | you know slowly to the my lattice work but before that would give some you |

0:02:04 | know and reduce the i-vectors which i don't need to because a lot of people |

0:02:07 | you but probably a know what sometime better than me |

0:02:11 | so i mean you guys you know the i-vectors is based on the gmm so |

0:02:14 | the first pass will be based on a gmm how we use it for gmm |

0:02:17 | so we present for the gmm white gmm mean adaptation |

0:02:21 | and we are we show to study of case speaker recognition language recognition here i'm |

0:02:25 | not give any i'm not telling you how |

0:02:27 | how to build your language or speaker recognition system but i just want to tell |

0:02:30 | you that would i-vectors we can do something that is what is a show and |

0:02:34 | again see some very interesting behavior of the data how the channels and one of |

0:02:38 | the remote the condition can affect |

0:02:40 | for speaker recognition system if you don't do any channel compensation |

0:02:43 | for language recognition or we show how the closeness of the speakers from data driven |

0:02:47 | this what is asian so the number that would remove |

0:02:51 | then the direction of how we can use actually some discrete i-vectors to models the |

0:02:55 | gmm weight adaptation is just some work that has started one |

0:02:59 | of hugo new when you most students pass in has sent by how do you |

0:03:03 | sees is actually was the case of an in bellingham he visit my me an |

0:03:07 | almighty for six months |

0:03:09 | and we will start working this gmm one advantage of language id |

0:03:13 | and then after that i'm that the announced are progressing comment over feel |

0:03:18 | and that's where start looking saying maybe this discrete i-vectors can be also use it |

0:03:22 | to model the posterior distribution for the n s |

0:03:25 | so i start this is what had this of the second part of that also |

0:03:28 | a start looking how you know the intended representing information in addition to layers |

0:03:33 | because a lot of the box in the vision to show all you can recognise |

0:03:37 | that's this moron that model actually cat's face from youtube videos or something like that |

0:03:41 | so |

0:03:41 | can we do something for speech |

0:03:43 | you know that's how i start thinking about using i-vector representation the model data layers |

0:03:48 | and that's why |

0:03:48 | then we show that you know how for example the accuracy goes more to go |

0:03:52 | deep in indian and how the accuracy going for example for language id task how |

0:03:56 | we go better |

0:03:57 | and also how we can more than that of activation the progression of that you |

0:04:00 | know the activation of the information over the non-target the nn |

0:04:04 | so if you feel like you what one hours too much for you to sit |

0:04:07 | in the shower and you want to the perfume is that you should even the |

0:04:11 | first part because the gmm part but the second part maybe more interesting for you |

0:04:14 | guys |

0:04:16 | i would be not offended if you want to the |

0:04:18 | so and that after the our finished by so given some conclusion of the work |

0:04:22 | so |

0:04:23 | so as you know i-vectors have been largely used it's a nice way to work |

0:04:28 | on to it's a compact representation that nicely of summarize and describe what's happening in |

0:04:34 | a given recording |

0:04:35 | you know it's have been largely used for a different task |

0:04:39 | speaker language speaker diarization |

0:04:41 | speech recognition so there isn't i-vectors was actually related to the gmm adaptation of the |

0:04:46 | means |

0:04:47 | so i just say lately i have interested also in the gmm weights adaptation |

0:04:51 | for using i-vectors and then are you know that after they move on to use |

0:04:55 | this for the model that the nn based i-vectors |

0:04:59 | for the for you what modifications |

0:05:03 | so that's not you know slowly take you to data the |

0:05:08 | to my the others for what is what slowly |

0:05:11 | so you know in speech processing usually what you have you have a recording of |

0:05:15 | this one recording and you transform the to get some features |

0:05:18 | then based on the complexity of the distributions features you build a gmm usually classes |

0:05:23 | when but the gmm top of this remote to maximize the probability of distributions |

0:05:27 | so you know |

0:05:29 | gmms are have been is defined by portions and portion has the weights the means |

0:05:34 | and covariance matrix are described this portions |

0:05:38 | so the way that some other countries the i-vectors in context set a concept of |

0:05:42 | speaker recognition so the way we were doing it in early twenties well that's what |

0:05:46 | how the kernel started |

0:05:47 | you know we dig a lot of non-target speakers were trained a large gaussian mixture |

0:05:52 | model |

0:05:53 | then after that because we don't have to meet sometime too many recordings from the |

0:05:57 | same speaker where n and one maximum likelihood do adaptation so we tried got that |

0:06:01 | the universal background model which is a cut prior of that how all the sounds |

0:06:05 | looks like to the direction of t |

0:06:07 | target speech |

0:06:08 | and the so the way that okay this should happen the between source trajectory gmm |

0:06:15 | supervectors because we finally he found that the one of the pine find out that |

0:06:18 | only the adaptation of the means is enough so the main the weighted it the |

0:06:23 | mean shift from this universal background models of the large gmm trained on a lot |

0:06:27 | of data |

0:06:28 | to the direction of the target speaker can be categorized of something happened the recording |

0:06:32 | that make happen that shift |

0:06:34 | so the lot of people starts to think this shift example packet kenny which one |

0:06:38 | factor analysis to try to |

0:06:39 | supplied with one speaker and channels |

0:06:42 | during the gmm supervector shoot for example also become boundaries what would svms you know |

0:06:47 | trying to model gmm as input to the svm to describe the model to the |

0:06:51 | probability of between speakers there |

0:06:53 | so the in the sense fear i-vectors came out as well |

0:06:57 | so the i-vector disposal you have a gmms subspace the ubm is one point there |

0:07:02 | and so we have one recording so we try to ship to the new recording |

0:07:06 | so |

0:07:07 | to the ubm to this new recording so if you have a survey recordings i |

0:07:12 | you we have look different one space the i-vectors extracted more the oldest variable between |

0:07:16 | all this recording |

0:07:17 | in the low dimensional space |

0:07:19 | so |

0:07:20 | and we still rocking is the ubm |

0:07:23 | so all this new recording can be mapped to this new space and now we |

0:07:27 | can represent and i is reporting by and vector of a fixed line |

0:07:31 | so this can be an modeled by this equation so we have the universal background |

0:07:35 | model here middle and east recording gmm can supervectors can be explained by the ubm |

0:07:41 | plus an offset |

0:07:43 | when offset also describe the |

0:07:45 | what happened is recording which is you are given by the i-vectors into proposed variable |

0:07:49 | the space where the i-vector a vector space |

0:07:51 | so now when you have a strange you doesn't margaret training for that you when |

0:07:56 | you have new recording utterance from the to get the features than after that you |

0:07:59 | map that you're subspace are you sure that all familiar with that |

0:08:03 | so now i'm not going to give anyway how to tell you how to do |

0:08:07 | speaker recognition you have been seen a lot of goods |

0:08:10 | talks during this will wonder four |

0:08:13 | all this is a conference but and still that will show you how we can |

0:08:16 | do visualisation with it so |

0:08:18 | first of also for speaker recognition this i-vectors have been applied for different kind of |

0:08:22 | speaker recognition task of speaker modeling task like spoken speaker verification when you have a |

0:08:26 | set of speakers you want anyway of recording you want to defy with those who |

0:08:30 | spoke in this segment speaker verification when you have a to want to verify that |

0:08:35 | to recording are coming from the same speaker or diarization |

0:08:38 | you want to know box and one |

0:08:40 | so for the for the speaker recognition task i would like to show some visualisation |

0:08:46 | that explain to you what's happening in the they that if you don't do any |

0:08:49 | channel compensation for do that |

0:08:51 | i would like to notice the work of that currently was actually psd students with |

0:08:55 | the unopened hyman that bill combine a mighty and he was working would not at |

0:08:59 | time |

0:09:00 | so we took the this is that in the nist two thousand us a ten |

0:09:04 | speaker recognition evaluation was based on i-vectors and the time of was that this that |

0:09:08 | system was we build was actually based it was a single system that rounded to |

0:09:12 | deal with the telephone and microphone data in the same subspace |

0:09:15 | and so we look like a box five thousand recordings from that the data and |

0:09:21 | we build a cosine similarity between all the recordings |

0:09:24 | that i think that it does this make metrics that similarity matrix and he built |

0:09:28 | teen is never appear at so is your for that would be connected to that |

0:09:31 | this tenish never |

0:09:33 | and then use this software called guess to do the graph visualisation |

0:09:37 | so in this graph you know that the relative location of the node is not |

0:09:41 | important but the relative distance between the notes |

0:09:44 | and the clusters important because |

0:09:46 | it's reflect how close they are and how to structure your data is |

0:09:50 | so that so here |

0:09:53 | exactly they female they data but database with the inter session with a channel compensation |

0:10:00 | applied so we can see the colours are by speakers |

0:10:03 | and the is so he's and he should british or point corresponded recording and cluster |

0:10:08 | compare the speakers |

0:10:09 | so for people that actually want to the museum and since all are this early |

0:10:13 | week can you do i mean what's the at this was thinking twenty been this |

0:10:17 | "'cause" |

0:10:20 | so |

0:10:21 | so the thing is like now what we start doing that we say okay well |

0:10:25 | known that we tried to remove the channel components i said what happened well we |

0:10:29 | lost the speaker clustering |

0:10:31 | and something happen that were some cost so that happen that appeared in this clusters |

0:10:35 | and always say like well what's going on he says so he went we went |

0:10:39 | together we will look cd a |

0:10:41 | to the labels and we start looking what's going also for example here |

0:10:45 | each you one check all the microphone at used for the different back that they |

0:10:49 | that the microphone was used to recover one of the recordings and you find that |

0:10:54 | actually with the clusters like to the microphone that was have been use |

0:10:57 | and that would like to pursue the pretty surprising for example it may assume at |

0:11:00 | this at the telephone data we have like one in-cylinder and this of the microphone |

0:11:04 | data |

0:11:05 | and also we have five that you also find that there's to actually for the |

0:11:08 | same activities cluster two clusters and actually because the room was there |

0:11:13 | that the ldc lifetime used to rooms the collected data so also the two rooms |

0:11:18 | was also reflected in your data |

0:11:20 | this is a liberal press every civilisation to show that you know i don't want |

0:11:23 | to give your michael right one from two to one point five whatever but i |

0:11:26 | don't tell you that if you don't anything about the market for the channel compensation |

0:11:30 | it may be big issue |

0:11:31 | so this is what happened there is only |

0:11:33 | the data can be affected by the my microphone can be affected by the channels |

0:11:37 | and also can be affected by the room that have been recorded |

0:11:42 | so this is that what we do try on the market the channel compensation |

0:11:47 | and we do the clustering by speaker and bit the visualisation is by the |

0:11:52 | by channel so that specific the channel compensation doing some good job too |

0:11:56 | trying to normalize this |

0:11:57 | so i front lately we recognise mel bit and female on a y |

0:12:01 | but different clusters of the time was better so this is that say the same |

0:12:05 | at a later we all have see also the same behaviour so this is the |

0:12:08 | one to the microphone data which is the most interesting |

0:12:11 | and you can still see that split between microphone between the room one and room |

0:12:15 | to the ldc and this use the collected data |

0:12:19 | so this is actually unique visualisation |

0:12:21 | that have been you know very helpful for us and stand and you know shows |

0:12:26 | the people that actually about the what we are doing it makes sense |

0:12:29 | and you know how we can still be fun to the some pictures and microphone |

0:12:36 | a microphone channel compensation |

0:12:39 | so this is the same thing so i honestly after that you know what we're |

0:12:43 | doing language id two thousand eleven i start looking to the language id task so |

0:12:47 | and i will try to do the same things also for visualisation so he language |

0:12:52 | recognition task we have a verification is why doesn't fixations so you don't need to |

0:12:56 | to spend too much time at that so here what i did is actually a |

0:13:00 | i to connect nist two thousand nine i have an i-vectors was trained in the |

0:13:03 | training data on it took it doesn't matter just a can cost |

0:13:07 | and a two hundred recalling for each language i think we have like twenty three |

0:13:12 | for that language |

0:13:13 | and i know to the same thing salad build the cosine distance or similarity and |

0:13:20 | bill between a separate graph and try to visualise it so this is what happened |

0:13:24 | for this kind of language recognition class so for example here disappointed because we have |

0:13:29 | for example |

0:13:33 | english and into english close together |

0:13:35 | we have into english and hindi and urdu you know like what they are very |

0:13:39 | close together |

0:13:41 | mandarin cantonese and that i mean and korean |

0:13:44 | is same almost in the same cluster |

0:13:46 | so |

0:13:48 | so also here's duration ugly green and was any and of course shines origin |

0:13:53 | in the same cluster and also french and real |

0:13:56 | so it's really data driven |

0:13:58 | at a visualisation that show you how big how the closeness of the languages are |

0:14:04 | from the acoustics |

0:14:05 | that have the primary using to model the i-vector representation |

0:14:09 | so here this is what have been you know you know |

0:14:12 | i-vectors were allowed to do because you have this you know in the time with |

0:14:15 | cosine distance between you can be lda to this was a bit as well |

0:14:19 | so |

0:14:20 | that we can you know doing i-vectors and represent the data and see what's happening |

0:14:25 | the data and how you can interpret what's |

0:14:26 | what's phenomena is going on |

0:14:28 | so that of is what is it was a good tools for that |

0:14:31 | so it is a you know that meet now try to move on because i |

0:14:34 | know that you all familiar i-vectors i don't want to |

0:14:37 | to spend too much time it anymore probably prefer we want to the more interesting |

0:14:41 | topic of this to of this talk so that after that i start looking to |

0:14:46 | the gmm what adaptation is a say with the students from what has a higher |

0:14:49 | you |

0:14:50 | and the way the gmm weight works that there's lot of actually the several decay |

0:14:56 | that have been applied to that |

0:14:57 | for example maximum likelihood should |

0:14:59 | the most a simple way |

0:15:01 | and one of the and also nonnegative magic factorisation which is actually you go via |

0:15:06 | newman was working in that at the subspace multinomial model |

0:15:10 | which is that what else complement inequality and what but people use |

0:15:14 | and what we propose which called non-negative factor analysis because the you know that the |

0:15:18 | gmms what adaptation is a little bit tricky because you have the nonnegativity of the |

0:15:23 | weights as well as they should sum to one so this is can trying to |

0:15:26 | do you have to deal with |

0:15:27 | during the optimization and when you're training your |

0:15:30 | your subspace |

0:15:32 | so it's a |

0:15:33 | so the whiteboard ogi validation for example you have a set of features like oneself |

0:15:37 | recording industry features |

0:15:39 | and you have any bn you model if you try to compute impostor distribution of |

0:15:43 | a of a given a component for some time of a frame |

0:15:49 | given the ubm subspace are you so we get this posteriors and then you and |

0:15:53 | your then you accumulate that and can |

0:15:55 | from that |

0:15:56 | so the object so in order to get that the gmm what adaptation you don't |

0:16:00 | you try to maximize looks very function given here |

0:16:02 | and if you want to do a maximum likelihood so the way to do what |

0:16:06 | you accumulate all this serious overtime and it divided by the number of frames that |

0:16:11 | you haven't you can do maximum likelihood |

0:16:13 | all |

0:16:14 | you can for example do nonnegative market factors estimation |

0:16:18 | which consist that okay we just try to split this weights adaptation into little small |

0:16:23 | negative matrix as |

0:16:25 | basis that |

0:16:25 | also maximize looks very functions that given here they the input is that the count |

0:16:31 | and you try to estimate is to subspaces vector representation one assumptions one and they |

0:16:36 | the representation of this in the subspace |

0:16:38 | to characterize the weights adaptations |

0:16:41 | so this is a negative matrix factorization is the you go value money students paper |

0:16:46 | that describe that |

0:16:48 | what implemented via t is that you have a multinomial distribution |

0:16:52 | and which kind of is described |

0:16:58 | so we have this subspace all that describe the a this the i-vector representation of |

0:17:05 | in the weight subspaces the when did v is actually but so we have you |

0:17:10 | know ubm plus share and didn't but no matter here also how to make sure |

0:17:14 | that the weights obtained are normalized to one |

0:17:18 | the good part of it here is that this is very good to when you |

0:17:22 | have a nonlinear data to fit for example he an example i would like to |

0:17:25 | thanks |

0:17:26 | but an specially older for shown with giving me the slides and that this |

0:17:32 | picture |

0:17:33 | here for example you have a gmm of to question for example |

0:17:37 | and he would try to similar each point corresponds to one recording weights adaptation |

0:17:41 | for example much estimation |

0:17:44 | and we tried to simulate what happened when you have a large gmm so we |

0:17:48 | have some sparsity not all the goshen would appear so we can see that this |

0:17:51 | question here the corner sorry |

0:17:55 | then the d |

0:17:56 | so this abortion here we would not be this is just a simulation |

0:18:00 | in what happened when you have a large ubm |

0:18:03 | so we can see that we for example in this case how the data looks |

0:18:06 | like |

0:18:06 | and this subspace moody model in the minima the sorry multinomial that model is very |

0:18:13 | good to fit the data |

0:18:15 | but that it has a drawbacks make overfit so that's why the but you guys |

0:18:19 | user regularization do not make it more overfit |

0:18:22 | so has send work at a time was trying to do that similar the same |

0:18:28 | as an i-vector so you haven't ubm weight i weights and you want to make |

0:18:33 | sure that new recordings had the ubm for you the weights for the new recordings |

0:18:37 | is that it will be in what was an offset |

0:18:40 | and the constraint here it's |

0:18:42 | you they should a weighted sum to one and they should be noted nonnegative so |

0:18:46 | we developed in an em like approach so but someone right in the center of |

0:18:52 | sound i think we did something applied to maximize the likelihood of the objective function |

0:18:58 | so you have to step second compute all i-vectors and you got many of they'd |

0:19:02 | are you but the l and you have you tried and w because the convergence |

0:19:06 | so let's say we tried to maximize the lower the likelihood of the data does |

0:19:10 | a function of the subject that they should sum to one and they should be |

0:19:14 | opposite if there is |

0:19:16 | projected gradient ascend that can belong to do that |

0:19:18 | and this is are you gonna go to the reference in you can find all |

0:19:21 | the information i don't want to go there to be a for this talk to |

0:19:25 | not |

0:19:27 | so |

0:19:28 | the difference between for example the non-negative factor analysis and the s m is of |

0:19:32 | actually |

0:19:33 | showing this table so that they i don't think that tend to not overfit because |

0:19:39 | the approximate or the maximum data is that would not touch the corner compared to |

0:19:44 | the ammonia s m |

0:19:47 | but sometimes good sometimes bad dependent which application you are targeting |

0:19:52 | but we compare that for several application they seem the same bit s m invented |

0:19:56 | non-negative factor is practice to |

0:19:59 | behave almost the same |

0:20:01 | so this discrete i-vectors have been applied for several applications and purposes for example modeling |

0:20:07 | of prosody that's what marcel that for his phd |

0:20:11 | phonotactics when you model the n-grams for example on dry and the did that and |

0:20:15 | method is based is this |

0:20:17 | and also what we did for the gmm weight adaptation for language recognition and |

0:20:23 | and dialect recognition would have sent has an work so |

0:20:26 | in this paper we compared activity taking and i'm have |

0:20:31 | assume m and as well as the you don't get a factor analysis so we |

0:20:33 | can go and check that |

0:20:35 | be almost behave the same thing as one for gmm weight adaptation |

0:20:38 | so now in order to go to the fun part |

0:20:44 | how we can use this |

0:20:48 | discrete i-vectors to model the |

0:20:51 | the gmm that the model that the nn activations i was actually the time of |

0:20:55 | was motivated by |

0:20:57 | this picture |

0:20:59 | so i was watching what it was actually that any one of the pocketing whatever |

0:21:03 | was given a talking to go on training or something like that and he was |

0:21:06 | showing that you if you do like some a deep belief network to unsupervised trained |

0:21:11 | your auto-encoder data |

0:21:13 | and he trained in the millions of unlabeled youtube |

0:21:17 | number link but component |

0:21:20 | and he said that maybe if you divide one or in top you maybe you |

0:21:23 | can actually construct |

0:21:25 | the pictures and he was saying all kayaking see the cat |

0:21:28 | face |

0:21:29 | and it will like okay well we do something for speech and wishart okay it's |

0:21:33 | a continuous the time series but |

0:21:35 | that was taken it can actually see how the data is are gonna the nn |

0:21:40 | hidden layers and that's how it is exactly what motivated to start this work |

0:21:45 | so remember that before i say we have a recording and the waitress from that |

0:21:50 | to set of features |

0:21:53 | then we get this feature to a gmm earlier now let's just remove the gmm |

0:21:57 | and give it to |

0:21:58 | due to deanna so for example we can do easy where a language recognition as |

0:22:03 | in what you give some frame versus like modelling of frames that's what you not |

0:22:07 | your from who did freeze paper really got thousand fourteen so it's input is of |

0:22:12 | segment was just like a frame and output is a language and |

0:22:16 | i will show the several the same like eggs experiment |

0:22:20 | note that when you have a new recording and you want to make the decision |

0:22:24 | you do a frame-by-frame decision and he aberration he tries to the max of the |

0:22:29 | output so that's largely what we compared to and you can also do example show |

0:22:35 | anymore seen on the n n's and you want to see how the data as |

0:22:38 | representing in the this task so |

0:22:43 | so imagine you have it that have been there so the way that we do |

0:22:46 | now |

0:22:47 | the before as a set earlier is we to get the n and we take |

0:22:51 | the output to make a decision |

0:22:54 | you know like or alignment for example for ubm i-vectors |

0:22:57 | or we take one hidden layer |

0:22:59 | and are used to it as a bottleneck features |

0:23:02 | but whenever and since we only see one level of what we've got the and |

0:23:07 | only one |

0:23:08 | one hidden layer or the output we don't see how the d n actually provide |

0:23:12 | get the information over |

0:23:13 | all his on fire the end on part of the nn and the reason for |

0:23:17 | example imagine you have a sparsity coding for each |

0:23:21 | for example for each hidden layers |

0:23:23 | and use a for each input only fifty percent of your |

0:23:27 | of your the foregone or inactive for example but for example drop out |

0:23:33 | so the way that the data we colour information for example for class one the |

0:23:38 | one and you will call it here and the one he would call you can |

0:23:40 | be different |

0:23:42 | because some randomness the way he would provocative what when coded information so if you |

0:23:47 | can model you get more that of the battles activation of how the class went |

0:23:52 | to the nn |

0:23:54 | and this is an information that's available there but we're not using it |

0:23:58 | and that's exactly what actually motivate me for doing for doing this work |

0:24:02 | so can we looked at all hardly nn and see how to progress there and |

0:24:07 | you know this is our should be one way to do with maybe is not |

0:24:10 | the best way to maybe don't always but this is one way to do |

0:24:14 | so the idea here were tried to do is |

0:24:17 | since we had this discrete i-vectors that also based on counts |

0:24:21 | and posteriors so can i use that to model |

0:24:24 | i i-vectors for each that we should outlier |

0:24:27 | that's what it is only built for example of the nn here we use an |

0:24:30 | i-vectors are presented and one |

0:24:31 | it into a taken as a present the lastly a loss leader as well and |

0:24:36 | noted to do there i need to have some counts |

0:24:39 | to react like we were so i'll be able to apply to my gmm weight |

0:24:43 | adaptation techniques to do it be used for gmm weight adaptation so here is to |

0:24:47 | when you get a combined counts |

0:24:49 | for example you can compute the posterior fortyish norm activation foster for each normal then |

0:24:55 | if we use you don't layer for each input your normalized to sum to one |

0:24:58 | artificially a common either because the you know was not trying to do that |

0:25:02 | and then you accumulated over time i became that became counts because here you should |

0:25:07 | allow us to sum to one |

0:25:10 | and you can you can use the same gmm to gain you don't change anything |

0:25:13 | to them |

0:25:14 | so the second one gonna post softmax for example |

0:25:17 | similar thing but you ample softmax we generalize to map and sum to one |

0:25:21 | and the accumulated you can also trained with softmax as well |

0:25:24 | but what is the most important one which the most understanding of all this ad |

0:25:28 | hoc |

0:25:29 | situation |

0:25:30 | and it compute the probability activation operational wrong and its complement one minus one |

0:25:35 | so you can consider this to normalize the one gmm of to work |

0:25:40 | so now we don't you only model that you can use the d n and |

0:25:43 | have the rest of the response so we don't normalise anything |

0:25:47 | so here so for example here for example if you have one thousand four neurons |

0:25:51 | you will have double their doubled that and you would have |

0:25:55 | thousand of |

0:25:57 | genments what to bush and you use the subspace model tool to do that what |

0:26:01 | the constraint that we used to normalize and his company wayne one is complementary sum |

0:26:05 | to one and in this case you don't do anything go wrong because you're modeling |

0:26:09 | the same behavior of the nn |

0:26:11 | so |

0:26:12 | we tried to compare few of them but we didn't will i'm not going too |

0:26:15 | much in a detector the want to make too much numbers here to confuse you |

0:26:19 | there will be have the same one |

0:26:22 | so in this case the say we can use for example here for the first |

0:26:26 | application we should dialect the condition |

0:26:28 | i use non-negative factor analysis |

0:26:30 | for the nist eight are you subspace multimodal more than one not be a model |

0:26:34 | "'cause" i wanted to show that but actually but works there is no distinctive to |

0:26:37 | be you |

0:26:38 | so he to the say |

0:26:40 | the non-negative factor analysis you have the weights of a new recordings used the ubm |

0:26:44 | so with a wary compute d b m's can i the weights i usually take |

0:26:49 | all the that the training data extract the count for each of them are normalized |

0:26:54 | m and it took an average and that's might ubm so every ubm response for |

0:26:58 | that's only the average response of a moral issue the layers |

0:27:02 | for a given him and it and |

0:27:04 | so if you shouldn't layers for a given all the recordings |

0:27:08 | so when you can use the at the you wanna get the factor allows us |

0:27:12 | to do that |

0:27:14 | so now |

0:27:15 | though that resting by is an eigen factor as a scan all support other approaches |

0:27:19 | can help you also to model all the hidden layers as well one way to |

0:27:23 | do it for example you can build hit and i-vectors for each subspace then you |

0:27:28 | can compensate the i-vectors of them |

0:27:30 | and you would have |

0:27:31 | or you could have one |

0:27:33 | that actually model everything with the constraint that uses hidden layers of some to well |

0:27:38 | and this will allow you to see how |

0:27:41 | you know how the correlation is happening between all the activation of your hidden layers |

0:27:45 | and that's exactly what we did |

0:27:48 | so |

0:27:49 | in order to do that we extended for example accented to d non-negative factor analysis |

0:27:53 | so you have a different ubm each one corresponding to issue the layers and it |

0:27:58 | would have a common |

0:28:00 | i-vector that control all of all the output for each dollar data sorry you have |

0:28:05 | a common |

0:28:07 | i-vectors for all the weights for all data it hidden layers |

0:28:14 | so in order to do that let's try to give some experiments and show something |

0:28:21 | results |

0:28:22 | so the first experiment that i would like to show is in that some dialect |

0:28:25 | id so we have a small sore from apart from vision |

0:28:29 | so we were interested in doing some back here we have five dialects we have |

0:28:33 | this isn't know how many recalling by training |

0:28:36 | it's about forty hours important thing for ten or fifteen hours and it'll it an |

0:28:40 | hour threeish a dialect |

0:28:42 | and we have training how many cost for training and development and eval |

0:28:47 | so a train the d n and |

0:28:49 | to |

0:28:51 | so we have five class that problem of trying to the n and with five |

0:28:55 | hidden layers |

0:28:56 | thousand and the first you know little about two thousand and then after that i |

0:29:00 | have five for all the hidden layers of five hundred |

0:29:06 | five hundred |

0:29:07 | then so the in is that the while training that the input is the same |

0:29:11 | the is the features of a stack of |

0:29:14 | i think was twenty one features frame then the output is the five dialect class |

0:29:20 | the same as a google paper with any with the in a two |

0:29:25 | then the when you get the i-vectors are used cosine scoring with lda and the |

0:29:30 | people described earlier today |

0:29:32 | and the best image method we find for this task is that the it's also |

0:29:37 | most full rank |

0:29:39 | as about thousand five hundred five and the for each other ones |

0:29:42 | so that so the first results show is the i-vector results |

0:29:47 | and he was the i-vectors actually it's worse than twenty to the d n an |

0:29:52 | average of the output |

0:29:54 | which a mean that for each frame you compute the posterior for the five o'clock |

0:29:57 | for the five class and you average them and you mathematics which is exactly what |

0:30:01 | we would paper describe and he is better because the that this the characteristic of |

0:30:07 | this data is that's the recording are very short cuts around thirty second you know |

0:30:12 | organ sometime less |

0:30:14 | so we know that you know if you do that the nn and you do |

0:30:17 | average scores it's always better you have already seen that talks in a wednesday afternoon |

0:30:22 | a show that |

0:30:23 | even for news data so this is the error rate sorry so that less is |

0:30:29 | better |

0:30:30 | so now i will show you know there is a twenty do the i-vectors in |

0:30:36 | the hidden layers and starting from it layer want to layer five and how the |

0:30:42 | results are is |

0:30:44 | more you go deep but there is which we know that |

0:30:48 | so this understanding what are preprocessing on other feel like in a vision so we |

0:30:53 | were able to do the same thing here so |

0:30:55 | you can see that were from their one layer wanted to the board the devil |

0:30:59 | that's cool down and i can't this |

0:31:02 | five lighters because i want to show that sometimes there's no need to go too |

0:31:05 | much deep |

0:31:06 | for example layer five already saturated |

0:31:09 | like that like five didn't have anything but they q prodigious to make sure that |

0:31:13 | you know sometime we will try to make it really d but is not necessary |

0:31:17 | so this is one example what you really don't want to do it |

0:31:21 | so |

0:31:22 | and putting is now we can also see that you know we were able to |

0:31:26 | see the accuracy of you should the layers and we can we also be able |

0:31:29 | to prove that more you go deep in this that the network but there is |

0:31:33 | a result are so you will probably get more information |

0:31:36 | model in all the hidden layers maybe have model but the representation |

0:31:40 | so here this is l deity |

0:31:44 | to do that a dimension |

0:31:46 | of the that the five classes is an lda project into dimension lda and a |

0:31:51 | member the first on the presented this work and the what the slide that people |

0:31:55 | say well but probability don't to lda i said that's true i forgot to do |

0:31:59 | that |

0:32:00 | so this time i didn't forget |

0:32:02 | and so what i took a set of the row i-vectors for example for the |

0:32:05 | last layer |

0:32:06 | and i do it i did jesse any to model that so now here just |

0:32:10 | a zero i-vectors were using to see any use lda also you can see that |

0:32:14 | for example the origin is around here so we can see the scatter going this |

0:32:18 | way |

0:32:19 | which just signed that okay length normalization will be useful again |

0:32:23 | so this is what you wanna do the length normalization due to the same thing |

0:32:27 | so it's and speaker area |

0:32:28 | so is the same thing so that normalisation is also useful here so |

0:32:35 | i'm not sure this project was unfortunately i was hoping to see different behaviour but |

0:32:38 | it is in say behave the same thing |

0:32:42 | so this is using to see any cell this is a role |

0:32:45 | i-vectors |

0:32:46 | so since the reason why i was asked this question because of the i was |

0:32:49 | just which are trained to the task |

0:32:51 | so how it really actually represent |

0:32:54 | the that the data and the layer was and their important thing to do |

0:32:58 | so this is a one is one thing that we were tracked |

0:33:01 | so now |

0:33:04 | i just say here probability result the i-vector result in that the nn and over |

0:33:09 | averaging the scores of the frames which is better than i-vectors then more than in |

0:33:14 | the hidden layers actually better is necessarily |

0:33:17 | and the results so and i say that from all my experiment that they have |

0:33:21 | been that seeing is that the last he'd of the last layer is the worst |

0:33:25 | one in time of information so don't take decision that |

0:33:28 | but with data we so that the old information is actually in the hidden layers |

0:33:32 | there's no doubt about |

0:33:34 | so here i give the last layer result and then what happened if you model |

0:33:39 | everything one you get more again |

0:33:42 | you get all other two percent gain by modeling all the hidden layers |

0:33:46 | and the same thing would happen witness tape |

0:33:49 | so my point here is you know is true hidden layers |

0:33:52 | you know more go deep but there is |

0:33:55 | but if you also looked at all the correlation that happening over all hidden layers |

0:33:59 | is actually better |

0:34:02 | and the reason for example why is you know the even people that do some |

0:34:06 | you know brain division amount vision and everything that wanna try to the activation the |

0:34:11 | cost of you know what him or more i've something's can use it and one |

0:34:14 | level but you cannot see that how this to propagate maybe she can correcting about |

0:34:18 | that if i'm wrong you know this way we can do the same thing for |

0:34:22 | the n and we can |

0:34:23 | top and one hidden layer or we can see what's happening all the d n |

0:34:26 | and is the same okay |

0:34:28 | you can you do td in my right to sit activation how it happened or |

0:34:32 | you can cut and one levels can and make a decision this is the same |

0:34:36 | thing can we just so this is the same behaviour and here i'm just saying |

0:34:40 | that |

0:34:41 | the n and has more information that we are now using |

0:34:45 | because we are not looking to the path of activation that he took too cold |

0:34:49 | his data |

0:34:51 | so this is a deck id probably are not familiar with that so probably move |

0:34:54 | onto the speaker id but before that i did an experiment because i you know |

0:35:00 | in the state of the i-vectors was completely unsupervised i was thinking okay so that |

0:35:05 | i used is actually |

0:35:07 | discriminatively trained for this specific task |

0:35:10 | can i have the n and that was just using to call the data on |

0:35:14 | colder |

0:35:15 | for example |

0:35:17 | and you know the simplest way to do it i say let me just try |

0:35:20 | to do a good idea learning every n to try to see you know what |

0:35:24 | happening i'm sure that people has more sophisticated network for that |

0:35:28 | so i tried this every have the same are selected that trained before the same |

0:35:32 | data these speech as input frames input |

0:35:36 | and i use of dimensionality reduction at that it subspace and use cosine distance so |

0:35:42 | we use five by the layers are b m's |

0:35:44 | and i |

0:35:45 | this of the results l the i-vectors here at the d n and output |

0:35:49 | but i am having some struggle because i cannot go more than the first layer |

0:35:55 | for the every m called an ongoing colours |

0:35:58 | so the how the first layer give me the best at all is not as |

0:36:01 | good as |

0:36:03 | you know this discriminatively trained subspace with the in a subspace forty i-vectors but |

0:36:10 | you know it's not that bad |

0:36:12 | you know and that's what have been seen |

0:36:14 | so the hidden layers the first one you trained is actually the best one more |

0:36:19 | you go deeper |

0:36:21 | it doesn't how and my |

0:36:23 | my hypothesis i'm not sure if it's true |

0:36:26 | because they are not jointly training |

0:36:28 | altogether |

0:36:30 | if there may be they are all the number of the |

0:36:34 | the layers are jointly trained to maximize the likelihood of the data that may be |

0:36:38 | different story and that's why what that's what we are trying to investigate now |

0:36:43 | with the my students so can we trained variation for example operational uncoded to train |

0:36:47 | the maximize the likelihood of the data |

0:36:49 | and see how |

0:36:50 | all this representation has a meaningful or not |

0:36:53 | so this is one thing that we are trying to explore |

0:36:56 | so now for people that are more familiar would |

0:37:01 | with the nist data so are you what you seen as it was wednesday afternoon |

0:37:06 | session that people are more than in six languages |

0:37:09 | i tried to the same thing so we selected with the help of like to |

0:37:12 | laugh read a give me this subset of the data |

0:37:17 | so first in the korean mandarin russian vietnamese |

0:37:21 | and the difference between us and other people doing people try to use all evaluate |

0:37:25 | data so that want to remove the mismatch but the trend not use the what |

0:37:29 | of density s and v only be to avoid the mismatch |

0:37:32 | it because i want to know what's going on |

0:37:34 | for us was where everything together |

0:37:37 | it seems that we didn't have this issue |

0:37:39 | so that's the difference between possibly not you paper and sum p other papers in |

0:37:43 | the that section of the |

0:37:45 | wednesday afternoon so we should put everything together and we're trying to the n and |

0:37:49 | that actually you take the frames as input and the output is a six class |

0:37:54 | and this is actually that is also so actually before that i will say |

0:37:58 | i train firefly the error five data layers about thousand ish |

0:38:03 | the input is the frames sec frames of twenty one eleven contextfree side |

0:38:10 | at certain context for each side sorry the output is the class |

0:38:14 | of the six class use a linear according to this time before of course |

0:38:19 | cosine this one is a collection |

0:38:21 | and the so here this i the result in a subset of the thousand nine |

0:38:26 | for the six languages |

0:38:28 | so there's a result of the i-vectors intended to second ten second and three second |

0:38:32 | and the average of the score which is what everyone is doing what you the |

0:38:37 | direct approach |

0:38:38 | and |

0:38:40 | so the that the characteristic of this is as have been said before |

0:38:44 | it only got the this the and it's |

0:38:49 | average only be the i-vectors in the three second entire thirty seconds and ten second |

0:38:53 | it's not it doesn't work |

0:38:55 | but what happened when you do the hidden layers is a little bit different story |

0:39:00 | so is well more legal given that the nn but there is |

0:39:05 | so this is the same thing a slow does not different story here |

0:39:09 | but the thing is |

0:39:12 | or actually here forty four you know participant and second that no one is able |

0:39:16 | to be this because the this |

0:39:18 | if you do the hidden layers and for example i want to the hidden layer |

0:39:21 | five |

0:39:23 | it's obtain the best result everywhere |

0:39:25 | for even for ten for this for to just forty seconds |

0:39:29 | so hidden layers and also this is actually was interesting it is the hidden layers |

0:39:33 | five is just the one preceding this i'll put e |

0:39:38 | so this one sign the last layers as the one that you really don't need |

0:39:42 | to look |

0:39:43 | so based on the my experience so and here again see that the last in |

0:39:47 | the letters actually marsh much better than |

0:39:50 | then the i-vectors and as well as the nn output every |

0:39:56 | so the hidden layers aims at that i-vectors representation for this case seems to do |

0:40:01 | an interesting job of aggregating and pooling |

0:40:05 | the frames data to make your representation of the data and you can do classification |

0:40:09 | with it |

0:40:09 | so this is an interesting funding for that so actually all surprising to see what's |

0:40:14 | on the data |

0:40:15 | so now |

0:40:16 | what happened when you do everything model all that a whole hidden layers as well |

0:40:21 | so here are show d |

0:40:25 | i-vector representation d v d n and every score as well as the last hidden |

0:40:28 | layer five |

0:40:30 | and you know i'll i |

0:40:34 | and also try to see what happen if you do |

0:40:39 | all hidden layers what used again some k |

0:40:44 | and you can win also one almost like zero point eight this sorry i forgot |

0:40:48 | synthesis the averages right in there so we can see that for thirty seconds there |

0:40:53 | is already low |

0:40:54 | you know i don't i don't think that too much seriously |

0:40:57 | that we was little bit here but for ten seconds we were able to wayne |

0:41:02 | and forty eight the signal were also able to |

0:41:06 | so it's the same behaviour that all hidden layers |

0:41:10 | has better information than the one that single-layer of the time |

0:41:14 | and also the last layer is also better the than the first layer and then |

0:41:20 | then the first so that last is also but the minutes like the first layer |

0:41:24 | a hidden layers and looking but the last output layer is not that much interesting |

0:41:31 | in term of making decision |

0:41:33 | so either one reason to be honest one explanation is that this the nn time |

0:41:38 | by ten to overfit |

0:41:39 | which i just a do |

0:41:41 | second to shoot |

0:41:43 | but even when they overfit like that and use them to make a representation or |

0:41:48 | discrete your space |

0:41:50 | it's in they work fine if you try to make decision what over fitting a |

0:41:54 | different story |

0:41:55 | as one thing here |

0:41:56 | so this is what i have been finding this last |

0:42:01 | you're trying to use this models to |

0:42:04 | understand what's going on |

0:42:07 | so |

0:42:09 | so let me try to conclude |

0:42:12 | so we have five minutes and have something called that you want to say |

0:42:16 | so the i-vector representation is you know an elegant way to do a representation of |

0:42:22 | speech with the different lance you know a lot of people ready also used in |

0:42:25 | a wood that's and twenty of |

0:42:27 | of the work of the recordings the one where you have a long segment and |

0:42:31 | short segment |

0:42:32 | gmm innovation gmm weight adaptation subspace can also be applied to as a show sheen |

0:42:39 | say that that's you have seen in this talk can be applied to model the |

0:42:43 | d n and activation |

0:42:44 | in the hidden layers as well and they would doing good job |

0:42:48 | so was actually the take home here |

0:42:51 | so that stating that they want to focus here the seldom under down for all |

0:42:56 | the information that was modeling that the nn is not in the output but isn't |

0:43:01 | inherently |

0:43:03 | looked at that it is this |

0:43:05 | don't try to make a decision directly from the from the out |

0:43:09 | so |

0:43:11 | and so also you know looking to one the liar at the time and not |

0:43:17 | seen what's going on in all the data layers |

0:43:19 | it may be a mistake were going but it's may be good also to look |

0:43:23 | at that |

0:43:24 | because it's will tell you what's how the information one to all the d n |

0:43:28 | and how we show that each class to be model |

0:43:32 | that's something to seem to be |

0:43:35 | very useful |

0:43:37 | the subspace approaches that have been trying is one thing that i was thinking off |

0:43:42 | to do this work demo specially in time of modelling all data layers |

0:43:47 | that you know we can use and it is seen to doing good job of |

0:43:51 | putting and are aggregating that they the all the frames and give you are not |

0:43:56 | representation with the maximum information you can use for your |

0:43:59 | for your classification task |

0:44:02 | so this has seemed to be very good even if the day was trained in |

0:44:07 | frame based |

0:44:08 | so with svms trained at the end frame based and use it to make a |

0:44:12 | sequential classification |

0:44:15 | i-vectors is actually a representation seems to be doing a really good job for that |

0:44:22 | so |

0:44:23 | take two minutes to |

0:44:25 | and we have to mitigate |

0:44:28 | so |

0:44:30 | for future work contracts that we have been explored my students and colleague |

0:44:35 | my colleagues |

0:44:37 | is |

0:44:37 | now that's a earlier that the other than being using are based on frame based |

0:44:42 | and segment length |

0:44:43 | frame of contacts of twenty one or something like that |

0:44:47 | it's not doing so we are trying to shift to |

0:44:50 | more like memory the nn is like for example td and endorse unit time |

0:44:56 | or l s t m or which is the |

0:44:58 | special case of recurrent networks that's what ruben is doing |

0:45:02 | my inter so we are trying to explore data instead of frame-by-frame to make more |

0:45:09 | to extract a model more speech more the dynamics |

0:45:14 | explore more data such vector for speaker |

0:45:17 | to make them more useful for speaker |

0:45:20 | we're still working on that as well |

0:45:23 | and the set earlier i would be of interest and people spy authors in my |

0:45:26 | talk to meet i mean maybe there is a better way to do |

0:45:30 | watercolours |

0:45:31 | to really corpus clear that the data speech |

0:45:35 | and my whole is at some point we would be able to |

0:45:39 | to get some speech modeling at the end the nn or speech colour so you |

0:45:43 | know |

0:45:44 | it just call the speech and after that i used to discrete my space and |

0:45:48 | use task for example i give you |

0:45:51 | a bunch of thousand of recordings you call your data and after that you say |

0:45:55 | i want to use speaker i want use language |

0:45:58 | can i use from the |

0:46:00 | from the same model |

0:46:01 | just calling speech |

0:46:03 | so if anyone has any idea or have any tell please come talk to me |

0:46:09 | so also to make the things the i activation more interesting |

0:46:15 | i'm interesting in exploring the sparsity of activation for if you know later |

0:46:19 | no i'm not doing a specifically i'm trying to use that the nn training but |

0:46:23 | is there a way to for example one way that i'm doing now we didn't |

0:46:27 | have time to compare the result is dropped |

0:46:30 | example i say |

0:46:32 | what for each input fifty percent of my for additional layers fifty percent of mine |

0:46:36 | or active |

0:46:37 | so there is some randomness between the recording but when the hidden layers because |

0:46:41 | i find that actually some if you do have at the end and the two |

0:46:44 | hidden layers consecutively the layers sometime i redundant because i close together but supplied them |

0:46:50 | is actually the two of these separation between it's better |

0:46:53 | sometime |

0:46:54 | so if you do surpassed activation with for example would drop obviously the simplest way |

0:46:58 | to do |

0:46:59 | you make them complementary because there's some randomness happen in the middle |

0:47:03 | so that you for that the nn to take different bat for each hidden layers |

0:47:07 | are normally |

0:47:08 | so |

0:47:10 | so that's something i'm really interesting to make the |

0:47:13 | information but the between two consecutive |

0:47:16 | hidden layers more powerful more interesting and then and make them more rather than rather |

0:47:20 | than and |

0:47:21 | and also there's a way to for example alternate activation functions |

0:47:26 | by same we can say sigmoid rectified linear and sigmoid on |

0:47:30 | so between two consecutive sigmoid that something in the method to make things changing a |

0:47:35 | little bit |

0:47:35 | so the behaviour change for the consecutive sigmoid |

0:47:38 | so when you model down there is there's hopefully a way to get more information |

0:47:44 | and you're so in the subspace and also how the how the d n and |

0:47:48 | is coding information can be useful for the classification |

0:47:52 | and |

0:47:53 | to conclude |

0:47:55 | well i'm organising assess it doesn't sixteen portrait ago |

0:47:59 | so hopefully to suit their lee's summit your paper the same is the same time |

0:48:03 | as the c |

0:48:04 | so that that's this work so please help to see there and if you come |

0:48:08 | at the workshop you can also stay |

0:48:11 | for the rest of the week you enjoy the beach and that the cocktails very |

0:48:15 | nice to signature nor owns to make your compared to the right object a function |

0:48:19 | so and so that and that's it i q |

0:48:42 | jim had sent you from these distortions mum concerns just about a point which is |

0:48:47 | not main point of view or which is not in the main point of your |

0:48:51 | talk |

0:48:52 | it's about the television in addition |

0:48:54 | a particular always the t s in the stochastic neighboring of meetings |

0:48:58 | the to use of form determinization think that |

0:49:03 | this techniques and that is this phenomena useful and satisfying four |

0:49:07 | for thinking for the it but also for the thinking and understandings the distributions |

0:49:13 | but we remark and some if you put forth |

0:49:17 | and so for presenting the high divorce which of data with those techniques particular these |

0:49:24 | speaker classes |

0:49:26 | i'll distributed along ambulance form norwegian |

0:49:31 | this thing directions |

0:49:33 | t s and then don't does not respect the initial distribution |

0:49:38 | it separates speaker classes but so as you |

0:49:42 | the does not respects is montreal |

0:49:45 | direction of speaker classes |

0:49:48 | so it is useful because we use e |

0:49:52 | separation between necklaces of speaker |

0:49:55 | but not |

0:49:59 | or maybe more |

0:50:01 | view of this is we'll distribution |

0:50:05 | so it's i think a very good tool |

0:50:07 | two but it may become few not to use it of to propose a new |

0:50:13 | nist |

0:50:15 | it's as those more one so you're saying i it's here's just want to show |

0:50:21 | that you know how it's kind of structured but i'm not checking account how it's |

0:50:25 | model was a distribution from a t c any that's what you're saying yes |

0:50:32 | simply for the also points in particular fourteen and |

0:50:45 | i didn't write down all the numbers but i saw you had results and b |

0:50:49 | r and the dialect id task for other five dialects arabic |

0:50:54 | and their numbers are three writing down here you had to i think that the |

0:50:58 | fourth layer supervectors right a twelve point two percent and then when you into if |

0:51:05 | we're was twelve point five percent |

0:51:08 | and i apologise i didn't see a slight that that's if there so my question |

0:51:12 | wise |

0:51:14 | as you're moving forward you're actually getting improvement but would really be nice in dialect |

0:51:19 | id it's a lot more subtle differences between a derelict right search a lot of |

0:51:24 | times it interesting to figure out what are the things that are differentiating between each |

0:51:29 | of the dialects so i'm wondering if it anywhere you go back |

0:51:33 | and look and the bad the test files that you went through here for guitar |

0:51:37 | residual moving in the improvement here |

0:51:40 | you you're some not your hand it may be assumption would be that you're getting |

0:51:44 | a few more files except it correctly but you're just likely to have a few |

0:51:51 | morph rows rejected incorrectly a and it would be nice to can see what they |

0:51:56 | balance it's are you getting more pluses |

0:51:59 | and you're losing a few or are you not losing anything in gaining more so |

0:52:03 | that's where i'd like to kind of c is you're moving down here is zero |

0:52:06 | is a positive movement forward or are there some better falling backwards but the net |

0:52:13 | gain is always possible |

0:52:14 | no i agree with that in i didn't do it you know virginia the wood |

0:52:18 | but also is interested at the time of than more interested also to see |

0:52:23 | between the hidden layers what's if i'm getting i was hoping to see what happened |

0:52:29 | the recording you know is that having a linguist work we made me trying to |

0:52:32 | understand okay handling like this that classified correctly in the hidden layer five but not |

0:52:37 | in the layer for three or to what make its change that it's so i |

0:52:42 | want to know |

0:52:42 | which affirmation of the five layer that got me to make this one better than |

0:52:46 | another one that's true we window at the end we were thinking about |

0:53:02 | so not too much just want to thank you very much for proposing a new |

0:53:07 | solution to the very heart problem so |

0:53:11 | i just like to put that the difficulty of the problem in into context because |

0:53:15 | we've been banging our heads against the same kind of difficulty so |

0:53:20 | to summarize the problem |

0:53:23 | it the problem is to get a low dimensional representation of the information in the |

0:53:28 | in that it in a sequence so you've got lots of speech frames |

0:53:32 | and then you want to the stall the information in all the speech frames to |

0:53:36 | single smallish vectors |

0:53:39 | so |

0:53:41 | the reason is difficult is let's look at the i-vectors the classical i-vector so |

0:53:47 | you can write down information that the generative model for the i-vectors in one equation |

0:53:52 | you had |

0:53:54 | so |

0:53:55 | it's very easy for most of us to just look at that an immediately understand |

0:54:00 | so that's the general the fruit |

0:54:02 | but what you're doing is the inference fruit |

0:54:05 | from the data back to the two |

0:54:08 | the hidden information so now we have two |

0:54:12 | share all the information from all the frames accumulate that information back into |

0:54:18 | back into the single |

0:54:22 | vectors so |

0:54:24 | if you look at the i-vector solution |

0:54:27 | that the formula for |

0:54:29 | for |

0:54:31 | calculating the i-vector posterior |

0:54:33 | that's a lot more complex than |

0:54:36 | just the generative formula for the i-vector |

0:54:39 | and that takes as |

0:54:43 | might be applied to the live |

0:54:45 | i that formula and |

0:54:48 | i believe it's similarly difficult for the neural network to learn that |

0:54:53 | so you mentioned the variational bayes order encoders |

0:54:59 | so we've been looking at that was quite a lot |

0:55:02 | in the papers that have been published thus far it's always a one-to-one relationship between |

0:55:07 | the hidden variable and the observation and then everything's i r d so |

0:55:12 | i was machine learning by per state been solving that a much easier problem |

0:55:16 | so |

0:55:18 | to accumulate on all that information is a harder problem that's also computationally it is |

0:55:25 | also computationally hot |

0:55:27 | if you think of the i-vectors posterior lots of piper's with published how to make |

0:55:32 | that computationally lighter |

0:55:34 | so |

0:55:36 | that's why say you all |

0:55:37 | no solution is quite exciting to us |

0:55:42 | what else also the one of the guy from machine learning ask me okay say |

0:55:46 | okay so we have indian and you have your i-vectors representation can you propagate the |

0:55:51 | errors from the i-vectors of the nn to make it more power for your specific |

0:55:56 | task with the i-vector percent |

0:55:58 | that's something interesting for psd topic noise |

0:56:01 | if you're i |

0:56:04 | you know way to combine the subspace and that the like the same as what |

0:56:08 | people do in the data in asr the symmetry of training sequence of training can |

0:56:13 | we do the summary things with when you have the error coming from the i-vector |

0:56:16 | space that work to propagate the data the d n n's dow |

0:56:21 | that's something maybe |

0:56:23 | interesting as well that's we got from machine learning cost me this |

0:56:35 | so not nice presentation nudging |

0:56:39 | i hadn't thought of questions one was |

0:56:43 | when people move from gmm based i-vectors to you know the nn |

0:56:49 | least i-vectors using c you know just classes |

0:56:54 | as i understood the improvement was |

0:56:57 | because of the fact that just these was quantized much better than using gmms right |

0:57:03 | and |

0:57:04 | i that it was phones as classes or you know languages classes |

0:57:08 | so |

0:57:09 | if you doubly that you're proposing to use auto-encoder |

0:57:14 | has no information about you know any classes so what's your intuition behind |

0:57:20 | something like that would work better than |

0:57:22 | using c you know ones are you know languages as classes |

0:57:27 | well you know it's actually is a good question so my tuition is just a |

0:57:33 | my feeling up in the speech processing and hairless how without doing it |

0:57:37 | is we start too much scrolly |

0:57:39 | make in to win information away from the signal |

0:57:42 | for example |

0:57:43 | here if you do line frame and language is a class |

0:57:47 | i'm normalising speakers i'm doing the l d n is doing all the things for |

0:57:51 | you |

0:57:52 | so i'm hoping to not do that |

0:57:55 | try to maximize as much information |

0:57:58 | as i can |

0:58:00 | for example i give you |

0:58:02 | to a four thousand six or ten thousand of speech i don't giving level about |

0:58:06 | the development but you know going to train the speech continuance provides way in your |

0:58:11 | data which you had be helpful for you because you have thousand hundred thousand speech |

0:58:16 | and maybe in the industry is different |

0:58:19 | i say you have moral appleton with us |

0:58:22 | but for so can we do that so that's what i hope so i can |

0:58:26 | you know this is the same talk what the jackal said the twenty have letterman |

0:58:30 | supervised |

0:58:31 | can you use that you and your training |

0:58:32 | so i'm hoping to have a kind of speech coder |

0:58:36 | this model speech that you hear something you given the same thing from both sides |

0:58:39 | of the affirmation is there |

0:58:42 | it's not sure what away it just how to use it |

0:58:45 | that's exactly feeling wineries and i'm not saying that would be the i don't they |

0:58:49 | would be the destructive training or something like that i'm just saying that if i |

0:58:52 | haven't all the speech coder that something like to maybe if i am too much |

0:58:56 | use anything august alameda truth but that this is what i one is like something |

0:59:01 | you know if we haven't woken colour style or something like that |

0:59:04 | if the if he can produce the speech again |

0:59:08 | so the information is there we just need extracted |

0:59:12 | i don't know if it was clear and |