Speech Transcript - I-Vector Representation Based on GMM and DNN for Audio Classification

0:00:15	alright technique for this introduction and i would like reverse also thank the
0:00:20	or this recognizer for inviting me to give this presentation and to give
0:00:25	present my last work and also for bringing us here too
0:00:28	this one different location it was an amazing week
0:00:30	the with that was very good
0:00:32	and the social events with many things so i i'll exercising so as a good
0:00:38	part
0:00:39	so
0:00:40	so it was really good since was very enjoyable to a week to talk to
0:00:44	people and meet would be blunt this costs and exchange ideas so that's what wonderful
0:00:48	and the gospel begins to
0:00:50	to see this winter vision of the basque country
0:00:53	so hopefully we'll come back to visit that this tourist
0:00:56	we have chance so they only presenting some of my let this work about using
0:01:00	the
0:01:02	some of the i-vectors some kind of i-vectors to model the hidden layers and see
0:01:06	how
0:01:07	the u d n and sparkling information in the hidden layers and because usually the
0:01:12	way
0:01:13	actually the way we doing now the nn since we are trying to look to
0:01:17	the output of the d n and land the n is to make some decisions
0:01:20	or we look to use the bottleneck features one half of it and one of
0:01:24	the hidden layer to use a bottleneck feature to do some classification with it
0:01:28	but unfortunately not
0:01:29	a lot not only not any work have been proposed to sit to look to
0:01:34	the whole unpacking the nn
0:01:36	because i believe that some way that we can there is some information were not
0:01:40	exploring and using actually into the nn is the activities of the part of activation
0:01:45	how the information was propagate over to the nn and that's what we're going to
0:01:49	be talking today
0:01:50	and show some results
0:01:52	so
0:01:54	so this is the out of my possible our staffers by can an introduction benefit
0:01:58	that all move onto
0:02:00	you know slowly to the my lattice work but before that would give some you
0:02:04	know and reduce the i-vectors which i don't need to because a lot of people
0:02:07	you but probably a know what sometime better than me
0:02:11	so i mean you guys you know the i-vectors is based on the gmm so
0:02:14	the first pass will be based on a gmm how we use it for gmm
0:02:17	so we present for the gmm white gmm mean adaptation
0:02:21	and we are we show to study of case speaker recognition language recognition here i'm
0:02:25	not give any i'm not telling you how
0:02:27	how to build your language or speaker recognition system but i just want to tell
0:02:30	you that would i-vectors we can do something that is what is a show and
0:02:34	again see some very interesting behavior of the data how the channels and one of
0:02:38	the remote the condition can affect
0:02:40	for speaker recognition system if you don't do any channel compensation
0:02:43	for language recognition or we show how the closeness of the speakers from data driven
0:02:47	this what is asian so the number that would remove
0:02:51	then the direction of how we can use actually some discrete i-vectors to models the
0:02:55	gmm weight adaptation is just some work that has started one
0:02:59	of hugo new when you most students pass in has sent by how do you
0:03:03	sees is actually was the case of an in bellingham he visit my me an
0:03:07	almighty for six months
0:03:09	and we will start working this gmm one advantage of language id
0:03:13	and then after that i'm that the announced are progressing comment over feel
0:03:18	and that's where start looking saying maybe this discrete i-vectors can be also use it
0:03:22	to model the posterior distribution for the n s
0:03:25	so i start this is what had this of the second part of that also
0:03:28	a start looking how you know the intended representing information in addition to layers
0:03:33	because a lot of the box in the vision to show all you can recognise
0:03:37	that's this moron that model actually cat's face from youtube videos or something like that
0:03:41	so
0:03:41	can we do something for speech
0:03:43	you know that's how i start thinking about using i-vector representation the model data layers
0:03:48	and that's why
0:03:48	then we show that you know how for example the accuracy goes more to go
0:03:52	deep in indian and how the accuracy going for example for language id task how
0:03:56	we go better
0:03:57	and also how we can more than that of activation the progression of that you
0:04:00	know the activation of the information over the non-target the nn
0:04:04	so if you feel like you what one hours too much for you to sit
0:04:07	in the shower and you want to the perfume is that you should even the
0:04:11	first part because the gmm part but the second part maybe more interesting for you
0:04:14	guys
0:04:16	i would be not offended if you want to the
0:04:18	so and that after the our finished by so given some conclusion of the work
0:04:22	so
0:04:23	so as you know i-vectors have been largely used it's a nice way to work
0:04:28	on to it's a compact representation that nicely of summarize and describe what's happening in
0:04:34	a given recording
0:04:35	you know it's have been largely used for a different task
0:04:39	speaker language speaker diarization
0:04:41	speech recognition so there isn't i-vectors was actually related to the gmm adaptation of the
0:04:46	means
0:04:47	so i just say lately i have interested also in the gmm weights adaptation
0:04:51	for using i-vectors and then are you know that after they move on to use
0:04:55	this for the model that the nn based i-vectors
0:04:59	for the for you what modifications
0:05:03	so that's not you know slowly take you to data the
0:05:08	to my the others for what is what slowly
0:05:11	so you know in speech processing usually what you have you have a recording of
0:05:15	this one recording and you transform the to get some features
0:05:18	then based on the complexity of the distributions features you build a gmm usually classes
0:05:23	when but the gmm top of this remote to maximize the probability of distributions
0:05:27	so you know
0:05:29	gmms are have been is defined by portions and portion has the weights the means
0:05:34	and covariance matrix are described this portions
0:05:38	so the way that some other countries the i-vectors in context set a concept of
0:05:42	speaker recognition so the way we were doing it in early twenties well that's what
0:05:46	how the kernel started
0:05:47	you know we dig a lot of non-target speakers were trained a large gaussian mixture
0:05:52	model
0:05:53	then after that because we don't have to meet sometime too many recordings from the
0:05:57	same speaker where n and one maximum likelihood do adaptation so we tried got that
0:06:01	the universal background model which is a cut prior of that how all the sounds
0:06:05	looks like to the direction of t
0:06:07	target speech
0:06:08	and the so the way that okay this should happen the between source trajectory gmm
0:06:15	supervectors because we finally he found that the one of the pine find out that
0:06:18	only the adaptation of the means is enough so the main the weighted it the
0:06:23	mean shift from this universal background models of the large gmm trained on a lot
0:06:27	of data
0:06:28	to the direction of the target speaker can be categorized of something happened the recording
0:06:32	that make happen that shift
0:06:34	so the lot of people starts to think this shift example packet kenny which one
0:06:38	factor analysis to try to
0:06:39	supplied with one speaker and channels
0:06:42	during the gmm supervector shoot for example also become boundaries what would svms you know
0:06:47	trying to model gmm as input to the svm to describe the model to the
0:06:51	probability of between speakers there
0:06:53	so the in the sense fear i-vectors came out as well
0:06:57	so the i-vector disposal you have a gmms subspace the ubm is one point there
0:07:02	and so we have one recording so we try to ship to the new recording
0:07:06	so
0:07:07	to the ubm to this new recording so if you have a survey recordings i
0:07:12	you we have look different one space the i-vectors extracted more the oldest variable between
0:07:16	all this recording
0:07:17	in the low dimensional space
0:07:19	so
0:07:20	and we still rocking is the ubm
0:07:23	so all this new recording can be mapped to this new space and now we
0:07:27	can represent and i is reporting by and vector of a fixed line
0:07:31	so this can be an modeled by this equation so we have the universal background
0:07:35	model here middle and east recording gmm can supervectors can be explained by the ubm
0:07:41	plus an offset
0:07:43	when offset also describe the
0:07:45	what happened is recording which is you are given by the i-vectors into proposed variable
0:07:49	the space where the i-vector a vector space
0:07:51	so now when you have a strange you doesn't margaret training for that you when
0:07:56	you have new recording utterance from the to get the features than after that you
0:07:59	map that you're subspace are you sure that all familiar with that
0:08:03	so now i'm not going to give anyway how to tell you how to do
0:08:07	speaker recognition you have been seen a lot of goods
0:08:10	talks during this will wonder four
0:08:13	all this is a conference but and still that will show you how we can
0:08:16	do visualisation with it so
0:08:18	first of also for speaker recognition this i-vectors have been applied for different kind of
0:08:22	speaker recognition task of speaker modeling task like spoken speaker verification when you have a
0:08:26	set of speakers you want anyway of recording you want to defy with those who
0:08:30	spoke in this segment speaker verification when you have a to want to verify that
0:08:35	to recording are coming from the same speaker or diarization
0:08:38	you want to know box and one
0:08:40	so for the for the speaker recognition task i would like to show some visualisation
0:08:46	that explain to you what's happening in the they that if you don't do any
0:08:49	channel compensation for do that
0:08:51	i would like to notice the work of that currently was actually psd students with
0:08:55	the unopened hyman that bill combine a mighty and he was working would not at
0:08:59	time
0:09:00	so we took the this is that in the nist two thousand us a ten
0:09:04	speaker recognition evaluation was based on i-vectors and the time of was that this that
0:09:08	system was we build was actually based it was a single system that rounded to
0:09:12	deal with the telephone and microphone data in the same subspace
0:09:15	and so we look like a box five thousand recordings from that the data and
0:09:21	we build a cosine similarity between all the recordings
0:09:24	that i think that it does this make metrics that similarity matrix and he built
0:09:28	teen is never appear at so is your for that would be connected to that
0:09:31	this tenish never
0:09:33	and then use this software called guess to do the graph visualisation
0:09:37	so in this graph you know that the relative location of the node is not
0:09:41	important but the relative distance between the notes
0:09:44	and the clusters important because
0:09:46	it's reflect how close they are and how to structure your data is
0:09:50	so that so here
0:09:53	exactly they female they data but database with the inter session with a channel compensation
0:10:00	applied so we can see the colours are by speakers
0:10:03	and the is so he's and he should british or point corresponded recording and cluster
0:10:08	compare the speakers
0:10:09	so for people that actually want to the museum and since all are this early
0:10:13	week can you do i mean what's the at this was thinking twenty been this
0:10:17	"'cause"
0:10:20	so
0:10:21	so the thing is like now what we start doing that we say okay well
0:10:25	known that we tried to remove the channel components i said what happened well we
0:10:29	lost the speaker clustering
0:10:31	and something happen that were some cost so that happen that appeared in this clusters
0:10:35	and always say like well what's going on he says so he went we went
0:10:39	together we will look cd a
0:10:41	to the labels and we start looking what's going also for example here
0:10:45	each you one check all the microphone at used for the different back that they
0:10:49	that the microphone was used to recover one of the recordings and you find that
0:10:54	actually with the clusters like to the microphone that was have been use
0:10:57	and that would like to pursue the pretty surprising for example it may assume at
0:11:00	this at the telephone data we have like one in-cylinder and this of the microphone
0:11:04	data
0:11:05	and also we have five that you also find that there's to actually for the
0:11:08	same activities cluster two clusters and actually because the room was there
0:11:13	that the ldc lifetime used to rooms the collected data so also the two rooms
0:11:18	was also reflected in your data
0:11:20	this is a liberal press every civilisation to show that you know i don't want
0:11:23	to give your michael right one from two to one point five whatever but i
0:11:26	don't tell you that if you don't anything about the market for the channel compensation
0:11:30	it may be big issue
0:11:31	so this is what happened there is only
0:11:33	the data can be affected by the my microphone can be affected by the channels
0:11:37	and also can be affected by the room that have been recorded
0:11:42	so this is that what we do try on the market the channel compensation
0:11:47	and we do the clustering by speaker and bit the visualisation is by the
0:11:52	by channel so that specific the channel compensation doing some good job too
0:11:56	trying to normalize this
0:11:57	so i front lately we recognise mel bit and female on a y
0:12:01	but different clusters of the time was better so this is that say the same
0:12:05	at a later we all have see also the same behaviour so this is the
0:12:08	one to the microphone data which is the most interesting
0:12:11	and you can still see that split between microphone between the room one and room
0:12:15	to the ldc and this use the collected data
0:12:19	so this is actually unique visualisation
0:12:21	that have been you know very helpful for us and stand and you know shows
0:12:26	the people that actually about the what we are doing it makes sense
0:12:29	and you know how we can still be fun to the some pictures and microphone
0:12:36	a microphone channel compensation
0:12:39	so this is the same thing so i honestly after that you know what we're
0:12:43	doing language id two thousand eleven i start looking to the language id task so
0:12:47	and i will try to do the same things also for visualisation so he language
0:12:52	recognition task we have a verification is why doesn't fixations so you don't need to
0:12:56	to spend too much time at that so here what i did is actually a
0:13:00	i to connect nist two thousand nine i have an i-vectors was trained in the
0:13:03	training data on it took it doesn't matter just a can cost
0:13:07	and a two hundred recalling for each language i think we have like twenty three
0:13:12	for that language
0:13:13	and i know to the same thing salad build the cosine distance or similarity and
0:13:20	bill between a separate graph and try to visualise it so this is what happened
0:13:24	for this kind of language recognition class so for example here disappointed because we have
0:13:29	for example
0:13:33	english and into english close together
0:13:35	we have into english and hindi and urdu you know like what they are very
0:13:39	close together
0:13:41	mandarin cantonese and that i mean and korean
0:13:44	is same almost in the same cluster
0:13:46	so
0:13:48	so also here's duration ugly green and was any and of course shines origin
0:13:53	in the same cluster and also french and real
0:13:56	so it's really data driven
0:13:58	at a visualisation that show you how big how the closeness of the languages are
0:14:04	from the acoustics
0:14:05	that have the primary using to model the i-vector representation
0:14:09	so here this is what have been you know you know
0:14:12	i-vectors were allowed to do because you have this you know in the time with
0:14:15	cosine distance between you can be lda to this was a bit as well
0:14:19	so
0:14:20	that we can you know doing i-vectors and represent the data and see what's happening
0:14:25	the data and how you can interpret what's
0:14:26	what's phenomena is going on
0:14:28	so that of is what is it was a good tools for that
0:14:31	so it is a you know that meet now try to move on because i
0:14:34	know that you all familiar i-vectors i don't want to
0:14:37	to spend too much time it anymore probably prefer we want to the more interesting
0:14:41	topic of this to of this talk so that after that i start looking to
0:14:46	the gmm what adaptation is a say with the students from what has a higher
0:14:49	you
0:14:50	and the way the gmm weight works that there's lot of actually the several decay
0:14:56	that have been applied to that
0:14:57	for example maximum likelihood should
0:14:59	the most a simple way
0:15:01	and one of the and also nonnegative magic factorisation which is actually you go via
0:15:06	newman was working in that at the subspace multinomial model
0:15:10	which is that what else complement inequality and what but people use
0:15:14	and what we propose which called non-negative factor analysis because the you know that the
0:15:18	gmms what adaptation is a little bit tricky because you have the nonnegativity of the
0:15:23	weights as well as they should sum to one so this is can trying to
0:15:26	do you have to deal with
0:15:27	during the optimization and when you're training your
0:15:30	your subspace
0:15:32	so it's a
0:15:33	so the whiteboard ogi validation for example you have a set of features like oneself
0:15:37	recording industry features
0:15:39	and you have any bn you model if you try to compute impostor distribution of
0:15:43	a of a given a component for some time of a frame
0:15:49	given the ubm subspace are you so we get this posteriors and then you and
0:15:53	your then you accumulate that and can
0:15:55	from that
0:15:56	so the object so in order to get that the gmm what adaptation you don't
0:16:00	you try to maximize looks very function given here
0:16:02	and if you want to do a maximum likelihood so the way to do what
0:16:06	you accumulate all this serious overtime and it divided by the number of frames that
0:16:11	you haven't you can do maximum likelihood
0:16:13	all
0:16:14	you can for example do nonnegative market factors estimation
0:16:18	which consist that okay we just try to split this weights adaptation into little small
0:16:23	negative matrix as
0:16:25	basis that
0:16:25	also maximize looks very functions that given here they the input is that the count
0:16:31	and you try to estimate is to subspaces vector representation one assumptions one and they
0:16:36	the representation of this in the subspace
0:16:38	to characterize the weights adaptations
0:16:41	so this is a negative matrix factorization is the you go value money students paper
0:16:46	that describe that
0:16:48	what implemented via t is that you have a multinomial distribution
0:16:52	and which kind of is described
0:16:58	so we have this subspace all that describe the a this the i-vector representation of
0:17:05	in the weight subspaces the when did v is actually but so we have you
0:17:10	know ubm plus share and didn't but no matter here also how to make sure
0:17:14	that the weights obtained are normalized to one
0:17:18	the good part of it here is that this is very good to when you
0:17:22	have a nonlinear data to fit for example he an example i would like to
0:17:25	thanks
0:17:26	but an specially older for shown with giving me the slides and that this
0:17:32	picture
0:17:33	here for example you have a gmm of to question for example
0:17:37	and he would try to similar each point corresponds to one recording weights adaptation
0:17:41	for example much estimation
0:17:44	and we tried to simulate what happened when you have a large gmm so we
0:17:48	have some sparsity not all the goshen would appear so we can see that this
0:17:51	question here the corner sorry
0:17:55	then the d
0:17:56	so this abortion here we would not be this is just a simulation
0:18:00	in what happened when you have a large ubm
0:18:03	so we can see that we for example in this case how the data looks
0:18:06	like
0:18:06	and this subspace moody model in the minima the sorry multinomial that model is very
0:18:13	good to fit the data
0:18:15	but that it has a drawbacks make overfit so that's why the but you guys
0:18:19	user regularization do not make it more overfit
0:18:22	so has send work at a time was trying to do that similar the same
0:18:28	as an i-vector so you haven't ubm weight i weights and you want to make
0:18:33	sure that new recordings had the ubm for you the weights for the new recordings
0:18:37	is that it will be in what was an offset
0:18:40	and the constraint here it's
0:18:42	you they should a weighted sum to one and they should be noted nonnegative so
0:18:46	we developed in an em like approach so but someone right in the center of
0:18:52	sound i think we did something applied to maximize the likelihood of the objective function
0:18:58	so you have to step second compute all i-vectors and you got many of they'd
0:19:02	are you but the l and you have you tried and w because the convergence
0:19:06	so let's say we tried to maximize the lower the likelihood of the data does
0:19:10	a function of the subject that they should sum to one and they should be
0:19:14	opposite if there is
0:19:16	projected gradient ascend that can belong to do that
0:19:18	and this is are you gonna go to the reference in you can find all
0:19:21	the information i don't want to go there to be a for this talk to
0:19:25	not
0:19:27	so
0:19:28	the difference between for example the non-negative factor analysis and the s m is of
0:19:32	actually
0:19:33	showing this table so that they i don't think that tend to not overfit because
0:19:39	the approximate or the maximum data is that would not touch the corner compared to
0:19:44	the ammonia s m
0:19:47	but sometimes good sometimes bad dependent which application you are targeting
0:19:52	but we compare that for several application they seem the same bit s m invented
0:19:56	non-negative factor is practice to
0:19:59	behave almost the same
0:20:01	so this discrete i-vectors have been applied for several applications and purposes for example modeling
0:20:07	of prosody that's what marcel that for his phd
0:20:11	phonotactics when you model the n-grams for example on dry and the did that and
0:20:15	method is based is this
0:20:17	and also what we did for the gmm weight adaptation for language recognition and
0:20:23	and dialect recognition would have sent has an work so
0:20:26	in this paper we compared activity taking and i'm have
0:20:31	assume m and as well as the you don't get a factor analysis so we
0:20:33	can go and check that
0:20:35	be almost behave the same thing as one for gmm weight adaptation
0:20:38	so now in order to go to the fun part
0:20:44	how we can use this
0:20:48	discrete i-vectors to model the
0:20:51	the gmm that the model that the nn activations i was actually the time of
0:20:55	was motivated by
0:20:57	this picture
0:20:59	so i was watching what it was actually that any one of the pocketing whatever
0:21:03	was given a talking to go on training or something like that and he was
0:21:06	showing that you if you do like some a deep belief network to unsupervised trained
0:21:11	your auto-encoder data
0:21:13	and he trained in the millions of unlabeled youtube
0:21:17	number link but component
0:21:20	and he said that maybe if you divide one or in top you maybe you
0:21:23	can actually construct
0:21:25	the pictures and he was saying all kayaking see the cat
0:21:28	face
0:21:29	and it will like okay well we do something for speech and wishart okay it's
0:21:33	a continuous the time series but
0:21:35	that was taken it can actually see how the data is are gonna the nn
0:21:40	hidden layers and that's how it is exactly what motivated to start this work
0:21:45	so remember that before i say we have a recording and the waitress from that
0:21:50	to set of features
0:21:53	then we get this feature to a gmm earlier now let's just remove the gmm
0:21:57	and give it to
0:21:58	due to deanna so for example we can do easy where a language recognition as
0:22:03	in what you give some frame versus like modelling of frames that's what you not
0:22:07	your from who did freeze paper really got thousand fourteen so it's input is of
0:22:12	segment was just like a frame and output is a language and
0:22:16	i will show the several the same like eggs experiment
0:22:20	note that when you have a new recording and you want to make the decision
0:22:24	you do a frame-by-frame decision and he aberration he tries to the max of the
0:22:29	output so that's largely what we compared to and you can also do example show
0:22:35	anymore seen on the n n's and you want to see how the data as
0:22:38	representing in the this task so
0:22:43	so imagine you have it that have been there so the way that we do
0:22:46	now
0:22:47	the before as a set earlier is we to get the n and we take
0:22:51	the output to make a decision
0:22:54	you know like or alignment for example for ubm i-vectors
0:22:57	or we take one hidden layer
0:22:59	and are used to it as a bottleneck features
0:23:02	but whenever and since we only see one level of what we've got the and
0:23:07	only one
0:23:08	one hidden layer or the output we don't see how the d n actually provide
0:23:12	get the information over
0:23:13	all his on fire the end on part of the nn and the reason for
0:23:17	example imagine you have a sparsity coding for each
0:23:21	for example for each hidden layers
0:23:23	and use a for each input only fifty percent of your
0:23:27	of your the foregone or inactive for example but for example drop out
0:23:33	so the way that the data we colour information for example for class one the
0:23:38	one and you will call it here and the one he would call you can
0:23:40	be different
0:23:42	because some randomness the way he would provocative what when coded information so if you
0:23:47	can model you get more that of the battles activation of how the class went
0:23:52	to the nn
0:23:54	and this is an information that's available there but we're not using it
0:23:58	and that's exactly what actually motivate me for doing for doing this work
0:24:02	so can we looked at all hardly nn and see how to progress there and
0:24:07	you know this is our should be one way to do with maybe is not
0:24:10	the best way to maybe don't always but this is one way to do
0:24:14	so the idea here were tried to do is
0:24:17	since we had this discrete i-vectors that also based on counts
0:24:21	and posteriors so can i use that to model
0:24:24	i i-vectors for each that we should outlier
0:24:27	that's what it is only built for example of the nn here we use an
0:24:30	i-vectors are presented and one
0:24:31	it into a taken as a present the lastly a loss leader as well and
0:24:36	noted to do there i need to have some counts
0:24:39	to react like we were so i'll be able to apply to my gmm weight
0:24:43	adaptation techniques to do it be used for gmm weight adaptation so here is to
0:24:47	when you get a combined counts
0:24:49	for example you can compute the posterior fortyish norm activation foster for each normal then
0:24:55	if we use you don't layer for each input your normalized to sum to one
0:24:58	artificially a common either because the you know was not trying to do that
0:25:02	and then you accumulated over time i became that became counts because here you should
0:25:07	allow us to sum to one
0:25:10	and you can you can use the same gmm to gain you don't change anything
0:25:13	to them
0:25:14	so the second one gonna post softmax for example
0:25:17	similar thing but you ample softmax we generalize to map and sum to one
0:25:21	and the accumulated you can also trained with softmax as well
0:25:24	but what is the most important one which the most understanding of all this ad
0:25:28	hoc
0:25:29	situation
0:25:30	and it compute the probability activation operational wrong and its complement one minus one
0:25:35	so you can consider this to normalize the one gmm of to work
0:25:40	so now we don't you only model that you can use the d n and
0:25:43	have the rest of the response so we don't normalise anything
0:25:47	so here so for example here for example if you have one thousand four neurons
0:25:51	you will have double their doubled that and you would have
0:25:55	thousand of
0:25:57	genments what to bush and you use the subspace model tool to do that what
0:26:01	the constraint that we used to normalize and his company wayne one is complementary sum
0:26:05	to one and in this case you don't do anything go wrong because you're modeling
0:26:09	the same behavior of the nn
0:26:11	so
0:26:12	we tried to compare few of them but we didn't will i'm not going too
0:26:15	much in a detector the want to make too much numbers here to confuse you
0:26:19	there will be have the same one
0:26:22	so in this case the say we can use for example here for the first
0:26:26	application we should dialect the condition
0:26:28	i use non-negative factor analysis
0:26:30	for the nist eight are you subspace multimodal more than one not be a model
0:26:34	"'cause" i wanted to show that but actually but works there is no distinctive to
0:26:37	be you
0:26:38	so he to the say
0:26:40	the non-negative factor analysis you have the weights of a new recordings used the ubm
0:26:44	so with a wary compute d b m's can i the weights i usually take
0:26:49	all the that the training data extract the count for each of them are normalized
0:26:54	m and it took an average and that's might ubm so every ubm response for
0:26:58	that's only the average response of a moral issue the layers
0:27:02	for a given him and it and
0:27:04	so if you shouldn't layers for a given all the recordings
0:27:08	so when you can use the at the you wanna get the factor allows us
0:27:12	to do that
0:27:14	so now
0:27:15	though that resting by is an eigen factor as a scan all support other approaches
0:27:19	can help you also to model all the hidden layers as well one way to
0:27:23	do it for example you can build hit and i-vectors for each subspace then you
0:27:28	can compensate the i-vectors of them
0:27:30	and you would have
0:27:31	or you could have one
0:27:33	that actually model everything with the constraint that uses hidden layers of some to well
0:27:38	and this will allow you to see how
0:27:41	you know how the correlation is happening between all the activation of your hidden layers
0:27:45	and that's exactly what we did
0:27:48	so
0:27:49	in order to do that we extended for example accented to d non-negative factor analysis
0:27:53	so you have a different ubm each one corresponding to issue the layers and it
0:27:58	would have a common
0:28:00	i-vector that control all of all the output for each dollar data sorry you have
0:28:05	a common
0:28:07	i-vectors for all the weights for all data it hidden layers
0:28:14	so in order to do that let's try to give some experiments and show something
0:28:21	results
0:28:22	so the first experiment that i would like to show is in that some dialect
0:28:25	id so we have a small sore from apart from vision
0:28:29	so we were interested in doing some back here we have five dialects we have
0:28:33	this isn't know how many recalling by training
0:28:36	it's about forty hours important thing for ten or fifteen hours and it'll it an
0:28:40	hour threeish a dialect
0:28:42	and we have training how many cost for training and development and eval
0:28:47	so a train the d n and
0:28:49	to
0:28:51	so we have five class that problem of trying to the n and with five
0:28:55	hidden layers
0:28:56	thousand and the first you know little about two thousand and then after that i
0:29:00	have five for all the hidden layers of five hundred
0:29:06	five hundred
0:29:07	then so the in is that the while training that the input is the same
0:29:11	the is the features of a stack of
0:29:14	i think was twenty one features frame then the output is the five dialect class
0:29:20	the same as a google paper with any with the in a two
0:29:25	then the when you get the i-vectors are used cosine scoring with lda and the
0:29:30	people described earlier today
0:29:32	and the best image method we find for this task is that the it's also
0:29:37	most full rank
0:29:39	as about thousand five hundred five and the for each other ones
0:29:42	so that so the first results show is the i-vector results
0:29:47	and he was the i-vectors actually it's worse than twenty to the d n an
0:29:52	average of the output
0:29:54	which a mean that for each frame you compute the posterior for the five o'clock
0:29:57	for the five class and you average them and you mathematics which is exactly what
0:30:01	we would paper describe and he is better because the that this the characteristic of
0:30:07	this data is that's the recording are very short cuts around thirty second you know
0:30:12	organ sometime less
0:30:14	so we know that you know if you do that the nn and you do
0:30:17	average scores it's always better you have already seen that talks in a wednesday afternoon
0:30:22	a show that
0:30:23	even for news data so this is the error rate sorry so that less is
0:30:29	better
0:30:30	so now i will show you know there is a twenty do the i-vectors in
0:30:36	the hidden layers and starting from it layer want to layer five and how the
0:30:42	results are is
0:30:44	more you go deep but there is which we know that
0:30:48	so this understanding what are preprocessing on other feel like in a vision so we
0:30:53	were able to do the same thing here so
0:30:55	you can see that were from their one layer wanted to the board the devil
0:30:59	that's cool down and i can't this
0:31:02	five lighters because i want to show that sometimes there's no need to go too
0:31:05	much deep
0:31:06	for example layer five already saturated
0:31:09	like that like five didn't have anything but they q prodigious to make sure that
0:31:13	you know sometime we will try to make it really d but is not necessary
0:31:17	so this is one example what you really don't want to do it
0:31:21	so
0:31:22	and putting is now we can also see that you know we were able to
0:31:26	see the accuracy of you should the layers and we can we also be able
0:31:29	to prove that more you go deep in this that the network but there is
0:31:33	a result are so you will probably get more information
0:31:36	model in all the hidden layers maybe have model but the representation
0:31:40	so here this is l deity
0:31:44	to do that a dimension
0:31:46	of the that the five classes is an lda project into dimension lda and a
0:31:51	member the first on the presented this work and the what the slide that people
0:31:55	say well but probability don't to lda i said that's true i forgot to do
0:31:59	that
0:32:00	so this time i didn't forget
0:32:02	and so what i took a set of the row i-vectors for example for the
0:32:05	last layer
0:32:06	and i do it i did jesse any to model that so now here just
0:32:10	a zero i-vectors were using to see any use lda also you can see that
0:32:14	for example the origin is around here so we can see the scatter going this
0:32:18	way
0:32:19	which just signed that okay length normalization will be useful again
0:32:23	so this is what you wanna do the length normalization due to the same thing
0:32:27	so it's and speaker area
0:32:28	so is the same thing so that normalisation is also useful here so
0:32:35	i'm not sure this project was unfortunately i was hoping to see different behaviour but
0:32:38	it is in say behave the same thing
0:32:42	so this is using to see any cell this is a role
0:32:45	i-vectors
0:32:46	so since the reason why i was asked this question because of the i was
0:32:49	just which are trained to the task
0:32:51	so how it really actually represent
0:32:54	the that the data and the layer was and their important thing to do
0:32:58	so this is a one is one thing that we were tracked
0:33:01	so now
0:33:04	i just say here probability result the i-vector result in that the nn and over
0:33:09	averaging the scores of the frames which is better than i-vectors then more than in
0:33:14	the hidden layers actually better is necessarily
0:33:17	and the results so and i say that from all my experiment that they have
0:33:21	been that seeing is that the last he'd of the last layer is the worst
0:33:25	one in time of information so don't take decision that
0:33:28	but with data we so that the old information is actually in the hidden layers
0:33:32	there's no doubt about
0:33:34	so here i give the last layer result and then what happened if you model
0:33:39	everything one you get more again
0:33:42	you get all other two percent gain by modeling all the hidden layers
0:33:46	and the same thing would happen witness tape
0:33:49	so my point here is you know is true hidden layers
0:33:52	you know more go deep but there is
0:33:55	but if you also looked at all the correlation that happening over all hidden layers
0:33:59	is actually better
0:34:02	and the reason for example why is you know the even people that do some
0:34:06	you know brain division amount vision and everything that wanna try to the activation the
0:34:11	cost of you know what him or more i've something's can use it and one
0:34:14	level but you cannot see that how this to propagate maybe she can correcting about
0:34:18	that if i'm wrong you know this way we can do the same thing for
0:34:22	the n and we can
0:34:23	top and one hidden layer or we can see what's happening all the d n
0:34:26	and is the same okay
0:34:28	you can you do td in my right to sit activation how it happened or
0:34:32	you can cut and one levels can and make a decision this is the same
0:34:36	thing can we just so this is the same behaviour and here i'm just saying
0:34:40	that
0:34:41	the n and has more information that we are now using
0:34:45	because we are not looking to the path of activation that he took too cold
0:34:49	his data
0:34:51	so this is a deck id probably are not familiar with that so probably move
0:34:54	onto the speaker id but before that i did an experiment because i you know
0:35:00	in the state of the i-vectors was completely unsupervised i was thinking okay so that
0:35:05	i used is actually
0:35:07	discriminatively trained for this specific task
0:35:10	can i have the n and that was just using to call the data on
0:35:14	colder
0:35:15	for example
0:35:17	and you know the simplest way to do it i say let me just try
0:35:20	to do a good idea learning every n to try to see you know what
0:35:24	happening i'm sure that people has more sophisticated network for that
0:35:28	so i tried this every have the same are selected that trained before the same
0:35:32	data these speech as input frames input
0:35:36	and i use of dimensionality reduction at that it subspace and use cosine distance so
0:35:42	we use five by the layers are b m's
0:35:44	and i
0:35:45	this of the results l the i-vectors here at the d n and output
0:35:49	but i am having some struggle because i cannot go more than the first layer
0:35:55	for the every m called an ongoing colours
0:35:58	so the how the first layer give me the best at all is not as
0:36:01	good as
0:36:03	you know this discriminatively trained subspace with the in a subspace forty i-vectors but
0:36:10	you know it's not that bad
0:36:12	you know and that's what have been seen
0:36:14	so the hidden layers the first one you trained is actually the best one more
0:36:19	you go deeper
0:36:21	it doesn't how and my
0:36:23	my hypothesis i'm not sure if it's true
0:36:26	because they are not jointly training
0:36:28	altogether
0:36:30	if there may be they are all the number of the
0:36:34	the layers are jointly trained to maximize the likelihood of the data that may be
0:36:38	different story and that's why what that's what we are trying to investigate now
0:36:43	with the my students so can we trained variation for example operational uncoded to train
0:36:47	the maximize the likelihood of the data
0:36:49	and see how
0:36:50	all this representation has a meaningful or not
0:36:53	so this is one thing that we are trying to explore
0:36:56	so now for people that are more familiar would
0:37:01	with the nist data so are you what you seen as it was wednesday afternoon
0:37:06	session that people are more than in six languages
0:37:09	i tried to the same thing so we selected with the help of like to
0:37:12	laugh read a give me this subset of the data
0:37:17	so first in the korean mandarin russian vietnamese
0:37:21	and the difference between us and other people doing people try to use all evaluate
0:37:25	data so that want to remove the mismatch but the trend not use the what
0:37:29	of density s and v only be to avoid the mismatch
0:37:32	it because i want to know what's going on
0:37:34	for us was where everything together
0:37:37	it seems that we didn't have this issue
0:37:39	so that's the difference between possibly not you paper and sum p other papers in
0:37:43	the that section of the
0:37:45	wednesday afternoon so we should put everything together and we're trying to the n and
0:37:49	that actually you take the frames as input and the output is a six class
0:37:54	and this is actually that is also so actually before that i will say
0:37:58	i train firefly the error five data layers about thousand ish
0:38:03	the input is the frames sec frames of twenty one eleven contextfree side
0:38:10	at certain context for each side sorry the output is the class
0:38:14	of the six class use a linear according to this time before of course
0:38:19	cosine this one is a collection
0:38:21	and the so here this i the result in a subset of the thousand nine
0:38:26	for the six languages
0:38:28	so there's a result of the i-vectors intended to second ten second and three second
0:38:32	and the average of the score which is what everyone is doing what you the
0:38:37	direct approach
0:38:38	and
0:38:40	so the that the characteristic of this is as have been said before
0:38:44	it only got the this the and it's
0:38:49	average only be the i-vectors in the three second entire thirty seconds and ten second
0:38:53	it's not it doesn't work
0:38:55	but what happened when you do the hidden layers is a little bit different story
0:39:00	so is well more legal given that the nn but there is
0:39:05	so this is the same thing a slow does not different story here
0:39:09	but the thing is
0:39:12	or actually here forty four you know participant and second that no one is able
0:39:16	to be this because the this
0:39:18	if you do the hidden layers and for example i want to the hidden layer
0:39:21	five
0:39:23	it's obtain the best result everywhere
0:39:25	for even for ten for this for to just forty seconds
0:39:29	so hidden layers and also this is actually was interesting it is the hidden layers
0:39:33	five is just the one preceding this i'll put e
0:39:38	so this one sign the last layers as the one that you really don't need
0:39:42	to look
0:39:43	so based on the my experience so and here again see that the last in
0:39:47	the letters actually marsh much better than
0:39:50	then the i-vectors and as well as the nn output every
0:39:56	so the hidden layers aims at that i-vectors representation for this case seems to do
0:40:01	an interesting job of aggregating and pooling
0:40:05	the frames data to make your representation of the data and you can do classification
0:40:09	with it
0:40:09	so this is an interesting funding for that so actually all surprising to see what's
0:40:14	on the data
0:40:15	so now
0:40:16	what happened when you do everything model all that a whole hidden layers as well
0:40:21	so here are show d
0:40:25	i-vector representation d v d n and every score as well as the last hidden
0:40:28	layer five
0:40:30	and you know i'll i
0:40:34	and also try to see what happen if you do
0:40:39	all hidden layers what used again some k
0:40:44	and you can win also one almost like zero point eight this sorry i forgot
0:40:48	synthesis the averages right in there so we can see that for thirty seconds there
0:40:53	is already low
0:40:54	you know i don't i don't think that too much seriously
0:40:57	that we was little bit here but for ten seconds we were able to wayne
0:41:02	and forty eight the signal were also able to
0:41:06	so it's the same behaviour that all hidden layers
0:41:10	has better information than the one that single-layer of the time
0:41:14	and also the last layer is also better the than the first layer and then
0:41:20	then the first so that last is also but the minutes like the first layer
0:41:24	a hidden layers and looking but the last output layer is not that much interesting
0:41:31	in term of making decision
0:41:33	so either one reason to be honest one explanation is that this the nn time
0:41:38	by ten to overfit
0:41:39	which i just a do
0:41:41	second to shoot
0:41:43	but even when they overfit like that and use them to make a representation or
0:41:48	discrete your space
0:41:50	it's in they work fine if you try to make decision what over fitting a
0:41:54	different story
0:41:55	as one thing here
0:41:56	so this is what i have been finding this last
0:42:01	you're trying to use this models to
0:42:04	understand what's going on
0:42:07	so
0:42:09	so let me try to conclude
0:42:12	so we have five minutes and have something called that you want to say
0:42:16	so the i-vector representation is you know an elegant way to do a representation of
0:42:22	speech with the different lance you know a lot of people ready also used in
0:42:25	a wood that's and twenty of
0:42:27	of the work of the recordings the one where you have a long segment and
0:42:31	short segment
0:42:32	gmm innovation gmm weight adaptation subspace can also be applied to as a show sheen
0:42:39	say that that's you have seen in this talk can be applied to model the
0:42:43	d n and activation
0:42:44	in the hidden layers as well and they would doing good job
0:42:48	so was actually the take home here
0:42:51	so that stating that they want to focus here the seldom under down for all
0:42:56	the information that was modeling that the nn is not in the output but isn't
0:43:01	inherently
0:43:03	looked at that it is this
0:43:05	don't try to make a decision directly from the from the out
0:43:09	so
0:43:11	and so also you know looking to one the liar at the time and not
0:43:17	seen what's going on in all the data layers
0:43:19	it may be a mistake were going but it's may be good also to look
0:43:23	at that
0:43:24	because it's will tell you what's how the information one to all the d n
0:43:28	and how we show that each class to be model
0:43:32	that's something to seem to be
0:43:35	very useful
0:43:37	the subspace approaches that have been trying is one thing that i was thinking off
0:43:42	to do this work demo specially in time of modelling all data layers
0:43:47	that you know we can use and it is seen to doing good job of
0:43:51	putting and are aggregating that they the all the frames and give you are not
0:43:56	representation with the maximum information you can use for your
0:43:59	for your classification task
0:44:02	so this has seemed to be very good even if the day was trained in
0:44:07	frame based
0:44:08	so with svms trained at the end frame based and use it to make a
0:44:12	sequential classification
0:44:15	i-vectors is actually a representation seems to be doing a really good job for that
0:44:22	so
0:44:23	take two minutes to
0:44:25	and we have to mitigate
0:44:28	so
0:44:30	for future work contracts that we have been explored my students and colleague
0:44:35	my colleagues
0:44:37	is
0:44:37	now that's a earlier that the other than being using are based on frame based
0:44:42	and segment length
0:44:43	frame of contacts of twenty one or something like that
0:44:47	it's not doing so we are trying to shift to
0:44:50	more like memory the nn is like for example td and endorse unit time
0:44:56	or l s t m or which is the
0:44:58	special case of recurrent networks that's what ruben is doing
0:45:02	my inter so we are trying to explore data instead of frame-by-frame to make more
0:45:09	to extract a model more speech more the dynamics
0:45:14	explore more data such vector for speaker
0:45:17	to make them more useful for speaker
0:45:20	we're still working on that as well
0:45:23	and the set earlier i would be of interest and people spy authors in my
0:45:26	talk to meet i mean maybe there is a better way to do
0:45:30	watercolours
0:45:31	to really corpus clear that the data speech
0:45:35	and my whole is at some point we would be able to
0:45:39	to get some speech modeling at the end the nn or speech colour so you
0:45:43	know
0:45:44	it just call the speech and after that i used to discrete my space and
0:45:48	use task for example i give you
0:45:51	a bunch of thousand of recordings you call your data and after that you say
0:45:55	i want to use speaker i want use language
0:45:58	can i use from the
0:46:00	from the same model
0:46:01	just calling speech
0:46:03	so if anyone has any idea or have any tell please come talk to me
0:46:09	so also to make the things the i activation more interesting
0:46:15	i'm interesting in exploring the sparsity of activation for if you know later
0:46:19	no i'm not doing a specifically i'm trying to use that the nn training but
0:46:23	is there a way to for example one way that i'm doing now we didn't
0:46:27	have time to compare the result is dropped
0:46:30	example i say
0:46:32	what for each input fifty percent of my for additional layers fifty percent of mine
0:46:36	or active
0:46:37	so there is some randomness between the recording but when the hidden layers because
0:46:41	i find that actually some if you do have at the end and the two
0:46:44	hidden layers consecutively the layers sometime i redundant because i close together but supplied them
0:46:50	is actually the two of these separation between it's better
0:46:53	sometime
0:46:54	so if you do surpassed activation with for example would drop obviously the simplest way
0:46:58	to do
0:46:59	you make them complementary because there's some randomness happen in the middle
0:47:03	so that you for that the nn to take different bat for each hidden layers
0:47:07	are normally
0:47:08	so
0:47:10	so that's something i'm really interesting to make the
0:47:13	information but the between two consecutive
0:47:16	hidden layers more powerful more interesting and then and make them more rather than rather
0:47:20	than and
0:47:21	and also there's a way to for example alternate activation functions
0:47:26	by same we can say sigmoid rectified linear and sigmoid on
0:47:30	so between two consecutive sigmoid that something in the method to make things changing a
0:47:35	little bit
0:47:35	so the behaviour change for the consecutive sigmoid
0:47:38	so when you model down there is there's hopefully a way to get more information
0:47:44	and you're so in the subspace and also how the how the d n and
0:47:48	is coding information can be useful for the classification
0:47:52	and
0:47:53	to conclude
0:47:55	well i'm organising assess it doesn't sixteen portrait ago
0:47:59	so hopefully to suit their lee's summit your paper the same is the same time
0:48:03	as the c
0:48:04	so that that's this work so please help to see there and if you come
0:48:08	at the workshop you can also stay
0:48:11	for the rest of the week you enjoy the beach and that the cocktails very
0:48:15	nice to signature nor owns to make your compared to the right object a function
0:48:19	so and so that and that's it i q
0:48:42	jim had sent you from these distortions mum concerns just about a point which is
0:48:47	not main point of view or which is not in the main point of your
0:48:51	talk
0:48:52	it's about the television in addition
0:48:54	a particular always the t s in the stochastic neighboring of meetings
0:48:58	the to use of form determinization think that
0:49:03	this techniques and that is this phenomena useful and satisfying four
0:49:07	for thinking for the it but also for the thinking and understandings the distributions
0:49:13	but we remark and some if you put forth
0:49:17	and so for presenting the high divorce which of data with those techniques particular these
0:49:24	speaker classes
0:49:26	i'll distributed along ambulance form norwegian
0:49:31	this thing directions
0:49:33	t s and then don't does not respect the initial distribution
0:49:38	it separates speaker classes but so as you
0:49:42	the does not respects is montreal
0:49:45	direction of speaker classes
0:49:48	so it is useful because we use e
0:49:52	separation between necklaces of speaker
0:49:55	but not
0:49:59	or maybe more
0:50:01	view of this is we'll distribution
0:50:05	so it's i think a very good tool
0:50:07	two but it may become few not to use it of to propose a new
0:50:13	nist
0:50:15	it's as those more one so you're saying i it's here's just want to show
0:50:21	that you know how it's kind of structured but i'm not checking account how it's
0:50:25	model was a distribution from a t c any that's what you're saying yes
0:50:32	simply for the also points in particular fourteen and
0:50:45	i didn't write down all the numbers but i saw you had results and b
0:50:49	r and the dialect id task for other five dialects arabic
0:50:54	and their numbers are three writing down here you had to i think that the
0:50:58	fourth layer supervectors right a twelve point two percent and then when you into if
0:51:05	we're was twelve point five percent
0:51:08	and i apologise i didn't see a slight that that's if there so my question
0:51:12	wise
0:51:14	as you're moving forward you're actually getting improvement but would really be nice in dialect
0:51:19	id it's a lot more subtle differences between a derelict right search a lot of
0:51:24	times it interesting to figure out what are the things that are differentiating between each
0:51:29	of the dialects so i'm wondering if it anywhere you go back
0:51:33	and look and the bad the test files that you went through here for guitar
0:51:37	residual moving in the improvement here
0:51:40	you you're some not your hand it may be assumption would be that you're getting
0:51:44	a few more files except it correctly but you're just likely to have a few
0:51:51	morph rows rejected incorrectly a and it would be nice to can see what they
0:51:56	balance it's are you getting more pluses
0:51:59	and you're losing a few or are you not losing anything in gaining more so
0:52:03	that's where i'd like to kind of c is you're moving down here is zero
0:52:06	is a positive movement forward or are there some better falling backwards but the net
0:52:13	gain is always possible
0:52:14	no i agree with that in i didn't do it you know virginia the wood
0:52:18	but also is interested at the time of than more interested also to see
0:52:23	between the hidden layers what's if i'm getting i was hoping to see what happened
0:52:29	the recording you know is that having a linguist work we made me trying to
0:52:32	understand okay handling like this that classified correctly in the hidden layer five but not
0:52:37	in the layer for three or to what make its change that it's so i
0:52:42	want to know
0:52:42	which affirmation of the five layer that got me to make this one better than
0:52:46	another one that's true we window at the end we were thinking about
0:53:02	so not too much just want to thank you very much for proposing a new
0:53:07	solution to the very heart problem so
0:53:11	i just like to put that the difficulty of the problem in into context because
0:53:15	we've been banging our heads against the same kind of difficulty so
0:53:20	to summarize the problem
0:53:23	it the problem is to get a low dimensional representation of the information in the
0:53:28	in that it in a sequence so you've got lots of speech frames
0:53:32	and then you want to the stall the information in all the speech frames to
0:53:36	single smallish vectors
0:53:39	so
0:53:41	the reason is difficult is let's look at the i-vectors the classical i-vector so
0:53:47	you can write down information that the generative model for the i-vectors in one equation
0:53:52	you had
0:53:54	so
0:53:55	it's very easy for most of us to just look at that an immediately understand
0:54:00	so that's the general the fruit
0:54:02	but what you're doing is the inference fruit
0:54:05	from the data back to the two
0:54:08	the hidden information so now we have two
0:54:12	share all the information from all the frames accumulate that information back into
0:54:18	back into the single
0:54:22	vectors so
0:54:24	if you look at the i-vector solution
0:54:27	that the formula for
0:54:29	for
0:54:31	calculating the i-vector posterior
0:54:33	that's a lot more complex than
0:54:36	just the generative formula for the i-vector
0:54:39	and that takes as
0:54:43	might be applied to the live
0:54:45	i that formula and
0:54:48	i believe it's similarly difficult for the neural network to learn that
0:54:53	so you mentioned the variational bayes order encoders
0:54:59	so we've been looking at that was quite a lot
0:55:02	in the papers that have been published thus far it's always a one-to-one relationship between
0:55:07	the hidden variable and the observation and then everything's i r d so
0:55:12	i was machine learning by per state been solving that a much easier problem
0:55:16	so
0:55:18	to accumulate on all that information is a harder problem that's also computationally it is
0:55:25	also computationally hot
0:55:27	if you think of the i-vectors posterior lots of piper's with published how to make
0:55:32	that computationally lighter
0:55:34	so
0:55:36	that's why say you all
0:55:37	no solution is quite exciting to us
0:55:42	what else also the one of the guy from machine learning ask me okay say
0:55:46	okay so we have indian and you have your i-vectors representation can you propagate the
0:55:51	errors from the i-vectors of the nn to make it more power for your specific
0:55:56	task with the i-vector percent
0:55:58	that's something interesting for psd topic noise
0:56:01	if you're i
0:56:04	you know way to combine the subspace and that the like the same as what
0:56:08	people do in the data in asr the symmetry of training sequence of training can
0:56:13	we do the summary things with when you have the error coming from the i-vector
0:56:16	space that work to propagate the data the d n n's dow
0:56:21	that's something maybe
0:56:23	interesting as well that's we got from machine learning cost me this
0:56:35	so not nice presentation nudging
0:56:39	i hadn't thought of questions one was
0:56:43	when people move from gmm based i-vectors to you know the nn
0:56:49	least i-vectors using c you know just classes
0:56:54	as i understood the improvement was
0:56:57	because of the fact that just these was quantized much better than using gmms right
0:57:03	and
0:57:04	i that it was phones as classes or you know languages classes
0:57:08	so
0:57:09	if you doubly that you're proposing to use auto-encoder
0:57:14	has no information about you know any classes so what's your intuition behind
0:57:20	something like that would work better than
0:57:22	using c you know ones are you know languages as classes
0:57:27	well you know it's actually is a good question so my tuition is just a
0:57:33	my feeling up in the speech processing and hairless how without doing it
0:57:37	is we start too much scrolly
0:57:39	make in to win information away from the signal
0:57:42	for example
0:57:43	here if you do line frame and language is a class
0:57:47	i'm normalising speakers i'm doing the l d n is doing all the things for
0:57:51	you
0:57:52	so i'm hoping to not do that
0:57:55	try to maximize as much information
0:57:58	as i can
0:58:00	for example i give you
0:58:02	to a four thousand six or ten thousand of speech i don't giving level about
0:58:06	the development but you know going to train the speech continuance provides way in your
0:58:11	data which you had be helpful for you because you have thousand hundred thousand speech
0:58:16	and maybe in the industry is different
0:58:19	i say you have moral appleton with us
0:58:22	but for so can we do that so that's what i hope so i can
0:58:26	you know this is the same talk what the jackal said the twenty have letterman
0:58:30	supervised
0:58:31	can you use that you and your training
0:58:32	so i'm hoping to have a kind of speech coder
0:58:36	this model speech that you hear something you given the same thing from both sides
0:58:39	of the affirmation is there
0:58:42	it's not sure what away it just how to use it
0:58:45	that's exactly feeling wineries and i'm not saying that would be the i don't they
0:58:49	would be the destructive training or something like that i'm just saying that if i
0:58:52	haven't all the speech coder that something like to maybe if i am too much
0:58:56	use anything august alameda truth but that this is what i one is like something
0:59:01	you know if we haven't woken colour style or something like that
0:59:04	if the if he can produce the speech again
0:59:08	so the information is there we just need extracted
0:59:12	i don't know if it was clear and

I-Vector Representation Based on GMM and DNN for Audio Classification

Keynotes

Najim Dehak