Speech Transcript - Comparison of Large–scale SVM Training Algorithms for Language Recognition

0:00:06	okay
0:00:07	so my name is on the money and time from the what a technical to anaemia
0:00:12	and i will
0:00:13	was then you our by uh
0:00:15	its title is analysis of large scale is
0:00:17	i am very not gonna
0:00:19	for language recognition
0:00:21	so
0:00:21	this is the outline of a war
0:00:24	where i was that we don't into that and then i we spend few works on support vector machine
0:00:31	now we discuss some artists
0:00:33	four
0:00:33	five
0:00:34	training overlaps case or vector machine
0:00:37	i will present the subset of that a lot of our and our remote as we
0:00:42	uh trained in order to evaluate the
0:00:45	they are
0:00:46	performances
0:00:47	then i we present our experimental results and then we compute the with some notes um pushed gmms sees them
0:00:53	and the conclusion on on
0:00:55	the
0:00:56	training or something yeah
0:00:58	so
0:00:59	why is
0:01:00	yeah
0:01:01	okay yeah so you can say yeah the svm uh
0:01:04	tend to appear in the many different
0:01:06	every system also necessary but
0:01:09	here we will focus on a lottery system
0:01:11	just to make some it's out on uh
0:01:14	we have
0:01:14	one eight thick anagram based the system G S P S can seize them and pushed gmms
0:01:21	they
0:01:21	are they are quite different but they all share this
0:01:25	that which is the svm training and
0:01:27	classification for phonetic and G S P svm system
0:01:31	in pushing it's actually used in a different light
0:01:34	however they all need some so svm training
0:01:38	so S P N
0:01:39	and support vector machine only not classifier
0:01:42	uh
0:01:44	the objective function on uh can be cast as a regularised the risk minimisation problem
0:01:50	well the loss
0:01:51	for the
0:01:54	most used
0:01:54	the loss function is the hinge loss which use place
0:01:58	to the
0:01:59	what is called a soft margin classifier
0:02:02	and the regularisation term is given by this
0:02:05	where of the normal to the hyperplane
0:02:08	uh which actually is related to the inverse of the margin so we have a trade off between them are
0:02:13	doing
0:02:14	and the misclassification error so
0:02:17	so
0:02:19	uh another formulation is
0:02:21	so given by the dweller grounds and no
0:02:23	the svm problem
0:02:26	uh which is actually a constraint that come with so
0:02:29	musician problem
0:02:30	and this is the culmination is interesting because uh
0:02:34	yeah we have the romantics of dot products between training set bass
0:02:40	and
0:02:40	the fact that we can although just one with the product silos to expand the support vector machine to nonlinear
0:02:47	classification by means of what is called the kernel tree
0:02:50	well we just
0:02:52	my what we choose to wow high data dimensional space by just evaluating dot products in yeah in an evaluation
0:03:00	base without the need
0:03:01	to actually perform any kind of projection
0:03:04	so
0:03:05	well a scalar svm
0:03:08	uh
0:03:09	well uh actually because we have
0:03:12	menu
0:03:12	training part elsewhere yeah larry O nine we have like
0:03:15	seventeen house on which
0:03:17	maybe are not so many
0:03:18	for a recognition system in general but
0:03:21	for our uh
0:03:23	needs
0:03:23	they are
0:03:24	um there are many
0:03:26	and the dimension you are actually baby
0:03:29	'cause we can go
0:03:31	from
0:03:31	four
0:03:32	the thousand for uh a freedom although with thirty five
0:03:35	Q on it so
0:03:37	to more than one hundred thousand for a gmm system
0:03:41	so uh now we present different targets
0:03:44	to train the
0:03:46	yeah
0:03:47	training
0:03:47	yeah i mean an efficient way
0:03:50	the most of these algorithms are actually uh linear but just for lena cabinets
0:03:56	um
0:03:57	but actually the can we use are almost
0:04:01	always lead us all this
0:04:02	not a problem
0:04:04	so
0:04:05	the this is our baseline system which is lesbian light is one oh
0:04:09	plus files
0:04:10	selena space
0:04:11	solvers
0:04:13	uh it sounds
0:04:14	do a problem uh in any therapy away
0:04:17	and
0:04:18	by decomposing the actual uh problem in smaller subsets
0:04:23	uh the problem with that is that if the
0:04:26	it says that whether i think
0:04:28	time behaviour
0:04:29	so
0:04:30	we could
0:04:31	uh
0:04:32	it has a quality time behaviour and
0:04:35	can this but that uh we did some work to speed it up there
0:04:38	but casing and the gore vacation uh cannot evaluation
0:04:43	uh however
0:04:45	delude a
0:04:46	evaluating on kernel um on the product so
0:04:50	is also whether
0:04:52	problem and the the dramatic
0:04:55	tends to grow
0:04:56	is what i think in the memory
0:04:59	space
0:05:00	so
0:05:01	so we are interested the inadmissible which are actually memory bounded the
0:05:06	and
0:05:07	possibly yeah time on it
0:05:10	so the first one we analyse it was
0:05:13	bag of those which is uh
0:05:14	a primary that is over based on subgradient
0:05:18	all stochastic
0:05:19	grabbed in
0:05:20	the same
0:05:21	well we hear talk about so got in because of the
0:05:25	this function is not actually that immobile everywhere so would
0:05:29	can not
0:05:30	like
0:05:30	the dragon
0:05:31	we
0:05:31	two
0:05:32	subgradient
0:05:34	and we have to cussed selection of learning sample so we do not train every
0:05:40	time
0:05:41	the assistant on the whole database but we just selected randomly
0:05:45	training part
0:05:47	right
0:05:48	you know that
0:05:49	to improve the convergence performance is the
0:05:53	uh the
0:05:55	we have uh projections that on a wall of rogers
0:05:58	so
0:05:59	the square root of the regularisation that we have
0:06:02	the
0:06:03	it's yeah problem formulation
0:06:05	and this is the actually
0:06:07	had
0:06:08	yeah it is
0:06:09	to reach convergence
0:06:11	so this problem the these are gonna do not
0:06:14	that it provides
0:06:15	the
0:06:16	the um
0:06:17	do a solution of the svm problem however
0:06:20	if we need it does
0:06:21	we might do it
0:06:23	we want to implement
0:06:24	jim
0:06:24	pushed gmms as the mighty proposal
0:06:27	uh we can actually train them
0:06:30	why we are trying to be a plane
0:06:33	so
0:06:33	next
0:06:34	we have
0:06:35	do a content inside the this time we move to that was based on
0:06:39	and they can we have an iterative solver which performs
0:06:42	according to this and the interviewer space actually
0:06:46	so we split that one problem uh in uh
0:06:49	uh
0:06:50	in a serious over when evaluate the optimisation so where would keep
0:06:55	all but one variable fixed and we optimise just that one variable by
0:07:01	uh
0:07:02	by some kind of way regarding minimisation
0:07:05	yeah we just have to project regarding you know them
0:07:07	to assure
0:07:09	that
0:07:09	the it's the end what probably constraints that statement
0:07:14	so um
0:07:16	okay uh this time
0:07:18	we do not have
0:07:19	uh that actually the primal solution but actually
0:07:23	it's very easy to update it while we're updating the one solution
0:07:28	and this is nice also because the
0:07:30	uh in order to evaluate the product so we do not have
0:07:34	just or support vector so be careful because we already have the i
0:07:38	plane
0:07:39	so uh this
0:07:41	problem is that it
0:07:42	can be
0:07:44	yeah actually sped up by performing a random permutation of the supplements that is we just switch the order in
0:07:51	which we optimise the variables
0:07:54	and also by introducing some sort of shrieking which are actually means that we
0:07:59	do not
0:08:00	the
0:08:01	uh we tend to not up to update the variables which are
0:08:05	which have region
0:08:06	the bounds of
0:08:07	the guy
0:08:07	the constraints of the svm problem
0:08:10	because uh they will probably
0:08:12	state that we just check
0:08:13	the yeah the this assumption is correct when we meet
0:08:17	so we actually meet
0:08:18	the the comma just
0:08:19	material
0:08:22	so
0:08:22	let's
0:08:23	the
0:08:24	uh we have
0:08:24	it implies some space would so it which was trained
0:08:29	and introduce the in svm pair
0:08:32	this is based on a different formulation of the svm problem the so called ones like
0:08:36	um valuable information
0:08:39	yeah we optimise over the ipod plane and the slack variable
0:08:43	see that
0:08:44	and this time we have a much more greater of a set of
0:08:48	strange so
0:08:50	so what is that what
0:08:52	is that here is that
0:08:54	we have to rebuild that working set about the constraint over which we solve the quadratic problem
0:09:01	and
0:09:02	what's interesting in and
0:09:05	in
0:09:05	these are great
0:09:06	is that the solution is not actually represented by using support back home so
0:09:11	but as a present the but means so what they call basis vector which are essentially have the same role
0:09:17	but they are not actually taken from the training set itself
0:09:21	so uh what we obtain is that we have uh much
0:09:25	sparser representation because the number of basis vectors this much
0:09:31	i'm only meet at the with respect to the support vector which actually tend to increase in number
0:09:37	and uh linearly with the size with the training set size
0:09:41	however the problem is that this time
0:09:43	cannot
0:09:44	assuming
0:09:45	blair recovered one solution of the svm problem
0:09:50	uh what is nice to actually see that
0:09:52	since we have uh so few buttons back over
0:09:55	it is
0:09:56	is it to extend this technique to an only not cameras
0:10:00	but actually we didn't try
0:10:03	so find the final and i agree
0:10:04	yeah is the
0:10:06	yeah that ma'am this is that they can from my uh risk minimisation framework
0:10:11	uh which are uh as that of the for the svm is a what
0:10:15	clotting
0:10:16	and so
0:10:17	so this time again we build any domain that low we won't be at work
0:10:22	set of approximate solution by taking tangent planes
0:10:26	objective function
0:10:28	and actually solving and ah
0:10:31	the minimisation on
0:10:32	these up
0:10:33	on the the functional approximated by means of pungent planes
0:10:38	so this time we still need to solve a a quadratic problem but
0:10:42	the size of the quality problem is actually
0:10:45	motion and i actually equal to the number of uh
0:10:49	tangent plane so we are using to approximate the function
0:10:52	and this is also equal to the in number of iteration with taking so
0:10:58	the size of this problem is much much smaller than the size of the original problem
0:11:03	and usually can be neglected since we do not
0:11:06	the need
0:11:07	more than two hundred or something like that
0:11:10	iterations
0:11:11	so i can be
0:11:12	hundreds of the
0:11:14	hmmm
0:11:14	the primal formulation of the svm problem but
0:11:18	do a solution can be about right
0:11:19	the is the also
0:11:21	this time
0:11:23	oh
0:11:23	now yeah they re more than we try to
0:11:26	uh larry model without a doubt a small subset of
0:11:30	the more the so we use what you larry oh no no evaluation
0:11:34	it's not the phonetic model is just
0:11:36	us under the
0:11:38	and bigram based the system when we perform connected according using a italian
0:11:43	tokenizer
0:11:44	then we're stuck and they don't count so we perform svm training and we adopt
0:11:49	that yeah but a lot of uh
0:11:51	kennel uh uh which actually is really not cannot so we just
0:11:55	to perform some kind of
0:11:57	twenty but the normalisation before feeding them to the svm on it
0:12:02	then the acoustic system is a standard two thousand forty eight gosh amount
0:12:06	well that the the gym and uh you
0:12:09	six parameters
0:12:10	we
0:12:12	we study the gaussian means two things
0:12:14	supervectors and we use the
0:12:16	okay i can do that
0:12:18	again just
0:12:19	normalising
0:12:21	the
0:12:21	buttons
0:12:22	so the system we actually train it is uh
0:12:26	we were interested in evaluating
0:12:27	was
0:12:28	gmm push system
0:12:30	where we used the svm meant what solution as the combination weights
0:12:35	for the model and the T model
0:12:37	and
0:12:38	so
0:12:39	no uh scoring is performed by means of a lack of duration
0:12:44	oh the evaluation condition we tested it on the L every O nine which combines
0:12:49	twenty three languages with narrowband broadcast and telephone data
0:12:54	we tested the systems on the thirty second and second and three second the valuation conditions
0:12:59	training was performed using seven
0:13:01	T in a thousand more or less training sentences
0:13:05	uh the main difference between what we did this time and what we did in the end larry O nine
0:13:10	evaluation is that
0:13:12	this then we train channel independent
0:13:14	system
0:13:15	well
0:13:16	for yeah that realign we use
0:13:17	channel dependent system
0:13:19	so on more that's not training you know one buttons or party
0:13:24	class balancing is
0:13:25	simulated the for all systems
0:13:27	except for a svm pad which do not actually allow for easy simulation of class balancing we
0:13:34	we performed by just playing with this
0:13:37	see fat or uh
0:13:39	and the
0:13:40	and the loss function
0:13:42	and in order to improve uh uh time performance is all models when training together
0:13:49	that you know to to just
0:13:51	have to scan one uh
0:13:53	with just one second of the database we train
0:13:56	uh we we trained on the more than some
0:13:59	so
0:14:00	here are the results
0:14:02	the four
0:14:04	the phonetic system okay
0:14:06	is the is
0:14:07	just
0:14:07	the
0:14:08	the same
0:14:09	system using uh this way or
0:14:12	uh the hinge loss function
0:14:14	uh with this idea but
0:14:16	you know
0:14:17	true
0:14:17	two
0:14:18	to give it any kind of use
0:14:21	one uh results
0:14:22	okay so what do you can see here is that actually
0:14:26	uh the
0:14:27	all results when all the system and which have made the
0:14:32	type
0:14:32	the combatants
0:14:34	criterion
0:14:35	so uh we can see that
0:14:37	although the the assistant performs
0:14:39	almost the same except for svm pair which is uh
0:14:43	due to the lack of class about nothing
0:14:46	uh so this is the same for the acoustic system
0:14:50	uh the results are almost the same
0:14:52	this time he were here we do not they're svm back
0:14:55	'cause
0:14:56	we do not have the the
0:14:58	and uh that was solution in uh we need
0:15:01	right
0:15:01	the push gmm system
0:15:04	so uh i mean was that so okay here is the the phonetic
0:15:08	so the second condition
0:15:10	uh what
0:15:11	we can see is that actually the C D M performs very well
0:15:16	yeah uh svm like which is our baseline is not shown because it to like
0:15:21	more than nine thousand seconds to train it so we just
0:15:25	but does show in the D C S
0:15:27	but
0:15:28	result but
0:15:29	the time
0:15:31	it to just didn't
0:15:33	there was just too much to show it in the plot
0:15:36	and so we
0:15:37	we can see is that
0:15:38	actually all the arguments the
0:15:41	yeah
0:15:43	improved performance with three time but
0:15:46	the more
0:15:47	for me one is actually record in the senate which allows us
0:15:51	train the svm in less
0:15:53	then
0:15:54	more or less to handle seconds
0:15:56	so this is the same for the test is then condition
0:16:00	and the three system condition actually here
0:16:03	we can not so
0:16:04	that
0:16:04	so
0:16:05	uh the generalisation um
0:16:08	generalisation uh property of the svm and
0:16:12	is actually
0:16:13	as is like
0:16:14	back
0:16:15	note before we
0:16:16	actually reach the the convergence criterion
0:16:19	however it's
0:16:20	not so relevant
0:16:22	so
0:16:23	now we get to the acoustic
0:16:25	pushed gmms is yeah
0:16:26	okay but the effect on the north in
0:16:30	nothing you with respect to the previews the grass
0:16:34	uh we also have the the city average
0:16:37	forms
0:16:37	quite well
0:16:38	well there is just a small difference in terms of this year
0:16:42	between as to seize them and then one sees them but
0:16:45	it's
0:16:46	just
0:16:47	very little
0:16:48	so uh for the ten second condition with
0:16:52	start to see some interesting things
0:16:55	so
0:16:55	and
0:16:56	actually what we get
0:16:58	here is that
0:16:59	the statistical condition we obtain results
0:17:02	which we
0:17:02	did not expect
0:17:04	so here for pushes them as to what
0:17:07	you can see is that actually
0:17:09	the
0:17:10	okay you
0:17:12	the first part of the graph here
0:17:14	represents using pushing way so which are far from the svm optimum
0:17:18	so what we obtain is that actually using the svm optimum uh
0:17:24	uh is not optimal for uh at least for the three second condition for pushed gmms
0:17:29	uh so we can see that
0:17:31	to
0:17:32	yeah we have the first iteration of the bbn matter
0:17:35	be an atom heart body
0:17:37	which i study is like we wanted
0:17:39	two
0:17:40	but for pushing by simply taking the arithmetical mean of
0:17:44	true class time post and the fourth circuit same posted up
0:17:47	any need of a svm training
0:17:49	and actually for the three second condition is
0:17:52	in this performs even better they're using
0:17:55	the outer door waiting
0:17:57	and
0:17:58	okay here we have uh
0:18:01	some other kinds of weighting which is
0:18:03	very far from the optimal actually we're
0:18:06	they have no or not
0:18:08	no i understand the board meeting this way
0:18:10	the just
0:18:12	around them intonational the algorithm however they
0:18:15	they are very far from the optimal
0:18:17	but they are
0:18:18	the best performing system
0:18:20	for the second condition
0:18:22	so
0:18:22	yeah is what those
0:18:24	what they said don't form pushed gmms is then we obtain the with the results
0:18:29	even when the the push away some very far from the svm optimum
0:18:34	uh model and then to more than the train just taking the arithmetic mean on the thirty second condition actually
0:18:41	improves performance is a
0:18:43	and well let's
0:18:44	so while svm push a base pushing
0:18:48	improves performance for the thirty seconds and slightly for this ten seconds condition
0:18:54	for the three seconds condition we actually look
0:18:56	have to look
0:18:58	so now the conclusion on svm modelling
0:19:01	we trained different are great
0:19:03	no
0:19:04	and uh what we obtain is that actually this at the emmys
0:19:07	process one themselves
0:19:09	well problems so we if we are still interested in the the solution these
0:19:13	is provided but is that really
0:19:16	however uh design put it in other wise
0:19:19	at this the
0:19:21	the the svm solution after
0:19:24	they're each button so
0:19:26	not
0:19:26	right
0:19:28	cannot directly be that's why the in uh this a good or a class to another minor mental
0:19:34	so i svm powerful is the second fastest are great no and
0:19:38	can
0:19:39	take advantage over this to put on the environment since um they separate from just at the end of one
0:19:44	and uh of a complete database can
0:19:47	and the scaling is good also for normally not can or however the
0:19:52	do what solution is not provided
0:19:54	and
0:19:55	class balancing can not be directly implemented the in an easy way out
0:19:59	it is possible with the other art
0:20:02	then B R M and uh B M a time is a much slower than the other arguments
0:20:07	however
0:20:08	we still have to see how how far we can how much we can speed up by using distributed environment
0:20:15	since
0:20:15	again this time
0:20:17	uh
0:20:18	weights um solution update is performed by a computer after that a complete database can
0:20:24	and also what is interesting and is a great it that's it
0:20:28	it's better is it works
0:20:29	then
0:20:30	uh what we did
0:20:31	to do
0:20:32	very different loss functions
0:20:34	and finally the inspectors
0:20:36	which which through the slower than the other one is the more already
0:20:40	can not
0:20:42	exploited it's a bit on the environment
0:20:44	so
0:20:45	like the
0:20:46	uh that was
0:20:47	so
0:20:48	before
0:20:49	so
0:20:49	this is all
0:20:51	uh
0:20:51	questions
0:20:59	exactly
0:20:59	your question
0:21:06	since then
0:21:07	actually
0:21:09	some of your
0:21:10	uh
0:21:11	it's yeah
0:21:12	summarising
0:21:14	oh great
0:21:15	along the way
0:21:15	fine
0:21:16	oh
0:21:17	oh
0:21:18	yeah
0:21:18	forming
0:21:19	yeah solution
0:21:20	does it mean that actually
0:21:21	yeah
0:21:22	roger
0:21:23	the model
0:21:26	right
0:21:27	like
0:21:29	right
0:21:30	yeah
0:21:31	the different you
0:21:33	well
0:21:34	okay
0:21:35	actually i think that
0:21:37	a uh
0:21:38	the svm problem uh the optimal solution actually is trying to minimise
0:21:43	the estimation of yeah
0:21:45	of the general
0:21:46	is that just that estimating the generalisation error or the svm uh
0:21:50	because we training it on train set and it tries to estimate the generalisation error but
0:21:55	still using the training set
0:21:57	so when we actually deployed
0:21:59	so it
0:22:00	it that can solve than that
0:22:02	the actual uh best generalisation error is not the thing that when we have
0:22:08	reached
0:22:09	type combat and the materials but
0:22:11	when we poses
0:22:14	uh less
0:22:15	five
0:22:15	uh criterion for convergence
0:22:18	just at some iteration
0:22:20	before
0:22:20	actually produce it
0:22:21	this again
0:22:22	in some iteration before
0:22:24	whatever
0:22:25	uh maybe it's just
0:22:28	imposing like
0:22:30	less tighter conditions
0:22:38	not to mention
0:22:40	uh
0:22:42	what we did
0:22:45	training process
0:22:47	let's
0:22:47	then yeah
0:22:49	maybe
0:22:50	nominee
0:22:50	anyway
0:22:51	you know the
0:22:54	your
0:22:56	the number of samples that you trained on most uh
0:22:58	entails
0:22:59	yeah
0:23:00	your
0:23:00	oh
0:23:01	kernel matrix
0:23:02	not not
0:23:03	she said
0:23:04	min
0:23:05	yeah yeah well actually it should
0:23:07	but
0:23:08	and uh we went
0:23:09	from a lower rio five where we have five thousand to everyone and where we have seventeen thousand so we
0:23:15	don't know next to what we have uh
0:23:18	so if we would have liked
0:23:19	thirty thousand and uh i don't know if
0:23:22	it
0:23:23	just
0:23:23	i
0:23:24	memory
0:23:25	so
0:23:26	yeah yeah actually uh S P and i was trained by evaluating the uh can on my tricks and storing
0:23:32	it in my memory
0:23:35	i where you
0:23:37	perform
0:23:37	or send you it
0:23:44	i had a quick question when you say class balancing you mean uh
0:23:48	giving equal to the weight
0:23:50	so the
0:23:50	positive and the negative yeah well actually i mean we tried to simulate where the same number of true samples
0:23:57	four samples by playing with the uh
0:24:00	but i mean
0:24:00	with the scaling parameter of the loss function
0:24:03	but is
0:24:04	we just
0:24:05	divide the through loss
0:24:08	the to the losses for the two buttons and the losses for the most part
0:24:12	just wait and if
0:24:13	yeah
0:24:15	okay let's take the speaker can

Comparison of Large–scale SVM Training Algorithms for Language Recognition

SESSION 9: Language recognition – general and data

Added: 14. 7. 2010 11:08, Author: Sandro Cumani (Politecnico di Torino), Fabio Castaldo (Loquendo), Pietro Laface (Politecnico di Torino), Daniele Colibro, Claudio Vair (Loquendo), Length: 0:24:22