0:00:13so the the graph an and and what i would like to talk about today E uh how we combine
0:00:19the the multiple binary classifiers to solve the
0:00:23well class problem
0:00:25and the technique uh that we are going to use these the
0:00:28the geometric programming we're
0:00:30we can uh solve the problems the using one of the convex of to my this and solver
0:00:40uh i would like to start we'd the multi-class class mornings
0:00:45and then are to the two different approaches to solve the multi-class learning to or multiclass problems
0:00:51either direct method
0:00:53we we can reduce the the multi-class problems in to the multiple binary problems
0:00:59is such a case actually we need to really combine the mode uh but multiple binary problems
0:01:06to determine the final answer uh to the multi-class problems
0:01:10and then
0:01:12probably i formulated this aggregation problems
0:01:15and the a geometric programming so that we can always find a global solutions
0:01:20"'kay" so you know or to introduce a geometric programming formulations and then i will introduce a soft models
0:01:27and L one norm regularized maxent lightly
0:01:29missions missions and
0:01:31oh we we just some of the uh
0:01:34the numerical experiments and then
0:01:35and can close
0:01:37so the multi-class problems for example
0:01:41we need to really a signed the class labels uh from one to K A
0:01:46uh uh to the data point
0:01:48so either direct methods okay so suppose actually we have a three classes
0:01:52and the direct method try to find the sum the separating hyperplane which
0:01:57discriminate uh these three classes
0:01:59uh in general uh the separating hyperplane these not really linear okay
0:02:04uh on the other hand
0:02:06uh for the binaural decomposition the method of for example all paired
0:02:10so we look at the sum the
0:02:12para one and two and two and three and one and three so we look at the sum the binary
0:02:18pair problem still we can always find the sum the binary classifiers
0:02:23and the remaining problem ease of how we really uh aggregate this um the solutions of the binary problems you
0:02:29know to determine the final answer to the multi-class problems
0:02:33okay so what are the advantage of the binary decompositions over the direct method
0:02:37and is
0:02:37but easier and the simple or tool on the classifiers
0:02:41and there's a lot of other sophisticated classification uh classifiers actually as mister ready for the binary uh the D
0:02:48i mean the binary problems for example support vector machines
0:02:52and we're so this is a better suited to the parallel computation
0:02:56so for example of these uh the three is a well-known examples for the binary decompositions
0:03:01and also we can really three the binary decompositions they're the binary encoding problem
0:03:06okay so well in other words
0:03:08how we aggregate the uh multiple binary answers
0:03:12is uh a binary decoding i mean the decoding prop
0:03:16so for example the one versus all K as so we have a three crises
0:03:21and then first the binary classifiers actually discriminate the first the class
0:03:25uh from the the remaining class
0:03:27and we're still the second binary classifiers discriminate
0:03:32second class from the remaining class then
0:03:34so one and all pairs where we need to look at the sum the pair of one and two and
0:03:39two and three and one and three
0:03:41and we're still all
0:03:44they're error correcting output coding also can be uh you use
0:03:48so all we really uh determine the some of the the court words
0:03:53oh where some of the code words have as a maximal uh the hamming distance so you in other words
0:03:58of some of the solutions we'd the tolerable error is actually can be uh correctly classified
0:04:04so in other words the binary decompositions really uh leads to the sum
0:04:08uh code matrix tk so in this case
0:04:11we have a three class
0:04:12and the three binary classifiers
0:04:14so uh these are the uh exemplary
0:04:17who may trick
0:04:18uh for the one versus all and all pairs and L
0:04:22error correcting up according
0:04:24so in other words actually we need to really train the uh three deeper in the binary classifiers
0:04:30uh following the the oracle a in this school me too
0:04:37"'kay" so for example or the case of the one versus a and then D is the code matrix produced
0:04:42by the one versus soul
0:04:44and is that a case
0:04:46and we need to oh
0:04:47train the three different pine binary classifiers
0:04:50and for example all X up i
0:04:53is that the data and the target labels is a two
0:04:57so it's such a case actually
0:05:00so the target label two
0:05:03and then
0:05:04in terms of the binary classifiers to actually the correct labels should be zero
0:05:09one and zero okay so the first the binary classifier a second binary and third binary classifiers
0:05:15so we i mean these uh binary classifier
0:05:19followed the
0:05:20at the binary label
0:05:22uh uh in this good matrix
0:05:26so we trained a binary classifiers
0:05:30then L each binary class apart
0:05:32classifiers produce the some the probability asked me
0:05:35a for example we can use the support vector machines with sick a the model so that the the banner
0:05:41classifier the produce the some the scores
0:05:43uh which uh between zero and one
0:05:47so the the problem here E
0:05:50"'kay" so we trained the three binary classifiers and each binary classifiers produce the sum of the scores between the
0:05:57your and one K and you're to answer to the multi-class problems and that we have to really combine the
0:06:05determine the by the three binary classifiers
0:06:08okay so how we really aggregate uh the binary classifiers
0:06:12and the sum of the charade heuristics these
0:06:14uh for the case of the all pairs and then we do easily majority voting as
0:06:19and then for the case of though one versus all and they maybe the maximum always means
0:06:24and hard decoding case
0:06:26and the we
0:06:28uh find the court word we to best match the collection of the predicted result computed by the binary classifiers
0:06:34okay so in the case of the three class and that we have a three
0:06:39code words okay and then train the three binary classifiers so that
0:06:44given a some the test at what point K and the three binary classified the produce a sum score is
0:06:50okay so the collection of the door the three values can stick you'd the three the mental vectors okay and
0:06:55then we search actually uh which code word he's best match the these uh the three dimensional a prediction result
0:07:02in or to really determine the final answer to the multi-class problems okay
0:07:07and or so all we can it a probabilistic decoding in other words in this case actually we need we
0:07:13really need to compute the class membership probabilities
0:07:17so uh C L can get the class members probabilities and then we can really uh to the prediction uh
0:07:22for the class
0:07:25so uh
0:07:26uh one of the popular approach in the problem list decoding is actually the based of the
0:07:31uh the bread are models
0:07:33and the let me just briefly to i mean the explain actually what really a the braille lit model oh
0:07:38used doing in this case
0:07:40so was again actually L we have a three class
0:07:44and that these reading tidy model has been used to relate the binary predictions with the class members to probably
0:07:50so for example
0:07:51okay so we have a some uh the three as servers could use the by the three binary classifiers
0:07:57and that we have to relate to those answers to the
0:08:00uh the class membership probabilities
0:08:02so in such a case actually we treat the class members probabilities as a prayer
0:08:08so in the case of the all pairs binary D competitions
0:08:11and then
0:08:13okay so the capital P one uh star he's actually uh
0:08:18the a
0:08:19the class membership probabilities uh
0:08:22for the
0:08:23the data point access star
0:08:27no known actually a a this should be a uh
0:08:35okay so this is actually the the class membership probabilities
0:08:39and then D is a uh
0:08:41all pairs result
0:08:43so all these are i mean the blue oh high like the things are actually the based on the bradley
0:08:49terry model
0:08:50and then we introduce a someone out a a is high want i to and pride three
0:08:55and then these uh relations are directly from the bread lee terry model
0:08:59and they just sub J star is actually the probability mate
0:09:03determine the by uh by the binary classifiers
0:09:06so we have that this was okay so in or to really compute the class membership probabilities and we treat
0:09:12them as a or and that we asked make these parameters by minimizing the
0:09:17okay of that verse as uh between the
0:09:21the binary for uses
0:09:22and or so
0:09:24these pie a with
0:09:26uh coming from them
0:09:29so oh is such a case actually a a from but actually which exploit it is uh the techniques
0:09:36here he's
0:09:38that's a that's probably three
0:09:41or you know other words actually uh a number of parameters grows with the number of uh
0:09:47the training example
0:09:49so if you have a assumed the a huge number of training examples and then we have a huge number
0:09:54a parameters you those should be really uh
0:09:57up team
0:10:00so all or some of the uh the existing tech the actually the base on the bradley terry model and
0:10:05then all one of the recent tech and he's they tried to find that some all optimal aggregation
0:10:12so uh why optimize aggregation is good you because some is some of the prediction is by and live by
0:10:17i fired
0:10:19oh the aggregate of
0:10:20i mean entire performance okay so somehow if they can really uh determine this i mean the
0:10:26come up with a weights which all see i mean all team hourly aggregate the uh the binary predictions
0:10:31and then that we can really of we did this prob
0:10:35so of all these uh a technique a uh has been done for the optimal aggregation actually but but based
0:10:40on the uh
0:10:42uh the bread carry model
0:10:43a a a a uh but the problem here is actually a a a a simple of the really a
0:10:47lot my the um probably decoders
0:10:50uh i use that red with you model so
0:10:53the number of parameters is actually the aggregation weights
0:10:57and also class membership probabilities which to grow as a we the number of uh
0:11:02so name and a lot my is a problem and or so
0:11:07uh this is not really a not i mean that not convex of to the some problems of doesn't guarantee
0:11:12a global
0:11:13okay so what i would like to hear he's actually uh we would like to formulate these problems
0:11:20as a convex up to my this um prop
0:11:22okay so all
0:11:23in the aggregate some model actually a we don't look at really the bread lead a model but yeah we
0:11:28we ah
0:11:29uh uh use a softmax model uh which was also a recently uh use uh
0:11:36by us
0:11:38exactly and i C yeah
0:11:39the last year
0:11:42yeah all introduce an i mean the softmax models and and is such a case
0:11:46uh and
0:11:47so these are actually the and um different binary classifiers
0:11:52and you know our approach actually the writers that all the aggregation weights okay so each a a class
0:11:59classifiers is really uh by the different uh a this and the W want through double sub and
0:12:05and then i goal is actually a optimized these court presents
0:12:09to produce a some the best uh
0:12:12uh a combinations of the binary prediction
0:12:20okay so
0:12:22the class and then we're the probabilities of follows the of the softmax of functions so in other words i
0:12:28mean the probability of wise the i equal K given some parameter this is the aggregation weight
0:12:34and the data point X of i follows the softmax functions
0:12:38uh but the exponent E
0:12:40uh the way sum of the discrepancy so okay so these are the the discrepancy between the
0:12:46code word and then the binary prediction
0:12:50so for example maybe we can use a cross-entropy entropy other functions
0:12:54so all this easy really uh the probably extensions of the loss based be decoding
0:13:00a and in this way
0:13:02actually we have only um mean the aggregation weights oh as a parameter
0:13:08so based on this models and then
0:13:11uh we write the likelihood of the training data so these C's the likely you and then maybe be of
0:13:17the details you can find in the papers
0:13:21and then we add the uh L one norm regularization as
0:13:25okay so
0:13:26the negative log-likelihood likelihood the L one norm regularization Z
0:13:30and then we come up with the some the law some exponential function
0:13:36and then uh we figured out
0:13:39so uh are our optimized nation is actually the minimize the loss of exponential function as
0:13:44that's some sex the uh
0:13:46some of the plastic uh
0:13:49and the loss some one it's of times an is a context
0:13:52"'kay" so we can really solve this problem as a convex of my Z
0:13:55a problem
0:13:56uh what we figured out about a a a a a two years ago he's actually be can form this
0:14:01i mean but can really think this into the the geometric programming
0:14:05so i mean this is just a short introduction of the geometric programming
0:14:10and this is a problem of i i mean the standard form of the german to
0:14:14and the we minimize uh some of the pose in on the L
0:14:17and that was not
0:14:19a it on you but we i mean that
0:14:21but the different sees
0:14:22uh the exponents are allowed to be a real valued okay in a plain on me on the exponent is
0:14:27only should be the integer
0:14:29so the minimize sum of was not me L under this all inequality constraint and also so you quality constraint
0:14:37and then uh this uh the geometric programming in a on L from always can be
0:14:42can already to the sum of the german program in convex
0:14:45uh on is you just a well in comics
0:14:50is is our of the optimisation and problems
0:14:53and and uh we can really write is uh
0:14:57optimize as
0:14:59geometric programming in either convex or the port a all forms
0:15:03uh so there is actually a uh some efficient since all words of yeah
0:15:07so we simply to the use that solvers
0:15:09two oh find the actually uh
0:15:12the minimum of this uh
0:15:14objective a function
0:15:17so in in experiments and that we compared the some of the uh existing work
0:15:22uh which is a loss to based decoding which is just a one of the heart according
0:15:25and the so the map is uh one of the
0:15:28optima a uh aggregation method based on the bradley terry model
0:15:36so of these are the some of the the data actually uh uh uh
0:15:41uh on you C i uh we pasta re
0:15:44and the number of samples uh uh
0:15:47a east stand then these are the number of attributes
0:15:50and the number of a class
0:15:52and then we compared to some the classification performance uh
0:15:56uh for the three different encoding technique all pairs one versus or then there
0:16:00cracking up of chord
0:16:01and that these are the result for the loss based decoding
0:16:04and then W map
0:16:06and that these are the uh
0:16:09a result of our men
0:16:11so i mean
0:16:13like that the wrecker experiment then uh our method
0:16:17up from uh better than oh these to existing method
0:16:21and although this is also the optimal aggregation uh but this involves the really uh should number of frames
0:16:27so uh i mean run time is really really i mean our method is much faster than the uh
0:16:33the pretty
0:16:36a now our case actually or i mean the parameters our only the aggregation weights
0:16:41so in conclusion i is actually be uh present the sum of the convex optimisation
0:16:47techniques needs uh for a aggregation of the binary classifiers to
0:16:52oh solve the multi-class problems
0:16:54uh but we chose as the geometric programming because uh our or objective function can be easily fit into the
0:17:01standard form of the german to programming
0:17:03and then we compared to uh the classification performance to some of the existing method to show you
0:17:09mean the the method we proposed these well seems to work uh a better than
0:17:14some of the existing F
0:17:15and then this clues um i
0:17:33that that you all you know the fact that you knew
0:17:35i have fewer were parameters
0:17:37for your method is that
0:17:40i presume that directly relates to um
0:17:43that you're less likely that over fit
0:17:46is that i think is the right to right yeah because i mean the previous one
0:17:49has as you to number of parameters so the easily over feet
0:17:52and then maybe uh
0:17:54i mean that might be the one of the reason actually why are we're method performs a better than the
0:17:58sum of
0:18:05so did you want to compare your results with
0:18:07uh uh to class not a location for example you could have used a multinomial logistic regression as the combined
0:18:13as and
0:18:13instead of
0:18:14uh comparing all the against the fusion of mine be classifiers
0:18:19for solving the to class rob
0:18:23i i i don't like the okay so yeah maybe we can compare but uh i don't think we really
0:18:28compared to a to the the multinomial logistic regression
0:18:33uh because multinomial logistic regression and also convex
0:18:37so you might be right there you right
0:18:39okay so we really didn't do it but
0:18:41we will
0:18:51a number of features actually that the added descriptions is uh oh case of the number of attributes
0:18:56okay is from some tend to six and read and
0:19:00you and you on the data
0:19:07no no no actually we just the user some though whole
0:19:10uh features actually so this is just a matter of that
0:19:12classifier perform was not a feature extraction
0:19:18a right thank you