so the the graph an and and what i would like to talk about today E uh how we combine

the the multiple binary classifiers to solve the

well class problem

and the technique uh that we are going to use these the

the geometric programming we're

we can uh solve the problems the using one of the convex of to my this and solver

uh i would like to start we'd the multi-class class mornings

and

and then are to the two different approaches to solve the multi-class learning to or multiclass problems

either direct method

we we can reduce the the multi-class problems in to the multiple binary problems

is such a case actually we need to really combine the mode uh but multiple binary problems

to determine the final answer uh to the multi-class problems

and then

probably i formulated this aggregation problems

and the a geometric programming so that we can always find a global solutions

"'kay" so you know or to introduce a geometric programming formulations and then i will introduce a soft models

and L one norm regularized maxent lightly

missions missions and

oh we we just some of the uh

the numerical experiments and then

and can close

so the multi-class problems for example

we need to really a signed the class labels uh from one to K A

uh uh to the data point

okay

so either direct methods okay so suppose actually we have a three classes

and the direct method try to find the sum the separating hyperplane which

discriminate uh these three classes

uh in general uh the separating hyperplane these not really linear okay

uh on the other hand

uh for the binaural decomposition the method of for example all paired

so we look at the sum the

para one and two and two and three and one and three so we look at the sum the binary

pair problem still we can always find the sum the binary classifiers

and the remaining problem ease of how we really uh aggregate this um the solutions of the binary problems you

know to determine the final answer to the multi-class problems

okay so what are the advantage of the binary decompositions over the direct method

and is

but easier and the simple or tool on the classifiers

and there's a lot of other sophisticated classification uh classifiers actually as mister ready for the binary uh the D

i mean the binary problems for example support vector machines

and we're so this is a better suited to the parallel computation

so for example of these uh the three is a well-known examples for the binary decompositions

and also we can really three the binary decompositions they're the binary encoding problem

okay so well in other words

how we aggregate the uh multiple binary answers

is uh a binary decoding i mean the decoding prop

so for example the one versus all K as so we have a three crises

and then first the binary classifiers actually discriminate the first the class

uh from the the remaining class

and we're still the second binary classifiers discriminate

second class from the remaining class then

so one and all pairs where we need to look at the sum the pair of one and two and

two and three and one and three

and we're still all

they're error correcting output coding also can be uh you use

so all we really uh determine the some of the the court words

oh where some of the code words have as a maximal uh the hamming distance so you in other words

of some of the solutions we'd the tolerable error is actually can be uh correctly classified

so in other words the binary decompositions really uh leads to the sum

uh code matrix tk so in this case

we have a three class

and the three binary classifiers

so uh these are the uh exemplary

who may trick

uh for the one versus all and all pairs and L

error correcting up according

okay

so in other words actually we need to really train the uh three deeper in the binary classifiers

uh following the the oracle a in this school me too

"'kay" so for example or the case of the one versus a and then D is the code matrix produced

by the one versus soul

and is that a case

and we need to oh

train the three different pine binary classifiers

and for example all X up i

is that the data and the target labels is a two

so it's such a case actually

so the target label two

and then

in terms of the binary classifiers to actually the correct labels should be zero

one and zero okay so the first the binary classifier a second binary and third binary classifiers

so we i mean these uh binary classifier

followed the

at the binary label

uh uh in this good matrix

so we trained a binary classifiers

and

then L each binary class apart

classifiers produce the some the probability asked me

a for example we can use the support vector machines with sick a the model so that the the banner

classifier the produce the some the scores

uh which uh between zero and one

"'kay"

so the the problem here E

"'kay" so we trained the three binary classifiers and each binary classifiers produce the sum of the scores between the

your and one K and you're to answer to the multi-class problems and that we have to really combine the

and

determine the by the three binary classifiers

okay so how we really aggregate uh the binary classifiers

and the sum of the charade heuristics these

uh for the case of the all pairs and then we do easily majority voting as

and then for the case of though one versus all and they maybe the maximum always means

and hard decoding case

and the we

uh find the court word we to best match the collection of the predicted result computed by the binary classifiers

okay so in the case of the three class and that we have a three

code words okay and then train the three binary classifiers so that

given a some the test at what point K and the three binary classified the produce a sum score is

okay so the collection of the door the three values can stick you'd the three the mental vectors okay and

then we search actually uh which code word he's best match the these uh the three dimensional a prediction result

in or to really determine the final answer to the multi-class problems okay

and or so all we can it a probabilistic decoding in other words in this case actually we need we

really need to compute the class membership probabilities

okay

so uh C L can get the class members probabilities and then we can really uh to the prediction uh

for the class

so uh

uh one of the popular approach in the problem list decoding is actually the based of the

uh the bread are models

and the let me just briefly to i mean the explain actually what really a the braille lit model oh

used doing in this case

so was again actually L we have a three class

okay

and that these reading tidy model has been used to relate the binary predictions with the class members to probably

so for example

okay so we have a some uh the three as servers could use the by the three binary classifiers

and that we have to relate to those answers to the

uh the class membership probabilities

so in such a case actually we treat the class members probabilities as a prayer

"'kay"

so in the case of the all pairs binary D competitions

and then

okay so the capital P one uh star he's actually uh

the a

the class membership probabilities uh

for the

the data point access star

okay

no known actually a a this should be a uh

see

okay so this is actually the the class membership probabilities

and then D is a uh

all pairs result

okay

so all these are i mean the blue oh high like the things are actually the based on the bradley

terry model

and then we introduce a someone out a a is high want i to and pride three

and then these uh relations are directly from the bread lee terry model

and they just sub J star is actually the probability mate

determine the by uh by the binary classifiers

so we have that this was okay so in or to really compute the class membership probabilities and we treat

them as a or and that we asked make these parameters by minimizing the

okay of that verse as uh between the

the binary for uses

and or so

these pie a with

uh coming from them

so oh is such a case actually a a from but actually which exploit it is uh the techniques

here he's

that's a that's probably three

or you know other words actually uh a number of parameters grows with the number of uh

the

the training example

okay

so if you have a assumed the a huge number of training examples and then we have a huge number

a parameters you those should be really uh

up team

so all or some of the uh the existing tech the actually the base on the bradley terry model and

then all one of the recent tech and he's they tried to find that some all optimal aggregation

okay

so uh why optimize aggregation is good you because some is some of the prediction is by and live by

i fired

oh the aggregate of

i mean entire performance okay so somehow if they can really uh determine this i mean the

come up with a weights which all see i mean all team hourly aggregate the uh the binary predictions

and then that we can really of we did this prob

so of all these uh a technique a uh has been done for the optimal aggregation actually but but based

on the uh

uh the bread carry model

a a a a uh but the problem here is actually a a a a simple of the really a

lot my the um probably decoders

uh i use that red with you model so

the number of parameters is actually the aggregation weights

and also class membership probabilities which to grow as a we the number of uh

example

so name and a lot my is a problem and or so

uh this is not really a not i mean that not convex of to the some problems of doesn't guarantee

a global

lesions

okay so what i would like to hear he's actually uh we would like to formulate these problems

as a convex up to my this um prop

okay so all

in the aggregate some model actually a we don't look at really the bread lead a model but yeah we

we ah

uh uh use a softmax model uh which was also a recently uh use uh

by us

exactly and i C yeah

the last year

yeah all introduce an i mean the softmax models and and is such a case

uh and

so these are actually the and um different binary classifiers

and you know our approach actually the writers that all the aggregation weights okay so each a a class

classifiers is really uh by the different uh a this and the W want through double sub and

and then i goal is actually a optimized these court presents

to produce a some the best uh

uh a combinations of the binary prediction

okay so

the class and then we're the probabilities of follows the of the softmax of functions so in other words i

mean the probability of wise the i equal K given some parameter this is the aggregation weight

and the data point X of i follows the softmax functions

uh but the exponent E

uh the way sum of the discrepancy so okay so these are the the discrepancy between the

code word and then the binary prediction

okay

so for example maybe we can use a cross-entropy entropy other functions

case

so all this easy really uh the probably extensions of the loss based be decoding

a and in this way

actually we have only um mean the aggregation weights oh as a parameter

so based on this models and then

uh we write the likelihood of the training data so these C's the likely you and then maybe be of

the details you can find in the papers

and then we add the uh L one norm regularization as

okay so

the negative log-likelihood likelihood the L one norm regularization Z

and then we come up with the some the law some exponential function

and then uh we figured out

so uh are our optimized nation is actually the minimize the loss of exponential function as

that's some sex the uh

some of the plastic uh

coefficients

okay

and the loss some one it's of times an is a context

"'kay" so we can really solve this problem as a convex of my Z

a problem

uh what we figured out about a a a a a two years ago he's actually be can form this

i mean but can really think this into the the geometric programming

so i mean this is just a short introduction of the geometric programming

and this is a problem of i i mean the standard form of the german to

programming

and the we minimize uh some of the pose in on the L

and that was not

a it on you but we i mean that

but the different sees

uh the exponents are allowed to be a real valued okay in a plain on me on the exponent is

only should be the integer

so the minimize sum of was not me L under this all inequality constraint and also so you quality constraint

"'kay"

and then uh this uh the geometric programming in a on L from always can be

can already to the sum of the german program in convex

uh on is you just a well in comics

is is our of the optimisation and problems

and and uh we can really write is uh

optimize as

the

geometric programming in either convex or the port a all forms

uh so there is actually a uh some efficient since all words of yeah

so we simply to the use that solvers

two oh find the actually uh

the minimum of this uh

objective a function

so in in experiments and that we compared the some of the uh existing work

uh which is a loss to based decoding which is just a one of the heart according

and the so the map is uh one of the

optima a uh aggregation method based on the bradley terry model

so of these are the some of the the data actually uh uh uh

uh on you C i uh we pasta re

and the number of samples uh uh

a east stand then these are the number of attributes

and the number of a class

and then we compared to some the classification performance uh

uh for the three different encoding technique all pairs one versus or then there

cracking up of chord

and that these are the result for the loss based decoding

and then W map

and that these are the uh

a result of our men

so i mean

like that the wrecker experiment then uh our method

up from uh better than oh these to existing method

and although this is also the optimal aggregation uh but this involves the really uh should number of frames

so uh i mean run time is really really i mean our method is much faster than the uh

the pretty

a now our case actually or i mean the parameters our only the aggregation weights

so in conclusion i is actually be uh present the sum of the convex optimisation

techniques needs uh for a aggregation of the binary classifiers to

oh solve the multi-class problems

uh but we chose as the geometric programming because uh our or objective function can be easily fit into the

standard form of the german to programming

and then we compared to uh the classification performance to some of the existing method to show you

mean the the method we proposed these well seems to work uh a better than

some of the existing F

and then this clues um i

that that you all you know the fact that you knew

i have fewer were parameters

for your method is that

that

i presume that directly relates to um

that you're less likely that over fit

is that i think is the right to right yeah because i mean the previous one

has as you to number of parameters so the easily over feet

and then maybe uh

i mean that might be the one of the reason actually why are we're method performs a better than the

sum of

so did you want to compare your results with

uh uh to class not a location for example you could have used a multinomial logistic regression as the combined

as and

instead of

uh comparing all the against the fusion of mine be classifiers

for solving the to class rob

i i i don't like the okay so yeah maybe we can compare but uh i don't think we really

compared to a to the the multinomial logistic regression

uh because multinomial logistic regression and also convex

so you might be right there you right

okay so we really didn't do it but

we will

okay

a number of features actually that the added descriptions is uh oh case of the number of attributes

okay is from some tend to six and read and

you and you on the data

no no no actually we just the user some though whole

uh features actually so this is just a matter of that

classifier perform was not a feature extraction

a right thank you