so the the graph an and and what i would like to talk about today E uh how we combine
the the multiple binary classifiers to solve the
well class problem
and the technique uh that we are going to use these the
the geometric programming we're
we can uh solve the problems the using one of the convex of to my this and solver
so
uh i would like to start we'd the multi-class class mornings
and
and then are to the two different approaches to solve the multi-class learning to or multiclass problems
either direct method
or
we we can reduce the the multi-class problems in to the multiple binary problems
K
is such a case actually we need to really combine the mode uh but multiple binary problems
to determine the final answer uh to the multi-class problems
and then
probably i formulated this aggregation problems
and the a geometric programming so that we can always find a global solutions
"'kay" so you know or to introduce a geometric programming formulations and then i will introduce a soft models
and L one norm regularized maxent lightly
missions missions and
no
oh we we just some of the uh
the numerical experiments and then
and can close
so the multi-class problems for example
uh
we need to really a signed the class labels uh from one to K A
uh uh to the data point
okay
so either direct methods okay so suppose actually we have a three classes
and the direct method try to find the sum the separating hyperplane which
discriminate uh these three classes
uh in general uh the separating hyperplane these not really linear okay
uh on the other hand
uh for the binaural decomposition the method of for example all paired
so we look at the sum the
para one and two and two and three and one and three so we look at the sum the binary
pair problem still we can always find the sum the binary classifiers
and the remaining problem ease of how we really uh aggregate this um the solutions of the binary problems you
know to determine the final answer to the multi-class problems
okay so what are the advantage of the binary decompositions over the direct method
and is
but easier and the simple or tool on the classifiers
and there's a lot of other sophisticated classification uh classifiers actually as mister ready for the binary uh the D
i mean the binary problems for example support vector machines
and we're so this is a better suited to the parallel computation
so for example of these uh the three is a well-known examples for the binary decompositions
and also we can really three the binary decompositions they're the binary encoding problem
okay so well in other words
how we aggregate the uh multiple binary answers
is uh a binary decoding i mean the decoding prop
so for example the one versus all K as so we have a three crises
and then first the binary classifiers actually discriminate the first the class
uh from the the remaining class
and we're still the second binary classifiers discriminate
second class from the remaining class then
so one and all pairs where we need to look at the sum the pair of one and two and
two and three and one and three
and we're still all
they're error correcting output coding also can be uh you use
so all we really uh determine the some of the the court words
oh where some of the code words have as a maximal uh the hamming distance so you in other words
of some of the solutions we'd the tolerable error is actually can be uh correctly classified
so in other words the binary decompositions really uh leads to the sum
uh code matrix tk so in this case
we have a three class
and the three binary classifiers
so uh these are the uh exemplary
who may trick
uh for the one versus all and all pairs and L
error correcting up according
okay
so in other words actually we need to really train the uh three deeper in the binary classifiers
uh following the the oracle a in this school me too
"'kay" so for example or the case of the one versus a and then D is the code matrix produced
by the one versus soul
and is that a case
and we need to oh
train the three different pine binary classifiers
and for example all X up i
is that the data and the target labels is a two
so it's such a case actually
uh
so the target label two
and then
in terms of the binary classifiers to actually the correct labels should be zero
one and zero okay so the first the binary classifier a second binary and third binary classifiers
so we i mean these uh binary classifier
followed the
at the binary label
uh uh in this good matrix
so we trained a binary classifiers
and
then L each binary class apart
classifiers produce the some the probability asked me
a for example we can use the support vector machines with sick a the model so that the the banner
classifier the produce the some the scores
uh which uh between zero and one
"'kay"
so the the problem here E
"'kay" so we trained the three binary classifiers and each binary classifiers produce the sum of the scores between the
your and one K and you're to answer to the multi-class problems and that we have to really combine the
and
determine the by the three binary classifiers
okay so how we really aggregate uh the binary classifiers
and the sum of the charade heuristics these
uh for the case of the all pairs and then we do easily majority voting as
and then for the case of though one versus all and they maybe the maximum always means
and hard decoding case
and the we
uh find the court word we to best match the collection of the predicted result computed by the binary classifiers
okay so in the case of the three class and that we have a three
code words okay and then train the three binary classifiers so that
given a some the test at what point K and the three binary classified the produce a sum score is
okay so the collection of the door the three values can stick you'd the three the mental vectors okay and
then we search actually uh which code word he's best match the these uh the three dimensional a prediction result
in or to really determine the final answer to the multi-class problems okay
and or so all we can it a probabilistic decoding in other words in this case actually we need we
really need to compute the class membership probabilities
okay
so uh C L can get the class members probabilities and then we can really uh to the prediction uh
for the class
so uh
uh one of the popular approach in the problem list decoding is actually the based of the
uh the bread are models
and the let me just briefly to i mean the explain actually what really a the braille lit model oh
used doing in this case
so was again actually L we have a three class
okay
and that these reading tidy model has been used to relate the binary predictions with the class members to probably
so for example
okay so we have a some uh the three as servers could use the by the three binary classifiers
and that we have to relate to those answers to the
uh the class membership probabilities
so in such a case actually we treat the class members probabilities as a prayer
"'kay"
so in the case of the all pairs binary D competitions
and then
okay so the capital P one uh star he's actually uh
the a
the class membership probabilities uh
for the
the data point access star
okay
no known actually a a this should be a uh
see
okay so this is actually the the class membership probabilities
and then D is a uh
all pairs result
okay
so all these are i mean the blue oh high like the things are actually the based on the bradley
terry model
and then we introduce a someone out a a is high want i to and pride three
and then these uh relations are directly from the bread lee terry model
and they just sub J star is actually the probability mate
determine the by uh by the binary classifiers
so we have that this was okay so in or to really compute the class membership probabilities and we treat
them as a or and that we asked make these parameters by minimizing the
okay of that verse as uh between the
uh
the binary for uses
and or so
these pie a with
uh coming from them
so oh is such a case actually a a from but actually which exploit it is uh the techniques
uh
here he's
that's a that's probably three
or you know other words actually uh a number of parameters grows with the number of uh
the
the training example
okay
so if you have a assumed the a huge number of training examples and then we have a huge number
of
uh
a parameters you those should be really uh
up team
so all or some of the uh the existing tech the actually the base on the bradley terry model and
then all one of the recent tech and he's they tried to find that some all optimal aggregation
okay
so uh why optimize aggregation is good you because some is some of the prediction is by and live by
like
i fired
oh the aggregate of
i mean entire performance okay so somehow if they can really uh determine this i mean the
come up with a weights which all see i mean all team hourly aggregate the uh the binary predictions
and then that we can really of we did this prob
so of all these uh a technique a uh has been done for the optimal aggregation actually but but based
on the uh
uh the bread carry model
a a a a uh but the problem here is actually a a a a simple of the really a
lot my the um probably decoders
uh i use that red with you model so
the number of parameters is actually the aggregation weights
and also class membership probabilities which to grow as a we the number of uh
example
so name and a lot my is a problem and or so
uh this is not really a not i mean that not convex of to the some problems of doesn't guarantee
a global
lesions
okay so what i would like to hear he's actually uh we would like to formulate these problems
uh
as a convex up to my this um prop
okay so all
in the aggregate some model actually a we don't look at really the bread lead a model but yeah we
we ah
uh uh use a softmax model uh which was also a recently uh use uh
uh
by us
exactly and i C yeah
the last year
so
yeah all introduce an i mean the softmax models and and is such a case
uh and
so these are actually the and um different binary classifiers
and you know our approach actually the writers that all the aggregation weights okay so each a a class
classifiers is really uh by the different uh a this and the W want through double sub and
and then i goal is actually a optimized these court presents
to produce a some the best uh
uh a combinations of the binary prediction
okay so
the class and then we're the probabilities of follows the of the softmax of functions so in other words i
mean the probability of wise the i equal K given some parameter this is the aggregation weight
and the data point X of i follows the softmax functions
uh but the exponent E
uh the way sum of the discrepancy so okay so these are the the discrepancy between the
code word and then the binary prediction
okay
so for example maybe we can use a cross-entropy entropy other functions
case
so all this easy really uh the probably extensions of the loss based be decoding
i
a and in this way
actually we have only um mean the aggregation weights oh as a parameter
so based on this models and then
uh we write the likelihood of the training data so these C's the likely you and then maybe be of
the details you can find in the papers
uh
and then we add the uh L one norm regularization as
okay so
the negative log-likelihood likelihood the L one norm regularization Z
and then we come up with the some the law some exponential function
and then uh we figured out
E
so uh are our optimized nation is actually the minimize the loss of exponential function as
that's some sex the uh
some of the plastic uh
coefficients
okay
and the loss some one it's of times an is a context
"'kay" so we can really solve this problem as a convex of my Z
a problem
uh what we figured out about a a a a a two years ago he's actually be can form this
i mean but can really think this into the the geometric programming
uh
so i mean this is just a short introduction of the geometric programming
and this is a problem of i i mean the standard form of the german to
programming
and the we minimize uh some of the pose in on the L
and that was not
a it on you but we i mean that
but the different sees
uh the exponents are allowed to be a real valued okay in a plain on me on the exponent is
only should be the integer
so the minimize sum of was not me L under this all inequality constraint and also so you quality constraint
"'kay"
and then uh this uh the geometric programming in a on L from always can be
can already to the sum of the german program in convex
uh on is you just a well in comics
i
so
is is our of the optimisation and problems
and and uh we can really write is uh
optimize as
the
geometric programming in either convex or the port a all forms
uh so there is actually a uh some efficient since all words of yeah
so we simply to the use that solvers
two oh find the actually uh
the minimum of this uh
objective a function
so in in experiments and that we compared the some of the uh existing work
uh which is a loss to based decoding which is just a one of the heart according
and the so the map is uh one of the
optima a uh aggregation method based on the bradley terry model
so of these are the some of the the data actually uh uh uh
uh on you C i uh we pasta re
and the number of samples uh uh
a east stand then these are the number of attributes
and the number of a class
and then we compared to some the classification performance uh
uh for the three different encoding technique all pairs one versus or then there
cracking up of chord
and that these are the result for the loss based decoding
and then W map
and that these are the uh
a result of our men
i
so i mean
like that the wrecker experiment then uh our method
up from uh better than oh these to existing method
and although this is also the optimal aggregation uh but this involves the really uh should number of frames
so uh i mean run time is really really i mean our method is much faster than the uh
the pretty
a now our case actually or i mean the parameters our only the aggregation weights
so in conclusion i is actually be uh present the sum of the convex optimisation
techniques needs uh for a aggregation of the binary classifiers to
oh solve the multi-class problems
uh but we chose as the geometric programming because uh our or objective function can be easily fit into the
standard form of the german to programming
and then we compared to uh the classification performance to some of the existing method to show you
mean the the method we proposed these well seems to work uh a better than
some of the existing F
and then this clues um i
that that you all you know the fact that you knew
i have fewer were parameters
for your method is that
that
i presume that directly relates to um
that you're less likely that over fit
is that i think is the right to right yeah because i mean the previous one
has as you to number of parameters so the easily over feet
and then maybe uh
i mean that might be the one of the reason actually why are we're method performs a better than the
sum of
so did you want to compare your results with
uh uh to class not a location for example you could have used a multinomial logistic regression as the combined
as and
instead of
uh comparing all the against the fusion of mine be classifiers
for solving the to class rob
ah
i i i don't like the okay so yeah maybe we can compare but uh i don't think we really
compared to a to the the multinomial logistic regression
uh because multinomial logistic regression and also convex
so you might be right there you right
okay so we really didn't do it but
we will
okay
a number of features actually that the added descriptions is uh oh case of the number of attributes
okay is from some tend to six and read and
you and you on the data
no no no actually we just the user some though whole
uh features actually so this is just a matter of that
classifier perform was not a feature extraction
a right thank you