this work was supported in part by grants from the

you states uh so it was a as of side it is so much and the it's national science foundation

only go back

okay so um

the topic of the talk is on classification

so

in a a model based classification as you all of there

yeah given

a a a a a prior distribution on the classes and uh

the and D like to function of the observations given class

and given these two things we can come up with the uh minimum probability of error decision rule

which i the on noise a maximum a posterior probably who

but simplifies to the maximum likelihood rule for you likely classes

so that that's model based

uh classifications not a better the model is from specified then you can in principle come up with the optimum

decision

in contrast to the sum of our uh be a the it in what is gonna of learning based classification

read everything is get a driven

so you only given examples of the two classes say

and you wanna come up with an a got of them which separates these class

the the channel have a lower that T wish that this is in this scenario is

that

very you often you encounter situations that you a high dimensional data for example you have a

so the billions video which lines be big by a by to get a

you have hyperspectral images you have you know synthetic aperture radar images and so forth

so get a high dimensional data on the one hand and you very few examples

compared to the dimensionality of the data on the other hand

now you might say well why not just use a generic uh

i did a reduction technique like

say pca A or L L E or so map

well on the one hand these are really generate methods

which are you know not

but it really do device so the classification problems so the optimized sort another generate method mow uh measures of

oh such as an preserving get about distances and so forth on the one hand

and don't know other hand

they haven't been designed with the view to high dimensionality prob the problem but that if you example

so our approach is to sort of exploit

what i shall call as the

latent in low dimensional sensing structure

now to make

this clear let's take a the cartoon example

let's suppose that

you given examples of each class only two classes here

and a learning based as a vision got in the such as svm but a kernel svm

which simply take the data and a lot a classification rule

in completely ignore

if any as C structure was present or not

in contrast to the this is

what i would call sensing of and classification where let's say we know that that these observations came from some

the sensing process

say for example the blurring operator

or of we may have either full or partial information about the blurring operator

and to the with some noise

and the question is can exploit knowledge of the fact that these observations came from someone underlying sensing structure

oh a the classification performance

no

yeah actually the it in a to the study of one is on the fundamental asymptotic limits of classification in

the

so audio of high dimensional was and very few samples

them to make things more concrete let's we assume that the the uh the did i mention and possibly the

number of samples

uh goes to infinity

while these samples per dimension

most of you

so that this a not of a you have that if you samples are very high dimensional data

but

in contrast to a a number of studies in the literature which has focused on

and S imported easy situation be want fixed the problem difficult D S imported you meaning that even if the

dimension increases to infinity

as it's not going to be easy to classify

and

for what is essentially means is that have fixing the signal to noise ratio as the problem schemes and this

would be considered as i do the mathematical more

a fundamental issues that be vision as is one is yes and it that's if you can performance

uh i that this asymptotic G

does it probably that are good to half which means

is it no better and random guessing

or does it go to the optimum base

probably that are which by the is not equal to half which is what i mean by fixing the problem

we do not equal to zero which is what i mean by fixing the problem difficulty or to something else

now

to make things more concrete i have two

i i is a model so that that's of the talk is based on only this is of a specific

model so

because need to understand the you got of these issues be side of the base simple model

a model is a simple in that

the observations are made up of um

uh are that some uh the uh the mean location which is lying and some of sensing subspace of think

of H as the sensing subspace

and even get in last one you are of this look at the you a mean location and one

and that

you are having a scalar gaussian perturbation along the edge axis

a big but for by a vector gaussian noise perturbation which to take your side this

subspace into to the gender the P dimensional space

so that the uh sensing a model we have a uh and lies the performance

and that's and what condition each class of the means are different so be know that the means

the are line a subspace and that

that's a scalar of but vision component along the subspace for or by but the gaussian perturbation it's takes the

subspace

so that the uh simple model not and the goal of was is that you are given uh menu of

many P dimensional vectors a and P dimensional vector some each class

and you to come up with a classifier

i understand the asymptotic classification performance for different uh sonatas

now

a be was a model to be simple to keep things tractable we are does not an article understanding not

even though it's fairly simple

that as not is that does make sense for example you have a sense an adults and audio

but you could have a the use so it's be the dimension of the observation in in the previous slide

uh each component being a sense on this case

observing some kind of a the line each signal few you

and and that last longer observing edge which is a signal

i the noise

and don't of the different class you of the negative of H i the noise

and the board of course is that

you a given and observations of the weak signal or sensor

i the each class and

the question is

yeah to come up with a classifier with decides

uh a the next observation is but to be which does it a long as a long the last class

the negative class

no moving ahead a that kind of classifies as for the rest of the talking would do

consider are the following

we like look at the baseline uh classifier which are is the full based which means you know everything about

the models so what is

a what is the that's which implements that we were fixed it

but gonna get familiar with the notation they're

then you wanna look at what a what i the and stuff sure

uh classifier which means that i know that it's of the conditionally gaussian observations but i don't know the means

that and all the variances quite is as

i would i them to estimate everything

using maxim like good estimates

how to that form

and then

a finally that look at structure based uh

a that additional problems

then the first case we look at the structure of it and what exact sensing subspace

how does the things behave in those cases

the second case i to for a structured maximum likelihood

which means that

of the estimate a tomatoes

annoying uh that is a little low dimensional subspace but i don't know the subspace

and finally um

you see that

yeah have negative results in this case is and the will of more T so a structured sparsity uh more

oh as a baseline model

so that a likelihood ratio test Q can you can john to the at and you can come up to

what is one of the up one like decision rule

it's it's gonna be a linear discriminant rule and is based on these parameters that i and mu uh it's

not important to know exactly what expressions are

that that stands for the difference in the class conditional mean

new is the have to the class conditional means and signal i Z equal that ends of the observations so

the that can rule depends on these parameters

and uh ms in can probably you can about added in closed form

it is and of the Q function which is nothing but the T and probably of a standard normal

and in terms of these uh a it is M to and what except which were a bit up your

here

only the important thing is that yeah is you a fixed the difficulty of the problem as the dimension scale

which means that i have to fix the argument of the Q function

that's that amount essentially fixing on most everything you are in particular the energy of these sensing a a vector

H

so we wanna keep the norm of edge fixed as things scale and that's an important uh a part of

this work

so that one of the the full based about looks like

oh that's one of the case better we know that it's conditionally got but you know we don't know any

of these parameters so

this of what the base classifier looks like

but i don't know a i don't know the model

so i have to estimate all these parameters from the get i given

so one approach a actual approach is to use a plug-in estimator which means estimate all these and it does

using the did a given

and like it into the optimum decision rule

that you are you you get a what as well as the uh of the medical fisher rule

and you can have a analyse the uh probably do better or you can get a close form expression and

look at what happens so that probably at as

these samples but dimensions go down to you'll the dimensions english to infinity

but you fixed the difficulty of the level

lot

turns out

not surprisingly that be probably a error goes to have

which means a no but than random guessing

now do not surprising because

you're trying to estimate for more parameters than you have data for

so asymptotically a you you don't catch up with the uh or or load of information that to estimate

so we in the structure in estimating all "'em" it is not a good idea and your

uh let's want to

structured uh approaches

so that's a minus so that does the sensing model

and let's suppose and the one extreme not been more tie sensing structure which means that i know the subspace

in which the observations lie

okay the underlying one dimensional subspace

so not natural thing to do in this case of wine not project everything down to the one dimensional subspace

right is it was scalar are learning based classification problem

estimate all the parameters

in that a reduced one some problem using the data you have the maximal some estimates and

C of what's

okay

that leads you do the uh what i what as projected empirical fisher rule

and that's the uh i an exact expression iteration at the exact expression is a set was not very important

but idea is that you

you know the sensing subspace we put giving down to that and reduce is it a one dimensional problem

and and the uh the probably did N are shown here

asymptotically as the number of samples goes to infinity

the out not surprisingly again that

i to keep the difficulty level of the problem fixed and a

as a the number of samples to infinity

the probably of uh N or goes to the base or it probably are which means of the optimum thing

you can do

now there is a uh it's can expect it is because

you know it's one i'm it so that it lit in one and some structure uh in one in this

problem and you know it exactly so when you project it down to that problem

that that the at the actually dimension of the but data of relevant

so P doesn't appear to this equation at all

your your the scale classification problem and as we know that when you uh do a mass and that the

estimation but in number of an uh a number of samples you can asymptotically get

optimal performance

but the did a dimension is fixed

so in this case the effectively the demonstrated option

uh by it takes into account a them a reduction in this or element to this problem

now

but the the idea of what of that we don't even know in general the sensing structure

okay we don't know the sensing subspace so when i is one to estimate the sensing subspace from the data

you have

so what would be one approach to estimate the sensing subspace

but we know what is that if a look at the difference in the class conditional means that are

it's actually a aligned with edge

"'kay"

so it was a lot of that and natural thing to do is to use a maximum that to estimate

of the that which was done before

and use that of the proxy for edge

then produce then project thing down to that that up

and then you're back to square the previous situation

and uh i again to get a project anybody "'cause" we shouldn't X of that the that action a which

project thing is not on the edge because it's not known to you but it's the estimated H

what you expect to get here

turns out that if you analyse the probability of mouth-position ever

as examples for dimension goes to zero

and the uh difficulty level is fixed

the probability of classification error goes to have

which means that even though you knew that was an underlying wind amazon something structure and you know that that

that was aligned with that

trying to estimate using using a matching like to kind of an estimate

didn't

doesn't do the job

okay you know but and random guessing asymptotically

but also it's it's all suggests that you need additional sensing structure to exploit here

no although this was not presented in our icassp able um yeah since then be able to show that this

fundamental meaning that

for this particular problem to analysing here

without any additional structure on edge

it's impossible for any uh learning a lot of them

to do any better than random guessing some importantly

so that's not present it an i cast to be appearing elsewhere but it's actually a fundamental be able lower

bound of the does of in probably which actually goes to have

and if you don't make any assumptions on these sensing structure

so that lead more T is the need for a id no structure don't edge

and one of the structures but be like to study is of course uh is a is a popular thing

these days

is uh as possibly okay

so

uh

that's say that the signal that uh that subspace is back the direction is sparse meaning that

uh the energy and edge

is look lies leave it if you components

compared to the number of dimension

so in particular let's see that the daily energy of a to this of the effect that edge the man

of the vector to the components

and their P components

and uh let's a pick a truncation point D um and look at the energy G this truncation and the

tail of the

uh edge vector here

as E N P will go to infinity you want the a in a to do to zero

so that a certainly a a a statement about the sparsity

as simple ks possibly all the signal

so in this case uh a natural thing to do is to use a uh so only have used the

maximal like to the estimate that a a of the top

and that didn't work

but not you know something more about edge namely that it's still energy goes to zero so one one interesting

to do you can try is why not and K that estimator

the component of the estimator

and use that as a proxy for that instead

the idea is to keep the estimate team i'd only are for all components less and some implication parameter T

and then set to you everything beyond

so that leads to what condition bayes estimate of the direction along H

and i and used i

that's the L how how things be

a big for show that as the that mentions the number of samples and the truncation point goes to infinity

but the truncation one is chosen in such a way

that the it goes slower than the number of sample

then

as important D can estimate

this is signal subspace perfect mean that in the mean a sense there are between the a truncated estimate and

the true data goes to zero we can as a a to the estimated i mention the subspace and of

course if we can estimate the subspace perfectly some got it it's on surprising then that

uh as things scale and you could the difficulty level fixed

the probably of class of never goes to the base probably

another the what's not is the sensing structure

but additional sparsity assumptions or some additional structure information

can a simple really yeah give you the uh a bayes pro uh probably of

he has a little simulation does not uh reinforce some of these insights

so here we have fixed the uh is probably other the the difficulty to be point one is fixed throughout

as a dimension scale

the energy use fixed to not some value to and you're some parameters to than in the model

and the number of samples is going slower than the other dimension as shown here

um

the truncation point

uh uh chose into go slower than the number of samples that shown here

and yeah one assume a polynomial D K for edge

and joint you're up for example of the beam line is the H

or the of one pretty localisation of edge

and on

the uh D D uh the red line is actually the noise the um at some like to to estimate

that the had

they are normalized to have unit energy

sure you're

and a blue one is a point conversion of the red one

the truncation point the i-th exactly twenty or so

on the right side is the probability of error on the vertical axes most of the dimension ambient dimension

so that the dimension scales

uh the unstructured uh approach where you don't know anything about the sensing structure you try to estimate all the

parameters using mac selected estimates

we'll approach to be that they probably about it being

you could have

on the other hand uh if you if you if you knew the sensing subspace but you estimated using nightly

using

simply that had

which is a max um that to estimate

then also you get a have

but if use the truncation based estimate

you are a pros the bayes optimal performance

so the control my talk

uh the

the you take points out that

for possible to many problems where you encounter situations where the number of samples that far fewer than the

i'm being uh get a dimension

in addition that is often exists a lead in sensing structure of the low-dimensional which can be exploited

you try to totally ignore the sensing structure and nine to try to estimate everything using mac selected estimates uh

you would probably be no better than random guessing in many scenarios

and even having a general knowledge of sensing structure like knowing that it's a one dimensional signal edge but i

don't know what they choose

and trying to estimate a nightly

can be it cannot do the job

so but only covers if you have a general or something structure plus some additional structure and edge

then you can often recover the optimum

asymptotically optimum estimation

the data into my

yeah i think

i know which i mean

was gonna be departing