Speech Transcript - SENSING-AWARE CLASSIFICATION WITH HIGH-DIMENSIONAL DATA

this work was supported in part by grants from the you states uh so it was a as of side it is so much and the it's national science foundation only go back okay so um the topic of the talk is on classification so in a a model based classification as you all of there yeah given a a a a a prior distribution on the classes and uh the and D like to function of the observations given class and given these two things we can come up with the uh minimum probability of error decision rule which i the on noise a maximum a posterior probably who but simplifies to the maximum likelihood rule for you likely classes so that that's model based uh classifications not a better the model is from specified then you can in principle come up with the optimum decision in contrast to the sum of our uh be a the it in what is gonna of learning based classification read everything is get a driven so you only given examples of the two classes say and you wanna come up with an a got of them which separates these class the the channel have a lower that T wish that this is in this scenario is that very you often you encounter situations that you a high dimensional data for example you have a so the billions video which lines be big by a by to get a you have hyperspectral images you have you know synthetic aperture radar images and so forth so get a high dimensional data on the one hand and you very few examples compared to the dimensionality of the data on the other hand now you might say well why not just use a generic uh i did a reduction technique like say pca A or L L E or so map well on the one hand these are really generate methods which are you know not but it really do device so the classification problems so the optimized sort another generate method mow uh measures of oh such as an preserving get about distances and so forth on the one hand and don't know other hand they haven't been designed with the view to high dimensionality prob the problem but that if you example so our approach is to sort of exploit what i shall call as the latent in low dimensional sensing structure now to make this clear let's take a the cartoon example let's suppose that you given examples of each class only two classes here and a learning based as a vision got in the such as svm but a kernel svm which simply take the data and a lot a classification rule in completely ignore if any as C structure was present or not in contrast to the this is what i would call sensing of and classification where let's say we know that that these observations came from some the sensing process say for example the blurring operator or of we may have either full or partial information about the blurring operator and to the with some noise and the question is can exploit knowledge of the fact that these observations came from someone underlying sensing structure oh a the classification performance no yeah actually the it in a to the study of one is on the fundamental asymptotic limits of classification in the so audio of high dimensional was and very few samples them to make things more concrete let's we assume that the the uh the did i mention and possibly the number of samples uh goes to infinity while these samples per dimension most of you so that this a not of a you have that if you samples are very high dimensional data but in contrast to a a number of studies in the literature which has focused on and S imported easy situation be want fixed the problem difficult D S imported you meaning that even if the dimension increases to infinity as it's not going to be easy to classify and for what is essentially means is that have fixing the signal to noise ratio as the problem schemes and this would be considered as i do the mathematical more a fundamental issues that be vision as is one is yes and it that's if you can performance uh i that this asymptotic G does it probably that are good to half which means is it no better and random guessing or does it go to the optimum base probably that are which by the is not equal to half which is what i mean by fixing the problem we do not equal to zero which is what i mean by fixing the problem difficulty or to something else now to make things more concrete i have two i i is a model so that that's of the talk is based on only this is of a specific model so because need to understand the you got of these issues be side of the base simple model a model is a simple in that the observations are made up of um uh are that some uh the uh the mean location which is lying and some of sensing subspace of think of H as the sensing subspace and even get in last one you are of this look at the you a mean location and one and that you are having a scalar gaussian perturbation along the edge axis a big but for by a vector gaussian noise perturbation which to take your side this subspace into to the gender the P dimensional space so that the uh sensing a model we have a uh and lies the performance and that's and what condition each class of the means are different so be know that the means the are line a subspace and that that's a scalar of but vision component along the subspace for or by but the gaussian perturbation it's takes the subspace so that the uh simple model not and the goal of was is that you are given uh menu of many P dimensional vectors a and P dimensional vector some each class and you to come up with a classifier i understand the asymptotic classification performance for different uh sonatas now a be was a model to be simple to keep things tractable we are does not an article understanding not even though it's fairly simple that as not is that does make sense for example you have a sense an adults and audio but you could have a the use so it's be the dimension of the observation in in the previous slide uh each component being a sense on this case observing some kind of a the line each signal few you and and that last longer observing edge which is a signal i the noise and don't of the different class you of the negative of H i the noise and the board of course is that you a given and observations of the weak signal or sensor i the each class and the question is yeah to come up with a classifier with decides uh a the next observation is but to be which does it a long as a long the last class the negative class no moving ahead a that kind of classifies as for the rest of the talking would do consider are the following we like look at the baseline uh classifier which are is the full based which means you know everything about the models so what is a what is the that's which implements that we were fixed it but gonna get familiar with the notation they're then you wanna look at what a what i the and stuff sure uh classifier which means that i know that it's of the conditionally gaussian observations but i don't know the means that and all the variances quite is as i would i them to estimate everything using maxim like good estimates how to that form and then a finally that look at structure based uh a that additional problems then the first case we look at the structure of it and what exact sensing subspace how does the things behave in those cases the second case i to for a structured maximum likelihood which means that of the estimate a tomatoes annoying uh that is a little low dimensional subspace but i don't know the subspace and finally um you see that yeah have negative results in this case is and the will of more T so a structured sparsity uh more oh as a baseline model so that a likelihood ratio test Q can you can john to the at and you can come up to what is one of the up one like decision rule it's it's gonna be a linear discriminant rule and is based on these parameters that i and mu uh it's not important to know exactly what expressions are that that stands for the difference in the class conditional mean new is the have to the class conditional means and signal i Z equal that ends of the observations so the that can rule depends on these parameters and uh ms in can probably you can about added in closed form it is and of the Q function which is nothing but the T and probably of a standard normal and in terms of these uh a it is M to and what except which were a bit up your here only the important thing is that yeah is you a fixed the difficulty of the problem as the dimension scale which means that i have to fix the argument of the Q function that's that amount essentially fixing on most everything you are in particular the energy of these sensing a a vector H so we wanna keep the norm of edge fixed as things scale and that's an important uh a part of this work so that one of the the full based about looks like oh that's one of the case better we know that it's conditionally got but you know we don't know any of these parameters so this of what the base classifier looks like but i don't know a i don't know the model so i have to estimate all these parameters from the get i given so one approach a actual approach is to use a plug-in estimator which means estimate all these and it does using the did a given and like it into the optimum decision rule that you are you you get a what as well as the uh of the medical fisher rule and you can have a analyse the uh probably do better or you can get a close form expression and look at what happens so that probably at as these samples but dimensions go down to you'll the dimensions english to infinity but you fixed the difficulty of the level lot turns out not surprisingly that be probably a error goes to have which means a no but than random guessing now do not surprising because you're trying to estimate for more parameters than you have data for so asymptotically a you you don't catch up with the uh or or load of information that to estimate so we in the structure in estimating all "'em" it is not a good idea and your uh let's want to structured uh approaches so that's a minus so that does the sensing model and let's suppose and the one extreme not been more tie sensing structure which means that i know the subspace in which the observations lie okay the underlying one dimensional subspace so not natural thing to do in this case of wine not project everything down to the one dimensional subspace right is it was scalar are learning based classification problem estimate all the parameters in that a reduced one some problem using the data you have the maximal some estimates and C of what's okay that leads you do the uh what i what as projected empirical fisher rule and that's the uh i an exact expression iteration at the exact expression is a set was not very important but idea is that you you know the sensing subspace we put giving down to that and reduce is it a one dimensional problem and and the uh the probably did N are shown here asymptotically as the number of samples goes to infinity the out not surprisingly again that i to keep the difficulty level of the problem fixed and a as a the number of samples to infinity the probably of uh N or goes to the base or it probably are which means of the optimum thing you can do now there is a uh it's can expect it is because you know it's one i'm it so that it lit in one and some structure uh in one in this problem and you know it exactly so when you project it down to that problem that that the at the actually dimension of the but data of relevant so P doesn't appear to this equation at all your your the scale classification problem and as we know that when you uh do a mass and that the estimation but in number of an uh a number of samples you can asymptotically get optimal performance but the did a dimension is fixed so in this case the effectively the demonstrated option uh by it takes into account a them a reduction in this or element to this problem now but the the idea of what of that we don't even know in general the sensing structure okay we don't know the sensing subspace so when i is one to estimate the sensing subspace from the data you have so what would be one approach to estimate the sensing subspace but we know what is that if a look at the difference in the class conditional means that are it's actually a aligned with edge "'kay" so it was a lot of that and natural thing to do is to use a maximum that to estimate of the that which was done before and use that of the proxy for edge then produce then project thing down to that that up and then you're back to square the previous situation and uh i again to get a project anybody "'cause" we shouldn't X of that the that action a which project thing is not on the edge because it's not known to you but it's the estimated H what you expect to get here turns out that if you analyse the probability of mouth-position ever as examples for dimension goes to zero and the uh difficulty level is fixed the probability of classification error goes to have which means that even though you knew that was an underlying wind amazon something structure and you know that that that was aligned with that trying to estimate using using a matching like to kind of an estimate didn't doesn't do the job okay you know but and random guessing asymptotically but also it's it's all suggests that you need additional sensing structure to exploit here no although this was not presented in our icassp able um yeah since then be able to show that this fundamental meaning that for this particular problem to analysing here without any additional structure on edge it's impossible for any uh learning a lot of them to do any better than random guessing some importantly so that's not present it an i cast to be appearing elsewhere but it's actually a fundamental be able lower bound of the does of in probably which actually goes to have and if you don't make any assumptions on these sensing structure so that lead more T is the need for a id no structure don't edge and one of the structures but be like to study is of course uh is a is a popular thing these days is uh as possibly okay so uh that's say that the signal that uh that subspace is back the direction is sparse meaning that uh the energy and edge is look lies leave it if you components compared to the number of dimension so in particular let's see that the daily energy of a to this of the effect that edge the man of the vector to the components and their P components and uh let's a pick a truncation point D um and look at the energy G this truncation and the tail of the uh edge vector here as E N P will go to infinity you want the a in a to do to zero so that a certainly a a a statement about the sparsity as simple ks possibly all the signal so in this case uh a natural thing to do is to use a uh so only have used the maximal like to the estimate that a a of the top and that didn't work but not you know something more about edge namely that it's still energy goes to zero so one one interesting to do you can try is why not and K that estimator the component of the estimator and use that as a proxy for that instead the idea is to keep the estimate team i'd only are for all components less and some implication parameter T and then set to you everything beyond so that leads to what condition bayes estimate of the direction along H and i and used i that's the L how how things be a big for show that as the that mentions the number of samples and the truncation point goes to infinity but the truncation one is chosen in such a way that the it goes slower than the number of sample then as important D can estimate this is signal subspace perfect mean that in the mean a sense there are between the a truncated estimate and the true data goes to zero we can as a a to the estimated i mention the subspace and of course if we can estimate the subspace perfectly some got it it's on surprising then that uh as things scale and you could the difficulty level fixed the probably of class of never goes to the base probably another the what's not is the sensing structure but additional sparsity assumptions or some additional structure information can a simple really yeah give you the uh a bayes pro uh probably of he has a little simulation does not uh reinforce some of these insights so here we have fixed the uh is probably other the the difficulty to be point one is fixed throughout as a dimension scale the energy use fixed to not some value to and you're some parameters to than in the model and the number of samples is going slower than the other dimension as shown here um the truncation point uh uh chose into go slower than the number of samples that shown here and yeah one assume a polynomial D K for edge and joint you're up for example of the beam line is the H or the of one pretty localisation of edge and on the uh D D uh the red line is actually the noise the um at some like to to estimate that the had they are normalized to have unit energy sure you're and a blue one is a point conversion of the red one the truncation point the i-th exactly twenty or so on the right side is the probability of error on the vertical axes most of the dimension ambient dimension so that the dimension scales uh the unstructured uh approach where you don't know anything about the sensing structure you try to estimate all the parameters using mac selected estimates we'll approach to be that they probably about it being you could have on the other hand uh if you if you if you knew the sensing subspace but you estimated using nightly using simply that had which is a max um that to estimate then also you get a have but if use the truncation based estimate you are a pros the bayes optimal performance so the control my talk uh the the you take points out that for possible to many problems where you encounter situations where the number of samples that far fewer than the i'm being uh get a dimension in addition that is often exists a lead in sensing structure of the low-dimensional which can be exploited you try to totally ignore the sensing structure and nine to try to estimate everything using mac selected estimates uh you would probably be no better than random guessing in many scenarios and even having a general knowledge of sensing structure like knowing that it's a one dimensional signal edge but i don't know what they choose and trying to estimate a nightly can be it cannot do the job so but only covers if you have a general or something structure plus some additional structure and edge then you can often recover the optimum asymptotically optimum estimation the data into my yeah i think i know which i mean was gonna be departing

SENSING-AWARE CLASSIFICATION WITH HIGH-DIMENSIONAL DATA

Classification and Pattern Recognition

Presented by: Prakash Ishwar, Author(s): Burkay Orten, Prakash Ishwar, William Clem Karl, Venkatesh Saligrama, Boston University, United States; Homer Pien, Massachusetts General Hospital, United States