or thanks um um so we have a title might not be obvious of what the stock is about but hopefully by the end it will be um so i'm from the uh B M and T J watson research center um alright so the topic of the paper is uh dealing with the sensor networks so uh so just to get everyone on the same page a um so these are collections of spatially distributed uh nodes to take measurements and communicate and we're interested in them for uh detection or classification task so um for example and environmental monitoring or surveillance so the one question all of us are interested in this week "'cause" whether the uh iceland volcano as that stuff no erupting or not um the measurements that um are of various types whatever that temperature sound pressure vibration that cetera and um these measurements exhibit spatial correlation um and the the sensor nodes are usually resource constrained or um so in detection or classification now um so they're basically the same problem but i one point out the differences um so in the detection problem uh we're gonna use measurements to make decisions about binary hypotheses and in this case the likelihood functions of the measurements and the prior probabilities are known and uh the likelihood ratio test just be up in this case um for the problem when you have sensor and works so the distributed detection problem so been studied quite widely um with didn't with the classification problem that's specifically with supervised classification problem look can we won use measurements to make decisions about binary the C um in this case the uh the likelihood functions of the measurements and the prior probabilities are known uh what we are given a L is a labeled training set us so measurement vectors with their true that C is and uh machine learning algorithms are applied so as P or an example of linear discriminant analysis was don't talk about in more detail um this a very classical technique um the difference kind of um between detections price classification is uh this idea of over so you know one of fit do year um your training vectors that you're given labeled training set too much um and uh this distributed suppose classification doesn't have as much brat previous work "'cause" the distributed detection problem um so practically a um up times we don't have knowledge of that the pair uh of the probability density a priori and it's much easier to imagine a situation one which uh training samples can be acquired so that's why we can sit uh the classification problem um so this is just kind of a high level thing in specific what i'm gonna talk but we're not gonna really considered too much about uh network constraints or on communication or et cetera um just more or less the um what you can uh what you can learn from uh data and sensor um so the outline of the talk is that um or still a few words about the supervised classification problem and uh some words but more in to just learning theory um L go through a linear discriminant analysis and and especially focus on so approximations to the generalization error that have been developed um and go over a a gauss markov random field sensor model that we're applying um a drive them a all an all this distance for that uh sensor model um and use that to drive a generalization error um expression for sensor net um uh provided a metric probability approximation to them in a the all this distance in order to um simplifies and things and then uh get some simulation results the and or is just going been going over the notation a bit um so we have a true hypothesis or class label Y which is either plus one or minus one so that you i some a noise either or in or not um we have a noisy measurement vector at which is in R P um and then we have a joint probability density function F of X Y um so the axes are the measurements and wise label uh we wanna learn a decision rule why white had i which is a mapping from the space of measurements to close one and minus one now where we uh learned this so a decision rule is in training samples and one why want to X and Y and but we don't apply to the training samples and but to unseen samples that X from the same distribution of of X Y um the thing were interested in characterising is the generalization error are um the probability that the decision rule will be wrong on new unseen samples from the same distribution uh the training air um is uh we can measure empirically based on the training so um which is a how many are wrong on the training set itself um this idea of overfitting and the structural risk minimization principle is that um as the complexity of the decision rule increases uh the training or we can take to zero but that over fit of the generalization error the thing we wanna minimize optimized for some intermediate complexity level um and in general uh what's just a school learning theory analysis at least small and out characterisations of a a generalization error or or are very loose usually so just one read this a quick quote um which gives this ideas so um one should not be concerned about the quantitative value of the boundary been about its fundamental farm but rather about the terms that appear in the bound and in that respect to useful bound is one which allows us to understand which can be involved in the learning process and as a result performance bounds would be used for what they're are good for the should not be used to actually predict the value of the expected error et cetera so um what if we actually are are interested in developing some um approximations are bounds for generalisation there that we can use an as an optimization criteria so we can use modern just learning theory results some things like uh V C dimension right a market complexity et cetera um and we wanna do this for uh sensor network design optimization so we wanna have some sort of expression that we can optimize uh for generalization um so we turn to the uh former soviet union literature um so they have uh develop very high quality approximations to generalisation error of uh specifically of linear discriminant analysis and some of its variations so um this work was motivated by a or of and the main contributor was uh true honest route and these expressions that we're developed in this literature are can be used as optimization criterion as we'll see in some of the simulation results later um they're very very close a very tight approximations um so specifically linear discriminant analysis is this is uh decision rule as i have on the screen um so why had of X this decision rule um is based on um basically uh if you had a gaussian distribution um for the two um for the do classes with the different means and the same or different covariance as well i what you would do to get the bayes a little but in this case these their sample means a sample covariance um um so uh one the true likelihood functions are gaussian and the true prior prior probabilities are equal will um but we don't know this so one we're learning why that from the N training samples than the generalization error approximation given by row out that all of is um so the generalization there um is approximately uh so this five function is the uh C D F of the gaussian distribution and then we have these terms involving a delta to which is uh the mahalanobis distance which a get to a second P is the dimensionality um and then our case it'll be the number of sensors and is the number of training samples so this is a is i mean it's not simple but it is very easy to evaluate evaluate a expression and we'll see how close um so this the the squared is the money not mahalanobis distance um so basically a um U plus and me minus or are the means of the two classes uh J minus in J plus are the inverse covariance matrices um so that's what we is one of the terms in this expression and that we want to use uh this the generalization X uh i error a approximation channel sensor networks with a special correlation and as i said before we're not can be concerned in this talk about that were communication or hours so i within the sensor network set specifically um we have P sensors each a measuring scalar valued measurements X one through X P so as a combined measurement vector we have this uh a big X which is an R P a the sensors are deployed randomly on the plane according to some distribution F se via of V so the sets of you V is supported on a square with area P so the region grows as more sensors are place um the at the generating a likely it functions are gaussian as i said before with um you minus and um you plus as the means and signal minus and signal classes yeah covariance um we model a covariance structure um so the spatial correlation between sensor measurements using a gauss markov random field model um and uh got to that in a second um for simplicity will let them you minus P zero at this you a vector and you plus P the all ones factor and will like the two covariance is uh be equal um so call them sit my than the inverse covariance matrices to be J so the uh actual um um mark a random feel that we have this with nearest neighbor dependency um so we construct a euclidean under directed uh and nearest neighbor graph between the uh sensors so there's no between sensor I and sensor J or if is then your sabre of J or if J is the nearest neighbor i and uh we do not the edge set of this graph as E so um then we use this uh um graph to do that uh defined a mark a random field covariance um so that's in three parts uh the diagonal elements of signal or are all equal to a little sigma square uh the elements of uh signal corresponding to edges in the nearest neighbor graph are are um basically this a little signal squared times uh G you have and D which is the distance between sensor I and sensor J so this G is a monotonically decreasing function to encode a a correlation decay a and so the farther uh two sensors are the more the less correlated they are so this is known as a semi variogram gram and a used to just it um and then the off diagonal elements of J um corresponding to non edges in the nearest neighbor graph are all zero and uh this is a model that has been used for example by uh i don't come are on and so me and uh some of their work as well um so now we wanna get this dealt to square this my an this distance um so when we have these simplifying assumptions of um you minus being zero mean plus being one and uh the took the inverse covariance matrices being equal and it turns out the the mall in just stuff the squared is just uh the sum of the um this inverse covariance matrix a and trees and uh if we substitute the covariance the expressions for the covariance uh matrices and it as a bit of algebra than we find that uh a small distance system to square is equal to up P over sigma squared um so P gone as the number of sensors mine is uh to over sigma squared times the sum of the edge set uh and uh we have the something of G over one plus G where the arguments to G again or um the distances between the sensor so um this is an that so we have the stop the squared expression we can plug that in to the previous expression i had a pure for generalisation error so um a gives is an exact um uh a not exact but an approximation in a without the everything defined a for the generalisation there but if we note is um it this expression for a the squared uh it depends on the actual realisation of uh the locations of the sensors of you i and V J um so we wanted kind of do something to integrate that out or to a kind of i understand what's happening in gen um when we have without specific we having an instantiation of the sensor so um as i said it depends on the particular realisation of the random deployment of sensor location so we want wanna characterise some sort of average behavior of delta uh across the realisations we want a V P um so it turns out that the average behavior functionals of the nearest neighbor graph can be described a using a average behavior of a margin as on point processes in work got developed by pen rows and you kids so as P goes to infinity the number of sensors is that goes to infinity um a functional uh defined on the edge uh um based on the nearest neighbor graph for distances uh converges to this other expression but uh a much as point process not up a poisson point process so um for us that five function on the left hand side is uh G over one plus G and um so if we let the right hand side of the limit be as a eight over to then the delta squared expression simplifies to pure over sigma square times one minus it and this uh as they a um can be approximated a using monte carlo simulation very easily because a poisson point process easy to generate alright so um we now we um have this other new expression for adult the squared which um we can again can plug into the generalization error expression and get uh a certain value out um so basically if we substitute in that a mahalanobis distance approximation in the generalization error approximation we get this expression up here and so um then the question might arise so was the optimal number of sensors for a given number of uh of training samples let's say so um we can find P that minimize uh uh now um approximation so uh we if we differentiate this with respect to P and set it equal to zero we find that uh the optimal P is an over two and it doesn't actually depend on sigma squared and data um so that what it tells us that it's not always beneficial to throw down more sensors uh with a fixed number of training samples and the reason for that is that uh the uh phenomena of overfitting happen um and the minimum minimum generalization error one we set people to over to is uh the other expression on the slide um so that a minimum generalization error or um uh that that expression is monotonically increasing in sigma squared which is expected you'd have more errors there's more noise monotonically decreasing in and uh as well which is expected so um as the number of training samples increases the uh the air yeah decreases and the interesting thing to note is that it's monotonically increasing in say to so what that tells us than Z the only depends on the place distribution of the sensors so um we should choose the set of V of that uh in order to minimize data so we have that choice one uh deploying a sensor uh system should all the sensors be clustered in middle of the square should they be at the edges should they be uniformly placed or any question like that um so do not me show you some simulations um so the overall message that the simulations present is that these are a really good approximations that can be used for system position um and you'll see that uh the semi very gram G function that we use is a have a you to the minus distance over two uh the sensor location distribution that we consider is uh again support it over the square with area P is um appropriately scaled and shifted beta distribution that's i D and both of its uh uh spatial dimensions and both parameters of the beta distribution are taken to be equal um so if you may eight was one that's a uniform distribution over the square if bit is greater than one then the sensors are on straight in the middle of the square and if beta as less than one than at the sensors will be construed at the edges of the square um so we do this for different values of P the number of sensors uh twenty realisations for of the sensor location placements uh ten realisations of training set for realisation of the um sensor locations for different values of the number of training samples and we have a hundred thousand test samples per training right so this is the uh mahalanobis distance approximation as a function of the number of sensors so there's two lines here but you can see them because the approximation is so good there's a blue line and a black line so the black line is the approximation and that um as i i actually didn't point out but um here we see don't the squared is a linear function of P O so can um so we see that the uh the approximation is a linear function of P and the uh empirical a how this distance is indistinguishable essentially so the approximation is uh uh the so this is from the geometric probability the poisson point process approximation uh then we plug that the approximation to the other approximation which was the uh are out a set all approximation for a generalization error of uh a linear discriminant analysis so here and the red line is the impure goal general say yeah the empirical that test are actually and the black as the approximation so um this is for and "'cause" one hundred us so the number of training samples is one hundred so we see not first of all that uh the air is minimised one a a P is a are approximately fifty which is what we wanted it uh which we saw that it's uh and over two and um the approximation is extremely good to here as well um the red line and the black line or uh almost the same um this is the same plot for an equals two hundred um so again the minimum is that uh people's one hundred as we but and the approximation as a quite um here's the case for any equals one thousand uh so this is a a one thousand training samples um if we wanted to empirically get these but a very tiny error um values that ten to the minus ten error probabilities i would have to run our simulations for a long long time um so this is the case where it's helpful to actually have the approximation um in order to understand the uh the performance and these low uh there regimes as well and um we have don't need to do a you um and P cool things uh we can just use that black line in general for optimising a sensors so that we have um the other question was um this is a to this um which depends on the beta to which was the uh distribution of the sensors so it equals one again was a uniform placement beta less than one was a a clustered at the at isn't it it was the middle so here this uh uh they it as a function of bit are we see that one where and that we have this uniform distribution is best and that's also um uh what minimize as the generalization error so in this case we want the sensors to be placed uniformly when we have this nearest neighbor dependency um so in conclusion a so time is always a limited resource so uh training sets are always uh a finite um it's all optimal to use process we have the uh number of sensors as training samples in a this uh sensor network with local gauss marco dependency and the fact that um there's a finite rather than infinite number of sensors that's optimal follows from the from an of overfitting uh we drive a generalization error approximation that involves um a obvious distance and we give an exact statement of that model a little was just as for are are gas mark of sensor measurements um and we approximate it using a a a a a to a a probability expression and uh we saw that the uh combined of approximations both um the uh did you much probability one and generalization error one uh together closely matched impair pair coal results and we saw that uniform sensor placement is a a good in this so i'll be happy to take uh some questions uh_huh yeah using i and it is to yeah uh_huh i what i see uh_huh make it uh_huh it she i she she two ish so i uh i you assuming that a all these uh my D V not a if not like its effect on the generalization right so as i mentioned um but for um so this approximation uh this generalization error approximation is true when the uh actual generating data that things that distributions to generating the data are actually gaussian and they have equal prior probability um so that's what we used um in following a but um if the true uh likelihood functions we're not gaussian then there would be a different expression for the generalisation error um and usually that's very hard to can come up with something that matches the actual empirical test or very well um so the soviet union but need should i mention there are a few variations on this but usually the assumption is that the actual generating data is gaussian otherwise it's very hard to come but expressions and people been trying to find these expressions for thirty forty years and it's very hard to uh find this expression all and making an exemption of E coli comedians yeah we don't have to do that um that's just further simplify notation um i think you you know this paper or or an extended version has for um where the two covariance as are different thank i