0:00:14 | or thanks |
---|---|

0:00:14 | um um |

0:00:15 | so we have a title might not be obvious of what the stock is about but hopefully by the end |

0:00:19 | it will be um |

0:00:21 | so i'm from the uh B M and T J watson research center um |

0:00:26 | alright so |

0:00:27 | the topic of the paper is uh dealing with the sensor networks |

0:00:31 | so uh so |

0:00:32 | just to get everyone on the same page a um so these are collections of spatially distributed uh nodes to |

0:00:37 | take measurements and communicate |

0:00:39 | and we're interested in them for uh detection or classification task |

0:00:43 | so um |

0:00:44 | for example and environmental monitoring or surveillance so |

0:00:48 | the one question all of us are interested in this week "'cause" whether the uh iceland volcano as that stuff |

0:00:53 | no erupting or not |

0:00:54 | um |

0:00:56 | the measurements that um are of various types whatever that temperature sound pressure vibration that cetera |

0:01:02 | and um these measurements exhibit spatial correlation um |

0:01:06 | and the the sensor nodes are usually resource constrained |

0:01:11 | or um so in detection or classification now um so |

0:01:14 | they're basically the same problem but i one point out the differences um |

0:01:18 | so in the detection problem uh we're gonna use measurements to make decisions about binary hypotheses and in this case |

0:01:24 | the likelihood functions of the measurements and the prior probabilities are known |

0:01:28 | and uh the likelihood ratio test just be up in this case |

0:01:32 | um |

0:01:33 | for the problem when you have sensor and works so the distributed detection problem |

0:01:36 | so been studied quite widely |

0:01:38 | um with didn't with the classification problem that's specifically with supervised classification problem look can we won use measurements to |

0:01:44 | make decisions about binary |

0:01:46 | the C |

0:01:47 | um in this case the uh the likelihood functions of the measurements and the prior probabilities are known |

0:01:52 | uh what we are given a L is a labeled training set us so measurement vectors with their true that |

0:01:57 | C is |

0:01:58 | and uh machine learning algorithms are applied so as P or an example of linear discriminant analysis was don't talk |

0:02:04 | about in more detail um this |

0:02:06 | a very classical technique |

0:02:08 | um the difference kind of um between detections price classification is uh this idea of over so |

0:02:15 | you know one of fit do year um your training vectors |

0:02:18 | that you're given labeled training set too much um |

0:02:21 | and |

0:02:22 | uh this distributed suppose classification doesn't have as much brat previous work "'cause" the distributed detection problem |

0:02:28 | um so practically a um up times we don't have knowledge of that the pair |

0:02:33 | uh of the probability density a priori |

0:02:36 | and it's much easier to imagine a situation one which uh training samples can be acquired so that's why we |

0:02:41 | can sit |

0:02:41 | uh the classification problem |

0:02:43 | um so this is just kind of a high level thing in specific what i'm gonna talk but we're not |

0:02:47 | gonna really considered too much about uh network constraints or on communication or et cetera |

0:02:53 | um just more or less |

0:02:54 | the um what you can uh |

0:02:56 | what you can learn from uh data |

0:02:58 | and sensor |

0:02:59 | um so the outline of the talk is that um |

0:03:03 | or still a few words about the supervised classification problem and uh some words but more in to just learning |

0:03:08 | theory |

0:03:09 | um L |

0:03:10 | go through a linear discriminant analysis and and especially focus on |

0:03:14 | so approximations to the generalization error that have been developed um |

0:03:18 | and go over a a gauss markov random field sensor model that we're applying |

0:03:22 | um |

0:03:23 | a drive them a all an all this distance for that uh sensor model um and |

0:03:28 | use that to drive a generalization error um expression for sensor net |

0:03:33 | um uh provided a metric probability approximation to them in a the all this distance in order to um simplifies |

0:03:38 | and things and then uh get some simulation results the and |

0:03:42 | or is just going been going over the notation a bit um so we have a true hypothesis or class |

0:03:47 | label Y which is either plus one or minus one |

0:03:50 | so that you i some a noise either or in or not um |

0:03:54 | we have a noisy measurement vector at which is in R P |

0:03:58 | um and then we have a joint probability density function F of X Y |

0:04:02 | um so the axes are the measurements and wise label |

0:04:05 | uh we wanna learn a decision rule why white had i which is a mapping from the space of measurements |

0:04:09 | to close one and minus one now where we uh learned this |

0:04:12 | so a decision rule is in training samples |

0:04:14 | and one why want to X and Y and |

0:04:16 | but we don't apply to the training samples and but to unseen samples that X from the same distribution of |

0:04:21 | of X Y |

0:04:23 | um the thing were interested in characterising is the generalization error are um the probability that the decision rule will |

0:04:28 | be wrong on new unseen samples from the same distribution |

0:04:31 | uh the training air um |

0:04:34 | is uh we can measure empirically based on the training so um |

0:04:38 | which is a how many are wrong on the training set itself |

0:04:41 | um this idea of overfitting and the structural risk minimization principle is that um |

0:04:46 | as the complexity of the decision rule increases uh the training or we can take to zero but that over |

0:04:51 | fit of the generalization error the thing we wanna minimize |

0:04:54 | optimized for |

0:04:55 | some intermediate complexity level |

0:04:58 | um and in general uh what's just a school learning theory analysis at least small and out characterisations of a |

0:05:03 | a generalization error or or are very loose usually |

0:05:07 | so just one read this a quick quote um which gives this ideas so um one should not be concerned |

0:05:11 | about the quantitative value of the boundary been about its fundamental farm |

0:05:15 | but rather about the terms that appear in the bound and in that respect to useful bound is one which |

0:05:19 | allows us to understand which can be involved in the learning process |

0:05:23 | and as a result performance bounds would be used for what they're are good for the should not be used |

0:05:27 | to actually predict the value of the expected error |

0:05:30 | et cetera |

0:05:31 | so um |

0:05:32 | what if we actually are are interested in |

0:05:34 | developing some um approximations are bounds for generalisation there that we can use an as an optimization criteria |

0:05:41 | so we can use modern just learning theory results some things like uh V C dimension right a market complexity |

0:05:46 | et cetera |

0:05:47 | um and we wanna do this for uh |

0:05:50 | sensor network design optimization so we wanna have some sort of expression that we can optimize |

0:05:55 | uh for generalization |

0:05:58 | um so we turn to the uh former soviet union literature um so they have uh develop very high quality |

0:06:04 | approximations to generalisation error of uh specifically of linear discriminant analysis |

0:06:09 | and some of its variations |

0:06:11 | so um this work was motivated by a or of and the main contributor was uh true honest route |

0:06:17 | and these expressions that we're developed in this literature are can be used as optimization criterion as we'll see in |

0:06:23 | some of the simulation results later um |

0:06:26 | they're very very close a very tight approximations |

0:06:30 | um |

0:06:31 | so specifically linear discriminant analysis is this is uh decision rule as i have on the screen um |

0:06:37 | so why had of X this decision rule um is based on um basically |

0:06:42 | uh if you had a gaussian distribution um for the two um for the do classes with the different means |

0:06:48 | and the same or different covariance as well |

0:06:51 | i what you would do to get the bayes a little but in this case these their sample means a |

0:06:55 | sample covariance |

0:06:57 | um um |

0:06:58 | so |

0:06:59 | uh one the true likelihood functions are gaussian and the true prior prior probabilities are equal will um |

0:07:05 | but we don't know this so one we're learning why that from the N training samples than the generalization error |

0:07:10 | approximation given by row out that all of is um |

0:07:14 | so the generalization there um |

0:07:16 | is approximately uh so this five function is the uh C D F of the gaussian distribution and then |

0:07:22 | we have these terms involving a delta to which is uh the mahalanobis distance which a get to a second |

0:07:28 | P is the dimensionality um |

0:07:31 | and then our case it'll be the number of sensors and is the number of training samples |

0:07:36 | so this is a |

0:07:37 | is i mean it's not simple but it is very easy to evaluate evaluate |

0:07:40 | a expression and we'll see how close |

0:07:43 | um so this the the squared is the money not mahalanobis distance um |

0:07:47 | so basically a um U plus and me minus or are the means of the two classes uh J minus |

0:07:52 | in J plus are the inverse covariance matrices |

0:07:56 | um so that's what we is one of the terms in this expression |

0:07:59 | and that we want to use uh this |

0:08:01 | the generalization X uh i error |

0:08:03 | a approximation channel sensor networks with a special correlation |

0:08:07 | and as i said before we're not can be concerned in this talk about that were communication or hours |

0:08:13 | so i within the sensor network set specifically um we have P sensors each a measuring scalar valued measurements X |

0:08:21 | one through X P |

0:08:22 | so as a combined measurement vector we have this uh a big X which is an R P |

0:08:27 | a the sensors are deployed randomly on the plane according to some distribution F se via of V |

0:08:32 | so the sets of you V is supported on a square with area P |

0:08:35 | so the region grows as more sensors are place |

0:08:38 | um the at the generating a likely it functions are gaussian as i said before with um you minus and |

0:08:45 | um you plus as the means and signal minus and signal classes |

0:08:48 | yeah covariance |

0:08:49 | um we model a covariance structure um |

0:08:52 | so the spatial correlation between sensor measurements using a gauss markov random field model |

0:08:57 | um and uh got to that in a second um for simplicity will let them you minus P zero at |

0:09:02 | this you a vector and you plus P the all ones factor |

0:09:05 | and will like the two covariance is uh be equal um so |

0:09:09 | call them sit my than the inverse covariance matrices to be |

0:09:12 | J |

0:09:14 | so the uh actual um um mark a random feel that we have this with nearest neighbor dependency um |

0:09:20 | so we construct a euclidean under directed uh and nearest neighbor graph between the |

0:09:24 | uh sensors |

0:09:25 | so there's no between sensor I and sensor J or if is then your sabre of J or if J |

0:09:30 | is the nearest neighbor |

0:09:31 | i |

0:09:31 | and uh we do not the edge set of this graph as |

0:09:34 | E |

0:09:35 | so um then we use this uh um |

0:09:37 | graph to do that uh defined a mark a random field covariance um |

0:09:41 | so that's in three parts uh the diagonal elements of signal or are all equal to a little sigma square |

0:09:47 | uh the elements of uh signal corresponding to edges in the nearest neighbor graph are are um |

0:09:53 | basically this a little signal squared times uh G you have and |

0:09:57 | D which is the distance between sensor I and sensor J |

0:10:01 | so this G is a monotonically decreasing function to encode a a correlation decay |

0:10:06 | a and so the farther uh two sensors are the more the less correlated they are so this is known |

0:10:11 | as a semi variogram gram and |

0:10:12 | a used to just it |

0:10:14 | um and then the off diagonal elements of J um corresponding to non edges in the nearest neighbor graph are |

0:10:20 | all zero |

0:10:22 | and uh this is a model that has been used for example by uh i don't come are on and |

0:10:26 | so me and uh some of their work as well |

0:10:29 | um |

0:10:30 | so now we wanna get this dealt to square this my an this distance um |

0:10:35 | so |

0:10:36 | when we have these simplifying assumptions of um you minus being zero mean plus being one and uh the took |

0:10:42 | the inverse covariance matrices being equal |

0:10:44 | and it turns out the the mall in just stuff the squared is just uh the sum of the um |

0:10:50 | this inverse covariance matrix a |

0:10:52 | and trees |

0:10:53 | and uh if we substitute the covariance the expressions for the covariance uh matrices and it as a bit of |

0:10:59 | algebra than we find that uh |

0:11:01 | a small distance system to square is equal to up P over sigma squared um so P gone as the |

0:11:06 | number of sensors |

0:11:07 | mine is uh to over sigma squared times |

0:11:09 | the sum of the edge set uh |

0:11:12 | and uh we have the something of G over one plus G |

0:11:15 | where the arguments to G again or um |

0:11:17 | the distances between the sensor |

0:11:21 | so um |

0:11:22 | this is an that so we have the stop the squared expression we can plug that in to |

0:11:26 | the previous expression i had a pure for generalisation error so um |

0:11:30 | a gives is an exact um |

0:11:32 | uh a not exact but an approximation in a without the everything defined a for the generalisation there |

0:11:39 | but if we note is um it this expression for a the squared uh it depends on the actual realisation |

0:11:45 | of uh the locations of the sensors of you i and V J |

0:11:49 | um so we wanted kind of do something to integrate that out or to a kind of |

0:11:54 | i understand what's happening in gen um when we have |

0:11:58 | without specific we having an instantiation of the sensor |

0:12:02 | so um |

0:12:03 | as i said it depends on the particular realisation of the random deployment of sensor location so we want wanna |

0:12:07 | characterise some sort of average behavior of delta uh across the realisations we want a V P |

0:12:13 | um |

0:12:13 | so it turns out that the average behavior functionals of the nearest neighbor graph can be described a using a |

0:12:19 | average behavior of a margin as on point processes |

0:12:22 | in work got developed by pen rows and you kids |

0:12:25 | so as P goes to infinity the number of sensors is that goes to infinity um |

0:12:30 | a functional uh defined on the edge uh |

0:12:33 | um based on the nearest neighbor graph for distances uh converges to this other expression but |

0:12:38 | uh a much as point process not up a poisson point process |

0:12:42 | so um for us that five function on the left hand side is uh G over one plus G |

0:12:47 | and um |

0:12:48 | so if we let the right hand side of the limit be as a eight over to then |

0:12:53 | the delta squared expression simplifies to pure over sigma square times one minus it |

0:12:58 | and this uh as they a um can be approximated a using monte carlo simulation very easily because a poisson |

0:13:05 | point process |

0:13:06 | easy to generate |

0:13:09 | alright so um |

0:13:11 | we now we um have this other new expression for adult the squared which um |

0:13:16 | we can again can plug into the generalization error expression and get uh |

0:13:20 | a certain value out um |

0:13:23 | so basically if we substitute in that a mahalanobis distance approximation in the generalization error approximation we get this expression |

0:13:30 | up here |

0:13:32 | and |

0:13:32 | so um then the question might arise so was the optimal number of sensors for a given number of uh |

0:13:39 | of training samples let's say |

0:13:41 | so um we can find P that minimize |

0:13:43 | uh uh now um approximation so |

0:13:47 | uh we if we differentiate this with respect to P and set it equal to zero we find that uh |

0:13:51 | the optimal P is an over two |

0:13:54 | and it doesn't actually depend on sigma squared and data |

0:13:57 | um |

0:13:58 | so that what it tells us that it's not always beneficial to throw down more sensors uh with a fixed |

0:14:03 | number of training samples |

0:14:05 | and the reason for that is that uh |

0:14:07 | the uh phenomena of overfitting happen |

0:14:10 | um and the minimum minimum generalization error one we set people to over to is uh the other expression on |

0:14:16 | the slide |

0:14:17 | um so that a minimum generalization error or um |

0:14:21 | uh that that expression is monotonically increasing in sigma squared which is expected you'd have more errors there's more noise |

0:14:27 | monotonically decreasing in and uh as well which is expected so um |

0:14:31 | as the number of training samples increases the uh |

0:14:34 | the air yeah decreases |

0:14:36 | and the interesting thing to note is that it's monotonically increasing in say to |

0:14:40 | so what that tells us than Z the only depends on the place distribution of the sensors so um we |

0:14:46 | should choose the set of V of that uh in order to minimize data |

0:14:49 | so we have that choice one |

0:14:51 | uh deploying a sensor uh system should all the sensors be clustered in middle of the square should they be |

0:14:56 | at the edges should they be uniformly placed or |

0:14:58 | any question like that |

0:15:00 | um so do not me show you some simulations um |

0:15:04 | so the overall message that the simulations present is that these are a really good approximations that can be used |

0:15:10 | for system position |

0:15:11 | um |

0:15:12 | and you'll see that |

0:15:13 | uh the semi very gram G function that we use is a have a you to the minus |

0:15:17 | distance over two |

0:15:18 | uh the sensor location distribution that we consider is uh |

0:15:22 | again support it over the square with area P is um |

0:15:25 | appropriately scaled and shifted beta distribution that's i D and both of its uh uh spatial dimensions |

0:15:32 | and both parameters of the beta distribution are taken to be equal um so if you may eight was one |

0:15:37 | that's a uniform distribution over the square |

0:15:39 | if bit is greater than one then the sensors are on straight in the middle of the square and if |

0:15:43 | beta as less than one than at the sensors will be construed at the edges of the square |

0:15:48 | um so we do this for different values of P the number of sensors uh twenty realisations for of the |

0:15:53 | sensor location placements uh ten realisations of training set for realisation of the um |

0:15:58 | sensor locations for different values of the number of training samples and we have |

0:16:02 | a hundred thousand test samples per training |

0:16:06 | right so this is the uh mahalanobis distance approximation as a function of the number of sensors so there's two |

0:16:11 | lines here but |

0:16:12 | you can see them because the approximation is so good there's a blue line and a black line so the |

0:16:16 | black line is the approximation and |

0:16:19 | that um as i i actually didn't point out but um |

0:16:23 | here we see don't the squared is a linear function of P O |

0:16:28 | so can um so we see that the uh the approximation is a linear function of P and the uh |

0:16:33 | empirical a how this distance |

0:16:36 | is indistinguishable essentially so the approximation is uh |

0:16:41 | uh the so this is from the geometric probability the poisson point process approximation |

0:16:46 | uh then we plug that the approximation to the other approximation which was the uh are out a set all |

0:16:50 | approximation for a generalization error of uh a linear discriminant analysis |

0:16:54 | so here and the red line is the impure goal general say yeah the empirical that test are actually and |

0:17:00 | the black as the approximation |

0:17:02 | so um this is for and "'cause" one hundred us so the number of training samples is one hundred |

0:17:08 | so we see not first of all that uh |

0:17:11 | the air is minimised one a a P is a are approximately fifty which is what we wanted it uh |

0:17:16 | which we saw that it's uh and over two |

0:17:19 | and um the approximation is extremely good to here as well um |

0:17:24 | the red line and the black line or uh |

0:17:26 | almost the same |

0:17:27 | um |

0:17:29 | this is the same plot for an equals two hundred um |

0:17:32 | so again the minimum is that uh people's one hundred as we |

0:17:36 | but |

0:17:37 | and the approximation as |

0:17:38 | a quite |

0:17:40 | um here's the case for any equals one thousand uh so this is a a one thousand training samples um |

0:17:47 | if we wanted to empirically get these but a very tiny error um values that ten to the minus ten |

0:17:53 | error probabilities i would have to run our simulations for a long long time um |

0:17:57 | so this is the case where it's helpful to actually have the approximation um |

0:18:03 | in order to understand the uh the performance and these low uh there regimes as well and um |

0:18:09 | we have don't need to do a you um and P cool things uh |

0:18:13 | we can just use that black line in general for optimising a sensors |

0:18:16 | so that we have |

0:18:18 | um the other question was um |

0:18:20 | this is a to this um which depends on the beta to which was the uh |

0:18:25 | distribution of the sensors so it equals one again was a uniform placement beta |

0:18:29 | less than one was a a clustered at the at isn't it it |

0:18:32 | was |

0:18:33 | the middle |

0:18:34 | so here this uh uh they it as a function of bit are we see that one where and that |

0:18:38 | we have this uniform distribution is best and that's also um |

0:18:42 | uh what minimize as the generalization error so |

0:18:45 | in this case we want the sensors to be placed uniformly when we have this nearest neighbor dependency |

0:18:50 | um so in conclusion a so time is always a limited resource so uh training sets are always uh a |

0:18:56 | finite |

0:18:58 | um it's all optimal to use process we have the uh number of sensors as training samples in a this |

0:19:03 | uh sensor network with local gauss marco dependency |

0:19:06 | and the fact that um there's a finite rather than infinite number of sensors that's optimal follows from the from |

0:19:11 | an of overfitting |

0:19:13 | uh we drive a generalization error approximation that involves um a obvious distance and we give an exact statement of |

0:19:20 | that model a little was just as for |

0:19:22 | are are gas mark of sensor measurements |

0:19:24 | um |

0:19:25 | and we approximate it using a a a a a to a a probability expression and uh |

0:19:30 | we saw that the uh combined of approximations both um |

0:19:34 | the uh |

0:19:35 | did you much probability one and generalization error one |

0:19:38 | uh together closely matched impair pair coal results and we saw that uniform sensor placement is |

0:19:42 | a a good in this |

0:19:44 | so i'll be happy to take uh some questions |

0:20:03 | uh_huh |

0:20:04 | yeah |

0:20:06 | using |

0:20:07 | i |

0:20:07 | and it is to yeah |

0:20:09 | uh_huh |

0:20:10 | i |

0:20:11 | what i see uh_huh |

0:20:14 | make it |

0:20:15 | uh_huh it |

0:20:17 | she |

0:20:18 | i |

0:20:19 | she she two ish |

0:20:20 | so i uh |

0:20:22 | i you assuming that |

0:20:24 | a all these uh |

0:20:26 | my D V not a if not |

0:20:28 | like its effect on the generalization |

0:20:31 | right so as i mentioned um |

0:20:35 | but for um |

0:20:36 | so this approximation uh this generalization error approximation is true when |

0:20:42 | the uh actual generating data that things that distributions to generating the data are actually gaussian and they have equal |

0:20:48 | prior probability |

0:20:50 | um so |

0:20:51 | that's what we used um in following a but um |

0:20:56 | if the true uh likelihood functions we're not gaussian then there would be a different expression for the generalisation error |

0:21:02 | um and usually that's very hard to can come up with something that matches the actual empirical test or very |

0:21:08 | well |

0:21:09 | um so the soviet union but need should i mention there are a few variations on this but |

0:21:14 | usually the assumption is that the actual generating data is gaussian otherwise it's very hard to come but expressions |

0:21:22 | and people been trying to find these expressions for thirty forty years and it's very hard to |

0:21:27 | uh find this expression |

0:21:30 | all and making an exemption of E coli |

0:21:32 | comedians yeah we don't have to do that um that's just further simplify notation um i think you you know |

0:21:39 | this paper or or an extended version has for um where the two covariance as are different |

0:21:45 | thank |

0:21:45 | i |