or thanks

um um

so we have a title might not be obvious of what the stock is about but hopefully by the end

it will be um

so i'm from the uh B M and T J watson research center um

alright so

the topic of the paper is uh dealing with the sensor networks

so uh so

just to get everyone on the same page a um so these are collections of spatially distributed uh nodes to

take measurements and communicate

and we're interested in them for uh detection or classification task

so um

for example and environmental monitoring or surveillance so

the one question all of us are interested in this week "'cause" whether the uh iceland volcano as that stuff

no erupting or not

um

the measurements that um are of various types whatever that temperature sound pressure vibration that cetera

and um these measurements exhibit spatial correlation um

and the the sensor nodes are usually resource constrained

or um so in detection or classification now um so

they're basically the same problem but i one point out the differences um

so in the detection problem uh we're gonna use measurements to make decisions about binary hypotheses and in this case

the likelihood functions of the measurements and the prior probabilities are known

and uh the likelihood ratio test just be up in this case

um

for the problem when you have sensor and works so the distributed detection problem

so been studied quite widely

um with didn't with the classification problem that's specifically with supervised classification problem look can we won use measurements to

make decisions about binary

the C

um in this case the uh the likelihood functions of the measurements and the prior probabilities are known

uh what we are given a L is a labeled training set us so measurement vectors with their true that

C is

and uh machine learning algorithms are applied so as P or an example of linear discriminant analysis was don't talk

about in more detail um this

a very classical technique

um the difference kind of um between detections price classification is uh this idea of over so

you know one of fit do year um your training vectors

that you're given labeled training set too much um

and

uh this distributed suppose classification doesn't have as much brat previous work "'cause" the distributed detection problem

um so practically a um up times we don't have knowledge of that the pair

uh of the probability density a priori

and it's much easier to imagine a situation one which uh training samples can be acquired so that's why we

can sit

uh the classification problem

um so this is just kind of a high level thing in specific what i'm gonna talk but we're not

gonna really considered too much about uh network constraints or on communication or et cetera

um just more or less

the um what you can uh

what you can learn from uh data

and sensor

um so the outline of the talk is that um

or still a few words about the supervised classification problem and uh some words but more in to just learning

theory

um L

go through a linear discriminant analysis and and especially focus on

so approximations to the generalization error that have been developed um

and go over a a gauss markov random field sensor model that we're applying

um

a drive them a all an all this distance for that uh sensor model um and

use that to drive a generalization error um expression for sensor net

um uh provided a metric probability approximation to them in a the all this distance in order to um simplifies

and things and then uh get some simulation results the and

or is just going been going over the notation a bit um so we have a true hypothesis or class

label Y which is either plus one or minus one

so that you i some a noise either or in or not um

we have a noisy measurement vector at which is in R P

um and then we have a joint probability density function F of X Y

um so the axes are the measurements and wise label

uh we wanna learn a decision rule why white had i which is a mapping from the space of measurements

to close one and minus one now where we uh learned this

so a decision rule is in training samples

and one why want to X and Y and

but we don't apply to the training samples and but to unseen samples that X from the same distribution of

of X Y

um the thing were interested in characterising is the generalization error are um the probability that the decision rule will

be wrong on new unseen samples from the same distribution

uh the training air um

is uh we can measure empirically based on the training so um

which is a how many are wrong on the training set itself

um this idea of overfitting and the structural risk minimization principle is that um

as the complexity of the decision rule increases uh the training or we can take to zero but that over

fit of the generalization error the thing we wanna minimize

optimized for

some intermediate complexity level

um and in general uh what's just a school learning theory analysis at least small and out characterisations of a

a generalization error or or are very loose usually

so just one read this a quick quote um which gives this ideas so um one should not be concerned

about the quantitative value of the boundary been about its fundamental farm

but rather about the terms that appear in the bound and in that respect to useful bound is one which

allows us to understand which can be involved in the learning process

and as a result performance bounds would be used for what they're are good for the should not be used

to actually predict the value of the expected error

et cetera

so um

what if we actually are are interested in

developing some um approximations are bounds for generalisation there that we can use an as an optimization criteria

so we can use modern just learning theory results some things like uh V C dimension right a market complexity

et cetera

um and we wanna do this for uh

sensor network design optimization so we wanna have some sort of expression that we can optimize

uh for generalization

um so we turn to the uh former soviet union literature um so they have uh develop very high quality

approximations to generalisation error of uh specifically of linear discriminant analysis

and some of its variations

so um this work was motivated by a or of and the main contributor was uh true honest route

and these expressions that we're developed in this literature are can be used as optimization criterion as we'll see in

some of the simulation results later um

they're very very close a very tight approximations

um

so specifically linear discriminant analysis is this is uh decision rule as i have on the screen um

so why had of X this decision rule um is based on um basically

uh if you had a gaussian distribution um for the two um for the do classes with the different means

and the same or different covariance as well

i what you would do to get the bayes a little but in this case these their sample means a

sample covariance

um um

so

uh one the true likelihood functions are gaussian and the true prior prior probabilities are equal will um

but we don't know this so one we're learning why that from the N training samples than the generalization error

approximation given by row out that all of is um

so the generalization there um

is approximately uh so this five function is the uh C D F of the gaussian distribution and then

we have these terms involving a delta to which is uh the mahalanobis distance which a get to a second

P is the dimensionality um

and then our case it'll be the number of sensors and is the number of training samples

so this is a

is i mean it's not simple but it is very easy to evaluate evaluate

a expression and we'll see how close

um so this the the squared is the money not mahalanobis distance um

so basically a um U plus and me minus or are the means of the two classes uh J minus

in J plus are the inverse covariance matrices

um so that's what we is one of the terms in this expression

and that we want to use uh this

the generalization X uh i error

a approximation channel sensor networks with a special correlation

and as i said before we're not can be concerned in this talk about that were communication or hours

so i within the sensor network set specifically um we have P sensors each a measuring scalar valued measurements X

one through X P

so as a combined measurement vector we have this uh a big X which is an R P

a the sensors are deployed randomly on the plane according to some distribution F se via of V

so the sets of you V is supported on a square with area P

so the region grows as more sensors are place

um the at the generating a likely it functions are gaussian as i said before with um you minus and

um you plus as the means and signal minus and signal classes

yeah covariance

um we model a covariance structure um

so the spatial correlation between sensor measurements using a gauss markov random field model

um and uh got to that in a second um for simplicity will let them you minus P zero at

this you a vector and you plus P the all ones factor

and will like the two covariance is uh be equal um so

call them sit my than the inverse covariance matrices to be

J

so the uh actual um um mark a random feel that we have this with nearest neighbor dependency um

so we construct a euclidean under directed uh and nearest neighbor graph between the

uh sensors

so there's no between sensor I and sensor J or if is then your sabre of J or if J

is the nearest neighbor

i

and uh we do not the edge set of this graph as

E

so um then we use this uh um

graph to do that uh defined a mark a random field covariance um

so that's in three parts uh the diagonal elements of signal or are all equal to a little sigma square

uh the elements of uh signal corresponding to edges in the nearest neighbor graph are are um

basically this a little signal squared times uh G you have and

D which is the distance between sensor I and sensor J

so this G is a monotonically decreasing function to encode a a correlation decay

a and so the farther uh two sensors are the more the less correlated they are so this is known

as a semi variogram gram and

a used to just it

um and then the off diagonal elements of J um corresponding to non edges in the nearest neighbor graph are

all zero

and uh this is a model that has been used for example by uh i don't come are on and

so me and uh some of their work as well

um

so now we wanna get this dealt to square this my an this distance um

so

when we have these simplifying assumptions of um you minus being zero mean plus being one and uh the took

the inverse covariance matrices being equal

and it turns out the the mall in just stuff the squared is just uh the sum of the um

this inverse covariance matrix a

and trees

and uh if we substitute the covariance the expressions for the covariance uh matrices and it as a bit of

algebra than we find that uh

a small distance system to square is equal to up P over sigma squared um so P gone as the

number of sensors

mine is uh to over sigma squared times

the sum of the edge set uh

and uh we have the something of G over one plus G

where the arguments to G again or um

the distances between the sensor

so um

this is an that so we have the stop the squared expression we can plug that in to

the previous expression i had a pure for generalisation error so um

a gives is an exact um

uh a not exact but an approximation in a without the everything defined a for the generalisation there

but if we note is um it this expression for a the squared uh it depends on the actual realisation

of uh the locations of the sensors of you i and V J

um so we wanted kind of do something to integrate that out or to a kind of

i understand what's happening in gen um when we have

without specific we having an instantiation of the sensor

so um

as i said it depends on the particular realisation of the random deployment of sensor location so we want wanna

characterise some sort of average behavior of delta uh across the realisations we want a V P

um

so it turns out that the average behavior functionals of the nearest neighbor graph can be described a using a

average behavior of a margin as on point processes

in work got developed by pen rows and you kids

so as P goes to infinity the number of sensors is that goes to infinity um

a functional uh defined on the edge uh

um based on the nearest neighbor graph for distances uh converges to this other expression but

uh a much as point process not up a poisson point process

so um for us that five function on the left hand side is uh G over one plus G

and um

so if we let the right hand side of the limit be as a eight over to then

the delta squared expression simplifies to pure over sigma square times one minus it

and this uh as they a um can be approximated a using monte carlo simulation very easily because a poisson

point process

easy to generate

alright so um

we now we um have this other new expression for adult the squared which um

we can again can plug into the generalization error expression and get uh

a certain value out um

so basically if we substitute in that a mahalanobis distance approximation in the generalization error approximation we get this expression

up here

and

so um then the question might arise so was the optimal number of sensors for a given number of uh

of training samples let's say

so um we can find P that minimize

uh uh now um approximation so

uh we if we differentiate this with respect to P and set it equal to zero we find that uh

the optimal P is an over two

and it doesn't actually depend on sigma squared and data

um

so that what it tells us that it's not always beneficial to throw down more sensors uh with a fixed

number of training samples

and the reason for that is that uh

the uh phenomena of overfitting happen

um and the minimum minimum generalization error one we set people to over to is uh the other expression on

the slide

um so that a minimum generalization error or um

uh that that expression is monotonically increasing in sigma squared which is expected you'd have more errors there's more noise

monotonically decreasing in and uh as well which is expected so um

as the number of training samples increases the uh

the air yeah decreases

and the interesting thing to note is that it's monotonically increasing in say to

so what that tells us than Z the only depends on the place distribution of the sensors so um we

should choose the set of V of that uh in order to minimize data

so we have that choice one

uh deploying a sensor uh system should all the sensors be clustered in middle of the square should they be

at the edges should they be uniformly placed or

any question like that

um so do not me show you some simulations um

so the overall message that the simulations present is that these are a really good approximations that can be used

for system position

um

and you'll see that

uh the semi very gram G function that we use is a have a you to the minus

distance over two

uh the sensor location distribution that we consider is uh

again support it over the square with area P is um

appropriately scaled and shifted beta distribution that's i D and both of its uh uh spatial dimensions

and both parameters of the beta distribution are taken to be equal um so if you may eight was one

that's a uniform distribution over the square

if bit is greater than one then the sensors are on straight in the middle of the square and if

beta as less than one than at the sensors will be construed at the edges of the square

um so we do this for different values of P the number of sensors uh twenty realisations for of the

sensor location placements uh ten realisations of training set for realisation of the um

sensor locations for different values of the number of training samples and we have

a hundred thousand test samples per training

right so this is the uh mahalanobis distance approximation as a function of the number of sensors so there's two

lines here but

you can see them because the approximation is so good there's a blue line and a black line so the

black line is the approximation and

that um as i i actually didn't point out but um

here we see don't the squared is a linear function of P O

so can um so we see that the uh the approximation is a linear function of P and the uh

empirical a how this distance

is indistinguishable essentially so the approximation is uh

uh the so this is from the geometric probability the poisson point process approximation

uh then we plug that the approximation to the other approximation which was the uh are out a set all

approximation for a generalization error of uh a linear discriminant analysis

so here and the red line is the impure goal general say yeah the empirical that test are actually and

the black as the approximation

so um this is for and "'cause" one hundred us so the number of training samples is one hundred

so we see not first of all that uh

the air is minimised one a a P is a are approximately fifty which is what we wanted it uh

which we saw that it's uh and over two

and um the approximation is extremely good to here as well um

the red line and the black line or uh

almost the same

um

this is the same plot for an equals two hundred um

so again the minimum is that uh people's one hundred as we

but

and the approximation as

a quite

um here's the case for any equals one thousand uh so this is a a one thousand training samples um

if we wanted to empirically get these but a very tiny error um values that ten to the minus ten

error probabilities i would have to run our simulations for a long long time um

so this is the case where it's helpful to actually have the approximation um

in order to understand the uh the performance and these low uh there regimes as well and um

we have don't need to do a you um and P cool things uh

we can just use that black line in general for optimising a sensors

so that we have

um the other question was um

this is a to this um which depends on the beta to which was the uh

distribution of the sensors so it equals one again was a uniform placement beta

less than one was a a clustered at the at isn't it it

was

the middle

so here this uh uh they it as a function of bit are we see that one where and that

we have this uniform distribution is best and that's also um

uh what minimize as the generalization error so

in this case we want the sensors to be placed uniformly when we have this nearest neighbor dependency

um so in conclusion a so time is always a limited resource so uh training sets are always uh a

finite

um it's all optimal to use process we have the uh number of sensors as training samples in a this

uh sensor network with local gauss marco dependency

and the fact that um there's a finite rather than infinite number of sensors that's optimal follows from the from

an of overfitting

uh we drive a generalization error approximation that involves um a obvious distance and we give an exact statement of

that model a little was just as for

are are gas mark of sensor measurements

um

and we approximate it using a a a a a to a a probability expression and uh

we saw that the uh combined of approximations both um

the uh

did you much probability one and generalization error one

uh together closely matched impair pair coal results and we saw that uniform sensor placement is

a a good in this

so i'll be happy to take uh some questions

uh_huh

yeah

using

i

and it is to yeah

uh_huh

i

what i see uh_huh

make it

uh_huh it

she

i

she she two ish

so i uh

i you assuming that

a all these uh

my D V not a if not

like its effect on the generalization

right so as i mentioned um

but for um

so this approximation uh this generalization error approximation is true when

the uh actual generating data that things that distributions to generating the data are actually gaussian and they have equal

prior probability

um so

that's what we used um in following a but um

if the true uh likelihood functions we're not gaussian then there would be a different expression for the generalisation error

um and usually that's very hard to can come up with something that matches the actual empirical test or very

well

um so the soviet union but need should i mention there are a few variations on this but

usually the assumption is that the actual generating data is gaussian otherwise it's very hard to come but expressions

and people been trying to find these expressions for thirty forty years and it's very hard to

uh find this expression

all and making an exemption of E coli

comedians yeah we don't have to do that um that's just further simplify notation um i think you you know

this paper or or an extended version has for um where the two covariance as are different

thank

i