0:00:13okay so
0:00:15i'm i'm V right
0:00:26you
0:00:28so
0:00:31in the stock a what what we'll do is we'll cover
0:00:34um
0:00:35the definition of
0:00:36entropy a maximum entropy methods
0:00:39and also alternative methods what we're going to present your in terms of entropy estimation
0:00:45then we will uh show how the principal max
0:00:48entropy can be applied
0:00:50to performing a
0:00:51approximations that
0:00:53could potentially uh
0:00:56although we could
0:00:57show with high probability that will converge
0:01:00to the true entropy so
0:01:02so what this
0:01:03what this um
0:01:04paper is about
0:01:06is about to as entropy estimators
0:01:08and their own
0:01:09performance in terms of uh compare
0:01:14so
0:01:15as everybody knows the entropy for a continuous random variable or the differential entropy
0:01:20can be written as
0:01:22a negative of the X
0:01:23expected value of log of the probability density function
0:01:27and the problem that we have a simple we have
0:01:30and samples from
0:01:31he of X from that pdf
0:01:33and the goal is to estimate H P and of course
0:01:36um there are a couple of issues first is we
0:01:38there is it's it's not parametric so we have no idea what the family
0:01:42of distributions P be able to
0:01:44as well as this is the estimation so
0:01:47so
0:01:48so to some extent we
0:01:49we don't have um a parametric capabilities in terms of approximation
0:01:54and the other aspect is we're are estimated date based on samples and of course
0:01:58uh
0:01:59resampling and different samples will give you
0:02:02different values and potentially
0:02:04uh do you to the uh
0:02:06sampling issues
0:02:07uh we're gonna have been at sampling here
0:02:10so
0:02:12so generally the entropy estimation is been applied in a variety of applications probably in this conference alone they're all
0:02:19all of these are represented throughout the use especially
0:02:22also also in the C or image analysis
0:02:25a anomaly detection and source separation either wanna go
0:02:28too deep into how entropy is applied in these methods
0:02:32as you see so this this paper is very focused on how to estimate
0:02:36P
0:02:37so
0:02:39existing methods in in entropy estimation
0:02:41um
0:02:43you know one of the classical method is called the plug-in method where
0:02:46um
0:02:47the expectation is simply replaced
0:02:49by an average
0:02:51so basically
0:02:52um
0:02:53and and D density P is being replaced by some uh
0:02:57plugging density estimates so for example one can think of
0:03:00using a kernel density estimate estimate P
0:03:03and then using
0:03:04um
0:03:04an average over the number of points uh that you're
0:03:07a average of the number of samples in order to get
0:03:11a an approximation for the expectation
0:03:14another alternative method looks at a
0:03:16this is more applicable in one B
0:03:18which is
0:03:19a a sample spacing so basically
0:03:21how to use a the distribution
0:03:24a the differences between uh
0:03:26nearest samples
0:03:28in order to approximate
0:03:29the density
0:03:31and the nearest neighbor methods by um
0:03:33"'cause" the check on the an echo
0:03:35uh presented basically a
0:03:37initially a me one nearest neighbor
0:03:40method for estimating the entropy
0:03:42and
0:03:43in their work they show
0:03:45converges results
0:03:46of that estimate to that or
0:03:50course
0:03:50another method is the histogram approach
0:03:53oh
0:03:54you skip through so
0:03:55so what is the idea that we present here
0:03:57um
0:03:58the at is the following first of all we're
0:04:00we're going to apply a maximum entropy principle to estimate the entropy and so the maximum entropy principle
0:04:06works in the following way
0:04:07what you're into so the problem is that
0:04:10we want to estimate in most method method it is estimated not parametrically
0:04:14and you were thinking is it possible for us to estimate in a parametric
0:04:19so
0:04:20so the approach was like that suppose that you have um
0:04:23and basis function fee one through fee him
0:04:26and
0:04:27and those could be you take your data you apply for example based function could be
0:04:30some all the same
0:04:32or some of the square about
0:04:35and so
0:04:36by taking the expected value of these are
0:04:39all these um
0:04:41moments are or or
0:04:42P function
0:04:43and setting it to the values that you measure
0:04:45from the data
0:04:46a you can set of constraints on the distribution in other words you like a distribution for example
0:04:51with a mean value all
0:04:53the average is
0:04:54uh as found for the data
0:04:56or you like a distribution with maybe a variance as as the very sample from the data but in general
0:05:01those could be
0:05:02very different from
0:05:04so the argument is of course if you
0:05:06i i set the optimization the principal max entropy to be too
0:05:10maximise the entropy subject to these constrained of course the sum to one can
0:05:14on the distribution you and up
0:05:16with the well known
0:05:17uh maximum entropy model which is the exponential model
0:05:21you to the summation of land i
0:05:23a all and a if E J minus a a constant
0:05:26basic
0:05:27a constant in terms of a
0:05:29and so the at here is we want to be able to use this
0:05:33density estimate as with with other plug in methods we have
0:05:37now i
0:05:38a method for finding it density we wanna use this density
0:05:41as part of or entropy
0:05:43now what is the advantage of this method so the first advantages
0:05:47that
0:05:48when you have a parametric a a and be you could
0:05:51but tension and find are we have sorry of a
0:05:54a parametric representation of the density you could potentially we integrate a log
0:05:58and and up finding the entropy being close form which case
0:06:01uh a a in i to integral Z of land that
0:06:05we have the entropy and close form
0:06:08and
0:06:09the difference between this approach and many other method is that most method
0:06:12focus on local an estimate of the density so
0:06:16for example estimating the density based on a very small neighbor
0:06:20in this case
0:06:21you can argue that the basis functions
0:06:23don't have to be localised
0:06:25and
0:06:26you can then be
0:06:27use
0:06:28global functions to estimate the entry
0:06:31so so what what's the advantage in this method that if we
0:06:34look at the kl divergence between P and P level that that
0:06:37approximation of P
0:06:39uh we can always show that the entropy is upper bounded
0:06:42by the estimate that we
0:06:44the we're proposal so that
0:06:46um um
0:06:47that's to some extent is good news
0:06:49so we know we'll never estimate
0:06:51uh
0:06:51the entropy with the smaller number than the true entropy
0:06:54and then the question of course well
0:06:56maybe this upper bound is not very tight so how can we verify
0:06:59but this upper bound is time
0:07:02so in order to show that this
0:07:04approximation that was present the previous slide
0:07:07a for the density
0:07:08can produce
0:07:09very tight estimates for the entropy we consider the following
0:07:13uh theorem so through through why stress and
0:07:16uh we can approximate
0:07:18uh functions with
0:07:20uh a uniformly
0:07:22with the given accuracy
0:07:23uh using only non that's
0:07:25one of the original papers and then later on there were some generalisations of that
0:07:29and so the idea here is that we're going to
0:07:32think of the approximation that
0:07:34the maximum entropy framework is given as as approximation of the log of the like a lot of the
0:07:39probability density function rather than
0:07:42a a rather than the probability density function if you look at a lot of the estimates
0:07:45it go directly at estimating
0:07:47um
0:07:48the density the probability density function in this case
0:07:51we're going to estimating
0:07:53the log of probability and as you can see basically
0:07:56uh one can think about
0:07:58representing precise to the log of the probability as the integral of phi theta
0:08:03X X the theta
0:08:05and the at
0:08:06the measure all
0:08:08that is given here
0:08:09and basically we can view the
0:08:12approximation the maxima to be is given us
0:08:14as a discrete eyes
0:08:16um a version of this in
0:08:18basically using basis functions or or basic using a point mass
0:08:23to uh in instead of this measure
0:08:26uh to approximate P
0:08:29so that
0:08:29that's the in and the argument is that true
0:08:32um still what to theorem you know that
0:08:34you um this approximation converges is uniformly to log B and so that's the motivation behind
0:08:40using a
0:08:41maximum trip
0:08:44so
0:08:45the two estimators that we propose based on that is
0:08:48are one is the brute force estimator and the next one oh presents one is the
0:08:52uh and term approximation
0:08:54so with the brute force
0:08:56uh approximate of the idea is as follows
0:08:59um
0:08:59you're basically considering
0:09:01the
0:09:02values of land the and to basically the you basically selecting jointly
0:09:07the basis functions fees
0:09:09as well as the coefficients along long this piece
0:09:11such that they minimize the end
0:09:14so of course
0:09:15the optimization with respect to land is convex but with respect to a is not
0:09:20a so that creates a
0:09:21and intractable optimization problem however
0:09:24um this approach at least
0:09:26you just
0:09:27uh we we can evaluate this approach and this approach give us an upper bound one performed
0:09:32and so we can compare other method the segment that we present
0:09:35to the performance
0:09:37uh for this man
0:09:39so
0:09:40in the paper we present fear and
0:09:43a that describes the accuracy of the method in
0:09:46without a through the details of the your of the derivation da da here is that
0:09:50here here can be broken into a an approximation error component and an estimation error component
0:09:56and the approximation error component
0:09:58in this particular theorem
0:10:00is obtained from the paper by andrew bear
0:10:03which basically suggest that maximum entropy framework can you used to approximate any distribution
0:10:09at the same time we're considering
0:10:11i i of sampling and of course
0:10:13uh using hopping inequality one can show that
0:10:16um the estimation here
0:10:18is
0:10:18given by this
0:10:20so
0:10:21when you set
0:10:23and to square of N
0:10:24you get the crawl already would says that the
0:10:27air
0:10:28with probably one one delta to is bounded
0:10:31by ten days
0:10:32square at of log in over and which is
0:10:35not to far from the classic metric
0:10:37estimation error which is one of risk square
0:10:40so this is quite encouraging to know that
0:10:43this method in principle could achieve
0:10:45this kind of attack
0:10:49so
0:10:50the alternative we to this method which was primarily because the optimum
0:10:53this is
0:10:54fairly theoretical the optimization is not tractable
0:10:57and we wanted to get a method
0:10:59that achieves
0:11:00similar performance
0:11:01uh without having to approximate
0:11:04and so the idea was to use a gradient or
0:11:06estimator
0:11:07a greedy and them more term approximation and the the a very simple
0:11:12it's well known as you know basis pursuit
0:11:15or other other methods in in
0:11:16in different fields
0:11:18and the idea here is that
0:11:20um your
0:11:21constantly approximating lot of P
0:11:24by adding one term one basis function at a time
0:11:27so in this case you start with
0:11:29zero and of course
0:11:31in step one you'll take
0:11:32i do you want will be
0:11:34one minus itself for the
0:11:36zero or
0:11:37approximation plus
0:11:38and you basis function in as you keep on going
0:11:41you'll be adding basis functions so after
0:11:43and iterations of the procedure you'll get in er tear and term approximation
0:11:48and this routine
0:11:51as opposed to the previous one only involves optimization
0:11:54it's gives me with respect
0:11:56two parameter
0:11:57a single of the data and a single
0:12:00basically equivalent equivalent of land that at a time
0:12:06so the the procedure for example we we took just to illustrate the idea
0:12:11what you have to remember in this particular example we have a lot P of X
0:12:15and were not approximating lot P of X
0:12:17knowing a you have X we approximating log P of X
0:12:20from its samples
0:12:21so the samples may not necessarily
0:12:24uh correspond well
0:12:26to this density
0:12:27a a function and basically
0:12:29by adding one term at a time
0:12:31you providing more
0:12:32uh
0:12:33more accurate approximation to the function
0:12:36of course in this case i think we took
0:12:38nine iterations yeah
0:12:40and and you can see how
0:12:42this is done from samples you get a
0:12:44flavour of of the method now if you look at the density so
0:12:47so one of the nice aspects of this method is
0:12:50not only do we estimate the entropy what do we also get
0:12:52it density estimate one of the
0:12:54characteristics that we notice of this density estimate is often
0:12:57it presents this um
0:13:00oh the noise for basically
0:13:02um
0:13:02it doesn't allow the density to drop a very low in places where
0:13:06you don't have a lot of samples this is
0:13:08in many cases a and artifact
0:13:11a problematic artifact of other math
0:13:16so for this
0:13:17for this method we able to show that
0:13:20the estimation here
0:13:21um
0:13:22for the entropy using the
0:13:24greedy m-term estimator
0:13:26um
0:13:27basically behaves again like
0:13:29a constant than squared of log and over N
0:13:32very similar perform the cost of a four so different
0:13:36but the the
0:13:38um D rate of the year
0:13:40is
0:13:41of the same kind as with as with the
0:13:44a a brute force estimate of that optimize as all these parameters jointly
0:13:48a so we we try this method and compare it with
0:13:51with other
0:13:52um with other estimators the kernel density estimator histogram
0:13:56sample spacing K nearest neighbor
0:13:58and our method in as you can see
0:14:00i mean
0:14:00this is this was a first
0:14:02at to see how we can apply
0:14:04and so what
0:14:05some of the simpler method
0:14:06our method is
0:14:08for a close to all others but not particularly
0:14:11better than all of them
0:14:12i which we try
0:14:14just to capture the flavour of of the approximation
0:14:17a by basically setting up the
0:14:20a a a a sample density D which is
0:14:22truncated mixture of gas
0:14:24uh hoping that
0:14:26basically be
0:14:27bases approach for estimating the density will work better and
0:14:30as we can see here
0:14:32it seems to outperform perform all the other
0:14:34so
0:14:35of course
0:14:37i like to say that these these results are anecdotal because you can always be could distribution where
0:14:42uh one method will perform the others
0:14:44but the point is to show that
0:14:46uh
0:14:47you know
0:14:47that the theorems basically
0:14:49uh are not incorrect then we can actually get
0:14:51accurate uh estimate
0:14:54least
0:14:54to there
0:14:56so another thing is we looked at application and all star L
0:15:00the once again the estimate that we get for a log P
0:15:03uh for for the density this was the mixture of the five gas since
0:15:08seem to be
0:15:09uh approximating the density
0:15:11for high values of the density and presenting a noise floor
0:15:14for low values of the density in other words
0:15:17there are not enough samples to approximate
0:15:19this density of these poor
0:15:22so we also
0:15:23tested it on T a um
0:15:25intruder detection
0:15:27dataset provided provided by
0:15:29and here was lab
0:15:30and we wanted to see whether the uh
0:15:34whether this works on on real data on whether we can use it for real data
0:15:37and so we just compare
0:15:39uh
0:15:41basically the entropy
0:15:42against and indicator of it chose whenever there's an each router and so the a here is that
0:15:47uh entropy can be used as a feature
0:15:50a for detecting intruders
0:15:52and basically whenever the entropy here is high
0:15:56relatively speaking to a relative to the other values of entropy
0:15:59uh it it correlate to the fact that there is an intruder router there
0:16:02and so we we try to apply this method on the data
0:16:06primarily to see that it works and that uh the method is
0:16:09fairly efficient computation
0:16:14so the summarise we proposed
0:16:17the frame of maximum entropy estimation
0:16:19uh to estimate entropies
0:16:21i think some of the added down is that we get from this method is we also get
0:16:25density estimation
0:16:26um we show that using the N term
0:16:29a approximation approach we get an error that is order of squared of log in
0:16:34over an which is
0:16:35only squared of log in away from the classical
0:16:37for metric the steve uh estimation error which is one of a squared of N
0:16:42um
0:16:43we're able to show through simulations the method is competitive with other
0:16:47nonparametric entropy entropy estimators and
0:16:49one of one of the motivation by starting to develop of this is basically to extend this
0:16:54uh
0:16:55to density estimation to use this method is a method for density estimation
0:16:59and
0:17:00going beyond that this method can be general also to estimating density in a family of that is one can
0:17:06imagine set of having one data corresponding to one this then density
0:17:10you'll have
0:17:11and data
0:17:12each and each data will correspond to different then density
0:17:15and they all be to the same family
0:17:17and the hope is that this method will be a lot you to find a basis functions that
0:17:21describe um
0:17:24but
0:17:37a real as this is not the focus of uh
0:17:40presentation but could you comment on the choice of basis
0:17:44function
0:17:45oh i
0:17:47so
0:17:47um
0:17:48i think what makes this work in practice so
0:17:51in this experiment actually we used a polynomials and you're gonna metric functions
0:17:56and so what happens is this idea that rather than choose from
0:17:59i mean
0:18:00in practice we had to go with the finite
0:18:02a a
0:18:03is hard valid over infinite
0:18:05but uh
0:18:06but what we do is we took a fairly large set of bases and the da here's the rather than
0:18:10in in this approximation method rather than go
0:18:12one bases at the time he want me to feet three and so on
0:18:16you get to choose what's the best fee that a proxy
0:18:18as the data so
0:18:19so the the these are the bases but for example
0:18:22i've seen another presentation here that i get that
0:18:25consider basis functions and some use pca
0:18:28you take the entire tired
0:18:29data that you have your own pca to ten basis function or or look less scene eigen maps are other
0:18:34technique
0:18:35to obtain basis functions and then you can apply this procedure
0:18:38to estimate the density using the basis
0:18:43just actually have follow up on that um
0:18:46i don't know yeah i to know the are uh we have
0:18:48entropy
0:18:49i mean at using maximum entropy principle of just i think that i as a
0:18:54so using and number all uh measuring function
0:18:58and a and making the parameters of course using be
0:19:01parameter i you can
0:19:02but you can also that incorporate some prior information
0:19:06but that it's my model the model and so on so for i yeah for example you don't really need
0:19:11to it make be that a very accurately
0:19:15i you are close enough so that's very much related
0:19:17i okay also double talk to you afterwards about this i just yeah so you can give you the reference
0:19:22i think that
0:19:23i think that you here though they focus is the the greedy m-term approximation
0:19:27which is rather than go for a fixed bases
0:19:30and estimate
0:19:31decode pieces for all the elements of the base you're basically
0:19:34some send look at it
0:19:35sparse decomposition along long that is right and now held i like that i i'll i critique you want to
0:19:40be for in the estimation for ica eight doesn't to make it big your right
0:19:45the trend that also
0:19:47or not to measuring function at that we called choice
0:19:51that's not what we can incorporate prior information okay
0:19:54as one the joy
0:19:59okay