Speech Transcript - ENTROPY ESTIMATION USING THE PRINCIPLE OF MAXIMUM ENTROPY

okay so i'm i'm V right you so in the stock a what what we'll do is we'll cover um the definition of entropy a maximum entropy methods and also alternative methods what we're going to present your in terms of entropy estimation then we will uh show how the principal max entropy can be applied to performing a approximations that could potentially uh although we could show with high probability that will converge to the true entropy so so what this what this um paper is about is about to as entropy estimators and their own performance in terms of uh compare so as everybody knows the entropy for a continuous random variable or the differential entropy can be written as a negative of the X expected value of log of the probability density function and the problem that we have a simple we have and samples from he of X from that pdf and the goal is to estimate H P and of course um there are a couple of issues first is we there is it's it's not parametric so we have no idea what the family of distributions P be able to as well as this is the estimation so so so to some extent we we don't have um a parametric capabilities in terms of approximation and the other aspect is we're are estimated date based on samples and of course uh resampling and different samples will give you different values and potentially uh do you to the uh sampling issues uh we're gonna have been at sampling here so so generally the entropy estimation is been applied in a variety of applications probably in this conference alone they're all all of these are represented throughout the use especially also also in the C or image analysis a anomaly detection and source separation either wanna go too deep into how entropy is applied in these methods as you see so this this paper is very focused on how to estimate P so existing methods in in entropy estimation um you know one of the classical method is called the plug-in method where um the expectation is simply replaced by an average so basically um and and D density P is being replaced by some uh plugging density estimates so for example one can think of using a kernel density estimate estimate P and then using um an average over the number of points uh that you're a average of the number of samples in order to get a an approximation for the expectation another alternative method looks at a this is more applicable in one B which is a a sample spacing so basically how to use a the distribution a the differences between uh nearest samples in order to approximate the density and the nearest neighbor methods by um "'cause" the check on the an echo uh presented basically a initially a me one nearest neighbor method for estimating the entropy and in their work they show converges results of that estimate to that or course another method is the histogram approach oh you skip through so so what is the idea that we present here um the at is the following first of all we're we're going to apply a maximum entropy principle to estimate the entropy and so the maximum entropy principle works in the following way what you're into so the problem is that we want to estimate in most method method it is estimated not parametrically and you were thinking is it possible for us to estimate in a parametric so so the approach was like that suppose that you have um and basis function fee one through fee him and and those could be you take your data you apply for example based function could be some all the same or some of the square about and so by taking the expected value of these are all these um moments are or or P function and setting it to the values that you measure from the data a you can set of constraints on the distribution in other words you like a distribution for example with a mean value all the average is uh as found for the data or you like a distribution with maybe a variance as as the very sample from the data but in general those could be very different from so the argument is of course if you i i set the optimization the principal max entropy to be too maximise the entropy subject to these constrained of course the sum to one can on the distribution you and up with the well known uh maximum entropy model which is the exponential model you to the summation of land i a all and a if E J minus a a constant basic a constant in terms of a and so the at here is we want to be able to use this density estimate as with with other plug in methods we have now i a method for finding it density we wanna use this density as part of or entropy now what is the advantage of this method so the first advantages that when you have a parametric a a and be you could but tension and find are we have sorry of a a parametric representation of the density you could potentially we integrate a log and and up finding the entropy being close form which case uh a a in i to integral Z of land that we have the entropy and close form and the difference between this approach and many other method is that most method focus on local an estimate of the density so for example estimating the density based on a very small neighbor in this case you can argue that the basis functions don't have to be localised and you can then be use global functions to estimate the entry so so what what's the advantage in this method that if we look at the kl divergence between P and P level that that approximation of P uh we can always show that the entropy is upper bounded by the estimate that we the we're proposal so that um um that's to some extent is good news so we know we'll never estimate uh the entropy with the smaller number than the true entropy and then the question of course well maybe this upper bound is not very tight so how can we verify but this upper bound is time so in order to show that this approximation that was present the previous slide a for the density can produce very tight estimates for the entropy we consider the following uh theorem so through through why stress and uh we can approximate uh functions with uh a uniformly with the given accuracy uh using only non that's one of the original papers and then later on there were some generalisations of that and so the idea here is that we're going to think of the approximation that the maximum entropy framework is given as as approximation of the log of the like a lot of the probability density function rather than a a rather than the probability density function if you look at a lot of the estimates it go directly at estimating um the density the probability density function in this case we're going to estimating the log of probability and as you can see basically uh one can think about representing precise to the log of the probability as the integral of phi theta X X the theta and the at the measure all that is given here and basically we can view the approximation the maxima to be is given us as a discrete eyes um a version of this in basically using basis functions or or basic using a point mass to uh in instead of this measure uh to approximate P so that that's the in and the argument is that true um still what to theorem you know that you um this approximation converges is uniformly to log B and so that's the motivation behind using a maximum trip so the two estimators that we propose based on that is are one is the brute force estimator and the next one oh presents one is the uh and term approximation so with the brute force uh approximate of the idea is as follows um you're basically considering the values of land the and to basically the you basically selecting jointly the basis functions fees as well as the coefficients along long this piece such that they minimize the end so of course the optimization with respect to land is convex but with respect to a is not a so that creates a and intractable optimization problem however um this approach at least you just uh we we can evaluate this approach and this approach give us an upper bound one performed and so we can compare other method the segment that we present to the performance uh for this man so in the paper we present fear and a that describes the accuracy of the method in without a through the details of the your of the derivation da da here is that here here can be broken into a an approximation error component and an estimation error component and the approximation error component in this particular theorem is obtained from the paper by andrew bear which basically suggest that maximum entropy framework can you used to approximate any distribution at the same time we're considering i i of sampling and of course uh using hopping inequality one can show that um the estimation here is given by this so when you set and to square of N you get the crawl already would says that the air with probably one one delta to is bounded by ten days square at of log in over and which is not to far from the classic metric estimation error which is one of risk square so this is quite encouraging to know that this method in principle could achieve this kind of attack so the alternative we to this method which was primarily because the optimum this is fairly theoretical the optimization is not tractable and we wanted to get a method that achieves similar performance uh without having to approximate and so the idea was to use a gradient or estimator a greedy and them more term approximation and the the a very simple it's well known as you know basis pursuit or other other methods in in in different fields and the idea here is that um your constantly approximating lot of P by adding one term one basis function at a time so in this case you start with zero and of course in step one you'll take i do you want will be one minus itself for the zero or approximation plus and you basis function in as you keep on going you'll be adding basis functions so after and iterations of the procedure you'll get in er tear and term approximation and this routine as opposed to the previous one only involves optimization it's gives me with respect two parameter a single of the data and a single basically equivalent equivalent of land that at a time so the the procedure for example we we took just to illustrate the idea what you have to remember in this particular example we have a lot P of X and were not approximating lot P of X knowing a you have X we approximating log P of X from its samples so the samples may not necessarily uh correspond well to this density a a function and basically by adding one term at a time you providing more uh more accurate approximation to the function of course in this case i think we took nine iterations yeah and and you can see how this is done from samples you get a flavour of of the method now if you look at the density so so one of the nice aspects of this method is not only do we estimate the entropy what do we also get it density estimate one of the characteristics that we notice of this density estimate is often it presents this um oh the noise for basically um it doesn't allow the density to drop a very low in places where you don't have a lot of samples this is in many cases a and artifact a problematic artifact of other math so for this for this method we able to show that the estimation here um for the entropy using the greedy m-term estimator um basically behaves again like a constant than squared of log and over N very similar perform the cost of a four so different but the the um D rate of the year is of the same kind as with as with the a a brute force estimate of that optimize as all these parameters jointly a so we we try this method and compare it with with other um with other estimators the kernel density estimator histogram sample spacing K nearest neighbor and our method in as you can see i mean this is this was a first at to see how we can apply and so what some of the simpler method our method is for a close to all others but not particularly better than all of them i which we try just to capture the flavour of of the approximation a by basically setting up the a a a a sample density D which is truncated mixture of gas uh hoping that basically be bases approach for estimating the density will work better and as we can see here it seems to outperform perform all the other so of course i like to say that these these results are anecdotal because you can always be could distribution where uh one method will perform the others but the point is to show that uh you know that the theorems basically uh are not incorrect then we can actually get accurate uh estimate least to there so another thing is we looked at application and all star L the once again the estimate that we get for a log P uh for for the density this was the mixture of the five gas since seem to be uh approximating the density for high values of the density and presenting a noise floor for low values of the density in other words there are not enough samples to approximate this density of these poor so we also tested it on T a um intruder detection dataset provided provided by and here was lab and we wanted to see whether the uh whether this works on on real data on whether we can use it for real data and so we just compare uh basically the entropy against and indicator of it chose whenever there's an each router and so the a here is that uh entropy can be used as a feature a for detecting intruders and basically whenever the entropy here is high relatively speaking to a relative to the other values of entropy uh it it correlate to the fact that there is an intruder router there and so we we try to apply this method on the data primarily to see that it works and that uh the method is fairly efficient computation so the summarise we proposed the frame of maximum entropy estimation uh to estimate entropies i think some of the added down is that we get from this method is we also get density estimation um we show that using the N term a approximation approach we get an error that is order of squared of log in over an which is only squared of log in away from the classical for metric the steve uh estimation error which is one of a squared of N um we're able to show through simulations the method is competitive with other nonparametric entropy entropy estimators and one of one of the motivation by starting to develop of this is basically to extend this uh to density estimation to use this method is a method for density estimation and going beyond that this method can be general also to estimating density in a family of that is one can imagine set of having one data corresponding to one this then density you'll have and data each and each data will correspond to different then density and they all be to the same family and the hope is that this method will be a lot you to find a basis functions that describe um but a real as this is not the focus of uh presentation but could you comment on the choice of basis function oh i so um i think what makes this work in practice so in this experiment actually we used a polynomials and you're gonna metric functions and so what happens is this idea that rather than choose from i mean in practice we had to go with the finite a a is hard valid over infinite but uh but what we do is we took a fairly large set of bases and the da here's the rather than in in this approximation method rather than go one bases at the time he want me to feet three and so on you get to choose what's the best fee that a proxy as the data so so the the these are the bases but for example i've seen another presentation here that i get that consider basis functions and some use pca you take the entire tired data that you have your own pca to ten basis function or or look less scene eigen maps are other technique to obtain basis functions and then you can apply this procedure to estimate the density using the basis just actually have follow up on that um i don't know yeah i to know the are uh we have entropy i mean at using maximum entropy principle of just i think that i as a so using and number all uh measuring function and a and making the parameters of course using be parameter i you can but you can also that incorporate some prior information but that it's my model the model and so on so for i yeah for example you don't really need to it make be that a very accurately i you are close enough so that's very much related i okay also double talk to you afterwards about this i just yeah so you can give you the reference i think that i think that you here though they focus is the the greedy m-term approximation which is rather than go for a fixed bases and estimate decode pieces for all the elements of the base you're basically some send look at it sparse decomposition along long that is right and now held i like that i i'll i critique you want to be for in the estimation for ica eight doesn't to make it big your right the trend that also or not to measuring function at that we called choice that's not what we can incorporate prior information okay as one the joy okay

ENTROPY ESTIMATION USING THE PRINCIPLE OF MAXIMUM ENTROPY

Machine Learning Methods and Applications

Presented by: Raviv Raich, Author(s): Behrouz Behmardi, Raviv Raich, Oregon State University, United States; Alfred O. Hero III, University of Michigan, United States