Speech Transcript - ENTROPY ESTIMATION USING THE PRINCIPLE OF MAXIMUM ENTROPY

0:00:13	okay so
0:00:15	i'm i'm V right
0:00:26	you
0:00:28	so
0:00:31	in the stock a what what we'll do is we'll cover
0:00:34	um
0:00:35	the definition of
0:00:36	entropy a maximum entropy methods
0:00:39	and also alternative methods what we're going to present your in terms of entropy estimation
0:00:45	then we will uh show how the principal max
0:00:48	entropy can be applied
0:00:50	to performing a
0:00:51	approximations that
0:00:53	could potentially uh
0:00:56	although we could
0:00:57	show with high probability that will converge
0:01:00	to the true entropy so
0:01:02	so what this
0:01:03	what this um
0:01:04	paper is about
0:01:06	is about to as entropy estimators
0:01:08	and their own
0:01:09	performance in terms of uh compare
0:01:14	so
0:01:15	as everybody knows the entropy for a continuous random variable or the differential entropy
0:01:20	can be written as
0:01:22	a negative of the X
0:01:23	expected value of log of the probability density function
0:01:27	and the problem that we have a simple we have
0:01:30	and samples from
0:01:31	he of X from that pdf
0:01:33	and the goal is to estimate H P and of course
0:01:36	um there are a couple of issues first is we
0:01:38	there is it's it's not parametric so we have no idea what the family
0:01:42	of distributions P be able to
0:01:44	as well as this is the estimation so
0:01:47	so
0:01:48	so to some extent we
0:01:49	we don't have um a parametric capabilities in terms of approximation
0:01:54	and the other aspect is we're are estimated date based on samples and of course
0:01:58	uh
0:01:59	resampling and different samples will give you
0:02:02	different values and potentially
0:02:04	uh do you to the uh
0:02:06	sampling issues
0:02:07	uh we're gonna have been at sampling here
0:02:10	so
0:02:12	so generally the entropy estimation is been applied in a variety of applications probably in this conference alone they're all
0:02:19	all of these are represented throughout the use especially
0:02:22	also also in the C or image analysis
0:02:25	a anomaly detection and source separation either wanna go
0:02:28	too deep into how entropy is applied in these methods
0:02:32	as you see so this this paper is very focused on how to estimate
0:02:36	P
0:02:37	so
0:02:39	existing methods in in entropy estimation
0:02:41	um
0:02:43	you know one of the classical method is called the plug-in method where
0:02:46	um
0:02:47	the expectation is simply replaced
0:02:49	by an average
0:02:51	so basically
0:02:52	um
0:02:53	and and D density P is being replaced by some uh
0:02:57	plugging density estimates so for example one can think of
0:03:00	using a kernel density estimate estimate P
0:03:03	and then using
0:03:04	um
0:03:04	an average over the number of points uh that you're
0:03:07	a average of the number of samples in order to get
0:03:11	a an approximation for the expectation
0:03:14	another alternative method looks at a
0:03:16	this is more applicable in one B
0:03:18	which is
0:03:19	a a sample spacing so basically
0:03:21	how to use a the distribution
0:03:24	a the differences between uh
0:03:26	nearest samples
0:03:28	in order to approximate
0:03:29	the density
0:03:31	and the nearest neighbor methods by um
0:03:33	"'cause" the check on the an echo
0:03:35	uh presented basically a
0:03:37	initially a me one nearest neighbor
0:03:40	method for estimating the entropy
0:03:42	and
0:03:43	in their work they show
0:03:45	converges results
0:03:46	of that estimate to that or
0:03:50	course
0:03:50	another method is the histogram approach
0:03:53	oh
0:03:54	you skip through so
0:03:55	so what is the idea that we present here
0:03:57	um
0:03:58	the at is the following first of all we're
0:04:00	we're going to apply a maximum entropy principle to estimate the entropy and so the maximum entropy principle
0:04:06	works in the following way
0:04:07	what you're into so the problem is that
0:04:10	we want to estimate in most method method it is estimated not parametrically
0:04:14	and you were thinking is it possible for us to estimate in a parametric
0:04:19	so
0:04:20	so the approach was like that suppose that you have um
0:04:23	and basis function fee one through fee him
0:04:26	and
0:04:27	and those could be you take your data you apply for example based function could be
0:04:30	some all the same
0:04:32	or some of the square about
0:04:35	and so
0:04:36	by taking the expected value of these are
0:04:39	all these um
0:04:41	moments are or or
0:04:42	P function
0:04:43	and setting it to the values that you measure
0:04:45	from the data
0:04:46	a you can set of constraints on the distribution in other words you like a distribution for example
0:04:51	with a mean value all
0:04:53	the average is
0:04:54	uh as found for the data
0:04:56	or you like a distribution with maybe a variance as as the very sample from the data but in general
0:05:01	those could be
0:05:02	very different from
0:05:04	so the argument is of course if you
0:05:06	i i set the optimization the principal max entropy to be too
0:05:10	maximise the entropy subject to these constrained of course the sum to one can
0:05:14	on the distribution you and up
0:05:16	with the well known
0:05:17	uh maximum entropy model which is the exponential model
0:05:21	you to the summation of land i
0:05:23	a all and a if E J minus a a constant
0:05:26	basic
0:05:27	a constant in terms of a
0:05:29	and so the at here is we want to be able to use this
0:05:33	density estimate as with with other plug in methods we have
0:05:37	now i
0:05:38	a method for finding it density we wanna use this density
0:05:41	as part of or entropy
0:05:43	now what is the advantage of this method so the first advantages
0:05:47	that
0:05:48	when you have a parametric a a and be you could
0:05:51	but tension and find are we have sorry of a
0:05:54	a parametric representation of the density you could potentially we integrate a log
0:05:58	and and up finding the entropy being close form which case
0:06:01	uh a a in i to integral Z of land that
0:06:05	we have the entropy and close form
0:06:08	and
0:06:09	the difference between this approach and many other method is that most method
0:06:12	focus on local an estimate of the density so
0:06:16	for example estimating the density based on a very small neighbor
0:06:20	in this case
0:06:21	you can argue that the basis functions
0:06:23	don't have to be localised
0:06:25	and
0:06:26	you can then be
0:06:27	use
0:06:28	global functions to estimate the entry
0:06:31	so so what what's the advantage in this method that if we
0:06:34	look at the kl divergence between P and P level that that
0:06:37	approximation of P
0:06:39	uh we can always show that the entropy is upper bounded
0:06:42	by the estimate that we
0:06:44	the we're proposal so that
0:06:46	um um
0:06:47	that's to some extent is good news
0:06:49	so we know we'll never estimate
0:06:51	uh
0:06:51	the entropy with the smaller number than the true entropy
0:06:54	and then the question of course well
0:06:56	maybe this upper bound is not very tight so how can we verify
0:06:59	but this upper bound is time
0:07:02	so in order to show that this
0:07:04	approximation that was present the previous slide
0:07:07	a for the density
0:07:08	can produce
0:07:09	very tight estimates for the entropy we consider the following
0:07:13	uh theorem so through through why stress and
0:07:16	uh we can approximate
0:07:18	uh functions with
0:07:20	uh a uniformly
0:07:22	with the given accuracy
0:07:23	uh using only non that's
0:07:25	one of the original papers and then later on there were some generalisations of that
0:07:29	and so the idea here is that we're going to
0:07:32	think of the approximation that
0:07:34	the maximum entropy framework is given as as approximation of the log of the like a lot of the
0:07:39	probability density function rather than
0:07:42	a a rather than the probability density function if you look at a lot of the estimates
0:07:45	it go directly at estimating
0:07:47	um
0:07:48	the density the probability density function in this case
0:07:51	we're going to estimating
0:07:53	the log of probability and as you can see basically
0:07:56	uh one can think about
0:07:58	representing precise to the log of the probability as the integral of phi theta
0:08:03	X X the theta
0:08:05	and the at
0:08:06	the measure all
0:08:08	that is given here
0:08:09	and basically we can view the
0:08:12	approximation the maxima to be is given us
0:08:14	as a discrete eyes
0:08:16	um a version of this in
0:08:18	basically using basis functions or or basic using a point mass
0:08:23	to uh in instead of this measure
0:08:26	uh to approximate P
0:08:29	so that
0:08:29	that's the in and the argument is that true
0:08:32	um still what to theorem you know that
0:08:34	you um this approximation converges is uniformly to log B and so that's the motivation behind
0:08:40	using a
0:08:41	maximum trip
0:08:44	so
0:08:45	the two estimators that we propose based on that is
0:08:48	are one is the brute force estimator and the next one oh presents one is the
0:08:52	uh and term approximation
0:08:54	so with the brute force
0:08:56	uh approximate of the idea is as follows
0:08:59	um
0:08:59	you're basically considering
0:09:01	the
0:09:02	values of land the and to basically the you basically selecting jointly
0:09:07	the basis functions fees
0:09:09	as well as the coefficients along long this piece
0:09:11	such that they minimize the end
0:09:14	so of course
0:09:15	the optimization with respect to land is convex but with respect to a is not
0:09:20	a so that creates a
0:09:21	and intractable optimization problem however
0:09:24	um this approach at least
0:09:26	you just
0:09:27	uh we we can evaluate this approach and this approach give us an upper bound one performed
0:09:32	and so we can compare other method the segment that we present
0:09:35	to the performance
0:09:37	uh for this man
0:09:39	so
0:09:40	in the paper we present fear and
0:09:43	a that describes the accuracy of the method in
0:09:46	without a through the details of the your of the derivation da da here is that
0:09:50	here here can be broken into a an approximation error component and an estimation error component
0:09:56	and the approximation error component
0:09:58	in this particular theorem
0:10:00	is obtained from the paper by andrew bear
0:10:03	which basically suggest that maximum entropy framework can you used to approximate any distribution
0:10:09	at the same time we're considering
0:10:11	i i of sampling and of course
0:10:13	uh using hopping inequality one can show that
0:10:16	um the estimation here
0:10:18	is
0:10:18	given by this
0:10:20	so
0:10:21	when you set
0:10:23	and to square of N
0:10:24	you get the crawl already would says that the
0:10:27	air
0:10:28	with probably one one delta to is bounded
0:10:31	by ten days
0:10:32	square at of log in over and which is
0:10:35	not to far from the classic metric
0:10:37	estimation error which is one of risk square
0:10:40	so this is quite encouraging to know that
0:10:43	this method in principle could achieve
0:10:45	this kind of attack
0:10:49	so
0:10:50	the alternative we to this method which was primarily because the optimum
0:10:53	this is
0:10:54	fairly theoretical the optimization is not tractable
0:10:57	and we wanted to get a method
0:10:59	that achieves
0:11:00	similar performance
0:11:01	uh without having to approximate
0:11:04	and so the idea was to use a gradient or
0:11:06	estimator
0:11:07	a greedy and them more term approximation and the the a very simple
0:11:12	it's well known as you know basis pursuit
0:11:15	or other other methods in in
0:11:16	in different fields
0:11:18	and the idea here is that
0:11:20	um your
0:11:21	constantly approximating lot of P
0:11:24	by adding one term one basis function at a time
0:11:27	so in this case you start with
0:11:29	zero and of course
0:11:31	in step one you'll take
0:11:32	i do you want will be
0:11:34	one minus itself for the
0:11:36	zero or
0:11:37	approximation plus
0:11:38	and you basis function in as you keep on going
0:11:41	you'll be adding basis functions so after
0:11:43	and iterations of the procedure you'll get in er tear and term approximation
0:11:48	and this routine
0:11:51	as opposed to the previous one only involves optimization
0:11:54	it's gives me with respect
0:11:56	two parameter
0:11:57	a single of the data and a single
0:12:00	basically equivalent equivalent of land that at a time
0:12:06	so the the procedure for example we we took just to illustrate the idea
0:12:11	what you have to remember in this particular example we have a lot P of X
0:12:15	and were not approximating lot P of X
0:12:17	knowing a you have X we approximating log P of X
0:12:20	from its samples
0:12:21	so the samples may not necessarily
0:12:24	uh correspond well
0:12:26	to this density
0:12:27	a a function and basically
0:12:29	by adding one term at a time
0:12:31	you providing more
0:12:32	uh
0:12:33	more accurate approximation to the function
0:12:36	of course in this case i think we took
0:12:38	nine iterations yeah
0:12:40	and and you can see how
0:12:42	this is done from samples you get a
0:12:44	flavour of of the method now if you look at the density so
0:12:47	so one of the nice aspects of this method is
0:12:50	not only do we estimate the entropy what do we also get
0:12:52	it density estimate one of the
0:12:54	characteristics that we notice of this density estimate is often
0:12:57	it presents this um
0:13:00	oh the noise for basically
0:13:02	um
0:13:02	it doesn't allow the density to drop a very low in places where
0:13:06	you don't have a lot of samples this is
0:13:08	in many cases a and artifact
0:13:11	a problematic artifact of other math
0:13:16	so for this
0:13:17	for this method we able to show that
0:13:20	the estimation here
0:13:21	um
0:13:22	for the entropy using the
0:13:24	greedy m-term estimator
0:13:26	um
0:13:27	basically behaves again like
0:13:29	a constant than squared of log and over N
0:13:32	very similar perform the cost of a four so different
0:13:36	but the the
0:13:38	um D rate of the year
0:13:40	is
0:13:41	of the same kind as with as with the
0:13:44	a a brute force estimate of that optimize as all these parameters jointly
0:13:48	a so we we try this method and compare it with
0:13:51	with other
0:13:52	um with other estimators the kernel density estimator histogram
0:13:56	sample spacing K nearest neighbor
0:13:58	and our method in as you can see
0:14:00	i mean
0:14:00	this is this was a first
0:14:02	at to see how we can apply
0:14:04	and so what
0:14:05	some of the simpler method
0:14:06	our method is
0:14:08	for a close to all others but not particularly
0:14:11	better than all of them
0:14:12	i which we try
0:14:14	just to capture the flavour of of the approximation
0:14:17	a by basically setting up the
0:14:20	a a a a sample density D which is
0:14:22	truncated mixture of gas
0:14:24	uh hoping that
0:14:26	basically be
0:14:27	bases approach for estimating the density will work better and
0:14:30	as we can see here
0:14:32	it seems to outperform perform all the other
0:14:34	so
0:14:35	of course
0:14:37	i like to say that these these results are anecdotal because you can always be could distribution where
0:14:42	uh one method will perform the others
0:14:44	but the point is to show that
0:14:46	uh
0:14:47	you know
0:14:47	that the theorems basically
0:14:49	uh are not incorrect then we can actually get
0:14:51	accurate uh estimate
0:14:54	least
0:14:54	to there
0:14:56	so another thing is we looked at application and all star L
0:15:00	the once again the estimate that we get for a log P
0:15:03	uh for for the density this was the mixture of the five gas since
0:15:08	seem to be
0:15:09	uh approximating the density
0:15:11	for high values of the density and presenting a noise floor
0:15:14	for low values of the density in other words
0:15:17	there are not enough samples to approximate
0:15:19	this density of these poor
0:15:22	so we also
0:15:23	tested it on T a um
0:15:25	intruder detection
0:15:27	dataset provided provided by
0:15:29	and here was lab
0:15:30	and we wanted to see whether the uh
0:15:34	whether this works on on real data on whether we can use it for real data
0:15:37	and so we just compare
0:15:39	uh
0:15:41	basically the entropy
0:15:42	against and indicator of it chose whenever there's an each router and so the a here is that
0:15:47	uh entropy can be used as a feature
0:15:50	a for detecting intruders
0:15:52	and basically whenever the entropy here is high
0:15:56	relatively speaking to a relative to the other values of entropy
0:15:59	uh it it correlate to the fact that there is an intruder router there
0:16:02	and so we we try to apply this method on the data
0:16:06	primarily to see that it works and that uh the method is
0:16:09	fairly efficient computation
0:16:14	so the summarise we proposed
0:16:17	the frame of maximum entropy estimation
0:16:19	uh to estimate entropies
0:16:21	i think some of the added down is that we get from this method is we also get
0:16:25	density estimation
0:16:26	um we show that using the N term
0:16:29	a approximation approach we get an error that is order of squared of log in
0:16:34	over an which is
0:16:35	only squared of log in away from the classical
0:16:37	for metric the steve uh estimation error which is one of a squared of N
0:16:42	um
0:16:43	we're able to show through simulations the method is competitive with other
0:16:47	nonparametric entropy entropy estimators and
0:16:49	one of one of the motivation by starting to develop of this is basically to extend this
0:16:54	uh
0:16:55	to density estimation to use this method is a method for density estimation
0:16:59	and
0:17:00	going beyond that this method can be general also to estimating density in a family of that is one can
0:17:06	imagine set of having one data corresponding to one this then density
0:17:10	you'll have
0:17:11	and data
0:17:12	each and each data will correspond to different then density
0:17:15	and they all be to the same family
0:17:17	and the hope is that this method will be a lot you to find a basis functions that
0:17:21	describe um
0:17:24	but
0:17:37	a real as this is not the focus of uh
0:17:40	presentation but could you comment on the choice of basis
0:17:44	function
0:17:45	oh i
0:17:47	so
0:17:47	um
0:17:48	i think what makes this work in practice so
0:17:51	in this experiment actually we used a polynomials and you're gonna metric functions
0:17:56	and so what happens is this idea that rather than choose from
0:17:59	i mean
0:18:00	in practice we had to go with the finite
0:18:02	a a
0:18:03	is hard valid over infinite
0:18:05	but uh
0:18:06	but what we do is we took a fairly large set of bases and the da here's the rather than
0:18:10	in in this approximation method rather than go
0:18:12	one bases at the time he want me to feet three and so on
0:18:16	you get to choose what's the best fee that a proxy
0:18:18	as the data so
0:18:19	so the the these are the bases but for example
0:18:22	i've seen another presentation here that i get that
0:18:25	consider basis functions and some use pca
0:18:28	you take the entire tired
0:18:29	data that you have your own pca to ten basis function or or look less scene eigen maps are other
0:18:34	technique
0:18:35	to obtain basis functions and then you can apply this procedure
0:18:38	to estimate the density using the basis
0:18:43	just actually have follow up on that um
0:18:46	i don't know yeah i to know the are uh we have
0:18:48	entropy
0:18:49	i mean at using maximum entropy principle of just i think that i as a
0:18:54	so using and number all uh measuring function
0:18:58	and a and making the parameters of course using be
0:19:01	parameter i you can
0:19:02	but you can also that incorporate some prior information
0:19:06	but that it's my model the model and so on so for i yeah for example you don't really need
0:19:11	to it make be that a very accurately
0:19:15	i you are close enough so that's very much related
0:19:17	i okay also double talk to you afterwards about this i just yeah so you can give you the reference
0:19:22	i think that
0:19:23	i think that you here though they focus is the the greedy m-term approximation
0:19:27	which is rather than go for a fixed bases
0:19:30	and estimate
0:19:31	decode pieces for all the elements of the base you're basically
0:19:34	some send look at it
0:19:35	sparse decomposition along long that is right and now held i like that i i'll i critique you want to
0:19:40	be for in the estimation for ica eight doesn't to make it big your right
0:19:45	the trend that also
0:19:47	or not to measuring function at that we called choice
0:19:51	that's not what we can incorporate prior information okay
0:19:54	as one the joy
0:19:59	okay

ENTROPY ESTIMATION USING THE PRINCIPLE OF MAXIMUM ENTROPY

Machine Learning Methods and Applications

Presented by: Raviv Raich, Author(s): Behrouz Behmardi, Raviv Raich, Oregon State University, United States; Alfred O. Hero III, University of Michigan, United States