5

The definition of differential (or continuous) entropy is problematic. As a matter of fact, differential entropy can be negative, can diverge and is not invariant with respect to linear transformations of coordinates. As far as I know, differential entropy is derived by analogy from the Shannon entropy by replacing a summation with an integration. So, what is the exact meaning of the differential entropy? Is there an alternative formulation which is non negative and doesn't suffer from the weaknesses I previously mentioned? To be more precise, please consider the differential entropy defined as: \begin{equation} h(f)=\lim_{\Delta \to 0} (H^{\Delta} + \log \Delta)=-\int_{-\infty}^{+\infty} f(x) \log f(x) dx \end{equation} where \begin{equation} H^{\Delta}=-\lim_{\Delta \to 0} \sum_{j=-\infty}^{+\infty} f(x_{j}) \Delta\log(f(x_{j}) \Delta) \end{equation} Since $H^{\Delta}$ represents the Shannon entropy in the limit $\Delta \to 0$, we can assume that $H^{\Delta}$ is our definition for continuos entropy. However, the term $\log \Delta$ diverges as$\Delta \to 0$. So the differential entropy, according to the above definition, differs in the limit by an infinite offset. I have the impression that this formulation of differential entropy has no solid physical foundation, and was derived mathematically without concern for its meaning. The fact that entropy can become infinitely large puzzles me. Should it be possible I infer that a continuous distribution can potentially carry infinite amounts of information, which does not seem possible to me. In addition, this is not the only problem with this definition. Again if we assume that this mathematical expression is correct, the entropy of a random variable with a continuous distribution can be negative, when the entropy of a discrete distribution is always positive or zero. Moreover, the entropy of a continuous system is not preserved by the transformation of the coordinate systems, in contrast to the discrete example. In conclusion, it seems strange to me that, given the problems I mentioned, there is no other expression for expressing differential entropy that solves these problems. For example, if we consider functions f(x) that are null outside a finite support $[a,b]$, it is possible to redefine the differential entropy as: \begin{equation} h'(f)=-\int_{a}^{b} f(x) \log((b-a) f(x)) dx \end{equation} This formulation solves the previously mentioned problems, but is valid only when $a$, and $b$ are finite and $b \geq a$.

@ hyportnex: First of all, thanks for the suggestion. Jeynes in his Brandeis lectures subtracts the infinite term, making the remaining function finite and invariant. This approach is equivalent to the one I used when I multiplied $f(x)$ by $(b-a)$ and replaced $\log f(x)$ with $\log((b-a) f(x))$. However, I am not at all convinced that this approach is correct. For example, consider a triangular probability density function defined on the support [a,b] with mode c. Its entropy is \begin{equation} \frac{1}{2}-\log(2)+\log(b-a) \end{equation} See for example Triangular PDF. When considering the Jaynes formulation, the entropy is effectively preserved by the transformation of the coordinate systems, but this is because it reduces to a constant independent of the support $[a,b]$. Thus any triangular distribution has the same entropy.

@dgamma: The term $\log \Delta x$ is nothing more than the term I called $\log \Delta$. From what you wrote I can conclude that entropy is always defined relative to something. For example, relative entropy reduces to entropy when the function $q(x)$ is equal to one, i.e. in the case of a uniform distribution. The fact that entropy is a relative measure is not in itself a problem. However, if entropy is a measure of information, then I conclude that information itself is relative in nature. I find this rather surprising, and the reason is as follows. While information about an object is conveyed by knowledge of its properties, but also by the absence of properties (which implies absence with respect to something else), I have a problem reconciling this interpretation with that proposed by Kolmororov. Kolmororov's theory focuses on the description of an object and tries to define the minimal description of this object. In other words, Kolmogorov, in determining the number of bits in the final compressed version, considers only the object itself, regardless of how the object was generated. On the other hand, Shannon considers only the properties of the random source from which the object is one of the possible outcomes. There is a connection between the description of Shannon and Kolmogorov. Namely, it can be shown that the probabilistic entropy of a random variable X within an additive constant is equal to the expected value of the Kolmogorov complexity. We can then write \begin{equation} E(K)=h(X)+\log{c} \end{equation} where $c$ is a positive constant. Thus, the expected value of Kolmogorov complexity can take an arbitrarily large positive or negative values, depending on the choice of $c$. If entropy H is defined with respect to a uniform distribution (q(x)=1), I infer that there must be a relative Kolmogorov complexity similar to the KL divergence, such that when $q(x)=1 the expression above holds. Can you tell me if this expression exists? Moreover, the fact that entropy diverges leads to the existence of objects of infinite complexity. This seems rather odd to me.

Upax
  • 186

2 Answers2

3

Recall that the definition of differential entropy is: $$ H=-\int \rho\ln\rho dx $$ with $\rho$ a pdf, that is therefore normalised as: $$ \int\rho dx=1 $$ For a finite domain, you should see it as the KL divergence with the uniform distribution: $$ \begin{align} H’&= -\int \rho\ln(\rho |\Omega|)dx \\ &= H-\ln |\Omega| \end{align} $$ with $|\Omega|=\int_\Omega dx$ the measure of the domain. In this case, the entropy is always positive with equality iff $\rho$ is the uniform measure. A general coordinate change would change the uniform measure, which is why the KL divergence changes as well. Dimensional analysis also shows that there is an issue in the formula for $H$, because $\rho$ is a dimensionful quantity in the logarithm.

Typically the significance of the uniform measure can be justified physically. For example in hamiltonian mechanics, the uniform measure in canonical coordinate system (common for all of them since the Jacobian of a canonical transformation is unity) is preserved by time evolution. Therefore it is a natural reference measure.

In the case of an infinite domain, there is no natural normalization of the reference uniform distribution. Thus entropy is defined up to an additive constant. This should be reminiscent of physics where only differences of entropy has physical significance. More generally, physically relevant quantities appear as KL divergences (actually rather their exponential since there is the ambiguity of the base of the logarithm).

As in the discrete case, one motivation would be the asymptotic equipartition theorem (check out Shannon’s original article for example). Physically, this is still relevant as the underlying mathematical concept is large deviations, which is also useful when considering the thermodynamic limit. To make things more simple, consider iid random variables whose common pdf is $\rho$. The joint distribution is therefore the product: $$ \rho_n(x)=\prod_{i=1}^n \rho(x_i) $$ The idea is that in the product space, the distribution gets concentrated in a small region where it is approximately uniform. The common value of the joint pdf is given by the differential entropy: $$ \frac{1}{n}\ln\rho_n\to H $$ Once again, the dependence on the coordinate system is to be expected. A generic coordinate change would change the effective domain, and in particular its volume. The differential entropy therefore changes accordingly. What is truly remarkable is that at logarithmic leading order, the distribution is uniform no matter the coordinate system. This may seem paradoxical at first, but the differences will be visible by investigating the subleading terms.

Take for example the case of standard normal distributions: $$ \rho(x)=\frac{e^{-x^2/2}}{\sqrt{2\pi}} $$ The joint distribution is multivariate standard gaussian. Even if the variance is constant, the increase of dimensions causes a dilution of the effective domain. Quantitatively, the squared distance’s pdf is given by the $\chi^2$ distribution. Any quantile will diverge asymptotically as $\sim n$ in accordance to the law of large numbers. The multivariate gaussian's effective support is therefore a hyper ball of radius $r_n = n$. Its volume is therefore: $$ \begin{align} V_n &= \frac{\pi^{n/2}}{(n/2)!}r_n^n \\ &\asymp \sqrt{2\pi e}^n \end{align} $$ If you look at the value of the pdf, at logarithmic leading order, it is indeed constant, as the $x$ dependence would give a $O(n^2)$ contribution, which is subdominant.

The multivariate gaussian distribution is therefore effectively uniformly distributed in a ball. By checking the entropy of the standard normal distribution: $$ H = \ln\sqrt{2\pi e} $$ this is all consistent with the general equipartition theorem. On a side note, from this formula, it would be tempting to claim that the distribution is concentrated on a hypercube of side length $\sqrt{2\pi e}$. On the contrary, the probability in the cube of any fixed side length goes to zero exponentially. You can reconcile the two approaches by noticing that in large dimensions, the inscribed ball has a negligible hyper volume. Thus, you still need to increase the side of the cube to have a big enough ball, but most of this inflated cube’s volume (near the vertices) contributes almost nothing to the total gaussian measure.

In this example, by changing the variance, the entropy is positive or negative. This touches back on the previous remark that there is no natural scale, so only ratio of volumes are significant. In terms of entropy, this translates in differences.

Hope this helps.

LPZ
  • 11,449
  • 1
  • 4
  • 24
2

This puzzled me too when I was first thinking about it. You're right; Shannon just kind of wrote down the differential entropy without really thinking about invariance or anything like that, and everything seemed to work out okay, except for the wrinkle that $S$ can be negative, which was weird but not a dealbreaker.

Negative entropy is weird because we're used to thinking about the entropy $S$ is the number of bits needed to specify the precise state of the system or, in other words, the missing information about a system. By that definition, it can't be negative.

On the other hand for a continuous distribution you always get $S = \infty$ which is even less useful. Intuitively, a die in an unknown state has $\log_2(6)$ bits of entropy, a 52-sided die has $\log_2(52)$ bits. An infinity sided die (a marble, basically) has $\log_2(\infty) = \infty$ bits of entropy because it takes that many bits to specify exactly the "face" the marble is resting on, and there are infinitely many "faces" to a marble.

Anything with an infinite number of configurations is going to have $S = \infty$ so you instead might want to ask if there's something else we can define to describe the amount of "missing information" that at least gives us a useful reference of comparison.

Formally, taking the entropy from the discrete case to the continuous case is described by going from the finite definition of $S$

$$ S = - \sum_i p_i \log p_i. $$

to

$$ S = -\sum_i p(x_i​)\log p(x_i​)\Delta x−\log\Delta x $$

(you can already see where the divergence comes from) and taking the limit of small $\Delta x$,

$$ S = -\int dx \rho(x) \log \rho(x) + S^\text{bin}. $$

You have to do a little trick to make the dimensions work out, but it doesn't change the main idea. The point is, you get two contributions to the entropy: the first part is the differential entropy $S^\text{diff} = - \int dx \rho \log \rho$, and the second part is the "entropy of the binning," $S^\text{bin} = - \log(\Delta x)$, or the "bintropy" as I like to call it. The finer the bins, the more bits of information you need to say specifically which bin you're in, and so the "bintropy" diverges.

The sum of these two parts is always positive, but clearly $S^\text{diff}$ on its own can be negative without messing anything up. Still, it obeys the same principle that a bigger $S^\text{diff}$ means more missing information. For example, the differential entropy of a gaussian is

$$ S^\text{diff} = \log \left( \sigma \sqrt{2\pi} \right) + \frac{1}{2} $$

which means that a sharper distribution (smaller $\sigma$) has less entropy and a broader one (larger $\sigma$) has more entropy. Unlike the discrete entropy which is bounded from below by zero, the differential entropy is bounded below by $-\infty$, but the idea is the same.

Interestingly if you have two distributions binned the same way, then when you take the difference of the total entropies (bintropy included), it's equal to the difference in their differential entropies,

$$ S_a - S_b = S_a^\text{diff} - S_b^\text{diff}. $$

I think this may help you understand the usefulness of $S^\text{diff}$ — that when you use the differential entropy you're implicitly comparing it to some other reference distribution with the same binning scheme.

As an aside, and something to look further into, comparing one entropy to another has the same flavor as "relative entropy"

$$ S^\text{rel} = - \int dx \rho(x) \log \frac{\rho(x)}{q(x)} $$

which I leave to you to prove is convergent in both the discrete and continuous cases. You can then choose a $q(x)$ to perhaps define a better continuous analogue of the entropy. Give it a try.

dgamma
  • 135