The definition of differential (or continuous) entropy is problematic. As a matter of fact, differential entropy can be negative, can diverge and is not invariant with respect to linear transformations of coordinates. As far as I know, differential entropy is derived by analogy from the Shannon entropy by replacing a summation with an integration. So, what is the exact meaning of the differential entropy? Is there an alternative formulation which is non negative and doesn't suffer from the weaknesses I previously mentioned? To be more precise, please consider the differential entropy defined as: \begin{equation} h(f)=\lim_{\Delta \to 0} (H^{\Delta} + \log \Delta)=-\int_{-\infty}^{+\infty} f(x) \log f(x) dx \end{equation} where \begin{equation} H^{\Delta}=-\lim_{\Delta \to 0} \sum_{j=-\infty}^{+\infty} f(x_{j}) \Delta\log(f(x_{j}) \Delta) \end{equation} Since $H^{\Delta}$ represents the Shannon entropy in the limit $\Delta \to 0$, we can assume that $H^{\Delta}$ is our definition for continuos entropy. However, the term $\log \Delta$ diverges as$\Delta \to 0$. So the differential entropy, according to the above definition, differs in the limit by an infinite offset. I have the impression that this formulation of differential entropy has no solid physical foundation, and was derived mathematically without concern for its meaning. The fact that entropy can become infinitely large puzzles me. Should it be possible I infer that a continuous distribution can potentially carry infinite amounts of information, which does not seem possible to me. In addition, this is not the only problem with this definition. Again if we assume that this mathematical expression is correct, the entropy of a random variable with a continuous distribution can be negative, when the entropy of a discrete distribution is always positive or zero. Moreover, the entropy of a continuous system is not preserved by the transformation of the coordinate systems, in contrast to the discrete example. In conclusion, it seems strange to me that, given the problems I mentioned, there is no other expression for expressing differential entropy that solves these problems. For example, if we consider functions f(x) that are null outside a finite support $[a,b]$, it is possible to redefine the differential entropy as: \begin{equation} h'(f)=-\int_{a}^{b} f(x) \log((b-a) f(x)) dx \end{equation} This formulation solves the previously mentioned problems, but is valid only when $a$, and $b$ are finite and $b \geq a$.
@ hyportnex: First of all, thanks for the suggestion. Jeynes in his Brandeis lectures subtracts the infinite term, making the remaining function finite and invariant. This approach is equivalent to the one I used when I multiplied $f(x)$ by $(b-a)$ and replaced $\log f(x)$ with $\log((b-a) f(x))$. However, I am not at all convinced that this approach is correct. For example, consider a triangular probability density function defined on the support [a,b] with mode c. Its entropy is \begin{equation} \frac{1}{2}-\log(2)+\log(b-a) \end{equation} See for example Triangular PDF. When considering the Jaynes formulation, the entropy is effectively preserved by the transformation of the coordinate systems, but this is because it reduces to a constant independent of the support $[a,b]$. Thus any triangular distribution has the same entropy.
@dgamma: The term $\log \Delta x$ is nothing more than the term I called $\log \Delta$. From what you wrote I can conclude that entropy is always defined relative to something. For example, relative entropy reduces to entropy when the function $q(x)$ is equal to one, i.e. in the case of a uniform distribution. The fact that entropy is a relative measure is not in itself a problem. However, if entropy is a measure of information, then I conclude that information itself is relative in nature. I find this rather surprising, and the reason is as follows. While information about an object is conveyed by knowledge of its properties, but also by the absence of properties (which implies absence with respect to something else), I have a problem reconciling this interpretation with that proposed by Kolmororov. Kolmororov's theory focuses on the description of an object and tries to define the minimal description of this object. In other words, Kolmogorov, in determining the number of bits in the final compressed version, considers only the object itself, regardless of how the object was generated. On the other hand, Shannon considers only the properties of the random source from which the object is one of the possible outcomes. There is a connection between the description of Shannon and Kolmogorov. Namely, it can be shown that the probabilistic entropy of a random variable X within an additive constant is equal to the expected value of the Kolmogorov complexity. We can then write \begin{equation} E(K)=h(X)+\log{c} \end{equation} where $c$ is a positive constant. Thus, the expected value of Kolmogorov complexity can take an arbitrarily large positive or negative values, depending on the choice of $c$. If entropy H is defined with respect to a uniform distribution (q(x)=1), I infer that there must be a relative Kolmogorov complexity similar to the KL divergence, such that when $q(x)=1 the expression above holds. Can you tell me if this expression exists? Moreover, the fact that entropy diverges leads to the existence of objects of infinite complexity. This seems rather odd to me.