6

Gibbs entropy is written as

$$ S = -k \sum_i p_i \ln p_i $$

Here is $p_i$ the probability that a system is in a microstate $i$ if I understand correctly.

This looks exactly like the expected value:

$$ E[X] = \sum_i x_i f(x_i) $$

So in this sense the logarithm is the function $f(p_i) = - \ln p_i$

In other words entropy is:

$$ S=E[\ln X]=\langle - \ln X\rangle$$

My question is, why is it the weighted average of the logarithm of the $p_i$. What is the clear intuition here? My guess would be that it is related to partitioning a phase space in $n^m$ microstates, where the exponent suggests a logarithm.

In information theory it has a clear interpretation. It is the average number of questions you need to ask to fully gain information of a system in bits.

EDIT: There is another way to see it as an expected value:

$$S=\langle S_i \rangle = \sum_i p_i S_i $$ Where $S_i = k \ln \Omega $ and $\Omega = 1/p_i$

See this youtube video: https://youtu.be/s4ARd68lkco?t=4913

bananenheld
  • 2,022
  • 1
    It should be $f(p_i)=-\ln(p_i)$. To your last sentence: It is the average minimum number of yes-no questions. What's wrong with the interpretation of information theory applied to statistical mechanics? – Tobias Fünke Jul 09 '22 at 07:14
  • when $p_k \propto e^{-c x_k^2}$ then $S \propto \sum_k p_k x_k^2$ and $S$ is just the mean square value of the distribution. – hyportnex Jul 09 '22 at 13:21
  • @hyportnex Thank you for your comment but I don't really understand it. That suggest that entropy is like a 'second moment' and related to the variance. Could you maybe expand the answer, it seems quite interesting. Also what does a gaussian exactly have to do with it? – bananenheld Jul 09 '22 at 18:22
  • You have asked for "intuition" and/or "interpretation". My special example gives an $S$ that you already know; if you change the distribution you get a different and likely to be unfamiliar quantity but in all cases $S$ is a measure of the spread of the distribution. – hyportnex Jul 09 '22 at 19:02
  • The first equation is Shannon entropy - neither Gibbs, nor Boltzmann. While in some cases these three are the same, this is not always the case (and they are differently defined). See Jaynes' article linked in this answer – Roger V. Jul 12 '22 at 07:46
  • related: https://physics.stackexchange.com/a/389714/226902 – Quillo Sep 10 '23 at 14:16

4 Answers4

6

It is exactly like in information theory. I insert the terminology from physics in parentheses [].

In information theory you have letters [physical system] making up an alphabet [occupiable micro-states for the system] denoted $\Omega = \{ a,b,c\dots \}$. Now one can create a sequence of letters that make up a string/ message e.g. $bdaacbac\dots$ [time evolution of a system, cycling through the individual micro-states]. Now suppose the probability of choosing the $j$-th letter [prob. of finding the system in $j$-th micro-state] is $p_j$ and in total you send $N$ many letters. When $N$ is large by the law of large numbers within the string, the letter $j$ appears $\approx Np_j$ many times. So one has a message with $N$ letters, $Np_j$ many $j$-th letters each [macro-state for the system in equilibrium].

The log of the number of possible different strings [number of micro state sequences that make up the macro state (i.e. the multiplicity of the macro state)] is then \begin{equation} \log\left(\frac{N!}{(Np_1)! \dots (Np_k)!}\right) \approx N \left( -\sum_{j=1}^k p_j\log(p_j) \right) =: N \cdot S \end{equation} whose interpretation is "how many bits" one needs in order to specify what message has been sent [the exact time evolution of micro states making up the macro state]. So per letter, one needs on average S more bits to cover the message space. The larger this is, the lower the probability of guessing the correct message, i.e. the higher the ignorance about the message-space.

I don't think that there is deeper intuition in the exact form of the formula than this computational result.

As a remark, though, it is nice, how from this view the entropy maximazation principle, ties in with maximal ignorance about the system.

1

My question is, why is it the weighted average of the logarithm of the pi. What is the clear intuition here?

The intuition here is difficult to make clear. However, I think the best way to clarify is to say that $$ S=-k\sum p_i \log(p_i)\;, \qquad (1) $$ is the only reasonable measure of "uncertainty" of a probability distribution that can be used to make sure the assignment of probabilities is done "fairly" in light of the available information.

You can contrast $S$ with some other, less good, potential measures of broadness or uncertainty that we could attempt to maximize. For example $$ S' = \sum_i p_i^2\;, \qquad {(2)\quad(BAD!)} $$ could be used as a measure of "uncertainty" in a distribution. Etc. Etc. Etc. There are an infinite number of possible measures of uncertainty, but (1) is the best.


Section 11.3 of Jaynes textbook "Probability Theory: The Logic of Science" gives an argument for why the "Information Entropy" must take on the form it does.

Basically, if the "amount of uncertainty" $S$ is to be continuous, consistent with common sense in that many possibilities are "more uncertain" than fewer, and consistent among different ways to calculate the probability it has to have the form: $$ S = -k\sum_i p_i log(p_i)\;. $$


You can understand where this form comes from by considering that the fundamental equation that $S$ must satisfy to be a consistent measure of probability uncertainty is: $$ S(p_1,p_2,p_3) = S(p_1, p_2 + p_3) + (p_2 + p_3)S(\frac{p_2}{p_2+p_3},\frac{p_3}{p_2+p_3})\;, \qquad(3) $$ which says that, if we determine that event 1 occurs, we lose uncertainty $S(p_1, p_1 - \sum_i p_i)$, but $p_1 - \sum_ip_i$ of the time we have to restrict to the non-event-1 case and consider the other possibilities.

Eq (3) can be generalized to: $$ S(p_1,\ldots,p_N) = S(w_1,\ldots,w_m)+w_1S(p_1/w_1,\ldots,p_n/w_1)+\ldots +w_mS(p_{N-n+1}/w_m,\ldots p_N/w_m)\;, $$ where $m$ and $N$ are integers and $m<N$. Here, we have written our fundamental equation as if we have partitioned the $p_i$ into $m$ subsets of $n$ terms, where $N=nm$, but we could partition in unequal groups too if we would like.

As an example, we would want: $$ S(1/2, 1/3, 1/6) = S(1/2, 1/2) + \frac{1}{2}S(2/3, 1/3) $$ And, as another example of partitioning, we would want: $$ S(1/2, 1/3, 1/6) = S(5/6, 1/6) + \frac{5}{6}S(3/5,6/15) $$

And, as another example, we would want: $$ S(1/4, 1/4, 1/4, 1/4) = S(1/2, 1/2) + \frac{1}{2}S(1/2, 1/2) + \frac{1}{2}S(1/2,1/2) $$

And, as another example, we would want: $$ S(1/N, 1/N, \ldots, 1/N) = S(w_1/N, w_2/N,\ldots) + \frac{w_1}{N}S(1/w_1,1/w_1,\ldots,1/w_1) + \frac{w_2}{N}S(1/w_2,\ldots) + \ldots\;, $$

Defining a new set of probabilities $q_i = w_i/N$, and using the above equation, we have $$ S(q_1, q_2, \ldots) = S(1/N, 1/N, \ldots) - \sum_j q_j S(1/w_i, 1/w_i, \ldots)\;, \qquad $$ in general.

Now specialize to the case of all of the $q_i$ being equal, so we have $q_i = n/N$ (and $w_i=n$ and $N=nm$) to find: $$ S(1/m, 1/m, \ldots) = S(1/N, 1/N, \ldots) - S(1/n, 1/n, \ldots) $$

Or, with $s(n) \equiv S(1/n, 1/n, \ldots, 1/n)$ we can write: $$ s(m) = s(nm) - s(n) \qquad (4) $$

The unique continuous function that satisfies Eq. (4) is $$ s(m) = k\log(m) $$

Therefore: $$ S(q_1, q_2,\ldots) = k\log(N) - k\sum_i q_i \log(w_i) = -k\sum_i q_i \log(q_i) $$

Of course, we can rename the $q$'s to $p$'s and we can write: $$ S(p_1, p_2,\ldots) = -k\sum_i p_i \log(p_i)\;. $$

hft
  • 19,536
0

My question is, why is it the weighted average of the logarithm of the pi. What is the clear intuition here?

This is my second answer. As before, I make the caveat that "clear intuition" is difficult. But, seeing a reason from another perspective may be helpful.


If two systems are statistically independent, this means that the combined probability of finding the first system in state $A$ and the second system in state $B$ is the product of the two individual probabilities: $$ p(AB) = p(A)p(B) $$

Therefore, the log is additive: $$ \log(p(AB)) = \log(p(A)) + \log(p(B))\;. $$

However, following Landau and Lifshitz "Statistical Physics 3rd Edition Part 1" Sections 1, 2, 4 and 7, we know that there are only seven independent additive integrals of the motion: the energy, three components of momentum, and three components of angular momentum. If we ignore overall translations and overall rotations, we only have one additive integral of the motion to consider: the energy. So we know that: $$ \log(p(A)) = a + bE_A\;, \qquad (1) $$ for some constants $a$ and $b$.

We also know that $$ \sum_A p_A = 1 = \int dE \frac{dN}{dE}p(E) $$

And, for a macroscopic body in thermal equilibrium, we know that $\frac{dN}{dE}p(E)$ has a very sharp maximum near the average energy $\bar E$. So, we can write: $$ \Delta N p(\bar E) \approx 1\;, \qquad (2) $$ where $\Delta N = \Delta E \frac{dN}{dE}(\bar E)$ is the number of states.

We define the thermodynamic entropy $S$ as: $$ S = \log(\Delta N)\;. $$

But, from Eq. 2 we have $$ S = -\log(p(\bar E)) $$

But, from Eq. 1 we know that $\log(p(E)$ is linear in the energy, and so we can take the average outside of the log: $$ S = -\overline{\log(p(E))} = -\sum_A p_A \log(p_A) $$

hft
  • 19,536
-1

It doesn't actually look like the expected value. There is a crucial difference. They expected value mixes the two symbols, so should actually read, $$E[f(X)]=\sum_i p_i~f(x_i).$$ The problem is that $\ln(p_i)$ does not look like $\ln(x_i)$, and this means that it resists an easy description in the random variables syntax that you’d want to use.

Instead for intuition’s sake, you may wish to just look at a noninteracting composition of two systems. So, when you do this you want the entropy to be additive, but the probabilities are actually multiplicative: $$ P_{(i,j)} = p^1_i~p^2_j,$$where $p^1$ are the probabilities for subsystem 1, $p^2$ are the probabilities for subsystem 2. Plug and chug: $$ S=-k_B\left(\sum_{i,j} p^1_i~p^2_j\ln p^1_i +p^1_i~p^2_j\ln p^2_j\right) \\ =-k_B\left(\sum_{i} p^1_i\ln p^1_i +p^2_j\ln p^2_j\right) \\ = s^1+s^2, $$ The logarithm is seen to be necessary to convert a multiplication into an addition, while the $p_i$ term in front is seen to be necessary to remove the summation over the irrelevant subsystem. (This may look deceptive at first, I know that it made me scratch my head during my undergrad... it looks like maybe plain $\sum_i\ln p_i$ would work and so there should be a big family of solutions. But there isn't... if the sum in one case went over $N$ states and the other went over $M$, you would have had $N\ln\dots+M\ln\dots$ and you'd have this form $f_1 s^2+f_2s^1$ for composition, which looks weird and is no longer additive.)

CR Drost
  • 37,682
  • In response to the first part: couldn't you have a random variable $X$ whose outcomes are $p_i$, i.e. $P(X=p_i)=p_i$? A bit of an awkward construction maybe but it means you could still write it as an expectation value. – AccidentalTaylorExpansion Jul 09 '22 at 22:11