17

I'm starting to study Information Theory, and the concept of entropy is not clear to me; I have already checked other posts in this category, but none of them seems to answer my indagations, so, I have some questions:

  1. Is the Shannon entropy equivalent to the thermodynamics entropy? I have read both answers, yes and no, and could not understand yet; many people say "yes", but the Shannon entropy is related to symbols, and the thermodynamic entropy is related to micro-states, and these are related to temperature (and Shannon entropy couldn't care less about temperature). However, the Gibbs' entropy formula is exactly the same to Shannon entropy, then I could not form a consensus in my head about the topic.

  2. What exactly is the difference between Boltzmann, Shannon, Gibbs and von Neumann concepts of entropy? I have read in one topic here that Shannon entropy measures the "minimum amount of yes/no questions to fully specify a system", but how does a physical system could obey this? For example, if the entropy of a volume of gas is $x$, what questions could I make to "fully specify" this gas?

  3. If these entropies are related, how could I convert J/K (thermodynamic unit) to bit (Shannon unit)? And if one use $\ln$ instead of $\log_{2}$, the unit would be nats; I understand that information is a way to measure differences between things, and is clear to me that a bit is the minimum amount of information, once it distinguish between 2 things, but what would a nat measure in this case? If a bit distinguishes between 2 things, a nat would distinguish between 2,718 things (can't understand that).

I've already searched in many books and sites, and questioned my professor, but still don't have a clue in this topics, so any hint will be much apreciated...

valerio
  • 16,231
donut
  • 383
  • 5
    It would be better if each post asks one specific question. Answering all questions at the same time could be difficult, and hence your chance of obtaining answer is reduced. – Shing Dec 23 '17 at 09:32
  • 2
    I agree with Shing. Interesting questions, but too many of them for a single post. – valerio Dec 23 '17 at 10:03
  • 3
    I think the questions are logically related enough that it makes sense to have them in one post. (See my answer.) – N. Virgo Dec 23 '17 at 10:48
  • @Nathaniel I disagree. Question 1 has already been asked . Question 2 may be a unique question, or it was asked under different wording. Question 3 seems to be a duplicate as well. Honestly, this is why we shouldn't be answering broad questions. Although you were able to answer them in one post; it doesn't really mean you should. – JMac Dec 25 '17 at 15:57
  • @Nathaniel Because now we have multiple resources for these questions. It becomes hard to determine which question someone should read; because it's covered multiple places. In this case, if someone had a question about 1 of these 3 questions; they would have to sort through all this information until they found what they needed. It's a lot neater if we can keep individual questions separate to be addressed separately. For example, what if one of your 3 answers was wrong? That becomes harder to address voting when you are giving 3 different answers. – JMac Dec 25 '17 at 16:00
  • @JMac I don't really care about any of those points. I enjoy explaining this stuff, and I hope people will find it useful. Happy Christmas! – N. Virgo Dec 25 '17 at 16:35
  • @Nathaniel That's how the site is supposed to work though. By doing this, you're only making it more fragmented and less of a coherent source of information. It's not to say that you can't answer their questions. But if you should answer it where they were originally asked and point people in the right direction when they ask duplicates. There's no use in duplicating information in multiple locations. The point is to be helpful to the physics community as a wh0le. Having multiple questions and answers scattered makes this more difficult. – JMac Dec 25 '17 at 16:43
  • 1
    @JMac did I not say I don't care? I don't believe I'm doing anyone a disservice by passing on my knowledge, and I have no wish to discuss it further. Once again, best wishes for the season of goodwill. – N. Virgo Dec 26 '17 at 02:06

2 Answers2

18

I hope that my answers below will all be helpful.

  1. There are more than one way to think about this, but the one I find most helpful is to think of thermodynamic entropy as a specific instance of Shannon entropy. Shannon entropy is defined by the formula $$ H = -\sum_i p_i \log p_i, $$ but this formula has many different applications, and the symbols $p_i$ have different meanings depending on what the formula is used for. Shannon thought of them as the probabilities of different messages or symbols being sent over a communication channel, but Shannon's formula has since found plenty of other applications as well. One specific thing you can apply it to are the microscopic states of a physical system. If the probabilities $p_i$ represent the equilibrium probabilities for a thermodynamic system to be be in microscopic states $i$, then you have the thermodynamic entropy. (Very often it is multiplied by Boltzmann's constant in this case, to put it into units of $JK^{-1}$ --- see below.) If they represent something else (such as, for example, a non-equilibrium ensemble) then you just have a different instance of the Shannon entropy. So in short, the thermodynamic entropy is a Shannon entropy, but not necessarily vice versa.

    (One should note, though, that this isn't the way it developed historically --- the formula was in use in physics before Shannon realised that it could be generalised, and the entropy was a known quantity before that formula was invented. For a very good overview of the historical development of information theory and physics, see Jaynes' paper "Where do we stand on maximum entropy?" It is very long, and quite old, but well worth the effort.)

  2. The paper linked above will also help with this. Essentially, the Shannon entropy is the formula quoted above; the Gibbs entropy is that same formula applied to the microscopic states of a physical system (so that sometimes it's called the Gibbs-Shannon entropy); the Boltzmann entropy is $\log W$, which is a special case of the Gibbs-Shannon entropy that was historically discovered first; and the von Neumann entropy is the quantum version of the Gibbs-Shannon entropy.

  3. This is straightforward. The physical definition of the entropy is $$ S = -k_B \sum_i p_i \log p_i, $$ where the logarithms have base $e$, and $k_B \approx 1.38\times 10^{-23} JK^{-1}$ is Boltzmann's constant. Physicists generally consider $\log p_i$ to be unitless (rather than having units of nats), so the expression has units of $JK^{-1}$ overall. Comparing this to the definition of $H$ above (with units of nats) we have $$ 1\,\mathrm{nat} = k_B\,JK^{-1}, $$ i.e. the conversion factor is just Boltzmann's constant.

    If we want to express $H$ in bits then we have to change the base of the logarithm from $e$ to 2, which we do by dividing by $\ln 2$: $$ H_\text{bit} = -\sum_i p_i \log_2 p_i = -\sum_i p_i \frac{\ln p_i}{\ln 2} = \frac{H_\text{nat}}{\ln 2}. $$ So we have $$ 1\,\mathrm{bit} = \ln 2\,\,\mathrm{nat}, $$ and therefore $$ 1\,\mathrm{bit} = k_B\ln 2\,JK^{-1} \approx 9.57\times 10^{-24} JK^{-1}. $$

    You will see this conversion factor, for example in Landauer's principle, in which erasing one bit requires $k_B T \ln 2$ joules of energy. This is really just saying that that deleting a bit (and therefore lowering the entropy by one bit) requires raising the entropy of the heat bath by one bit, or $k_B \ln 2$. For a heat bath of temperature $T$ this can be done by raising its energy by $k_B T \ln 2\,\, J$.

    As for the intuitive interpretation of nats, this is indeed a little tricky. The reason nats are used is that they're mathematically more convenient. (If you take the derivative you won't get factors of $\ln 2$ appearing all the time.) But it doesn't make nice intuitive sense to think of distinguishing between 2.718 things, so it's probably better just to think of a nat as $\frac{1}{\ln 2}$ bits, and remember that it's defined that way for mathematical convenience.

N. Virgo
  • 33,913
7

Question 1:

The Boltzmann entropy $S_B=k_B\ln\Omega(E)$ is valid only for the microcanonical ensemble. In the microcanonical ensemble, all accessible microstates (accessible = they have energy $E$, at least with some $\delta E$ uncertainty) have equal probability. So if $r$ is an index that labels microstates, we have $$ p_r=C, $$ where $C$ is a constant. The normalization means that $C=1/\Omega(E)$, where $\Omega(E)$ is the number of accessible microstates.

Let us define a more general entropy as $$ S_S=-k_B\sum_r p_r\ln(p_r), $$ valid for an arbitrary probability distribution $p_r$. What happens if $p_r=C$? Then we have $$ S_S=-k_B\Omega(E)C\ln(C)=-k_B\ln\left(\frac{1}{\Omega(E)}\right)=k_B\ln(\Omega(E))=S_B. $$ The only thing that is questionable is what's with $k_B$ and $\ln$ instead of $\log_2$.

The point is, the multiplicative constant doesn't really matter, and different logarithms are related by multiplicative constants. What we have is that in the microcanonical ensemble, temperature is defined as $$ \frac{1}{T}=\frac{\partial S_B}{\partial E}. $$ Early phenomenological thermodynamists on the other hand had no clue about what temperature actually is, so they invented a unit, $K$ for it. From he perspective of equilibrium statistical mechanics, it is far more natural to measure temperature in units of energy, rather than Kelvin. So, whatever multiplicative constants happen to be in the formula for entropy, they essentially act as conversion factors between units of temperature and units of energy. And aside from multiplicative factors, $S_S$ is Shannon-entropy. So they are essentially the same, with the understanding that Boltzmann's entropy is a special case for microcanonical ensembles.

Interesting tidbit: Consider Shannon-entropy $S_S$ as a functional of probability distribution: $$ S_S[p]=-\sum_r p_r\ln(p_r). $$ Here I set $k_B=1$. What are the critical points of this functional? We can do calculus of variations, but we only vary probability distributions, so we need to enforce $\sum_r p_r=1$. The functional to be varied is then $$ F[p]=-\sum_r p_r\ln(p_r)-\gamma\left(1-\sum_r p_r\right) $$ where $\gamma$ is a Lagrange-multiplier. After variation we get $$ \delta F[p]=-\sum_r\left(\delta p_r\ln(p_r)+p_r\frac{1}{p_r}\delta p_r+\gamma\delta p_r\right). $$ Setting this to 0 gives $$ \ln(p_r)=-(1+\gamma)\Rightarrow p_r=e^{-1-\gamma}=C, $$ where $C$ can be determined from normalization.

So basically, the microcanonical ensemble is precisely the one which maximizes entropy.

Question 2: I am not sure which is Gibbs entropy (probably the "modified" Shannon entropy?), but they are basically all the same in different formulations and different conventions for temperature. Von Neumann entropy is of course quantum mechanical, but it reduces to usual entropy if you diagonize the density matrix.

If you are curious about the meaning of entropy, I think you should drop strict information theory and just look at probability theory. It is probably simpler to consider the negative of entropy, $I=\sum_r p_r\ln(p_r)$. It can be seen that this essentially measures how much knowledge you'd gain if you were to know which state the system is in. Assume the probability distribution is such that only one state has nonzero probability, so $p_r=1$ for a specific $r$ and 0 for the rest. Then entropy is ostensibly zero. And indeed, since that state is the only realizable state, you gain absolutely no information if somebody tells you what its state is. On the other hand, if all states are equiprobable, then you have absolutely no basis to "guess" the state of the system without knowing anything about it. If someone tells you the state of the system, you gain quite a lot of information. If the probability distribution is "spiky", then this $I$ quantity is lower, than if it was even, because if you "guess" that the state of the sytem is in the "spiky" domain, you'd be more often right than not.

So I somewhat retrace my statement and say that it isn't so much about how much knowlegde one gained if they were told the state of the system (but clearly, it is related), but rather, how likely it is that you can guess which state of the system is realized, just by knowing the distribution. For a "spiky" distribution, the system is likely near the spike, so it is pretty guessable. For a system that is evenly distributed, your guess is worthless. It is a measure of "spikiness", a measure of how evenly the system is distributed over its accessible states.

Question 3: I cannot really answer this directly, mainly because my knowledge of information theory isn't that high, so I'll only say what I already said in 1, that the multiplicative constant of units $J/K$ is only needed to make contact with what phenomenological thermodynamists of old defined as temperature. In the microcanonical ensemble, the entropy is given by the logarithm of the number of accessible microstates, which depend on energy. Inverse temperature is the response of entropy to the change in energy, so it should have dimensions of energy. With that said, if you defined entropy in the microcanonical ensemble as $$ S=\log_2(\Omega(E)), $$ then temperature would have units of $J/\text{bit}$, if you'd like. Then, if you defined $k_B$ with units of energy, then the unit of temperature would be $1/\text{bit}$.

Edit - clarifications:

I cannot shake the nagging feeling that I did not answer this question satisfyingly, so I'd like to clarify certain points.

Related to Question 3, I think (but I might be wrong, not an expert in this field) that relating temperature to information is somewhat futile, at least beyond superficialities. Temperature is only defined in a meaningful way for equilibrium systems. Specifically, temperature is only defined for microcanonical ensembles. Realize that it is not meaningful to talk about microcanonical ensembles that do not describe equilibrium systems. Non-equilibrium systems have time-dependent probability distributions. But a microcanonical ensemble is in a very specific distribution (even distribution), so you cannot have time-evolution if this even-ness is to be kept. For all other ensembles, temperature is defined by being in equilibirum with another system such that they, together form a microcanonical ensemble.

On the other hand entropy/information is meaningful as soon as you got a probability distribution.

Related to the interpretation of entropy , I think it is probably best to not to think about it either in the context of information theory or thermodynamics. Even if those two fields were the main inspiration for the conept of entropy, it is a concept in probability theory. Both information theorists and thermodynamists use entropy for their own nefarious purposes, so it is best to abstract it away.

Entropy is simply a number associated to a probability distribution. I thought things through and I think I can give a better recount of what it means than I did in the main answer. Instead of considering $S=-\sum_r p_r\ln(p_r)$, let us consider $I_r=-\ln(p_r)$ where $p_r$ is the probability of a specific state. Let us call this the "information" of the state $r$.

Since $p_r$ may take on values between $0$ and $1$, and $I_r$ is a monotonically decreasing function of $p_r$, we need to consider only the limiting cases, 0 and 1.

If $p_r=1$, $I_r=0$. In this case, the system is not probabilistic, but deterministic. Thus, there is no information to be gained if a wizard suddenly told us that the system is in $r$. It is trivial. No information content.

On the other hand, if $p_r=0$, $I_r=\infty$. This case is singular, so it is difficult to interpret. Basically, if a wizard told us that the system is in $r$, he'd be lying. But if we consider the case when $p_r=\epsilon$ very small, but nonzero, $I_r$ approaches infinity. If a wizard told us that the system is in $r$, a very unlikely state, we'd be surprised. It would, in some sense, net us a great deal of information, since it is very unlikely that the system is in $r$.

Entropy is then $$ S=\sum_r p_rI_r=\left< I\right>, $$ the expectation value of information. So it is a kind of "average information content" of the distribution. If the distribution is even, we know very little about the state of the system, since it can be in any. If the distribution is spiky, we pretty much know that the system is near the spike. Entropy parametrizes our ignorance about the system, if we only know the distribution and nothing else.

Bence Racskó
  • 10,948
  • PS: Since I know jack all about information theory, some terms in question 3 might be off. For example, reading both the question (more carefully) and the other answer (which preceded mine, but I started writing mine before it was posted), I realized I have no clue what a "nat" is. With that said, I think my answer is definitely correct in terms of "spirit", since my main point is that multiplicative constants are irrelevant as they can be absorbed by thermodynamic quantities. – Bence Racskó Dec 23 '17 at 11:16
  • 1
    There is not such a big divide between information theory and probability theory - they are more or less the same thing. I guess people start to call it information theory once you start to bring the logarithms in. Although you say one should forget information theory and just use probability theory, the explanations you give are perfectly good information-theoretic ones. (This is a minor comment, it's a good answer, +1.) – N. Virgo Dec 24 '17 at 03:55
  • 2
    Nats, by the way, are just a word people sometimes use for the "units" of entropy when the natural logarithm is used. So $-\sum_i p_i\log_2 p_i$ is measured in bits and $-\sum_i p_i\ln p_i$ is measured in nats. – N. Virgo Dec 24 '17 at 03:57