Definition of Information in Information Theory

Question

I am not sure in which SE site I have to put this question. But since I have learnt Shannon Entropy in the context of Statistical Physics, I am putting this question here.

In the case of Shannon Information theory, he defines information $I$ for an $i^{th}$ event as,

$$ I_i = -\ln P_i \qquad \qquad \forall i=1,...n. $$

Based on this definition we further define Shannon Entropy as average information,

$$ S_\text{Shannon} =\langle I\rangle = -\sum\limits_{i=1}^n P_i\log_2P_i .$$

My question is what is the motivation behind defining entropy as some function that is inversely related to probability? I was told by my professor that lesser the probability of an event more information it possesses, although am still not convinced about this fact.

Secondly, what is the reason in choosing the logarithmic function in this definition? Are there places where this definition of information is forfeited?

Related: How is thermodynamic entropy defined? What is its relationship to information entropy? — , May 03 '14 at 19:46
@ChrisWhite : Thanks. This question is more about the physical grounds for mathematical definition of information in Shannon's theory, I understand the relationship between these two entropies. — user35952, May 03 '14 at 20:18
Your professor is incorrect, at least in the field of information theory. The probability of 100 '1's in a row is rather low, but it carries far less information than a string of 100 bits whose value toggles a lot. You can get some good "layman's interpretation" info in Simon Singh's "The Code Book." — Carl Witthoft, May 03 '14 at 21:24
@CarlWitthoft : So then, what is the justification on this definition of information ? — user35952, May 03 '14 at 21:28
Here's an answer of mine. Maybe that helps. http://physics.stackexchange.com/a/64597/3998 — Siva, May 04 '14 at 00:37
@Siva : Thank you so much +1, it tries to almost answer my question, but that it just explains using an example. I would be really glad to hear an answer that is more generic, without having to stick to some example to explain itself. — user35952, May 04 '14 at 01:32
Consider $N$ events with equal probability. What is the information? It would be $-\sum (1/N)\ln(1/N)$. There are $N$ terms in the sum, so it would be just $\ln(N)$. If it were, say, deterministic --- so we knew $P_{j}=1$ for some $j$, then the information would be $-(0+0+0+\dots+1\ln(1)+0+\dots+0)=-\ln(1)=0$. — Alex Nelson, May 04 '14 at 03:52
Shannon's original paper is very readable and gives a really excellent answer to this question, in section 6 (which may be read independently of the rest of the paper). — N. Virgo, May 04 '14 at 04:35

score 2 · Accepted Answer · edited Apr 13 '17 at 12:39

This is in addition to my answer posted elsewhere since OP wanted a more general answer. That example captured the essence based on the idea of how information can be encoded -- it is a somewhat constructive argument in spirit.

Another way of thinking about the amount of information is as follows:

If an event that is very probable happens, then you cannot get much information out of it. It was going to happen any way. On the other hand, if something unusual happens, then that should give you something to think about. Such an event carries more "information". (For eg: The occurrence of a certain event conveys no information)
If two independent events happen, then the information you glean from them must "add up". Since their probabilities multiply to give the probability of the combined event, the information gleaned from each event must be proportional to the $\log$ of its probability.

In typical treatments, one solves a functional equation for the dependence of the entropy on the probability distribution -- with the two conditions mentioned above. The latter gives the $a \log [] + b $ while the former fixes the additive constant $b$ to zero. The scale factor $a$ depends on the base to which you take the logarithm.

+1, especially for the second point !! Thanks. Will take some more time to analyse though — user35952, May 04 '14 at 09:15
Can you give a case or example of how one solves a functional equation for the dependence of the entropy on the probability distribution ? — user35952, May 05 '14 at 18:35

score 0 · Answer 2 · answered Dec 10 '16 at 16:33

Shannon performed his work in the context of communication engineering-- considering a communication system that can send a sequence of symbols from some alphabet $\mathcal A$ to communicate information. Wireless communication systems today do this by modulating amplitude/phase/frequency/code or some combination of the above. A link to his original paper, which starts off by giving the historical reasoning for the logarithmic informaiton measure is given here.

If an information source has a lot of redundancy, the same message can be conveyed without transmitting the literal message from the source. For instance, if the message was a string consisting of 50 copies of the letter "A", instead of transmitting "AAAAAAAA...." you could just design a source coding scheme that transmits "A" with some metadata implying "repeat 50 times", and the receiver reconstructs the original message with much less overhead. Incidentally, Shannon's theory is also the basis for data compression theory today.

Definition of Information in Information Theory

2 Answers2