I am not sure in which SE site I have to put this question. But since I have learnt Shannon Entropy in the context of Statistical Physics, I am putting this question here.
In the case of Shannon Information theory, he defines information $I$ for an $i^{th}$ event as,
$$ I_i = -\ln P_i \qquad \qquad \forall i=1,...n. $$
Based on this definition we further define Shannon Entropy as average information,
$$ S_\text{Shannon} =\langle I\rangle = -\sum\limits_{i=1}^n P_i\log_2P_i .$$
My question is what is the motivation behind defining entropy as some function that is inversely related to probability? I was told by my professor that lesser the probability of an event more information it possesses, although am still not convinced about this fact.
Secondly, what is the reason in choosing the logarithmic function in this definition? Are there places where this definition of information is forfeited?