What is Information theory?
Information theory is the scientific study of the quantification of information in things, like events, random variables, and distributions, for storage, and communication.
It is a subfield of mathematics and is concerned with transmitting data across a noisy channel. This field was proposed and developed by Claude Shannon while working at the US telephone company Bell Labs.
Information theory is concerned with representing data in a compact form, as well as with transmitting and storing it in a way that is robust to errors
In information theory, it is important to have the idea quantifying how much information there is in a message. Measurements of information in a message are widely used in artificial intelligence and machine learning, such as in feature selection process, in the construction of decision trees or classification models, and the optimization of classifier models. The following are the some other fields where information theory is needed:
A key measure in information theory is entropy.
How to measure information?
To measure the information content of a message quantitatively, we are required to arrive at an intuitive concept of the amount of information.
Consider the following examples: A trip to Dinajpur in the winter time during evening hours,
Amount of information received is obviously different for these messages:
On an intuitive basis, then with a knowledge of the occurrence of an event, what can be said about the amount of information conveyed?
It is related to the probability of occurrence of the event. Message associated with an event ‘least likely to occur’ contains most information.
How to calculate the Information for an Event?
The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability). i.e. an unlikely event that has occurred is more informative than a likely event.
We can calculate the amount of information there is in an event using the probability of the event. This is called "Shannon information", "self-information", or simply the “information,” and can be calculated for a discrete event x as follows:
information(x) = -log( p(x) )
where log() is the base-2 logarithm and p(x) is the probability of the event x.
The choice of the base-2 logarithm means that the units of the information measure is in bits (binary digits). This can be directly interpreted in the information processing sense as the number of bits required to represent the event.
The calculation of information is often written as h(); for example:
h(x) = -log( p(x)
Information will be zero when the probability of an event is 1.0 or a certainty, e.g. there is no surprise.
Example:
Consider a toss of a single fair coin. The probability of heads (and tails) is 0.5. Information for a result of head is (using python code):
from math import log2
p = 0.5
h = -log2(p)
print('p(x)=%.2f, information: %.2f bits' % (p, h))
p(x)=0.50, information: 1.00 bits
If the coin is not fair and the probability of a head is 10% (0.1), then the event would be more rare and would require more than 3 bits of information.
from math import log2
p = 0.1
h = -log2(p)
print('p(x)=%.2f, information: %.2f bits' % (p, h))
p(x)=0.10, information: 3.32 bits
Let us consider a six-sided rolling dice. Probability of appearing each number is 1/6, what is the measure of the information in rolling a 6?
from math import log2
p = 1/6
h = -log2(p)
print('p(x)=%.2f, information: %.2f bits' % (p, h))
p(x)=0.17, information: 2.58 bits
If the base of the logarithm is 2 then it is called that the information is measured in bits. If base of the logarithm is e, then it is called measured in nats. If base of the logarithm is 10, then it is called measured in Hartley / decit.
So it can be concluded that the more probability of occurrence the less information is. To make this clear, we can calculate the information for probabilities between 0 and 1 and plot the corresponding information for each.
from math import log2
from matplotlib import pyplot
# list of possible probabilities
probs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
# calculating information in array
info = [-log2(p) for p in probs]
# plot probability vs information
pyplot.plot(probs, info, marker='.')
pyplot.title('Probability vs Information')
pyplot.xlabel('Probability')
pyplot.ylabel('Information')
pyplot.show()

Why unify information theory and machine learning?
The state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning. That’s why, they are called as two sides of the same coin.
No More
Statlearner
Statlearner