Online Learning Platform

Information Theory and Coding > Entropy > What Is Information Theory?

What is Information theory?

Information theory is the scientific study of the quantification of information in things, like events, random variables, and distributions, for storage, and communication.

It is a subfield of mathematics and is concerned with transmitting data across a noisy channel. This field was proposed and developed by Claude Shannon while working at the US telephone company Bell Labs.

Information theory is concerned with representing data in a compact form, as well as with transmitting and storing it in a way that is robust to errors

In information theory, it is important to have the idea quantifying how much information there is in a message. Measurements of information in a message are widely used in artificial intelligence and machine learning, such as in feature selection process, in the construction of decision trees or classification models, and the optimization of classifier models. The following are the some other fields where information theory is needed:

Statistical Inference
Cryptography
Neurobiology
Perception
Linguistics
Evolution and function of molecular codes (bioinformatics)
Thermal physics
Molecular dynamics
Quantum computing
Black holes
Information retrieval
Plagiarism detection
Pattern recognition
Anomaly detection

A key measure in information theory is entropy.

How to measure information?

To measure the information content of a message quantitatively, we are required to arrive at an intuitive concept of the amount of information.

Consider the following examples: A trip to Dinajpur in the winter time during evening hours,

It is a cold day
It is a cloudy day
Possible snow flurries

Amount of information received is obviously different for these messages:

Contains very little information since the weather in the City is ‘cold’ for most part of the time during winter season.
The forecast of ‘cloudy day’ contains more information, since it is not an event that occurs often.
In contrast, the forecast of ‘snow flurries’ convey s even more information, since the occurrence of snow in the city is a rare event.

On an intuitive basis, then with a knowledge of the occurrence of an event, what can be said about the amount of information conveyed?

It is related to the probability of occurrence of the event. Message associated with an event ‘least likely to occur’ contains most information.

How to calculate the Information for an Event?

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability). i.e. an unlikely event that has occurred is more informative than a likely event.

Low Probability Event: High Information (surprising)
High Probability Event: Low Information (unsurprising).

We can calculate the amount of information there is in an event using the probability of the event. This is called "Shannon information", "self-information", or simply the “information,” and can be calculated for a discrete event x as follows:

information(x) = -log( p(x) )
where log() is the base-2 logarithm and p(x) is the probability of the event x.

The choice of the base-2 logarithm means that the units of the information measure is in bits (binary digits). This can be directly interpreted in the information processing sense as the number of bits required to represent the event.

The calculation of information is often written as h(); for example:

h(x) = -log( p(x)

Information will be zero when the probability of an event is 1.0 or a certainty, e.g. there is no surprise.

Example:

Consider a toss of a single fair coin. The probability of heads (and tails) is 0.5. Information for a result of head is (using python code):

from math import log2

p = 0.5
h = -log2(p)
print('p(x)=%.2f, information: %.2f bits' % (p, h))

p(x)=0.50, information: 1.00 bits

If the coin is not fair and the probability of a head is 10% (0.1), then the event would be more rare and would require more than 3 bits of information.

from math import log2

p = 0.1
h = -log2(p)
print('p(x)=%.2f, information: %.2f bits' % (p, h))

p(x)=0.10, information: 3.32 bits

Let us consider a six-sided rolling dice. Probability of appearing each number is 1/6, what is the measure of the information in rolling a 6?

from math import log2

p = 1/6
h = -log2(p)
print('p(x)=%.2f, information: %.2f bits' % (p, h))

p(x)=0.17, information: 2.58 bits

If the base of the logarithm is 2 then it is called that the information is measured in bits. If base of the logarithm is e, then it is called measured in nats. If base of the logarithm is 10, then it is called measured in Hartley / decit.

So it can be concluded that the more probability of occurrence the less information is. To make this clear, we can calculate the information for probabilities between 0 and 1 and plot the corresponding information for each.


from math import log2
from matplotlib import pyplot
# list of possible probabilities
probs = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

# calculating information in array
info = [-log2(p) for p in probs]

# plot probability vs information
pyplot.plot(probs, info, marker='.')
pyplot.title('Probability vs Information')
pyplot.xlabel('Probability')
pyplot.ylabel('Information')
pyplot.show()

Why unify information theory and machine learning?

The state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning. That’s why, they are called as two sides of the same coin.

No More

Feedback

ABOUT

Statlearner

Statlearner STUDY

Statlearner