fbpx
Wikipedia

Information content

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.

The Shannon information is closely related to entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.[1]

The information content can be expressed in various units of information, of which the most common is the "bit" (more correctly called the shannon), as explained below.

Definition edit

Claude Shannon's definition of self-information was chosen to meet several axioms:

  1. An event with probability 100% is perfectly unsurprising and yields no information.
  2. The less probable an event is, the more surprising it is and the more information it yields.
  3. If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number   and an event   with probability  , the information content is defined as follows:

 

The base b corresponds to the scaling factor above. Different choices of b correspond to different units of information: when b = 2, the unit is the shannon (symbol Sh), often called a 'bit'; when b = e, the unit is the natural unit of information (symbol nat); and when b = 10, the unit is the hartley (symbol Hart).

Formally, given a discrete random variable   with probability mass function  , the self-information of measuring   as outcome   is defined as[2]

 

The use of the notation   for self-information above is not universal. Since the notation   is also often used for the related quantity of mutual information, many authors use a lowercase   for self-entropy instead, mirroring the use of the capital   for the entropy.

Properties edit

Monotonically decreasing function of probability edit

For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content, than more common values. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function.

While standard probabilities are represented by real numbers in the interval  , self-informations are represented by extended real numbers in the interval  . In particular, we have the following, for any choice of logarithmic base:

  • If a particular event has a 100% probability of occurring, then its self-information is  : its occurrence is "perfectly non-surprising" and yields no information.
  • If a particular event has a 0% probability of occurring, then its self-information is  : its occurrence is "infinitely surprising".

From this, we can get a few general properties:

  • Intuitively, more information is gained from observing an unexpected event—it is "surprising".
    • For example, if there is a one-in-a-million chance of Alice winning the lottery, her friend Bob will gain significantly more information from learning that she won than that she lost on a given day. (See also Lottery mathematics.)
  • This establishes an implicit relationship between the self-information of a random variable and its variance.

Relationship to log-odds edit

The Shannon information is closely related to the log-odds. In particular, given some event  , suppose that   is the probability of   occurring, and that   is the probability of   not occurring. Then we have the following definition of the log-odds:

 

This can be expressed as a difference of two Shannon informations:

 

In other words, the log-odds can be interpreted as the level of surprise when the event doesn't happen, minus the level of surprise when the event does happen.

Additivity of independent events edit

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics, and sigma additivity in particular in measure and probability theory. Consider two independent random variables   with probability mass functions   and   respectively. The joint probability mass function is

 

because   and   are independent. The information content of the outcome   is

 
See § Two independent, identically distributed dice below for an example.

The corresponding property for likelihoods is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

Relationship to entropy edit

The Shannon entropy of the random variable   above is defined as

 
by definition equal to the expected information content of measurement of  .[3]: 11 [4]: 19–20  The expectation is taken over the discrete values over its support.

Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies  , where   is the mutual information of   with itself.[5]

For continuous random variables the corresponding concept is differential entropy.

Notes edit

This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was coined by Myron Tribus in his 1961 book Thermostatics and Thermodynamics.[6][7]

When the event is a random realization (of a variable) the self-information of the variable is defined as the expected value of the self-information of the realization.

Self-information is an example of a proper scoring rule.[clarification needed]

Examples edit

Fair coin toss edit

Consider the Bernoulli trial of tossing a fair coin  . The probabilities of the events of the coin landing as heads   and tails   (see fair coin and obverse and reverse) are one half each,  . Upon measuring the variable as heads, the associated information gain is

 
so the information gain of a fair coin landing as heads is 1 shannon.[2] Likewise, the information gain of measuring tails   is
 

Fair die roll edit

Suppose we have a fair six-sided die. The value of a dice roll is a discrete uniform random variable   with probability mass function

 
The probability of rolling a 4 is  , as for any other valid roll. The information content of rolling a 4 is thus
 
of information.

Two independent, identically distributed dice edit

Suppose we have two independent, identically distributed random variables   each corresponding to an independent fair 6-sided dice roll. The joint distribution of   and   is

 

The information content of the random variate   is

 
and can also be calculated by additivity of events
 

Information from frequency of rolls edit

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables

 
for  , then   and the counts have the multinomial distribution
 

To verify this, the 6 outcomes   correspond to the event   and a total probability of 1/6. These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other   combinations correspond to one die rolling one number and the other die rolling a different number, each having probability 1/18. Indeed,  , as required.

Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number. Take for examples the events   and   for  . For example,   and  .

The information contents are

 
 

Let   be the event that both dice rolled the same value and   be the event that the dice differed. Then   and  . The information contents of the events are

 
 

Information from sum of die edit

The probability mass or density function (collectively probability measure) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable   has probability mass function  , where   represents the discrete convolution. The outcome   has probability  . Therefore, the information asserted is

 

General discrete uniform distribution edit

Generalizing the § Fair dice roll example above, consider a general discrete uniform random variable (DURV)   For convenience, define  . The probability mass function is

 
In general, the values of the DURV need not be integers, or for the purposes of information theory even uniformly spaced; they need only be equiprobable.[2] The information gain of any observation   is
 

Special case: constant random variable edit

If   above,   degenerates to a constant random variable with probability distribution deterministically given by   and probability measure the Dirac measure  . The only value   can take is deterministically  , so the information content of any measurement of   is

 
In general, there is no information gained from measuring a known value.[2]

Categorical distribution edit

Generalizing all of the above cases, consider a categorical discrete random variable with support   and probability mass function given by

 

For the purposes of information theory, the values   do not have to be numbers; they can be any mutually exclusive events on a measure space of finite measure that has been normalized to a probability measure  . Without loss of generality, we can assume the categorical distribution is supported on the set  ; the mathematical structure is isomorphic in terms of probability theory and therefore information theory as well.

The information of the outcome   is given

 

From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

Derivation edit

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin:

Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.[8]

Assuming that one does not reside near the polar regions, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event,  , depends only on the probability of that event.

 
for some function   to be determined below. If  , then  . If  , then  .

Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event   is the intersection of two independent events   and  , then the information of event   occurring is that of the compound message of both independent events   and   occurring. The quantity of information of compound message   would be expected to equal the sum of the amounts of information of the individual component messages   and   respectively:

 

Because of the independence of events   and  , the probability of event   is

 

However, applying function   results in

 

Thanks to work on Cauchy's functional equation, the only monotone functions   having the property such that

 
are the logarithm functions  . The only operational difference between logarithms of different bases is that of different scaling constants, so we may assume
 

where   is the natural logarithm. Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that  .

Taking into account these properties, the self-information   associated with outcome   with probability   is defined as:

 

The smaller the probability of event  , the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of   is shannon. This is the most common practice. When using the natural logarithm of base  , the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley.

As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 shannons (probability 15/16). See above for detailed examples.

See also edit

References edit

  1. ^ Jones, D.S., Elementary Information Theory, Vol., Clarendon Press, Oxford pp 11–15 1979
  2. ^ a b c d McMahon, David M. (2008). Quantum Computing Explained. Hoboken, NJ: Wiley-Interscience. ISBN 9780470181386. OCLC 608622533.
  3. ^ Borda, Monica (2011). Fundamentals in Information Theory and Coding. Springer. ISBN 978-3-642-20346-6.
  4. ^ Han, Te Sun; Kobayashi, Kingo (2002). Mathematics of Information and Coding. American Mathematical Society. ISBN 978-0-8218-4256-0.
  5. ^ Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.
  6. ^ R. B. Bernstein and R. D. Levine (1972) "Entropy and Chemical Change. I. Characterization of Product (and Reactant) Energy Distributions in Reactive Molecular Collisions: Information and Entropy Deficiency", The Journal of Chemical Physics 57, 434–449 link.
  7. ^ Myron Tribus (1961) Thermodynamics and Thermostatics: An Introduction to Energy, Information and States of Matter, with Engineering Applications (D. Van Nostrand, 24 West 40 Street, New York 18, New York, U.S.A) Tribus, Myron (1961), pp. 64–66 borrow.
  8. ^ "A quote by George Carlin". www.goodreads.com. Retrieved 2021-04-01.

Further reading edit

External links edit

  • Examples of surprisal measures
  • Bayesian Theory of Surprise

information, content, this, article, require, cleanup, meet, wikipedia, quality, standards, specific, problem, unclear, terminology, please, help, improve, this, article, june, 2017, learn, when, remove, this, template, message, information, theory, informatio. This article may require cleanup to meet Wikipedia s quality standards The specific problem is unclear terminology Please help improve this article if you can June 2017 Learn how and when to remove this template message In information theory the information content self information surprisal or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable It can be thought of as an alternative way of expressing probability much like odds or log odds but which has particular mathematical advantages in the setting of information theory The Shannon information can be interpreted as quantifying the level of surprise of a particular outcome As it is such a basic quantity it also appears in several other settings such as the length of a message needed to transmit the event given an optimal source coding of the random variable The Shannon information is closely related to entropy which is the expected value of the self information of a random variable quantifying how surprising the random variable is on average This is the average amount of self information an observer would expect to gain about a random variable when measuring it 1 The information content can be expressed in various units of information of which the most common is the bit more correctly called the shannon as explained below Contents 1 Definition 2 Properties 2 1 Monotonically decreasing function of probability 2 2 Relationship to log odds 2 3 Additivity of independent events 3 Relationship to entropy 4 Notes 5 Examples 5 1 Fair coin toss 5 2 Fair die roll 5 3 Two independent identically distributed dice 5 3 1 Information from frequency of rolls 5 3 2 Information from sum of die 5 4 General discrete uniform distribution 5 4 1 Special case constant random variable 5 5 Categorical distribution 6 Derivation 7 See also 8 References 9 Further reading 10 External linksDefinition editClaude Shannon s definition of self information was chosen to meet several axioms An event with probability 100 is perfectly unsurprising and yields no information The less probable an event is the more surprising it is and the more information it yields If two independent events are measured separately the total amount of information is the sum of the self informations of the individual events The detailed derivation is below but it can be shown that there is a unique function of probability that meets these three axioms up to a multiplicative scaling factor Broadly given a real number b gt 1 displaystyle b gt 1 nbsp and an event x displaystyle x nbsp with probability P displaystyle P nbsp the information content is defined as follows I x log b Pr x log b P displaystyle mathrm I x log b left Pr left x right right log b left P right nbsp The base b corresponds to the scaling factor above Different choices of b correspond to different units of information when b 2 the unit is the shannon symbol Sh often called a bit when b e the unit is the natural unit of information symbol nat and when b 10 the unit is the hartley symbol Hart Formally given a discrete random variable X displaystyle X nbsp with probability mass function p X x displaystyle p X left x right nbsp the self information of measuring X displaystyle X nbsp as outcome x displaystyle x nbsp is defined as 2 I X x log p X x log 1 p X x displaystyle operatorname I X x log left p X left x right right log left frac 1 p X left x right right nbsp The use of the notation I X x displaystyle I X x nbsp for self information above is not universal Since the notation I X Y displaystyle I X Y nbsp is also often used for the related quantity of mutual information many authors use a lowercase h X x displaystyle h X x nbsp for self entropy instead mirroring the use of the capital H X displaystyle H X nbsp for the entropy Properties editThis section needs expansion You can help by adding to it October 2018 Monotonically decreasing function of probability edit For a given probability space the measurement of rarer events are intuitively more surprising and yield more information content than more common values Thus self information is a strictly decreasing monotonic function of the probability or sometimes called an antitonic function While standard probabilities are represented by real numbers in the interval 0 1 displaystyle 0 1 nbsp self informations are represented by extended real numbers in the interval 0 displaystyle 0 infty nbsp In particular we have the following for any choice of logarithmic base If a particular event has a 100 probability of occurring then its self information is log 1 0 displaystyle log 1 0 nbsp its occurrence is perfectly non surprising and yields no information If a particular event has a 0 probability of occurring then its self information is log 0 displaystyle log 0 infty nbsp its occurrence is infinitely surprising From this we can get a few general properties Intuitively more information is gained from observing an unexpected event it is surprising For example if there is a one in a million chance of Alice winning the lottery her friend Bob will gain significantly more information from learning that she won than that she lost on a given day See also Lottery mathematics This establishes an implicit relationship between the self information of a random variable and its variance Relationship to log odds edit The Shannon information is closely related to the log odds In particular given some event x displaystyle x nbsp suppose that p x displaystyle p x nbsp is the probability of x displaystyle x nbsp occurring and that p x 1 p x displaystyle p lnot x 1 p x nbsp is the probability of x displaystyle x nbsp not occurring Then we have the following definition of the log odds log odds x log p x p x displaystyle text log odds x log left frac p x p lnot x right nbsp This can be expressed as a difference of two Shannon informations log odds x I x I x displaystyle text log odds x mathrm I lnot x mathrm I x nbsp In other words the log odds can be interpreted as the level of surprise when the event doesn t happen minus the level of surprise when the event does happen Additivity of independent events edit The information content of two independent events is the sum of each event s information content This property is known as additivity in mathematics and sigma additivity in particular in measure and probability theory Consider two independent random variables X Y textstyle X Y nbsp with probability mass functions p X x displaystyle p X x nbsp and p Y y displaystyle p Y y nbsp respectively The joint probability mass function isp X Y x y Pr X x Y y p X x p Y y displaystyle p X Y left x y right Pr X x Y y p X x p Y y nbsp because X textstyle X nbsp and Y textstyle Y nbsp are independent The information content of the outcome X Y x y displaystyle X Y x y nbsp isI X Y x y log 2 p X Y x y log 2 p X x p Y y log 2 p X x log 2 p Y y I X x I Y y displaystyle begin aligned operatorname I X Y x y amp log 2 left p X Y x y right log 2 left p X x p Y y right 5pt amp log 2 left p X x right log 2 left p Y y right 5pt amp operatorname I X x operatorname I Y y end aligned nbsp See Two independent identically distributed dice below for an example The corresponding property for likelihoods is that the log likelihood of independent events is the sum of the log likelihoods of each event Interpreting log likelihood as support or negative surprisal the degree to which an event supports a given model a model is supported by an event to the extent that the event is unsurprising given the model this states that independent events add support the information that the two events together provide for statistical inference is the sum of their independent information Relationship to entropy editThe Shannon entropy of the random variable X displaystyle X nbsp above is defined asH X x p X x log p X x x p X x I X x d e f E I X X displaystyle begin alignedat 2 mathrm H X amp sum x p X left x right log p X left x right amp sum x p X left x right operatorname I X x amp overset underset mathrm def operatorname E left operatorname I X X right end alignedat nbsp by definition equal to the expected information content of measurement of X displaystyle X nbsp 3 11 4 19 20 The expectation is taken over the discrete values over its support Sometimes the entropy itself is called the self information of the random variable possibly because the entropy satisfies H X I X X displaystyle mathrm H X operatorname I X X nbsp where I X X displaystyle operatorname I X X nbsp is the mutual information of X displaystyle X nbsp with itself 5 For continuous random variables the corresponding concept is differential entropy Notes editThis measure has also been called surprisal as it represents the surprise of seeing the outcome a highly improbable outcome is very surprising This term as a log probability measure was coined by Myron Tribus in his 1961 book Thermostatics and Thermodynamics 6 7 When the event is a random realization of a variable the self information of the variable is defined as the expected value of the self information of the realization Self information is an example of a proper scoring rule clarification needed Examples editFair coin toss edit Consider the Bernoulli trial of tossing a fair coin X displaystyle X nbsp The probabilities of the events of the coin landing as heads H displaystyle text H nbsp and tails T displaystyle text T nbsp see fair coin and obverse and reverse are one half each p X H p X T 1 2 0 5 textstyle p X text H p X text T tfrac 1 2 0 5 nbsp Upon measuring the variable as heads the associated information gain isI X H log 2 p X H log 2 1 2 1 displaystyle operatorname I X text H log 2 p X text H log 2 tfrac 1 2 1 nbsp so the information gain of a fair coin landing as heads is 1 shannon 2 Likewise the information gain of measuring tails T displaystyle T nbsp isI X T log 2 p X T log 2 1 2 1 Sh displaystyle operatorname I X T log 2 p X text T log 2 tfrac 1 2 1 text Sh nbsp Fair die roll edit Suppose we have a fair six sided die The value of a dice roll is a discrete uniform random variable X D U 1 6 displaystyle X sim mathrm DU 1 6 nbsp with probability mass functionp X k 1 6 k 1 2 3 4 5 6 0 otherwise displaystyle p X k begin cases frac 1 6 amp k in 1 2 3 4 5 6 0 amp text otherwise end cases nbsp The probability of rolling a 4 is p X 4 1 6 textstyle p X 4 frac 1 6 nbsp as for any other valid roll The information content of rolling a 4 is thusI X 4 log 2 p X 4 log 2 1 6 2 585 Sh displaystyle operatorname I X 4 log 2 p X 4 log 2 tfrac 1 6 approx 2 585 text Sh nbsp of information Two independent identically distributed dice edit Suppose we have two independent identically distributed random variables X Y D U 1 6 textstyle X Y sim mathrm DU 1 6 nbsp each corresponding to an independent fair 6 sided dice roll The joint distribution of X displaystyle X nbsp and Y displaystyle Y nbsp isp X Y x y Pr X x Y y p X x p Y y 1 36 x y 1 6 N 0 otherwise displaystyle begin aligned p X Y left x y right amp Pr X x Y y p X x p Y y amp begin cases displaystyle 1 over 36 amp x y in 1 6 cap mathbb N 0 amp text otherwise end cases end aligned nbsp The information content of the random variate X Y 2 4 displaystyle X Y 2 4 nbsp isI X Y 2 4 log 2 p X Y 2 4 log 2 36 2 log 2 6 5 169925 Sh displaystyle begin aligned operatorname I X Y 2 4 amp log 2 left p X Y 2 4 right log 2 36 2 log 2 6 amp approx 5 169925 text Sh end aligned nbsp and can also be calculated by additivity of events I X Y 2 4 log 2 p X Y 2 4 log 2 p X 2 log 2 p Y 4 2 log 2 6 5 169925 Sh displaystyle begin aligned operatorname I X Y 2 4 amp log 2 left p X Y 2 4 right log 2 left p X 2 right log 2 left p Y 4 right amp 2 log 2 6 amp approx 5 169925 text Sh end aligned nbsp Information from frequency of rolls edit If we receive information about the value of the dice without knowledge of which die had which value we can formalize the approach with so called counting variablesC k d k X d k Y 0 X k Y k 1 X k Y k 2 X k Y k displaystyle C k delta k X delta k Y begin cases 0 amp neg X k vee Y k 1 amp quad X k veebar Y k 2 amp quad X k wedge Y k end cases nbsp for k 1 2 3 4 5 6 displaystyle k in 1 2 3 4 5 6 nbsp then k 1 6 C k 2 textstyle sum k 1 6 C k 2 nbsp and the counts have the multinomial distribution f c 1 c 6 Pr C 1 c 1 and and C 6 c 6 1 18 1 c 1 c k when i 1 6 c i 2 0 otherwise 1 18 when 2 c k are 1 1 36 when exactly one c k 2 0 otherwise displaystyle begin aligned f c 1 ldots c 6 amp Pr C 1 c 1 text and dots text and C 6 c 6 amp begin cases displaystyle 1 over 18 1 over c 1 cdots c k amp text when sum i 1 6 c i 2 0 amp text otherwise end cases amp begin cases 1 over 18 amp text when 2 c k text are 1 1 over 36 amp text when exactly one c k 2 0 amp text otherwise end cases end aligned nbsp To verify this the 6 outcomes X Y k k k 1 6 1 1 2 2 3 3 4 4 5 5 6 6 textstyle X Y in left k k right k 1 6 left 1 1 2 2 3 3 4 4 5 5 6 6 right nbsp correspond to the event C k 2 displaystyle C k 2 nbsp and a total probability of 1 6 These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same Without knowledge to distinguish the dice rolling the other numbers the other 6 2 15 textstyle binom 6 2 15 nbsp combinations correspond to one die rolling one number and the other die rolling a different number each having probability 1 18 Indeed 6 1 36 15 1 18 1 textstyle 6 cdot tfrac 1 36 15 cdot tfrac 1 18 1 nbsp as required Unsurprisingly the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number Take for examples the events A k X Y k k displaystyle A k X Y k k nbsp and B j k c j 1 c k 1 displaystyle B j k c j 1 cap c k 1 nbsp for j k 1 j k 6 displaystyle j neq k 1 leq j k leq 6 nbsp For example A 2 X 2 and Y 2 displaystyle A 2 X 2 text and Y 2 nbsp and B 3 4 3 4 4 3 displaystyle B 3 4 3 4 4 3 nbsp The information contents areI A 2 log 2 1 36 5 169925 Sh displaystyle operatorname I A 2 log 2 tfrac 1 36 5 169925 text Sh nbsp I B 3 4 log 2 1 18 4 169925 Sh displaystyle operatorname I left B 3 4 right log 2 tfrac 1 18 4 169925 text Sh nbsp Let Same i 1 6 A i textstyle text Same bigcup i 1 6 A i nbsp be the event that both dice rolled the same value and Diff Same displaystyle text Diff overline text Same nbsp be the event that the dice differed Then Pr Same 1 6 textstyle Pr text Same tfrac 1 6 nbsp and Pr Diff 5 6 textstyle Pr text Diff tfrac 5 6 nbsp The information contents of the events areI Same log 2 1 6 2 5849625 Sh displaystyle operatorname I text Same log 2 tfrac 1 6 2 5849625 text Sh nbsp I Diff log 2 5 6 0 2630344 Sh displaystyle operatorname I text Diff log 2 tfrac 5 6 0 2630344 text Sh nbsp Information from sum of die edit The probability mass or density function collectively probability measure of the sum of two independent random variables is the convolution of each probability measure In the case of independent fair 6 sided dice rolls the random variable Z X Y displaystyle Z X Y nbsp has probability mass function p Z z p X x p Y y 6 z 7 36 textstyle p Z z p X x p Y y 6 z 7 over 36 nbsp where displaystyle nbsp represents the discrete convolution The outcome Z 5 displaystyle Z 5 nbsp has probability p Z 5 4 36 1 9 textstyle p Z 5 frac 4 36 1 over 9 nbsp Therefore the information asserted isI Z 5 log 2 1 9 log 2 9 3 169925 Sh displaystyle operatorname I Z 5 log 2 tfrac 1 9 log 2 9 approx 3 169925 text Sh nbsp General discrete uniform distribution edit Generalizing the Fair dice roll example above consider a general discrete uniform random variable DURV X D U a b a b Z b a displaystyle X sim mathrm DU a b quad a b in mathbb Z b geq a nbsp For convenience define N b a 1 textstyle N b a 1 nbsp The probability mass function isp X k 1 N k a b Z 0 otherwise displaystyle p X k begin cases frac 1 N amp k in a b cap mathbb Z 0 amp text otherwise end cases nbsp In general the values of the DURV need not be integers or for the purposes of information theory even uniformly spaced they need only be equiprobable 2 The information gain of any observation X k displaystyle X k nbsp isI X k log 2 1 N log 2 N Sh displaystyle operatorname I X k log 2 frac 1 N log 2 N text Sh nbsp Special case constant random variable edit If b a displaystyle b a nbsp above X displaystyle X nbsp degenerates to a constant random variable with probability distribution deterministically given by X b displaystyle X b nbsp and probability measure the Dirac measure p X k d b k textstyle p X k delta b k nbsp The only value X displaystyle X nbsp can take is deterministically b displaystyle b nbsp so the information content of any measurement of X displaystyle X nbsp isI X b log 2 1 0 displaystyle operatorname I X b log 2 1 0 nbsp In general there is no information gained from measuring a known value 2 Categorical distribution edit Generalizing all of the above cases consider a categorical discrete random variable with support S s i i 1 N textstyle mathcal S bigl s i bigr i 1 N nbsp and probability mass function given byp X k p i k s i S 0 otherwise displaystyle p X k begin cases p i amp k s i in mathcal S 0 amp text otherwise end cases nbsp For the purposes of information theory the values s S displaystyle s in mathcal S nbsp do not have to be numbers they can be any mutually exclusive events on a measure space of finite measure that has been normalized to a probability measure p displaystyle p nbsp Without loss of generality we can assume the categorical distribution is supported on the set N 1 2 N textstyle N left 1 2 dots N right nbsp the mathematical structure is isomorphic in terms of probability theory and therefore information theory as well The information of the outcome X x displaystyle X x nbsp is givenI X x log 2 p X x displaystyle operatorname I X x log 2 p X x nbsp From these examples it is possible to calculate the information of any set of independent DRVs with known distributions by additivity Derivation editBy definition information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori If the receiving entity had previously known the content of a message with certainty before receiving the message the amount of information of the message received is zero Only when the advance knowledge of the content of the message by the receiver is less than 100 certain does the message actually convey information For example quoting a character the Hippy Dippy Weatherman of comedian George Carlin Weather forecast for tonight dark Continued dark overnight with widely scattered light by morning 8 Assuming that one does not reside near the polar regions the amount of information conveyed in that forecast is zero because it is known in advance of receiving the forecast that darkness always comes with the night Accordingly the amount of self information contained in a message conveying content informing an occurrence of event w n displaystyle omega n nbsp depends only on the probability of that event I w n f P w n displaystyle operatorname I omega n f operatorname P omega n nbsp for some function f displaystyle f cdot nbsp to be determined below If P w n 1 displaystyle operatorname P omega n 1 nbsp then I w n 0 displaystyle operatorname I omega n 0 nbsp If P w n lt 1 displaystyle operatorname P omega n lt 1 nbsp then I w n gt 0 displaystyle operatorname I omega n gt 0 nbsp Further by definition the measure of self information is nonnegative and additive If a message informing of event C displaystyle C nbsp is the intersection of two independent events A displaystyle A nbsp and B displaystyle B nbsp then the information of event C displaystyle C nbsp occurring is that of the compound message of both independent events A displaystyle A nbsp and B displaystyle B nbsp occurring The quantity of information of compound message C displaystyle C nbsp would be expected to equal the sum of the amounts of information of the individual component messages A displaystyle A nbsp and B displaystyle B nbsp respectively I C I A B I A I B displaystyle operatorname I C operatorname I A cap B operatorname I A operatorname I B nbsp Because of the independence of events A displaystyle A nbsp and B displaystyle B nbsp the probability of event C displaystyle C nbsp isP C P A B P A P B displaystyle operatorname P C operatorname P A cap B operatorname P A cdot operatorname P B nbsp However applying function f displaystyle f cdot nbsp results inI C I A I B f P C f P A f P B f P A P B displaystyle begin aligned operatorname I C amp operatorname I A operatorname I B f operatorname P C amp f operatorname P A f operatorname P B amp f big operatorname P A cdot operatorname P B big end aligned nbsp Thanks to work on Cauchy s functional equation the only monotone functions f displaystyle f cdot nbsp having the property such thatf x y f x f y displaystyle f x cdot y f x f y nbsp are the logarithm functions log b x displaystyle log b x nbsp The only operational difference between logarithms of different bases is that of different scaling constants so we may assume f x K log x displaystyle f x K log x nbsp where log displaystyle log nbsp is the natural logarithm Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative that requires that K lt 0 displaystyle K lt 0 nbsp Taking into account these properties the self information I w n displaystyle operatorname I omega n nbsp associated with outcome w n displaystyle omega n nbsp with probability P w n displaystyle operatorname P omega n nbsp is defined as I w n log P w n log 1 P w n displaystyle operatorname I omega n log operatorname P omega n log left frac 1 operatorname P omega n right nbsp The smaller the probability of event w n displaystyle omega n nbsp the larger the quantity of self information associated with the message that the event indeed occurred If the above logarithm is base 2 the unit of I w n displaystyle I omega n nbsp is shannon This is the most common practice When using the natural logarithm of base e displaystyle e nbsp the unit will be the nat For the base 10 logarithm the unit of information is the hartley As a quick illustration the information content associated with an outcome of 4 heads or any specific outcome in 4 consecutive tosses of a coin would be 4 shannons probability 1 16 and the information content associated with getting a result other than the one specified would be 0 09 shannons probability 15 16 See above for detailed examples See also editKolmogorov complexity Surprisal analysisReferences edit Jones D S Elementary Information Theory Vol Clarendon Press Oxford pp 11 15 1979 a b c d McMahon David M 2008 Quantum Computing Explained Hoboken NJ Wiley Interscience ISBN 9780470181386 OCLC 608622533 Borda Monica 2011 Fundamentals in Information Theory and Coding Springer ISBN 978 3 642 20346 6 Han Te Sun Kobayashi Kingo 2002 Mathematics of Information and Coding American Mathematical Society ISBN 978 0 8218 4256 0 Thomas M Cover Joy A Thomas Elements of Information Theory p 20 1991 R B Bernstein and R D Levine 1972 Entropy and Chemical Change I Characterization of Product and Reactant Energy Distributions in Reactive Molecular Collisions Information and Entropy Deficiency The Journal of Chemical Physics 57 434 449 link Myron Tribus 1961 Thermodynamics and Thermostatics An Introduction to Energy Information and States of Matter with Engineering Applications D Van Nostrand 24 West 40 Street New York 18 New York U S A Tribus Myron 1961 pp 64 66 borrow A quote by George Carlin www goodreads com Retrieved 2021 04 01 Further reading editC E Shannon A Mathematical Theory of Communication Bell Systems Technical Journal Vol 27 pp 379 423 Part I 1948 External links editExamples of surprisal measures Surprisal entry in a glossary of molecular information theory Bayesian Theory of Surprise Retrieved from https en wikipedia org w index php title Information content amp oldid 1217742338, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.