fbpx
Wikipedia

Kullback–Leibler divergence

In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence[1]), denoted , is a type of statistical distance: a measure of how one probability distribution P is different from a second, reference probability distribution Q.[2][3] A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model instead of P when the actual distribution is P. While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions (in contrast to variation of information), and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence,[4] a generalization of squared distance, and for certain classes of distributions (notably an exponential family), it satisfies a generalized Pythagorean theorem (which applies to squared distances).[5]

A relative entropy of 0 indicates that the two distributions in question are identical. Relative entropy is a nonnegative function of two distributions or measures. It has diverse applications, both theoretical, such as characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference; and practical, such as applied statistics, fluid mechanics, neuroscience and bioinformatics.

Introduction and context edit

Consider two probability distributions P and Q. Usually, P represents the data, the observations, or a measured probability distribution. Distribution Q represents instead a theory, a model, a description or an approximation of P. The Kullback–Leibler divergence   is then interpreted as the average difference of the number of bits required for encoding samples of P using a code optimized for Q rather than one optimized for P. Note that the roles of P and Q can be reversed in some situations where that is easier to compute, such as with the expectation–maximization algorithm (EM) and evidence lower bound (ELBO) computations.

Etymology edit

The relative entropy was introduced by Solomon Kullback and Richard Leibler in Kullback & Leibler (1951) as "the mean information for discrimination between   and   per observation from  ",[6] where one is comparing two probability measures  , and   are the hypotheses that one is selecting from measure   (respectively). They denoted this by  , and defined the "'divergence' between   and  " as the symmetrized quantity  , which had already been defined and used by Harold Jeffreys in 1948.[7] In Kullback (1959), the symmetrized form is again referred to as the "divergence", and the relative entropies in each direction are referred to as a "directed divergences" between two distributions;[8] Kullback preferred the term discrimination information.[9] The term "divergence" is in contrast to a distance (metric), since the symmetrized divergence does not satisfy the triangle inequality.[10] Numerous references to earlier uses of the symmetrized divergence and to other statistical distances are given in Kullback (1959, pp. 6–7, §1.3 Divergence). The asymmetric "directed divergence" has come to be known as the Kullback–Leibler divergence, while the symmetrized "divergence" is now referred to as the Jeffreys divergence.

Definition edit

For discrete probability distributions P and Q defined on the same sample space,   the relative entropy from Q to P is defined[11] to be

 

which is equivalent to

 

In other words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P.

Relative entropy is only defined in this way if, for all x,   implies   (absolute continuity). Otherwise, it is often defined as  ,[1] but the value   is possible even if   everywhere,[12][13] provided that   is infinite in extent. Analogous comments apply to the continuous and general measure cases defined below.

Whenever   is zero the contribution of the corresponding term is interpreted as zero because

 

For distributions P and Q of a continuous random variable, relative entropy is defined to be the integral[14]

 

where p and q denote the probability densities of P and Q.

More generally, if P and Q are probability measures on a measurable space   and P is absolutely continuous with respect to Q, then the relative entropy from Q to P is defined as

 

where   is the Radon–Nikodym derivative of P with respect to Q, i.e. the unique Q almost everywhere defined function r on   such that   which exists because P is absolutely continuous with respect to Q. Also we assume the expression on the right-hand side exists. Equivalently (by the chain rule), this can be written as

 

which is the entropy of P relative to Q. Continuing in this case, if   is any measure on   for which densities p and q with   and   exist (meaning that P and Q are both absolutely continuous with respect to  ), then the relative entropy from Q to P is given as

 

Note that such a measure   for which densities can be defined always exists, since one can take   although in practice it will usually be one that in the context like counting measure for discrete distributions, or Lebesgue measure or a convenient variant thereof like Gaussian measure or the uniform measure on the sphere, Haar measure on a Lie group etc. for continuous distributions. The logarithms in these formulae are usually taken to base 2 if information is measured in units of bits, or to base e if information is measured in nats. Most formulas involving relative entropy hold regardless of the base of the logarithm.

Various conventions exist for referring to   in words. Often it is referred to as the divergence between P and Q, but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of P from Q or as the divergence from Q to P. This reflects the asymmetry in Bayesian inference, which starts from a prior Q and updates to the posterior P. Another common way to refer to   is as the relative entropy of P with respect to Q or the information gain from P over Q.

Basic example edit

Kullback[3] gives the following example (Table 2.1, Example 2.1). Let P and Q be the distributions shown in the table and figure. P is the distribution on the left side of the figure, a binomial distribution with   and  . Q is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes   0, 1, 2 (i.e.  ), each with probability  .

 
Two distributions to illustrate relative entropy
x 0 1 2
Distribution        
Distribution        

Relative entropies   and   are calculated as follows. This example uses the natural log with base e, designated ln to get results in nats (see units of information):

 
 

Interpretations edit

Statistics edit

In the field of statistics, the Neyman–Pearson lemma states that the most powerful way to distinguish between the two distributions P and Q based on an observation Y (drawn from one of them) is through the log of the ratio of their likelihoods:  . The KL divergence is the expected value of this statistic if Y is actually drawn from P. Kullback motivated the statistic as an expected log likelihood ratio.[15]

Coding edit

In the context of coding theory,   can be constructed by measuring the expected number of extra bits required to code samples from P using a code optimized for Q rather than the code optimized for P.

Inference edit

In the context of machine learning,   is often called the information gain achieved if P would be used instead of Q which is currently used. By analogy with information theory, it is called the relative entropy of P with respect to Q.

Expressed in the language of Bayesian inference,   is a measure of the information gained by revising one's beliefs from the prior probability distribution Q to the posterior probability distribution P. In other words, it is the amount of information lost when Q is used to approximate P.[16]

Information geometry edit

In applications, P typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while Q typically represents a theory, model, description, or approximation of P. In order to find a distribution Q that is closest to P, we can minimize the KL divergence and compute an information projection.

While it is a statistical distance, it is not a metric, the most familiar type of distance, but instead it is a divergence.[4] While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general   does not equal  , and the asymmetry is an important part of the geometry.[4] The infinitesimal form of relative entropy, specifically its Hessian, gives a metric tensor that equals the Fisher information metric; see § Fisher information metric. Relative entropy satisfies a generalized Pythagorean theorem for exponential families (geometrically interpreted as dually flat manifolds), and this allows one to minimize relative entropy by geometric means, for example by information projection and in maximum likelihood estimation.[5]

The relative entropy is the Bregman divergence generated by the negative entropy, but it is also of the form of an f-divergence. For probabilities over a finite alphabet, it is unique in being a member of both of these classes of statistical divergences.

Finance (game theory) edit

Consider a growth-optimizing investor in a fair game with mutually exclusive outcomes (e.g. a “horse race” in which the official odds add up to one). The rate of return expected by such an investor is equal to the relative entropy between the investor's believed probabilities and the official odds.[17] This is a special case of a much more general connection between financial returns and divergence measures.[18]

Financial risks are connected to   via information geometry.[19] Investors' views, the prevailing market view, and risky scenarios form triangles on the relevant manifold of probability distributions. The shape of the triangles determines key financial risks (both qualitatively and quantitatively). For instance, obtuse triangles in which investors' views and risk scenarios appear on “opposite sides” relative to the market describe negative risks, acute triangles describe positive exposure, and the right-angled situation in the middle corresponds to zero risk.

Motivation edit

 
Illustration of the relative entropy for two normal distributions. The typical asymmetry is clearly visible.

In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value   out of a set of possibilities X can be seen as representing an implicit probability distribution   over X, where   is the length of the code for   in bits. Therefore, relative entropy can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P: it is the excess entropy.

 

where   is the cross entropy of P and Q, and   is the entropy of P (which is the same as the cross-entropy of P with itself).

The relative entropy   can be thought of geometrically as a statistical distance, a measure of how far the distribution Q is from the distribution P. Geometrically it is a divergence: an asymmetric, generalized form of squared distance. The cross-entropy   is itself such a measurement (formally a loss function), but it cannot be thought of as a distance, since   is not zero. This can be fixed by subtracting   to make   agree more closely with our notion of distance, as the excess loss. The resulting function is asymmetric, and while this can be symmetrized (see § Symmetrised divergence), the asymmetric form is more useful. See § Interpretations for more on the geometric interpretation.

Relative entropy relates to "rate function" in the theory of large deviations.[20][21]

Arthur Hobson proved that relative entropy is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy.[22] Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.

Properties edit

  • Relative entropy is always non-negative,
     
    a result known as Gibbs' inequality, with   equals zero if and only if   as measures.

In particular, if   and  , then    -almost everywhere. The entropy   thus sets a minimum value for the cross-entropy  , the expected number of bits required when using a code based on Q rather than P; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value x drawn from X, if a code is used corresponding to the probability distribution Q, rather than the "true" distribution P.

  • No upper-bound exists for the general case. However, it is shown that if P and Q are two discrete probability distributions built by distributing the same discrete quantity, then the maximum value of   can be calculated.[23]
  • Relative entropy remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. For example, if a transformation is made from variable x to variable  , then, since   and   where   is the absolute value of the derivative or more generally of the Jacobian, the relative entropy may be rewritten:
     
    where   and  . Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the relative entropy produces a dimensionally consistent quantity, since if x is a dimensioned variable,   and   are also dimensioned, since e.g.   is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory[24] (such as self-information or Shannon entropy), which can become undefined or negative for non-discrete probabilities.
  • Relative entropy is additive for independent distributions in much the same way as Shannon entropy. If   are independent distributions, and  , and likewise   for independent distributions   then
     
  • Relative entropy   is convex in the pair of probability measures  , i.e. if   and   are two pairs of probability measures then
     
  • The Taylor expansion is  .

Duality formula for variational inference edit

The following result, due to Donsker and Varadhan,[25] is known as Donsker and Varadhan's variational formula.

Theorem [Duality Formula for Variational Inference] — Let   be a set endowed with an appropriate  -field  , and two probability measures P and Q, which formulate two probability spaces   and  , with  . (  indicates that Q is absolutely continuous with respect to P.) Let h be a real-valued integrable random variable on  . Then the following equality holds

 

Further, the supremum on the right-hand side is attained if and only if it holds

 

almost surely with respect to probability measure P, where   denotes the Radon-Nikodym derivative of Q with respect to P .

Proof

For a short proof assuming integrability of   with respect to P, let   have P-density  , i.e.   Then

 

Therefore,

 

where the last inequality follows from  , for which equality occurs if and only if  . The conclusion follows.

For alternative proof using measure theory, see.[26]

Examples edit

Multivariate normal distributions edit

Suppose that we have two multivariate normal distributions, with means   and with (non-singular) covariance matrices   If the two distributions have the same dimension, k, then the relative entropy between the distributions is as follows:[27]

 

The logarithm in the last term must be taken to base e since all terms apart from the last are base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by   yields the divergence in bits.

In a numerical implementation, it is helpful to express the result in terms of the Cholesky decompositions   such that   and  . Then with M and y solutions to the triangular linear systems  , and  ,

 

A special case, and a common quantity in variational inference, is the relative entropy between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):

 

For two univariate normal distributions p and q the above simplifies to[28]

 

In the case of co-centered normal distributions with  , this simplifies[29] to:

 

Uniform distributions edit

Consider two uniform distributions, with the support of   enclosed within   ( ). Then the information gain is:

 

Intuitively,[29] the information gain to a k times narrower uniform distribution contains   bits. This connects with the use of bits in computing, where   bits would be needed to identify one element of a k long stream.

Relation to metrics edit

While relative entropy is a statistical distance, it is not a metric on the space of probability distributions, but instead it is a divergence.[4] While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric in general and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general   does not equal  , and while this can be symmetrized (see § Symmetrised divergence), the asymmetry is an important part of the geometry.[4]

It generates a topology on the space of probability distributions. More concretely, if   is a sequence of distributions such that

 ,

then it is said that

 .

Pinsker's inequality entails that

 ,

where the latter stands for the usual convergence in total variation.

Fisher information metric edit

Relative entropy is directly related to the Fisher information metric. This can be made explicit as follows. Assume that the probability distributions P and Q are both parameterized by some (possibly multi-dimensional) parameter  . Consider then two close by values of   and   so that the parameter   differs by only a small amount from the parameter value  . Specifically, up to first order one has (using the Einstein summation convention)

 

with   a small change of   in the j direction, and   the corresponding rate of change in the probability distribution. Since relative entropy has an absolute minimum 0 for  , i.e.  , it changes only to second order in the small parameters  . More formally, as for any minimum, the first derivatives of the divergence vanish

 

and by the Taylor expansion one has up to second order

 

where the Hessian matrix of the divergence

 

must be positive semidefinite. Letting   vary (and dropping the subindex 0) the Hessian   defines a (possibly degenerate) Riemannian metric on the θ parameter space, called the Fisher information metric.

Fisher information metric theorem edit

When   satisfies the following regularity conditions:

  exist,
 

where ξ is independent of ρ

 

then:

 

Variation of information edit

Another information-theoretic metric is variation of information, which is roughly a symmetrization of conditional entropy. It is a metric on the set of partitions of a discrete probability space.

Relation to other quantities of information theory edit

Many of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases.

Self-information edit

The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.

When applied to a discrete random variable, the self-information can be represented as[citation needed]

 

is the relative entropy of the probability distribution   from a Kronecker delta representing certainty that   — i.e. the number of extra bits that must be transmitted to identify i if only the probability distribution   is available to the receiver, not the fact that  .

Mutual information edit

The mutual information,

 

is the relative entropy of the joint probability distribution   from the product   of the two marginal probability distributions — i.e. the expected number of extra bits that must be transmitted to identify X and Y if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability   is known, it is the expected number of extra bits that must on average be sent to identify Y if the value of X is not already known to the receiver.

Shannon entropy edit

The Shannon entropy,

 

is the number of bits which would have to be transmitted to identify X from N equally likely possibilities, less the relative entropy of the uniform distribution on the random variates of X,  , from the true distribution   — i.e. less the expected number of bits saved, which would have had to be sent if the value of X were coded according to the uniform distribution   rather than the true distribution  . This definition of Shannon entropy forms the basis of E.T. Jaynes's alternative generalization to continuous distributions, the limiting density of discrete points (as opposed to the usual differential entropy), which defines the continuous entropy as

 

which is equivalent to:

 

Conditional entropy edit

The conditional entropy[30],

 

is the number of bits which would have to be transmitted to identify X from N equally likely possibilities, less the relative entropy of the product distribution   from the true joint distribution   — i.e. less the expected number of bits saved which would have had to be sent if the value of X were coded according to the uniform distribution   rather than the conditional distribution   of X given Y.

Cross entropy edit

When we have a set of possible events, coming from the distribution p, we can encode them (with a lossless data compression) using entropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distribution p in advance, we can devise an encoding that would be optimal (e.g.: using Huffman coding). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled from p), which will be equal to Shannon's Entropy of p (denoted as  ). However, if we use a different probability distribution (q) when creating the entropy encoding scheme, then a larger number of bits will be used (on average) to identify an event from a set of possibilities. This new (larger) number is measured by the cross entropy between p and q.

The cross entropy between two probability distributions (p and q) measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p. The cross entropy for two distributions p and q over the same probability space is thus defined as follows.

 

For explicit derivation of this, see the Motivation section above.

Under this scenario, relative entropies (kl-divergence) can be interpreted as the extra number of bits, on average, that are needed (beyond  ) for encoding the events because of using q for constructing the encoding scheme instead of p.

Bayesian updating edit

In Bayesian statistics, relative entropy can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution:  . If some new fact   is discovered, it can be used to update the posterior distribution for X from   to a new posterior distribution   using Bayes' theorem:

 

This distribution has a new entropy:

 

which may be less than or greater than the original entropy  . However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on   instead of a new code based on   would have added an expected number of bits:

 

to the message length. This therefore represents the amount of useful information, or information gain, about X, that has been learned by discovering  .

If a further piece of data,  , subsequently comes in, the probability distribution for x can be updated further, to give a new best guess  . If one reinvestigates the information gain for using   rather than  , it turns out that it may be either greater or less than previously estimated:

  may be ≤ or > than  

and so the combined information gain does not obey the triangle inequality:

  may be <, = or > than  

All one can say is that on average, averaging using  , the two sides will average out.

Bayesian experimental design edit

A common goal in Bayesian experimental design is to maximise the expected relative entropy between the prior and the posterior.[31] When posteriors are approximated to be Gaussian distributions, a design maximising the expected relative entropy is called Bayes d-optimal.

Discrimination information edit

Relative entropy   can also be interpreted as the expected discrimination information for   over  : the mean information per sample for discriminating in favor of a hypothesis   against a hypothesis  , when hypothesis   is true.[32] Another name for this quantity, given to it by I. J. Good, is the expected weight of evidence for   over   to be expected from each sample.

The expected weight of evidence for   over   is not the same as the information gain expected per sample about the probability distribution   of the hypotheses,

 

Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

Principle of minimum discrimination information edit

The idea of relative entropy as discrimination information led Kullback to propose the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution f should be chosen which is as hard to discriminate from the original distribution   as possible; so that the new data produces as small an information gain   as possible.

For example, if one had a prior distribution   over x and a, and subsequently learnt the true distribution of a was  , then the relative entropy between the new joint distribution for x and a,  , and the earlier prior distribution would be:

 

i.e. the sum of the relative entropy of   the prior distribution for a from the updated distribution  , plus the expected value (using the probability distribution  ) of the relative entropy of the prior conditional distribution   from the new conditional distribution  . (Note that often the later expected value is called the conditional relative entropy (or conditional Kullback–Leibler divergence) and denoted by  [3][30]) This is minimized if   over the whole support of  ; and we note that this result incorporates Bayes' theorem, if the new distribution   is in fact a δ function representing certainty that a has one particular value.

MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the relative entropy continues to be just as relevant.

In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising relative entropy from m to p with respect to m is equivalent to minimizing the cross-entropy of p and m, since

 

which is appropriate if one is trying to choose an adequate approximation to p. However, this is just as often not the task one is trying to achieve. Instead, just as often it is m that is some fixed prior reference measure, and p that one is attempting to optimise by minimising   subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be  , rather than   [citation needed].

Relationship to available work edit

 
Pressure versus volume plot of available work from a mole of argon gas relative

kullback, leibler, divergence, mathematical, statistics, kullback, leibler, divergence, also, called, relative, entropy, divergence, denoted, displaystyle, text, parallel, type, statistical, distance, measure, probability, distribution, different, from, second. In mathematical statistics the Kullback Leibler KL divergence also called relative entropy and I divergence 1 denoted D KL P Q displaystyle D text KL P parallel Q is a type of statistical distance a measure of how one probability distribution P is different from a second reference probability distribution Q 2 3 A simple interpretation of the KL divergence of P from Q is the expected excess surprise from using Q as a model instead of P when the actual distribution is P While it is a measure of how different two distributions are and in some sense is thus a distance it is not actually a metric which is the most familiar and formal type of distance In particular it is not symmetric in the two distributions in contrast to variation of information and does not satisfy the triangle inequality Instead in terms of information geometry it is a type of divergence 4 a generalization of squared distance and for certain classes of distributions notably an exponential family it satisfies a generalized Pythagorean theorem which applies to squared distances 5 A relative entropy of 0 indicates that the two distributions in question are identical Relative entropy is a nonnegative function of two distributions or measures It has diverse applications both theoretical such as characterizing the relative Shannon entropy in information systems randomness in continuous time series and information gain when comparing statistical models of inference and practical such as applied statistics fluid mechanics neuroscience and bioinformatics Contents 1 Introduction and context 2 Etymology 3 Definition 4 Basic example 5 Interpretations 5 1 Statistics 5 2 Coding 5 3 Inference 5 4 Information geometry 5 5 Finance game theory 6 Motivation 7 Properties 8 Duality formula for variational inference 9 Examples 9 1 Multivariate normal distributions 9 2 Uniform distributions 10 Relation to metrics 10 1 Fisher information metric 10 1 1 Fisher information metric theorem 10 2 Variation of information 11 Relation to other quantities of information theory 11 1 Self information 11 2 Mutual information 11 3 Shannon entropy 11 4 Conditional entropy 11 5 Cross entropy 12 Bayesian updating 12 1 Bayesian experimental design 13 Discrimination information 13 1 Principle of minimum discrimination information 14 Relationship to available work 15 Quantum information theory 16 Relationship between models and reality 17 Symmetrised divergence 18 Relationship to other probability distance measures 19 Data differencing 20 See also 21 References 22 External linksIntroduction and context editConsider two probability distributions P and Q Usually P represents the data the observations or a measured probability distribution Distribution Q represents instead a theory a model a description or an approximation of P The Kullback Leibler divergence D KL P Q displaystyle D text KL P parallel Q nbsp is then interpreted as the average difference of the number of bits required for encoding samples of P using a code optimized for Q rather than one optimized for P Note that the roles of P and Q can be reversed in some situations where that is easier to compute such as with the expectation maximization algorithm EM and evidence lower bound ELBO computations Etymology editThe relative entropy was introduced by Solomon Kullback and Richard Leibler in Kullback amp Leibler 1951 as the mean information for discrimination between H 1 displaystyle H 1 nbsp and H 2 displaystyle H 2 nbsp per observation from m 1 displaystyle mu 1 nbsp 6 where one is comparing two probability measures m 1 m 2 displaystyle mu 1 mu 2 nbsp and H 1 H 2 displaystyle H 1 H 2 nbsp are the hypotheses that one is selecting from measure m 1 m 2 displaystyle mu 1 mu 2 nbsp respectively They denoted this by I 1 2 displaystyle I 1 2 nbsp and defined the divergence between m 1 displaystyle mu 1 nbsp and m 2 displaystyle mu 2 nbsp as the symmetrized quantity J 1 2 I 1 2 I 2 1 displaystyle J 1 2 I 1 2 I 2 1 nbsp which had already been defined and used by Harold Jeffreys in 1948 7 In Kullback 1959 the symmetrized form is again referred to as the divergence and the relative entropies in each direction are referred to as a directed divergences between two distributions 8 Kullback preferred the term discrimination information 9 The term divergence is in contrast to a distance metric since the symmetrized divergence does not satisfy the triangle inequality 10 Numerous references to earlier uses of the symmetrized divergence and to other statistical distances are given in Kullback 1959 pp 6 7 1 3 Divergence The asymmetric directed divergence has come to be known as the Kullback Leibler divergence while the symmetrized divergence is now referred to as the Jeffreys divergence Definition editFor discrete probability distributions P and Q defined on the same sample space X displaystyle mathcal X nbsp the relative entropy from Q to P is defined 11 to be D KL P Q x X P x log P x Q x displaystyle D text KL P parallel Q sum x in mathcal X P x log left frac P x Q x right nbsp which is equivalent to D KL P Q x X P x log Q x P x displaystyle D text KL P parallel Q sum x in mathcal X P x log left frac Q x P x right nbsp In other words it is the expectation of the logarithmic difference between the probabilities P and Q where the expectation is taken using the probabilities P Relative entropy is only defined in this way if for all x Q x 0 displaystyle Q x 0 nbsp implies P x 0 displaystyle P x 0 nbsp absolute continuity Otherwise it is often defined as displaystyle infty nbsp 1 but the value displaystyle infty nbsp is possible even if Q x 0 displaystyle Q x neq 0 nbsp everywhere 12 13 provided that X displaystyle mathcal X nbsp is infinite in extent Analogous comments apply to the continuous and general measure cases defined below Whenever P x displaystyle P x nbsp is zero the contribution of the corresponding term is interpreted as zero because lim x 0 x log x 0 displaystyle lim x to 0 x log x 0 nbsp For distributions P and Q of a continuous random variable relative entropy is defined to be the integral 14 D KL P Q p x log p x q x d x displaystyle D text KL P parallel Q int infty infty p x log left frac p x q x right mathrm d x nbsp where p and q denote the probability densities of P and Q More generally if P and Q are probability measures on a measurable space X displaystyle mathcal X nbsp and P is absolutely continuous with respect to Q then the relative entropy from Q to P is defined as D KL P Q x X log P d x Q d x P d x displaystyle D text KL P parallel Q int x in mathcal X log left frac P mathrm d x Q mathrm d x right P mathrm d x nbsp where P d x Q d x displaystyle frac P mathrm d x Q mathrm d x nbsp is the Radon Nikodym derivative of P with respect to Q i e the unique Q almost everywhere defined function r on X displaystyle mathcal X nbsp such that P d x r x Q d x displaystyle P mathrm d x r x Q mathrm d x nbsp which exists because P is absolutely continuous with respect to Q Also we assume the expression on the right hand side exists Equivalently by the chain rule this can be written as D KL P Q x X P d x Q d x log P d x Q d x Q d x displaystyle D text KL P parallel Q int x in mathcal X frac P mathrm d x Q mathrm d x log left frac P mathrm d x Q mathrm d x right Q mathrm d x nbsp which is the entropy of P relative to Q Continuing in this case if m displaystyle mu nbsp is any measure on X displaystyle mathcal X nbsp for which densities p and q with P d x p x m d x displaystyle P mathrm d x p x mu mathrm d x nbsp and Q d x q x m d x displaystyle Q mathrm d x q x mu mathrm d x nbsp exist meaning that P and Q are both absolutely continuous with respect to m displaystyle mu nbsp then the relative entropy from Q to P is given as D KL P Q x X p x log p x q x m d x displaystyle D text KL P parallel Q int x in mathcal X p x log left frac p x q x right mu mathrm d x nbsp Note that such a measure m displaystyle mu nbsp for which densities can be defined always exists since one can take m 1 2 P Q displaystyle mu frac 1 2 left P Q right nbsp although in practice it will usually be one that in the context like counting measure for discrete distributions or Lebesgue measure or a convenient variant thereof like Gaussian measure or the uniform measure on the sphere Haar measure on a Lie group etc for continuous distributions The logarithms in these formulae are usually taken to base 2 if information is measured in units of bits or to base e if information is measured in nats Most formulas involving relative entropy hold regardless of the base of the logarithm Various conventions exist for referring to D KL P Q displaystyle D text KL P parallel Q nbsp in words Often it is referred to as the divergence between P and Q but this fails to convey the fundamental asymmetry in the relation Sometimes as in this article it may be described as the divergence of P from Q or as the divergence from Q to P This reflects the asymmetry in Bayesian inference which starts from a prior Q and updates to the posterior P Another common way to refer to D KL P Q displaystyle D text KL P parallel Q nbsp is as the relative entropy of P with respect to Q or the information gain from P over Q Basic example editKullback 3 gives the following example Table 2 1 Example 2 1 Let P and Q be the distributions shown in the table and figure P is the distribution on the left side of the figure a binomial distribution with N 2 displaystyle N 2 nbsp and p 0 4 displaystyle p 0 4 nbsp Q is the distribution on the right side of the figure a discrete uniform distribution with the three possible outcomes x displaystyle x nbsp 0 1 2 i e X 0 1 2 displaystyle mathcal X 0 1 2 nbsp each with probability p 1 3 displaystyle p 1 3 nbsp nbsp Two distributions to illustrate relative entropy x 0 1 2 Distribution P x displaystyle P x nbsp 9 25 displaystyle frac 9 25 nbsp 12 25 displaystyle frac 12 25 nbsp 4 25 displaystyle frac 4 25 nbsp Distribution Q x displaystyle Q x nbsp 1 3 displaystyle frac 1 3 nbsp 1 3 displaystyle frac 1 3 nbsp 1 3 displaystyle frac 1 3 nbsp Relative entropies D KL P Q displaystyle D text KL P parallel Q nbsp and D KL Q P displaystyle D text KL Q parallel P nbsp are calculated as follows This example uses the natural log with base e designated ln to get results in nats see units of information D KL P Q x X P x ln P x Q x 9 25 ln 9 25 1 3 12 25 ln 12 25 1 3 4 25 ln 4 25 1 3 1 25 32 ln 2 55 ln 3 50 ln 5 0 0852996 displaystyle begin aligned D text KL P parallel Q amp sum x in mathcal X P x ln left frac P x Q x right amp frac 9 25 ln left frac 9 25 1 3 right frac 12 25 ln left frac 12 25 1 3 right frac 4 25 ln left frac 4 25 1 3 right amp frac 1 25 left 32 ln 2 55 ln 3 50 ln 5 right approx 0 0852996 end aligned nbsp D KL Q P x X Q x ln Q x P x 1 3 ln 1 3 9 25 1 3 ln 1 3 12 25 1 3 ln 1 3 4 25 1 3 4 ln 2 6 ln 3 6 ln 5 0 097455 displaystyle begin aligned D text KL Q parallel P amp sum x in mathcal X Q x ln left frac Q x P x right amp frac 1 3 ln left frac 1 3 9 25 right frac 1 3 ln left frac 1 3 12 25 right frac 1 3 ln left frac 1 3 4 25 right amp frac 1 3 left 4 ln 2 6 ln 3 6 ln 5 right approx 0 097455 end aligned nbsp Interpretations editStatistics edit In the field of statistics the Neyman Pearson lemma states that the most powerful way to distinguish between the two distributions P and Q based on an observation Y drawn from one of them is through the log of the ratio of their likelihoods log P Y log Q Y displaystyle log P Y log Q Y nbsp The KL divergence is the expected value of this statistic if Y is actually drawn from P Kullback motivated the statistic as an expected log likelihood ratio 15 Coding edit In the context of coding theory D KL P Q displaystyle D text KL P parallel Q nbsp can be constructed by measuring the expected number of extra bits required to code samples from P using a code optimized for Q rather than the code optimized for P Inference edit In the context of machine learning D KL P Q displaystyle D text KL P parallel Q nbsp is often called the information gain achieved if P would be used instead of Q which is currently used By analogy with information theory it is called the relative entropy of P with respect to Q Expressed in the language of Bayesian inference D KL P Q displaystyle D text KL P parallel Q nbsp is a measure of the information gained by revising one s beliefs from the prior probability distribution Q to the posterior probability distribution P In other words it is the amount of information lost when Q is used to approximate P 16 Information geometry edit In applications P typically represents the true distribution of data observations or a precisely calculated theoretical distribution while Q typically represents a theory model description or approximation of P In order to find a distribution Q that is closest to P we can minimize the KL divergence and compute an information projection While it is a statistical distance it is not a metric the most familiar type of distance but instead it is a divergence 4 While metrics are symmetric and generalize linear distance satisfying the triangle inequality divergences are asymmetric and generalize squared distance in some cases satisfying a generalized Pythagorean theorem In general D KL P Q displaystyle D text KL P parallel Q nbsp does not equal D KL Q P displaystyle D text KL Q parallel P nbsp and the asymmetry is an important part of the geometry 4 The infinitesimal form of relative entropy specifically its Hessian gives a metric tensor that equals the Fisher information metric see Fisher information metric Relative entropy satisfies a generalized Pythagorean theorem for exponential families geometrically interpreted as dually flat manifolds and this allows one to minimize relative entropy by geometric means for example by information projection and in maximum likelihood estimation 5 The relative entropy is the Bregman divergence generated by the negative entropy but it is also of the form of an f divergence For probabilities over a finite alphabet it is unique in being a member of both of these classes of statistical divergences Finance game theory edit Consider a growth optimizing investor in a fair game with mutually exclusive outcomes e g a horse race in which the official odds add up to one The rate of return expected by such an investor is equal to the relative entropy between the investor s believed probabilities and the official odds 17 This is a special case of a much more general connection between financial returns and divergence measures 18 Financial risks are connected to D KL displaystyle D text KL nbsp via information geometry 19 Investors views the prevailing market view and risky scenarios form triangles on the relevant manifold of probability distributions The shape of the triangles determines key financial risks both qualitatively and quantitatively For instance obtuse triangles in which investors views and risk scenarios appear on opposite sides relative to the market describe negative risks acute triangles describe positive exposure and the right angled situation in the middle corresponds to zero risk Motivation edit nbsp Illustration of the relative entropy for two normal distributions The typical asymmetry is clearly visible In information theory the Kraft McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value x i displaystyle x i nbsp out of a set of possibilities X can be seen as representing an implicit probability distribution q x i 2 ℓ i displaystyle q x i 2 ell i nbsp over X where ℓ i displaystyle ell i nbsp is the length of the code for x i displaystyle x i nbsp in bits Therefore relative entropy can be interpreted as the expected extra message length per datum that must be communicated if a code that is optimal for a given wrong distribution Q is used compared to using a code based on the true distribution P it is the excess entropy D KL P Q x X p x log 1 q x x X p x log 1 p x H P Q H P displaystyle begin aligned D text KL P parallel Q amp sum x in mathcal X p x log frac 1 q x sum x in mathcal X p x log frac 1 p x 5pt amp mathrm H P Q mathrm H P end aligned nbsp where H P Q displaystyle mathrm H P Q nbsp is the cross entropy of P and Q and H P displaystyle mathrm H P nbsp is the entropy of P which is the same as the cross entropy of P with itself The relative entropy D KL P Q displaystyle D text KL P parallel Q nbsp can be thought of geometrically as a statistical distance a measure of how far the distribution Q is from the distribution P Geometrically it is a divergence an asymmetric generalized form of squared distance The cross entropy H P Q displaystyle H P Q nbsp is itself such a measurement formally a loss function but it cannot be thought of as a distance since H P P H P displaystyle H P P H P nbsp is not zero This can be fixed by subtracting H P displaystyle H P nbsp to make D KL P Q displaystyle D text KL P parallel Q nbsp agree more closely with our notion of distance as the excess loss The resulting function is asymmetric and while this can be symmetrized see Symmetrised divergence the asymmetric form is more useful See Interpretations for more on the geometric interpretation Relative entropy relates to rate function in the theory of large deviations 20 21 Arthur Hobson proved that relative entropy is the only measure of difference between probability distributions that satisfies some desired properties which are the canonical extension to those appearing in a commonly used characterization of entropy 22 Consequently mutual information is the only measure of mutual dependence that obeys certain related conditions since it can be defined in terms of Kullback Leibler divergence Properties editRelative entropy is always non negative D KL P Q 0 displaystyle D text KL P parallel Q geq 0 nbsp a result known as Gibbs inequality with D KL P Q displaystyle D text KL P parallel Q nbsp equals zero if and only if P Q displaystyle P Q nbsp as measures In particular if P d x p x m d x displaystyle P dx p x mu dx nbsp and Q d x q x m d x displaystyle Q dx q x mu dx nbsp then p x q x displaystyle p x q x nbsp m displaystyle mu nbsp almost everywhere The entropy H P displaystyle mathrm H P nbsp thus sets a minimum value for the cross entropy H P Q displaystyle mathrm H P Q nbsp the expected number of bits required when using a code based on Q rather than P and the Kullback Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value x drawn from X if a code is used corresponding to the probability distribution Q rather than the true distribution P No upper bound exists for the general case However it is shown that if P and Q are two discrete probability distributions built by distributing the same discrete quantity then the maximum value of D KL P Q displaystyle D text KL P parallel Q nbsp can be calculated 23 Relative entropy remains well defined for continuous distributions and furthermore is invariant under parameter transformations For example if a transformation is made from variable x to variable y x displaystyle y x nbsp then since P d x p x d x p y d y p y x d y d x x d x displaystyle P dx p x dx tilde p y dy tilde p y x tfrac dy dx x dx nbsp and Q d x q x d x q y d y q y d y d x x d x displaystyle Q dx q x dx tilde q y dy tilde q y tfrac dy dx x dx nbsp where d y d x x displaystyle tfrac dy dx x nbsp is the absolute value of the derivative or more generally of the Jacobian the relative entropy may be rewritten D KL P Q x a x b p x log p x q x d x x a x b p y x d y d x x log p y x d y d x x q y x d y d x x d x y a y b p y log p y q y d y displaystyle begin aligned D text KL P parallel Q amp int x a x b p x log left frac p x q x right dx 6pt amp int x a x b tilde p y x frac dy dx x log left frac tilde p y x frac dy dx x tilde q y x frac dy dx x right dx amp int y a y b tilde p y log left frac tilde p y tilde q y right dy end aligned nbsp where y a y x a displaystyle y a y x a nbsp and y b y x b displaystyle y b y x b nbsp Although it was assumed that the transformation was continuous this need not be the case This also shows that the relative entropy produces a dimensionally consistent quantity since if x is a dimensioned variable p x displaystyle p x nbsp and q x displaystyle q x nbsp are also dimensioned since e g P d x p x d x displaystyle P dx p x dx nbsp is dimensionless The argument of the logarithmic term is and remains dimensionless as it must It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory 24 such as self information or Shannon entropy which can become undefined or negative for non discrete probabilities Relative entropy is additive for independent distributions in much the same way as Shannon entropy If P 1 P 2 displaystyle P 1 P 2 nbsp are independent distributions and P d x d y P 1 d x P 2 d y displaystyle P dx dy P 1 dx P 2 dy nbsp and likewise Q d x d y Q 1 d x Q 2 d y displaystyle Q dx dy Q 1 dx Q 2 dy nbsp for independent distributions Q 1 Q 2 displaystyle Q 1 Q 2 nbsp then D KL P Q D KL P 1 Q 1 D KL P 2 Q 2 displaystyle D text KL P parallel Q D text KL P 1 parallel Q 1 D text KL P 2 parallel Q 2 nbsp Relative entropy D KL P Q displaystyle D text KL P parallel Q nbsp is convex in the pair of probability measures P Q displaystyle P Q nbsp i e if P 1 Q 1 displaystyle P 1 Q 1 nbsp and P 2 Q 2 displaystyle P 2 Q 2 nbsp are two pairs of probability measures then D KL l P 1 1 l P 2 l Q 1 1 l Q 2 l D KL P 1 Q 1 1 l D KL P 2 Q 2 for 0 l 1 displaystyle D text KL lambda P 1 1 lambda P 2 parallel lambda Q 1 1 lambda Q 2 leq lambda D text KL P 1 parallel Q 1 1 lambda D text KL P 2 parallel Q 2 text for 0 leq lambda leq 1 nbsp The Taylor expansion is D KL P Q 1 2 x P x Q x 2 P x 1 3 x P x Q x 3 P x 2 displaystyle D text KL P parallel Q frac 1 2 sum x frac P x Q x 2 P x frac 1 3 sum x frac P x Q x 3 P x 2 cdots nbsp Duality formula for variational inference editThe following result due to Donsker and Varadhan 25 is known as Donsker and Varadhan s variational formula Theorem Duality Formula for Variational Inference Let 8 displaystyle Theta nbsp be a set endowed with an appropriate s displaystyle sigma nbsp field F displaystyle mathcal F nbsp and two probability measures P and Q which formulate two probability spaces 8 F P displaystyle Theta mathcal F P nbsp and 8 F Q displaystyle Theta mathcal F Q nbsp with Q P displaystyle Q ll P nbsp Q P displaystyle Q ll P nbsp indicates that Q is absolutely continuous with respect to P Let h be a real valued integrable random variable on 8 F P displaystyle Theta mathcal F P nbsp Then the following equality holds log E P exp h sup Q P E Q h D KL Q P displaystyle log E P exp h text sup Q ll P E Q h D text KL Q parallel P nbsp Further the supremum on the right hand side is attained if and only if it holds Q d 8 P d 8 exp h 8 E P exp h displaystyle frac Q d theta P d theta frac exp h theta E P exp h nbsp almost surely with respect to probability measure P where Q d 8 P d 8 displaystyle frac Q d theta P d theta nbsp denotes the Radon Nikodym derivative of Q with respect to P Proof For a short proof assuming integrability of exp h displaystyle exp h nbsp with respect to P let Q displaystyle Q nbsp have P density exp h 8 E P exp h displaystyle frac exp h theta E P exp h nbsp i e Q d 8 exp h 8 E P exp h P d 8 displaystyle Q d theta frac exp h theta E P exp h P d theta nbsp Then D KL Q Q D KL Q P E Q h log E P exp h displaystyle D text KL Q parallel Q D text KL Q parallel P E Q h log E P exp h nbsp Therefore E Q h D KL Q P log E P exp h D KL Q Q log E P exp h displaystyle E Q h D text KL Q parallel P log E P exp h D text KL Q parallel Q leq log E P exp h nbsp where the last inequality follows from D KL Q Q 0 displaystyle D text KL Q parallel Q geq 0 nbsp for which equality occurs if and only if Q Q displaystyle Q Q nbsp The conclusion follows For alternative proof using measure theory see 26 Examples editMultivariate normal distributions edit Suppose that we have two multivariate normal distributions with means m 0 m 1 displaystyle mu 0 mu 1 nbsp and with non singular covariance matrices S 0 S 1 displaystyle Sigma 0 Sigma 1 nbsp If the two distributions have the same dimension k then the relative entropy between the distributions is as follows 27 D KL N 0 N 1 1 2 tr S 1 1 S 0 k m 1 m 0 T S 1 1 m 1 m 0 ln det S 1 det S 0 displaystyle D text KL left mathcal N 0 parallel mathcal N 1 right frac 1 2 left operatorname tr left Sigma 1 1 Sigma 0 right k left mu 1 mu 0 right mathsf T Sigma 1 1 left mu 1 mu 0 right ln left frac det Sigma 1 det Sigma 0 right right nbsp The logarithm in the last term must be taken to base e since all terms apart from the last are base e logarithms of expressions that are either factors of the density function or otherwise arise naturally The equation therefore gives a result measured in nats Dividing the entire expression above by ln 2 displaystyle ln 2 nbsp yields the divergence in bits In a numerical implementation it is helpful to express the result in terms of the Cholesky decompositions L 0 L 1 displaystyle L 0 L 1 nbsp such that S 0 L 0 L 0 T displaystyle Sigma 0 L 0 L 0 T nbsp and S 1 L 1 L 1 T displaystyle Sigma 1 L 1 L 1 T nbsp Then with M and y solutions to the triangular linear systems L 1 M L 0 displaystyle L 1 M L 0 nbsp and L 1 y m 1 m 0 displaystyle L 1 y mu 1 mu 0 nbsp D KL N 0 N 1 1 2 i j 1 k M i j 2 k y 2 2 i 1 k ln L 1 i i L 0 i i displaystyle D text KL left mathcal N 0 parallel mathcal N 1 right frac 1 2 left sum i j 1 k M ij 2 k y 2 2 sum i 1 k ln frac L 1 ii L 0 ii right nbsp A special case and a common quantity in variational inference is the relative entropy between a diagonal multivariate normal and a standard normal distribution with zero mean and unit variance D KL N m 1 m k T diag s 1 2 s k 2 N 0 I 1 2 i 1 k s i 2 m i 2 1 ln s i 2 displaystyle D text KL left mathcal N left left mu 1 ldots mu k right mathsf T operatorname diag left sigma 1 2 ldots sigma k 2 right right parallel mathcal N left mathbf 0 mathbf I right right 1 over 2 sum i 1 k left sigma i 2 mu i 2 1 ln left sigma i 2 right right nbsp For two univariate normal distributions p and q the above simplifies to 28 D KL p q log s 1 s 0 s 0 2 m 0 m 1 2 2 s 1 2 1 2 displaystyle D text KL left mathcal p parallel mathcal q right log frac sigma 1 sigma 0 frac sigma 0 2 mu 0 mu 1 2 2 sigma 1 2 frac 1 2 nbsp In the case of co centered normal distributions with k s 1 s 0 displaystyle k sigma 1 sigma 0 nbsp this simplifies 29 to D KL p q log 2 k k 2 1 2 ln 2 b i t s displaystyle D text KL left mathcal p parallel mathcal q right log 2 k k 2 1 2 ln 2 mathrm bits nbsp Uniform distributions edit Consider two uniform distributions with the support of p A B displaystyle p A B nbsp enclosed within q C D displaystyle q C D nbsp C A lt B D displaystyle C leq A lt B leq D nbsp Then the information gain is D KL p q log D C B A displaystyle D text KL left mathcal p parallel mathcal q right log frac D C B A nbsp Intuitively 29 the information gain to a k times narrower uniform distribution contains log 2 k displaystyle log 2 k nbsp bits This connects with the use of bits in computing where log 2 k displaystyle log 2 k nbsp bits would be needed to identify one element of a k long stream Relation to metrics editWhile relative entropy is a statistical distance it is not a metric on the space of probability distributions but instead it is a divergence 4 While metrics are symmetric and generalize linear distance satisfying the triangle inequality divergences are asymmetric in general and generalize squared distance in some cases satisfying a generalized Pythagorean theorem In general D KL P Q displaystyle D text KL P parallel Q nbsp does not equal D KL Q P displaystyle D text KL Q parallel P nbsp and while this can be symmetrized see Symmetrised divergence the asymmetry is an important part of the geometry 4 It generates a topology on the space of probability distributions More concretely if P 1 P 2 displaystyle P 1 P 2 ldots nbsp is a sequence of distributions such that lim n D KL P n Q 0 displaystyle lim n to infty D text KL P n parallel Q 0 nbsp then it is said that P n D Q displaystyle P n xrightarrow D Q nbsp Pinsker s inequality entails that P n D P P n T V P displaystyle P n xrightarrow D P Rightarrow P n xrightarrow TV P nbsp where the latter stands for the usual convergence in total variation Fisher information metric edit Relative entropy is directly related to the Fisher information metric This can be made explicit as follows Assume that the probability distributions P and Q are both parameterized by some possibly multi dimensional parameter 8 displaystyle theta nbsp Consider then two close by values of P P 8 displaystyle P P theta nbsp and Q P 8 0 displaystyle Q P theta 0 nbsp so that the parameter 8 displaystyle theta nbsp differs by only a small amount from the parameter value 8 0 displaystyle theta 0 nbsp Specifically up to first order one has using the Einstein summation convention P 8 P 8 0 D 8 j P j 8 0 displaystyle P theta P theta 0 Delta theta j P j theta 0 cdots nbsp with D 8 j 8 8 0 j displaystyle Delta theta j theta theta 0 j nbsp a small change of 8 displaystyle theta nbsp in the j direction and P j 8 0 P 8 j 8 0 displaystyle P j left theta 0 right frac partial P partial theta j theta 0 nbsp the corresponding rate of change in the probability distribution Since relative entropy has an absolute minimum 0 for P Q displaystyle P Q nbsp i e 8 8 0 displaystyle theta theta 0 nbsp it changes only to second order in the small parameters D 8 j displaystyle Delta theta j nbsp More formally as for any minimum the first derivatives of the divergence vanish 8 j 8 8 0 D KL P 8 P 8 0 0 displaystyle left frac partial partial theta j right theta theta 0 D text KL P theta parallel P theta 0 0 nbsp and by the Taylor expansion one has up to second order D KL P 8 P 8 0 1 2 D 8 j D 8 k g j k 8 0 displaystyle D text KL P theta parallel P theta 0 frac 1 2 Delta theta j Delta theta k g jk theta 0 cdots nbsp where the Hessian matrix of the divergence g j k 8 0 2 8 j 8 k 8 8 0 D KL P 8 P 8 0 displaystyle g jk theta 0 left frac partial 2 partial theta j partial theta k right theta theta 0 D text KL P theta parallel P theta 0 nbsp must be positive semidefinite Letting 8 0 displaystyle theta 0 nbsp vary and dropping the subindex 0 the Hessian g j k 8 displaystyle g jk theta nbsp defines a possibly degenerate Riemannian metric on the 8 parameter space called the Fisher information metric Fisher information metric theorem edit When p x r displaystyle p x rho nbsp satisfies the following regularity conditions log p r 2 log p r 2 3 log p r 3 displaystyle frac partial log p partial rho frac partial 2 log p partial rho 2 frac partial 3 log p partial rho 3 nbsp exist p r lt F x x 0 F x d x lt 2 p r 2 lt G x x 0 G x d x lt 3 log p r 3 lt H x x 0 p x 0 H x d x lt 3 lt displaystyle begin aligned left frac partial p partial rho right amp lt F x int x 0 infty F x dx lt infty left frac partial 2 p partial rho 2 right amp lt G x int x 0 infty G x dx lt infty left frac partial 3 log p partial rho 3 right amp lt H x int x 0 infty p x 0 H x dx lt xi lt infty end aligned nbsp where 3 is independent of r x 0 p x r r r 0 d x x 0 2 p x r r 2 r 0 d x 0 displaystyle left int x 0 infty frac partial p x rho partial rho right rho 0 dx left int x 0 infty frac partial 2 p x rho partial rho 2 right rho 0 dx 0 nbsp then D p x 0 p x r c r 2 2 O r 3 as r 0 displaystyle mathcal D p x 0 parallel p x rho frac c rho 2 2 mathcal O left rho 3 right text as rho to 0 nbsp Variation of information edit Another information theoretic metric is variation of information which is roughly a symmetrization of conditional entropy It is a metric on the set of partitions of a discrete probability space Relation to other quantities of information theory editMany of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases Self information edit Main article Information content The self information also known as the information content of a signal random variable or event is defined as the negative logarithm of the probability of the given outcome occurring When applied to a discrete random variable the self information can be represented as citation needed I m D KL d im p i displaystyle operatorname operatorname I m D text KL left delta text im parallel p i right nbsp is the relative entropy of the probability distribution P i displaystyle P i nbsp from a Kronecker delta representing certainty that i m displaystyle i m nbsp i e the number of extra bits that must be transmitted to identify i if only the probability distribution P i displaystyle P i nbsp is available to the receiver not the fact that i m displaystyle i m nbsp Mutual information edit The mutual information I X Y D KL P X Y P X P Y E X D KL P Y X P Y E Y D KL P X Y P X displaystyle begin aligned operatorname I X Y amp D text KL P X Y parallel P X P Y 5pt amp operatorname E X D text KL P Y mid X parallel P Y 5pt amp operatorname E Y D text KL P X mid Y parallel P X end aligned nbsp is the relative entropy of the joint probability distribution P X Y displaystyle P X Y nbsp from the product P X P Y displaystyle P X P Y nbsp of the two marginal probability distributions i e the expected number of extra bits that must be transmitted to identify X and Y if they are coded using only their marginal distributions instead of the joint distribution Equivalently if the joint probability P X Y displaystyle P X Y nbsp is known it is the expected number of extra bits that must on average be sent to identify Y if the value of X is not already known to the receiver Shannon entropy edit The Shannon entropy H X E I X x log N D KL p X x P U X displaystyle begin aligned mathrm H X amp operatorname E left operatorname I X x right amp log N D text KL left p X x parallel P U X right end aligned nbsp is the number of bits which would have to be transmitted to identify X from N equally likely possibilities less the relative entropy of the uniform distribution on the random variates of X P U X displaystyle P U X nbsp from the true distribution P X displaystyle P X nbsp i e less the expected number of bits saved which would have had to be sent if the value of X were coded according to the uniform distribution P U X displaystyle P U X nbsp rather than the true distribution P X displaystyle P X nbsp This definition of Shannon entropy forms the basis of E T Jaynes s alternative generalization to continuous distributions the limiting density of discrete points as opposed to the usual differential entropy which defines the continuous entropy as lim N H N X log N p x log p x m x d x displaystyle lim N rightarrow infty H N X log N int p x log frac p x m x dx nbsp which is equivalent to log N D KL p x m x displaystyle log N D text KL p x m x nbsp Conditional entropy edit The conditional entropy 30 H X Y log N D KL P X Y P U X P Y log N D KL P X Y P X P Y D KL P X P U X H X I X Y log N E Y D KL P X Y P U X displaystyle begin aligned mathrm H X mid Y amp log N D text KL P X Y parallel P U X P Y 5pt amp log N D text KL P X Y parallel P X P Y D text KL P X parallel P U X 5pt amp mathrm H X operatorname I X Y 5pt amp log N operatorname E Y left D text KL left P left X mid Y right parallel P U X right right end aligned nbsp is the number of bits which would have to be transmitted to identify X from N equally likely possibilities less the relative entropy of the product distribution P U X P Y displaystyle P U X P Y nbsp from the true joint distribution P X Y displaystyle P X Y nbsp i e less the expected number of bits saved which would have had to be sent if the value of X were coded according to the uniform distribution P U X displaystyle P U X nbsp rather than the conditional distribution P X Y displaystyle P X Y nbsp of X given Y Cross entropy edit When we have a set of possible events coming from the distribution p we can encode them with a lossless data compression using entropy encoding This compresses the data by replacing each fixed length input symbol with a corresponding unique variable length prefix free code e g the events A B C with probabilities p 1 2 1 4 1 4 can be encoded as the bits 0 10 11 If we know the distribution p in advance we can devise an encoding that would be optimal e g using Huffman coding Meaning the messages we encode will have the shortest length on average assuming the encoded events are sampled from p which will be equal to Shannon s Entropy of p denoted as H p displaystyle mathrm H p nbsp However if we use a different probability distribution q when creating the entropy encoding scheme then a larger number of bits will be used on average to identify an event from a set of possibilities This new larger number is measured by the cross entropy between p and q The cross entropy between two probability distributions p and q measures the average number of bits needed to identify an event from a set of possibilities if a coding scheme is used based on a given probability distribution q rather than the true distribution p The cross entropy for two distributions p and q over the same probability space is thus defined as follows H p q E p log q H p D KL p q displaystyle mathrm H p q operatorname E p log q mathrm H p D text KL p parallel q nbsp For explicit derivation of this see the Motivation section above Under this scenario relative entropies kl divergence can be interpreted as the extra number of bits on average that are needed beyond H p displaystyle mathrm H p nbsp for encoding the events because of using q for constructing the encoding scheme instead of p Bayesian updating editIn Bayesian statistics relative entropy can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution p x p x I displaystyle p x to p x mid I nbsp If some new fact Y y displaystyle Y y nbsp is discovered it can be used to update the posterior distribution for X from p x I displaystyle p x mid I nbsp to a new posterior distribution p x y I displaystyle p x mid y I nbsp using Bayes theorem p x y I p y x I p x I p y I displaystyle p x mid y I frac p y mid x I p x mid I p y mid I nbsp This distribution has a new entropy H p x y I x p x y I log p x y I displaystyle mathrm H big p x mid y I big sum x p x mid y I log p x mid y I nbsp which may be less than or greater than the original entropy H p x I displaystyle mathrm H p x mid I nbsp However from the standpoint of the new probability distribution one can estimate that to have used the original code based on p x I displaystyle p x mid I nbsp instead of a new code based on p x y I displaystyle p x mid y I nbsp would have added an expected number of bits D KL p x y I p x I x p x y I log p x y I p x I displaystyle D text KL big p x mid y I parallel p x mid I big sum x p x mid y I log left frac p x mid y I p x mid I right nbsp to the message length This therefore represents the amount of useful information or information gain about X that has been learned by discovering Y y displaystyle Y y nbsp If a further piece of data Y 2 y 2 displaystyle Y 2 y 2 nbsp subsequently comes in the probability distribution for x can be updated further to give a new best guess p x y 1 y 2 I displaystyle p x mid y 1 y 2 I nbsp If one reinvestigates the information gain for using p x y 1 I displaystyle p x mid y 1 I nbsp rather than p x I displaystyle p x mid I nbsp it turns out that it may be either greater or less than previously estimated x p x y 1 y 2 I log p x y 1 y 2 I p x I displaystyle sum x p x mid y 1 y 2 I log left frac p x mid y 1 y 2 I p x mid I right nbsp may be or gt than x p x y 1 I log p x y 1 I p x I displaystyle displaystyle sum x p x mid y 1 I log left frac p x mid y 1 I p x mid I right nbsp and so the combined information gain does not obey the triangle inequality D KL p x y 1 y 2 I p x I displaystyle D text KL big p x mid y 1 y 2 I parallel p x mid I big nbsp may be lt or gt than D KL p x y 1 y 2 I p x y 1 I D KL p x y 1 I p x I displaystyle D text KL big p x mid y 1 y 2 I parallel p x mid y 1 I big D text KL big p x mid y 1 I parallel p x mid I big nbsp All one can say is that on average averaging using p y 2 y 1 x I displaystyle p y 2 mid y 1 x I nbsp the two sides will average out Bayesian experimental design edit A common goal in Bayesian experimental design is to maximise the expected relative entropy between the prior and the posterior 31 When posteriors are approximated to be Gaussian distributions a design maximising the expected relative entropy is called Bayes d optimal Discrimination information editRelative entropy D KL p x H 1 p x H 0 textstyle D text KL bigl p x mid H 1 parallel p x mid H 0 bigr nbsp can also be interpreted as the expected discrimination information for H 1 displaystyle H 1 nbsp over H 0 displaystyle H 0 nbsp the mean information per sample for discriminating in favor of a hypothesis H 1 displaystyle H 1 nbsp against a hypothesis H 0 displaystyle H 0 nbsp when hypothesis H 1 displaystyle H 1 nbsp is true 32 Another name for this quantity given to it by I J Good is the expected weight of evidence for H 1 displaystyle H 1 nbsp over H 0 displaystyle H 0 nbsp to be expected from each sample The expected weight of evidence for H 1 displaystyle H 1 nbsp over H 0 displaystyle H 0 nbsp is not the same as the information gain expected per sample about the probability distribution p H displaystyle p H nbsp of the hypotheses D KL p x H 1 p x H 0 I G D KL p H x p H I displaystyle D text KL p x mid H 1 parallel p x mid H 0 neq IG D text KL p H mid x parallel p H mid I nbsp Either of the two quantities can be used as a utility function in Bayesian experimental design to choose an optimal next question to investigate but they will in general lead to rather different experimental strategies On the entropy scale of information gain there is very little difference between near certainty and absolute certainty coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty On the other hand on the logit scale implied by weight of evidence the difference between the two is enormous infinite perhaps this might reflect the difference between being almost sure on a probabilistic level that say the Riemann hypothesis is correct compared to being certain that it is correct because one has a mathematical proof These two different scales of loss function for uncertainty are both useful according to how well each reflects the particular circumstances of the problem in question Principle of minimum discrimination information edit The idea of relative entropy as discrimination information led Kullback to propose the Principle of Minimum Discrimination Information MDI given new facts a new distribution f should be chosen which is as hard to discriminate from the original distribution f 0 displaystyle f 0 nbsp as possible so that the new data produces as small an information gain D KL f f 0 displaystyle D text KL f parallel f 0 nbsp as possible For example if one had a prior distribution p x a displaystyle p x a nbsp over x and a and subsequently learnt the true distribution of a was u a displaystyle u a nbsp then the relative entropy between the new joint distribution for x and a q x a u a displaystyle q x mid a u a nbsp and the earlier prior distribution would be D KL q x a u a p x a E u a D KL q x a p x a D KL u a p a displaystyle D text KL q x mid a u a parallel p x a operatorname E u a left D text KL q x mid a parallel p x mid a right D text KL u a parallel p a nbsp i e the sum of the relative entropy of p a displaystyle p a nbsp the prior distribution for a from the updated distribution u a displaystyle u a nbsp plus the expected value using the probability distribution u a displaystyle u a nbsp of the relative entropy of the prior conditional distribution p x a displaystyle p x mid a nbsp from the new conditional distribution q x a displaystyle q x mid a nbsp Note that often the later expected value is called the conditional relative entropy or conditional Kullback Leibler divergence and denoted by D KL q x a p x a displaystyle D text KL q x mid a parallel p x mid a nbsp 3 30 This is minimized if q x a p x a displaystyle q x mid a p x mid a nbsp over the whole support of u a displaystyle u a nbsp and we note that this result incorporates Bayes theorem if the new distribution u a displaystyle u a nbsp is in fact a d function representing certainty that a has one particular value MDI can be seen as an extension of Laplace s Principle of Insufficient Reason and the Principle of Maximum Entropy of E T Jaynes In particular it is the natural extension of the principle of maximum entropy from discrete to continuous distributions for which Shannon entropy ceases to be so useful see differential entropy but the relative entropy continues to be just as relevant In the engineering literature MDI is sometimes called the Principle of Minimum Cross Entropy MCE or Minxent for short Minimising relative entropy from m to p with respect to m is equivalent to minimizing the cross entropy of p and m since H p m H p D KL p m displaystyle mathrm H p m mathrm H p D text KL p parallel m nbsp which is appropriate if one is trying to choose an adequate approximation to p However this is just as often not the task one is trying to achieve Instead just as often it is m that is some fixed prior reference measure and p that one is attempting to optimise by minimising D KL p m displaystyle D text KL p parallel m nbsp subject to some constraint This has led to some ambiguity in the literature with some authors attempting to resolve the inconsistency by redefining cross entropy to be D KL p m displaystyle D text KL p parallel m nbsp rather than H p m displaystyle mathrm H p m nbsp citation needed Relationship to available work edit nbsp Pressure versus volume plot of available work from a mole of argon gas relative, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.