fbpx
Wikipedia

Exponential family

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family",[1] or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

The concept of exponential families is credited to[2] E. J. G. Pitman,[3] G. Darmois,[4] and B. O. Koopman[5] in 1935–1936. Exponential families of distributions provide a general framework for selecting a possible alternative parameterisation of a parametric family of distributions, in terms of natural parameters, and for defining useful sample statistics, called the natural sufficient statistics of the family.

Nomenclature difficulty edit

The terms "distribution" and "family" are often used loosely: Specifically, an exponential family is a set of distributions, where the specific distribution varies with the parameter;[a] however, a parametric family of distributions is often referred to as "a distribution" (like "the normal distribution", meaning "the family of normal distributions"), and the set of all exponential families is sometimes loosely referred to as "the" exponential family.

Definition edit

Most of the commonly used distributions form an exponential family or subset of an exponential family, listed in the subsection below. The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

Examples of exponential family distributions edit

Exponential families include many of the most common distributions. Among many others, exponential families includes the following:[6]

A number of common distributions are exponential families, but only when certain parameters are fixed and known. For example:

Note that in each case, the parameters which must be fixed are those that set a limit on the range of values that can possibly be observed.

Examples of common distributions that are not exponential families are Student's t, most mixture distributions, and even the family of uniform distributions when the bounds are not fixed. See the section below on examples for more discussion.

Scalar parameter edit

The value of   is called the parameter of the family.

A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form

 

where       and   are known functions. The function   must be non-negative.

An alternative, equivalent form often given is

 

or equivalently

 

Note that   and  

Support must be independent of θ edit

Importantly, the support of   (all the possible   values for which   is greater than  ) is required to not depend on  [7] This requirement can be used to exclude a parametric family distribution from being an exponential family.

For example: The Pareto distribution has a pdf which is defined for   (the minimum value,   being the scale parameter) and its support, therefore, has a lower limit of   Since the support of   is dependent on the value of the parameter, the family of Pareto distributions does not form an exponential family of distributions (at least when   is unknown).

Another example: Bernoulli-type distributions – binomial, negative binomial, geometric distribution, and similar – can only be included in the exponential class if the number of Bernoulli trials,   is treated as a fixed constant – excluded from the free parameter(s)   – since the allowed number of trials sets the limits for the number of "successes" or "failures" that can be observed in a set of trials.

Vector valued x and θ edit

Often   is a vector of measurements, in which case   may be a function from the space of possible values of   to the real numbers.

More generally,   and   can each be vector-valued such that   is real-valued. However, see the discussion below on vector parameters, regarding the curved exponential family.

Canonical formulation edit

If   then the exponential family is said to be in canonical form. By defining a transformed parameter   it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since   can be multiplied by any nonzero constant, provided that   is multiplied by that constant's reciprocal, or a constant c can be added to   and   multiplied by   to offset it. In the special case that   and   then the family is called a natural exponential family.

Even when   is a scalar, and there is only a single parameter, the functions   and   can still be vectors, as described below.

The function   or equivalently   is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of   even when   is not a one-to-one function, i.e. two or more different values of   map to the same value of   and hence   cannot be inverted. In such a case, all values of   mapping to the same   will also have the same value for   and  

Factorization of the variables involved edit

What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms:

 

where   and   are arbitrary functions of   the observed statistical variable;   and   are arbitrary functions of   the fixed parameters defining the shape of the distribution; and   is any arbitrary constant expression (i.e. a number or an expression that does not change with either   or  ).

There are further restrictions on how many such factors can occur. For example, the two expressions:

 

are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,

 

it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.[citation needed])

To see why an expression of the form

 

qualifies,

 

and hence factorizes inside of the exponent. Similarly,

 

and again factorizes inside of the exponent.

A factor consisting of a sum where both types of variables are involved (e.g. a factor of the form  ) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the Cauchy distribution and Student's t distribution are not exponential families.

Vector parameter edit

The definition in terms of one real-number parameter can be extended to one real-vector parameter

 

A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as

 

or in a more compact form,

 

This form writes the sum as a dot product of vector-valued functions   and  .

An alternative, equivalent form often seen is

 

As in the scalar valued case, the exponential family is said to be in canonical form if

 

A vector exponential family is said to be curved if the dimension of

 

is less than the dimension of the vector

 

That is, if the dimension, d, of the parameter vector is less than the number of functions, s, of the parameter vector in the above representation of the probability density function. Most common distributions in the exponential family are not curved, and many algorithms designed to work with any exponential family implicitly or explicitly assume that the distribution is not curved.

Just as in the case of a scalar-valued parameter, the function   or equivalently   is automatically determined by the normalization constraint, once the other functions have been chosen. Even if   is not one-to-one, functions   and   can be defined by requiring that the distribution is normalized for each value of the natural parameter  . This yields the canonical form

 

or equivalently

 

The above forms may sometimes be seen with   in place of  . These are exactly equivalent formulations, merely using different notation for the dot product.

Vector parameter, vector variable edit

The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar x replaced by the vector

 

The dimensions k of the random variable need not match the dimension d of the parameter vector, nor (in the case of a curved exponential function) the dimension s of the natural parameter   and sufficient statistic T(x) .

The distribution in this case is written as

 

Or more compactly as

 

Or alternatively as

 

Measure-theoretic formulation edit

We use cumulative distribution functions (CDF) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to   are integrals with respect to the reference measure of the exponential family generated by H .

Any member of that exponential family has cumulative distribution function

 

H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density   with respect to a reference measure   (typically Lebesgue measure), one can write  . In this case, H is also absolutely continuous and can be written   so the formulas reduce to that of the previous paragraphs. If F is discrete, then H is a step function (with steps on the support of F).

Alternatively, we can write the probability measure directly as

 

for some reference measure  .

Interpretation edit

In the definitions above, the functions T(x), η(θ), and A(η) were arbitrary. However, these functions have important interpretations in the resulting probability distribution.

  • T(x) is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that holds all information the data x provides with regard to the unknown parameter values. This means that, for any data sets   and  , the likelihood ratio is the same, that is   if  T(x) = T(y. This is true even if x and y are not equal to each other. The dimension of T(x) equals the number of parameters of θ and encompasses all of the information regarding the data related to the parameter θ. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters). (This important property is discussed further below.)
  • η is called the natural parameter. The set of values of η for which the function   is integrable is called the natural parameter space. It can be shown that the natural parameter space is always convex.
  • A(η) is called the log-partition function[b] because it is the logarithm of a normalization factor, without which   would not be a probability distribution:
 

The function A is important in its own right, because the mean, variance and other moments of the sufficient statistic T(x) can be derived simply by differentiating A(η). For example, because log(x) is one of the components of the sufficient statistic of the gamma distribution,   can be easily determined for this distribution using A(η). Technically, this is true because

 

is the cumulant generating function of the sufficient statistic.

Properties edit

Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that only exponential families have these properties. Examples:

Given an exponential family defined by  , where   is the parameter space, such that  . Then

  • If   has nonempty interior in  , then given any IID samples  , the statistic   is a complete statistic for  .[9][10]
  •   is a minimal statistic for   iff for all  , and   in the support of  , if  , then   or  .[11]

Examples edit

It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family.

The normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, ALAAM, von Mises, and von Mises-Fisher distributions are all exponential families.

Some distributions are exponential families only if some of their parameters are held fixed. The family of Pareto distributions with a fixed minimum bound xm form an exponential family. The families of binomial and multinomial distributions with fixed number of trials n but unknown probability parameter(s) are exponential families. The family of negative binomial distributions with fixed number of failures (a.k.a. stopping-time parameter) r is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family.

As mentioned above, as a general rule, the support of an exponential family must remain the same across all parameter settings in the family. This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value). For similar reasons, neither the discrete uniform distribution nor continuous uniform distribution are exponential families as one or both bounds vary.

The Weibull distribution with fixed shape parameter k is an exponential family. Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's probability density function (k appears in the exponent of an exponent).

In general, distributions that result from a finite or infinite mixture of other distributions, e.g. mixture model densities and compound probability distributions, are not exponential families. Examples are typical Gaussian mixture models as well as many heavy-tailed distributions that result from compounding (i.e. infinitely mixing) a distribution with a prior distribution over one of its parameters, e.g. the Student's t-distribution (compounding a normal distribution over a gamma-distributed precision prior), and the beta-binomial and Dirichlet-multinomial distributions. Other examples of distributions that are not exponential families are the F-distribution, Cauchy distribution, hypergeometric distribution and logistic distribution.

Following are some detailed examples of the representation of some useful distribution as exponential families.

Normal distribution: unknown mean, known variance edit

As a first example, consider a random variable distributed normally with unknown mean μ and known variance σ2. The probability density function is then

 

This is a single-parameter exponential family, as can be seen by setting

 

If σ = 1 this is in canonical form, as then η(μ) = μ.

Normal distribution: unknown mean and unknown variance edit

Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then

 

This is an exponential family which can be written in canonical form by defining

 

Binomial distribution edit

As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is

 

This can equivalently be written as

 

which shows that the binomial distribution is an exponential family, whose natural parameter is

 

This function of p is known as logit.

Table of distributions edit

The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards[12] for main exponential families.

For a scalar variable and scalar parameter, the form is as follows:

 

For a scalar variable and vector parameter:

 
 

For a vector variable and vector parameter:

 

The above formulas choose the functional form of the exponential-family with a log-partition function  . The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter   instead of the natural parameter, and/or using a factor   outside of the exponential. The relation between the latter and the former is:

 
 

To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other.

Distribution Parameter(s)   Natural parameter(s)   Inverse parameter mapping Base measure   Sufficient statistic   Log-partition   Log-partition  
Bernoulli distribution              
binomial distribution
with known number of trials  
             
Poisson distribution              
negative binomial distribution
with known number of failures  
             
exponential distribution              
Pareto distribution
with known minimum value  
             
Weibull distribution
with known shape k
             
Laplace distribution
with known mean  
             
chi-squared distribution              
normal distribution
known variance
             
continuous Bernoulli distribution              
normal distribution              
log-normal distribution              
inverse Gaussian distribution              
gamma distribution              
       
inverse gamma distribution              
generalized inverse Gaussian distribution              
scaled inverse chi-squared distribution              
beta distribution

(variant 1)
             
beta distribution

(variant 2)
             
multivariate normal distribution              
categorical distribution

(variant 1)
 

where  
   

where  
       
categorical distribution

(variant 2)
 

where  
   

 

where  

       
categorical distribution

(variant 3)
 

where  
 

 
 

 

       
multinomial distribution

(variant 1)
with known number of trials  
 

where  
   

where  
 
exponential, family, confused, with, exponential, distribution, natural, parameter, redirects, here, this, term, differential, geometry, natural, parametrization, probability, statistics, exponential, family, parametric, probability, distributions, certain, fo. Not to be confused with the exponential distribution Natural parameter redirects here For use of this term in differential geometry see Natural parametrization In probability and statistics an exponential family is a parametric set of probability distributions of a certain form specified below This special form is chosen for mathematical convenience including the enabling of the user to calculate expectations covariances using differentiation based on some useful algebraic properties as well as for generality as exponential families are in a sense very natural sets of distributions to consider The term exponential class is sometimes used in place of exponential family 1 or the older term Koopman Darmois family Sometimes loosely referred to as the exponential family this class of distributions is distinct because they all possess a variety of desirable properties most importantly the existence of a sufficient statistic The concept of exponential families is credited to 2 E J G Pitman 3 G Darmois 4 and B O Koopman 5 in 1935 1936 Exponential families of distributions provide a general framework for selecting a possible alternative parameterisation of a parametric family of distributions in terms of natural parameters and for defining useful sample statistics called the natural sufficient statistics of the family Contents 1 Nomenclature difficulty 2 Definition 2 1 Examples of exponential family distributions 2 2 Scalar parameter 2 2 1 Support must be independent of 8 2 2 2 Vector valued x and 8 2 2 3 Canonical formulation 2 3 Factorization of the variables involved 2 4 Vector parameter 2 5 Vector parameter vector variable 2 6 Measure theoretic formulation 3 Interpretation 4 Properties 5 Examples 5 1 Normal distribution unknown mean known variance 5 2 Normal distribution unknown mean and unknown variance 5 3 Binomial distribution 6 Table of distributions 7 Moments and cumulants of the sufficient statistic 7 1 Normalization of the distribution 7 2 Moment generating function of the sufficient statistic 7 2 1 Differential identities for cumulants 7 2 2 Example 1 7 2 3 Example 2 7 2 4 Example 3 8 Entropy 8 1 Relative entropy 8 2 Maximum entropy derivation 9 Role in statistics 9 1 Classical estimation sufficiency 9 2 Bayesian estimation conjugate distributions 9 3 Unbiased estimation 9 4 Hypothesis testing uniformly most powerful tests 9 5 Generalized linear models 10 See also 11 Footnotes 12 References 12 1 Citations 12 2 Sources 13 Further reading 14 External linksNomenclature difficulty editThe terms distribution and family are often used loosely Specifically an exponential family is a set of distributions where the specific distribution varies with the parameter a however a parametric family of distributions is often referred to as a distribution like the normal distribution meaning the family of normal distributions and the set of all exponential families is sometimes loosely referred to as the exponential family Definition editMost of the commonly used distributions form an exponential family or subset of an exponential family listed in the subsection below The subsections following it are a sequence of increasingly more general mathematical definitions of an exponential family A casual reader may wish to restrict attention to the first and simplest definition which corresponds to a single parameter family of discrete or continuous probability distributions Examples of exponential family distributions edit Exponential families include many of the most common distributions Among many others exponential families includes the following 6 normal exponential gamma chi squared beta Dirichlet Bernoulli categorical Poisson Wishart inverse Wishart geometric A number of common distributions are exponential families but only when certain parameters are fixed and known For example binomial with fixed number of trials multinomial with fixed number of trials negative binomial with fixed number of failures Note that in each case the parameters which must be fixed are those that set a limit on the range of values that can possibly be observed Examples of common distributions that are not exponential families are Student s t most mixture distributions and even the family of uniform distributions when the bounds are not fixed See the section below on examples for more discussion Scalar parameter edit The value of 8 displaystyle theta nbsp is called the parameter of the family A single parameter exponential family is a set of probability distributions whose probability density function or probability mass function for the case of a discrete distribution can be expressed in the form fX x 8 h x exp h 8 T x A 8 displaystyle f X left x big theta right h x exp bigl eta theta cdot T x A theta bigr nbsp where T x displaystyle T x nbsp h x displaystyle h x nbsp h 8 displaystyle eta theta nbsp and A 8 displaystyle A theta nbsp are known functions The function h x displaystyle h x nbsp must be non negative An alternative equivalent form often given is fX x 8 h x g 8 exp h 8 T x displaystyle f X left x big theta right h x g theta exp bigl eta theta cdot T x bigr nbsp or equivalently fX x 8 exp h 8 T x A 8 B x displaystyle f X left x big theta right exp bigl eta theta cdot T x A theta B x bigr nbsp Note that g 8 e A 8 displaystyle quad g theta e A theta quad nbsp and h x eB x displaystyle quad h x e B x nbsp Support must be independent of 8 edit Importantly the support of fX x 8 displaystyle f X left x big theta right nbsp all the possible x displaystyle x nbsp values for which fX x 8 displaystyle f X left x big theta right nbsp is greater than 0 displaystyle 0 nbsp is required to not depend on 8 displaystyle theta nbsp 7 This requirement can be used to exclude a parametric family distribution from being an exponential family For example The Pareto distribution has a pdf which is defined for x xm displaystyle x geq x mathsf m nbsp the minimum value xm displaystyle x m nbsp being the scale parameter and its support therefore has a lower limit of xm displaystyle x mathsf m nbsp Since the support of fa xm x displaystyle f alpha x m x nbsp is dependent on the value of the parameter the family of Pareto distributions does not form an exponential family of distributions at least when xm displaystyle x m nbsp is unknown Another example Bernoulli type distributions binomial negative binomial geometric distribution and similar can only be included in the exponential class if the number of Bernoulli trials n displaystyle n nbsp is treated as a fixed constant excluded from the free parameter s 8 displaystyle theta nbsp since the allowed number of trials sets the limits for the number of successes or failures that can be observed in a set of trials Vector valued x and 8 edit Often x displaystyle x nbsp is a vector of measurements in which case T x displaystyle T x nbsp may be a function from the space of possible values of x displaystyle x nbsp to the real numbers More generally h 8 displaystyle eta theta nbsp and T x displaystyle T x nbsp can each be vector valued such that h 8 T x displaystyle eta theta cdot T x nbsp is real valued However see the discussion below on vector parameters regarding the curved exponential family Canonical formulation edit If h 8 8 displaystyle eta theta theta nbsp then the exponential family is said to be in canonical form By defining a transformed parameter h h 8 displaystyle eta eta theta nbsp it is always possible to convert an exponential family to canonical form The canonical form is non unique since h 8 displaystyle eta theta nbsp can be multiplied by any nonzero constant provided that T x displaystyle T x nbsp is multiplied by that constant s reciprocal or a constant c can be added to h 8 displaystyle eta theta nbsp and h x displaystyle h x nbsp multiplied by exp c T x displaystyle exp bigl c cdot T x bigr nbsp to offset it In the special case that h 8 8 displaystyle eta theta theta nbsp and T x x displaystyle T x x nbsp then the family is called a natural exponential family Even when x displaystyle x nbsp is a scalar and there is only a single parameter the functions h 8 displaystyle eta theta nbsp and T x displaystyle T x nbsp can still be vectors as described below The function A 8 displaystyle A theta nbsp or equivalently g 8 displaystyle g theta nbsp is automatically determined once the other functions have been chosen since it must assume a form that causes the distribution to be normalized sum or integrate to one over the entire domain Furthermore both of these functions can always be written as functions of h displaystyle eta nbsp even when h 8 displaystyle eta theta nbsp is not a one to one function i e two or more different values of 8 displaystyle theta nbsp map to the same value of h 8 displaystyle eta theta nbsp and hence h 8 displaystyle eta theta nbsp cannot be inverted In such a case all values of 8 displaystyle theta nbsp mapping to the same h 8 displaystyle eta theta nbsp will also have the same value for A 8 displaystyle A theta nbsp and g 8 displaystyle g theta nbsp Factorization of the variables involved edit What is important to note and what characterizes all exponential family variants is that the parameter s and the observation variable s must factorize can be separated into products each of which involves only one type of variable either directly or within either part the base or exponent of an exponentiation operation Generally this means that all of the factors constituting the density or mass function must be of one of the following forms f x g 8 cf x cg 8 f x c g 8 c f x g 8 g 8 f x f x h x g 8 or g 8 h x j 8 displaystyle f x g theta c f x c g theta f x c g theta c f x g theta g theta f x f x h x g theta mathsf or g theta h x j theta nbsp where f displaystyle f nbsp and h displaystyle h nbsp are arbitrary functions of x displaystyle x nbsp the observed statistical variable g displaystyle g nbsp and j displaystyle j nbsp are arbitrary functions of 8 displaystyle theta nbsp the fixed parameters defining the shape of the distribution and c displaystyle c nbsp is any arbitrary constant expression i e a number or an expression that does not change with either x displaystyle x nbsp or 8 displaystyle theta nbsp There are further restrictions on how many such factors can occur For example the two expressions f x g 8 h x j 8 f x h x j 8 g 8 h x j 8 displaystyle f x g theta h x j theta qquad f x h x j theta g theta h x j theta nbsp are the same i e a product of two allowed factors However when rewritten into the factorized form f x g 8 h x j 8 f x h x j 8 g 8 h x j 8 e h x log f x j 8 h x j 8 log g 8 displaystyle f x g theta h x j theta f x h x j theta g theta h x j theta e h x log f x j theta h x j theta log g theta nbsp it can be seen that it cannot be expressed in the required form However a form of this sort is a member of a curved exponential family which allows multiple factorized terms in the exponent citation needed To see why an expression of the form f x g 8 displaystyle f x g theta nbsp qualifies f x g 8 eg 8 log f x displaystyle f x g theta e g theta log f x nbsp and hence factorizes inside of the exponent Similarly f x h x g 8 eh x g 8 log f x e h x log f x g 8 displaystyle f x h x g theta e h x g theta log f x e h x log f x g theta nbsp and again factorizes inside of the exponent A factor consisting of a sum where both types of variables are involved e g a factor of the form 1 f x g 8 displaystyle 1 f x g theta nbsp cannot be factorized in this fashion except in some cases where occurring directly in an exponent this is why for example the Cauchy distribution and Student s t distribution are not exponential families Vector parameter edit The definition in terms of one real number parameter can be extended to one real vector parameter 8 81 82 8s T displaystyle boldsymbol theta equiv left theta 1 theta 2 ldots theta s right mathsf T nbsp A family of distributions is said to belong to a vector exponential family if the probability density function or probability mass function for discrete distributions can be written as fX x 8 h x exp i 1shi 8 Ti x A 8 displaystyle f X x mid boldsymbol theta h x exp left sum i 1 s eta i boldsymbol theta T i x A boldsymbol theta right nbsp or in a more compact form fX x 8 h x exp h 8 T x A 8 displaystyle f X x mid boldsymbol theta h x exp Big boldsymbol eta boldsymbol theta cdot mathbf T x A boldsymbol theta Big nbsp This form writes the sum as a dot product of vector valued functions h 8 displaystyle boldsymbol eta boldsymbol theta nbsp and T x displaystyle mathbf T x nbsp An alternative equivalent form often seen is fX x 8 h x g 8 exp h 8 T x displaystyle f X x mid boldsymbol theta h x g boldsymbol theta exp Big boldsymbol eta boldsymbol theta cdot mathbf T x Big nbsp As in the scalar valued case the exponential family is said to be in canonical form if hi 8 8i i displaystyle quad eta i boldsymbol theta theta i quad forall i nbsp A vector exponential family is said to be curved if the dimension of 8 81 82 8d T displaystyle boldsymbol theta equiv left theta 1 theta 2 ldots theta d right mathsf T nbsp is less than the dimension of the vector h 8 h1 8 h2 8 hs 8 T displaystyle boldsymbol eta boldsymbol theta equiv left eta 1 boldsymbol theta eta 2 boldsymbol theta ldots eta s boldsymbol theta right mathsf T nbsp That is if the dimension d of the parameter vector is less than the number of functions s of the parameter vector in the above representation of the probability density function Most common distributions in the exponential family are not curved and many algorithms designed to work with any exponential family implicitly or explicitly assume that the distribution is not curved Just as in the case of a scalar valued parameter the function A 8 displaystyle A boldsymbol theta nbsp or equivalently g 8 displaystyle g boldsymbol theta nbsp is automatically determined by the normalization constraint once the other functions have been chosen Even if h 8 displaystyle boldsymbol eta boldsymbol theta nbsp is not one to one functions A h displaystyle A boldsymbol eta nbsp and g h displaystyle g boldsymbol eta nbsp can be defined by requiring that the distribution is normalized for each value of the natural parameter h displaystyle boldsymbol eta nbsp This yields the canonical form fX x h h x exp h T x A h displaystyle f X x mid boldsymbol eta h x exp Big boldsymbol eta cdot mathbf T x A boldsymbol eta Big nbsp or equivalently fX x h h x g h exp h T x displaystyle f X x mid boldsymbol eta h x g boldsymbol eta exp Big boldsymbol eta cdot mathbf T x Big nbsp The above forms may sometimes be seen with hTT x displaystyle boldsymbol eta mathsf T mathbf T x nbsp in place of h T x displaystyle boldsymbol eta cdot mathbf T x nbsp These are exactly equivalent formulations merely using different notation for the dot product Vector parameter vector variable edit The vector parameter form over a single scalar valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables The resulting distribution is simply the same as the above distribution for a scalar valued random variable with each occurrence of the scalar x replaced by the vector x x1 x2 xk T displaystyle mathbf x left x 1 x 2 cdots x k right mathsf T nbsp The dimensions k of the random variable need not match the dimension d of the parameter vector nor in the case of a curved exponential function the dimension s of the natural parameter h displaystyle boldsymbol eta nbsp and sufficient statistic T x The distribution in this case is written as fX x 8 h x exp i 1shi 8 Ti x A 8 displaystyle f X left mathbf x mid boldsymbol theta right h mathbf x exp left sum i 1 s eta i boldsymbol theta T i mathbf x A boldsymbol theta right nbsp Or more compactly as fX x 8 h x exp h 8 T x A 8 displaystyle f X left mathbf x mid boldsymbol theta right h mathbf x exp Big boldsymbol eta boldsymbol theta cdot mathbf T mathbf x A boldsymbol theta Big nbsp Or alternatively as fX x 8 g 8 h x exp h 8 T x displaystyle f X left mathbf x mid boldsymbol theta right g boldsymbol theta h mathbf x exp Big boldsymbol eta boldsymbol theta cdot mathbf T mathbf x Big nbsp Measure theoretic formulation edit We use cumulative distribution functions CDF in order to encompass both discrete and continuous distributions Suppose H is a non decreasing function of a real variable Then Lebesgue Stieltjes integrals with respect to dH x displaystyle rm d H mathbf x nbsp are integrals with respect to the reference measure of the exponential family generated by H Any member of that exponential family has cumulative distribution function dF x 8 exp h 8 T x A 8 dH x displaystyle rm d F left mathbf x mid boldsymbol theta right exp bigl boldsymbol eta theta cdot mathbf T mathbf x A boldsymbol theta bigr rm d H mathbf x nbsp H x is a Lebesgue Stieltjes integrator for the reference measure When the reference measure is finite it can be normalized and H is actually the cumulative distribution function of a probability distribution If F is absolutely continuous with a density f x displaystyle f x nbsp with respect to a reference measure dx displaystyle rm d x nbsp typically Lebesgue measure one can write dF x f x dx displaystyle rm d F x f x rm d x nbsp In this case H is also absolutely continuous and can be written dH x h x dx displaystyle rm d H x h x rm d x nbsp so the formulas reduce to that of the previous paragraphs If F is discrete then H is a step function with steps on the support of F Alternatively we can write the probability measure directly as P dx 8 exp h 8 T x A 8 m dx displaystyle P left rm d mathbf x mid boldsymbol theta right exp bigl boldsymbol eta theta cdot mathbf T mathbf x A boldsymbol theta bigr mu rm d mathbf x nbsp for some reference measure m displaystyle mu nbsp Interpretation editIn the definitions above the functions T x h 8 and A h were arbitrary However these functions have important interpretations in the resulting probability distribution T x is a sufficient statistic of the distribution For exponential families the sufficient statistic is a function of the data that holds all information the data x provides with regard to the unknown parameter values This means that for any data sets x displaystyle x nbsp and y displaystyle y nbsp the likelihood ratio is the same that is f x 81 f x 82 f y 81 f y 82 displaystyle frac f x theta 1 f x theta 2 frac f y theta 1 f y theta 2 nbsp if T x T y This is true even if x and y are not equal to each other The dimension of T x equals the number of parameters of 8 and encompasses all of the information regarding the data related to the parameter 8 The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics and encapsulates all the information needed to describe the posterior distribution of the parameters given the data and hence to derive any desired estimate of the parameters This important property is discussed further below h is called the natural parameter The set of values of h for which the function fX x h displaystyle f X x eta nbsp is integrable is called the natural parameter space It can be shown that the natural parameter space is always convex A h is called the log partition function b because it is the logarithm of a normalization factor without which fX x 8 displaystyle f X x theta nbsp would not be a probability distribution A h log Xh x exp h 8 T x dx displaystyle A eta log left int X h x exp eta theta cdot T x mathrm d x right nbsp dd The function A is important in its own right because the mean variance and other moments of the sufficient statistic T x can be derived simply by differentiating A h For example because log x is one of the components of the sufficient statistic of the gamma distribution E log x displaystyle operatorname mathcal E log x nbsp can be easily determined for this distribution using A h Technically this is true because K u h A h u A h displaystyle K left u mid eta right A eta u A eta nbsp dd is the cumulant generating function of the sufficient statistic Properties editExponential families have a large number of properties that make them extremely useful for statistical analysis In many cases it can be shown that only exponential families have these properties Examples Exponential families are the only families with sufficient statistics that can summarize arbitrary amounts of independent identically distributed data using a fixed number of values Pitman Koopman Darmois theorem Exponential families have conjugate priors an important property in Bayesian statistics The posterior predictive distribution of an exponential family random variable with a conjugate prior can always be written in closed form provided that the normalizing factor of the exponential family distribution can itself be written in closed form c In the mean field approximation in variational Bayes used for approximating the posterior distribution in large Bayesian networks the best approximating posterior distribution of an exponential family node a node is a random variable in the context of Bayesian networks with a conjugate prior is in the same family as the node 8 Given an exponential family defined by fX x 8 h x exp 8 T x A 8 displaystyle f X x mid theta h x exp bigl theta cdot T x A theta bigr nbsp where 8 displaystyle Theta nbsp is the parameter space such that 8 8 Rk displaystyle theta in Theta subset mathbb R k nbsp Then If 8 displaystyle Theta nbsp has nonempty interior in Rk displaystyle mathbb R k nbsp then given any IID samples X1 Xn fX displaystyle X 1 X n sim f X nbsp the statistic T X1 Xn i 1nT Xi displaystyle T X 1 X n sum i 1 n T X i nbsp is a complete statistic for 8 displaystyle theta nbsp 9 10 T displaystyle T nbsp is a minimal statistic for 8 displaystyle theta nbsp iff for all 81 82 8 displaystyle theta 1 theta 2 in Theta nbsp and x1 x2 displaystyle x 1 x 2 nbsp in the support of X displaystyle X nbsp if 81 82 T x1 T x2 0 displaystyle theta 1 theta 2 cdot T x 1 T x 2 0 nbsp then 81 82 displaystyle theta 1 theta 2 nbsp or x1 x2 displaystyle x 1 x 2 nbsp 11 Examples editIt is critical when considering the examples in this section to remember the discussion above about what it means to say that a distribution is an exponential family and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a distribution is or is not an exponential family The normal exponential log normal gamma chi squared beta Dirichlet Bernoulli categorical Poisson geometric inverse Gaussian ALAAM von Mises and von Mises Fisher distributions are all exponential families Some distributions are exponential families only if some of their parameters are held fixed The family of Pareto distributions with a fixed minimum bound xm form an exponential family The families of binomial and multinomial distributions with fixed number of trials n but unknown probability parameter s are exponential families The family of negative binomial distributions with fixed number of failures a k a stopping time parameter r is an exponential family However when any of the above mentioned fixed parameters are allowed to vary the resulting family is not an exponential family As mentioned above as a general rule the support of an exponential family must remain the same across all parameter settings in the family This is why the above cases e g binomial with varying number of trials Pareto with varying minimum bound are not exponential families in all of the cases the parameter in question affects the support particularly changing the minimum or maximum possible value For similar reasons neither the discrete uniform distribution nor continuous uniform distribution are exponential families as one or both bounds vary The Weibull distribution with fixed shape parameter k is an exponential family Unlike in the previous examples the shape parameter does not affect the support the fact that allowing it to vary makes the Weibull non exponential is due rather to the particular form of the Weibull s probability density function k appears in the exponent of an exponent In general distributions that result from a finite or infinite mixture of other distributions e g mixture model densities and compound probability distributions are not exponential families Examples are typical Gaussian mixture models as well as many heavy tailed distributions that result from compounding i e infinitely mixing a distribution with a prior distribution over one of its parameters e g the Student s t distribution compounding a normal distribution over a gamma distributed precision prior and the beta binomial and Dirichlet multinomial distributions Other examples of distributions that are not exponential families are the F distribution Cauchy distribution hypergeometric distribution and logistic distribution Following are some detailed examples of the representation of some useful distribution as exponential families Normal distribution unknown mean known variance edit As a first example consider a random variable distributed normally with unknown mean m and known variance s2 The probability density function is then fs x m 12ps2e x m 2 2s2 displaystyle f sigma x mu frac 1 sqrt 2 pi sigma 2 e x mu 2 2 sigma 2 nbsp This is a single parameter exponential family as can be seen by setting hs x 12ps2e x2 2s2 Ts x xsAs m m22s2hs m ms displaystyle begin aligned h sigma x amp frac 1 sqrt 2 pi sigma 2 e x 2 2 sigma 2 4pt T sigma x amp frac x sigma 4pt A sigma mu amp frac mu 2 2 sigma 2 4pt eta sigma mu amp frac mu sigma end aligned nbsp If s 1 this is in canonical form as then h m m Normal distribution unknown mean and unknown variance edit Next consider the case of a normal distribution with unknown mean and unknown variance The probability density function is then f y m s2 12ps2e y m 2 2s2 displaystyle f y mu sigma 2 frac 1 sqrt 2 pi sigma 2 e y mu 2 2 sigma 2 nbsp This is an exponential family which can be written in canonical form by defining h ms2 12s2 h y 12pT y y y2 TA h m22s2 log s h124h2 12log 12h2 displaystyle begin aligned boldsymbol eta amp left frac mu sigma 2 frac 1 2 sigma 2 right h y amp frac 1 sqrt 2 pi T y amp left y y 2 right rm T A boldsymbol eta amp frac mu 2 2 sigma 2 log sigma frac eta 1 2 4 eta 2 frac 1 2 log left frac 1 2 eta 2 right end aligned nbsp Binomial distribution edit As an example of a discrete exponential family consider the binomial distribution with known number of trials n The probability mass function for this distribution is f x nx px 1 p n x x 0 1 2 n displaystyle f x n choose x p x 1 p n x quad x in 0 1 2 ldots n nbsp This can equivalently be written as f x nx exp xlog p1 p nlog 1 p displaystyle f x n choose x exp left x log left frac p 1 p right n log 1 p right nbsp which shows that the binomial distribution is an exponential family whose natural parameter is h log p1 p displaystyle eta log frac p 1 p nbsp This function of p is known as logit Table of distributions editThe following table shows how to rewrite a number of common distributions as exponential family distributions with natural parameters Refer to the flashcards 12 for main exponential families For a scalar variable and scalar parameter the form is as follows fX x 8 h x exp h 8 T x A h displaystyle f X x mid theta h x exp Big eta theta T x A eta Big nbsp For a scalar variable and vector parameter fX x 8 h x exp h 8 T x A h displaystyle f X x mid boldsymbol theta h x exp Big boldsymbol eta boldsymbol theta cdot mathbf T x A boldsymbol eta Big nbsp fX x 8 h x g 8 exp h 8 T x displaystyle f X x mid boldsymbol theta h x g boldsymbol theta exp Big boldsymbol eta boldsymbol theta cdot mathbf T x Big nbsp For a vector variable and vector parameter fX x 8 h x exp h 8 T x A h displaystyle f X mathbf x mid boldsymbol theta h mathbf x exp Big boldsymbol eta boldsymbol theta cdot mathbf T mathbf x A boldsymbol eta Big nbsp The above formulas choose the functional form of the exponential family with a log partition function A h displaystyle A boldsymbol eta nbsp The reason for this is so that the moments of the sufficient statistics can be calculated easily simply by differentiating this function Alternative forms involve either parameterizing this function in terms of the normal parameter 8 displaystyle boldsymbol theta nbsp instead of the natural parameter and or using a factor g h displaystyle g boldsymbol eta nbsp outside of the exponential The relation between the latter and the former is A h log g h displaystyle A boldsymbol eta log g boldsymbol eta nbsp g h e A h displaystyle g boldsymbol eta e A boldsymbol eta nbsp To convert between the representations involving the two types of parameter use the formulas below for writing one type of parameter in terms of the other Distribution Parameter s 8 displaystyle boldsymbol theta nbsp Natural parameter s h displaystyle boldsymbol eta nbsp Inverse parameter mapping Base measure h x displaystyle h x nbsp Sufficient statistic T x displaystyle T x nbsp Log partition A h displaystyle A boldsymbol eta nbsp Log partition A 8 displaystyle A boldsymbol theta nbsp Bernoulli distribution p displaystyle p nbsp log p1 p displaystyle log frac p 1 p nbsp This is the logit function 11 e h eh1 eh displaystyle frac 1 1 e eta frac e eta 1 e eta nbsp This is the logistic function 1 displaystyle 1 nbsp x displaystyle x nbsp log 1 eh displaystyle log 1 e eta nbsp log 1 p displaystyle log 1 p nbsp binomial distributionwith known number of trials n displaystyle n nbsp p displaystyle p nbsp log p1 p displaystyle log frac p 1 p nbsp 11 e h eh1 eh displaystyle frac 1 1 e eta frac e eta 1 e eta nbsp nx displaystyle n choose x nbsp x displaystyle x nbsp nlog 1 eh displaystyle n log 1 e eta nbsp nlog 1 p displaystyle n log 1 p nbsp Poisson distribution l displaystyle lambda nbsp log l displaystyle log lambda nbsp eh displaystyle e eta nbsp 1x displaystyle frac 1 x nbsp x displaystyle x nbsp eh displaystyle e eta nbsp l displaystyle lambda nbsp negative binomial distributionwith known number of failures r displaystyle r nbsp p displaystyle p nbsp log 1 p displaystyle log 1 p nbsp 1 eh displaystyle 1 e eta nbsp x r 1x displaystyle x r 1 choose x nbsp x displaystyle x nbsp rlog 1 eh displaystyle r log 1 e eta nbsp rlog 1 p displaystyle r log 1 p nbsp exponential distribution l displaystyle lambda nbsp l displaystyle lambda nbsp h displaystyle eta nbsp 1 displaystyle 1 nbsp x displaystyle x nbsp log h displaystyle log eta nbsp log l displaystyle log lambda nbsp Pareto distributionwith known minimum value xm displaystyle x m nbsp a displaystyle alpha nbsp a 1 displaystyle alpha 1 nbsp 1 h displaystyle 1 eta nbsp 1 displaystyle 1 nbsp log x displaystyle log x nbsp log 1 h 1 h log xm displaystyle log 1 eta 1 eta log x mathrm m nbsp log a alog xm displaystyle log alpha alpha log x mathrm m nbsp Weibull distributionwith known shape k l displaystyle lambda nbsp 1lk displaystyle frac 1 lambda k nbsp h 1 k displaystyle eta 1 k nbsp xk 1 displaystyle x k 1 nbsp xk displaystyle x k nbsp log h log k displaystyle log eta log k nbsp klog l log k displaystyle k log lambda log k nbsp Laplace distributionwith known mean m displaystyle mu nbsp b displaystyle b nbsp 1b displaystyle frac 1 b nbsp 1h displaystyle frac 1 eta nbsp 1 displaystyle 1 nbsp x m displaystyle x mu nbsp log 2h displaystyle log left frac 2 eta right nbsp log 2b displaystyle log 2b nbsp chi squared distribution n displaystyle nu nbsp n2 1 displaystyle frac nu 2 1 nbsp 2 h 1 displaystyle 2 eta 1 nbsp e x 2 displaystyle e x 2 nbsp log x displaystyle log x nbsp log G h 1 h 1 log 2 displaystyle log Gamma eta 1 eta 1 log 2 nbsp log G n2 n2log 2 displaystyle log Gamma left frac nu 2 right frac nu 2 log 2 nbsp normal distributionknown variance m displaystyle mu nbsp ms displaystyle frac mu sigma nbsp sh displaystyle sigma eta nbsp e x2 2s2 2ps displaystyle frac e x 2 2 sigma 2 sqrt 2 pi sigma nbsp xs displaystyle frac x sigma nbsp h22 displaystyle frac eta 2 2 nbsp m22s2 displaystyle frac mu 2 2 sigma 2 nbsp continuous Bernoulli distribution l displaystyle lambda nbsp log l1 l displaystyle log frac lambda 1 lambda nbsp eh1 eh displaystyle frac e eta 1 e eta nbsp 1 displaystyle 1 nbsp x displaystyle x nbsp log eh 1h displaystyle log frac e eta 1 eta nbsp log 1 2l 1 l log 1 ll displaystyle log left frac 1 2 lambda 1 lambda log left frac 1 lambda lambda right right nbsp normal distribution m s2 displaystyle mu sigma 2 nbsp ms2 12s2 displaystyle begin bmatrix dfrac mu sigma 2 10pt dfrac 1 2 sigma 2 end bmatrix nbsp h12h2 12h2 displaystyle begin bmatrix dfrac eta 1 2 eta 2 15pt dfrac 1 2 eta 2 end bmatrix nbsp 12p displaystyle frac 1 sqrt 2 pi nbsp xx2 displaystyle begin bmatrix x x 2 end bmatrix nbsp h124h2 12log 2h2 displaystyle frac eta 1 2 4 eta 2 frac 1 2 log 2 eta 2 nbsp m22s2 log s displaystyle frac mu 2 2 sigma 2 log sigma nbsp log normal distribution m s2 displaystyle mu sigma 2 nbsp ms2 12s2 displaystyle begin bmatrix dfrac mu sigma 2 10pt dfrac 1 2 sigma 2 end bmatrix nbsp h12h2 12h2 displaystyle begin bmatrix dfrac eta 1 2 eta 2 15pt dfrac 1 2 eta 2 end bmatrix nbsp 12px displaystyle frac 1 sqrt 2 pi x nbsp log x log x 2 displaystyle begin bmatrix log x log x 2 end bmatrix nbsp h124h2 12log 2h2 displaystyle frac eta 1 2 4 eta 2 frac 1 2 log 2 eta 2 nbsp m22s2 log s displaystyle frac mu 2 2 sigma 2 log sigma nbsp inverse Gaussian distribution m l displaystyle mu lambda nbsp l2m2 l2 displaystyle begin bmatrix dfrac lambda 2 mu 2 15pt dfrac lambda 2 end bmatrix nbsp h2h1 2h2 displaystyle begin bmatrix sqrt dfrac eta 2 eta 1 15pt 2 eta 2 end bmatrix nbsp 12px3 2 displaystyle frac 1 sqrt 2 pi x 3 2 nbsp x1x displaystyle begin bmatrix x 5pt dfrac 1 x end bmatrix nbsp 2h1h2 12log 2h2 displaystyle 2 sqrt eta 1 eta 2 frac 1 2 log 2 eta 2 nbsp lm 12log l displaystyle frac lambda mu frac 1 2 log lambda nbsp gamma distribution a b displaystyle alpha beta nbsp a 1 b displaystyle begin bmatrix alpha 1 beta end bmatrix nbsp h1 1 h2 displaystyle begin bmatrix eta 1 1 eta 2 end bmatrix nbsp 1 displaystyle 1 nbsp log xx displaystyle begin bmatrix log x x end bmatrix nbsp log G h1 1 h1 1 log h2 displaystyle log Gamma eta 1 1 eta 1 1 log eta 2 nbsp log G a alog b displaystyle log Gamma alpha alpha log beta nbsp k 8 displaystyle k theta nbsp k 1 18 displaystyle begin bmatrix k 1 5pt dfrac 1 theta end bmatrix nbsp h1 1 1h2 displaystyle begin bmatrix eta 1 1 5pt dfrac 1 eta 2 end bmatrix nbsp log G k klog 8 displaystyle log Gamma k k log theta nbsp inverse gamma distribution a b displaystyle alpha beta nbsp a 1 b displaystyle begin bmatrix alpha 1 beta end bmatrix nbsp h1 1 h2 displaystyle begin bmatrix eta 1 1 eta 2 end bmatrix nbsp 1 displaystyle 1 nbsp log x1x displaystyle begin bmatrix log x frac 1 x end bmatrix nbsp log G h1 1 h1 1 log h2 displaystyle log Gamma eta 1 1 eta 1 1 log eta 2 nbsp log G a alog b displaystyle log Gamma alpha alpha log beta nbsp generalized inverse Gaussian distribution p a b displaystyle p a b nbsp p 1 a 2 b 2 displaystyle begin bmatrix p 1 a 2 b 2 end bmatrix nbsp h1 1 2h2 2h3 displaystyle begin bmatrix eta 1 1 2 eta 2 2 eta 3 end bmatrix nbsp 1 displaystyle 1 nbsp log xx1x displaystyle begin bmatrix log x x frac 1 x end bmatrix nbsp log 2Kh1 1 4h2h3 h1 12log h2h3 displaystyle log 2K eta 1 1 sqrt 4 eta 2 eta 3 frac eta 1 1 2 log frac eta 2 eta 3 nbsp log 2Kp ab p2log ab displaystyle log 2K p sqrt ab frac p 2 log frac a b nbsp scaled inverse chi squared distribution n s2 displaystyle nu sigma 2 nbsp n2 1 ns22 displaystyle begin bmatrix dfrac nu 2 1 10pt dfrac nu sigma 2 2 end bmatrix nbsp 2 h1 1 h2h1 1 displaystyle begin bmatrix 2 eta 1 1 10pt dfrac eta 2 eta 1 1 end bmatrix nbsp 1 displaystyle 1 nbsp log x1x displaystyle begin bmatrix log x frac 1 x end bmatrix nbsp log G h1 1 h1 1 log h2 displaystyle log Gamma eta 1 1 eta 1 1 log eta 2 nbsp log G n2 n2log ns22 displaystyle log Gamma left frac nu 2 right frac nu 2 log frac nu sigma 2 2 nbsp beta distribution variant 1 a b displaystyle alpha beta nbsp ab displaystyle begin bmatrix alpha beta end bmatrix nbsp h1h2 displaystyle begin bmatrix eta 1 eta 2 end bmatrix nbsp 1x 1 x displaystyle frac 1 x 1 x nbsp log xlog 1 x displaystyle begin bmatrix log x log 1 x end bmatrix nbsp log G h1 log G h2 log G h1 h2 displaystyle log Gamma eta 1 log Gamma eta 2 log Gamma eta 1 eta 2 nbsp log G a log G b log G a b displaystyle log Gamma alpha log Gamma beta log Gamma alpha beta nbsp beta distribution variant 2 a b displaystyle alpha beta nbsp a 1b 1 displaystyle begin bmatrix alpha 1 beta 1 end bmatrix nbsp h1 1h2 1 displaystyle begin bmatrix eta 1 1 eta 2 1 end bmatrix nbsp 1 displaystyle 1 nbsp log xlog 1 x displaystyle begin bmatrix log x log 1 x end bmatrix nbsp log G h1 1 log G h2 1 log G h1 h2 2 displaystyle log Gamma eta 1 1 log Gamma eta 2 1 log Gamma eta 1 eta 2 2 nbsp log G a log G b log G a b displaystyle log Gamma alpha log Gamma beta log Gamma alpha beta nbsp multivariate normal distribution m S displaystyle boldsymbol mu boldsymbol Sigma nbsp S 1m 12S 1 displaystyle begin bmatrix boldsymbol Sigma 1 boldsymbol mu 5pt frac 1 2 boldsymbol Sigma 1 end bmatrix nbsp 12h2 1h1 12h2 1 displaystyle begin bmatrix frac 1 2 boldsymbol eta 2 1 boldsymbol eta 1 5pt frac 1 2 boldsymbol eta 2 1 end bmatrix nbsp 2p k2 displaystyle 2 pi frac k 2 nbsp xxxT displaystyle begin bmatrix mathbf x 5pt mathbf x mathbf x mathsf T end bmatrix nbsp 14h1Th2 1h1 12log 2h2 displaystyle frac 1 4 boldsymbol eta 1 mathsf T boldsymbol eta 2 1 boldsymbol eta 1 frac 1 2 log left 2 boldsymbol eta 2 right nbsp 12mTS 1m 12log S displaystyle frac 1 2 boldsymbol mu mathsf T boldsymbol Sigma 1 boldsymbol mu frac 1 2 log boldsymbol Sigma nbsp categorical distribution variant 1 p1 pk displaystyle p 1 ldots p k nbsp where i 1kpi 1 displaystyle textstyle sum i 1 k p i 1 nbsp log p1 log pk displaystyle begin bmatrix log p 1 vdots log p k end bmatrix nbsp eh1 ehk displaystyle begin bmatrix e eta 1 vdots e eta k end bmatrix nbsp where i 1kehi 1 displaystyle textstyle sum i 1 k e eta i 1 nbsp 1 displaystyle 1 nbsp x 1 x k displaystyle begin bmatrix x 1 vdots x k end bmatrix nbsp x i displaystyle x i nbsp is the Iverson bracket 0 displaystyle 0 nbsp 0 displaystyle 0 nbsp categorical distribution variant 2 p1 pk displaystyle p 1 ldots p k nbsp where i 1kpi 1 displaystyle textstyle sum i 1 k p i 1 nbsp log p1 C log pk C displaystyle begin bmatrix log p 1 C vdots log p k C end bmatrix nbsp 1Ceh1 1Cehk displaystyle begin bmatrix dfrac 1 C e eta 1 vdots dfrac 1 C e eta k end bmatrix nbsp eh1 i 1kehi ehk i 1kehi displaystyle begin bmatrix dfrac e eta 1 sum i 1 k e eta i 10pt vdots 5pt dfrac e eta k sum i 1 k e eta i end bmatrix nbsp where i 1kehi C displaystyle textstyle sum i 1 k e eta i C nbsp 1 displaystyle 1 nbsp x 1 x k displaystyle begin bmatrix x 1 vdots x k end bmatrix nbsp x i displaystyle x i nbsp is the Iverson bracket 0 displaystyle 0 nbsp 0 displaystyle 0 nbsp categorical distribution variant 3 p1 pk displaystyle p 1 ldots p k nbsp where pk 1 i 1k 1pi displaystyle p k 1 textstyle sum i 1 k 1 p i nbsp log p1pk log pk 1pk0 displaystyle begin bmatrix log dfrac p 1 p k 10pt vdots 5pt log dfrac p k 1 p k 15pt 0 end bmatrix nbsp log p11 i 1k 1pi log pk 11 i 1k 1pi0 displaystyle begin bmatrix log dfrac p 1 1 sum i 1 k 1 p i 10pt vdots 5pt log dfrac p k 1 1 sum i 1 k 1 p i 15pt 0 end bmatrix nbsp This is the inverse softmax function a generalization of the logit function eh1 i 1kehi ehk i 1kehi displaystyle begin bmatrix dfrac e eta 1 sum i 1 k e eta i 10pt vdots 5pt dfrac e eta k sum i 1 k e eta i end bmatrix nbsp eh11 i 1k 1ehi ehk 11 i 1k 1ehi11 i 1k 1ehi displaystyle begin bmatrix dfrac e eta 1 1 sum i 1 k 1 e eta i 10pt vdots 5pt dfrac e eta k 1 1 sum i 1 k 1 e eta i 15pt dfrac 1 1 sum i 1 k 1 e eta i end bmatrix nbsp This is the softmax function a generalization of the logistic function 1 displaystyle 1 nbsp x 1 x k displaystyle begin bmatrix x 1 vdots x k end bmatrix nbsp x i displaystyle x i nbsp is the Iverson bracket log i 1kehi log 1 i 1k 1ehi displaystyle log left sum i 1 k e eta i right log left 1 sum i 1 k 1 e eta i right nbsp log pk log 1 i 1k 1pi displaystyle log p k log left 1 sum i 1 k 1 p i right nbsp multinomial distribution variant 1 with known number of trials n displaystyle n nbsp p1 pk displaystyle p 1 ldots p k nbsp where i 1kpi 1 displaystyle textstyle sum i 1 k p i 1 nbsp log p1 log pk displaystyle begin bmatrix log p 1 vdots log p k end bmatrix nbsp eh1 ehk displaystyle begin bmatrix e eta 1 vdots e eta k end bmatrix nbsp where i 1kehi 1 displaystyle textstyle sum i 1 k e eta i 1 nbsp n i 1kxi displaystyle frac n prod i 1 k x i nbsp x1 xk displaystyle begin bmatrix x 1 vdots x k end bmatrix span, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.