fbpx
Wikipedia

Maximum likelihood estimation

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.[1] The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.[2][3][4]

If the likelihood function is differentiable, the derivative test for finding maxima can be applied. In some cases, the first-order conditions of the likelihood function can be solved analytically; for instance, the ordinary least squares estimator for a linear regression model maximizes the likelihood when the random errors are assumed to have normal distributions with the same variance.[5]

From the perspective of Bayesian inference, MLE is generally equivalent to maximum a posteriori (MAP) estimation with uniform prior distributions (or a normal prior distribution with a standard deviation of infinity). In frequentist inference, MLE is a special case of an extremum estimator, with the objective function being the likelihood.

Principles edit

We model a set of observations as a random sample from an unknown joint probability distribution which is expressed in terms of a set of parameters. The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability. We write the parameters governing the joint distribution as a vector   so that this distribution falls within a parametric family   where   is called the parameter space, a finite-dimensional subset of Euclidean space. Evaluating the joint density at the observed data sample   gives a real-valued function,

 

which is called the likelihood function. For independent and identically distributed random variables,   will be the product of univariate density functions:

 

The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space,[6] that is

 

Intuitively, this selects the parameter values that make the observed data most probable. The specific value   that maximizes the likelihood function   is called the maximum likelihood estimate. Further, if the function   so defined is measurable, then it is called the maximum likelihood estimator. It is generally a function defined over the sample space, i.e. taking a given sample as its argument. A sufficient but not necessary condition for its existence is for the likelihood function to be continuous over a parameter space   that is compact.[7] For an open   the likelihood function may increase without ever reaching a supremum value.

In practice, it is often convenient to work with the natural logarithm of the likelihood function, called the log-likelihood:

 

Since the logarithm is a monotonic function, the maximum of   occurs at the same value of   as does the maximum of  [8] If   is differentiable in   the necessary conditions for the occurrence of a maximum (or a minimum) are

 

known as the likelihood equations. For some models, these equations can be explicitly solved for   but in general no closed-form solution to the maximization problem is known or available, and an MLE can only be found via numerical optimization. Another problem is that in finite samples, there may exist multiple roots for the likelihood equations.[9] Whether the identified root   of the likelihood equations is indeed a (local) maximum depends on whether the matrix of second-order partial and cross-partial derivatives, the so-called Hessian matrix

 

is negative semi-definite at  , as this indicates local concavity. Conveniently, most common probability distributions – in particular the exponential family – are logarithmically concave.[10][11]

Restricted parameter space edit

While the domain of the likelihood function—the parameter space—is generally a finite-dimensional subset of Euclidean space, additional restrictions sometimes need to be incorporated into the estimation process. The parameter space can be expressed as

 

where   is a vector-valued function mapping   into   Estimating the true parameter   belonging to   then, as a practical matter, means to find the maximum of the likelihood function subject to the constraint  

Theoretically, the most natural approach to this constrained optimization problem is the method of substitution, that is "filling out" the restrictions   to a set   in such a way that   is a one-to-one function from   to itself, and reparameterize the likelihood function by setting  [12] Because of the equivariance of the maximum likelihood estimator, the properties of the MLE apply to the restricted estimates also.[13] For instance, in a multivariate normal distribution the covariance matrix   must be positive-definite; this restriction can be imposed by replacing   where   is a real upper triangular matrix and   is its transpose.[14]

In practice, restrictions are usually imposed using the method of Lagrange which, given the constraints as defined above, leads to the restricted likelihood equations

  and  

where   is a column-vector of Lagrange multipliers and   is the k × r Jacobian matrix of partial derivatives.[12] Naturally, if the constraints are not binding at the maximum, the Lagrange multipliers should be zero.[15] This in turn allows for a statistical test of the "validity" of the constraint, known as the Lagrange multiplier test.

Nonparametric Maximum Likelihood Estimation edit

Nonparametric Maximum likelihood estimation can be performed using the empirical likelihood.

Properties edit

A maximum likelihood estimator is an extremum estimator obtained by maximizing, as a function of θ, the objective function  . If the data are independent and identically distributed, then we have

 

this being the sample analogue of the expected log-likelihood  , where this expectation is taken with respect to the true density.

Maximum-likelihood estimators have no optimum properties for finite samples, in the sense that (when evaluated on finite samples) other estimators may have greater concentration around the true parameter-value.[16] However, like other estimation methods, maximum likelihood estimation possesses a number of attractive limiting properties: As the sample size increases to infinity, sequences of maximum likelihood estimators have these properties:

  • Consistency: the sequence of MLEs converges in probability to the value being estimated.
  • Invariance: If   is the maximum likelihood estimator for  , and if   is any transformation of  , then the maximum likelihood estimator for   is  . This property is less commonly known as functional equivariance. The invariance property holds for arbitrary transformation  , although the proof simplifies if   is restricted to one-to-one transformations.
  • Efficiency, i.e. it achieves the Cramér–Rao lower bound when the sample size tends to infinity. This means that no consistent estimator has lower asymptotic mean squared error than the MLE (or other estimators attaining this bound), which also means that MLE has asymptotic normality.
  • Second-order efficiency after correction for bias.

Consistency edit

Under the conditions outlined below, the maximum likelihood estimator is consistent. The consistency means that if the data were generated by   and we have a sufficiently large number of observations n, then it is possible to find the value of θ0 with arbitrary precision. In mathematical terms this means that as n goes to infinity the estimator   converges in probability to its true value:

 

Under slightly stronger conditions, the estimator converges almost surely (or strongly):

 

In practical applications, data is never generated by  . Rather,   is a model, often in idealized form, of the process generated by the data. It is a common aphorism in statistics that all models are wrong. Thus, true consistency does not occur in practical applications. Nevertheless, consistency is often considered to be a desirable property for an estimator to have.

To establish consistency, the following conditions are sufficient.[17]

  1. Identification of the model:
     

    In other words, different parameter values θ correspond to different distributions within the model. If this condition did not hold, there would be some value θ1 such that θ0 and θ1 generate an identical distribution of the observable data. Then we would not be able to distinguish between these two parameters even with an infinite amount of data—these parameters would have been observationally equivalent.

    The identification condition is absolutely necessary for the ML estimator to be consistent. When this condition holds, the limiting likelihood function (θ|·) has unique global maximum at θ0.
  2. Compactness: the parameter space Θ of the model is compact.
     

    The identification condition establishes that the log-likelihood has a unique global maximum. Compactness implies that the likelihood cannot approach the maximum value arbitrarily close at some other point (as demonstrated for example in the picture on the right).

    Compactness is only a sufficient condition and not a necessary condition. Compactness can be replaced by some other conditions, such as:

    • both concavity of the log-likelihood function and compactness of some (nonempty) upper level sets of the log-likelihood function, or
    • existence of a compact neighborhood N of θ0 such that outside of N the log-likelihood function is less than the maximum by at least some ε > 0.
  3. Continuity: the function ln f(x | θ) is continuous in θ for almost all values of x:
     
    The continuity here can be replaced with a slightly weaker condition of upper semi-continuity.
  4. Dominance: there exists D(x) integrable with respect to the distribution f(x | θ0) such that
     
    By the uniform law of large numbers, the dominance condition together with continuity establish the uniform convergence in probability of the log-likelihood:
     

The dominance condition can be employed in the case of i.i.d. observations. In the non-i.i.d. case, the uniform convergence in probability can be checked by showing that the sequence   is stochastically equicontinuous. If one wants to demonstrate that the ML estimator   converges to θ0 almost surely, then a stronger condition of uniform convergence almost surely has to be imposed:

 

Additionally, if (as assumed above) the data were generated by  , then under certain conditions, it can also be shown that the maximum likelihood estimator converges in distribution to a normal distribution. Specifically,[18]

 

where I is the Fisher information matrix.

Functional invariance edit

The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability (or probability density, in the continuous case). If the parameter consists of a number of components, then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the complete parameter. Consistent with this, if   is the MLE for  , and if   is any transformation of  , then the MLE for   is by definition[19]

 

It maximizes the so-called profile likelihood:

 

The MLE is also equivariant with respect to certain transformations of the data. If   where   is one to one and does not depend on the parameters to be estimated, then the density functions satisfy

 

and hence the likelihood functions for   and   differ only by a factor that does not depend on the model parameters.

For example, the MLE parameters of the log-normal distribution are the same as those of the normal distribution fitted to the logarithm of the data.

Efficiency edit

As assumed above, if the data were generated by   then under certain conditions, it can also be shown that the maximum likelihood estimator converges in distribution to a normal distribution. It is n -consistent and asymptotically efficient, meaning that it reaches the Cramér–Rao bound. Specifically,[18]

 

where   is the Fisher information matrix:

 

In particular, it means that the bias of the maximum likelihood estimator is equal to zero up to the order 1/n .

Second-order efficiency after correction for bias edit

However, when we consider the higher-order terms in the expansion of the distribution of this estimator, it turns out that θmle has bias of order 1n. This bias is equal to (componentwise)[20]

 

where   (with superscripts) denotes the (j,k)-th component of the inverse Fisher information matrix  , and

 

Using these formulae it is possible to estimate the second-order bias of the maximum likelihood estimator, and correct for that bias by subtracting it:

 

This estimator is unbiased up to the terms of order 1/n, and is called the bias-corrected maximum likelihood estimator.

This bias-corrected estimator is second-order efficient (at least within the curved exponential family), meaning that it has minimal mean squared error among all second-order bias-corrected estimators, up to the terms of the order 1/n2 . It is possible to continue this process, that is to derive the third-order bias-correction term, and so on. However, the maximum likelihood estimator is not third-order efficient.[21]

Relation to Bayesian inference edit

A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior distribution on the parameters. Indeed, the maximum a posteriori estimate is the parameter θ that maximizes the probability of θ given the data, given by Bayes' theorem:

 

where   is the prior distribution for the parameter θ and where   is the probability of the data averaged over all parameters. Since the denominator is independent of θ, the Bayesian estimator is obtained by maximizing   with respect to θ. If we further assume that the prior   is a uniform distribution, the Bayesian estimator is obtained by maximizing the likelihood function  . Thus the Bayesian estimator coincides with the maximum likelihood estimator for a uniform prior distribution  .

Application of maximum-likelihood estimation in Bayes decision theory edit

In many practical applications in machine learning, maximum-likelihood estimation is used as the model for parameter estimation.

The Bayesian Decision theory is about designing a classifier that minimizes total expected risk, especially, when the costs (the loss function) associated with different decisions are equal, the classifier is minimizing the error over the whole distribution.[22]

Thus, the Bayes Decision Rule is stated as

"decide   if   otherwise decide  "

where   are predictions of different classes. From a perspective of minimizing error, it can also be stated as

 

where

 

if we decide   and   if we decide  

By applying Bayes' theorem

 ,

and if we further assume the zero-or-one loss function, which is a same loss for all errors, the Bayes Decision rule can be reformulated as:

 

where   is the prediction and   is the prior probability.

Relation to minimizing Kullback–Leibler divergence and cross entropy edit

Finding   that maximizes the likelihood is asymptotically equivalent to finding the   that defines a probability distribution ( ) that has a minimal distance, in terms of Kullback–Leibler divergence, to the real probability distribution from which our data were generated (i.e., generated by  ).[23] In an ideal world, P and Q are the same (and the only thing unknown is   that defines P), but even if they are not and the model we use is misspecified, still the MLE will give us the "closest" distribution (within the restriction of a model Q that depends on  ) to the real distribution  .[24]

Examples edit

Discrete uniform distribution edit

Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution); thus, the sample size is 1. If n is unknown, then the maximum likelihood estimator   of n is the number m on the drawn ticket. (The likelihood is 0 for n < m, 1n for n ≥ m, and this is greatest when n = m. Note that the maximum likelihood estimate of n occurs at the lower extreme of possible values {mm + 1, ...}, rather than somewhere in the "middle" of the range of possible values, which would result in less bias.) The expected value of the number m on the drawn ticket, and therefore the expected value of  , is (n + 1)/2. As a result, with a sample size of 1, the maximum likelihood estimator for n will systematically underestimate n by (n − 1)/2.

Discrete distribution, finite parameter space edit

Suppose one wishes to determine just how biased an unfair coin is. Call the probability of tossing a 'head' p. The goal then becomes to determine p.

Suppose the coin is tossed 80 times: i.e. the sample might be something like x1 = H, x2 = T, ..., x80 = T, and the count of the number of heads "H" is observed.

The probability of tossing tails is 1 − p (so here p is θ above). Suppose the outcome is 49 heads and 31 tails, and suppose the coin was taken from a box containing three coins: one which gives heads with probability p = 13, one which gives heads with probability p = 12 and another which gives heads with probability p = 23. The coins have lost their labels, so which one it was is unknown. Using maximum likelihood estimation, the coin that has the largest likelihood can be found, given the data that were observed. By using the probability mass function of the binomial distribution with sample size equal to 80, number successes equal to 49 but for different values of p (the "probability of success"), the likelihood function (defined below) takes one of three values:

 

The likelihood is maximized when p = 23, and so this is the maximum likelihood estimate for p.

Discrete distribution, continuous parameter space edit

Now suppose that there was only one coin but its p could have been any value 0 ≤ p ≤ 1 . The likelihood function to be maximised is

 

and the maximisation is over all possible values 0 ≤ p ≤ 1 .

 
Likelihood function for proportion value of a binomial process (n = 10)

One way to maximize this function is by differentiating with respect to p and setting to zero:

 

This is a product of three terms. The first term is 0 when p = 0. The second is 0 when p = 1. The third is zero when p = 4980. The solution that maximizes the likelihood is clearly p = 4980 (since p = 0 and p = 1 result in a likelihood of 0). Thus the maximum likelihood estimator for p is 4980.

This result is easily generalized by substituting a letter such as s in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields sn which is the maximum likelihood estimator for any sequence of n Bernoulli trials resulting in s 'successes'.

Continuous distribution, continuous parameter space edit

For the normal distribution   which has probability density function

 

the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is

 

This family of distributions has two parameters: θ = (μσ); so we maximize the likelihood,  , over both parameters simultaneously, or if possible, individually.

Since the logarithm function itself is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm (the log-likelihood itself is not necessarily strictly increasing). The log-likelihood can be written as follows:

 

(Note: the log-likelihood is closely related to information entropy and Fisher information.)

We now compute the derivatives of this log-likelihood as follows.

 

where   is the sample mean. This is solved by

 

This is indeed the maximum of the function, since it is the only turning point in μ and the second derivative is strictly less than zero. Its expected value is equal to the parameter μ of the given distribution,

 

which means that the maximum likelihood estimator   is unbiased.

Similarly we differentiate the log-likelihood with respect to σ and equate to zero:

 

which is solved by

 

Inserting the estimate   we obtain

 

To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables (statistical error)  . Expressing the estimate in these variables yields

 

Simplifying the expression above, utilizing the facts that   and  , allows us to obtain

 

This means that the estimator   is biased for  . It can also be shown that   is biased for  , but that both   and   are consistent.

Formally we say that the maximum likelihood estimator for   is

 

In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.

The normal log-likelihood at its maximum takes a particularly simple form:

 

This maximum log-likelihood can be shown to be the same for more general least squares, even for non-linear least squares. This is often used in determining likelihood-based approximate confidence intervals and confidence regions, which are generally more accurate than those using the asymptotic normality discussed above.

Non-independent variables edit

It may be the case that variables are correlated, that is, not independent. Two random variables   and   are independent only if their joint probability density function is the product of the individual probability density functions, i.e.

 

Suppose one constructs an order-n Gaussian vector out of random variables  , where each variable has means given by  . Furthermore, let the covariance matrix be denoted by  . The joint probability density function of these n random variables then follows a multivariate normal distribution given by:

 

In the bivariate case, the joint probability density function is given by:

 

In this and other cases where a joint density function exists, the likelihood function is defined as above, in the section "principles," using this density.

Example edit

  are counts in cells / boxes 1 up to m; each box has a different probability (think of the boxes being bigger or smaller) and we fix the number of balls that fall to be  : . The probability of each box is  , with a constraint:  . This is a case in which the   s are not independent, the joint probability of a vector   is called the multinomial and has the form:

 

Each box taken separately against all the other boxes is a binomial and this is an extension thereof.

The log-likelihood of this is:

 

The constraint has to be taken into account and use the Lagrange multipliers:

 

By posing all the derivatives to be 0, the most natural estimate is derived

 

Maximizing log likelihood, with and without constraints, can be an unsolvable problem in closed form, then we have to use iterative procedures.

Iterative procedures edit

Except for special cases, the likelihood equations

 

cannot be solved explicitly for an estimator  . Instead, they need to be solved iteratively: starting from an initial guess of   (say  ), one seeks to obtain a convergent sequence  . Many methods for this kind of optimization problem are available,[26][27] but the most commonly used ones are algorithms based on an updating formula of the form

 

where the vector   indicates the descent direction of the rth "step," and the scalar   captures the "step length,"[28][29] also known as the learning rate.[30]

Gradient descent method edit

(Note: here it is a maximization problem, so the sign before gradient is flipped)

  that is small enough for convergence and  

Gradient descent method requires to calculate the gradient at the rth iteration, but no need to calculate the inverse of second-order derivative, i.e., the Hessian matrix. Therefore, it is computationally faster than Newton-Raphson method.

Newton–Raphson method edit

  and  

where   is the score and   is the inverse of the Hessian matrix of the log-likelihood function, both evaluated the rth iteration.[31][32] But because the calculation of the Hessian matrix is computationally costly, numerous alternatives have been proposed. The popular Berndt–Hall–Hall–Hausman algorithm approximates the Hessian with the outer product of the expected gradient, such that

 

Quasi-Newton methods edit

Other quasi-Newton methods use more elaborate secant updates to give approximation of Hessian matrix.

Davidon–Fletcher–Powell formula edit

DFP formula finds a solution that is symmetric, positive-definite and closest to the current approximate value of second-order derivative:

 

where

 
 
 

Broyden–Fletcher–Goldfarb–Shanno algorithm edit

BFGS also gives a solution that is symmetric and positive-definite:

maximum, likelihood, estimation, this, article, about, statistical, techniques, computer, data, storage, partial, response, maximum, likelihood, statistics, maximum, likelihood, estimation, method, estimating, parameters, assumed, probability, distribution, gi. This article is about the statistical techniques For computer data storage see partial response maximum likelihood In statistics maximum likelihood estimation MLE is a method of estimating the parameters of an assumed probability distribution given some observed data This is achieved by maximizing a likelihood function so that under the assumed statistical model the observed data is most probable The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate 1 The logic of maximum likelihood is both intuitive and flexible and as such the method has become a dominant means of statistical inference 2 3 4 If the likelihood function is differentiable the derivative test for finding maxima can be applied In some cases the first order conditions of the likelihood function can be solved analytically for instance the ordinary least squares estimator for a linear regression model maximizes the likelihood when the random errors are assumed to have normal distributions with the same variance 5 From the perspective of Bayesian inference MLE is generally equivalent to maximum a posteriori MAP estimation with uniform prior distributions or a normal prior distribution with a standard deviation of infinity In frequentist inference MLE is a special case of an extremum estimator with the objective function being the likelihood Contents 1 Principles 1 1 Restricted parameter space 1 2 Nonparametric Maximum Likelihood Estimation 2 Properties 2 1 Consistency 2 2 Functional invariance 2 3 Efficiency 2 4 Second order efficiency after correction for bias 2 5 Relation to Bayesian inference 2 5 1 Application of maximum likelihood estimation in Bayes decision theory 2 6 Relation to minimizing Kullback Leibler divergence and cross entropy 3 Examples 3 1 Discrete uniform distribution 3 2 Discrete distribution finite parameter space 3 3 Discrete distribution continuous parameter space 3 4 Continuous distribution continuous parameter space 4 Non independent variables 4 1 Example 5 Iterative procedures 5 1 Gradient descent method 5 2 Newton Raphson method 5 3 Quasi Newton methods 5 3 1 Davidon Fletcher Powell formula 5 3 2 Broyden Fletcher Goldfarb Shanno algorithm 5 3 3 Fisher s scoring 6 History 7 See also 7 1 Related concepts 7 2 Other estimation methods 8 References 9 Further reading 10 External linksPrinciples editWe model a set of observations as a random sample from an unknown joint probability distribution which is expressed in terms of a set of parameters The goal of maximum likelihood estimation is to determine the parameters for which the observed data have the highest joint probability We write the parameters governing the joint distribution as a vector 8 8 1 8 2 8 k T displaystyle theta left theta 1 theta 2 ldots theta k right mathsf T nbsp so that this distribution falls within a parametric family f 8 8 8 displaystyle f cdot theta mid theta in Theta nbsp where 8 displaystyle Theta nbsp is called the parameter space a finite dimensional subset of Euclidean space Evaluating the joint density at the observed data sample y y 1 y 2 y n displaystyle mathbf y y 1 y 2 ldots y n nbsp gives a real valued function L n 8 L n 8 y f n y 8 displaystyle mathcal L n theta mathcal L n theta mathbf y f n mathbf y theta nbsp which is called the likelihood function For independent and identically distributed random variables f n y 8 displaystyle f n mathbf y theta nbsp will be the product of univariate density functions f n y 8 k 1 n f k u n i v a r y k 8 displaystyle f n mathbf y theta prod k 1 n f k mathsf univar y k theta nbsp The goal of maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function over the parameter space 6 that is 8 a r g m a x 8 8 L n 8 y displaystyle hat theta underset theta in Theta operatorname arg max mathcal L n theta mathbf y nbsp Intuitively this selects the parameter values that make the observed data most probable The specific value 8 8 n y 8 displaystyle hat theta hat theta n mathbf y in Theta nbsp that maximizes the likelihood function L n displaystyle mathcal L n nbsp is called the maximum likelihood estimate Further if the function 8 n R n 8 displaystyle hat theta n mathbb R n to Theta nbsp so defined is measurable then it is called the maximum likelihood estimator It is generally a function defined over the sample space i e taking a given sample as its argument A sufficient but not necessary condition for its existence is for the likelihood function to be continuous over a parameter space 8 displaystyle Theta nbsp that is compact 7 For an open 8 displaystyle Theta nbsp the likelihood function may increase without ever reaching a supremum value In practice it is often convenient to work with the natural logarithm of the likelihood function called the log likelihood ℓ 8 y ln L n 8 y displaystyle ell theta mathbf y ln mathcal L n theta mathbf y nbsp Since the logarithm is a monotonic function the maximum of ℓ 8 y displaystyle ell theta mathbf y nbsp occurs at the same value of 8 displaystyle theta nbsp as does the maximum of L n displaystyle mathcal L n nbsp 8 If ℓ 8 y displaystyle ell theta mathbf y nbsp is differentiable in 8 displaystyle Theta nbsp the necessary conditions for the occurrence of a maximum or a minimum are ℓ 8 1 0 ℓ 8 2 0 ℓ 8 k 0 displaystyle frac partial ell partial theta 1 0 quad frac partial ell partial theta 2 0 quad ldots quad frac partial ell partial theta k 0 nbsp known as the likelihood equations For some models these equations can be explicitly solved for 8 displaystyle widehat theta nbsp but in general no closed form solution to the maximization problem is known or available and an MLE can only be found via numerical optimization Another problem is that in finite samples there may exist multiple roots for the likelihood equations 9 Whether the identified root 8 displaystyle widehat theta nbsp of the likelihood equations is indeed a local maximum depends on whether the matrix of second order partial and cross partial derivatives the so called Hessian matrix H 8 2 ℓ 8 1 2 8 8 2 ℓ 8 1 8 2 8 8 2 ℓ 8 1 8 k 8 8 2 ℓ 8 2 8 1 8 8 2 ℓ 8 2 2 8 8 2 ℓ 8 2 8 k 8 8 2 ℓ 8 k 8 1 8 8 2 ℓ 8 k 8 2 8 8 2 ℓ 8 k 2 8 8 displaystyle mathbf H left widehat theta right begin bmatrix left frac partial 2 ell partial theta 1 2 right theta widehat theta amp left frac partial 2 ell partial theta 1 partial theta 2 right theta widehat theta amp dots amp left frac partial 2 ell partial theta 1 partial theta k right theta widehat theta left frac partial 2 ell partial theta 2 partial theta 1 right theta widehat theta amp left frac partial 2 ell partial theta 2 2 right theta widehat theta amp dots amp left frac partial 2 ell partial theta 2 partial theta k right theta widehat theta vdots amp vdots amp ddots amp vdots left frac partial 2 ell partial theta k partial theta 1 right theta widehat theta amp left frac partial 2 ell partial theta k partial theta 2 right theta widehat theta amp dots amp left frac partial 2 ell partial theta k 2 right theta widehat theta end bmatrix nbsp is negative semi definite at 8 displaystyle widehat theta nbsp as this indicates local concavity Conveniently most common probability distributions in particular the exponential family are logarithmically concave 10 11 Restricted parameter space edit Not to be confused with restricted maximum likelihood While the domain of the likelihood function the parameter space is generally a finite dimensional subset of Euclidean space additional restrictions sometimes need to be incorporated into the estimation process The parameter space can be expressed as 8 8 8 R k h 8 0 displaystyle Theta left theta theta in mathbb R k h theta 0 right nbsp where h 8 h 1 8 h 2 8 h r 8 displaystyle h theta left h 1 theta h 2 theta ldots h r theta right nbsp is a vector valued function mapping R k displaystyle mathbb R k nbsp into R r displaystyle mathbb R r nbsp Estimating the true parameter 8 displaystyle theta nbsp belonging to 8 displaystyle Theta nbsp then as a practical matter means to find the maximum of the likelihood function subject to the constraint h 8 0 displaystyle h theta 0 nbsp Theoretically the most natural approach to this constrained optimization problem is the method of substitution that is filling out the restrictions h 1 h 2 h r displaystyle h 1 h 2 ldots h r nbsp to a set h 1 h 2 h r h r 1 h k displaystyle h 1 h 2 ldots h r h r 1 ldots h k nbsp in such a way that h h 1 h 2 h k displaystyle h ast left h 1 h 2 ldots h k right nbsp is a one to one function from R k displaystyle mathbb R k nbsp to itself and reparameterize the likelihood function by setting ϕ i h i 8 1 8 2 8 k displaystyle phi i h i theta 1 theta 2 ldots theta k nbsp 12 Because of the equivariance of the maximum likelihood estimator the properties of the MLE apply to the restricted estimates also 13 For instance in a multivariate normal distribution the covariance matrix S displaystyle Sigma nbsp must be positive definite this restriction can be imposed by replacing S G T G displaystyle Sigma Gamma mathsf T Gamma nbsp where G displaystyle Gamma nbsp is a real upper triangular matrix and G T displaystyle Gamma mathsf T nbsp is its transpose 14 In practice restrictions are usually imposed using the method of Lagrange which given the constraints as defined above leads to the restricted likelihood equations ℓ 8 h 8 T 8 l 0 displaystyle frac partial ell partial theta frac partial h theta mathsf T partial theta lambda 0 nbsp and h 8 0 displaystyle h theta 0 nbsp where l l 1 l 2 l r T displaystyle lambda left lambda 1 lambda 2 ldots lambda r right mathsf T nbsp is a column vector of Lagrange multipliers and h 8 T 8 displaystyle frac partial h theta mathsf T partial theta nbsp is the k r Jacobian matrix of partial derivatives 12 Naturally if the constraints are not binding at the maximum the Lagrange multipliers should be zero 15 This in turn allows for a statistical test of the validity of the constraint known as the Lagrange multiplier test Nonparametric Maximum Likelihood Estimation edit Nonparametric Maximum likelihood estimation can be performed using the empirical likelihood Properties editA maximum likelihood estimator is an extremum estimator obtained by maximizing as a function of 8 the objective function ℓ 8 x displaystyle widehat ell theta x nbsp If the data are independent and identically distributed then we have ℓ 8 x 1 n i 1 n ln f x i 8 displaystyle widehat ell theta x frac 1 n sum i 1 n ln f x i mid theta nbsp this being the sample analogue of the expected log likelihood ℓ 8 E ln f x i 8 displaystyle ell theta operatorname mathbb E ln f x i mid theta nbsp where this expectation is taken with respect to the true density Maximum likelihood estimators have no optimum properties for finite samples in the sense that when evaluated on finite samples other estimators may have greater concentration around the true parameter value 16 However like other estimation methods maximum likelihood estimation possesses a number of attractive limiting properties As the sample size increases to infinity sequences of maximum likelihood estimators have these properties Consistency the sequence of MLEs converges in probability to the value being estimated Invariance If 8 displaystyle hat theta nbsp is the maximum likelihood estimator for 8 displaystyle theta nbsp and if g 8 displaystyle g theta nbsp is any transformation of 8 displaystyle theta nbsp then the maximum likelihood estimator for a g 8 displaystyle alpha g theta nbsp is a g 8 displaystyle hat alpha g hat theta nbsp This property is less commonly known as functional equivariance The invariance property holds for arbitrary transformation g displaystyle g nbsp although the proof simplifies if g displaystyle g nbsp is restricted to one to one transformations Efficiency i e it achieves the Cramer Rao lower bound when the sample size tends to infinity This means that no consistent estimator has lower asymptotic mean squared error than the MLE or other estimators attaining this bound which also means that MLE has asymptotic normality Second order efficiency after correction for bias Consistency edit Under the conditions outlined below the maximum likelihood estimator is consistent The consistency means that if the data were generated by f 8 0 displaystyle f cdot theta 0 nbsp and we have a sufficiently large number of observations n then it is possible to find the value of 80 with arbitrary precision In mathematical terms this means that as n goes to infinity the estimator 8 displaystyle widehat theta nbsp converges in probability to its true value 8 m l e p 8 0 displaystyle widehat theta mathrm mle xrightarrow text p theta 0 nbsp Under slightly stronger conditions the estimator converges almost surely or strongly 8 m l e a s 8 0 displaystyle widehat theta mathrm mle xrightarrow text a s theta 0 nbsp In practical applications data is never generated by f 8 0 displaystyle f cdot theta 0 nbsp Rather f 8 0 displaystyle f cdot theta 0 nbsp is a model often in idealized form of the process generated by the data It is a common aphorism in statistics that all models are wrong Thus true consistency does not occur in practical applications Nevertheless consistency is often considered to be a desirable property for an estimator to have To establish consistency the following conditions are sufficient 17 Identification of the model 8 8 0 f 8 f 8 0 displaystyle theta neq theta 0 quad Leftrightarrow quad f cdot mid theta neq f cdot mid theta 0 nbsp In other words different parameter values 8 correspond to different distributions within the model If this condition did not hold there would be some value 81 such that 80 and 81 generate an identical distribution of the observable data Then we would not be able to distinguish between these two parameters even with an infinite amount of data these parameters would have been observationally equivalent The identification condition is absolutely necessary for the ML estimator to be consistent When this condition holds the limiting likelihood function ℓ 8 has unique global maximum at 80 Compactness the parameter space 8 of the model is compact nbsp The identification condition establishes that the log likelihood has a unique global maximum Compactness implies that the likelihood cannot approach the maximum value arbitrarily close at some other point as demonstrated for example in the picture on the right Compactness is only a sufficient condition and not a necessary condition Compactness can be replaced by some other conditions such as both concavity of the log likelihood function and compactness of some nonempty upper level sets of the log likelihood function orexistence of a compact neighborhood N of 8 0 such that outside of N the log likelihood function is less than the maximum by at least some e gt 0 Continuity the function ln f x 8 is continuous in 8 for almost all values of x P ln f x 8 C 0 8 1 displaystyle operatorname mathbb P Bigl ln f x mid theta in C 0 Theta Bigr 1 nbsp The continuity here can be replaced with a slightly weaker condition of upper semi continuity Dominance there exists D x integrable with respect to the distribution f x 80 such that ln f x 8 lt D x for all 8 8 displaystyle Bigl ln f x mid theta Bigr lt D x quad text for all theta in Theta nbsp By the uniform law of large numbers the dominance condition together with continuity establish the uniform convergence in probability of the log likelihood sup 8 8 ℓ 8 x ℓ 8 p 0 displaystyle sup theta in Theta left widehat ell theta mid x ell theta right xrightarrow text p 0 nbsp The dominance condition can be employed in the case of i i d observations In the non i i d case the uniform convergence in probability can be checked by showing that the sequence ℓ 8 x displaystyle widehat ell theta mid x nbsp is stochastically equicontinuous If one wants to demonstrate that the ML estimator 8 displaystyle widehat theta nbsp converges to 80 almost surely then a stronger condition of uniform convergence almost surely has to be imposed sup 8 8 ℓ 8 x ℓ 8 a s 0 displaystyle sup theta in Theta left widehat ell theta mid x ell theta right xrightarrow text a s 0 nbsp Additionally if as assumed above the data were generated by f 8 0 displaystyle f cdot theta 0 nbsp then under certain conditions it can also be shown that the maximum likelihood estimator converges in distribution to a normal distribution Specifically 18 n 8 m l e 8 0 d N 0 I 1 displaystyle sqrt n left widehat theta mathrm mle theta 0 right xrightarrow d mathcal N left 0 I 1 right nbsp where I is the Fisher information matrix Functional invariance edit The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability or probability density in the continuous case If the parameter consists of a number of components then we define their separate maximum likelihood estimators as the corresponding component of the MLE of the complete parameter Consistent with this if 8 displaystyle widehat theta nbsp is the MLE for 8 displaystyle theta nbsp and if g 8 displaystyle g theta nbsp is any transformation of 8 displaystyle theta nbsp then the MLE for a g 8 displaystyle alpha g theta nbsp is by definition 19 a g 8 displaystyle widehat alpha g widehat theta nbsp It maximizes the so called profile likelihood L a sup 8 a g 8 L 8 displaystyle bar L alpha sup theta alpha g theta L theta nbsp The MLE is also equivariant with respect to certain transformations of the data If y g x displaystyle y g x nbsp where g displaystyle g nbsp is one to one and does not depend on the parameters to be estimated then the density functions satisfy f Y y f X x g x displaystyle f Y y frac f X x g x nbsp and hence the likelihood functions for X displaystyle X nbsp and Y displaystyle Y nbsp differ only by a factor that does not depend on the model parameters For example the MLE parameters of the log normal distribution are the same as those of the normal distribution fitted to the logarithm of the data Efficiency edit As assumed above if the data were generated by f 8 0 displaystyle f cdot theta 0 nbsp then under certain conditions it can also be shown that the maximum likelihood estimator converges in distribution to a normal distribution It is n consistent and asymptotically efficient meaning that it reaches the Cramer Rao bound Specifically 18 n 8 mle 8 0 d N 0 I 1 displaystyle sqrt n left widehat theta text mle theta 0 right xrightarrow d mathcal N left 0 mathcal I 1 right nbsp where I displaystyle mathcal I nbsp is the Fisher information matrix I j k E 2 ln f 8 0 X t 8 j 8 k displaystyle mathcal I jk operatorname mathbb E biggl frac partial 2 ln f theta 0 X t partial theta j partial theta k biggr nbsp In particular it means that the bias of the maximum likelihood estimator is equal to zero up to the order 1 n Second order efficiency after correction for bias edit However when we consider the higher order terms in the expansion of the distribution of this estimator it turns out that 8mle has bias of order 1 n This bias is equal to componentwise 20 b h E 8 m l e 8 0 h 1 n i j k 1 m I h i I j k 1 2 K i j k J j i k displaystyle b h equiv operatorname mathbb E biggl left widehat theta mathrm mle theta 0 right h biggr frac 1 n sum i j k 1 m mathcal I hi mathcal I jk left frac 1 2 K ijk J j ik right nbsp where I j k displaystyle mathcal I jk nbsp with superscripts denotes the j k th component of the inverse Fisher information matrix I 1 displaystyle mathcal I 1 nbsp and 1 2 K i j k J j i k E 1 2 3 ln f 8 0 X t 8 i 8 j 8 k ln f 8 0 X t 8 j 2 ln f 8 0 X t 8 i 8 k displaystyle frac 1 2 K ijk J j ik operatorname mathbb E biggl frac 1 2 frac partial 3 ln f theta 0 X t partial theta i partial theta j partial theta k frac partial ln f theta 0 X t partial theta j frac partial 2 ln f theta 0 X t partial theta i partial theta k biggr nbsp Using these formulae it is possible to estimate the second order bias of the maximum likelihood estimator and correct for that bias by subtracting it 8 mle 8 mle b displaystyle widehat theta text mle widehat theta text mle widehat b nbsp This estimator is unbiased up to the terms of order 1 n and is called the bias corrected maximum likelihood estimator This bias corrected estimator is second order efficient at least within the curved exponential family meaning that it has minimal mean squared error among all second order bias corrected estimators up to the terms of the order 1 n 2 It is possible to continue this process that is to derive the third order bias correction term and so on However the maximum likelihood estimator is not third order efficient 21 Relation to Bayesian inference edit A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior distribution on the parameters Indeed the maximum a posteriori estimate is the parameter 8 that maximizes the probability of 8 given the data given by Bayes theorem P 8 x 1 x 2 x n f x 1 x 2 x n 8 P 8 P x 1 x 2 x n displaystyle operatorname mathbb P theta mid x 1 x 2 ldots x n frac f x 1 x 2 ldots x n mid theta operatorname mathbb P theta operatorname mathbb P x 1 x 2 ldots x n nbsp where P 8 displaystyle operatorname mathbb P theta nbsp is the prior distribution for the parameter 8 and where P x 1 x 2 x n displaystyle operatorname mathbb P x 1 x 2 ldots x n nbsp is the probability of the data averaged over all parameters Since the denominator is independent of 8 the Bayesian estimator is obtained by maximizing f x 1 x 2 x n 8 P 8 displaystyle f x 1 x 2 ldots x n mid theta operatorname mathbb P theta nbsp with respect to 8 If we further assume that the prior P 8 displaystyle operatorname mathbb P theta nbsp is a uniform distribution the Bayesian estimator is obtained by maximizing the likelihood function f x 1 x 2 x n 8 displaystyle f x 1 x 2 ldots x n mid theta nbsp Thus the Bayesian estimator coincides with the maximum likelihood estimator for a uniform prior distribution P 8 displaystyle operatorname mathbb P theta nbsp Application of maximum likelihood estimation in Bayes decision theory edit In many practical applications in machine learning maximum likelihood estimation is used as the model for parameter estimation The Bayesian Decision theory is about designing a classifier that minimizes total expected risk especially when the costs the loss function associated with different decisions are equal the classifier is minimizing the error over the whole distribution 22 Thus the Bayes Decision Rule is stated as decide w 1 displaystyle w 1 nbsp if P w 1 x gt P w 2 x displaystyle operatorname mathbb P w 1 x gt operatorname mathbb P w 2 x nbsp otherwise decide w 2 displaystyle w 2 nbsp where w 1 w 2 displaystyle w 1 w 2 nbsp are predictions of different classes From a perspective of minimizing error it can also be stated as w a r g m a x w P error x P x d x displaystyle w underset w operatorname arg max int infty infty operatorname mathbb P text error mid x operatorname mathbb P x operatorname d x nbsp where P error x P w 1 x displaystyle operatorname mathbb P text error mid x operatorname mathbb P w 1 mid x nbsp if we decide w 2 displaystyle w 2 nbsp and P error x P w 2 x displaystyle operatorname mathbb P text error mid x operatorname mathbb P w 2 mid x nbsp if we decide w 1 displaystyle w 1 nbsp By applying Bayes theorem P w i x P x w i P w i P x displaystyle operatorname mathbb P w i mid x frac operatorname mathbb P x mid w i operatorname mathbb P w i operatorname mathbb P x nbsp and if we further assume the zero or one loss function which is a same loss for all errors the Bayes Decision rule can be reformulated as h Bayes a r g m a x w P x w P w displaystyle h text Bayes underset w operatorname arg max bigl operatorname mathbb P x mid w operatorname mathbb P w bigr nbsp where h Bayes displaystyle h text Bayes nbsp is the prediction and P w displaystyle operatorname mathbb P w nbsp is the prior probability Relation to minimizing Kullback Leibler divergence and cross entropy edit Finding 8 displaystyle hat theta nbsp that maximizes the likelihood is asymptotically equivalent to finding the 8 displaystyle hat theta nbsp that defines a probability distribution Q 8 displaystyle Q hat theta nbsp that has a minimal distance in terms of Kullback Leibler divergence to the real probability distribution from which our data were generated i e generated by P 8 0 displaystyle P theta 0 nbsp 23 In an ideal world P and Q are the same and the only thing unknown is 8 displaystyle theta nbsp that defines P but even if they are not and the model we use is misspecified still the MLE will give us the closest distribution within the restriction of a model Q that depends on 8 displaystyle hat theta nbsp to the real distribution P 8 0 displaystyle P theta 0 nbsp 24 Proof For simplicity of notation let s assume that P Q Let there be n i i d data samples y y 1 y 2 y n displaystyle mathbf y y 1 y 2 ldots y n nbsp from some probability y P 8 0 displaystyle y sim P theta 0 nbsp that we try to estimate by finding 8 displaystyle hat theta nbsp that will maximize the likelihood using P 8 displaystyle P theta nbsp then 8 a r g m a x 8 L P 8 y a r g m a x 8 P 8 y a r g m a x 8 P y 8 a r g m a x 8 i 1 n P y i 8 a r g m a x 8 i 1 n log P y i 8 a r g m a x 8 i 1 n log P y i 8 i 1 n log P y i 8 0 a r g m a x 8 i 1 n log P y i 8 log P y i 8 0 a r g m a x 8 i 1 n log P y i 8 P y i 8 0 a r g m i n 8 i 1 n log P y i 8 0 P y i 8 a r g m i n 8 1 n i 1 n log P y i 8 0 P y i 8 a r g m i n 8 1 n i 1 n h 8 y i n a r g m i n 8 E h 8 y a r g m i n 8 P 8 0 y h 8 y d y a r g m i n 8 P 8 0 y log P y 8 0 P y 8 d y a r g m i n 8 D KL P 8 0 P 8 displaystyle begin aligned hat theta amp underset theta operatorname arg max L P theta mathbf y underset theta operatorname arg max P theta mathbf y underset theta operatorname arg max P mathbf y mid theta amp underset theta operatorname arg max prod i 1 n P y i mid theta underset theta operatorname arg max sum i 1 n log P y i mid theta amp underset theta operatorname arg max left sum i 1 n log P y i mid theta sum i 1 n log P y i mid theta 0 right underset theta operatorname arg max sum i 1 n left log P y i mid theta log P y i mid theta 0 right amp underset theta operatorname arg max sum i 1 n log frac P y i mid theta P y i mid theta 0 underset theta operatorname arg min sum i 1 n log frac P y i mid theta 0 P y i mid theta underset theta operatorname arg min frac 1 n sum i 1 n log frac P y i mid theta 0 P y i mid theta amp underset theta operatorname arg min frac 1 n sum i 1 n h theta y i quad underset n to infty longrightarrow quad underset theta operatorname arg min E h theta y amp underset theta operatorname arg min int P theta 0 y h theta y dy underset theta operatorname arg min int P theta 0 y log frac P y mid theta 0 P y mid theta dy amp underset theta operatorname arg min D text KL P theta 0 parallel P theta end aligned nbsp Where h 8 x log P x 8 0 P x 8 displaystyle h theta x log frac P x mid theta 0 P x mid theta nbsp Using h helps see how we are using the law of large numbers to move from the average of h x to the expectancy of it using the law of the unconscious statistician The first several transitions have to do with laws of logarithm and that finding 8 displaystyle hat theta nbsp that maximizes some function will also be the one that maximizes some monotonic transformation of that function i e adding multiplying by a constant Since cross entropy is just Shannon s entropy plus KL divergence and since the entropy of P 8 0 displaystyle P theta 0 nbsp is constant then the MLE is also asymptotically minimizing cross entropy 25 Examples editDiscrete uniform distribution edit Main article German tank problem Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random see uniform distribution thus the sample size is 1 If n is unknown then the maximum likelihood estimator n displaystyle widehat n nbsp of n is the number m on the drawn ticket The likelihood is 0 for n lt m 1 n for n m and this is greatest when n m Note that the maximum likelihood estimate of n occurs at the lower extreme of possible values m m 1 rather than somewhere in the middle of the range of possible values which would result in less bias The expected value of the number m on the drawn ticket and therefore the expected value of n displaystyle widehat n nbsp is n 1 2 As a result with a sample size of 1 the maximum likelihood estimator for n will systematically underestimate n by n 1 2 Discrete distribution finite parameter space edit Suppose one wishes to determine just how biased an unfair coin is Call the probability of tossing a head p The goal then becomes to determine p Suppose the coin is tossed 80 times i e the sample might be something like x1 H x2 T x80 T and the count of the number of heads H is observed The probability of tossing tails is 1 p so here p is 8 above Suppose the outcome is 49 heads and 31 tails and suppose the coin was taken from a box containing three coins one which gives heads with probability p 1 3 one which gives heads with probability p 1 2 and another which gives heads with probability p 2 3 The coins have lost their labels so which one it was is unknown Using maximum likelihood estimation the coin that has the largest likelihood can be found given the data that were observed By using the probability mass function of the binomial distribution with sample size equal to 80 number successes equal to 49 but for different values of p the probability of success the likelihood function defined below takes one of three values P H 49 p 1 3 80 49 1 3 49 1 1 3 31 0 000 P H 49 p 1 2 80 49 1 2 49 1 1 2 31 0 012 P H 49 p 2 3 80 49 2 3 49 1 2 3 31 0 054 displaystyle begin aligned operatorname mathbb P bigl mathrm H 49 mid p tfrac 1 3 bigr amp binom 80 49 tfrac 1 3 49 1 tfrac 1 3 31 approx 0 000 6pt operatorname mathbb P bigl mathrm H 49 mid p tfrac 1 2 bigr amp binom 80 49 tfrac 1 2 49 1 tfrac 1 2 31 approx 0 012 6pt operatorname mathbb P bigl mathrm H 49 mid p tfrac 2 3 bigr amp binom 80 49 tfrac 2 3 49 1 tfrac 2 3 31 approx 0 054 end aligned nbsp The likelihood is maximized when p 2 3 and so this is the maximum likelihood estimate for p Discrete distribution continuous parameter space edit Now suppose that there was only one coin but its p could have been any value 0 p 1 The likelihood function to be maximised is L p f D H 49 p 80 49 p 49 1 p 31 displaystyle L p f D mathrm H 49 mid p binom 80 49 p 49 1 p 31 nbsp and the maximisation is over all possible values 0 p 1 nbsp Likelihood function for proportion value of a binomial process n 10 One way to maximize this function is by differentiating with respect to p and setting to zero 0 p 80 49 p 49 1 p 31 0 49 p 48 1 p 31 31 p 49 1 p 30 p 48 1 p 30 49 1 p 31 p p 48 1 p 30 49 80 p displaystyle begin aligned 0 amp frac partial partial p left binom 80 49 p 49 1 p 31 right 8pt 0 amp 49p 48 1 p 31 31p 49 1 p 30 8pt amp p 48 1 p 30 left 49 1 p 31p right 8pt amp p 48 1 p 30 left 49 80p right end aligned nbsp This is a product of three terms The first term is 0 when p 0 The second is 0 when p 1 The third is zero when p 49 80 The solution that maximizes the likelihood is clearly p 49 80 since p 0 and p 1 result in a likelihood of 0 Thus the maximum likelihood estimator for p is 49 80 This result is easily generalized by substituting a letter such as s in the place of 49 to represent the observed number of successes of our Bernoulli trials and a letter such as n in the place of 80 to represent the number of Bernoulli trials Exactly the same calculation yields s n which is the maximum likelihood estimator for any sequence of n Bernoulli trials resulting in s successes Continuous distribution continuous parameter space edit For the normal distribution N m s 2 displaystyle mathcal N mu sigma 2 nbsp which has probability density function f x m s 2 1 2 p s 2 exp x m 2 2 s 2 displaystyle f x mid mu sigma 2 frac 1 sqrt 2 pi sigma 2 exp left frac x mu 2 2 sigma 2 right nbsp the corresponding probability density function for a sample of n independent identically distributed normal random variables the likelihood is f x 1 x n m s 2 i 1 n f x i m s 2 1 2 p s 2 n 2 exp i 1 n x i m 2 2 s 2 displaystyle f x 1 ldots x n mid mu sigma 2 prod i 1 n f x i mid mu sigma 2 left frac 1 2 pi sigma 2 right n 2 exp left frac sum i 1 n x i mu 2 2 sigma 2 right nbsp This family of distributions has two parameters 8 m s so we maximize the likelihood L m s 2 f x 1 x n m s 2 displaystyle mathcal L mu sigma 2 f x 1 ldots x n mid mu sigma 2 nbsp over both parameters simultaneously or if possible individually Since the logarithm function itself is a continuous strictly increasing function over the range of the likelihood the values which maximize the likelihood will also maximize its logarithm the log likelihood itself is not necessarily strictly increasing The log likelihood can be written as follows log L m s 2 n 2 log 2 p s 2 1 2 s 2 i 1 n x i m 2 displaystyle log Bigl mathcal L mu sigma 2 Bigr frac n 2 log 2 pi sigma 2 frac 1 2 sigma 2 sum i 1 n x i mu 2 nbsp Note the log likelihood is closely related to information entropy and Fisher information We now compute the derivatives of this log likelihood as follows 0 m log L m s 2 0 2 n x m 2 s 2 displaystyle begin aligned 0 amp frac partial partial mu log Bigl mathcal L mu sigma 2 Bigr 0 frac 2n bar x mu 2 sigma 2 end aligned nbsp where x displaystyle bar x nbsp is the sample mean This is solved by m x i 1 n x i n displaystyle widehat mu bar x sum i 1 n frac x i n nbsp This is indeed the maximum of the function since it is the only turning point in m and the second derivative is strictly less than zero Its expected value is equal to the parameter m of the given distribution E m m displaystyle operatorname mathbb E bigl widehat mu bigr mu nbsp which means that the maximum likelihood estimator m displaystyle widehat mu nbsp is unbiased Similarly we differentiate the log likelihood with respect to s and equate to zero 0 s log L m s 2 n s 1 s 3 i 1 n x i m 2 displaystyle begin aligned 0 amp frac partial partial sigma log Bigl mathcal L mu sigma 2 Bigr frac n sigma frac 1 sigma 3 sum i 1 n x i mu 2 end aligned nbsp which is solved by s 2 1 n i 1 n x i m 2 displaystyle widehat sigma 2 frac 1 n sum i 1 n x i mu 2 nbsp Inserting the estimate m m displaystyle mu widehat mu nbsp we obtain s 2 1 n i 1 n x i x 2 1 n i 1 n x i 2 1 n 2 i 1 n j 1 n x i x j displaystyle widehat sigma 2 frac 1 n sum i 1 n x i bar x 2 frac 1 n sum i 1 n x i 2 frac 1 n 2 sum i 1 n sum j 1 n x i x j nbsp To calculate its expected value it is convenient to rewrite the expression in terms of zero mean random variables statistical error d i m x i displaystyle delta i equiv mu x i nbsp Expressing the estimate in these variables yields s 2 1 n i 1 n m d i 2 1 n 2 i 1 n j 1 n m d i m d j displaystyle widehat sigma 2 frac 1 n sum i 1 n mu delta i 2 frac 1 n 2 sum i 1 n sum j 1 n mu delta i mu delta j nbsp Simplifying the expression above utilizing the facts that E d i 0 displaystyle operatorname mathbb E bigl delta i bigr 0 nbsp and E d i 2 s 2 displaystyle operatorname E bigl delta i 2 bigr sigma 2 nbsp allows us to obtain E s 2 n 1 n s 2 displaystyle operatorname mathbb E bigl widehat sigma 2 bigr frac n 1 n sigma 2 nbsp This means that the estimator s 2 displaystyle widehat sigma 2 nbsp is biased for s 2 displaystyle sigma 2 nbsp It can also be shown that s displaystyle widehat sigma nbsp is biased for s displaystyle sigma nbsp but that both s 2 displaystyle widehat sigma 2 nbsp and s displaystyle widehat sigma nbsp are consistent Formally we say that the maximum likelihood estimator for 8 m s 2 displaystyle theta mu sigma 2 nbsp is 8 m s 2 displaystyle widehat theta left widehat mu widehat sigma 2 right nbsp In this case the MLEs could be obtained individually In general this may not be the case and the MLEs would have to be obtained simultaneously The normal log likelihood at its maximum takes a particularly simple form log L m s n 2 log 2 p s 2 1 displaystyle log Bigl mathcal L widehat mu widehat sigma Bigr frac n 2 bigl log 2 pi widehat sigma 2 1 bigr nbsp This maximum log likelihood can be shown to be the same for more general least squares even for non linear least squares This is often used in determining likelihood based approximate confidence intervals and confidence regions which are generally more accurate than those using the asymptotic normality discussed above Non independent variables editIt may be the case that variables are correlated that is not independent Two random variables y 1 displaystyle y 1 nbsp and y 2 displaystyle y 2 nbsp are independent only if their joint probability density function is the product of the individual probability density functions i e f y 1 y 2 f y 1 f y 2 displaystyle f y 1 y 2 f y 1 f y 2 nbsp Suppose one constructs an order n Gaussian vector out of random variables y 1 y n displaystyle y 1 ldots y n nbsp where each variable has means given by m 1 m n displaystyle mu 1 ldots mu n nbsp Furthermore let the covariance matrix be denoted by S displaystyle mathit Sigma nbsp The joint probability density function of these n random variables then follows a multivariate normal distribution given by f y 1 y n 1 2 p n 2 det S exp 1 2 y 1 m 1 y n m n S 1 y 1 m 1 y n m n T displaystyle f y 1 ldots y n frac 1 2 pi n 2 sqrt det mathit Sigma exp left frac 1 2 left y 1 mu 1 ldots y n mu n right mathit Sigma 1 left y 1 mu 1 ldots y n mu n right mathrm T right nbsp In the bivariate case the joint probability density function is given by f y 1 y 2 1 2 p s 1 s 2 1 r 2 exp 1 2 1 r 2 y 1 m 1 2 s 1 2 2 r y 1 m 1 y 2 m 2 s 1 s 2 y 2 m 2 2 s 2 2 displaystyle f y 1 y 2 frac 1 2 pi sigma 1 sigma 2 sqrt 1 rho 2 exp left frac 1 2 1 rho 2 left frac y 1 mu 1 2 sigma 1 2 frac 2 rho y 1 mu 1 y 2 mu 2 sigma 1 sigma 2 frac y 2 mu 2 2 sigma 2 2 right right nbsp In this and other cases where a joint density function exists the likelihood function is defined as above in the section principles using this density Example edit X 1 X 2 X m displaystyle X 1 X 2 ldots X m nbsp are counts in cells boxes 1 up to m each box has a different probability think of the boxes being bigger or smaller and we fix the number of balls that fall to be n displaystyle n nbsp x 1 x 2 x m n displaystyle x 1 x 2 cdots x m n nbsp The probability of each box is p i displaystyle p i nbsp with a constraint p 1 p 2 p m 1 displaystyle p 1 p 2 cdots p m 1 nbsp This is a case in which the X i displaystyle X i nbsp s are not independent the joint probability of a vector x 1 x 2 x m displaystyle x 1 x 2 ldots x m nbsp is called the multinomial and has the form f x 1 x 2 x m p 1 p 2 p m n x i p i x i n x 1 x 2 x m p 1 x 1 p 2 x 2 p m x m displaystyle f x 1 x 2 ldots x m mid p 1 p 2 ldots p m frac n prod x i prod p i x i binom n x 1 x 2 ldots x m p 1 x 1 p 2 x 2 cdots p m x m nbsp Each box taken separately against all the other boxes is a binomial and this is an extension thereof The log likelihood of this is ℓ p 1 p 2 p m log n i 1 m log x i i 1 m x i log p i displaystyle ell p 1 p 2 ldots p m log n sum i 1 m log x i sum i 1 m x i log p i nbsp The constraint has to be taken into account and use the Lagrange multipliers L p 1 p 2 p m l ℓ p 1 p 2 p m l 1 i 1 m p i displaystyle L p 1 p 2 ldots p m lambda ell p 1 p 2 ldots p m lambda left 1 sum i 1 m p i right nbsp By posing all the derivatives to be 0 the most natural estimate is derived p i x i n displaystyle hat p i frac x i n nbsp Maximizing log likelihood with and without constraints can be an unsolvable problem in closed form then we have to use iterative procedures Iterative procedures editExcept for special cases the likelihood equations ℓ 8 y 8 0 displaystyle frac partial ell theta mathbf y partial theta 0 nbsp cannot be solved explicitly for an estimator 8 8 y displaystyle widehat theta widehat theta mathbf y nbsp Instead they need to be solved iteratively starting from an initial guess of 8 displaystyle theta nbsp say 8 1 displaystyle widehat theta 1 nbsp one seeks to obtain a convergent sequence 8 r displaystyle left widehat theta r right nbsp Many methods for this kind of optimization problem are available 26 27 but the most commonly used ones are algorithms based on an updating formula of the form 8 r 1 8 r h r d r 8 displaystyle widehat theta r 1 widehat theta r eta r mathbf d r left widehat theta right nbsp where the vector d r 8 displaystyle mathbf d r left widehat theta right nbsp indicates the descent direction of the rth step and the scalar h r displaystyle eta r nbsp captures the step length 28 29 also known as the learning rate 30 Gradient descent method edit Note here it is a maximization problem so the sign before gradient is flipped h r R displaystyle eta r in mathbb R nbsp that is small enough for convergence and d r 8 ℓ 8 r y displaystyle mathbf d r left widehat theta right nabla ell left widehat theta r mathbf y right nbsp Gradient descent method requires to calculate the gradient at the rth iteration but no need to calculate the inverse of second order derivative i e the Hessian matrix Therefore it is computationally faster than Newton Raphson method Newton Raphson method edit h r 1 displaystyle eta r 1 nbsp and d r 8 H r 1 8 s r 8 displaystyle mathbf d r left widehat theta right mathbf H r 1 left widehat theta right mathbf s r left widehat theta right nbsp where s r 8 displaystyle mathbf s r widehat theta nbsp is the score and H r 1 8 displaystyle mathbf H r 1 left widehat theta right nbsp is the inverse of the Hessian matrix of the log likelihood function both evaluated the rth iteration 31 32 But because the calculation of the Hessian matrix is computationally costly numerous alternatives have been proposed The popular Berndt Hall Hall Hausman algorithm approximates the Hessian with the outer product of the expected gradient such that d r 8 1 n t 1 n ℓ 8 y 8 ℓ 8 y 8 T 1 s r 8 displaystyle mathbf d r left widehat theta right left frac 1 n sum t 1 n frac partial ell theta mathbf y partial theta left frac partial ell theta mathbf y partial theta right mathsf T right 1 mathbf s r left widehat theta right nbsp Quasi Newton methods edit Other quasi Newton methods use more elaborate secant updates to give approximation of Hessian matrix Davidon Fletcher Powell formula edit DFP formula finds a solution that is symmetric positive definite and closest to the current approximate value of second order derivative H k 1 I g k y k s k T H k I g k s k y k T g k y k y k T displaystyle mathbf H k 1 left I gamma k y k s k mathsf T right mathbf H k left I gamma k s k y k mathsf T right gamma k y k y k mathsf T nbsp where y k ℓ x k s k ℓ x k displaystyle y k nabla ell x k s k nabla ell x k nbsp g k 1 y k T s k displaystyle gamma k frac 1 y k T s k nbsp s k x k 1 x k displaystyle s k x k 1 x k nbsp Broyden Fletcher Goldfarb Shanno algorithm edit BFGS also gives a solution that is symmetric and positive definite B k 1 B k y k y k T y k mrow, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.