fbpx
Wikipedia

Metropolis–Hastings algorithm

In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult. This sequence can be used to approximate the distribution (e.g. to generate a histogram) or to compute an integral (e.g. an expected value). Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, there are usually other methods (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and these are free from the problem of autocorrelated samples that is inherent in MCMC methods.

The proposal distribution Q proposes the next point to which the random walk might move.

History edit

The algorithm is named in part for Nicholas Metropolis, the first coauthor of a 1953 paper, entitled Equation of State Calculations by Fast Computing Machines, with Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller and Edward Teller. For many years the algorithm was known simply as the Metropolis algorithm.[1][2] The paper proposed the algorithm for the case of symmetrical proposal distributions, but in 1970, W.K. Hastings extended it to the more general case.[3] The generalized method was eventually identified by both names, although the first use of the term "Metropolis-Hastings algorithm" is unclear.

Some controversy exists with regard to credit for development of the Metropolis algorithm. Metropolis, who was familiar with the computational aspects of the method, had coined the term "Monte Carlo" in an earlier article with Stanisław Ulam, and led the group in the Theoretical Division that designed and built the MANIAC I computer used in the experiments in 1952. However, prior to 2003 there was no detailed account of the algorithm's development. Shortly before his death, Marshall Rosenbluth attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication. At this conference, Rosenbluth described the algorithm and its development in a presentation titled "Genesis of the Monte Carlo Algorithm for Statistical Mechanics".[4] Further historical clarification is made by Gubernatis in a 2005 journal article[5] recounting the 50th anniversary conference. Rosenbluth makes it clear that he and his wife Arianna did the work, and that Metropolis played no role in the development other than providing computer time.

This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the 1953 article worked together for "days (and nights)".[6] In contrast, the detailed account by Rosenbluth credits Teller with a crucial but early suggestion to "take advantage of statistical mechanics and take ensemble averages instead of following detailed kinematics". This, says Rosenbluth, started him thinking about the generalized Monte Carlo approach – a topic which he says he had discussed often with John Von Neumann. Arianna Rosenbluth recounted (to Gubernatis in 2003) that Augusta Teller started the computer work, but that Arianna herself took it over and wrote the code from scratch. In an oral history recorded shortly before his death,[7] Rosenbluth again credits Teller with posing the original problem, himself with solving it, and Arianna with programming the computer.

Intuition edit

The Metropolis–Hastings algorithm can draw samples from any probability distribution with probability density  , provided that we know a function   proportional to the density   and the values of   can be calculated. The requirement that   must only be proportional to the density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful, because calculating the necessary normalization factor is often extremely difficult in practice.

The Metropolis–Hastings algorithm generates a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution. These sample values are produced iteratively, with the distribution of the next sample being dependent only on the current sample value, thus making the sequence of samples into a Markov chain. Specifically, at each iteration, the algorithm picks a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted, in which case the candidate value is used in the next iteration, or it is rejected in which case the candidate value is discarded, and current value is reused in the next iteration. The probability of acceptance is determined by comparing the values of the function   of the current and candidate sample values with respect to the desired distribution.

For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below.

Metropolis algorithm (symmetric proposal distribution)

Let   be a function that is proportional to the desired probability density function   (a.k.a. a target distribution)[a].

  1. Initialization: Choose an arbitrary point   to be the first observation in the sample and choose an arbitrary probability density   (sometimes written  ) that suggests a candidate for the next sample value  , given the previous sample value  . In this section,   is assumed to be symmetric; in other words, it must satisfy  . A usual choice is to let   be a Gaussian distribution centered at  , so that points closer to   are more likely to be visited next, making the sequence of samples into a random walk[b]. The function   is referred to as the proposal density or jumping distribution.
  2. For each iteration t:
    • Generate a candidate   for the next sample by picking from the distribution  .
    • Calculate the acceptance ratio  , which will be used to decide whether to accept or reject the candidate[c]. Because f is proportional to the density of P, we have that  .
    • Accept or reject:
      • Generate a uniform random number  .
      • If  , then accept the candidate by setting  ,
      • If  , then reject the candidate and set   instead.

This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio   indicates how probable the new proposed sample is with respect to the current sample, according to the distribution whose density is  . If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of   corresponding to an  ), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the larger the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of  , while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works and returns samples that follow the desired distribution with density  .

Compared with an algorithm like adaptive rejection sampling[8] that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:

  • The samples are correlated. Even though over the long term they do correctly follow  , a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that effective sample sizes can be significantly lower than the number of samples actually taken, leading to large errors.
  • Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a burn-in period is typically necessary,[9] where an initial number of samples are thrown away.

On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from hierarchical Bayesian models and other high-dimensional statistical models used nowadays in many disciplines.

In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the suitable jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. That way, the problem of sampling from potentially high-dimensional space will be reduced to a collection of problems to sample from small dimensionality.[10] This is especially applicable when the multivariate distribution is composed of a set of individual random variables in which each variable is conditioned on only a small number of other variables, as is the case in most typical hierarchical models. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are the adaptive rejection sampling methods,[8] the adaptive rejection Metropolis sampling algorithm,[11] a simple one-dimensional Metropolis–Hastings step, or slice sampling.

Formal derivation edit

The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution  . To accomplish this, the algorithm uses a Markov process, which asymptotically reaches a unique stationary distribution   such that  .[12]

A Markov process is uniquely defined by its transition probabilities  , the probability of transitioning from any given state   to any other given state  . It has a unique stationary distribution   when the following two conditions are met:[12]

  1. Existence of stationary distribution: there must exist a stationary distribution  . A sufficient but not necessary condition is detailed balance, which requires that each transition   is reversible: for every pair of states  , the probability of being in state   and transitioning to state   must be equal to the probability of being in state   and transitioning to state  ,  .
  2. Uniqueness of stationary distribution: the stationary distribution   must be unique. This is guaranteed by ergodicity of the Markov process, which requires that every state must (1) be aperiodic—the system does not return to the same state at fixed intervals; and (2) be positive recurrent—the expected number of steps for returning to the same state is finite.

The Metropolis–Hastings algorithm involves designing a Markov process (by constructing transition probabilities) that fulfills the two above conditions, such that its stationary distribution   is chosen to be  . The derivation of the algorithm starts with the condition of detailed balance:

 

which is re-written as

 

The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection. The proposal distribution   is the conditional probability of proposing a state   given  , and the acceptance distribution   is the probability to accept the proposed state  . The transition probability can be written as the product of them:

 

Inserting this relation in the previous equation, we have

 

The next step in the derivation is to choose an acceptance ratio that fulfills the condition above. One common choice is the Metropolis choice:

 

For this Metropolis acceptance ratio  , either   or   and, either way, the condition is satisfied.

The Metropolis–Hastings algorithm can thus be written as follows:

  1. Initialise
    1. Pick an initial state  .
    2. Set  .
  2. Iterate
    1. Generate a random candidate state   according to  .
    2. Calculate the acceptance probability  .
    3. Accept or reject:
      1. generate a uniform random number  ;
      2. if  , then accept the new state and set  ;
      3. if  , then reject the new state, and copy the old state forward  .
    4. Increment: set  .

Provided that specified conditions are met, the empirical distribution of saved states   will approach  . The number of iterations ( ) required to effectively estimate   depends on the number of factors, including the relationship between   and the proposal distribution and the desired accuracy of estimation.[13] For distribution on discrete state spaces, it has to be of the order of the autocorrelation time of the Markov process.[14]

It is important to notice that it is not clear, in a general problem, which distribution   one should use or the number of iterations necessary for proper estimation; both are free parameters of the method, which must be adjusted to the particular problem in hand.

Use in numerical integration edit

A common use of Metropolis–Hastings algorithm is to compute an integral. Specifically, consider a space   and a probability distribution   over  ,  . Metropolis–Hastings can estimate an integral of the form of

 

where   is a (measurable) function of interest.

For example, consider a statistic   and its probability distribution  , which is a marginal distribution. Suppose that the goal is to estimate   for   on the tail of  . Formally,   can be written as

 

and, thus, estimating   can be accomplished by estimating the expected value of the indicator function  , which is 1 when   and zero otherwise. Because   is on the tail of  , the probability to draw a state   with   on the tail of   is proportional to  , which is small by definition. The Metropolis–Hastings algorithm can be used here to sample (rare) states more likely and thus increase the number of samples used to estimate   on the tails. This can be done e.g. by using a sampling distribution   to favor those states (e.g.   with  ).

Step-by-step instructions edit

Suppose that the most recent value sampled is  . To follow the Metropolis–Hastings algorithm, we next draw a new proposal state   with probability density   and calculate a value

 

where

 

is the probability (e.g., Bayesian posterior) ratio between the proposed sample   and the previous sample  , and

 

is the ratio of the proposal density in two directions (from   to   and conversely). This is equal to 1 if the proposal density is symmetric. Then the new state   is chosen according to the following rules.

If  
 
else:
 

The Markov chain is started from an arbitrary initial value  , and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as burn-in. The remaining set of accepted values of   represent a sample from the distribution  .

The algorithm works best if the proposal density matches the shape of the target distribution  , from which direct sampling is difficult, that is  . If a Gaussian proposal density   is used, the variance parameter   has to be tuned during the burn-in period. This is usually done by calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a window of the last   samples. The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one-dimensional Gaussian distribution is about 50%, decreasing to about 23% for an  -dimensional Gaussian target distribution.[15] These guidelines can work well when sampling from sufficiently regular Bayesian posteriors as they often follow a multivariate normal distribution as can be established using the Bernstein–von Mises theorem.[16]

If   is too small, the chain will mix slowly (i.e., the acceptance rate will be high, but successive samples will move around the space slowly, and the chain will converge only slowly to  ). On the other hand, if   is too large, the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so   will be very small, and again the chain will converge very slowly. One typically tunes the proposal distribution so that the algorithms accepts on the order of 30% of all samples – in line with the theoretical estimates mentioned in the previous paragraph.

 
The result of three Markov chains running on the 3D Rosenbrock function using the Metropolis–Hastings algorithm. The algorithm samples from regions where the posterior probability is high, and the chains begin to mix in these regions. The approximate position of the maximum has been illuminated. The red points are the ones that remain after the burn-in process. The earlier ones have been discarded.

Bayesian Inference edit

 
Flowchart of Metropolis-Hastings (M-H) algorithm for the parameter estimation using the Markov Chain Monte Carlo (MCMC) approach.

MCMC can be used to draw samples from the posterior distribution of a statistical model. The acceptance probability is given by:   where   is the likelihood,   the prior probability density and   the (conditional) proposal probability.

See also edit

References edit

  1. ^ Kalos, Malvin H.; Whitlock, Paula A. (1986). Monte Carlo Methods Volume I: Basics. New York: Wiley. pp. 78–88.
  2. ^ Tierney, Luke (1994). "Markov chains for exploring posterior distributions". The Annals of Statistics. 22 (4): 1701–1762.
  3. ^ Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika. 57 (1): 97–109. Bibcode:1970Bimka..57...97H. doi:10.1093/biomet/57.1.97. JSTOR 2334940. Zbl 0219.65008.
  4. ^ M.N. Rosenbluth (2003). "Genesis of the Monte Carlo Algorithm for Statistical Mechanics". AIP Conference Proceedings. 690: 22–30. Bibcode:2003AIPC..690...22R. doi:10.1063/1.1632112.
  5. ^ J.E. Gubernatis (2005). "Marshall Rosenbluth and the Metropolis Algorithm". Physics of Plasmas. 12 (5): 057303. Bibcode:2005PhPl...12e7303G. doi:10.1063/1.1887186.
  6. ^ Teller, Edward. Memoirs: A Twentieth-Century Journey in Science and Politics. Perseus Publishing, 2001, p. 328
  7. ^ Rosenbluth, Marshall. "Oral History Transcript". American Institute of Physics
  8. ^ a b Gilks, W. R.; Wild, P. (1992-01-01). "Adaptive Rejection Sampling for Gibbs Sampling". Journal of the Royal Statistical Society. Series C (Applied Statistics). 41 (2): 337–348. doi:10.2307/2347565. JSTOR 2347565.
  9. ^ Bayesian data analysis. Gelman, Andrew (2nd ed.). Boca Raton, Fla.: Chapman & Hall / CRC. 2004. ISBN 978-1584883883. OCLC 51991499.{{cite book}}: CS1 maint: others (link)
  10. ^ Lee, Se Yoon (2021). "Gibbs sampler and coordinate ascent variational inference: A set-theoretical review". Communications in Statistics - Theory and Methods. 51 (6): 1549–1568. arXiv:2008.01006. doi:10.1080/03610926.2021.1921214. S2CID 220935477.
  11. ^ Gilks, W. R.; Best, N. G.; Tan, K. K. C. (1995-01-01). "Adaptive Rejection Metropolis Sampling within Gibbs Sampling". Journal of the Royal Statistical Society. Series C (Applied Statistics). 44 (4): 455–472. doi:10.2307/2986138. JSTOR 2986138.
  12. ^ a b Robert, Christian; Casella, George (2004). Monte Carlo Statistical Methods. Springer. ISBN 978-0387212395.
  13. ^ Raftery, Adrian E., and Steven Lewis. "How Many Iterations in the Gibbs Sampler?" In Bayesian Statistics 4. 1992.
  14. ^ Newman, M. E. J.; Barkema, G. T. (1999). Monte Carlo Methods in Statistical Physics. USA: Oxford University Press. ISBN 978-0198517979.
  15. ^ Roberts, G.O.; Gelman, A.; Gilks, W.R. (1997). "Weak convergence and optimal scaling of random walk Metropolis algorithms". Ann. Appl. Probab. 7 (1): 110–120. CiteSeerX 10.1.1.717.2582. doi:10.1214/aoap/1034625254.
  16. ^ Schmon, Sebastian M.; Gagnon, Philippe (2022-04-15). "Optimal scaling of random walk Metropolis algorithms using Bayesian large-sample asymptotics". Statistics and Computing. 32 (2): 28. doi:10.1007/s11222-022-10080-8. ISSN 0960-3174. PMC 8924149. PMID 35310543.

Notes edit

  1. ^ In the original paper by Metropolis et al. (1953),   was taken to be the Boltzmann distribution as the specific application considered was Monte Carlo integration of equations of state in physical chemistry; the extension by Hastings generalized to an arbitrary distribution  .
  2. ^ In the original paper by Metropolis et al. (1953),   was suggested to be a random translation with uniform density over some prescribed range.
  3. ^ In the original paper by Metropolis et al. (1953),   was actually the Boltzmann distribution, as it was applied to physical systems in the context of statistical mechanics (e.g., a maximal-entropy distribution of microstates for a given temperature at thermal equilibrium). Consequently, the acceptance ratio was itself an exponential of the difference in the parameters of the numerator and denominator of this ratio.

Further reading edit

  • Bernd A. Berg. Markov Chain Monte Carlo Simulations and Their Statistical Analysis. Singapore, World Scientific, 2004.
  • Siddhartha Chib and Edward Greenberg: "Understanding the Metropolis–Hastings Algorithm". American Statistician, 49(4), 327–335, 1995
  • David D. L. Minh and Do Le Minh. "Understanding the Hastings Algorithm." Communications in Statistics - Simulation and Computation, 44:2 332-349, 2015
  • Bolstad, William M. (2010) Understanding Computational Bayesian Statistics, John Wiley & Sons ISBN 0-470-04609-0

metropolis, hastings, algorithm, statistics, statistical, physics, markov, chain, monte, carlo, mcmc, method, obtaining, sequence, random, samples, from, probability, distribution, from, which, direct, sampling, difficult, this, sequence, used, approximate, di. In statistics and statistical physics the Metropolis Hastings algorithm is a Markov chain Monte Carlo MCMC method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult This sequence can be used to approximate the distribution e g to generate a histogram or to compute an integral e g an expected value Metropolis Hastings and other MCMC algorithms are generally used for sampling from multi dimensional distributions especially when the number of dimensions is high For single dimensional distributions there are usually other methods e g adaptive rejection sampling that can directly return independent samples from the distribution and these are free from the problem of autocorrelated samples that is inherent in MCMC methods The proposal distribution Q proposes the next point to which the random walk might move Contents 1 History 2 Intuition 3 Formal derivation 4 Use in numerical integration 5 Step by step instructions 6 Bayesian Inference 7 See also 8 References 9 Notes 10 Further readingHistory editThe algorithm is named in part for Nicholas Metropolis the first coauthor of a 1953 paper entitled Equation of State Calculations by Fast Computing Machines with Arianna W Rosenbluth Marshall Rosenbluth Augusta H Teller and Edward Teller For many years the algorithm was known simply as the Metropolis algorithm 1 2 The paper proposed the algorithm for the case of symmetrical proposal distributions but in 1970 W K Hastings extended it to the more general case 3 The generalized method was eventually identified by both names although the first use of the term Metropolis Hastings algorithm is unclear Some controversy exists with regard to credit for development of the Metropolis algorithm Metropolis who was familiar with the computational aspects of the method had coined the term Monte Carlo in an earlier article with Stanislaw Ulam and led the group in the Theoretical Division that designed and built the MANIAC I computer used in the experiments in 1952 However prior to 2003 there was no detailed account of the algorithm s development Shortly before his death Marshall Rosenbluth attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication At this conference Rosenbluth described the algorithm and its development in a presentation titled Genesis of the Monte Carlo Algorithm for Statistical Mechanics 4 Further historical clarification is made by Gubernatis in a 2005 journal article 5 recounting the 50th anniversary conference Rosenbluth makes it clear that he and his wife Arianna did the work and that Metropolis played no role in the development other than providing computer time This contradicts an account by Edward Teller who states in his memoirs that the five authors of the 1953 article worked together for days and nights 6 In contrast the detailed account by Rosenbluth credits Teller with a crucial but early suggestion to take advantage of statistical mechanics and take ensemble averages instead of following detailed kinematics This says Rosenbluth started him thinking about the generalized Monte Carlo approach a topic which he says he had discussed often with John Von Neumann Arianna Rosenbluth recounted to Gubernatis in 2003 that Augusta Teller started the computer work but that Arianna herself took it over and wrote the code from scratch In an oral history recorded shortly before his death 7 Rosenbluth again credits Teller with posing the original problem himself with solving it and Arianna with programming the computer Intuition editThe Metropolis Hastings algorithm can draw samples from any probability distribution with probability density P x displaystyle P x nbsp provided that we know a function f x displaystyle f x nbsp proportional to the density P displaystyle P nbsp and the values of f x displaystyle f x nbsp can be calculated The requirement that f x displaystyle f x nbsp must only be proportional to the density rather than exactly equal to it makes the Metropolis Hastings algorithm particularly useful because calculating the necessary normalization factor is often extremely difficult in practice The Metropolis Hastings algorithm generates a sequence of sample values in such a way that as more and more sample values are produced the distribution of values more closely approximates the desired distribution These sample values are produced iteratively with the distribution of the next sample being dependent only on the current sample value thus making the sequence of samples into a Markov chain Specifically at each iteration the algorithm picks a candidate for the next sample value based on the current sample value Then with some probability the candidate is either accepted in which case the candidate value is used in the next iteration or it is rejected in which case the candidate value is discarded and current value is reused in the next iteration The probability of acceptance is determined by comparing the values of the function f x displaystyle f x nbsp of the current and candidate sample values with respect to the desired distribution For the purpose of illustration the Metropolis algorithm a special case of the Metropolis Hastings algorithm where the proposal function is symmetric is described below Metropolis algorithm symmetric proposal distribution Let f x displaystyle f x nbsp be a function that is proportional to the desired probability density function P x displaystyle P x nbsp a k a a target distribution a Initialization Choose an arbitrary point x t displaystyle x t nbsp to be the first observation in the sample and choose an arbitrary probability density g x y displaystyle g x mid y nbsp sometimes written Q x y displaystyle Q x mid y nbsp that suggests a candidate for the next sample value x displaystyle x nbsp given the previous sample value y displaystyle y nbsp In this section g displaystyle g nbsp is assumed to be symmetric in other words it must satisfy g x y g y x displaystyle g x mid y g y mid x nbsp A usual choice is to let g x y displaystyle g x mid y nbsp be a Gaussian distribution centered at y displaystyle y nbsp so that points closer to y displaystyle y nbsp are more likely to be visited next making the sequence of samples into a random walk b The function g displaystyle g nbsp is referred to as the proposal density or jumping distribution For each iteration t Generate a candidate x displaystyle x nbsp for the next sample by picking from the distribution g x x t displaystyle g x mid x t nbsp Calculate the acceptance ratio a f x f x t displaystyle alpha f x f x t nbsp which will be used to decide whether to accept or reject the candidate c Because f is proportional to the density of P we have that a f x f x t P x P x t displaystyle alpha f x f x t P x P x t nbsp Accept or reject Generate a uniform random number u 0 1 displaystyle u in 0 1 nbsp If u a displaystyle u leq alpha nbsp then accept the candidate by setting x t 1 x displaystyle x t 1 x nbsp If u gt a displaystyle u gt alpha nbsp then reject the candidate and set x t 1 x t displaystyle x t 1 x t nbsp instead This algorithm proceeds by randomly attempting to move about the sample space sometimes accepting the moves and sometimes remaining in place Note that the acceptance ratio a displaystyle alpha nbsp indicates how probable the new proposed sample is with respect to the current sample according to the distribution whose density is P x displaystyle P x nbsp If we attempt to move to a point that is more probable than the existing point i e a point in a higher density region of P x displaystyle P x nbsp corresponding to an a gt 1 u displaystyle alpha gt 1 geq u nbsp we will always accept the move However if we attempt to move to a less probable point we will sometimes reject the move and the larger the relative drop in probability the more likely we are to reject the new point Thus we will tend to stay in and return large numbers of samples from high density regions of P x displaystyle P x nbsp while only occasionally visiting low density regions Intuitively this is why this algorithm works and returns samples that follow the desired distribution with density P x displaystyle P x nbsp Compared with an algorithm like adaptive rejection sampling 8 that directly generates independent samples from a distribution Metropolis Hastings and other MCMC algorithms have a number of disadvantages The samples are correlated Even though over the long term they do correctly follow P x displaystyle P x nbsp a set of nearby samples will be correlated with each other and not correctly reflect the distribution This means that effective sample sizes can be significantly lower than the number of samples actually taken leading to large errors Although the Markov chain eventually converges to the desired distribution the initial samples may follow a very different distribution especially if the starting point is in a region of low density As a result a burn in period is typically necessary 9 where an initial number of samples are thrown away On the other hand most simple rejection sampling methods suffer from the curse of dimensionality where the probability of rejection increases exponentially as a function of the number of dimensions Metropolis Hastings along with other MCMC methods do not have this problem to such a degree and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high As a result MCMC methods are often the methods of choice for producing samples from hierarchical Bayesian models and other high dimensional statistical models used nowadays in many disciplines In multivariate distributions the classic Metropolis Hastings algorithm as described above involves choosing a new multi dimensional sample point When the number of dimensions is high finding the suitable jumping distribution to use can be difficult as the different individual dimensions behave in very different ways and the jumping width see above must be just right for all dimensions at once to avoid excessively slow mixing An alternative approach that often works better in such situations known as Gibbs sampling involves choosing a new sample for each dimension separately from the others rather than choosing a sample for all dimensions at once That way the problem of sampling from potentially high dimensional space will be reduced to a collection of problems to sample from small dimensionality 10 This is especially applicable when the multivariate distribution is composed of a set of individual random variables in which each variable is conditioned on only a small number of other variables as is the case in most typical hierarchical models The individual variables are then sampled one at a time with each variable conditioned on the most recent values of all the others Various algorithms can be used to choose these individual samples depending on the exact form of the multivariate distribution some possibilities are the adaptive rejection sampling methods 8 the adaptive rejection Metropolis sampling algorithm 11 a simple one dimensional Metropolis Hastings step or slice sampling Formal derivation editThe purpose of the Metropolis Hastings algorithm is to generate a collection of states according to a desired distribution P x displaystyle P x nbsp To accomplish this the algorithm uses a Markov process which asymptotically reaches a unique stationary distribution p x displaystyle pi x nbsp such that p x P x displaystyle pi x P x nbsp 12 A Markov process is uniquely defined by its transition probabilities P x x displaystyle P x mid x nbsp the probability of transitioning from any given state x displaystyle x nbsp to any other given state x displaystyle x nbsp It has a unique stationary distribution p x displaystyle pi x nbsp when the following two conditions are met 12 Existence of stationary distribution there must exist a stationary distribution p x displaystyle pi x nbsp A sufficient but not necessary condition is detailed balance which requires that each transition x x displaystyle x to x nbsp is reversible for every pair of states x x displaystyle x x nbsp the probability of being in state x displaystyle x nbsp and transitioning to state x displaystyle x nbsp must be equal to the probability of being in state x displaystyle x nbsp and transitioning to state x displaystyle x nbsp p x P x x p x P x x displaystyle pi x P x mid x pi x P x mid x nbsp Uniqueness of stationary distribution the stationary distribution p x displaystyle pi x nbsp must be unique This is guaranteed by ergodicity of the Markov process which requires that every state must 1 be aperiodic the system does not return to the same state at fixed intervals and 2 be positive recurrent the expected number of steps for returning to the same state is finite The Metropolis Hastings algorithm involves designing a Markov process by constructing transition probabilities that fulfills the two above conditions such that its stationary distribution p x displaystyle pi x nbsp is chosen to be P x displaystyle P x nbsp The derivation of the algorithm starts with the condition of detailed balance P x x P x P x x P x displaystyle P x mid x P x P x mid x P x nbsp which is re written as P x x P x x P x P x displaystyle frac P x mid x P x mid x frac P x P x nbsp The approach is to separate the transition in two sub steps the proposal and the acceptance rejection The proposal distribution g x x displaystyle g x mid x nbsp is the conditional probability of proposing a state x displaystyle x nbsp given x displaystyle x nbsp and the acceptance distribution A x x displaystyle A x x nbsp is the probability to accept the proposed state x displaystyle x nbsp The transition probability can be written as the product of them P x x g x x A x x displaystyle P x mid x g x mid x A x x nbsp Inserting this relation in the previous equation we have A x x A x x P x P x g x x g x x displaystyle frac A x x A x x frac P x P x frac g x mid x g x mid x nbsp The next step in the derivation is to choose an acceptance ratio that fulfills the condition above One common choice is the Metropolis choice A x x min 1 P x P x g x x g x x displaystyle A x x min left 1 frac P x P x frac g x mid x g x mid x right nbsp For this Metropolis acceptance ratio A displaystyle A nbsp either A x x 1 displaystyle A x x 1 nbsp or A x x 1 displaystyle A x x 1 nbsp and either way the condition is satisfied The Metropolis Hastings algorithm can thus be written as follows Initialise Pick an initial state x 0 displaystyle x 0 nbsp Set t 0 displaystyle t 0 nbsp Iterate Generate a random candidate state x displaystyle x nbsp according to g x x t displaystyle g x mid x t nbsp Calculate the acceptance probability A x x t min 1 P x P x t g x t x g x x t displaystyle A x x t min left 1 frac P x P x t frac g x t mid x g x mid x t right nbsp Accept or reject generate a uniform random number u 0 1 displaystyle u in 0 1 nbsp if u A x x t displaystyle u leq A x x t nbsp then accept the new state and set x t 1 x displaystyle x t 1 x nbsp if u gt A x x t displaystyle u gt A x x t nbsp then reject the new state and copy the old state forward x t 1 x t displaystyle x t 1 x t nbsp Increment set t t 1 displaystyle t t 1 nbsp Provided that specified conditions are met the empirical distribution of saved states x 0 x T displaystyle x 0 ldots x T nbsp will approach P x displaystyle P x nbsp The number of iterations T displaystyle T nbsp required to effectively estimate P x displaystyle P x nbsp depends on the number of factors including the relationship between P x displaystyle P x nbsp and the proposal distribution and the desired accuracy of estimation 13 For distribution on discrete state spaces it has to be of the order of the autocorrelation time of the Markov process 14 It is important to notice that it is not clear in a general problem which distribution g x x displaystyle g x mid x nbsp one should use or the number of iterations necessary for proper estimation both are free parameters of the method which must be adjusted to the particular problem in hand Use in numerical integration editMain article Monte Carlo integration A common use of Metropolis Hastings algorithm is to compute an integral Specifically consider a space W R displaystyle Omega subset mathbb R nbsp and a probability distribution P x displaystyle P x nbsp over W displaystyle Omega nbsp x W displaystyle x in Omega nbsp Metropolis Hastings can estimate an integral of the form of P E W A x P x d x displaystyle P E int Omega A x P x dx nbsp where A x displaystyle A x nbsp is a measurable function of interest For example consider a statistic E x displaystyle E x nbsp and its probability distribution P E displaystyle P E nbsp which is a marginal distribution Suppose that the goal is to estimate P E displaystyle P E nbsp for E displaystyle E nbsp on the tail of P E displaystyle P E nbsp Formally P E displaystyle P E nbsp can be written as P E W P E x P x d x W d E E x P x d x E P E X displaystyle P E int Omega P E mid x P x dx int Omega delta big E E x big P x dx E big P E mid X big nbsp and thus estimating P E displaystyle P E nbsp can be accomplished by estimating the expected value of the indicator function A E x 1 E x displaystyle A E x equiv mathbf 1 E x nbsp which is 1 when E x E E D E displaystyle E x in E E Delta E nbsp and zero otherwise Because E displaystyle E nbsp is on the tail of P E displaystyle P E nbsp the probability to draw a state x displaystyle x nbsp with E x displaystyle E x nbsp on the tail of P E displaystyle P E nbsp is proportional to P E displaystyle P E nbsp which is small by definition The Metropolis Hastings algorithm can be used here to sample rare states more likely and thus increase the number of samples used to estimate P E displaystyle P E nbsp on the tails This can be done e g by using a sampling distribution p x displaystyle pi x nbsp to favor those states e g p x e a E displaystyle pi x propto e aE nbsp with a gt 0 displaystyle a gt 0 nbsp Step by step instructions editSuppose that the most recent value sampled is x t displaystyle x t nbsp To follow the Metropolis Hastings algorithm we next draw a new proposal state x displaystyle x nbsp with probability density g x x t displaystyle g x mid x t nbsp and calculate a value a a 1 a 2 displaystyle a a 1 a 2 nbsp where a 1 P x P x t displaystyle a 1 frac P x P x t nbsp is the probability e g Bayesian posterior ratio between the proposed sample x displaystyle x nbsp and the previous sample x t displaystyle x t nbsp and a 2 g x t x g x x t displaystyle a 2 frac g x t mid x g x mid x t nbsp is the ratio of the proposal density in two directions from x t displaystyle x t nbsp to x displaystyle x nbsp and conversely This is equal to 1 if the proposal density is symmetric Then the new state x t 1 displaystyle x t 1 nbsp is chosen according to the following rules If a 1 displaystyle a geq 1 nbsp x t 1 x displaystyle x t 1 x nbsp dd else x t 1 x with probability a x t with probability 1 a displaystyle x t 1 begin cases x amp text with probability a x t amp text with probability 1 a end cases nbsp dd The Markov chain is started from an arbitrary initial value x 0 displaystyle x 0 nbsp and the algorithm is run for many iterations until this initial state is forgotten These samples which are discarded are known as burn in The remaining set of accepted values of x displaystyle x nbsp represent a sample from the distribution P x displaystyle P x nbsp The algorithm works best if the proposal density matches the shape of the target distribution P x displaystyle P x nbsp from which direct sampling is difficult that is g x x t P x displaystyle g x mid x t approx P x nbsp If a Gaussian proposal density g displaystyle g nbsp is used the variance parameter s 2 displaystyle sigma 2 nbsp has to be tuned during the burn in period This is usually done by calculating the acceptance rate which is the fraction of proposed samples that is accepted in a window of the last N displaystyle N nbsp samples The desired acceptance rate depends on the target distribution however it has been shown theoretically that the ideal acceptance rate for a one dimensional Gaussian distribution is about 50 decreasing to about 23 for an N displaystyle N nbsp dimensional Gaussian target distribution 15 These guidelines can work well when sampling from sufficiently regular Bayesian posteriors as they often follow a multivariate normal distribution as can be established using the Bernstein von Mises theorem 16 If s 2 displaystyle sigma 2 nbsp is too small the chain will mix slowly i e the acceptance rate will be high but successive samples will move around the space slowly and the chain will converge only slowly to P x displaystyle P x nbsp On the other hand if s 2 displaystyle sigma 2 nbsp is too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density so a 1 displaystyle a 1 nbsp will be very small and again the chain will converge very slowly One typically tunes the proposal distribution so that the algorithms accepts on the order of 30 of all samples in line with the theoretical estimates mentioned in the previous paragraph nbsp The result of three Markov chains running on the 3D Rosenbrock function using the Metropolis Hastings algorithm The algorithm samples from regions where the posterior probability is high and the chains begin to mix in these regions The approximate position of the maximum has been illuminated The red points are the ones that remain after the burn in process The earlier ones have been discarded Bayesian Inference editMain article Bayesian Inference nbsp Flowchart of Metropolis Hastings M H algorithm for the parameter estimation using the Markov Chain Monte Carlo MCMC approach MCMC can be used to draw samples from the posterior distribution of a statistical model The acceptance probability is given by P a c c 8 i 8 min 1 L y 8 P 8 L y 8 i P 8 i Q 8 i 8 Q 8 8 i displaystyle P acc theta i to theta min left 1 frac mathcal L y theta P theta mathcal L y theta i P theta i frac Q theta i theta Q theta theta i right nbsp where L displaystyle mathcal L nbsp is the likelihood P 8 displaystyle P theta nbsp the prior probability density and Q displaystyle Q nbsp the conditional proposal probability See also editDetailed balance Genetic algorithms Gibbs sampling Hamiltonian Monte Carlo Mean field particle methods Metropolis adjusted Langevin algorithm Metropolis light transport Multiple try Metropolis Parallel tempering Preconditioned Crank Nicolson algorithm Sequential Monte Carlo Simulated annealingReferences edit Kalos Malvin H Whitlock Paula A 1986 Monte Carlo Methods Volume I Basics New York Wiley pp 78 88 Tierney Luke 1994 Markov chains for exploring posterior distributions The Annals of Statistics 22 4 1701 1762 Hastings W K 1970 Monte Carlo Sampling Methods Using Markov Chains and Their Applications Biometrika 57 1 97 109 Bibcode 1970Bimka 57 97H doi 10 1093 biomet 57 1 97 JSTOR 2334940 Zbl 0219 65008 M N Rosenbluth 2003 Genesis of the Monte Carlo Algorithm for Statistical Mechanics AIP Conference Proceedings 690 22 30 Bibcode 2003AIPC 690 22R doi 10 1063 1 1632112 J E Gubernatis 2005 Marshall Rosenbluth and the Metropolis Algorithm Physics of Plasmas 12 5 057303 Bibcode 2005PhPl 12e7303G doi 10 1063 1 1887186 Teller Edward Memoirs A Twentieth Century Journey in Science and Politics Perseus Publishing 2001 p 328 Rosenbluth Marshall Oral History Transcript American Institute of Physics a b Gilks W R Wild P 1992 01 01 Adaptive Rejection Sampling for Gibbs Sampling Journal of the Royal Statistical Society Series C Applied Statistics 41 2 337 348 doi 10 2307 2347565 JSTOR 2347565 Bayesian data analysis Gelman Andrew 2nd ed Boca Raton Fla Chapman amp Hall CRC 2004 ISBN 978 1584883883 OCLC 51991499 a href Template Cite book html title Template Cite book cite book a CS1 maint others link Lee Se Yoon 2021 Gibbs sampler and coordinate ascent variational inference A set theoretical review Communications in Statistics Theory and Methods 51 6 1549 1568 arXiv 2008 01006 doi 10 1080 03610926 2021 1921214 S2CID 220935477 Gilks W R Best N G Tan K K C 1995 01 01 Adaptive Rejection Metropolis Sampling within Gibbs Sampling Journal of the Royal Statistical Society Series C Applied Statistics 44 4 455 472 doi 10 2307 2986138 JSTOR 2986138 a b Robert Christian Casella George 2004 Monte Carlo Statistical Methods Springer ISBN 978 0387212395 Raftery Adrian E and Steven Lewis How Many Iterations in the Gibbs Sampler In Bayesian Statistics 4 1992 Newman M E J Barkema G T 1999 Monte Carlo Methods in Statistical Physics USA Oxford University Press ISBN 978 0198517979 Roberts G O Gelman A Gilks W R 1997 Weak convergence and optimal scaling of random walk Metropolis algorithms Ann Appl Probab 7 1 110 120 CiteSeerX 10 1 1 717 2582 doi 10 1214 aoap 1034625254 Schmon Sebastian M Gagnon Philippe 2022 04 15 Optimal scaling of random walk Metropolis algorithms using Bayesian large sample asymptotics Statistics and Computing 32 2 28 doi 10 1007 s11222 022 10080 8 ISSN 0960 3174 PMC 8924149 PMID 35310543 Notes edit In the original paper by Metropolis et al 1953 f displaystyle f nbsp was taken to be the Boltzmann distribution as the specific application considered was Monte Carlo integration of equations of state in physical chemistry the extension by Hastings generalized to an arbitrary distribution f displaystyle f nbsp In the original paper by Metropolis et al 1953 g x y displaystyle g x mid y nbsp was suggested to be a random translation with uniform density over some prescribed range In the original paper by Metropolis et al 1953 f displaystyle f nbsp was actually the Boltzmann distribution as it was applied to physical systems in the context of statistical mechanics e g a maximal entropy distribution of microstates for a given temperature at thermal equilibrium Consequently the acceptance ratio was itself an exponential of the difference in the parameters of the numerator and denominator of this ratio Further reading editBernd A Berg Markov Chain Monte Carlo Simulations and Their Statistical Analysis Singapore World Scientific 2004 Siddhartha Chib and Edward Greenberg Understanding the Metropolis Hastings Algorithm American Statistician 49 4 327 335 1995 David D L Minh and Do Le Minh Understanding the Hastings Algorithm Communications in Statistics Simulation and Computation 44 2 332 349 2015 Bolstad William M 2010 Understanding Computational Bayesian Statistics John Wiley amp Sons ISBN 0 470 04609 0 Retrieved from https en wikipedia org w index php title Metropolis Hastings algorithm amp oldid 1172902257, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.