fbpx
Wikipedia

Prior probability

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In Bayesian statistics, Bayes' rule prescribes how to update the prior with new information to obtain the posterior probability distribution, which is the conditional distribution of the uncertain quantity given new data. Historically, the choice of priors was often constrained to a conjugate family of a given likelihood function, for that it would result in a tractable posterior of the same family. The widespread availability of Markov chain Monte Carlo methods, however, has made this less of a concern.

There are many ways to construct a prior distribution.[1] In some cases, a prior may be determined from past information, such as previous experiments. A prior can also be elicited from the purely subjective assessment of an experienced expert.[2][3] When no information is available, an uninformative prior may be adopted as justified by the principle of indifference.[4][5] In modern applications, priors are also often chosen for their mechanical properties, such as regularization and feature selection.[6][7][8]

The prior distributions of model parameters will often depend on parameters of their own. Uncertainty about these hyperparameters can, in turn, be expressed as hyperprior probability distributions. For example, if one uses a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:

  • p is a parameter of the underlying system (Bernoulli distribution), and
  • α and β are parameters of the prior distribution (beta distribution); hence hyperparameters.

In principle, priors can be decomposed into many conditional levels of distributions, so-called hierarchical priors.[9]

Informative priors Edit

An informative prior expresses specific, definite information about a variable. An example is a prior distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior a normal distribution with expected value equal to today's noontime temperature, with variance equal to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that day of the year.

This example has a property in common with many priors, namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and, as more evidence accumulates, the posterior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting. The terms "prior" and "posterior" are generally relative to a specific datum or observation.

Weakly informative priors Edit

A weakly informative prior expresses partial information about a variable. An example is, when setting the prior distribution for the temperature at noon tomorrow in St. Louis, to use a normal distribution with mean 50 degrees Fahrenheit and standard deviation 40 degrees, which very loosely constrains the temperature to the range (10 degrees, 90 degrees) with a small chance of being below -30 degrees or above 130 degrees. The purpose of a weakly informative prior is for regularization, that is, to keep inferences in a reasonable range.

Uninformative priors Edit

An uninformative, flat, or diffuse prior expresses vague or general information about a variable.[4] The term "uninformative prior" is somewhat of a misnomer. Such a prior might also be called a not very informative prior, or an objective prior, i.e. one that's not subjectively elicited.

Uninformative priors can express "objective" information such as "the variable is positive" or "the variable is less than some limit". The simplest and oldest rule for determining a non-informative prior is the principle of indifference, which assigns equal probabilities to all possibilities. In parameter estimation problems, the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis, as the likelihood function often yields more information than the uninformative prior.

Some attempts have been made at finding a priori probabilities, i.e. probability distributions in some sense logically required by the nature of one's state of uncertainty; these are a subject of philosophical controversy, with Bayesians being roughly divided into two schools: "objective Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who believe that in practice priors usually represent subjective judgements of opinion that cannot be rigorously justified (Williamson 2010). Perhaps the strongest arguments for objective Bayesianism were given by Edwin T. Jaynes, based mainly on the consequences of symmetries and on the principle of maximum entropy.

As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a ball has been hidden under one of three cups, A, B, or C, but no other information is available about its location. In this case a uniform prior of p(A) = p(B) = p(C) = 1/3 seems intuitively like the only reasonable choice. More formally, we can see that the problem remains the same if we swap around the labels ("A", "B" and "C") of the cups. It would therefore be odd to choose a prior for which a permutation of the labels would cause a change in our predictions about which cup the ball will be found under; the uniform prior is the only one which preserves this invariance. If one accepts this invariance principle then one can see that the uniform prior is the logically correct prior to represent this state of knowledge. This prior is "objective" in the sense of being the correct choice to represent a particular state of knowledge, but it is not objective in the sense of being an observer-independent feature of the world: in reality the ball exists under a particular cup, and it only makes sense to speak of probabilities in this situation if there is an observer with limited knowledge about the system.[10]

As a more contentious example, Jaynes published an argument based on the invariance of the prior under a change of parameters that suggests that the prior representing complete uncertainty about a probability should be the Haldane prior p−1(1 − p)−1.[11] The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments. The Haldane prior[12] gives by far the most weight to   and  , indicating that the sample will either dissolve every time or never dissolve, with equal probability. However, if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the uniform distribution on the interval [0, 1]. This is obtained by applying Bayes' theorem to the data set consisting of one observation of dissolving and one of not dissolving, using the above prior. The Haldane prior is an improper prior distribution (meaning that it has an infinite mass). Harold Jeffreys devised a systematic way for designing uninformative priors as e.g., Jeffreys prior p−1/2(1 − p)−1/2 for the Bernoulli random variable.

Priors can be constructed which are proportional to the Haar measure if the parameter space X carries a natural group structure which leaves invariant our Bayesian state of knowledge.[11] This can be seen as a generalisation of the invariance principle used to justify the uniform prior over the three cups in the example above. For example, in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system. This induces the group structure of the translation group on X, which determines the prior probability as a constant improper prior. Similarly, some measurements are naturally invariant to the choice of an arbitrary scale (e.g., whether centimeters or inches are used, the physical results should be equal). In such a case, the scale group is the natural group structure, and the corresponding prior on X is proportional to 1/x. It sometimes matters whether we use the left-invariant or right-invariant Haar measure. For example, the left and right invariant Haar measures on the affine group are not equal. Berger (1985, p. 413) argues that the right-invariant Haar measure is the correct choice.

Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum entropy (MAXENT). The motivation is that the Shannon entropy of a probability distribution measures the amount of information contained in the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on X, one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and unit variance is the standard normal distribution. The principle of minimum cross-entropy generalizes MAXENT to the case of "updating" an arbitrary prior distribution with suitable constraints in the maximum-entropy sense.

A related idea, reference priors, was introduced by José-Miguel Bernardo. Here, the idea is to maximize the expected Kullback–Leibler divergence of the posterior distribution relative to the prior. This maximizes the expected posterior information about X when the prior density is p(x); thus, in some sense, p(x) is the "least informative" prior about X. The reference prior is defined in the asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points goes to infinity. In the present case, the KL divergence between the prior and posterior distributions is given by

 

Here,   is a sufficient statistic for some parameter  . The inner integral is the KL divergence between the posterior   and prior   distributions and the result is the weighted mean over all values of  . Splitting the logarithm into two parts, reversing the order of integrals in the second part and noting that

 
does not depend on   yields
 

The inner integral in the second part is the integral over   of the joint density  . This is the marginal distribution  , so we have

 

Now we use the concept of entropy which, in the case of probability distributions, is the negative expected value of the logarithm of the probability mass or density function or   Using this in the last equation yields

 

In words, KL is the negative expected value over   of the entropy of   conditional on   plus the marginal (i.e. unconditional) entropy of  . In the limiting case where the sample size tends to infinity, the Bernstein-von Mises theorem states that the distribution of   conditional on a given observed value of   is normal with a variance equal to the reciprocal of the Fisher information at the 'true' value of  . The entropy of a normal density function is equal to half the logarithm of   where   is the variance of the distribution. In this case therefore

 
where   is the arbitrarily large sample size (to which Fisher information is proportional) and   is the 'true' value. Since this does not depend on   it can be taken out of the integral, and as this integral is over a probability space it equals one. Hence we can write the asymptotic form of KL as
 

where   is proportional to the (asymptotically large) sample size. We do not know the value of  . Indeed, the very idea goes against the philosophy of Bayesian inference in which 'true' values of parameters are replaced by prior and posterior distributions. So we remove   by replacing it with   and taking the expected value of the normal entropy, which we obtain by multiplying by   and integrating over  . This allows us to combine the logarithms yielding

 

This is a quasi-KL divergence ("quasi" in the sense that the square root of the Fisher information may be the kernel of an improper distribution). Due to the minus sign, we need to minimise this in order to maximise the KL divergence with which we started. The minimum value of the last equation occurs where the two distributions in the logarithm argument, improper or not, do not diverge. This in turn occurs when the prior distribution is proportional to the square root of the Fisher information of the likelihood function. Hence in the single parameter case, reference priors and Jeffreys priors are identical, even though Jeffreys has a very different rationale.

Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., Jeffreys' rule) may result in priors with problematic behavior.[clarification needed A Jeffreys prior is related to KL divergence?]

Objective prior distributions may also be derived from other principles, such as information or coding theory (see e.g. minimum description length) or frequentist statistics (so-called probability matching priors).[13] Such methods are used in Solomonoff's theory of inductive inference. Constructing objective priors have been recently introduced in bioinformatics, and specially inference in cancer systems biology, where sample size is limited and a vast amount of prior knowledge is available. In these methods, either an information theory based criterion, such as KL divergence or log-likelihood function for binary supervised learning problems[14] and mixture model problems.[15]

Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred. Jaynes' often-overlooked[by whom?] method of transformation groups can answer this question in some situations.[16]

Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely, and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, the logarithmic prior, which is the uniform prior on the logarithm of proportion. The Jeffreys prior attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion p is p−1/2(1 − p)−1/2, which differs from Jaynes' recommendation.

Priors based on notions of algorithmic probability are used in inductive inference as a basis for induction in very general settings.

Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used routinely, i.e., with many different data sets, it should have good frequentist properties. Normally a Bayesian would not be concerned with such issues, but it can be important in this situation. For example, one would want any decision rule based on the posterior distribution to be admissible under the adopted loss function. Unfortunately, admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with hierarchical Bayes models; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy.

Improper priors Edit

Let events   be mutually exclusive and exhaustive. If Bayes' theorem is written as

 

then it is clear that the same result would be obtained if all the prior probabilities P(Ai) and P(Aj) were multiplied by a given constant; the same would be true for a continuous random variable. If the summation in the denominator converges, the posterior probabilities will still sum (or integrate) to 1 even if the prior values do not, and so the priors may only need to be specified in the correct proportion. Taking this idea further, in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. When this is the case, the prior is called an improper prior. However, the posterior distribution need not be a proper distribution if the prior is improper.[17] This is clear from the case where event B is independent of all of the Aj.

Statisticians sometimes use improper priors as uninformative priors.[18] For example, if they need a prior distribution for the mean and variance of a random variable, they may assume p(mv) ~ 1/v (for v > 0) which would suggest that any value for the mean is "equally likely" and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996)[citation needed] warn against the danger of over-interpreting those priors since they are not probability densities. The only relevance they have is found in the corresponding posterior, as long as it is well-defined for all observations. (The Haldane prior is a typical counterexample.[clarification needed][citation needed])

By contrast, likelihood functions do not need to be integrated, and a likelihood function that is uniformly 1 corresponds to the absence of data (all models are equally likely, given no data): Bayes' rule multiplies a prior by the likelihood, and an empty product is just the constant likelihood 1. However, without starting with a prior probability distribution, one does not end up getting a posterior probability distribution, and thus cannot integrate or compute expected values or loss. See Likelihood function § Non-integrability for details.

Examples Edit

Examples of improper priors include:

These functions, interpreted as uniform distributions, can also be interpreted as the likelihood function in the absence of data, but are not proper priors.

Prior probability in statistical mechanics Edit

While in Bayesian statistics the prior probability is used to represent initial beliefs about an uncertain parameter, in statistical mechanics the a priori probability is used to describe the initial state of a system.[19] The classical version is defined as the ratio of the number of elementary events (e.g. the number of times a die is thrown) to the total number of events—and these considered purely deductively, i.e. without any experimenting. In the case of the die if we look at it on the table without throwing it, each elementary event is reasoned deductively to have the same probability—thus the probability of each outcome of an imaginary throwing of the (perfect) die or simply by counting the number of faces is 1/6. Each face of the die appears with equal probability—probability being a measure defined for each elementary event. The result is different if we throw the die twenty times and ask how many times (out of 20) the number 6 appears on the upper face. In this case time comes into play and we have a different type of probability depending on time or the number of times the die is thrown. On the other hand, the a priori probability is independent of time—you can look at the die on the table as long as you like without touching it and you deduce the probability for the number 6 to appear on the upper face is 1/6.

In statistical mechanics, e.g. that of a gas contained in a finite volume  , both the spatial coordinates   and the momentum coordinates   of the individual gas elements (atoms or molecules) are finite in the phase space spanned by these coordinates. In analogy to the case of the die, the a priori probability is here (in the case of a continuum) proportional to the phase space volume element   divided by  , and is the number of standing waves (i.e. states) therein, where   is the range of the variable   and   is the range of the variable   (here for simplicity considered in one dimension). In 1 dimension (length  ) this number or statistical weight or a priori weighting is  . In customary 3 dimensions (volume  ) the corresponding number can be calculated to be  .[20] In order to understand this quantity as giving a number of states in quantum (i.e. wave) mechanics, recall that in quantum mechanics every particle is associated with a matter wave which is the solution of a Schrödinger equation. In the case of free particles (of energy  ) like those of a gas in a box of volume   such a matter wave is explicitly

 
,

where   are integers. The number of different   values and hence states in the region between   is then found to be the above expression   by considering the area covered by these points. Moreover, in view of the uncertainty relation, which in 1 spatial dimension is

 
,

these states are indistinguishable (i.e. these states do not carry labels). An important consequence is a result known as Liouville's theorem, i.e. the time independence of this phase space volume element and thus of the a priori probability. A time dependence of this quantity would imply known information about the dynamics of the system, and hence would not be an a priori probability.[21] Thus the region

 

when differentiated with respect to time   yields zero (with the help of Hamilton's equations): The volume at time   is the same as at time zero. One describes this also as conservation of information.

In the full quantum theory one has an analogous conservation law. In this case, the phase space region is replaced by a subspace of the space of states expressed in terms of a projection operator  , and instead of the probability in phase space, one has the probability density

 

where   is the dimensionality of the subspace. The conservation law in this case is expressed by the unitarity of the S-matrix. In either case, the considerations assume a closed isolated system. This closed isolated system is a system with (1) a fixed energy   and (2) a fixed number of particles   in (c) a state of equilibrium. If one considers a huge number of replicas of this system, one obtains what is called a microcanonical ensemble. It is for this system that one postulates in quantum statistics the "fundamental postulate of equal a priori probabilities of an isolated system." This says that the isolated system in equilibrium occupies each of its accessible states with the same probability. This fundamental postulate therefore allows us to equate the a priori probability to the degeneracy of a system, i.e. to the number of different states with the same energy.

Example Edit

The following example illustrates the a priori probability (or a priori weighting) in (a) classical and (b) quantal contexts.

(a) Classical a priori probability

Consider the rotational energy E of a diatomic molecule with moment of inertia I in spherical polar coordinates   (this means   above is here  ), i.e.

 

The

 
-curve for constant E and   is an ellipse of area
 
.

By integrating over   and   the total volume of phase space covered for constant energy E is

 
,

and hence the classical a priori weighting in the energy range   is

  (phase space volume at  ) minus (phase space volume at  ) is given by  

(b) Quantum a priori probability

Assuming that the number of quantum states in a range   for each direction of motion is given, per element, by a factor  , the number of states in the energy range dE is, as seen under (a)   for the rotating diatomic molecule. From wave mechanics it is known that the energy levels of a rotating diatomic molecule are given by

 

each such level being (2n+1)-fold degenerate. By evaluating   one obtains

 

Thus by comparison with   above, one finds that the approximate number of states in the range dE is given by the degeneracy, i.e.

 

Thus the a priori weighting in the classical context (a) corresponds to the a priori weighting here in the quantal context (b). In the case of the one-dimensional simple harmonic oscillator of natural frequency   one finds correspondingly: (a)

 
, and (b)   (no degeneracy). Thus in quantum mechanics the a priori probability is effectively a measure of the degeneracy, i.e. the number of states having the same energy.

In the case of the hydrogen atom or Coulomb potential (where the evaluation of the phase space volume for constant energy is more complicated) one knows that the quantum mechanical degeneracy is   with  . Thus in this case  .

Priori probability and distribution functions Edit

In statistical mechanics (see any book) one derives the so-called distribution functions   for various statistics. In the case of Fermi–Dirac statistics and Bose–Einstein statistics these functions are respectively

 

These functions are derived for (1) a system in dynamic equilibrium (i.e. under steady, uniform conditions) with (2) total (and huge) number of particles   (this condition determines the constant  ), and (3) total energy  , i.e. with each of the   particles having the energy  . An important aspect in the derivation is the taking into account of the indistinguishability of particles and states in quantum statistics, i.e. there particles and states do not have labels. In the case of fermions, like electrons, obeying the Pauli principle (only one particle per state or none allowed), one has therefore

 

Thus   is a measure of the fraction of states actually occupied by electrons at energy   and temperature  . On the other hand, the a priori probability   is a measure of the number of wave mechanical states available. Hence

 

Since   is constant under uniform conditions (as many particles as flow out of a volume element also flow in steadily, so that the situation in the element appears static), i.e. independent of time  , and   is also independent of time   as shown earlier, we obtain

 

Expressing this equation in terms of its partial derivatives, one obtains the Boltzmann transport equation. How do coordinates   etc. appear here suddenly? Above no mention was made of electric or other fields. Thus with no such fields present we have the Fermi-Dirac distribution as above. But with such fields present we have this additional dependence of  .

See also Edit

Notes Edit

  1. ^ Robert, Christian (1994). "From Prior Information to Prior Distributions". The Bayesian Choice. New York: Springer. pp. 89–136. ISBN 0-387-94296-3.
  2. ^ Chaloner, Kathryn (1996). "Elicitation of Prior Distributions". In Berry, Donald A.; Stangl, Dalene (eds.). Bayesian Biostatistics. New York: Marcel Dekker. pp. 141–156. ISBN 0-8247-9334-X.
  3. ^ Mikkola, Petrus; et al. (2023). "Prior Knowledge Elicitation: The Past, Present, and Future". Bayesian Analysis. Forthcoming. doi:10.1214/23-BA1381. hdl:11336/183197. S2CID 244798734.
  4. ^ a b Zellner, Arnold (1971). "Prior Distributions to Represent 'Knowing Little'". An Introduction to Bayesian Inference in Econometrics. New York: John Wiley & Sons. pp. 41–53. ISBN 0-471-98165-6.
  5. ^ Price, Harold J.; Manson, Allison R. (2001). "Uninformative priors for Bayes' theorem". AIP Conf. Proc. 617: 379–391. doi:10.1063/1.1477060.
  6. ^ Piironen, Juho; Vehtari, Aki (2017). "Sparsity information and regularization in the horseshoe and other shrinkage priors". Electronic Journal of Statistics. 11 (2): 5018–5051. doi:10.1214/17-EJS1337SI.
  7. ^ Simpson, Daniel; et al. (2017). "Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors". Statistical Science. 32 (1): 1–28. arXiv:1403.4630. doi:10.1214/16-STS576. S2CID 88513041.
  8. ^ Fortuin, Vincent (2022). "Priors in Bayesian Deep Learning: A Review". International Statistical Review. 90 (3): 563–591. doi:10.1111/insr.12502. hdl:20.500.11850/547969. S2CID 234681651.
  9. ^ Congdon, Peter D. (2020). "Regression Techniques using Hierarchical Priors". Bayesian Hierarchical Models (2nd ed.). Boca Raton: CRC Press. pp. 253–315. ISBN 978-1-03-217715-1.
  10. ^ Florens, Jean-Pierre; Mouchart, Michael; Rolin, Jean-Marie (1990). "Invariance Arguments in Bayesian Statistics". Economic Decision-Making: Games, Econometrics and Optimisation. North-Holland. pp. 351–367. ISBN 0-444-88422-X.
  11. ^ a b Jaynes, Edwin T. (Sep 1968). "Prior Probabilities" (PDF). IEEE Transactions on Systems Science and Cybernetics. 4 (3): 227–241. doi:10.1109/TSSC.1968.300117.
  12. ^ This prior was proposed by J.B.S. Haldane in "A note on inverse probability", Mathematical Proceedings of the Cambridge Philosophical Society 28, 55–61, 1932, doi:10.1017/S0305004100010495. See also J. Haldane, "The precision of observed values of small frequencies", Biometrika, 35:297–300, 1948, doi:10.2307/2332350, JSTOR 2332350.
  13. ^ Datta, Gauri Sankar; Mukerjee, Rahul (2004). Probability Matching Priors: Higher Order Asymptotics. Springer. ISBN 978-0-387-20329-4.
  14. ^ Esfahani, M. S.; Dougherty, E. R. (2014). "Incorporation of Biological Pathway Knowledge in the Construction of Priors for Optimal Bayesian Classification - IEEE Journals & Magazine". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 11 (1): 202–18. doi:10.1109/TCBB.2013.143. PMID 26355519. S2CID 10096507.
  15. ^ Boluki, Shahin; Esfahani, Mohammad Shahrokh; Qian, Xiaoning; Dougherty, Edward R (December 2017). "Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors". BMC Bioinformatics. 18 (S14): 552. doi:10.1186/s12859-017-1893-4. ISSN 1471-2105. PMC 5751802. PMID 29297278.
  16. ^ Jaynes (1968), pp. 17, see also Jaynes (2003), chapter 12. Note that chapter 12 is not available in the online preprint but can be previewed via Google Books.
  17. ^ Dawid, A. P.; Stone, M.; Zidek, J. V. (1973). "Marginalization Paradoxes in Bayesian and Structural Inference". Journal of the Royal Statistical Society. Series B (Methodological). 35 (2): 189–233. JSTOR 2984907.
  18. ^ Christensen, Ronald; Johnson, Wesley; Branscum, Adam; Hanson, Timothy E. (2010). Bayesian Ideas and Data Analysis : An Introduction for Scientists and Statisticians. Hoboken: CRC Press. p. 69. ISBN 9781439894798.
  19. ^ Iba, Y. (1989). "Bayesian Statistics and Statistical Mechanics". In Takayama, H. (ed.). Cooperative Dynamics in Complex Physical Systems. Springer Series in Synergetics. Vol. 43. Berlin: Springer. doi:10.1007/978-3-642-74554-6_60.
  20. ^ Müller-Kirsten, H. J. W. (2013). Basics of Statistical Physics (2nd ed.). Singapore: World Scientific. Chapter 6.
  21. ^ Ben-Naim, A. (2007). Entropy Demystified. Singapore: World Scientific.

References Edit

  • Bauwens, Luc; Lubrano, Michel; Richard, Jean-François (1999). "Prior Densities for the Regression Model". Bayesian Inference in Dynamic Econometric Models. Oxford University Press. pp. 94–128. ISBN 0-19-877313-7.
  • Rubin, Donald B.; Gelman, Andrew; John B. Carlin; Stern, Hal (2003). Bayesian Data Analysis (2nd ed.). Boca Raton: Chapman & Hall/CRC. ISBN 978-1-58488-388-3. MR 2027492.
  • Berger, James O. (1985). Statistical decision theory and Bayesian analysis. Berlin: Springer-Verlag. ISBN 978-0-387-96098-2. MR 0804611.
  • Berger, James O.; Strawderman, William E. (1996). "Choice of hierarchical priors: admissibility in estimation of normal means". Annals of Statistics. 24 (3): 931–951. doi:10.1214/aos/1032526950. MR 1401831. Zbl 0865.62004.
  • Bernardo, Jose M. (1979). "Reference Posterior Distributions for Bayesian Inference". Journal of the Royal Statistical Society, Series B. 41 (2): 113–147. JSTOR 2985028. MR 0547240.
  • James O. Berger; José M. Bernardo; Dongchu Sun (2009). "The formal definition of reference priors". Annals of Statistics. 37 (2): 905–938. arXiv:0904.0156. Bibcode:2009arXiv0904.0156B. doi:10.1214/07-AOS587. S2CID 3221355.
  • Jaynes, Edwin T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. ISBN 978-0-521-59271-0.
  • Williamson, Jon (2010). (PDF). Philosophia Mathematica. 18 (1): 130–135. doi:10.1093/philmat/nkp019. Archived from the original (PDF) on 2011-06-09. Retrieved 2010-07-02.

prior, probability, been, suggested, that, strong, prior, merged, into, this, article, discuss, proposed, since, july, 2023, prior, probability, distribution, uncertain, quantity, often, simply, called, prior, assumed, probability, distribution, before, some, . It has been suggested that Strong prior be merged into this article Discuss Proposed since July 2023 A prior probability distribution of an uncertain quantity often simply called the prior is its assumed probability distribution before some evidence is taken into account For example the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable In Bayesian statistics Bayes rule prescribes how to update the prior with new information to obtain the posterior probability distribution which is the conditional distribution of the uncertain quantity given new data Historically the choice of priors was often constrained to a conjugate family of a given likelihood function for that it would result in a tractable posterior of the same family The widespread availability of Markov chain Monte Carlo methods however has made this less of a concern There are many ways to construct a prior distribution 1 In some cases a prior may be determined from past information such as previous experiments A prior can also be elicited from the purely subjective assessment of an experienced expert 2 3 When no information is available an uninformative prior may be adopted as justified by the principle of indifference 4 5 In modern applications priors are also often chosen for their mechanical properties such as regularization and feature selection 6 7 8 The prior distributions of model parameters will often depend on parameters of their own Uncertainty about these hyperparameters can in turn be expressed as hyperprior probability distributions For example if one uses a beta distribution to model the distribution of the parameter p of a Bernoulli distribution then p is a parameter of the underlying system Bernoulli distribution and a and b are parameters of the prior distribution beta distribution hence hyperparameters In principle priors can be decomposed into many conditional levels of distributions so called hierarchical priors 9 Contents 1 Informative priors 2 Weakly informative priors 3 Uninformative priors 4 Improper priors 4 1 Examples 5 Prior probability in statistical mechanics 5 1 Example 5 2 Priori probability and distribution functions 6 See also 7 Notes 8 ReferencesInformative priors EditAn informative prior expresses specific definite information about a variable An example is a prior distribution for the temperature at noon tomorrow A reasonable approach is to make the prior a normal distribution with expected value equal to today s noontime temperature with variance equal to the day to day variance of atmospheric temperature or a distribution of the temperature for that day of the year This example has a property in common with many priors namely that the posterior from one problem today s temperature becomes the prior for another problem tomorrow s temperature pre existing evidence which has already been taken into account is part of the prior and as more evidence accumulates the posterior is determined largely by the evidence rather than any original assumption provided that the original assumption admitted the possibility of what the evidence is suggesting The terms prior and posterior are generally relative to a specific datum or observation Weakly informative priors EditA weakly informative prior expresses partial information about a variable An example is when setting the prior distribution for the temperature at noon tomorrow in St Louis to use a normal distribution with mean 50 degrees Fahrenheit and standard deviation 40 degrees which very loosely constrains the temperature to the range 10 degrees 90 degrees with a small chance of being below 30 degrees or above 130 degrees The purpose of a weakly informative prior is for regularization that is to keep inferences in a reasonable range Uninformative priors EditAn uninformative flat or diffuse prior expresses vague or general information about a variable 4 The term uninformative prior is somewhat of a misnomer Such a prior might also be called a not very informative prior or an objective prior i e one that s not subjectively elicited Uninformative priors can express objective information such as the variable is positive or the variable is less than some limit The simplest and oldest rule for determining a non informative prior is the principle of indifference which assigns equal probabilities to all possibilities In parameter estimation problems the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis as the likelihood function often yields more information than the uninformative prior Some attempts have been made at finding a priori probabilities i e probability distributions in some sense logically required by the nature of one s state of uncertainty these are a subject of philosophical controversy with Bayesians being roughly divided into two schools objective Bayesians who believe such priors exist in many useful situations and subjective Bayesians who believe that in practice priors usually represent subjective judgements of opinion that cannot be rigorously justified Williamson 2010 Perhaps the strongest arguments for objective Bayesianism were given by Edwin T Jaynes based mainly on the consequences of symmetries and on the principle of maximum entropy As an example of an a priori prior due to Jaynes 2003 consider a situation in which one knows a ball has been hidden under one of three cups A B or C but no other information is available about its location In this case a uniform prior of p A p B p C 1 3 seems intuitively like the only reasonable choice More formally we can see that the problem remains the same if we swap around the labels A B and C of the cups It would therefore be odd to choose a prior for which a permutation of the labels would cause a change in our predictions about which cup the ball will be found under the uniform prior is the only one which preserves this invariance If one accepts this invariance principle then one can see that the uniform prior is the logically correct prior to represent this state of knowledge This prior is objective in the sense of being the correct choice to represent a particular state of knowledge but it is not objective in the sense of being an observer independent feature of the world in reality the ball exists under a particular cup and it only makes sense to speak of probabilities in this situation if there is an observer with limited knowledge about the system 10 As a more contentious example Jaynes published an argument based on the invariance of the prior under a change of parameters that suggests that the prior representing complete uncertainty about a probability should be the Haldane prior p 1 1 p 1 11 The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments The Haldane prior 12 gives by far the most weight to p 0 displaystyle p 0 nbsp and p 1 displaystyle p 1 nbsp indicating that the sample will either dissolve every time or never dissolve with equal probability However if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the uniform distribution on the interval 0 1 This is obtained by applying Bayes theorem to the data set consisting of one observation of dissolving and one of not dissolving using the above prior The Haldane prior is an improper prior distribution meaning that it has an infinite mass Harold Jeffreys devised a systematic way for designing uninformative priors as e g Jeffreys prior p 1 2 1 p 1 2 for the Bernoulli random variable Priors can be constructed which are proportional to the Haar measure if the parameter space X carries a natural group structure which leaves invariant our Bayesian state of knowledge 11 This can be seen as a generalisation of the invariance principle used to justify the uniform prior over the three cups in the example above For example in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system This induces the group structure of the translation group on X which determines the prior probability as a constant improper prior Similarly some measurements are naturally invariant to the choice of an arbitrary scale e g whether centimeters or inches are used the physical results should be equal In such a case the scale group is the natural group structure and the corresponding prior on X is proportional to 1 x It sometimes matters whether we use the left invariant or right invariant Haar measure For example the left and right invariant Haar measures on the affine group are not equal Berger 1985 p 413 argues that the right invariant Haar measure is the correct choice Another idea championed by Edwin T Jaynes is to use the principle of maximum entropy MAXENT The motivation is that the Shannon entropy of a probability distribution measures the amount of information contained in the distribution The larger the entropy the less information is provided by the distribution Thus by maximizing the entropy over a suitable set of probability distributions on X one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set For example the maximum entropy prior on a discrete space given only that the probability is normalized to 1 is the prior that assigns equal probability to each state And in the continuous case the maximum entropy prior given that the density is normalized with mean zero and unit variance is the standard normal distribution The principle of minimum cross entropy generalizes MAXENT to the case of updating an arbitrary prior distribution with suitable constraints in the maximum entropy sense A related idea reference priors was introduced by Jose Miguel Bernardo Here the idea is to maximize the expected Kullback Leibler divergence of the posterior distribution relative to the prior This maximizes the expected posterior information about X when the prior density is p x thus in some sense p x is the least informative prior about X The reference prior is defined in the asymptotic limit i e one considers the limit of the priors so obtained as the number of data points goes to infinity In the present case the KL divergence between the prior and posterior distributions is given by K L p t p x t log p x t p x d x d t displaystyle KL int p t int p x mid t log frac p x mid t p x dx dt nbsp Here t displaystyle t nbsp is a sufficient statistic for some parameter x displaystyle x nbsp The inner integral is the KL divergence between the posterior p x t displaystyle p x mid t nbsp and prior p x displaystyle p x nbsp distributions and the result is the weighted mean over all values of t displaystyle t nbsp Splitting the logarithm into two parts reversing the order of integrals in the second part and noting thatlog p x displaystyle log p x nbsp does not depend on t displaystyle t nbsp yields K L p t p x t log p x t d x d t log p x p t p x t d t d x displaystyle KL int p t int p x mid t log p x mid t dx dt int log p x int p t p x mid t dt dx nbsp The inner integral in the second part is the integral over t displaystyle t nbsp of the joint density p x t displaystyle p x t nbsp This is the marginal distribution p x displaystyle p x nbsp so we have K L p t p x t log p x t d x d t p x log p x d x displaystyle KL int p t int p x mid t log p x mid t dx dt int p x log p x dx nbsp Now we use the concept of entropy which in the case of probability distributions is the negative expected value of the logarithm of the probability mass or density function or H x p x log p x d x displaystyle H x int p x log p x dx nbsp Using this in the last equation yields K L p t H x t d t H x displaystyle KL int p t H x mid t dt H x nbsp In words KL is the negative expected value over t displaystyle t nbsp of the entropy of x displaystyle x nbsp conditional on t displaystyle t nbsp plus the marginal i e unconditional entropy of x displaystyle x nbsp In the limiting case where the sample size tends to infinity the Bernstein von Mises theorem states that the distribution of x displaystyle x nbsp conditional on a given observed value of t displaystyle t nbsp is normal with a variance equal to the reciprocal of the Fisher information at the true value of x displaystyle x nbsp The entropy of a normal density function is equal to half the logarithm of 2 p e v displaystyle 2 pi ev nbsp where v displaystyle v nbsp is the variance of the distribution In this case thereforeH log 2 p e N I x displaystyle H log sqrt 2 pi e NI x nbsp where N displaystyle N nbsp is the arbitrarily large sample size to which Fisher information is proportional and x displaystyle x nbsp is the true value Since this does not depend on t displaystyle t nbsp it can be taken out of the integral and as this integral is over a probability space it equals one Hence we can write the asymptotic form of KL as K L log 1 k I x p x log p x d x displaystyle KL log 1 sqrt kI x int p x log p x dx nbsp where k displaystyle k nbsp is proportional to the asymptotically large sample size We do not know the value of x displaystyle x nbsp Indeed the very idea goes against the philosophy of Bayesian inference in which true values of parameters are replaced by prior and posterior distributions So we remove x displaystyle x nbsp by replacing it with x displaystyle x nbsp and taking the expected value of the normal entropy which we obtain by multiplying by p x displaystyle p x nbsp and integrating over x displaystyle x nbsp This allows us to combine the logarithms yielding K L p x log p x k I x d x displaystyle KL int p x log p x sqrt kI x dx nbsp This is a quasi KL divergence quasi in the sense that the square root of the Fisher information may be the kernel of an improper distribution Due to the minus sign we need to minimise this in order to maximise the KL divergence with which we started The minimum value of the last equation occurs where the two distributions in the logarithm argument improper or not do not diverge This in turn occurs when the prior distribution is proportional to the square root of the Fisher information of the likelihood function Hence in the single parameter case reference priors and Jeffreys priors are identical even though Jeffreys has a very different rationale Reference priors are often the objective prior of choice in multivariate problems since other rules e g Jeffreys rule may result in priors with problematic behavior clarification needed A Jeffreys prior is related to KL divergence Objective prior distributions may also be derived from other principles such as information or coding theory see e g minimum description length or frequentist statistics so called probability matching priors 13 Such methods are used in Solomonoff s theory of inductive inference Constructing objective priors have been recently introduced in bioinformatics and specially inference in cancer systems biology where sample size is limited and a vast amount of prior knowledge is available In these methods either an information theory based criterion such as KL divergence or log likelihood function for binary supervised learning problems 14 and mixture model problems 15 Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric or measurement scale Suppose we want a prior for the running speed of a runner who is unknown to us We could specify say a normal distribution as the prior for his speed but alternatively we could specify a normal prior for the time he takes to complete 100 metres which is proportional to the reciprocal of the first prior These are very different priors but it is not clear which is to be preferred Jaynes often overlooked by whom method of transformation groups can answer this question in some situations 16 Similarly if asked to estimate an unknown proportion between 0 and 1 we might say that all proportions are equally likely and use a uniform prior Alternatively we might say that all orders of magnitude for the proportion are equally likely the logarithmic prior which is the uniform prior on the logarithm of proportion The Jeffreys prior attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used The Jeffreys prior for an unknown proportion p is p 1 2 1 p 1 2 which differs from Jaynes recommendation Priors based on notions of algorithmic probability are used in inductive inference as a basis for induction in very general settings Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper The usual uninformative priors on continuous unbounded variables are improper This need not be a problem if the posterior distribution is proper Another issue of importance is that if an uninformative prior is to be used routinely i e with many different data sets it should have good frequentist properties Normally a Bayesian would not be concerned with such issues but it can be important in this situation For example one would want any decision rule based on the posterior distribution to be admissible under the adopted loss function Unfortunately admissibility is often difficult to check although some results are known e g Berger and Strawderman 1996 The issue is particularly acute with hierarchical Bayes models the usual priors e g Jeffreys prior may give badly inadmissible decision rules if employed at the higher levels of the hierarchy Improper priors EditLet events A 1 A 2 A n displaystyle A 1 A 2 ldots A n nbsp be mutually exclusive and exhaustive If Bayes theorem is written as P A i B P B A i P A i j P B A j P A j displaystyle P A i mid B frac P B mid A i P A i sum j P B mid A j P A j nbsp then it is clear that the same result would be obtained if all the prior probabilities P Ai and P Aj were multiplied by a given constant the same would be true for a continuous random variable If the summation in the denominator converges the posterior probabilities will still sum or integrate to 1 even if the prior values do not and so the priors may only need to be specified in the correct proportion Taking this idea further in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities When this is the case the prior is called an improper prior However the posterior distribution need not be a proper distribution if the prior is improper 17 This is clear from the case where event B is independent of all of the Aj Statisticians sometimes use improper priors as uninformative priors 18 For example if they need a prior distribution for the mean and variance of a random variable they may assume p m v 1 v for v gt 0 which would suggest that any value for the mean is equally likely and that a value for the positive variance becomes less likely in inverse proportion to its value Many authors Lindley 1973 De Groot 1937 Kass and Wasserman 1996 citation needed warn against the danger of over interpreting those priors since they are not probability densities The only relevance they have is found in the corresponding posterior as long as it is well defined for all observations The Haldane prior is a typical counterexample clarification needed citation needed By contrast likelihood functions do not need to be integrated and a likelihood function that is uniformly 1 corresponds to the absence of data all models are equally likely given no data Bayes rule multiplies a prior by the likelihood and an empty product is just the constant likelihood 1 However without starting with a prior probability distribution one does not end up getting a posterior probability distribution and thus cannot integrate or compute expected values or loss See Likelihood function Non integrability for details Examples Edit Examples of improper priors include The uniform distribution on an infinite interval i e a half line or the entire real line Beta 0 0 the beta distribution for a 0 b 0 uniform distribution on log odds scale The logarithmic prior on the positive reals uniform distribution on log scale citation needed These functions interpreted as uniform distributions can also be interpreted as the likelihood function in the absence of data but are not proper priors Prior probability in statistical mechanics EditWhile in Bayesian statistics the prior probability is used to represent initial beliefs about an uncertain parameter in statistical mechanics the a priori probability is used to describe the initial state of a system 19 The classical version is defined as the ratio of the number of elementary events e g the number of times a die is thrown to the total number of events and these considered purely deductively i e without any experimenting In the case of the die if we look at it on the table without throwing it each elementary event is reasoned deductively to have the same probability thus the probability of each outcome of an imaginary throwing of the perfect die or simply by counting the number of faces is 1 6 Each face of the die appears with equal probability probability being a measure defined for each elementary event The result is different if we throw the die twenty times and ask how many times out of 20 the number 6 appears on the upper face In this case time comes into play and we have a different type of probability depending on time or the number of times the die is thrown On the other hand the a priori probability is independent of time you can look at the die on the table as long as you like without touching it and you deduce the probability for the number 6 to appear on the upper face is 1 6 In statistical mechanics e g that of a gas contained in a finite volume V displaystyle V nbsp both the spatial coordinates q i displaystyle q i nbsp and the momentum coordinates p i displaystyle p i nbsp of the individual gas elements atoms or molecules are finite in the phase space spanned by these coordinates In analogy to the case of the die the a priori probability is here in the case of a continuum proportional to the phase space volume element D q D p displaystyle Delta q Delta p nbsp divided by h displaystyle h nbsp and is the number of standing waves i e states therein where D q displaystyle Delta q nbsp is the range of the variable q displaystyle q nbsp and D p displaystyle Delta p nbsp is the range of the variable p displaystyle p nbsp here for simplicity considered in one dimension In 1 dimension length L displaystyle L nbsp this number or statistical weight or a priori weighting is L D p h displaystyle L Delta p h nbsp In customary 3 dimensions volume V displaystyle V nbsp the corresponding number can be calculated to be V 4 p p 2 D p h 3 displaystyle V4 pi p 2 Delta p h 3 nbsp 20 In order to understand this quantity as giving a number of states in quantum i e wave mechanics recall that in quantum mechanics every particle is associated with a matter wave which is the solution of a Schrodinger equation In the case of free particles of energy ϵ p 2 2 m displaystyle epsilon bf p 2 2m nbsp like those of a gas in a box of volume V L 3 displaystyle V L 3 nbsp such a matter wave is explicitly ps sin l p x L sin m p y L sin n p z L displaystyle psi propto sin l pi x L sin m pi y L sin n pi z L nbsp where l m n displaystyle l m n nbsp are integers The number of different l m n displaystyle l m n nbsp values and hence states in the region between p p d p p 2 p 2 displaystyle p p dp p 2 bf p 2 nbsp is then found to be the above expression V 4 p p 2 d p h 3 displaystyle V4 pi p 2 dp h 3 nbsp by considering the area covered by these points Moreover in view of the uncertainty relation which in 1 spatial dimension is D q D p h displaystyle Delta q Delta p geq h nbsp these states are indistinguishable i e these states do not carry labels An important consequence is a result known as Liouville s theorem i e the time independence of this phase space volume element and thus of the a priori probability A time dependence of this quantity would imply known information about the dynamics of the system and hence would not be an a priori probability 21 Thus the region W D q D p D q D p D q D p c o n s t displaystyle Omega frac Delta q Delta p int Delta q Delta p int Delta q Delta p const nbsp when differentiated with respect to time t displaystyle t nbsp yields zero with the help of Hamilton s equations The volume at time t displaystyle t nbsp is the same as at time zero One describes this also as conservation of information In the full quantum theory one has an analogous conservation law In this case the phase space region is replaced by a subspace of the space of states expressed in terms of a projection operator P displaystyle P nbsp and instead of the probability in phase space one has the probability density S P Tr P N Tr P c o n s t displaystyle Sigma frac P text Tr P N text Tr P const nbsp where N displaystyle N nbsp is the dimensionality of the subspace The conservation law in this case is expressed by the unitarity of the S matrix In either case the considerations assume a closed isolated system This closed isolated system is a system with 1 a fixed energy E displaystyle E nbsp and 2 a fixed number of particles N displaystyle N nbsp in c a state of equilibrium If one considers a huge number of replicas of this system one obtains what is called a microcanonical ensemble It is for this system that one postulates in quantum statistics the fundamental postulate of equal a priori probabilities of an isolated system This says that the isolated system in equilibrium occupies each of its accessible states with the same probability This fundamental postulate therefore allows us to equate the a priori probability to the degeneracy of a system i e to the number of different states with the same energy Example Edit The following example illustrates the a priori probability or a priori weighting in a classical and b quantal contexts a Classical a priori probabilityConsider the rotational energy E of a diatomic molecule with moment of inertia I in spherical polar coordinates 8 ϕ displaystyle theta phi nbsp this means q displaystyle q nbsp above is here 8 ϕ displaystyle theta phi nbsp i e E 1 2 I p 8 2 p ϕ 2 sin 2 8 displaystyle E frac 1 2I left p theta 2 frac p phi 2 sin 2 theta right nbsp The p 8 p ϕ displaystyle p theta p phi nbsp curve for constant E and 8 displaystyle theta nbsp is an ellipse of area d p 8 d p ϕ p 2 I E 2 I E sin 8 2 p I E sin 8 displaystyle oint dp theta dp phi pi sqrt 2IE sqrt 2IE sin theta 2 pi IE sin theta nbsp By integrating over 8 displaystyle theta nbsp and ϕ displaystyle phi nbsp the total volume of phase space covered for constant energy E is 0 ϕ 2 p 0 8 p 2 I p E sin 8 d 8 d ϕ 8 p 2 I E d p 8 d p ϕ d 8 d ϕ displaystyle int 0 phi 2 pi int 0 theta pi 2I pi E sin theta d theta d phi 8 pi 2 IE oint dp theta dp phi d theta d phi nbsp and hence the classical a priori weighting in the energy range d E displaystyle dE nbsp is W displaystyle Omega propto nbsp phase space volume at E d E displaystyle E dE nbsp minus phase space volume at E displaystyle E nbsp is given by 8 p 2 I d E displaystyle 8 pi 2 IdE nbsp b Quantum a priori probabilityAssuming that the number of quantum states in a range D q D p displaystyle Delta q Delta p nbsp for each direction of motion is given per element by a factor D q D p h displaystyle Delta q Delta p h nbsp the number of states in the energy range dE is as seen under a 8 p 2 I d E h 2 displaystyle 8 pi 2 IdE h 2 nbsp for the rotating diatomic molecule From wave mechanics it is known that the energy levels of a rotating diatomic molecule are given by E n n n 1 h 2 8 p 2 I displaystyle E n frac n n 1 h 2 8 pi 2 I nbsp each such level being 2n 1 fold degenerate By evaluating d n d E n 1 d E n d n displaystyle dn dE n 1 dE n dn nbsp one obtains d n d E n 8 p 2 I 2 n 1 h 2 2 n 1 d n 8 p 2 I h 2 d E n displaystyle frac dn dE n frac 8 pi 2 I 2n 1 h 2 2n 1 dn frac 8 pi 2 I h 2 dE n nbsp Thus by comparison with W displaystyle Omega nbsp above one finds that the approximate number of states in the range dE is given by the degeneracy i e S 2 n 1 d n displaystyle Sigma propto 2n 1 dn nbsp Thus the a priori weighting in the classical context a corresponds to the a priori weighting here in the quantal context b In the case of the one dimensional simple harmonic oscillator of natural frequency n displaystyle nu nbsp one finds correspondingly a W d E n displaystyle Omega propto dE nu nbsp and b S d n displaystyle Sigma propto dn nbsp no degeneracy Thus in quantum mechanics the a priori probability is effectively a measure of the degeneracy i e the number of states having the same energy In the case of the hydrogen atom or Coulomb potential where the evaluation of the phase space volume for constant energy is more complicated one knows that the quantum mechanical degeneracy is n 2 displaystyle n 2 nbsp with E 1 n 2 displaystyle E propto 1 n 2 nbsp Thus in this case S n 2 d n displaystyle Sigma propto n 2 dn nbsp Priori probability and distribution functions Edit In statistical mechanics see any book one derives the so called distribution functions f displaystyle f nbsp for various statistics In the case of Fermi Dirac statistics and Bose Einstein statistics these functions are respectively f i F D 1 e ϵ i ϵ 0 k T 1 f i B E 1 e ϵ i ϵ 0 k T 1 displaystyle f i FD frac 1 e epsilon i epsilon 0 kT 1 quad f i BE frac 1 e epsilon i epsilon 0 kT 1 nbsp These functions are derived for 1 a system in dynamic equilibrium i e under steady uniform conditions with 2 total and huge number of particles N S i n i displaystyle N Sigma i n i nbsp this condition determines the constant ϵ 0 displaystyle epsilon 0 nbsp and 3 total energy E S i n i ϵ i displaystyle E Sigma i n i epsilon i nbsp i e with each of the n i displaystyle n i nbsp particles having the energy ϵ i displaystyle epsilon i nbsp An important aspect in the derivation is the taking into account of the indistinguishability of particles and states in quantum statistics i e there particles and states do not have labels In the case of fermions like electrons obeying the Pauli principle only one particle per state or none allowed one has therefore 0 f i F D 1 w h e r e a s 0 f i B E displaystyle 0 leq f i FD leq 1 quad whereas quad 0 leq f i BE leq infty nbsp Thus f i F D displaystyle f i FD nbsp is a measure of the fraction of states actually occupied by electrons at energy ϵ i displaystyle epsilon i nbsp and temperature T displaystyle T nbsp On the other hand the a priori probability g i displaystyle g i nbsp is a measure of the number of wave mechanical states available Hence n i f i g i displaystyle n i f i g i nbsp Since n i displaystyle n i nbsp is constant under uniform conditions as many particles as flow out of a volume element also flow in steadily so that the situation in the element appears static i e independent of time t displaystyle t nbsp and g i displaystyle g i nbsp is also independent of time t displaystyle t nbsp as shown earlier we obtain d f i d t 0 f i f i t v i r i displaystyle frac df i dt 0 quad f i f i t bf v i bf r i nbsp Expressing this equation in terms of its partial derivatives one obtains the Boltzmann transport equation How do coordinates r displaystyle bf r nbsp etc appear here suddenly Above no mention was made of electric or other fields Thus with no such fields present we have the Fermi Dirac distribution as above But with such fields present we have this additional dependence of f displaystyle f nbsp See also EditBase rate Bayesian epistemology Strong priorNotes Edit Robert Christian 1994 From Prior Information to Prior Distributions The Bayesian Choice New York Springer pp 89 136 ISBN 0 387 94296 3 Chaloner Kathryn 1996 Elicitation of Prior Distributions In Berry Donald A Stangl Dalene eds Bayesian Biostatistics New York Marcel Dekker pp 141 156 ISBN 0 8247 9334 X Mikkola Petrus et al 2023 Prior Knowledge Elicitation The Past Present and Future Bayesian Analysis Forthcoming doi 10 1214 23 BA1381 hdl 11336 183197 S2CID 244798734 a b Zellner Arnold 1971 Prior Distributions to Represent Knowing Little An Introduction to Bayesian Inference in Econometrics New York John Wiley amp Sons pp 41 53 ISBN 0 471 98165 6 Price Harold J Manson Allison R 2001 Uninformative priors for Bayes theorem AIP Conf Proc 617 379 391 doi 10 1063 1 1477060 Piironen Juho Vehtari Aki 2017 Sparsity information and regularization in the horseshoe and other shrinkage priors Electronic Journal of Statistics 11 2 5018 5051 doi 10 1214 17 EJS1337SI Simpson Daniel et al 2017 Penalising Model Component Complexity A Principled Practical Approach to Constructing Priors Statistical Science 32 1 1 28 arXiv 1403 4630 doi 10 1214 16 STS576 S2CID 88513041 Fortuin Vincent 2022 Priors in Bayesian Deep Learning A Review International Statistical Review 90 3 563 591 doi 10 1111 insr 12502 hdl 20 500 11850 547969 S2CID 234681651 Congdon Peter D 2020 Regression Techniques using Hierarchical Priors Bayesian Hierarchical Models 2nd ed Boca Raton CRC Press pp 253 315 ISBN 978 1 03 217715 1 Florens Jean Pierre Mouchart Michael Rolin Jean Marie 1990 Invariance Arguments in Bayesian Statistics Economic Decision Making Games Econometrics and Optimisation North Holland pp 351 367 ISBN 0 444 88422 X a b Jaynes Edwin T Sep 1968 Prior Probabilities PDF IEEE Transactions on Systems Science and Cybernetics 4 3 227 241 doi 10 1109 TSSC 1968 300117 This prior was proposed by J B S Haldane in A note on inverse probability Mathematical Proceedings of the Cambridge Philosophical Society 28 55 61 1932 doi 10 1017 S0305004100010495 See also J Haldane The precision of observed values of small frequencies Biometrika 35 297 300 1948 doi 10 2307 2332350 JSTOR 2332350 Datta Gauri Sankar Mukerjee Rahul 2004 Probability Matching Priors Higher Order Asymptotics Springer ISBN 978 0 387 20329 4 Esfahani M S Dougherty E R 2014 Incorporation of Biological Pathway Knowledge in the Construction of Priors for Optimal Bayesian Classification IEEE Journals amp Magazine IEEE ACM Transactions on Computational Biology and Bioinformatics 11 1 202 18 doi 10 1109 TCBB 2013 143 PMID 26355519 S2CID 10096507 Boluki Shahin Esfahani Mohammad Shahrokh Qian Xiaoning Dougherty Edward R December 2017 Incorporating biological prior knowledge for Bayesian learning via maximal knowledge driven information priors BMC Bioinformatics 18 S14 552 doi 10 1186 s12859 017 1893 4 ISSN 1471 2105 PMC 5751802 PMID 29297278 Jaynes 1968 pp 17 see also Jaynes 2003 chapter 12 Note that chapter 12 is not available in the online preprint but can be previewed via Google Books Dawid A P Stone M Zidek J V 1973 Marginalization Paradoxes in Bayesian and Structural Inference Journal of the Royal Statistical Society Series B Methodological 35 2 189 233 JSTOR 2984907 Christensen Ronald Johnson Wesley Branscum Adam Hanson Timothy E 2010 Bayesian Ideas and Data Analysis An Introduction for Scientists and Statisticians Hoboken CRC Press p 69 ISBN 9781439894798 Iba Y 1989 Bayesian Statistics and Statistical Mechanics In Takayama H ed Cooperative Dynamics in Complex Physical Systems Springer Series in Synergetics Vol 43 Berlin Springer doi 10 1007 978 3 642 74554 6 60 Muller Kirsten H J W 2013 Basics of Statistical Physics 2nd ed Singapore World Scientific Chapter 6 Ben Naim A 2007 Entropy Demystified Singapore World Scientific References EditBauwens Luc Lubrano Michel Richard Jean Francois 1999 Prior Densities for the Regression Model Bayesian Inference in Dynamic Econometric Models Oxford University Press pp 94 128 ISBN 0 19 877313 7 Rubin Donald B Gelman Andrew John B Carlin Stern Hal 2003 Bayesian Data Analysis 2nd ed Boca Raton Chapman amp Hall CRC ISBN 978 1 58488 388 3 MR 2027492 Berger James O 1985 Statistical decision theory and Bayesian analysis Berlin Springer Verlag ISBN 978 0 387 96098 2 MR 0804611 Berger James O Strawderman William E 1996 Choice of hierarchical priors admissibility in estimation of normal means Annals of Statistics 24 3 931 951 doi 10 1214 aos 1032526950 MR 1401831 Zbl 0865 62004 Bernardo Jose M 1979 Reference Posterior Distributions for Bayesian Inference Journal of the Royal Statistical Society Series B 41 2 113 147 JSTOR 2985028 MR 0547240 James O Berger Jose M Bernardo Dongchu Sun 2009 The formal definition of reference priors Annals of Statistics 37 2 905 938 arXiv 0904 0156 Bibcode 2009arXiv0904 0156B doi 10 1214 07 AOS587 S2CID 3221355 Jaynes Edwin T 2003 Probability Theory The Logic of Science Cambridge University Press ISBN 978 0 521 59271 0 Williamson Jon 2010 review of Bruno di Finetti Philosophical Lectures on Probability PDF Philosophia Mathematica 18 1 130 135 doi 10 1093 philmat nkp019 Archived from the original PDF on 2011 06 09 Retrieved 2010 07 02 Retrieved from https en wikipedia org w index php title Prior probability amp oldid 1178672509, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.