fbpx
Wikipedia

Probability distribution fitting

Probability distribution fitting or simply distribution fitting is the fitting of a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon. The aim of distribution fitting is to predict the probability or to forecast the frequency of occurrence of the magnitude of the phenomenon in a certain interval.

There are many probability distributions (see list of probability distributions) of which some can be fitted more closely to the observed frequency of the data than others, depending on the characteristics of the phenomenon and of the distribution. The distribution giving a close fit is supposed to lead to good predictions. In distribution fitting, therefore, one needs to select a distribution that suits the data well.

Selection of distribution edit

 
Different shapes of the symmetrical normal distribution depending on mean μ and variance σ 2

The selection of the appropriate distribution depends on the presence or absence of symmetry of the data set with respect to the central tendency.

Symmetrical distributions

When the data are symmetrically distributed around the mean while the frequency of occurrence of data farther away from the mean diminishes, one may for example select the normal distribution, the logistic distribution, or the Student's t-distribution. The first two are very similar, while the last, with one degree of freedom, has "heavier tails" meaning that the values farther away from the mean occur relatively more often (i.e. the kurtosis is higher). The Cauchy distribution is also symmetric.

Skew distributions to the right

 
Skewness to left and right

When the larger values tend to be farther away from the mean than the smaller values, one has a skew distribution to the right (i.e. there is positive skewness), one may for example select the log-normal distribution (i.e. the log values of the data are normally distributed), the log-logistic distribution (i.e. the log values of the data follow a logistic distribution), the Gumbel distribution, the exponential distribution, the Pareto distribution, the Weibull distribution, the Burr distribution, or the Fréchet distribution. The last four distributions are bounded to the left.

Skew distributions to the left

When the smaller values tend to be farther away from the mean than the larger values, one has a skew distribution to the left (i.e. there is negative skewness), one may for example select the square-normal distribution (i.e. the normal distribution applied to the square of the data values),[1] the inverted (mirrored) Gumbel distribution,[1] the Dagum distribution (mirrored Burr distribution), or the Gompertz distribution, which is bounded to the left.

Techniques of fitting edit

The following techniques of distribution fitting exist:[2]

For example, the parameter   (the expectation) can be estimated by the mean of the data and the parameter   (the variance) can be estimated from the standard deviation of the data. The mean is found as  , where   is the data value and   the number of data, while the standard deviation is calculated as  . With these parameters many distributions, e.g. the normal distribution, are completely defined.
 
Cumulative Gumbel distribution fitted to maximum one-day October rainfalls in Suriname by the regression method with added confidence band using cumfreq
For example, the cumulative Gumbel distribution can be linearized to  , where   is the data variable and  , with   being the cumulative probability, i.e. the probability that the data value is less than  . Thus, using the plotting position for  , one finds the parameters   and   from a linear regression of   on  , and the Gumbel distribution is fully defined.

Generalization of distributions edit

It is customary to transform data logarithmically to fit symmetrical distributions (like the normal and logistic) to data obeying a distribution that is positively skewed (i.e. skew to the right, with mean > mode, and with a right hand tail that is longer than the left hand tail), see lognormal distribution and the loglogistic distribution. A similar effect can be achieved by taking the square root of the data.

To fit a symmetrical distribution to data obeying a negatively skewed distribution (i.e. skewed to the left, with mean < mode, and with a right hand tail this is shorter than the left hand tail) one could use the squared values of the data to accomplish the fit.

More generally one can raise the data to a power p in order to fit symmetrical distributions to data obeying a distribution of any skewness, whereby p < 1 when the skewness is positive and p > 1 when the skewness is negative. The optimal value of p is to be found by a numerical method. The numerical method may consist of assuming a range of p values, then applying the distribution fitting procedure repeatedly for all the assumed p values, and finally selecting the value of p for which the sum of squares of deviations of calculated probabilities from measured frequencies (chi squared) is minimum, as is done in CumFreq.

The generalization enhances the flexibility of probability distributions and increases their applicability in distribution fitting.[6]

The versatility of generalization makes it possible, for example, to fit approximately normally distributed data sets to a large number of different probability distributions,[7] while negatively skewed distributions can be fitted to square normal and mirrored Gumbel distributions.[8]

Inversion of skewness edit

 
(A) Gumbel probability distribution skew to right and (B) Gumbel mirrored skew to left

Skewed distributions can be inverted (or mirrored) by replacing in the mathematical expression of the cumulative distribution function (F) by its complement: F'=1-F, obtaining the complementary distribution function (also called survival function) that gives a mirror image. In this manner, a distribution that is skewed to the right is transformed into a distribution that is skewed to the left and vice versa.

Example. The F-expression of the positively skewed Gumbel distribution is: F=exp[-exp{-(X-u)/0.78s}], where u is the mode (i.e. the value occurring most frequently) and s is the standard deviation. The Gumbel distribution can be transformed using F'=1-exp[-exp{-(x-u)/0.78s}] . This transformation yields the inverse, mirrored, or complementary Gumbel distribution that may fit a data series obeying a negatively skewed distribution.

The technique of skewness inversion increases the number of probability distributions available for distribution fitting and enlarges the distribution fitting opportunities.

Shifting of distributions edit

Some probability distributions, like the exponential, do not support negative data values (X). Yet, when negative data are present, such distributions can still be used replacing X by Y=X-Xm, where Xm is the minimum value of X. This replacement represents a shift of the probability distribution in positive direction, i.e. to the right, because Xm is negative. After completing the distribution fitting of Y, the corresponding X-values are found from X=Y+Xm, which represents a back-shift of the distribution in negative direction, i.e. to the left.
The technique of distribution shifting augments the chance to find a properly fitting probability distribution.

Composite distributions edit

 
Composite (discontinuous) distribution with confidence belt[9]

The option exists to use two different probability distributions, one for the lower data range, and one for the higher like for example the Laplace distribution. The ranges are separated by a break-point. The use of such composite (discontinuous) probability distributions can be opportune when the data of the phenomenon studied were obtained under two sets different conditions.[6]

Uncertainty of prediction edit

 
Uncertainty analysis with confidence belts using the binomial distribution [10]

Predictions of occurrence based on fitted probability distributions are subject to uncertainty, which arises from the following conditions:

  • The true probability distribution of events may deviate from the fitted distribution, as the observed data series may not be totally representative of the real probability of occurrence of the phenomenon due to random error
  • The occurrence of events in another situation or in the future may deviate from the fitted distribution as this occurrence can also be subject to random error
  • A change of environmental conditions may cause a change in the probability of occurrence of the phenomenon
 
Variations of nine return period curves of 50-year samples from a theoretical 1000 year record (base line), data from Benson[11]

An estimate of the uncertainty in the first and second case can be obtained with the binomial probability distribution using for example the probability of exceedance Pe (i.e. the chance that the event X is larger than a reference value Xr of X) and the probability of non-exceedance Pn (i.e. the chance that the event X is smaller than or equal to the reference value Xr, this is also called cumulative probability). In this case there are only two possibilities: either there is exceedance or there is non-exceedance. This duality is the reason that the binomial distribution is applicable.

With the binomial distribution one can obtain a prediction interval. Such an interval also estimates the risk of failure, i.e. the chance that the predicted event still remains outside the confidence interval. The confidence or risk analysis may include the return period T=1/Pe as is done in hydrology.

Variance of Bayesian fitted probability functions edit

A Bayesian approach can be used for fitting a model   having a prior distribution   for the parameter  . When one has samples   that are independently drawn from the underlying distribution then one can derive the so-called posterior distribution  . This posterior can be used to update the probability mass function for a new sample   given the observations  , one obtains

 .

The variance of the newly obtained probability mass function can also be determined. The variance for a Bayesian probability mass function can be defined as

 .

This expression for the variance can be substantially simplified (assuming independently drawn samples). Defining the "self probability mass function" as

 ,

one obtains for the variance [12]

 .

The expression for variance involves an additional fit that includes the sample   of interest.

 
List of probability distributions ranked by goodness of fit.[13]
 
Histogram and probability density of a data set fitting the GEV distribution

Goodness of fit edit

By ranking the goodness of fit of various distributions one can get an impression of which distribution is acceptable and which is not.

Histogram and density function edit

From the cumulative distribution function (CDF) one can derive a histogram and the probability density function (PDF).

See also edit

References edit

  1. ^ a b Left (negatively) skewed frequency histograms can be fitted to square Normal or mirrored Gumbel probability functions. On line: [1]
  2. ^ Frequency and Regression Analysis. Chapter 6 in: H.P.Ritzema (ed., 1994), Drainage Principles and Applications, Publ. 16, pp. 175–224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. ISBN 9070754339. Free download from the webpage [2] under nr. 12, or directly as PDF : [3]
  3. ^ H. Cramér, "Mathematical methods of statistics", Princeton Univ. Press (1946)
  4. ^ Hosking, J.R.M. (1990). "L-moments: analysis and estimation of distributions using linear combinations of order statistics". Journal of the Royal Statistical Society, Series B. 52 (1): 105–124. JSTOR 2345653.
  5. ^ Aldrich, John (1997). "R. A. Fisher and the making of maximum likelihood 1912–1922". Statistical Science. 12 (3): 162–176. doi:10.1214/ss/1030037906. MR 1617519.
  6. ^ a b c Software for Generalized and Composite Probability Distributions. International Journal of Mathematical and Computational Methods, 4, 1-9 [4] or [5]
  7. ^ Example of an approximately normally distributed data set to which a large number of different probability distributions can be fitted, [6]
  8. ^ Left (negatively) skewed frequency histograms can be fitted to square normal or mirrored Gumbel probability functions. [7]
  9. ^ Intro to composite probability distributions
  10. ^ Frequency predictions and their binomial confidence limits. In: International Commission on Irrigation and Drainage, Special Technical Session: Economic Aspects of Flood Control and non-Structural Measures, Dubrovnik, Yugoslavia, 1988. On line
  11. ^ Benson, M.A. 1960. Characteristics of frequency curves based on a theoretical 1000 year record. In: T.Dalrymple (Ed.), Flood frequency analysis. U.S. Geological Survey Water Supply Paper, 1543-A, pp. 51-71.
  12. ^ Pijlman; Linnartz (2023). "Variance of Likelihood of data". SITB 2023 Proceedings: 34.
  13. ^ Software for probability distribution fitting

probability, distribution, fitting, simply, distribution, fitting, fitting, probability, distribution, series, data, concerning, repeated, measurement, variable, phenomenon, distribution, fitting, predict, probability, forecast, frequency, occurrence, magnitud. Probability distribution fitting or simply distribution fitting is the fitting of a probability distribution to a series of data concerning the repeated measurement of a variable phenomenon The aim of distribution fitting is to predict the probability or to forecast the frequency of occurrence of the magnitude of the phenomenon in a certain interval There are many probability distributions see list of probability distributions of which some can be fitted more closely to the observed frequency of the data than others depending on the characteristics of the phenomenon and of the distribution The distribution giving a close fit is supposed to lead to good predictions In distribution fitting therefore one needs to select a distribution that suits the data well Contents 1 Selection of distribution 2 Techniques of fitting 3 Generalization of distributions 4 Inversion of skewness 5 Shifting of distributions 6 Composite distributions 7 Uncertainty of prediction 7 1 Variance of Bayesian fitted probability functions 8 Goodness of fit 9 Histogram and density function 10 See also 11 ReferencesSelection of distribution edit nbsp Different shapes of the symmetrical normal distribution depending on mean m and variance s 2 The selection of the appropriate distribution depends on the presence or absence of symmetry of the data set with respect to the central tendency Symmetrical distributionsWhen the data are symmetrically distributed around the mean while the frequency of occurrence of data farther away from the mean diminishes one may for example select the normal distribution the logistic distribution or the Student s t distribution The first two are very similar while the last with one degree of freedom has heavier tails meaning that the values farther away from the mean occur relatively more often i e the kurtosis is higher The Cauchy distribution is also symmetric Skew distributions to the right nbsp Skewness to left and right When the larger values tend to be farther away from the mean than the smaller values one has a skew distribution to the right i e there is positive skewness one may for example select the log normal distribution i e the log values of the data are normally distributed the log logistic distribution i e the log values of the data follow a logistic distribution the Gumbel distribution the exponential distribution the Pareto distribution the Weibull distribution the Burr distribution or the Frechet distribution The last four distributions are bounded to the left Skew distributions to the leftWhen the smaller values tend to be farther away from the mean than the larger values one has a skew distribution to the left i e there is negative skewness one may for example select the square normal distribution i e the normal distribution applied to the square of the data values 1 the inverted mirrored Gumbel distribution 1 the Dagum distribution mirrored Burr distribution or the Gompertz distribution which is bounded to the left Techniques of fitting editThe following techniques of distribution fitting exist 2 Parametric methods by which the parameters of the distribution are calculated from the data series 3 The parametric methods are Method of moments Maximum spacing estimation Method of L moments 4 Maximum likelihood method 5 For example the parameter m displaystyle mu nbsp the expectation can be estimated by the mean of the data and the parameter s 2 displaystyle sigma 2 nbsp the variance can be estimated from the standard deviation of the data The mean is found as m X n textstyle m sum X n nbsp where X displaystyle X nbsp is the data value and n displaystyle n nbsp the number of data while the standard deviation is calculated as s 1 n 1 X m 2 textstyle s sqrt frac 1 n 1 sum X m 2 nbsp With these parameters many distributions e g the normal distribution are completely defined dd nbsp Cumulative Gumbel distribution fitted to maximum one day October rainfalls in Suriname by the regression method with added confidence band using cumfreq Plotting position plus Regression analysis using a transformation of the cumulative distribution function so that a linear relation is found between the cumulative probability and the values of the data which may also need to be transformed depending on the selected probability distribution In this method the cumulative probability needs to be estimated by the plotting position 6 For example the cumulative Gumbel distribution can be linearized to Y a X b displaystyle Y aX b nbsp where X displaystyle X nbsp is the data variable and Y ln ln P displaystyle Y ln ln P nbsp with P displaystyle P nbsp being the cumulative probability i e the probability that the data value is less than X displaystyle X nbsp Thus using the plotting position for P displaystyle P nbsp one finds the parameters a displaystyle a nbsp and b displaystyle b nbsp from a linear regression of Y displaystyle Y nbsp on X displaystyle X nbsp and the Gumbel distribution is fully defined dd Generalization of distributions editIt is customary to transform data logarithmically to fit symmetrical distributions like the normal and logistic to data obeying a distribution that is positively skewed i e skew to the right with mean gt mode and with a right hand tail that is longer than the left hand tail see lognormal distribution and the loglogistic distribution A similar effect can be achieved by taking the square root of the data To fit a symmetrical distribution to data obeying a negatively skewed distribution i e skewed to the left with mean lt mode and with a right hand tail this is shorter than the left hand tail one could use the squared values of the data to accomplish the fit More generally one can raise the data to a power p in order to fit symmetrical distributions to data obeying a distribution of any skewness whereby p lt 1 when the skewness is positive and p gt 1 when the skewness is negative The optimal value of p is to be found by a numerical method The numerical method may consist of assuming a range of p values then applying the distribution fitting procedure repeatedly for all the assumed p values and finally selecting the value of p for which the sum of squares of deviations of calculated probabilities from measured frequencies chi squared is minimum as is done in CumFreq The generalization enhances the flexibility of probability distributions and increases their applicability in distribution fitting 6 The versatility of generalization makes it possible for example to fit approximately normally distributed data sets to a large number of different probability distributions 7 while negatively skewed distributions can be fitted to square normal and mirrored Gumbel distributions 8 Inversion of skewness edit nbsp A Gumbel probability distribution skew to right and B Gumbel mirrored skew to left Skewed distributions can be inverted or mirrored by replacing in the mathematical expression of the cumulative distribution function F by its complement F 1 F obtaining the complementary distribution function also called survival function that gives a mirror image In this manner a distribution that is skewed to the right is transformed into a distribution that is skewed to the left and vice versa Example The F expression of the positively skewed Gumbel distribution is F exp exp X u 0 78s where u is the mode i e the value occurring most frequently and s is the standard deviation The Gumbel distribution can be transformed using F 1 exp exp x u 0 78s This transformation yields the inverse mirrored or complementary Gumbel distribution that may fit a data series obeying a negatively skewed distribution dd The technique of skewness inversion increases the number of probability distributions available for distribution fitting and enlarges the distribution fitting opportunities Shifting of distributions editSome probability distributions like the exponential do not support negative data values X Yet when negative data are present such distributions can still be used replacing X by Y X Xm where Xm is the minimum value of X This replacement represents a shift of the probability distribution in positive direction i e to the right because Xm is negative After completing the distribution fitting of Y the corresponding X values are found from X Y Xm which represents a back shift of the distribution in negative direction i e to the left The technique of distribution shifting augments the chance to find a properly fitting probability distribution Composite distributions edit nbsp Composite discontinuous distribution with confidence belt 9 The option exists to use two different probability distributions one for the lower data range and one for the higher like for example the Laplace distribution The ranges are separated by a break point The use of such composite discontinuous probability distributions can be opportune when the data of the phenomenon studied were obtained under two sets different conditions 6 Uncertainty of prediction edit nbsp Uncertainty analysis with confidence belts using the binomial distribution 10 Predictions of occurrence based on fitted probability distributions are subject to uncertainty which arises from the following conditions The true probability distribution of events may deviate from the fitted distribution as the observed data series may not be totally representative of the real probability of occurrence of the phenomenon due to random error The occurrence of events in another situation or in the future may deviate from the fitted distribution as this occurrence can also be subject to random error A change of environmental conditions may cause a change in the probability of occurrence of the phenomenon nbsp Variations of nine return period curves of 50 year samples from a theoretical 1000 year record base line data from Benson 11 An estimate of the uncertainty in the first and second case can be obtained with the binomial probability distribution using for example the probability of exceedance Pe i e the chance that the event X is larger than a reference value Xr of X and the probability of non exceedance Pn i e the chance that the event X is smaller than or equal to the reference value Xr this is also called cumulative probability In this case there are only two possibilities either there is exceedance or there is non exceedance This duality is the reason that the binomial distribution is applicable With the binomial distribution one can obtain a prediction interval Such an interval also estimates the risk of failure i e the chance that the predicted event still remains outside the confidence interval The confidence or risk analysis may include the return period T 1 Pe as is done in hydrology Variance of Bayesian fitted probability functions edit A Bayesian approach can be used for fitting a model P x 8 displaystyle P x theta nbsp having a prior distribution P 8 displaystyle P theta nbsp for the parameter 8 displaystyle theta nbsp When one has samples X displaystyle X nbsp that are independently drawn from the underlying distribution then one can derive the so called posterior distribution P 8 X displaystyle P theta X nbsp This posterior can be used to update the probability mass function for a new sample x displaystyle x nbsp given the observations X displaystyle X nbsp one obtainsP 8 x X d 8 P x 8 P 8 X displaystyle P theta x X int d theta P x theta P theta X nbsp The variance of the newly obtained probability mass function can also be determined The variance for a Bayesian probability mass function can be defined ass P 8 x X 2 d 8 P x 8 P 8 x X 2 P 8 X displaystyle sigma P theta x X 2 int d theta left P x theta P theta x X right 2 P theta X nbsp This expression for the variance can be substantially simplified assuming independently drawn samples Defining the self probability mass function asP 8 x X x d 8 P x 8 P 8 X x displaystyle P theta x left X x right int d theta P x theta P theta left X x right nbsp one obtains for the variance 12 s P 8 x X 2 P 8 x X P 8 x X x P 8 x X displaystyle sigma P theta x X 2 P theta x X left P theta x left X x right P theta x X right nbsp The expression for variance involves an additional fit that includes the sample x displaystyle x nbsp of interest nbsp List of probability distributions ranked by goodness of fit 13 nbsp Histogram and probability density of a data set fitting the GEV distributionGoodness of fit editBy ranking the goodness of fit of various distributions one can get an impression of which distribution is acceptable and which is not Histogram and density function editFrom the cumulative distribution function CDF one can derive a histogram and the probability density function PDF See also editCurve fitting Density estimation Mixture distribution Product distributionReferences edit a b Left negatively skewed frequency histograms can be fitted to square Normal or mirrored Gumbel probability functions On line 1 Frequency and Regression Analysis Chapter 6 in H P Ritzema ed 1994 Drainage Principles and Applications Publ 16 pp 175 224 International Institute for Land Reclamation and Improvement ILRI Wageningen The Netherlands ISBN 9070754339 Free download from the webpage 2 under nr 12 or directly as PDF 3 H Cramer Mathematical methods of statistics Princeton Univ Press 1946 Hosking J R M 1990 L moments analysis and estimation of distributions using linear combinations of order statistics Journal of the Royal Statistical Society Series B 52 1 105 124 JSTOR 2345653 Aldrich John 1997 R A Fisher and the making of maximum likelihood 1912 1922 Statistical Science 12 3 162 176 doi 10 1214 ss 1030037906 MR 1617519 a b c Software for Generalized and Composite Probability Distributions International Journal of Mathematical and Computational Methods 4 1 9 4 or 5 Example of an approximately normally distributed data set to which a large number of different probability distributions can be fitted 6 Left negatively skewed frequency histograms can be fitted to square normal or mirrored Gumbel probability functions 7 Intro to composite probability distributions Frequency predictions and their binomial confidence limits In International Commission on Irrigation and Drainage Special Technical Session Economic Aspects of Flood Control and non Structural Measures Dubrovnik Yugoslavia 1988 On line Benson M A 1960 Characteristics of frequency curves based on a theoretical 1000 year record In T Dalrymple Ed Flood frequency analysis U S Geological Survey Water Supply Paper 1543 A pp 51 71 Pijlman Linnartz 2023 Variance of Likelihood of data SITB 2023 Proceedings 34 Software for probability distribution fitting Retrieved from https en wikipedia org w index php title Probability distribution fitting amp oldid 1210084825, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.