fbpx
Wikipedia

Statistical model

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process.[1] When referring specifically to probabilities, the corresponding term is probabilistic model. All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).[2]

Introduction edit

Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.

The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is 1/6. From that assumption, we can calculate the probability of both dice coming up 5:  1/6 × 1/6 =1/36.  More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or (5 and 6). The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is 1/8 (because the dice are weighted). From that assumption, we can calculate the probability of both dice coming up 5:  1/8 × 1/8 =1/64.  We cannot, however, calculate the probability of any other nontrivial event, as the probabilities of the other faces are unknown.

The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of every event. In the example above, with the first assumption, calculating the probability of an event is easy. With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.

Formal definition edit

In mathematical terms, a statistical model is usually[clarification needed] thought of as a pair ( ), where   is the set of possible observations, i.e. the sample space, and   is a set of probability distributions on  .[3] The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose   to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution. Note that we do not require that   contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality"[4]—hence the saying "all models are wrong". The set   is almost always parameterized:  . The set of distributions   defines the parameters of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e.   must hold (in other words, it must be injective). A parameterization that meets the requirement is said to be identifiable.[3]

An example edit

Suppose that we have a population of children, with the ages of the children distributed uniformly, in the population. The height of a child will be stochastically related to the age: e.g. when we know that a child is of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in a linear regression model, like this: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to obtain a prediction of height, εi is the error term, and i identifies the child. This implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, a straight line (heighti = b0 + b1agei) cannot be the equation for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line. The error term, εi, must be included in the equation, so that the model is consistent with all the data points. To do statistical inference, we would first need to assume some probability distributions for the εi. For instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b0, b1, and the variance of the Gaussian distribution. We can formally specify the model in the form ( ) as follows. The sample space,  , of our model comprises the set of all possible pairs (age, height). Each possible value of   = (b0, b1, σ2) determines a distribution on  ; denote that distribution by  . If   is the set of all possible values of  , then  . (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying   and (2) making some assumptions relevant to  . There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify  —as they are required to do.

General remarks edit

A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the above example with children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic. Statistical models are often used even when the data-generating process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process). Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses. Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".[5]

There are three purposes for a statistical model, according to Konishi & Kitagawa.[6]

  • Predictions
  • Extraction of information
  • Description of stochastic structures

Those three purposes are essentially the same as the three purposes indicated by Friendly & Meyer: prediction, estimation, description.[7] The three purposes correspond with the three kinds of logical reasoning: deductive reasoning, inductive reasoning, abductive reasoning.[citation needed][clarification needed]

Dimension of a model edit

Suppose that we have a statistical model ( ) with  . In notation, we write that   where k is a positive integer (  denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the model. The model is said to be parametric if   has finite dimension.[citation needed] As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that

 .

In this example, the dimension, k, equals 2. As another example, suppose that the data consists of points (x, y) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note the set of all possible lines has dimension 2, even though geometrically, a line has dimension 1.)

Although formally   is a single parameter that has dimension k, it is sometimes regarded as comprising k separate parameters. For example, with the univariate Gaussian distribution,   is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation. A statistical model is nonparametric if the parameter set   is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if k is the dimension of   and n is the number of samples, both semiparametric and nonparametric models have   as  . If   as  , then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".[8]

Nested models edit

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model

y = b0 + b1x + b2x2 + ε,    ε ~ 𝒩(0, σ2)

has, nested within it, the linear model

y = b0 + b1x + ε,    ε ~ 𝒩(0, σ2)

—we constrain the parameter b2 to equal 0.

In both those examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As an example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.

Comparing models edit

Comparing statistical models is fundamental for much of statistical inference. Konishi & Kitagawa (2008, p. 75) state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models." Common criteria for comparing models include the following: R2, Bayes factor, Akaike information criterion, and the likelihood-ratio test together with its generalization, the relative likelihood.

Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam.[9]

See also edit

Notes edit

  1. ^ Cox 2006, p. 178
  2. ^ Adèr 2008, p. 280
  3. ^ a b McCullagh 2002
  4. ^ Burnham & Anderson 2002, §1.2.5
  5. ^ Cox 2006, p. 197
  6. ^ Konishi & Kitagawa 2008, §1.1
  7. ^ Friendly & Meyer 2016, §11.6
  8. ^ Cox 2006, p. 2
  9. ^ Le Cam, Lucien (1964). "Sufficiency and Approximate Sufficiency". Annals of Mathematical Statistics. 35 (4). Institute of Mathematical Statistics: 1429. doi:10.1214/aoms/1177700372.

References edit

  • Adèr, H. J. (2008), "Modelling", in Adèr, H. J.; Mellenbergh, G. J. (eds.), Advising on Research Methods: A consultant's companion, Huizen, The Netherlands: Johannes van Kessel Publishing, pp. 271–304.
  • Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference (2nd ed.), Springer-Verlag.
  • Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press.
  • Friendly, M.; Meyer, D. (2016), Discrete Data Analysis with R, Chapman & Hall.
  • Konishi, S.; Kitagawa, G. (2008), Information Criteria and Statistical Modeling, Springer.
  • McCullagh, P. (2002), "What is a statistical model?" (PDF), Annals of Statistics, 30 (5): 1225–1310, doi:10.1214/aos/1035844977.

Further reading edit

statistical, model, statistical, model, mathematical, model, that, embodies, statistical, assumptions, concerning, generation, sample, data, similar, data, from, larger, population, statistical, model, represents, often, considerably, idealized, form, data, ge. A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data and similar data from a larger population A statistical model represents often in considerably idealized form the data generating process 1 When referring specifically to probabilities the corresponding term is probabilistic model All statistical hypothesis tests and all statistical estimators are derived via statistical models More generally statistical models are part of the foundation of statistical inference A statistical model is usually specified as a mathematical relationship between one or more random variables and other non random variables As such a statistical model is a formal representation of a theory Herman Ader quoting Kenneth Bollen 2 Contents 1 Introduction 2 Formal definition 3 An example 4 General remarks 5 Dimension of a model 6 Nested models 7 Comparing models 8 See also 9 Notes 10 References 11 Further readingIntroduction editInformally a statistical model can be thought of as a statistical assumption or set of statistical assumptions with a certain property that the assumption allows us to calculate the probability of any event As an example consider a pair of ordinary six sided dice We will study two different statistical assumptions about the dice The first statistical assumption is this for each of the dice the probability of each face 1 2 3 4 5 and 6 coming up is 1 6 From that assumption we can calculate the probability of both dice coming up 5 1 6 1 6 1 36 More generally we can calculate the probability of any event e g 1 and 2 or 3 and 3 or 5 and 6 The alternative statistical assumption is this for each of the dice the probability of the face 5 coming up is 1 8 because the dice are weighted From that assumption we can calculate the probability of both dice coming up 5 1 8 1 8 1 64 We cannot however calculate the probability of any other nontrivial event as the probabilities of the other faces are unknown The first statistical assumption constitutes a statistical model because with the assumption alone we can calculate the probability of any event The alternative statistical assumption does not constitute a statistical model because with the assumption alone we cannot calculate the probability of every event In the example above with the first assumption calculating the probability of an event is easy With some other examples though the calculation can be difficult or even impractical e g it might require millions of years of computation For an assumption to constitute a statistical model such difficulty is acceptable doing the calculation does not need to be practicable just theoretically possible Formal definition editIn mathematical terms a statistical model is usually clarification needed thought of as a pair S P displaystyle S mathcal P nbsp where S displaystyle S nbsp is the set of possible observations i e the sample space and P displaystyle mathcal P nbsp is a set of probability distributions on S displaystyle S nbsp 3 The intuition behind this definition is as follows It is assumed that there is a true probability distribution induced by the process that generates the observed data We choose P displaystyle mathcal P nbsp to represent a set of distributions which contains a distribution that adequately approximates the true distribution Note that we do not require that P displaystyle mathcal P nbsp contains the true distribution and in practice that is rarely the case Indeed as Burnham amp Anderson state A model is a simplification or approximation of reality and hence will not reflect all of reality 4 hence the saying all models are wrong The set P displaystyle mathcal P nbsp is almost always parameterized P F 8 8 8 displaystyle mathcal P F theta theta in Theta nbsp The set of distributions 8 displaystyle Theta nbsp defines the parameters of the model A parameterization is generally required to have distinct parameter values give rise to distinct distributions i e F 8 1 F 8 2 8 1 8 2 displaystyle F theta 1 F theta 2 Rightarrow theta 1 theta 2 nbsp must hold in other words it must be injective A parameterization that meets the requirement is said to be identifiable 3 An example editSuppose that we have a population of children with the ages of the children distributed uniformly in the population The height of a child will be stochastically related to the age e g when we know that a child is of age 7 this influences the chance of the child being 1 5 meters tall We could formalize that relationship in a linear regression model like this heighti b0 b1agei ei where b0 is the intercept b1 is a parameter that age is multiplied by to obtain a prediction of height ei is the error term and i identifies the child This implies that height is predicted by age with some error An admissible model must be consistent with all the data points Thus a straight line heighti b0 b1agei cannot be the equation for a model of the data unless it exactly fits all the data points i e all the data points lie perfectly on the line The error term ei must be included in the equation so that the model is consistent with all the data points To do statistical inference we would first need to assume some probability distributions for the ei For instance we might assume that the ei distributions are i i d Gaussian with zero mean In this instance the model would have 3 parameters b0 b1 and the variance of the Gaussian distribution We can formally specify the model in the form S P displaystyle S mathcal P nbsp as follows The sample space S displaystyle S nbsp of our model comprises the set of all possible pairs age height Each possible value of 8 displaystyle theta nbsp b0 b1 s2 determines a distribution on S displaystyle S nbsp denote that distribution by F 8 displaystyle F theta nbsp If 8 displaystyle Theta nbsp is the set of all possible values of 8 displaystyle theta nbsp then P F 8 8 8 displaystyle mathcal P F theta theta in Theta nbsp The parameterization is identifiable and this is easy to check In this example the model is determined by 1 specifying S displaystyle S nbsp and 2 making some assumptions relevant to P displaystyle mathcal P nbsp There are two assumptions that height can be approximated by a linear function of age that errors in the approximation are distributed as i i d Gaussian The assumptions are sufficient to specify P displaystyle mathcal P nbsp as they are required to do General remarks editA statistical model is a special class of mathematical model What distinguishes a statistical model from other mathematical models is that a statistical model is non deterministic Thus in a statistical model specified via mathematical equations some of the variables do not have specific values but instead have probability distributions i e some of the variables are stochastic In the above example with children s heights e is a stochastic variable without that stochastic variable the model would be deterministic Statistical models are often used even when the data generating process being modeled is deterministic For instance coin tossing is in principle a deterministic process yet it is commonly modeled as stochastic via a Bernoulli process Choosing an appropriate statistical model to represent a given data generating process is sometimes extremely difficult and may require knowledge of both the process and relevant statistical analyses Relatedly the statistician Sir David Cox has said How the translation from subject matter problem to statistical model is done is often the most critical part of an analysis 5 There are three purposes for a statistical model according to Konishi amp Kitagawa 6 Predictions Extraction of information Description of stochastic structures Those three purposes are essentially the same as the three purposes indicated by Friendly amp Meyer prediction estimation description 7 The three purposes correspond with the three kinds of logical reasoning deductive reasoning inductive reasoning abductive reasoning citation needed clarification needed Dimension of a model editSuppose that we have a statistical model S P displaystyle S mathcal P nbsp with P F 8 8 8 displaystyle mathcal P F theta theta in Theta nbsp In notation we write that 8 R k displaystyle Theta subseteq mathbb R k nbsp where k is a positive integer R displaystyle mathbb R nbsp denotes the real numbers other sets can be used in principle Here k is called the dimension of the model The model is said to be parametric if 8 displaystyle Theta nbsp has finite dimension citation needed As an example if we assume that data arise from a univariate Gaussian distribution then we are assuming that P F m s x 1 2 p s exp x m 2 2 s 2 m R s gt 0 displaystyle mathcal P left F mu sigma x equiv frac 1 sqrt 2 pi sigma exp left frac x mu 2 2 sigma 2 right mu in mathbb R sigma gt 0 right nbsp In this example the dimension k equals 2 As another example suppose that the data consists of points x y that we assume are distributed according to a straight line with i i d Gaussian residuals with zero mean this leads to the same statistical model as was used in the example with children s heights The dimension of the statistical model is 3 the intercept of the line the slope of the line and the variance of the distribution of the residuals Note the set of all possible lines has dimension 2 even though geometrically a line has dimension 1 Although formally 8 8 displaystyle theta in Theta nbsp is a single parameter that has dimension k it is sometimes regarded as comprising k separate parameters For example with the univariate Gaussian distribution 8 displaystyle theta nbsp is formally a single parameter with dimension 2 but it is often regarded as comprising 2 separate parameters the mean and the standard deviation A statistical model is nonparametric if the parameter set 8 displaystyle Theta nbsp is infinite dimensional A statistical model is semiparametric if it has both finite dimensional and infinite dimensional parameters Formally if k is the dimension of 8 displaystyle Theta nbsp and n is the number of samples both semiparametric and nonparametric models have k displaystyle k rightarrow infty nbsp as n displaystyle n rightarrow infty nbsp If k n 0 displaystyle k n rightarrow 0 nbsp as n displaystyle n rightarrow infty nbsp then the model is semiparametric otherwise the model is nonparametric Parametric models are by far the most commonly used statistical models Regarding semiparametric and nonparametric models Sir David Cox has said These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies 8 Nested models editThis section needs additional citations for verification Please help improve this article by adding citations to reliable sources in this section Unsourced material may be challenged and removed Find sources Statistical model news newspapers books scholar JSTOR November 2023 Learn how and when to remove this template message Not to be confused with Multilevel models Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model As an example the set of all Gaussian distributions has nested within it the set of zero mean Gaussian distributions we constrain the mean in the set of all Gaussian distributions to get the zero mean distributions As a second example the quadratic model y b0 b1x b2x2 e e 𝒩 0 s2 has nested within it the linear model y b0 b1x e e 𝒩 0 s2 we constrain the parameter b2 to equal 0 In both those examples the first model has a higher dimension than the second model for the first example the zero mean model has dimension 1 Such is often but not always the case As an example where they have the same dimension the set of positive mean Gaussian distributions is nested within the set of all Gaussian distributions they both have dimension 2 Comparing models editSee also Statistical model selection Comparing statistical models is fundamental for much of statistical inference Konishi amp Kitagawa 2008 p 75 state The majority of the problems in statistical inference can be considered to be problems related to statistical modeling They are typically formulated as comparisons of several statistical models Common criteria for comparing models include the following R2 Bayes factor Akaike information criterion and the likelihood ratio test together with its generalization the relative likelihood Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam 9 See also edit nbsp Mathematics portal All models are wrong Blockmodel Conceptual model Design of experiments Deterministic model Effective theory Predictive model Response modeling methodology Scientific model Statistical inference Statistical model specification Statistical model validation Statistical theory Stochastic processNotes edit Cox 2006 p 178 Ader 2008 p 280 a b McCullagh 2002 Burnham amp Anderson 2002 1 2 5 Cox 2006 p 197 Konishi amp Kitagawa 2008 1 1 Friendly amp Meyer 2016 11 6 Cox 2006 p 2 Le Cam Lucien 1964 Sufficiency and Approximate Sufficiency Annals of Mathematical Statistics 35 4 Institute of Mathematical Statistics 1429 doi 10 1214 aoms 1177700372 This article includes a list of general references but it lacks sufficient corresponding inline citations Please help to improve this article by introducing more precise citations September 2010 Learn how and when to remove this template message References editAder H J 2008 Modelling in Ader H J Mellenbergh G J eds Advising on Research Methods A consultant s companion Huizen The Netherlands Johannes van Kessel Publishing pp 271 304 Burnham K P Anderson D R 2002 Model Selection and Multimodel Inference 2nd ed Springer Verlag Cox D R 2006 Principles of Statistical Inference Cambridge University Press Friendly M Meyer D 2016 Discrete Data Analysis with R Chapman amp Hall Konishi S Kitagawa G 2008 Information Criteria and Statistical Modeling Springer McCullagh P 2002 What is a statistical model PDF Annals of Statistics 30 5 1225 1310 doi 10 1214 aos 1035844977 Further reading editDavison A C 2008 Statistical Models Cambridge University Press Drton M Sullivant S 2007 Algebraic statistical models PDF Statistica Sinica 17 1273 1297 Freedman D A 2009 Statistical Models Cambridge University Press Helland I S 2010 Steps Towards a Unified Basis for Scientific Models and Methods World Scientific Kroese D P Chan J C C 2014 Statistical Modeling and Computation Springer Shmueli G 2010 To explain or to predict Statistical Science 25 3 289 310 arXiv 1101 0891 doi 10 1214 10 STS330 S2CID 15900983 Retrieved from https en wikipedia org w index php title Statistical model amp oldid 1218604746, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.