fbpx
Wikipedia

Ordinary least squares

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the input dataset and the output of the (linear) function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.

The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments [1] and—by the Gauss–Markov theoremoptimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator.

Linear model Edit

 
Okun's law in macroeconomics states that in an economy the GDP growth should depend linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.

Suppose the data consists of   observations  . Each observation   includes a scalar response   and a column vector   of   parameters (regressors), i.e.,  . In a linear regression model, the response variable,  , is a linear function of the regressors:

 

or in vector form,

 

where  , as introduced previously, is a column vector of the  -th observation of all the explanatory variables;   is a   vector of unknown parameters; and the scalar   represents unobserved random variables (errors) of the  -th observation.   accounts for the influences upon the responses   from sources other than the explanatory variables  . This model can also be written in matrix notation as

 

where   and   are   vectors of the response variables and the errors of the   observations, and   is an   matrix of regressors, also sometimes called the design matrix, whose row   is   and contains the  -th observations on all the explanatory variables.

Typically, a constant term is included in the set of regressors  , say, by taking   for all  . The coefficient   corresponding to this regressor is called the intercept. Without the intercept, the fitted line is forced to cross the origin when  .

Regressors do not have to be independent: there can be any desired relationship between the regressors (so long as it is not a linear relationship). For instance, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters ( ).

Matrix/vector formulation Edit

Consider an overdetermined system

 

of   linear equations in   unknown coefficients,  , with  . This can be written in matrix form as

 

where

 

(Note: for a linear model as above, not all elements in   contains information on the data points. The first column is populated with ones,  . Only the other columns contain actual data. So here   is equal to the number of regressors plus one).

Such a system usually has no exact solution, so the goal is instead to find the coefficients   which fit the equations "best", in the sense of solving the quadratic minimization problem

 

where the objective function   is given by

 

A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the   columns of the matrix   are linearly independent, given by solving the so-called normal equations:

 

The matrix   is known as the normal matrix or Gram matrix and the matrix   is known as the moment matrix of regressand by regressors.[2] Finally,   is the coefficient vector of the least-squares hyperplane, expressed as

 

or

 

Estimation Edit

Suppose b is a "candidate" value for the parameter vector β. The quantity yixiTb, called the residual for the i-th observation, measures the vertical distance between the data point (xi, yi) and the hyperplane y = xTb, and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called the error sum of squares (ESS) or residual sum of squares (RSS))[3] is a measure of the overall model fit:

 

where T denotes the matrix transpose, and the rows of X, denoting the values of all the independent variables associated with a particular value of the dependent variable, are Xi = xiT. The value of b which minimizes this sum is called the OLS estimator for β. The function S(b) is quadratic in b with positive-definite Hessian, and therefore this function possesses a unique global minimum at  , which can be given by the explicit formula:[4][proof]

 

The product N=XT X is a Gram matrix and its inverse, Q=N–1, is the cofactor matrix of β,[5][6][7] closely related to its covariance matrix, Cβ. The matrix (XT X)–1 XT=Q XT is called the Moore–Penrose pseudoinverse matrix of X. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the gram matrix to have no inverse).

After we have estimated β, the fitted values (or predicted values) from the regression will be

 

where P = X(XTX)−1XT is the projection matrix onto the space V spanned by the columns of X. This matrix P is also sometimes called the hat matrix because it "puts a hat" onto the variable y. Another matrix, closely related to P is the annihilator matrix M = InP; this is a projection matrix onto the space orthogonal to V. Both matrices P and M are symmetric and idempotent (meaning that P2 = P and M2 = M), and relate to the data matrix X via identities PX = X and MX = 0.[8] Matrix M creates the residuals from the regression:

 

Using these residuals we can estimate the value of σ 2 using the reduced chi-squared statistic:

 

The denominator, np, is the statistical degrees of freedom. The first quantity, s2, is the OLS estimate for σ2, whereas the second,  , is the MLE estimate for σ2. The two estimators are quite similar in large samples; the first estimator is always unbiased, while the second estimator is biased but has a smaller mean squared error. In practice s2 is used more often, since it is more convenient for the hypothesis testing. The square root of s2 is called the regression standard error,[9] standard error of the regression,[10][11] or standard error of the equation.[8]

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X. The coefficient of determination R2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y, in the cases where the regression sum of squares equals the sum of squares of residuals:[12]

 

where TSS is the total sum of squares for the dependent variable,  , and   is an n×n matrix of ones. (  is a centering matrix which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for R2 to be meaningful, the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, R2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.

The variance in the prediction of the independent variable as a function of the dependent variable is given in the article Polynomial least squares.

Simple linear regression model Edit

If the data matrix X contains only two variables, a constant and a scalar regressor xi, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as (α, β):

 

The least squares estimates in this case are given by simple formulas

 

Alternative derivations Edit

In the previous section the least squares estimator   was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: ^β = (XTX)−1XTy; the only difference is in how we interpret this result.

Projection Edit

 
OLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of   and   refers to a column of the data matrix.)

For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations y, where β is the unknown. Assuming the system cannot be solved exactly (the number of equations n is much larger than the number of unknowns p), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies

 

where · is the standard L2 norm in the n-dimensional Euclidean space Rn. The predicted quantity is just a certain linear combination of the vectors of regressors. Thus, the residual vector y will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X. The OLS estimator   in this case can be interpreted as the coefficients of vector decomposition of ^y = Py along the basis of X.

In other words, the gradient equations at the minimum can be written as:

 

A geometrical interpretation of these equations is that the vector of residuals,   is orthogonal to the column space of X, since the dot product   is equal to zero for any conformal vector, v. This means that   is the shortest of all possible vectors  , that is, the variance of the residuals is the minimum possible. This is illustrated at the right.

Introducing   and a matrix K with the assumption that a matrix   is non-singular and KT X = 0 (cf. Orthogonal projections), the residual vector should satisfy the following equation:

 

The equation and solution of linear least squares are thus described as follows:

 

Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[13] Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.

Maximum likelihood Edit

The OLS estimator is identical to the maximum likelihood estimator (MLE) under the normality assumption for the error terms.[14][proof] This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson.[15] From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound for variance) if the normality assumption is satisfied.[16]

Generalized method of moments Edit

In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions

 

These moment conditions state that the regressors should be uncorrelated with the errors. Since xi is a p-vector, the number of moment conditions is equal to the dimension of the parameter vector β, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.

Note that the original strict exogeneity assumption E[εi | xi] = 0 implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function ƒ, the moment condition E[ƒ(xiεi] = 0 will hold. However it can be shown using the Gauss–Markov theorem that the optimal choice of function ƒ is to take ƒ(x) = x, which results in the moment equation posted above.

Properties Edit

Assumptions Edit

There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.

One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (random design) the regressors xi are random and sampled together with the yi's from some population, as in an observational study. This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation (fixed design), the regressors X are treated as known constants set by a design, and y is sampled conditionally on the values of X as in an experiment. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on X. All results stated in this article are within the random design framework.

Classical linear regression model Edit

The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the number of observations is allowed to grow to infinity.

  • Correct specification. The linear functional form must coincide with the form of the actual data-generating process.
  • Strict exogeneity. The errors in the regression should have conditional mean zero:[17]
     
    The immediate consequence of the exogeneity assumption is that the errors have mean zero: E[ε] = 0 (for the law of total expectation), and that the regressors are uncorrelated with the errors: E[XTε] = 0.
    The exogeneity assumption is critical for the OLS theory. If it holds then the regressor variables are called exogenous. If it doesn't, then those regressors that are correlated with the error term are called endogenous,[18] and the OLS estimator becomes biased. In such case the method of instrumental variables may be used to carry out inference.
  • No linear dependence. The regressors in X must all be linearly independent. Mathematically, this means that the matrix X must have full column rank almost surely:[19]
     
    Usually, it is also assumed that the regressors have finite moments up to at least the second moment. Then the matrix Qxx = E[XTX / n] is finite and positive semi-definite.
    When this assumption is violated the regressors are called linearly dependent or perfectly multicollinear. In such case the value of the regression coefficient β cannot be learned, although prediction of y values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
  • Spherical errors:[19]
     
    where In is the identity matrix in dimension n, and σ2 is a parameter which determines the variance of each observation. This σ2 is considered a nuisance parameter in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient.
    It is customary to split this assumption into two parts:
    • Homoscedasticity: E[ εi2 | X ] = σ2, which means that the error term has the same variance σ2 in each observation. When this requirement is violated this is called heteroscedasticity, in such case a more efficient estimator would be weighted least squares. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the law of large numbers they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, robust estimation techniques are recommended.
    • No autocorrelation: the errors are uncorrelated between observations: E[ εiεj | X ] = 0 for ij. This assumption may be violated in the context of time series data, panel data, cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases generalized least squares provides a better alternative than the OLS. Another expression for autocorrelation is serial correlation.
  • Normality. It is sometimes additionally assumed that the errors have normal distribution conditional on the regressors:[20]
     
    This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the maximum likelihood estimator (MLE), and therefore it is asymptotically efficient in the class of all regular estimators. Importantly, the normality assumption applies only to the error terms; contrary to a popular misconception, the response (dependent) variable is not required to be normally distributed.[21]

Independent and identically distributed (iid) Edit

In some applications, especially with cross-sectional data, an additional assumption is imposed — that all observations are independent and identically distributed. This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size n → ∞), which are understood as a theoretical possibility of fetching new independent observations from the data generating process. The list of assumptions in this case is:

  • iid observations: (xi, yi) is independent from, and has the same distribution as, (xj, yj) for all i ≠ j;
  • no perfect multicollinearity: Qxx = E[ xi xiT ] is a positive-definite matrix;
  • exogeneity: E[ εi | xi ] = 0;
  • homoscedasticity: Var[ εi | xi ] = σ2.

Time series model Edit

Finite sample properties Edit

First of all, under the strict exogeneity assumption the OLS estimators   and s2 are unbiased, meaning that their expected values coincide with the true values of the parameters:[23][proof]

 

If the strict exogeneity does not hold (as is the case with many time series models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.

The variance-covariance matrix (or simply covariance matrix) of   is equal to[24]

 

In particular, the standard error of each coefficient   is equal to square root of the j-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity σ2 with its estimate s2. Thus,

 

It can also be easily shown that the estimator   is uncorrelated with the residuals from the model:[24]

 

The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be uncorrelated and homoscedastic) the estimator   is efficient in the class of linear unbiased estimators. This is called the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimator   which would be linear in y and unbiased, then [24]

 

in the sense that this is a nonnegative-definite matrix. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ε, other, non-linear estimators may provide better results than OLS.

Assuming normality Edit

The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the normality assumption holds (that is, that ε ~ N(0, σ2In)), then additional properties of the OLS estimators can be stated.

The estimator   is normally distributed, with mean and variance as given before:[25]

 

This estimator reaches the Cramér–Rao bound for the model, and thus is optimal in the class of all unbiased estimators.[16] Note that unlike the Gauss–Markov theorem, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.

The estimator s2 will be proportional to the chi-squared distribution:[26]

 

The variance of this estimator is equal to 2σ4/(n − p), which does not attain the Cramér–Rao bound of 2σ4/n. However it was shown that there are no unbiased estimators of σ2 with variance smaller than that of the estimator s2.[27] If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error) estimator in this class will be ~σ2 = SSR / (n − p + 2), which even beats the Cramér–Rao bound in case when there is only one regressor (p = 1).[28]

Moreover, the estimators   and s2 are independent,[29] the fact which comes in useful when constructing the t- and F-tests for the regression.

Influential observations Edit

As was mentioned before, the estimator   is linear in y, meaning that it represents a linear combination of the dependent variables yi. The weights in this linear combination are functions of the regressors X, and generally are unequal. The observations with high weights are called influential because they have a more pronounced effect on the value of the estimator.

To analyze which observations are influential we remove a specific j-th observation and consider how much the estimated quantities are going to change (similarly to the jackknife method). It can be shown that the change in the OLS estimator for β will be equal to [30]

 

where hj = xjT (XTX)−1xj is the j-th diagonal element of the hat matrix P, and xj is the vector of regressors corresponding to the j-th observation. Similarly, the change in the predicted value for j-th observation resulting from omitting that observation from the dataset will be equal to [30]

 

From the properties of the hat matrix, 0 ≤ hj ≤ 1, and they sum up to p, so that on average hjp/n. These quantities hj are called the leverages, and observations with high hj are called leverage points.[31] Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.

Partitioned regression Edit

Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form

 

where X1 and X2 have dimensions n×p1, n×p2, and β1, β2 are p1×1 and p2×1 vectors, with p1 + p2 = p.

The Frisch–Waugh–Lovell theorem states that in this regression the residuals   and the OLS estimate   will be numerically identical to the residuals and the OLS estimate for β2 in the following regression:[32]

 

where M1 is the annihilator matrix for regressors X1.

The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.

Constrained estimation Edit

Suppose it is known that the coefficients in the regression satisfy a system of linear equations

 

where Q is a p×q matrix of full rank, and c is a q×1 vector of known constants, where q < p. In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint A. The constrained least squares (CLS) estimator can be given by an explicit formula:[33]

 

This expression for the constrained estimator is valid as long as the matrix XTX is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, β will not be identifiable. However it may happen that adding the restriction A makes β identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to [34]

 

where R is a p×(p − q) matrix such that the matrix [Q R] is non-singular, and RTQ = 0. Such a matrix can always be found, although generally it is not unique. The second formula coincides with the first in case when XTX is invertible.[34]

Large sample properties Edit

The least squares estimators are point estimates of the linear regression model parameters β. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates.

Since we haven't made any assumption about the distribution of error term εi, it is impossible to infer the distribution of the estimators   and  . Nevertheless, we can apply the central limit theorem to derive their asymptotic properties as sample size n goes to infinity. While the sample size is necessarily finite, it is customary to assume that n is "large enough" so that the true distribution of the OLS estimator is close to its asymptotic limit.

We can show that under the model assumptions, the least squares estimator for β is consistent (that is   converges in probability to β) and asymptotically normal:[proof]

 

where  

Intervals Edit

Using this asymptotic distribution, approximate two-sided confidence intervals for the j-th component of the vector   can be constructed as

    at the 1 − α confidence level,

where q denotes the quantile function of standard normal distribution, and [·]jj is the j-th diagonal element of a matrix.

Similarly, the least squares estimator for σ2 is also consistent and asymptotically normal (provided that the fourth moment of εi exists) with limiting distribution

 

These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose   is some point within the domain of distribution of the regressors, and one wants to know what the response variable would have been at that point. The mean response is the quantity  , whereas the predicted response is  . Clearly the predicted response is a random variable, its distribution can be derived from that of  :

 

which allows construct confidence intervals for mean response   to be constructed:

    at the 1 − α confidence level.

Hypothesis testing Edit

Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The null hypothesis of no explanatory value of the estimated regression is tested using an F-test. If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the alternative hypothesis, that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.

Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's t-statistic, as the ratio of the coefficient estimate to its standard error. If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.

In addition, the Chow test is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.

Example with real data Edit

The following data set gives average heights and weights for American women aged 30–39 (source: The World Almanac and Book of Facts, 1975).

Height (m) 1.47 1.50 1.52 1.55 1.57
 
Scatterplot of the data, the relationship is slightly curved but close to linear
Weight (kg) 52.21 53.12 54.48 55.84 57.20
Height (m) 1.60 1.63 1.65 1.68 1.70
Weight (kg) 58.57 59.93 61.29 63.11 64.47
Height (m) 1.73 1.75 1.78 1.80 1.83
Weight (kg) 66.28 68.10 69.92 72.19 74.46

When only one dependent variable is being modeled, a scatterplot will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor HEIGHT2. The regression model then becomes a multiple linear model:

 
 
Fitted regression

The output from most popular statistical packages will look similar to this:

Method Least squares
Dependent variable WEIGHT
Observations 15

Parameter Value Std error t-statistic p-value

  128.8128 16.3083 7.8986 0.0000
  –143.1620 19.8332 –7.2183 0.0000
  61.9603 6.0084 10.3122 0.0000

R2 0.9989 S.E. of regression 0.2516
Adjusted R2 0.9987 Model sum-of-sq. 692.61
Log-likelihood 1.0890 Residual sum-of-sq. 0.7595
Durbin–Watson stat. 2.1013 Total sum-of-sq. 693.37
Akaike criterion 0.2548 F-statistic 5471.2
Schwarz criterion 0.3964 p-value (F-stat) 0.0000

In this table:

  • The Value column gives the least squares estimates of parameters βj
  • The Std error column shows standard errors of each coefficient estimate:  
  • The t-statistic and p-value columns are testing whether any of the coefficients might be equal to zero. The t-statistic is calculated simply as  . If the errors ε follow a normal distribution, t follows a Student-t distribution. Under weaker conditions, t is asymptotically normal. Large values of t indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero. The second column, p-value, expresses the results of the hypothesis test as a significance level. Conventionally, p-values smaller than 0.05 are taken as evidence that the population coefficient is nonzero.
  • R-squared is the coefficient of determination indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors X have no explanatory power whatsoever. This is a biased estimate of the population R-squared, and will never decrease if additional regressors are added, even if they are irrelevant.
  • Adjusted R-squared is a slightly modified version of  , designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than  , can decrease as new regressors are added, and even be negative for poorly fitting models:
 
  • Log-likelihood is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.
  • Durbin–Watson statistic tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation.
  • Akaike information criterion and Schwarz criterion are both used for model selection. Generally when comparing two alternative models, smaller values of one of these criteria will indicate a better model.[35]
  • Standard error of regression is an estimate of σ, standard error of the error term.
  • Total sum of squares, model sum of squared, and residual sum of squares tell us how much of the initial variation in the sample were explained by the regression.
  • F-statistic tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has F(p–1,n–p) distribution under the null hypothesis and normality assumption, and its p-value indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as Wald test or LR test should be used.
 
Residuals plot

Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:

  • Residuals against the explanatory variables in the model. A non-linear relation between these variables suggests that the linearity of the conditional mean function may not hold. Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity.
  • Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  • Residuals against the fitted values,  .
  • Residuals against the preceding residual. This plot may identify serial correlations in the residuals.

An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.

Sensitivity to rounding Edit

This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is not an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become:

Const Height Height2
Converted to metric with rounding. 128.8128 −143.162 61.96033
Converted to metric without rounding. 119.0205 −131.5076 58.5046
 
Residuals to a quadratic fit for correctly and incorrectly converted data.

Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.

While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range (extrapolation).

This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the x and y errors.

Another example with less real data Edit

Problem statement Edit

We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is   where   is the radius of how far the object is from one of the bodies. In the equation the parameters   and   are used to determine the path of the orbit. We have measured the following data.

  (in degrees) 43 45 52 93 108 116
  4.7126 4.5542 4.0419 2.2187 1.8910 1.7599

We need to find the least-squares approximation of   and   for the given data.

Solution Edit

First we need to represent e and p in a linear form. So we are going to rewrite the equation   as  . Now we can use this form to represent our observational data as:

  where   is   and   is   and   is constructed by the first column being the coefficient of   and the second column being the coefficient of   and   is the values for the respective   so   and  

On solving we get  

so   and  

See also Edit

References Edit

  1. ^ "What is a complete list of the usual assumptions for linear regression?". Cross Validated. Retrieved 2022-09-28.
  2. ^ Goldberger, Arthur S. (1964). "Classical Linear Regression". Econometric Theory. New York: John Wiley & Sons. pp. 158. ISBN 0-471-31101-4.
  3. ^ Hayashi, Fumio (2000). Econometrics. Princeton University Press. p. 15.
  4. ^ Hayashi (2000, page 18)
  5. ^ Ghilani, Charles D.; Paul r. Wolf, Ph. D. (12 June 2006). Adjustment Computations: Spatial Data Analysis. ISBN 9780471697282.
  6. ^ Hofmann-Wellenhof, Bernhard; Lichtenegger, Herbert; Wasle, Elmar (20 November 2007). GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more. ISBN 9783211730171.
  7. ^ Xu, Guochang (5 October 2007). GPS: Theory, Algorithms and Applications. ISBN 9783540727156.
  8. ^ a b Hayashi (2000, page 19)
  9. ^ Julian Faraway (2000), Practical Regression and Anova using R
  10. ^ Kenney, J.; Keeping, E. S. (1963). Mathematics of Statistics. van Nostrand. p. 187.
  11. ^ Zwillinger, D. (1995). Standard Mathematical Tables and Formulae. Chapman&Hall/CRC. p. 626. ISBN 0-8493-2479-3.
  12. ^ Hayashi (2000, page 20)
  13. ^ Akbarzadeh, Vahab (7 May 2014). "Line Estimation".
  14. ^ Hayashi (2000, page 49)
  15. ^ "Least Squares Introduction | Massachusetts Institute of Technology - KeepNotes". keepnotes.com. Retrieved 2023-09-25.
  16. ^ a b Hayashi (2000, page 52)
  17. ^ Hayashi (2000, page 7)
  18. ^ Hayashi (2000, page 187)
  19. ^ a b Hayashi (2000, page 10)
  20. ^ Hayashi (2000, page 34)
  21. ^ Williams, M. N; Grajales, C. A. G; Kurkiewicz, D (2013). "Assumptions of multiple regression: Correcting two misconceptions". Practical Assessment, Research & Evaluation. 18 (11).
  22. ^ "Memento on EViews Output" (PDF). Retrieved 28 December 2020.
  23. ^ Hayashi (2000, pages 27, 30)
  24. ^ a b c Hayashi (2000, page 27)
  25. ^ Amemiya, Takeshi (1985). Advanced Econometrics. Harvard University Press. p. 13. ISBN 9780674005600.
  26. ^ Amemiya (1985, page 14)
  27. ^ Rao, C. R. (1973). Linear Statistical Inference and its Applications (Second ed.). New York: J. Wiley & Sons. p. 319. ISBN 0-471-70823-2.
  28. ^ Amemiya (1985, page 20)
  29. ^ Amemiya (1985, page 27)
  30. ^ a b Davidson, Russell; MacKinnon, James G. (1993). Estimation and Inference in Econometrics. New York: Oxford University Press. p. 33. ISBN 0-19-506011-3.
  31. ^ Davidson & MacKinnon (1993, page 36)
  32. ^ Davidson & MacKinnon (1993, page 20)
  33. ^ Amemiya (1985, page 21)
  34. ^ a b Amemiya (1985, page 22)
  35. ^ Burnham, Kenneth P.; David Anderson (2002). Model Selection and Multi-Model Inference (2nd ed.). Springer. ISBN 0-387-95364-7.

Further reading Edit

  • Dougherty, Christopher (2002). Introduction to Econometrics (2nd ed.). New York: Oxford University Press. pp. 48–113. ISBN 0-19-877643-8.
  • Gujarati, Damodar N.; Porter, Dawn C. (2009). Basic Econometics (Fifth ed.). Boston: McGraw-Hill Irwin. pp. 55–96. ISBN 978-0-07-337577-9.
  • Heij, Christiaan; Boer, Paul; Franses, Philip H.; Kloek, Teun; van Dijk, Herman K. (2004). Econometric Methods with Applications in Business and Economics (1st ed.). Oxford: Oxford University Press. pp. 76–115. ISBN 978-0-19-926801-6.
  • Hill, R. Carter; Griffiths, William E.; Lim, Guay C. (2008). Principles of Econometrics (3rd ed.). Hoboken, NJ: John Wiley & Sons. pp. 8–47. ISBN 978-0-471-72360-8.
  • Wooldridge, Jeffrey (2008). "The Simple Regression Model". Introductory Econometrics: A Modern Approach (4th ed.). Mason, OH: Cengage Learning. pp. 22–67. ISBN 978-0-324-58162-1.

ordinary, least, squares, statistics, ordinary, least, squares, type, linear, least, squares, method, choosing, unknown, parameters, linear, regression, model, with, fixed, level, effects, linear, function, explanatory, variables, principle, least, squares, mi. In statistics ordinary least squares OLS is a type of linear least squares method for choosing the unknown parameters in a linear regression model with fixed level one effects of a linear function of a set of explanatory variables by the principle of least squares minimizing the sum of the squares of the differences between the observed dependent variable values of the variable being observed in the input dataset and the output of the linear function of the independent variable Geometrically this is seen as the sum of the squared distances parallel to the axis of the dependent variable between each data point in the set and the corresponding point on the regression surface the smaller the differences the better the model fits the data The resulting estimator can be expressed by a simple formula especially in the case of a simple linear regression in which there is a single regressor on the right side of the regression equation The OLS estimator is consistent for the level one fixed effects when the regressors are exogenous and forms perfect colinearity rank condition consistent for the variance estimate of the residuals when regressors have finite fourth moments 1 and by the Gauss Markov theorem optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated Under these conditions the method of OLS provides minimum variance mean unbiased estimation when the errors have finite variances Under the additional assumption that the errors are normally distributed with zero mean OLS is the maximum likelihood estimator that outperforms any non linear unbiased estimator Contents 1 Linear model 1 1 Matrix vector formulation 2 Estimation 2 1 Simple linear regression model 3 Alternative derivations 3 1 Projection 3 2 Maximum likelihood 3 3 Generalized method of moments 4 Properties 4 1 Assumptions 4 1 1 Classical linear regression model 4 1 2 Independent and identically distributed iid 4 1 3 Time series model 4 2 Finite sample properties 4 2 1 Assuming normality 4 2 2 Influential observations 4 2 3 Partitioned regression 4 2 4 Constrained estimation 4 3 Large sample properties 4 3 1 Intervals 4 3 2 Hypothesis testing 5 Example with real data 5 1 Sensitivity to rounding 6 Another example with less real data 6 1 Problem statement 6 2 Solution 7 See also 8 References 9 Further readingLinear model EditMain article Linear regression model nbsp Okun s law in macroeconomics states that in an economy the GDP growth should depend linearly on the changes in the unemployment rate Here the ordinary least squares method is used to construct the regression line describing this law Suppose the data consists of n displaystyle n nbsp observations x i y i i 1 n displaystyle left mathbf x i y i right i 1 n nbsp Each observation i displaystyle i nbsp includes a scalar response y i displaystyle y i nbsp and a column vector x i displaystyle mathbf x i nbsp of p displaystyle p nbsp parameters regressors i e x i x i 1 x i 2 x i p T displaystyle mathbf x i left x i1 x i2 dots x ip right operatorname T nbsp In a linear regression model the response variable y i displaystyle y i nbsp is a linear function of the regressors y i b 1 x i 1 b 2 x i 2 b p x i p e i displaystyle y i beta 1 x i1 beta 2 x i2 cdots beta p x ip varepsilon i nbsp or in vector form y i x i T b e i displaystyle y i mathbf x i operatorname T boldsymbol beta varepsilon i nbsp where x i displaystyle mathbf x i nbsp as introduced previously is a column vector of the i displaystyle i nbsp th observation of all the explanatory variables b displaystyle boldsymbol beta nbsp is a p 1 displaystyle p times 1 nbsp vector of unknown parameters and the scalar e i displaystyle varepsilon i nbsp represents unobserved random variables errors of the i displaystyle i nbsp th observation e i displaystyle varepsilon i nbsp accounts for the influences upon the responses y i displaystyle y i nbsp from sources other than the explanatory variables x i displaystyle mathbf x i nbsp This model can also be written in matrix notation as y X b e displaystyle mathbf y mathbf X boldsymbol beta boldsymbol varepsilon nbsp where y displaystyle mathbf y nbsp and e displaystyle boldsymbol varepsilon nbsp are n 1 displaystyle n times 1 nbsp vectors of the response variables and the errors of the n displaystyle n nbsp observations and X displaystyle mathbf X nbsp is an n p displaystyle n times p nbsp matrix of regressors also sometimes called the design matrix whose row i displaystyle i nbsp is x i T displaystyle mathbf x i operatorname T nbsp and contains the i displaystyle i nbsp th observations on all the explanatory variables Typically a constant term is included in the set of regressors X displaystyle mathbf X nbsp say by taking x i 1 1 displaystyle x i1 1 nbsp for all i 1 n displaystyle i 1 dots n nbsp The coefficient b 1 displaystyle beta 1 nbsp corresponding to this regressor is called the intercept Without the intercept the fitted line is forced to cross the origin when x i 0 displaystyle x i vec 0 nbsp Regressors do not have to be independent there can be any desired relationship between the regressors so long as it is not a linear relationship For instance we might suspect the response depends linearly both on a value and its square in which case we would include one regressor whose value is just the square of another regressor In that case the model would be quadratic in the second regressor but none the less is still considered a linear model because the model is still linear in the parameters b displaystyle boldsymbol beta nbsp Matrix vector formulation Edit Consider an overdetermined system j 1 p x i j b j y i i 1 2 n displaystyle sum j 1 p x ij beta j y i i 1 2 dots n nbsp of n displaystyle n nbsp linear equations in p displaystyle p nbsp unknown coefficients b 1 b 2 b p displaystyle beta 1 beta 2 dots beta p nbsp with n gt p displaystyle n gt p nbsp This can be written in matrix form as X b y displaystyle mathbf X boldsymbol beta mathbf y nbsp where X X 11 X 12 X 1 p X 21 X 22 X 2 p X n 1 X n 2 X n p b b 1 b 2 b p y y 1 y 2 y n displaystyle mathbf X begin bmatrix X 11 amp X 12 amp cdots amp X 1p X 21 amp X 22 amp cdots amp X 2p vdots amp vdots amp ddots amp vdots X n1 amp X n2 amp cdots amp X np end bmatrix qquad boldsymbol beta begin bmatrix beta 1 beta 2 vdots beta p end bmatrix qquad mathbf y begin bmatrix y 1 y 2 vdots y n end bmatrix nbsp Note for a linear model as above not all elements in X displaystyle mathbf X nbsp contains information on the data points The first column is populated with ones X i 1 1 displaystyle X i1 1 nbsp Only the other columns contain actual data So here p displaystyle p nbsp is equal to the number of regressors plus one Such a system usually has no exact solution so the goal is instead to find the coefficients b displaystyle boldsymbol beta nbsp which fit the equations best in the sense of solving the quadratic minimization problem b a r g m i n b S b displaystyle hat boldsymbol beta underset boldsymbol beta operatorname arg min S boldsymbol beta nbsp where the objective function S displaystyle S nbsp is given by S b i 1 n y i j 1 p X i j b j 2 y X b 2 displaystyle S boldsymbol beta sum i 1 n left y i sum j 1 p X ij beta j right 2 left mathbf y mathbf X boldsymbol beta right 2 nbsp A justification for choosing this criterion is given in Properties below This minimization problem has a unique solution provided that the p displaystyle p nbsp columns of the matrix X displaystyle mathbf X nbsp are linearly independent given by solving the so called normal equations X T X b X T y displaystyle left mathbf X operatorname T mathbf X right hat boldsymbol beta mathbf X operatorname T mathbf y nbsp The matrix X T X displaystyle mathbf X operatorname T mathbf X nbsp is known as the normal matrix or Gram matrix and the matrix X T y displaystyle mathbf X operatorname T mathbf y nbsp is known as the moment matrix of regressand by regressors 2 Finally b displaystyle hat boldsymbol beta nbsp is the coefficient vector of the least squares hyperplane expressed as b X T X 1 X T y displaystyle hat boldsymbol beta left mathbf X operatorname T mathbf X right 1 mathbf X operatorname T mathbf y nbsp or b b X T X 1 X T e displaystyle hat boldsymbol beta boldsymbol beta left mathbf X operatorname T mathbf X right 1 mathbf X operatorname T boldsymbol varepsilon nbsp Estimation EditSuppose b is a candidate value for the parameter vector b The quantity yi xiTb called the residual for the i th observation measures the vertical distance between the data point xi yi and the hyperplane y xTb and thus assesses the degree of fit between the actual data and the model The sum of squared residuals SSR also called the error sum of squares ESS or residual sum of squares RSS 3 is a measure of the overall model fit S b i 1 n y i x i T b 2 y X b T y X b displaystyle S b sum i 1 n y i x i operatorname T b 2 y Xb operatorname T y Xb nbsp where T denotes the matrix transpose and the rows of X denoting the values of all the independent variables associated with a particular value of the dependent variable are Xi xiT The value of b which minimizes this sum is called the OLS estimator for b The function S b is quadratic in b with positive definite Hessian and therefore this function possesses a unique global minimum at b b displaystyle b hat beta nbsp which can be given by the explicit formula 4 proof b argmin b R p S b X T X 1 X T y displaystyle hat beta operatorname argmin b in mathbb R p S b X operatorname T X 1 X operatorname T y nbsp The product N XT X is a Gram matrix and its inverse Q N 1 is the cofactor matrix of b 5 6 7 closely related to its covariance matrix Cb The matrix XT X 1 XT Q XT is called the Moore Penrose pseudoinverse matrix of X This formulation highlights the point that estimation can be carried out if and only if there is no perfect multicollinearity between the explanatory variables which would cause the gram matrix to have no inverse After we have estimated b the fitted values or predicted values from the regression will be y X b P y displaystyle hat y X hat beta Py nbsp where P X XTX 1XT is the projection matrix onto the space V spanned by the columns of X This matrix P is also sometimes called the hat matrix because it puts a hat onto the variable y Another matrix closely related to P is the annihilator matrix M In P this is a projection matrix onto the space orthogonal to V Both matrices P and M are symmetric and idempotent meaning that P2 P and M2 M and relate to the data matrix X via identities PX X and MX 0 8 Matrix M creates the residuals from the regression e y y y X b M y M X b e M X b M e M e displaystyle hat varepsilon y hat y y X hat beta My M X beta varepsilon MX beta M varepsilon M varepsilon nbsp Using these residuals we can estimate the value of s2 using the reduced chi squared statistic s 2 e T e n p M y T M y n p y T M T M y n p y T M y n p S b n p s 2 n p n s 2 displaystyle s 2 frac hat varepsilon mathrm T hat varepsilon n p frac My mathrm T My n p frac y mathrm T M mathrm T My n p frac y mathrm T My n p frac S hat beta n p qquad hat sigma 2 frac n p n s 2 nbsp The denominator n p is the statistical degrees of freedom The first quantity s2 is the OLS estimate for s2 whereas the second s 2 displaystyle scriptstyle hat sigma 2 nbsp is the MLE estimate for s2 The two estimators are quite similar in large samples the first estimator is always unbiased while the second estimator is biased but has a smaller mean squared error In practice s2 is used more often since it is more convenient for the hypothesis testing The square root of s2 is called the regression standard error 9 standard error of the regression 10 11 or standard error of the equation 8 It is common to assess the goodness of fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X The coefficient of determination R2 is defined as a ratio of explained variance to the total variance of the dependent variable y in the cases where the regression sum of squares equals the sum of squares of residuals 12 R 2 y i y 2 y i y 2 y T P T L P y y T L y 1 y T M y y T L y 1 R S S T S S displaystyle R 2 frac sum hat y i overline y 2 sum y i overline y 2 frac y mathrm T P mathrm T LPy y mathrm T Ly 1 frac y mathrm T My y mathrm T Ly 1 frac rm RSS rm TSS nbsp where TSS is the total sum of squares for the dependent variable L I n 1 n J n textstyle L I n frac 1 n J n nbsp and J n textstyle J n nbsp is an n n matrix of ones L displaystyle L nbsp is a centering matrix which is equivalent to regression on a constant it simply subtracts the mean from a variable In order for R2 to be meaningful the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept In that case R2 will always be a number between 0 and 1 with values close to 1 indicating a good degree of fit The variance in the prediction of the independent variable as a function of the dependent variable is given in the article Polynomial least squares Simple linear regression model Edit Main article Simple linear regression If the data matrix X contains only two variables a constant and a scalar regressor xi then this is called the simple regression model This case is often considered in the beginner statistics classes as it provides much simpler formulas even suitable for manual calculation The parameters are commonly denoted as a b y i a b x i e i displaystyle y i alpha beta x i varepsilon i nbsp The least squares estimates in this case are given by simple formulas b i 1 n x i x y i y i 1 n x i x 2 a y b x displaystyle begin aligned widehat beta amp frac sum i 1 n x i bar x y i bar y sum i 1 n x i bar x 2 2pt widehat alpha amp bar y widehat beta bar x end aligned nbsp Alternative derivations EditIn the previous section the least squares estimator b displaystyle hat beta nbsp was obtained as a value that minimizes the sum of squared residuals of the model However it is also possible to derive the same estimator from other approaches In all cases the formula for OLS estimator remains the same b XTX 1XTy the only difference is in how we interpret this result Projection Edit nbsp OLS estimation can be viewed as a projection onto the linear space spanned by the regressors Here each of X 1 displaystyle X 1 nbsp and X 2 displaystyle X 2 nbsp refers to a column of the data matrix This section may need to be cleaned up It has been merged from Linear least squares mathematics For mathematicians OLS is an approximate solution to an overdetermined system of linear equations Xb y where b is the unknown Assuming the system cannot be solved exactly the number of equations n is much larger than the number of unknowns p we are looking for a solution that could provide the smallest discrepancy between the right and left hand sides In other words we are looking for the solution that satisfies b a r g min b y X b displaystyle hat beta rm arg min beta lVert mathbf y mathbf X boldsymbol beta rVert nbsp where is the standard L2 norm in the n dimensional Euclidean space Rn The predicted quantity Xb is just a certain linear combination of the vectors of regressors Thus the residual vector y Xb will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X The OLS estimator b displaystyle hat beta nbsp in this case can be interpreted as the coefficients of vector decomposition of y Py along the basis of X In other words the gradient equations at the minimum can be written as y X b X 0 displaystyle mathbf y mathbf X hat boldsymbol beta top mathbf X 0 nbsp A geometrical interpretation of these equations is that the vector of residuals y X b displaystyle mathbf y X hat boldsymbol beta nbsp is orthogonal to the column space of X since the dot product y X b X v displaystyle mathbf y mathbf X hat boldsymbol beta cdot mathbf X mathbf v nbsp is equal to zero for any conformal vector v This means that y X b displaystyle mathbf y mathbf X boldsymbol hat beta nbsp is the shortest of all possible vectors y X b displaystyle mathbf y mathbf X boldsymbol beta nbsp that is the variance of the residuals is the minimum possible This is illustrated at the right Introducing g displaystyle hat boldsymbol gamma nbsp and a matrix K with the assumption that a matrix X K displaystyle mathbf X mathbf K nbsp is non singular and KT X 0 cf Orthogonal projections the residual vector should satisfy the following equation r y X b K g displaystyle hat mathbf r mathbf y mathbf X hat boldsymbol beta mathbf K hat boldsymbol gamma nbsp The equation and solution of linear least squares are thus described as follows y X K b g b g X K 1 y X X 1 X K K 1 K y displaystyle begin aligned mathbf y amp begin bmatrix mathbf X amp mathbf K end bmatrix begin bmatrix hat boldsymbol beta hat boldsymbol gamma end bmatrix Rightarrow begin bmatrix hat boldsymbol beta hat boldsymbol gamma end bmatrix amp begin bmatrix mathbf X amp mathbf K end bmatrix 1 mathbf y begin bmatrix left mathbf X top mathbf X right 1 mathbf X top left mathbf K top mathbf K right 1 mathbf K top end bmatrix mathbf y end aligned nbsp Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset 13 Although this way of calculation is more computationally expensive it provides a better intuition on OLS Maximum likelihood Edit The OLS estimator is identical to the maximum likelihood estimator MLE under the normality assumption for the error terms 14 proof This normality assumption has historical importance as it provided the basis for the early work in linear regression analysis by Yule and Pearson 15 From the properties of MLE we can infer that the OLS estimator is asymptotically efficient in the sense of attaining the Cramer Rao bound for variance if the normality assumption is satisfied 16 Generalized method of moments Edit In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions E x i y i x i T b 0 displaystyle mathrm E big x i left y i x i operatorname T beta right big 0 nbsp These moment conditions state that the regressors should be uncorrelated with the errors Since xi is a p vector the number of moment conditions is equal to the dimension of the parameter vector b and thus the system is exactly identified This is the so called classical GMM case when the estimator does not depend on the choice of the weighting matrix Note that the original strict exogeneity assumption E ei xi 0 implies a far richer set of moment conditions than stated above In particular this assumption implies that for any vector function ƒ the moment condition E ƒ xi ei 0 will hold However it can be shown using the Gauss Markov theorem that the optimal choice of function ƒ is to take ƒ x x which results in the moment equation posted above Properties EditAssumptions Edit See also Linear regression Assumptions There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable Each of these settings produces the same formulas and same results The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results The choice of the applicable framework depends mostly on the nature of data in hand and on the inference task which has to be performed One of the lines of difference in interpretation is whether to treat the regressors as random variables or as predefined constants In the first case random design the regressors xi are random and sampled together with the yi s from some population as in an observational study This approach allows for more natural study of the asymptotic properties of the estimators In the other interpretation fixed design the regressors X are treated as known constants set by a design and y is sampled conditionally on the values of X as in an experiment For practical purposes this distinction is often unimportant since estimation and inference is carried out while conditioning on X All results stated in this article are within the random design framework Classical linear regression model Edit The classical model focuses on the finite sample estimation and inference meaning that the number of observations n is fixed This contrasts with the other approaches which study the asymptotic behavior of OLS and in which the number of observations is allowed to grow to infinity Correct specification The linear functional form must coincide with the form of the actual data generating process Strict exogeneity The errors in the regression should have conditional mean zero 17 E e X 0 displaystyle operatorname E varepsilon mid X 0 nbsp The immediate consequence of the exogeneity assumption is that the errors have mean zero E e 0 for the law of total expectation and that the regressors are uncorrelated with the errors E XTe 0 The exogeneity assumption is critical for the OLS theory If it holds then the regressor variables are called exogenous If it doesn t then those regressors that are correlated with the error term are called endogenous 18 and the OLS estimator becomes biased In such case the method of instrumental variables may be used to carry out inference No linear dependence The regressors in X must all be linearly independent Mathematically this means that the matrix X must have full column rank almost surely 19 Pr rank X p 1 displaystyle Pr big operatorname rank X p big 1 nbsp Usually it is also assumed that the regressors have finite moments up to at least the second moment Then the matrix Qxx E XTX n is finite and positive semi definite When this assumption is violated the regressors are called linearly dependent or perfectly multicollinear In such case the value of the regression coefficient b cannot be learned although prediction of y values is still possible for new values of the regressors that lie in the same linearly dependent subspace Spherical errors 19 Var e X s 2 I n displaystyle operatorname Var varepsilon mid X sigma 2 I n nbsp where In is the identity matrix in dimension n and s2 is a parameter which determines the variance of each observation This s2 is considered a nuisance parameter in the model although usually it is also estimated If this assumption is violated then the OLS estimates are still valid but no longer efficient It is customary to split this assumption into two parts Homoscedasticity E ei2 X s2 which means that the error term has the same variance s2 in each observation When this requirement is violated this is called heteroscedasticity in such case a more efficient estimator would be weighted least squares If the errors have infinite variance then the OLS estimates will also have infinite variance although by the law of large numbers they will nonetheless tend toward the true values so long as the errors have zero mean In this case robust estimation techniques are recommended No autocorrelation the errors are uncorrelated between observations E eiej X 0 for i j This assumption may be violated in the context of time series data panel data cluster samples hierarchical data repeated measures data longitudinal data and other data with dependencies In such cases generalized least squares provides a better alternative than the OLS Another expression for autocorrelation is serial correlation Normality It is sometimes additionally assumed that the errors have normal distribution conditional on the regressors 20 e X N 0 s 2 I n displaystyle varepsilon mid X sim mathcal N 0 sigma 2 I n nbsp This assumption is not needed for the validity of the OLS method although certain additional finite sample properties can be established in case when it does especially in the area of hypotheses testing Also when the errors are normal the OLS estimator is equivalent to the maximum likelihood estimator MLE and therefore it is asymptotically efficient in the class of all regular estimators Importantly the normality assumption applies only to the error terms contrary to a popular misconception the response dependent variable is not required to be normally distributed 21 Independent and identically distributed iid Edit In some applications especially with cross sectional data an additional assumption is imposed that all observations are independent and identically distributed This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret Also this framework allows one to state asymptotic results as the sample size n which are understood as a theoretical possibility of fetching new independent observations from the data generating process The list of assumptions in this case is iid observations xi yi is independent from and has the same distribution as xj yj for all i j no perfect multicollinearity Qxx E xi xiT is a positive definite matrix exogeneity E ei xi 0 homoscedasticity Var ei xi s2 Time series model Edit The stochastic process xi yi is stationary and ergodic if xi yi is nonstationary OLS results are often spurious unless xi yi is co integrating 22 The regressors are predetermined E xiei 0 for all i 1 n The p p matrix Qxx E xi xiT is of full rank and hence positive definite xiei is a martingale difference sequence with a finite matrix of second moments Qxxe E ei2xi xiT Finite sample properties Edit First of all under the strict exogeneity assumption the OLS estimators b displaystyle scriptstyle hat beta nbsp and s2 are unbiased meaning that their expected values coincide with the true values of the parameters 23 proof E b X b E s 2 X s 2 displaystyle operatorname E hat beta mid X beta quad operatorname E s 2 mid X sigma 2 nbsp If the strict exogeneity does not hold as is the case with many time series models where exogeneity is assumed only with respect to the past shocks but not the future ones then these estimators will be biased in finite samples The variance covariance matrix or simply covariance matrix of b displaystyle scriptstyle hat beta nbsp is equal to 24 Var b X s 2 X T X 1 s 2 Q displaystyle operatorname Var hat beta mid X sigma 2 left X operatorname T X right 1 sigma 2 Q nbsp In particular the standard error of each coefficient b j displaystyle scriptstyle hat beta j nbsp is equal to square root of the j th diagonal element of this matrix The estimate of this standard error is obtained by replacing the unknown quantity s2 with its estimate s2 Thus s e b j s 2 X T X j j 1 displaystyle widehat operatorname s e hat beta j sqrt s 2 left X operatorname T X right jj 1 nbsp It can also be easily shown that the estimator b displaystyle scriptstyle hat beta nbsp is uncorrelated with the residuals from the model 24 Cov b e X 0 displaystyle operatorname Cov hat beta hat varepsilon mid X 0 nbsp The Gauss Markov theorem states that under the spherical errors assumption that is the errors should be uncorrelated and homoscedastic the estimator b displaystyle scriptstyle hat beta nbsp is efficient in the class of linear unbiased estimators This is called the best linear unbiased estimator BLUE Efficiency should be understood as if we were to find some other estimator b displaystyle scriptstyle tilde beta nbsp which would be linear in y and unbiased then 24 Var b X Var b X 0 displaystyle operatorname Var tilde beta mid X operatorname Var hat beta mid X geq 0 nbsp in the sense that this is a nonnegative definite matrix This theorem establishes optimality only in the class of linear unbiased estimators which is quite restrictive Depending on the distribution of the error terms e other non linear estimators may provide better results than OLS Assuming normality Edit The properties listed so far are all valid regardless of the underlying distribution of the error terms However if you are willing to assume that the normality assumption holds that is that e N 0 s2In then additional properties of the OLS estimators can be stated The estimator b displaystyle scriptstyle hat beta nbsp is normally distributed with mean and variance as given before 25 b N b s 2 X T X 1 displaystyle hat beta sim mathcal N big beta sigma 2 X mathrm T X 1 big nbsp This estimator reaches the Cramer Rao bound for the model and thus is optimal in the class of all unbiased estimators 16 Note that unlike the Gauss Markov theorem this result establishes optimality among both linear and non linear estimators but only in the case of normally distributed error terms The estimator s2 will be proportional to the chi squared distribution 26 s 2 s 2 n p x n p 2 displaystyle s 2 sim frac sigma 2 n p cdot chi n p 2 nbsp The variance of this estimator is equal to 2s4 n p which does not attain the Cramer Rao bound of 2s4 n However it was shown that there are no unbiased estimators of s2 with variance smaller than that of the estimator s2 27 If we are willing to allow biased estimators and consider the class of estimators that are proportional to the sum of squared residuals SSR of the model then the best in the sense of the mean squared error estimator in this class will be s2 SSR n p 2 which even beats the Cramer Rao bound in case when there is only one regressor p 1 28 Moreover the estimators b displaystyle scriptstyle hat beta nbsp and s2 are independent 29 the fact which comes in useful when constructing the t and F tests for the regression Influential observations Edit Main article Influential observation See also Leverage statistics As was mentioned before the estimator b displaystyle hat beta nbsp is linear in y meaning that it represents a linear combination of the dependent variables yi The weights in this linear combination are functions of the regressors X and generally are unequal The observations with high weights are called influential because they have a more pronounced effect on the value of the estimator To analyze which observations are influential we remove a specific j th observation and consider how much the estimated quantities are going to change similarly to the jackknife method It can be shown that the change in the OLS estimator for b will be equal to 30 b j b 1 1 h j X T X 1 x j T e j displaystyle hat beta j hat beta frac 1 1 h j X mathrm T X 1 x j mathrm T hat varepsilon j nbsp where hj xjT XTX 1xj is the j th diagonal element of the hat matrix P and xj is the vector of regressors corresponding to the j th observation Similarly the change in the predicted value for j th observation resulting from omitting that observation from the dataset will be equal to 30 y j j y j x j T b j x j T b h j 1 h j e j displaystyle hat y j j hat y j x j mathrm T hat beta j x j operatorname T hat beta frac h j 1 h j hat varepsilon j nbsp From the properties of the hat matrix 0 hj 1 and they sum up to p so that on average hj p n These quantities hj are called the leverages and observations with high hj are called leverage points 31 Usually the observations with high leverage ought to be scrutinized more carefully in case they are erroneous or outliers or in some other way atypical of the rest of the dataset Partitioned regression Edit Sometimes the variables and corresponding parameters in the regression can be logically split into two groups so that the regression takes form y X 1 b 1 X 2 b 2 e displaystyle y X 1 beta 1 X 2 beta 2 varepsilon nbsp where X1 and X2 have dimensions n p1 n p2 and b1 b2 are p1 1 and p2 1 vectors with p1 p2 p The Frisch Waugh Lovell theorem states that in this regression the residuals e displaystyle hat varepsilon nbsp and the OLS estimate b 2 displaystyle scriptstyle hat beta 2 nbsp will be numerically identical to the residuals and the OLS estimate for b2 in the following regression 32 M 1 y M 1 X 2 b 2 h displaystyle M 1 y M 1 X 2 beta 2 eta nbsp where M1 is the annihilator matrix for regressors X1 The theorem can be used to establish a number of theoretical results For example having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de meaned variables but without the constant term Constrained estimation Edit Main article Ridge regression Suppose it is known that the coefficients in the regression satisfy a system of linear equations A Q T b c displaystyle A colon quad Q operatorname T beta c nbsp where Q is a p q matrix of full rank and c is a q 1 vector of known constants where q lt p In this case least squares estimation is equivalent to minimizing the sum of squared residuals of the model subject to the constraint A The constrained least squares CLS estimator can be given by an explicit formula 33 b c b X T X 1 Q Q T X T X 1 Q 1 Q T b c displaystyle hat beta c hat beta X operatorname T X 1 Q Big Q operatorname T X operatorname T X 1 Q Big 1 Q operatorname T hat beta c nbsp This expression for the constrained estimator is valid as long as the matrix XTX is invertible It was assumed from the beginning of this article that this matrix is of full rank and it was noted that when the rank condition fails b will not be identifiable However it may happen that adding the restriction A makes b identifiable in which case one would like to find the formula for the estimator The estimator is equal to 34 b c R R T X T X R 1 R T X T y I p R R T X T X R 1 R T X T X Q Q T Q 1 c displaystyle hat beta c R R operatorname T X operatorname T XR 1 R operatorname T X operatorname T y Big I p R R operatorname T X operatorname T XR 1 R operatorname T X operatorname T X Big Q Q operatorname T Q 1 c nbsp where R is a p p q matrix such that the matrix Q R is non singular and RTQ 0 Such a matrix can always be found although generally it is not unique The second formula coincides with the first in case when XTX is invertible 34 Large sample properties Edit The least squares estimators are point estimates of the linear regression model parameters b However generally we also want to know how close those estimates might be to the true values of parameters In other words we want to construct the interval estimates Since we haven t made any assumption about the distribution of error term ei it is impossible to infer the distribution of the estimators b displaystyle hat beta nbsp and s 2 displaystyle hat sigma 2 nbsp Nevertheless we can apply the central limit theorem to derive their asymptotic properties as sample size n goes to infinity While the sample size is necessarily finite it is customary to assume that n is large enough so that the true distribution of the OLS estimator is close to its asymptotic limit We can show that under the model assumptions the least squares estimator for b is consistent that is b displaystyle hat beta nbsp converges in probability to b and asymptotically normal proof b b d N 0 s 2 Q x x 1 displaystyle hat beta beta xrightarrow d mathcal N big 0 sigma 2 Q xx 1 big nbsp where Q x x X T X displaystyle Q xx X operatorname T X nbsp Intervals Edit Main articles Confidence interval and Prediction interval Using this asymptotic distribution approximate two sided confidence intervals for the j th component of the vector b displaystyle hat beta nbsp can be constructed as b j b j q 1 a 2 N 0 1 s 2 Q x x 1 j j displaystyle beta j in bigg hat beta j pm q 1 frac alpha 2 mathcal N 0 1 sqrt hat sigma 2 left Q xx 1 right jj bigg nbsp at the 1 a confidence level where q denotes the quantile function of standard normal distribution and jj is the j th diagonal element of a matrix Similarly the least squares estimator for s2 is also consistent and asymptotically normal provided that the fourth moment of ei exists with limiting distribution s 2 s 2 d N 0 E e i 4 s 4 displaystyle hat sigma 2 sigma 2 xrightarrow d mathcal N left 0 operatorname E left varepsilon i 4 right sigma 4 right nbsp These asymptotic distributions can be used for prediction testing hypotheses constructing other estimators etc As an example consider the problem of prediction Suppose x 0 displaystyle x 0 nbsp is some point within the domain of distribution of the regressors and one wants to know what the response variable would have been at that point The mean response is the quantity y 0 x 0 T b displaystyle y 0 x 0 mathrm T beta nbsp whereas the predicted response is y 0 x 0 T b displaystyle hat y 0 x 0 mathrm T hat beta nbsp Clearly the predicted response is a random variable its distribution can be derived from that of b displaystyle hat beta nbsp y 0 y 0 d N 0 s 2 x 0 T Q x x 1 x 0 displaystyle left hat y 0 y 0 right xrightarrow d mathcal N left 0 sigma 2 x 0 mathrm T Q xx 1 x 0 right nbsp which allows construct confidence intervals for mean response y 0 displaystyle y 0 nbsp to be constructed y 0 x 0 T b q 1 a 2 N 0 1 s 2 x 0 T Q x x 1 x 0 displaystyle y 0 in left x 0 mathrm T hat beta pm q 1 frac alpha 2 mathcal N 0 1 sqrt hat sigma 2 x 0 mathrm T Q xx 1 x 0 right nbsp at the 1 a confidence level Hypothesis testing Edit Main article Hypothesis testing This section needs expansion You can help by adding to it February 2017 Two hypothesis tests are particularly widely used First one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean if not it is said to have no explanatory power The null hypothesis of no explanatory value of the estimated regression is tested using an F test If the calculated F value is found to be large enough to exceed its critical value for the pre chosen level of significance the null hypothesis is rejected and the alternative hypothesis that the regression has explanatory power is accepted Otherwise the null hypothesis of no explanatory power is accepted Second for each explanatory variable of interest one wants to know whether its estimated coefficient differs significantly from zero that is whether this particular explanatory variable in fact has explanatory power in predicting the response variable Here the null hypothesis is that the true coefficient is zero This hypothesis is tested by computing the coefficient s t statistic as the ratio of the coefficient estimate to its standard error If the t statistic is larger than a predetermined value the null hypothesis is rejected and the variable is found to have explanatory power with its coefficient significantly different from zero Otherwise the null hypothesis of a zero value of the true coefficient is accepted In addition the Chow test is used to test whether two subsamples both have the same underlying true coefficient values The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F statistic if this exceeds a critical value the null hypothesis of no difference between the two subsets is rejected otherwise it is accepted Example with real data EditSee also Simple linear regression Example and Linear least squares Example The following data set gives average heights and weights for American women aged 30 39 source The World Almanac and Book of Facts 1975 Height m 1 47 1 50 1 52 1 55 1 57 nbsp Scatterplot of the data the relationship is slightly curved but close to linearWeight kg 52 21 53 12 54 48 55 84 57 20Height m 1 60 1 63 1 65 1 68 1 70Weight kg 58 57 59 93 61 29 63 11 64 47Height m 1 73 1 75 1 78 1 80 1 83Weight kg 66 28 68 10 69 92 72 19 74 46When only one dependent variable is being modeled a scatterplot will suggest the form and strength of the relationship between the dependent variable and regressors It might also reveal outliers heteroscedasticity and other aspects of the data that may complicate the interpretation of a fitted regression model The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function OLS can handle non linear relationships by introducing the regressor HEIGHT2 The regression model then becomes a multiple linear model w i b 1 b 2 h i b 3 h i 2 e i displaystyle w i beta 1 beta 2 h i beta 3 h i 2 varepsilon i nbsp nbsp Fitted regressionThe output from most popular statistical packages will look similar to this Method Least squaresDependent variable WEIGHTObservations 15Parameter Value Std error t statistic p valueb 1 displaystyle beta 1 nbsp 128 8128 16 3083 7 8986 0 0000b 2 displaystyle beta 2 nbsp 143 1620 19 8332 7 2183 0 0000b 3 displaystyle beta 3 nbsp 61 9603 6 0084 10 3122 0 0000R2 0 9989 S E of regression 0 2516Adjusted R2 0 9987 Model sum of sq 692 61Log likelihood 1 0890 Residual sum of sq 0 7595Durbin Watson stat 2 1013 Total sum of sq 693 37Akaike criterion 0 2548 F statistic 5471 2Schwarz criterion 0 3964 p value F stat 0 0000In this table The Value column gives the least squares estimates of parameters bj The Std error column shows standard errors of each coefficient estimate s j s 2 Q x x 1 j j 1 2 displaystyle hat sigma j left hat sigma 2 left Q xx 1 right jj right frac 1 2 nbsp The t statistic and p value columns are testing whether any of the coefficients might be equal to zero The t statistic is calculated simply as t b j s j displaystyle t hat beta j hat sigma j nbsp If the errors e follow a normal distribution t follows a Student t distribution Under weaker conditions t is asymptotically normal Large values of t indicate that the null hypothesis can be rejected and that the corresponding coefficient is not zero The second column p value expresses the results of the hypothesis test as a significance level Conventionally p values smaller than 0 05 are taken as evidence that the population coefficient is nonzero R squared is the coefficient of determination indicating goodness of fit of the regression This statistic will be equal to one if fit is perfect and to zero when regressors X have no explanatory power whatsoever This is a biased estimate of the population R squared and will never decrease if additional regressors are added even if they are irrelevant Adjusted R squared is a slightly modified version of R 2 displaystyle R 2 nbsp designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression This statistic is always smaller than R 2 displaystyle R 2 nbsp can decrease as new regressors are added and even be negative for poorly fitting models R 2 1 n 1 n p 1 R 2 displaystyle overline R 2 1 frac n 1 n p 1 R 2 nbsp dd Log likelihood is calculated under the assumption that errors follow normal distribution Even though the assumption is not very reasonable this statistic may still find its use in conducting LR tests Durbin Watson statistic tests whether there is any evidence of serial correlation between the residuals As a rule of thumb the value smaller than 2 will be an evidence of positive correlation Akaike information criterion and Schwarz criterion are both used for model selection Generally when comparing two alternative models smaller values of one of these criteria will indicate a better model 35 Standard error of regression is an estimate of s standard error of the error term Total sum of squares model sum of squared and residual sum of squares tell us how much of the initial variation in the sample were explained by the regression F statistic tries to test the hypothesis that all coefficients except the intercept are equal to zero This statistic has F p 1 n p distribution under the null hypothesis and normality assumption and its p value indicates probability that the hypothesis is indeed true Note that when errors are not normal this statistic becomes invalid and other tests such as Wald test or LR test should be used nbsp Residuals plotOrdinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model These are some of the common diagnostic plots Residuals against the explanatory variables in the model A non linear relation between these variables suggests that the linearity of the conditional mean function may not hold Different levels of variability in the residuals for different levels of the explanatory variables suggests possible heteroscedasticity Residuals against explanatory variables not in the model Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model Residuals against the fitted values y displaystyle hat y nbsp Residuals against the preceding residual This plot may identify serial correlations in the residuals An important consideration when carrying out statistical inference using regression models is how the data were sampled In this example the data are averages rather than measurements on individual women The fit of the model is very good but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height Sensitivity to rounding Edit Main article Errors in variables models See also Quantization error model This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre Since the conversion factor is one inch to 2 54 cm this is not an exact conversion The original inches can be recovered by Round x 0 0254 and then re converted to metric without rounding If this is done the results become Const Height Height2Converted to metric with rounding 128 8128 143 162 61 96033Converted to metric without rounding 119 0205 131 5076 58 5046 nbsp Residuals to a quadratic fit for correctly and incorrectly converted data Using either of these equations to predict the weight of a 5 6 1 6764 m woman gives similar values 62 94 kg with rounding vs 62 98 kg without rounding Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range extrapolation This highlights a common error this example is an abuse of OLS which inherently requires that the errors in the independent variable in this case height are zero or at least negligible The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non negligible error As a result the fitted parameters are not the best estimates they are presumed to be Though not totally spurious the error in the estimation will depend upon relative size of the x and y errors Another example with less real data EditProblem statement Edit We can use the least square mechanism to figure out the equation of a two body orbit in polar base co ordinates The equation typically used is r 8 p 1 e cos 8 displaystyle r theta frac p 1 e cos theta nbsp where r 8 displaystyle r theta nbsp is the radius of how far the object is from one of the bodies In the equation the parameters p displaystyle p nbsp and e displaystyle e nbsp are used to determine the path of the orbit We have measured the following data 8 displaystyle theta nbsp in degrees 43 45 52 93 108 116r 8 displaystyle r theta nbsp 4 7126 4 5542 4 0419 2 2187 1 8910 1 7599We need to find the least squares approximation of e displaystyle e nbsp and p displaystyle p nbsp for the given data Solution Edit First we need to represent e and p in a linear form So we are going to rewrite the equation r 8 displaystyle r theta nbsp as 1 r 8 1 p e p cos 8 displaystyle frac 1 r theta frac 1 p frac e p cos theta nbsp Now we can use this form to represent our observational data as A T A x y A T b displaystyle A T A binom x y A T b nbsp where x displaystyle x nbsp is 1 p displaystyle frac 1 p nbsp and y displaystyle y nbsp is e p displaystyle frac e p nbsp and A displaystyle A nbsp is constructed by the first column being the coefficient of 1 p displaystyle frac 1 p nbsp and the second column being the coefficient of e p displaystyle frac e p nbsp and b displaystyle b nbsp is the values for the respective 1 r 8 displaystyle frac 1 r theta nbsp so A 1 0 731354 1 0 707107 1 0 615661 1 0 052336 1 0 309017 1 0 438371 displaystyle A begin bmatrix 1 amp 0 731354 1 amp 0 707107 1 amp 0 615661 1 amp 0 052336 1 amp 0 309017 1 amp 0 438371 end bmatrix nbsp and b 0 21220 0 21958 0 24741 0 45071 0 52883 0 56820 displaystyle b begin bmatrix 0 21220 0 21958 0 24741 0 45071 0 52883 0 56820 end bmatrix nbsp On solving we get x y 0 43478 0 30435 displaystyle binom x y binom 0 43478 0 30435 nbsp so p 1 x 2 3000 displaystyle p frac 1 x 2 3000 nbsp and e p y 0 70001 displaystyle e p cdot y 0 70001 nbsp See also EditBayesian least squares Fama MacBeth regression Nonlinear least squares Numerical methods for linear least squares Nonlinear system identificationReferences Edit What is a complete list of the usual assumptions for linear regression Cross Validated Retrieved 2022 09 28 Goldberger Arthur S 1964 Classical Linear Regression Econometric Theory New York John Wiley amp Sons pp 158 ISBN 0 471 31101 4 Hayashi Fumio 2000 Econometrics Princeton University Press p 15 Hayashi 2000 page 18 Ghilani Charles D Paul r Wolf Ph D 12 June 2006 Adjustment Computations Spatial Data Analysis ISBN 9780471697282 Hofmann Wellenhof Bernhard Lichtenegger Herbert Wasle Elmar 20 November 2007 GNSS Global Navigation Satellite Systems GPS GLONASS Galileo and more ISBN 9783211730171 Xu Guochang 5 October 2007 GPS Theory Algorithms and Applications ISBN 9783540727156 a b Hayashi 2000 page 19 Julian Faraway 2000 Practical Regression and Anova using R Kenney J Keeping E S 1963 Mathematics of Statistics van Nostrand p 187 Zwillinger D 1995 Standard Mathematical Tables and Formulae Chapman amp Hall CRC p 626 ISBN 0 8493 2479 3 Hayashi 2000 page 20 Akbarzadeh Vahab 7 May 2014 Line Estimation Hayashi 2000 page 49 Least Squares Introduction Massachusetts Institute of Technology KeepNotes keepnotes com Retrieved 2023 09 25 a b Hayashi 2000 page 52 Hayashi 2000 page 7 Hayashi 2000 page 187 a b Hayashi 2000 page 10 Hayashi 2000 page 34 Williams M N Grajales C A G Kurkiewicz D 2013 Assumptions of multiple regression Correcting two misconceptions Practical Assessment Research amp Evaluation 18 11 Memento on EViews Output PDF Retrieved 28 December 2020 Hayashi 2000 pages 27 30 a b c Hayashi 2000 page 27 Amemiya Takeshi 1985 Advanced Econometrics Harvard University Press p 13 ISBN 9780674005600 Amemiya 1985 page 14 Rao C R 1973 Linear Statistical Inference and its Applications Second ed New York J Wiley amp Sons p 319 ISBN 0 471 70823 2 Amemiya 1985 page 20 Amemiya 1985 page 27 a b Davidson Russell MacKinnon James G 1993 Estimation and Inference in Econometrics New York Oxford University Press p 33 ISBN 0 19 506011 3 Davidson amp MacKinnon 1993 page 36 Davidson amp MacKinnon 1993 page 20 Amemiya 1985 page 21 a b Amemiya 1985 page 22 Burnham Kenneth P David Anderson 2002 Model Selection and Multi Model Inference 2nd ed Springer ISBN 0 387 95364 7 Further reading EditDougherty Christopher 2002 Introduction to Econometrics 2nd ed New York Oxford University Press pp 48 113 ISBN 0 19 877643 8 Gujarati Damodar N Porter Dawn C 2009 Basic Econometics Fifth ed Boston McGraw Hill Irwin pp 55 96 ISBN 978 0 07 337577 9 Heij Christiaan Boer Paul Franses Philip H Kloek Teun van Dijk Herman K 2004 Econometric Methods with Applications in Business and Economics 1st ed Oxford Oxford University Press pp 76 115 ISBN 978 0 19 926801 6 Hill R Carter Griffiths William E Lim Guay C 2008 Principles of Econometrics 3rd ed Hoboken NJ John Wiley amp Sons pp 8 47 ISBN 978 0 471 72360 8 Wooldridge Jeffrey 2008 The Simple Regression Model Introductory Econometrics A Modern Approach 4th ed Mason OH Cengage Learning pp 22 67 ISBN 978 0 324 58162 1 Retrieved from https en wikipedia org w index php title Ordinary least squares amp oldid 1178190127, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.