fbpx
Wikipedia

Logistic regression

In statistics, the logistic model (or logit model) is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression[1] (or logit regression) is estimating the parameters of a logistic model (the coefficients in the linear combination). Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling;[2] the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.

Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See § Example for worked details.

Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see § Applications), and the logistic model has been the most commonly used model for binary regression since about 1970.[3] Binary variables can be generalized to categorical variables when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to multinomial logistic regression. If the multiple categories are ordered, one can use the ordinal logistic regression (for example the proportional odds ordinal logistic model[4]). See § Extensions for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform statistical classification (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a binary classifier.

Analogous linear models for binary variables with a different sigmoid function instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the probit model; see § Alternatives. The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio. More abstractly, the logistic function is the natural parameter for the Bernoulli distribution, and in this sense is the "simplest" way to convert a real number to a probability. In particular, it maximizes entropy (minimizes added information), and in this sense makes the fewest assumptions of the data being modeled; see § Maximum entropy.

The parameters of a logistic regression are most commonly estimated by maximum-likelihood estimation (MLE). This does not have a closed-form expression, unlike linear least squares; see § Model fitting. Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by ordinary least squares (OLS) plays for scalar responses: it is a simple, well-analyzed baseline model; see § Comparison with linear regression for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by Joseph Berkson,[5] beginning in Berkson (1944), where he coined "logit"; see § History.

Applications edit

General edit

Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression.[6] Many other medical scales used to assess severity of a patient have been developed using logistic regression.[7][8][9][10] Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).[11][12] Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc.[13] The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product.[14][15] It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.[16] In economics, it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing. Disaster planners and engineers rely on these models to predict decision take by householders or building occupants in small-scale and large-scales evacuations ,such as building fires, wildfires, hurricanes among others.[17][18][19] These models help in the development of reliable disaster managing plans and safer design for the built environment.

Supervised machine learning edit

Logistic regression is a supervised machine learning algorithm widely used for binary classification tasks, such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results. This approach utilizes the logistic (or sigmoid) function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories. The essential mechanism of logistic regression is grounded in the logistic function's ability to model the probability of binary outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps any real-valued number to a value within the 0 to 1 interval. This feature renders it particularly suitable for binary classification tasks, such as sorting emails into "spam" or "not spam". By calculating the probability that the dependent variable will be categorized into a specific group, logistic regression provides a probabilistic framework that supports informed decision-making.[20]

Example edit

Problem edit

As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question:

A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?

The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple regression analysis could be used.

The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).

Hours (xk) 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Pass (yk) 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

We wish to fit a logistic function to the data consisting of the hours studied (xk) and the outcome of the test (yk =1 for pass, 0 for fail). The data points are indexed by the subscript k which runs from   to  . The x variable is called the "explanatory variable", and the y variable is called the "categorical variable" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively.

Model edit

 
Graph of a logistic regression curve fitted to the (xm,ym) data. The curve shows the probability of passing an exam versus hours studying.

The logistic function is of the form:

 

where μ is a location parameter (the midpoint of the curve, where  ) and s is a scale parameter. This expression may be rewritten as:

 

where   and is known as the intercept (it is the vertical intercept or y-intercept of the line  ), and   (inverse scale parameter or rate parameter): these are the y-intercept and slope of the log-odds as a function of x. Conversely,   and  .

Fit edit

The usual measure of goodness of fit for a logistic regression uses logistic loss (or log loss), the negative log-likelihood. For a given xk and yk, write  . The   are the probabilities that the corresponding   will equal one and   are the probabilities that they will be zero (see Bernoulli distribution). We wish to find the values of   and   which give the "best fit" to the data. In the case of linear regression, the sum of the squared deviations of the fit from the data points (yk), the squared error loss, is taken as a measure of the goodness of fit, and the best fit is obtained when that function is minimized.

The log loss for the k-th point   is:

 

The log loss can be interpreted as the "surprisal" of the actual outcome   relative to the prediction  , and is a measure of information content. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when   and  , or   and  ), and approaches infinity as the prediction gets worse (i.e., when   and   or   and  ), meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point (and zero loss overall if all points are on a line), in a logistic regression it is not possible to have zero loss at any points, since   is either 0 or 1, but  .

These can be combined into a single expression:

 

This expression is more formally known as the cross-entropy of the predicted distribution   from the actual distribution  , as probability distributions on the two-element space of (pass, fail).

The sum of these, the total loss, is the overall negative log-likelihood  , and the best fit is obtained for those choices of   and   for which   is minimized.

Alternatively, instead of minimizing the loss, one can maximize its inverse, the (positive) log-likelihood:

 

or equivalently maximize the likelihood function itself, which is the probability that the given data set is produced by a particular logistic function:

 

This method is known as maximum likelihood estimation.

Parameter estimation edit

Since is nonlinear in   and  , determining their optimum values will require numerical methods. One method of maximizing is to require the derivatives of with respect to   and   to be zero:

 
 

and the maximization procedure can be accomplished by solving the above two equations for   and  , which, again, will generally require the use of numerical methods.

The values of   and   which maximize and L using the above data are found to be:

 
 

which yields a value for μ and s of:

 
 

Predictions edit

The   and   coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam.

For example, for a student who studies 2 hours, entering the value   into the equation gives the estimated probability of passing the exam of 0.25:

 
 

Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:

 
 

This table shows the estimated probability of passing the exam for several values of hours studying.

Hours
of study
(x)
Passing exam
Log-odds (t) Odds (et) Probability (p)
1 −2.57 0.076 ≈ 1:13.1 0.07
2 −1.07 0.34 ≈ 1:2.91 0.26
  0 1   = 0.50
3 0.44 1.55 0.61
4 1.94 6.96 0.87
5 3.45 31.4 0.97

Model evaluation edit

The logistic regression analysis gives the following output.

Coefficient Std. Error z-value p-value (Wald)
Intercept (β0) −4.1 1.8 −2.3 0.021
Hours (β1) 1.5 0.6 2.4 0.017

By the Wald test, the output indicates that hours studying is significantly associated with the probability of passing the exam ( ). Rather than the Wald method, the recommended method[21] to calculate the p-value for logistic regression is the likelihood-ratio test (LRT), which for these data give   (see § Deviance and likelihood ratio tests below).

Generalizations edit

This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. Multinomial logistic regression is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories.

Background edit

 
Figure 1. The standard logistic function  ;   for all  .

Definition of the logistic function edit

An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input  , and outputs a value between zero and one.[2] For the logit, this is interpreted as taking input log-odds and having output probability. The standard logistic function   is defined as follows:

 

A graph of the logistic function on the t-interval (−6,6) is shown in Figure 1.

Let us assume that   is a linear function of a single explanatory variable   (the case where   is a linear combination of multiple explanatory variables is treated similarly). We can then express   as follows:

 

And the general logistic function   can now be written as:

 

In the logistic model,   is interpreted as the probability of the dependent variable   equaling a success/case rather than a failure/non-case. It is clear that the response variables   are not identically distributed:   differs from one data point   to another, though they are independent given design matrix   and shared parameters  .[11]

Definition of the inverse of the logistic function edit

We can now define the logit (log odds) function as the inverse   of the standard logistic function. It is easy to see that it satisfies:

 

and equivalently, after exponentiating both sides we have the odds:

 

Interpretation of these terms edit

In the above equations, the terms are as follows:

  •   is the logit function. The equation for   illustrates that the logit (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression expression.
  •   denotes the natural logarithm.
  •   is the probability that the dependent variable equals a case, given some linear combination of the predictors. The formula for   illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression. This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation, the resulting expression for the probability   ranges between 0 and 1.
  •   is the intercept from the linear regression equation (the value of the criterion when the predictor is equal to zero).
  •   is the regression coefficient multiplied by some value of the predictor.
  • base   denotes the exponential function.

Definition of the odds edit

The odds of the dependent variable equaling a case (given some linear combination   of the predictors) is equivalent to the exponential function of the linear regression expression. This illustrates how the logit serves as a link function between the probability and the linear regression expression. Given that the logit ranges between negative and positive infinity, it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds.[2]

So we define odds of the dependent variable equaling a case (given some linear combination   of the predictors) as follows:

 

The odds ratio edit

For a continuous independent variable the odds ratio can be defined as:

 
The image represents an outline of what an odds ratio looks like in writing, through a template in addition to the test score example in the "Example" section of the contents. In simple terms, if we hypothetically get an odds ratio of 2 to 1, we can say... "For every one-unit increase in hours studied, the odds of passing (group 1) or failing (group 0) are (expectedly) 2 to 1 (Denis, 2019).
 

This exponential relationship provides an interpretation for  : The odds multiply by   for every 1-unit increase in x.[22]

For a binary independent variable the odds ratio is defined as   where a, b, c and d are cells in a 2×2 contingency table.[23]

Multiple explanatory variables edit

If there are multiple explanatory variables, the above expression   can be revised to  . Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a multiple regression with m explanators; the parameters   for all   are all estimated.

Again, the more traditional equations are:

 

and

 

where usually  .

Definition edit

A dataset contains N points. Each point i consists of a set of m input variables x1,i ... xm,i (also called independent variables, explanatory variables, predictor variables, features, or attributes), and a binary outcome variable Yi (also known as a dependent variable, response variable, output variable, or class), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable.

As in linear regression, the outcome variables Yi are assumed to depend on the explanatory variables x1,i ... xm,i.

Explanatory variables

The explanatory variables may be of any type: real-valued, binary, categorical, etc. The main distinction is between continuous variables and discrete variables.

(Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables), that is, separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have that value".)

Outcome variables

Formally, the outcomes Yi are described as being Bernoulli-distributed data, where each outcome is determined by an unobserved probability pi that is specific to the outcome at hand, but related to the explanatory variables. This can be expressed in any of the following equivalent forms:

 

The meanings of these four lines are:

  1. The first line expresses the probability distribution of each Yi : conditioned on the explanatory variables, it follows a Bernoulli distribution with parameters pi, the probability of the outcome of 1 for trial i. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables. The probability of success pi is not observed, only the outcome of an individual Bernoulli trial using that probability.
  2. The second line expresses the fact that the expected value of each Yi is equal to the probability of success pi, which is a general property of the Bernoulli distribution. In other words, if we run a large number of Bernoulli trials using the same probability of success pi, then take the average of all the 1 and 0 outcomes, then the result would be close to pi. This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success.
  3. The third line writes out the probability mass function of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes.
  4. The fourth line is another way of writing the probability mass function, which avoids having to write separate cases and is more convenient for certain types of calculations. This relies on the fact that Yi can take only the value 0 or 1. In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it. Hence, the outcome is either pi or 1 − pi, as in the previous line.
Linear predictor function

The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear predictor function   for a particular data point i is written as:

 

where   are regression coefficients indicating the relative effect of a particular explanatory variable on the outcome.

The model is usually put into a more compact form as follows:

  • The regression coefficients β0, β1, ..., βm are grouped into a single vector β of size m + 1.
  • For each data point i, an additional explanatory pseudo-variable x0,i is added, with a fixed value of 1, corresponding to the intercept coefficient β0.
  • The resulting explanatory variables x0,i, x1,i, ..., xm,i are then grouped into a single vector Xi of size m + 1.

This makes it possible to write the linear predictor function as follows:

 

using the notation for a dot product between two vectors.

 
This is an example of an SPSS output for a logistic regression model using three explanatory variables (coffee use per week, energy drink use per week, and soda use per week) and two categories (male and female).

Many explanatory variables, two categories edit

The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables x1, x2,... and any number of categorical values  .

To begin with, we may consider a logistic model with M explanatory variables, x1, x2 ... xM and, as in the example above, two categorical values (y = 0 and 1). For the simple binary logistic regression model, we assumed a linear relationship between the predictor variable and the log-odds (also called logit) of the event that  . This linear relationship may be extended to the case of M explanatory variables:

 

where t is the log-odds and   are parameters of the model. An additional generalization has been introduced in which the base of the model (b) is not restricted to the Euler number e. In most applications, the base   of the logarithm is usually taken to be e. However, in some cases it can be easier to communicate results by working in base 2 or base 10.

For a more compact notation, we will specify the explanatory variables and the β coefficients as  -dimensional vectors:

 
 

with an added explanatory variable x0 =1. The logit may now be written as:

 

Solving for the probability p that   yields:

 ,

where   is the sigmoid function with base  . The above formula shows that once the   are fixed, we can easily compute either the log-odds that   for a given observation, or the probability that   for a given observation. The main use-case of a logistic model is to be given an observation  , and estimate the probability   that  . The optimum beta coefficients may again be found by maximizing the log-likelihood. For K measurements, defining   as the explanatory vector of the k-th measurement, and   as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple   case above:

 

As in the simple example above, finding the optimum β parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the β parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood:

 

where xmk is the value of the xm explanatory variable from the k-th measurement.

Consider an example with   explanatory variables,  , and coefficients  ,  , and   which have been determined by the above method. To be concrete, the model is:

 
 ,

where p is the probability of the event that  . This can be interpreted as follows:

  •   is the y-intercept. It is the log-odds of the event that  , when the predictors  . By exponentiating, we can see that when   the odds of the event that   are 1-to-1000, or  . Similarly, the probability of the event that   when   can be computed as  
  •   means that increasing   by 1 increases the log-odds by  . So if   increases by 1, the odds that   increase by a factor of  . The probability of   has also increased, but it has not increased by as much as the odds have increased.
  •   means that increasing   by 1 increases the log-odds by  . So if   increases by 1, the odds that   increase by a factor of   Note how the effect of   on the log-odds is twice as great as the effect of  , but the effect on the odds is 10 times greater. But the effect on the probability of   is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater.

Multinomial logistic regression: Many explanatory variables and many categories edit

In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probabilities: The probability that the outcome was in category 1 was given by  and the probability that the outcome was in category 0 was given by  . The sum of these probabilities equals 1, which must be true, since "0" and "1" are the only possible categories in this setup.

In general, if we have   explanatory variables (including x0) and   categories, we will need   separate probabilities, one for each category, indexed by n, which describe the probability that the categorical outcome y will be in category y=n, conditional on the vector of covariates x. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base e, these probabilities are:

  for  
 

Each of the probabilities except   will have their own set of regression coefficients  . It can be seen that, as required, the sum of the   over all categories n is 1. The selection of   to be defined in terms of the other probabilities is artificial. Any of the probabilities could have been selected to be so defined. This special value of n is termed the "pivot index", and the log-odds (tn) are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables:

 

Note also that for the simple case of  , the two-category case is recovered, with   and  .

The log-likelihood that a particular set of K measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by k, let the k-th set of measured explanatory variables be denoted by   and their categorical outcomes be denoted by   which can be equal to any integer in [0,N]. The log-likelihood is then:

 

where   is an indicator function which equals 1 if yk = n and zero otherwise. In the case of two explanatory variables, this indicator function was defined as yk when n = 1 and 1-yk when n = 0. This was convenient, but not necessary.[24] Again, the optimum beta coefficients may be found by maximizing the log-likelihood function generally using numerical methods. A possible method of solution is to set the derivatives of the log-likelihood with respect to each beta coefficient equal to zero and solve for the beta coefficients:

 

where   is the m-th coefficient of the   vector and   is the m-th explanatory variable of the k-th measurement. Once the beta coefficients have been estimated from the data, we will be able to estimate the probability that any subsequent set of explanatory variables will result in any of the possible outcome categories.

Interpretations edit

There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations.

As a generalized linear model edit

The particular model used by logistic regression, which distinguishes it from standard linear regression and from other types of regression analysis used for binary-valued outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:

 

Written using the more compact notation described above, this is:

 

This formulation expresses logistic regression as a type of generalized linear model, which predicts variables with various types of probability distributions by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable.

The intuition for transforming using the logit function (the natural log of the odds) was explained above[clarification needed]. It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over   — thereby matching the potential range of the linear prediction function on the right side of the equation.

Both the probabilities pi and the regression coefficients are unobserved, and the means of determining them is not part of the model itself. They are typically determined by some sort of optimization procedure, e.g. maximum likelihood estimation, that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to regularization conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients. The use of a regularization condition is equivalent to doing maximum a posteriori (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using a squared regularizing function, which is equivalent to placing a zero-mean Gaussian prior distribution on the coefficients, but other regularizers are also possible.) Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as iteratively reweighted least squares (IRLS) or, more commonly these days, a quasi-Newton method such as the L-BFGS method.[25]

The interpretation of the βj parameter estimates is as the additive effect on the log of the odds for a unit change in the j the explanatory variable. In the case of a dichotomous explanatory variable, for instance, gender   is the estimate of the odds of having the outcome for, say, males compared with females.

An equivalent formula uses the inverse of the logit function, which is the logistic function, i.e.:

 

The formula can also be written as a probability distribution (specifically, using a probability mass function):

 

As a latent-variable model edit

The logistic model has an equivalent formulation as a latent-variable model. This formulation is common in the theory of discrete choice models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related probit model.

Imagine that, for each trial i, there is a continuous latent variable Yi* (i.e. an unobserved random variable) that is distributed as follows:

 

where

 

i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random error variable that is distributed according to a standard logistic distribution.

Then Yi can be viewed as an indicator for whether this latent variable is positive:

 

The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not. It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution. For example, a logistic error-variable distribution with a non-zero location parameter μ (which sets the mean) is equivalent to a distribution with a zero location parameter, where μ has been added to the intercept coefficient. Both situations produce the same value for Yi* regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by s. In the latter case, the resulting value of Yi* will be smaller by a factor of s than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same Yi choice.

(This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.)

It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the generalized linear model and without any latent variables. This can be shown as follows, using the fact that the cumulative distribution function (CDF) of the standard logistic distribution is the logistic function, which is the inverse of the logit function, i.e.

 

Then:

 

This formulation—which is standard in discrete choice models—makes clear the relationship between logistic regression (the "logit model") and the probit model, which uses an error variable distributed according to a standard normal distribution instead of a standard logistic distribution. Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape. The only difference is that the logistic distribution has somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust to model mis-specifications or erroneous data).

Two-way latent-variable model edit

Yet another formulation uses two separate latent variables:

 

where

 

where EV1(0,1) is a standard type-1 extreme value distribution: i.e.

 

Then

 

This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable. The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the multinomial logit model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients. It is also possible to motivate each of the separate latent variables as the theoretical utility associated with making the associated choice, and thus motivate logistic regression in terms of utility theory. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is the approach taken by economists when formulating discrete choice models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.)

The choice of the type-1 extreme value distribution seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through rational choice theory.

It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution. In fact, this model reduces directly to the previous one with the following substitutions:

 
 

An intuition for this comes from the fact that, since we choose based on the maximum of two values, only their difference matters, not the exact values — and this effectively removes one degree of freedom. Another critical fact is that the difference of two type-1 extreme-value-distributed variables is a logistic distribution, i.e.   We can demonstrate the equivalent as follows:

 

Example edit

As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. the Parti Québécois, which wants Quebec to secede from Canada). We would then use three latent variables, one for each choice. Then, in accordance with utility theory, we can then interpret the latent variables as expressing the utility that results from making each of the choices. We can also interpret the regression coefficients as indicating the strength that the associated factor (i.e. explanatory variable) has in contributing to the utility — or more correctly, the amount by which a unit change in an explanatory variable changes the utility of a given choice. A voter might expect that the right-of-center party would lower taxes, especially on rich people. This would give low-income people no benefit, i.e. no change in utility (since they usually don't pay taxes); would cause moderate benefit (i.e. somewhat more money, or moderate utility increase) for middle-incoming people; would cause significant benefits for high-income people. On the other hand, the left-of-center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes. This would cause significant positive benefit to low-income people, perhaps a weak benefit to middle-income people, and significant negative benefit to high-income people. Finally, the secessionist party would take no direct actions on the economy, but simply secede. A low-income or middle-income voter might expect basically no clear utility gain or loss from this, but a high-income voter might expect negative utility since he/she is likely to own companies, which will have a harder time doing business in such an environment and probably lose money.

These intuitions can be expressed as follows:

Estimated strength of regression coefficient for different outcomes (party choices) and different values of explanatory variables
Center-right Center-left Secessionist
High-income strong + strong − strong −
Middle-income moderate + weak + none
Low-income none strong + none

This clearly shows that

  1. Separate sets of regression coefficients need to exist for each choice. When phrased in terms of utility, this can be seen very easily. Different choices have different effects on net utility; furthermore, the effects vary in complex ways that depend on the characteristics of each individual, so there need to be separate sets of coefficients for each characteristic, not simply a single extra per-choice characteristic.
  2. Even though income is a continuous variable, its effect on utility is too complex for it to be treated as a single variable. Either it needs to be directly split up into ranges, or higher powers of income need to be added so that polynomial regression on income is effectively done.

As a "log-linear" model edit

Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the multinomial logit.

Here, instead of writing the logit of the probabilities pi as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:

 

Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the logarithm of the associated probability as a linear predictor, with an extra term   at the end. This term, as it turns out, serves as the normalizing factor ensuring that the result is a distribution. This can be seen by exponentiating both sides:

 

In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply the sum of all un-normalized probabilities, and by dividing each probability by Z, the probabilities become "normalized". That is:

 

and the resulting equations are

 

Or generally:

 

This shows clearly how to generalize this formulation to more than two outcomes, as in multinomial logit. This general formulation is exactly the softmax function as in

 

In order to prove that this is equivalent to the previous model, the above model is overspecified, in that   and   cannot be independently specified: rather   so knowing one automatically determines the other. As a result, the model is nonidentifiable, in that multiple combinations of β0 and β1 will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:

 

As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set   Then,

 

and so

logistic, regression, logit, model, redirects, here, confused, with, logit, function, statistics, logistic, model, logit, model, statistical, model, that, models, odds, event, linear, combination, more, independent, variables, regression, analysis, logistic, r. Logit model redirects here Not to be confused with Logit function In statistics the logistic model or logit model is a statistical model that models the log odds of an event as a linear combination of one or more independent variables In regression analysis logistic regression 1 or logit regression is estimating the parameters of a logistic model the coefficients in the linear combination Formally in binary logistic regression there is a single binary dependent variable coded by an indicator variable where the two values are labeled 0 and 1 while the independent variables can each be a binary variable two classes coded by an indicator variable or a continuous variable any real value The corresponding probability of the value labeled 1 can vary between 0 certainly the value 0 and 1 certainly the value 1 hence the labeling 2 the function that converts log odds to probability is the logistic function hence the name The unit of measurement for the log odds scale is called a logit from logistic unit hence the alternative names See Background and Definition for formal mathematics and Example for a worked example Example graph of a logistic regression curve fitted to data The curve shows the estimated probability of passing an exam binary dependent variable versus hours studying scalar independent variable See Example for worked details Binary variables are widely used in statistics to model the probability of a certain class or event taking place such as the probability of a team winning of a patient being healthy etc see Applications and the logistic model has been the most commonly used model for binary regression since about 1970 3 Binary variables can be generalized to categorical variables when there are more than two possible values e g whether an image is of a cat dog lion etc and the binary logistic regression generalized to multinomial logistic regression If the multiple categories are ordered one can use the ordinal logistic regression for example the proportional odds ordinal logistic model 4 See Extensions for further extensions The logistic regression model itself simply models probability of output in terms of input and does not perform statistical classification it is not a classifier though it can be used to make a classifier for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class below the cutoff as the other this is a common way to make a binary classifier Analogous linear models for binary variables with a different sigmoid function instead of the logistic function to convert the linear combination to a probability can also be used most notably the probit model see Alternatives The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate with each independent variable having its own parameter for a binary dependent variable this generalizes the odds ratio More abstractly the logistic function is the natural parameter for the Bernoulli distribution and in this sense is the simplest way to convert a real number to a probability In particular it maximizes entropy minimizes added information and in this sense makes the fewest assumptions of the data being modeled see Maximum entropy The parameters of a logistic regression are most commonly estimated by maximum likelihood estimation MLE This does not have a closed form expression unlike linear least squares see Model fitting Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by ordinary least squares OLS plays for scalar responses it is a simple well analyzed baseline model see Comparison with linear regression for discussion The logistic regression as a general statistical model was originally developed and popularized primarily by Joseph Berkson 5 beginning in Berkson 1944 where he coined logit see History Contents 1 Applications 1 1 General 1 2 Supervised machine learning 2 Example 2 1 Problem 2 2 Model 2 3 Fit 2 4 Parameter estimation 2 5 Predictions 2 6 Model evaluation 2 7 Generalizations 3 Background 3 1 Definition of the logistic function 3 2 Definition of the inverse of the logistic function 3 3 Interpretation of these terms 3 4 Definition of the odds 3 5 The odds ratio 3 6 Multiple explanatory variables 4 Definition 4 1 Many explanatory variables two categories 4 2 Multinomial logistic regression Many explanatory variables and many categories 5 Interpretations 5 1 As a generalized linear model 5 2 As a latent variable model 5 3 Two way latent variable model 5 3 1 Example 5 4 As a log linear model 5 5 As a single layer perceptron 5 6 In terms of binomial data 6 Model fitting 6 1 Maximum likelihood estimation MLE 6 2 Iteratively reweighted least squares IRLS 6 3 Bayesian 6 4 Rule of ten 7 Error and significance of fit 7 1 Deviance and likelihood ratio test a simple case 7 2 Goodness of fit summary 7 2 1 Deviance and likelihood ratio tests 7 2 2 Pseudo R squared 7 2 3 Hosmer Lemeshow test 7 3 Coefficient significance 7 3 1 Likelihood ratio test 7 3 2 Wald statistic 7 3 3 Case control sampling 8 Discussion 9 Maximum entropy 9 1 Proof 9 2 Other approaches 10 Comparison with linear regression 11 Alternatives 12 History 13 Extensions 14 See also 15 References 16 Sources 17 External linksApplications editGeneral edit Logistic regression is used in various fields including machine learning most medical fields and social sciences For example the Trauma and Injury Severity Score TRISS which is widely used to predict mortality in injured patients was originally developed by Boyd et al using logistic regression 6 Many other medical scales used to assess severity of a patient have been developed using logistic regression 7 8 9 10 Logistic regression may be used to predict the risk of developing a given disease e g diabetes coronary heart disease based on observed characteristics of the patient age sex body mass index results of various blood tests etc 11 12 Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party based on age income sex race state of residence votes in previous elections etc 13 The technique can also be used in engineering especially for predicting the probability of failure of a given process system or product 14 15 It is also used in marketing applications such as prediction of a customer s propensity to purchase a product or halt a subscription etc 16 In economics it can be used to predict the likelihood of a person ending up in the labor force and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage Conditional random fields an extension of logistic regression to sequential data are used in natural language processing Disaster planners and engineers rely on these models to predict decision take by householders or building occupants in small scale and large scales evacuations such as building fires wildfires hurricanes among others 17 18 19 These models help in the development of reliable disaster managing plans and safer design for the built environment Supervised machine learning edit Logistic regression is a supervised machine learning algorithm widely used for binary classification tasks such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results This approach utilizes the logistic or sigmoid function to transform a linear combination of input features into a probability value ranging between 0 and 1 This probability indicates the likelihood that a given input corresponds to one of two predefined categories The essential mechanism of logistic regression is grounded in the logistic function s ability to model the probability of binary outcomes accurately With its distinctive S shaped curve the logistic function effectively maps any real valued number to a value within the 0 to 1 interval This feature renders it particularly suitable for binary classification tasks such as sorting emails into spam or not spam By calculating the probability that the dependent variable will be categorized into a specific group logistic regression provides a probabilistic framework that supports informed decision making 20 Example editProblem edit As a simple example we can use a logistic regression with one explanatory variable and two categories to answer the following question A group of 20 students spends between 0 and 6 hours studying for an exam How does the number of hours spent studying affect the probability of the student passing the exam The reason for using logistic regression for this problem is that the values of the dependent variable pass and fail while represented by 1 and 0 are not cardinal numbers If the problem was changed so that pass fail was replaced with the grade 0 100 cardinal numbers then simple regression analysis could be used The table shows the number of hours each student spent studying and whether they passed 1 or failed 0 Hours xk 0 50 0 75 1 00 1 25 1 50 1 75 1 75 2 00 2 25 2 50 2 75 3 00 3 25 3 50 4 00 4 25 4 50 4 75 5 00 5 50 Pass yk 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 We wish to fit a logistic function to the data consisting of the hours studied xk and the outcome of the test yk 1 for pass 0 for fail The data points are indexed by the subscript k which runs from k 1 displaystyle k 1 nbsp to k K 20 displaystyle k K 20 nbsp The x variable is called the explanatory variable and the y variable is called the categorical variable consisting of two categories pass or fail corresponding to the categorical values 1 and 0 respectively Model edit nbsp Graph of a logistic regression curve fitted to the xm ym data The curve shows the probability of passing an exam versus hours studying The logistic function is of the form p x 1 1 e x m s displaystyle p x frac 1 1 e x mu s nbsp where m is a location parameter the midpoint of the curve where p m 1 2 displaystyle p mu 1 2 nbsp and s is a scale parameter This expression may be rewritten as p x 1 1 e b 0 b 1 x displaystyle p x frac 1 1 e beta 0 beta 1 x nbsp where b 0 m s displaystyle beta 0 mu s nbsp and is known as the intercept it is the vertical intercept or y intercept of the line y b 0 b 1 x displaystyle y beta 0 beta 1 x nbsp and b 1 1 s displaystyle beta 1 1 s nbsp inverse scale parameter or rate parameter these are the y intercept and slope of the log odds as a function of x Conversely m b 0 b 1 displaystyle mu beta 0 beta 1 nbsp and s 1 b 1 displaystyle s 1 beta 1 nbsp Fit edit The usual measure of goodness of fit for a logistic regression uses logistic loss or log loss the negative log likelihood For a given xk and yk write p k p x k displaystyle p k p x k nbsp The p k displaystyle p k nbsp are the probabilities that the corresponding y k displaystyle y k nbsp will equal one and 1 p k displaystyle 1 p k nbsp are the probabilities that they will be zero see Bernoulli distribution We wish to find the values of b 0 displaystyle beta 0 nbsp and b 1 displaystyle beta 1 nbsp which give the best fit to the data In the case of linear regression the sum of the squared deviations of the fit from the data points yk the squared error loss is taken as a measure of the goodness of fit and the best fit is obtained when that function is minimized The log loss for the k th point ℓ k displaystyle ell k nbsp is ℓ k ln p k if y k 1 ln 1 p k if y k 0 displaystyle ell k begin cases ln p k amp text if y k 1 ln 1 p k amp text if y k 0 end cases nbsp The log loss can be interpreted as the surprisal of the actual outcome y k displaystyle y k nbsp relative to the prediction p k displaystyle p k nbsp and is a measure of information content Log loss is always greater than or equal to 0 equals 0 only in case of a perfect prediction i e when p k 1 displaystyle p k 1 nbsp and y k 1 displaystyle y k 1 nbsp or p k 0 displaystyle p k 0 nbsp and y k 0 displaystyle y k 0 nbsp and approaches infinity as the prediction gets worse i e when y k 1 displaystyle y k 1 nbsp and p k 0 displaystyle p k to 0 nbsp or y k 0 displaystyle y k 0 nbsp and p k 1 displaystyle p k to 1 nbsp meaning the actual outcome is more surprising Since the value of the logistic function is always strictly between zero and one the log loss is always greater than zero and less than infinity Unlike in a linear regression where the model can have zero loss at a point by passing through a data point and zero loss overall if all points are on a line in a logistic regression it is not possible to have zero loss at any points since y k displaystyle y k nbsp is either 0 or 1 but 0 lt p k lt 1 displaystyle 0 lt p k lt 1 nbsp These can be combined into a single expression ℓ k y k ln p k 1 y k ln 1 p k displaystyle ell k y k ln p k 1 y k ln 1 p k nbsp This expression is more formally known as the cross entropy of the predicted distribution p k 1 p k displaystyle big p k 1 p k big nbsp from the actual distribution y k 1 y k displaystyle big y k 1 y k big nbsp as probability distributions on the two element space of pass fail The sum of these the total loss is the overall negative log likelihood ℓ displaystyle ell nbsp and the best fit is obtained for those choices of b 0 displaystyle beta 0 nbsp and b 1 displaystyle beta 1 nbsp for which ℓ displaystyle ell nbsp is minimized Alternatively instead of minimizing the loss one can maximize its inverse the positive log likelihood ℓ k y k 1 ln p k k y k 0 ln 1 p k k 1 K y k ln p k 1 y k ln 1 p k displaystyle ell sum k y k 1 ln p k sum k y k 0 ln 1 p k sum k 1 K left y k ln p k 1 y k ln 1 p k right nbsp or equivalently maximize the likelihood function itself which is the probability that the given data set is produced by a particular logistic function L k y k 1 p k k y k 0 1 p k displaystyle L prod k y k 1 p k prod k y k 0 1 p k nbsp This method is known as maximum likelihood estimation Parameter estimation edit Since ℓ is nonlinear in b 0 displaystyle beta 0 nbsp and b 1 displaystyle beta 1 nbsp determining their optimum values will require numerical methods One method of maximizing ℓ is to require the derivatives of ℓ with respect to b 0 displaystyle beta 0 nbsp and b 1 displaystyle beta 1 nbsp to be zero 0 ℓ b 0 k 1 K y k p k displaystyle 0 frac partial ell partial beta 0 sum k 1 K y k p k nbsp 0 ℓ b 1 k 1 K y k p k x k displaystyle 0 frac partial ell partial beta 1 sum k 1 K y k p k x k nbsp and the maximization procedure can be accomplished by solving the above two equations for b 0 displaystyle beta 0 nbsp and b 1 displaystyle beta 1 nbsp which again will generally require the use of numerical methods The values of b 0 displaystyle beta 0 nbsp and b 1 displaystyle beta 1 nbsp which maximize ℓ and L using the above data are found to be b 0 4 1 displaystyle beta 0 approx 4 1 nbsp b 1 1 5 displaystyle beta 1 approx 1 5 nbsp which yields a value for m and s of m b 0 b 1 2 7 displaystyle mu beta 0 beta 1 approx 2 7 nbsp s 1 b 1 0 67 displaystyle s 1 beta 1 approx 0 67 nbsp Predictions edit The b 0 displaystyle beta 0 nbsp and b 1 displaystyle beta 1 nbsp coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam For example for a student who studies 2 hours entering the value x 2 displaystyle x 2 nbsp into the equation gives the estimated probability of passing the exam of 0 25 t b 0 2 b 1 4 1 2 1 5 1 1 displaystyle t beta 0 2 beta 1 approx 4 1 2 cdot 1 5 1 1 nbsp p 1 1 e t 0 25 Probability of passing exam displaystyle p frac 1 1 e t approx 0 25 text Probability of passing exam nbsp Similarly for a student who studies 4 hours the estimated probability of passing the exam is 0 87 t b 0 4 b 1 4 1 4 1 5 1 9 displaystyle t beta 0 4 beta 1 approx 4 1 4 cdot 1 5 1 9 nbsp p 1 1 e t 0 87 Probability of passing exam displaystyle p frac 1 1 e t approx 0 87 text Probability of passing exam nbsp This table shows the estimated probability of passing the exam for several values of hours studying Hoursof study x Passing exam Log odds t Odds et Probability p 1 2 57 0 076 1 13 1 0 07 2 1 07 0 34 1 2 91 0 26 m 2 7 displaystyle mu approx 2 7 nbsp 0 1 1 2 displaystyle tfrac 1 2 nbsp 0 50 3 0 44 1 55 0 61 4 1 94 6 96 0 87 5 3 45 31 4 0 97 Model evaluation edit The logistic regression analysis gives the following output Coefficient Std Error z value p value Wald Intercept b0 4 1 1 8 2 3 0 021 Hours b1 1 5 0 6 2 4 0 017 By the Wald test the output indicates that hours studying is significantly associated with the probability of passing the exam p 0 017 displaystyle p 0 017 nbsp Rather than the Wald method the recommended method 21 to calculate the p value for logistic regression is the likelihood ratio test LRT which for these data give p 0 00064 displaystyle p approx 0 00064 nbsp see Deviance and likelihood ratio tests below Generalizations edit This simple model is an example of binary logistic regression and has one explanatory variable and a binary categorical variable which can assume one of two categorical values Multinomial logistic regression is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories Background edit nbsp Figure 1 The standard logistic function s t displaystyle sigma t nbsp s t 0 1 displaystyle sigma t in 0 1 nbsp for all t displaystyle t nbsp Definition of the logistic function edit An explanation of logistic regression can begin with an explanation of the standard logistic function The logistic function is a sigmoid function which takes any real input t displaystyle t nbsp and outputs a value between zero and one 2 For the logit this is interpreted as taking input log odds and having output probability The standard logistic function s R 0 1 displaystyle sigma mathbb R rightarrow 0 1 nbsp is defined as follows s t e t e t 1 1 1 e t displaystyle sigma t frac e t e t 1 frac 1 1 e t nbsp A graph of the logistic function on the t interval 6 6 is shown in Figure 1 Let us assume that t displaystyle t nbsp is a linear function of a single explanatory variable x displaystyle x nbsp the case where t displaystyle t nbsp is a linear combination of multiple explanatory variables is treated similarly We can then express t displaystyle t nbsp as follows t b 0 b 1 x displaystyle t beta 0 beta 1 x nbsp And the general logistic function p R 0 1 displaystyle p mathbb R rightarrow 0 1 nbsp can now be written as p x s t 1 1 e b 0 b 1 x displaystyle p x sigma t frac 1 1 e beta 0 beta 1 x nbsp In the logistic model p x displaystyle p x nbsp is interpreted as the probability of the dependent variable Y displaystyle Y nbsp equaling a success case rather than a failure non case It is clear that the response variables Y i displaystyle Y i nbsp are not identically distributed P Y i 1 X displaystyle P Y i 1 mid X nbsp differs from one data point X i displaystyle X i nbsp to another though they are independent given design matrix X displaystyle X nbsp and shared parameters b displaystyle beta nbsp 11 Definition of the inverse of the logistic function edit We can now define the logit log odds function as the inverse g s 1 displaystyle g sigma 1 nbsp of the standard logistic function It is easy to see that it satisfies g p x s 1 p x logit p x ln p x 1 p x b 0 b 1 x displaystyle g p x sigma 1 p x operatorname logit p x ln left frac p x 1 p x right beta 0 beta 1 x nbsp and equivalently after exponentiating both sides we have the odds p x 1 p x e b 0 b 1 x displaystyle frac p x 1 p x e beta 0 beta 1 x nbsp Interpretation of these terms edit In the above equations the terms are as follows g displaystyle g nbsp is the logit function The equation for g p x displaystyle g p x nbsp illustrates that the logit i e log odds or natural logarithm of the odds is equivalent to the linear regression expression ln displaystyle ln nbsp denotes the natural logarithm p x displaystyle p x nbsp is the probability that the dependent variable equals a case given some linear combination of the predictors The formula for p x displaystyle p x nbsp illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet after transformation the resulting expression for the probability p x displaystyle p x nbsp ranges between 0 and 1 b 0 displaystyle beta 0 nbsp is the intercept from the linear regression equation the value of the criterion when the predictor is equal to zero b 1 x displaystyle beta 1 x nbsp is the regression coefficient multiplied by some value of the predictor base e displaystyle e nbsp denotes the exponential function Definition of the odds edit The odds of the dependent variable equaling a case given some linear combination x displaystyle x nbsp of the predictors is equivalent to the exponential function of the linear regression expression This illustrates how the logit serves as a link function between the probability and the linear regression expression Given that the logit ranges between negative and positive infinity it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds 2 So we define odds of the dependent variable equaling a case given some linear combination x displaystyle x nbsp of the predictors as follows odds e b 0 b 1 x displaystyle text odds e beta 0 beta 1 x nbsp The odds ratio edit For a continuous independent variable the odds ratio can be defined as nbsp The image represents an outline of what an odds ratio looks like in writing through a template in addition to the test score example in the Example section of the contents In simple terms if we hypothetically get an odds ratio of 2 to 1 we can say For every one unit increase in hours studied the odds of passing group 1 or failing group 0 are expectedly 2 to 1 Denis 2019 O R odds x 1 odds x p x 1 1 p x 1 p x 1 p x e b 0 b 1 x 1 e b 0 b 1 x e b 1 displaystyle mathrm OR frac operatorname odds x 1 operatorname odds x frac left frac p x 1 1 p x 1 right left frac p x 1 p x right frac e beta 0 beta 1 x 1 e beta 0 beta 1 x e beta 1 nbsp This exponential relationship provides an interpretation for b 1 displaystyle beta 1 nbsp The odds multiply by e b 1 displaystyle e beta 1 nbsp for every 1 unit increase in x 22 For a binary independent variable the odds ratio is defined as a d b c displaystyle frac ad bc nbsp where a b c and d are cells in a 2 2 contingency table 23 Multiple explanatory variables edit If there are multiple explanatory variables the above expression b 0 b 1 x displaystyle beta 0 beta 1 x nbsp can be revised to b 0 b 1 x 1 b 2 x 2 b m x m b 0 i 1 m b i x i displaystyle beta 0 beta 1 x 1 beta 2 x 2 cdots beta m x m beta 0 sum i 1 m beta i x i nbsp Then when this is used in the equation relating the log odds of a success to the values of the predictors the linear regression will be a multiple regression with m explanators the parameters b j displaystyle beta j nbsp for all j 0 1 2 m displaystyle j 0 1 2 dots m nbsp are all estimated Again the more traditional equations are log p 1 p b 0 b 1 x 1 b 2 x 2 b m x m displaystyle log frac p 1 p beta 0 beta 1 x 1 beta 2 x 2 cdots beta m x m nbsp and p 1 1 b b 0 b 1 x 1 b 2 x 2 b m x m displaystyle p frac 1 1 b beta 0 beta 1 x 1 beta 2 x 2 cdots beta m x m nbsp where usually b e displaystyle b e nbsp Definition editA dataset contains N points Each point i consists of a set of m input variables x1 i xm i also called independent variables explanatory variables predictor variables features or attributes and a binary outcome variable Yi also known as a dependent variable response variable output variable or class i e it can assume only the two possible values 0 often meaning no or failure or 1 often meaning yes or success The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable As in linear regression the outcome variables Yi are assumed to depend on the explanatory variables x1 i xm i Explanatory variables The explanatory variables may be of any type real valued binary categorical etc The main distinction is between continuous variables and discrete variables Discrete variables referring to more than two possible choices are typically coded using dummy variables or indicator variables that is separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable with a 1 meaning variable does have the given value and a 0 meaning variable does not have that value Outcome variables Formally the outcomes Yi are described as being Bernoulli distributed data where each outcome is determined by an unobserved probability pi that is specific to the outcome at hand but related to the explanatory variables This can be expressed in any of the following equivalent forms Y i x 1 i x m i Bernoulli p i E Y i x 1 i x m i p i Pr Y i y x 1 i x m i p i if y 1 1 p i if y 0 Pr Y i y x 1 i x m i p i y 1 p i 1 y displaystyle begin aligned Y i mid x 1 i ldots x m i amp sim operatorname Bernoulli p i 5pt operatorname mathbb E Y i mid x 1 i ldots x m i amp p i 5pt Pr Y i y mid x 1 i ldots x m i amp begin cases p i amp text if y 1 1 p i amp text if y 0 end cases 5pt Pr Y i y mid x 1 i ldots x m i amp p i y 1 p i 1 y end aligned nbsp dd The meanings of these four lines are The first line expresses the probability distribution of each Yi conditioned on the explanatory variables it follows a Bernoulli distribution with parameters pi the probability of the outcome of 1 for trial i As noted above each separate trial has its own probability of success just as each trial has its own explanatory variables The probability of success pi is not observed only the outcome of an individual Bernoulli trial using that probability The second line expresses the fact that the expected value of each Yi is equal to the probability of success pi which is a general property of the Bernoulli distribution In other words if we run a large number of Bernoulli trials using the same probability of success pi then take the average of all the 1 and 0 outcomes then the result would be close to pi This is because doing an average this way simply computes the proportion of successes seen which we expect to converge to the underlying probability of success The third line writes out the probability mass function of the Bernoulli distribution specifying the probability of seeing each of the two possible outcomes The fourth line is another way of writing the probability mass function which avoids having to write separate cases and is more convenient for certain types of calculations This relies on the fact that Yi can take only the value 0 or 1 In each case one of the exponents will be 1 choosing the value under it while the other is 0 canceling out the value under it Hence the outcome is either pi or 1 pi as in the previous line Linear predictor function The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function i e a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials The linear predictor function f i displaystyle f i nbsp for a particular data point i is written as f i b 0 b 1 x 1 i b m x m i displaystyle f i beta 0 beta 1 x 1 i cdots beta m x m i nbsp where b 0 b m displaystyle beta 0 ldots beta m nbsp are regression coefficients indicating the relative effect of a particular explanatory variable on the outcome The model is usually put into a more compact form as follows The regression coefficients b0 b1 bm are grouped into a single vector b of size m 1 For each data point i an additional explanatory pseudo variable x0 i is added with a fixed value of 1 corresponding to the intercept coefficient b0 The resulting explanatory variables x0 i x1 i xm i are then grouped into a single vector Xi of size m 1 This makes it possible to write the linear predictor function as follows f i b X i displaystyle f i boldsymbol beta cdot mathbf X i nbsp using the notation for a dot product between two vectors nbsp This is an example of an SPSS output for a logistic regression model using three explanatory variables coffee use per week energy drink use per week and soda use per week and two categories male and female Many explanatory variables two categories edit The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables x1 x2 and any number of categorical values y 0 1 2 displaystyle y 0 1 2 dots nbsp To begin with we may consider a logistic model with M explanatory variables x1 x2 xM and as in the example above two categorical values y 0 and 1 For the simple binary logistic regression model we assumed a linear relationship between the predictor variable and the log odds also called logit of the event that y 1 displaystyle y 1 nbsp This linear relationship may be extended to the case of M explanatory variables t log b p 1 p b 0 b 1 x 1 b 2 x 2 b M x M displaystyle t log b frac p 1 p beta 0 beta 1 x 1 beta 2 x 2 cdots beta M x M nbsp where t is the log odds and b i displaystyle beta i nbsp are parameters of the model An additional generalization has been introduced in which the base of the model b is not restricted to the Euler number e In most applications the base b displaystyle b nbsp of the logarithm is usually taken to be e However in some cases it can be easier to communicate results by working in base 2 or base 10 For a more compact notation we will specify the explanatory variables and the b coefficients as M 1 displaystyle M 1 nbsp dimensional vectors x x 0 x 1 x 2 x M displaystyle boldsymbol x x 0 x 1 x 2 dots x M nbsp b b 0 b 1 b 2 b M displaystyle boldsymbol beta beta 0 beta 1 beta 2 dots beta M nbsp with an added explanatory variable x0 1 The logit may now be written as t m 0 M b m x m b x displaystyle t sum m 0 M beta m x m boldsymbol beta cdot x nbsp Solving for the probability p that y 1 displaystyle y 1 nbsp yields p x b b x 1 b b x 1 1 b b x S b t displaystyle p boldsymbol x frac b boldsymbol beta cdot boldsymbol x 1 b boldsymbol beta cdot boldsymbol x frac 1 1 b boldsymbol beta cdot boldsymbol x S b t nbsp where S b displaystyle S b nbsp is the sigmoid function with base b displaystyle b nbsp The above formula shows that once the b m displaystyle beta m nbsp are fixed we can easily compute either the log odds that y 1 displaystyle y 1 nbsp for a given observation or the probability that y 1 displaystyle y 1 nbsp for a given observation The main use case of a logistic model is to be given an observation x displaystyle boldsymbol x nbsp and estimate the probability p x displaystyle p boldsymbol x nbsp that y 1 displaystyle y 1 nbsp The optimum beta coefficients may again be found by maximizing the log likelihood For K measurements defining x k displaystyle boldsymbol x k nbsp as the explanatory vector of the k th measurement and y k displaystyle y k nbsp as the categorical outcome of that measurement the log likelihood may be written in a form very similar to the simple M 1 displaystyle M 1 nbsp case above ℓ k 1 K y k log b p x k k 1 K 1 y k log b 1 p x k displaystyle ell sum k 1 K y k log b p boldsymbol x k sum k 1 K 1 y k log b 1 p boldsymbol x k nbsp As in the simple example above finding the optimum b parameters will require numerical methods One useful technique is to equate the derivatives of the log likelihood with respect to each of the b parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood ℓ b m 0 k 1 K y k x m k k 1 K p x k x m k displaystyle frac partial ell partial beta m 0 sum k 1 K y k x mk sum k 1 K p boldsymbol x k x mk nbsp where xmk is the value of the xm explanatory variable from the k th measurement Consider an example with M 2 displaystyle M 2 nbsp explanatory variables b 10 displaystyle b 10 nbsp and coefficients b 0 3 displaystyle beta 0 3 nbsp b 1 1 displaystyle beta 1 1 nbsp and b 2 2 displaystyle beta 2 2 nbsp which have been determined by the above method To be concrete the model is t log 10 p 1 p 3 x 1 2 x 2 displaystyle t log 10 frac p 1 p 3 x 1 2x 2 nbsp p b b x 1 b b x b b 0 b 1 x 1 b 2 x 2 1 b b 0 b 1 x 1 b 2 x 2 1 1 b b 0 b 1 x 1 b 2 x 2 displaystyle p frac b boldsymbol beta cdot boldsymbol x 1 b boldsymbol beta cdot x frac b beta 0 beta 1 x 1 beta 2 x 2 1 b beta 0 beta 1 x 1 beta 2 x 2 frac 1 1 b beta 0 beta 1 x 1 beta 2 x 2 nbsp where p is the probability of the event that y 1 displaystyle y 1 nbsp This can be interpreted as follows b 0 3 displaystyle beta 0 3 nbsp is the y intercept It is the log odds of the event that y 1 displaystyle y 1 nbsp when the predictors x 1 x 2 0 displaystyle x 1 x 2 0 nbsp By exponentiating we can see that when x 1 x 2 0 displaystyle x 1 x 2 0 nbsp the odds of the event that y 1 displaystyle y 1 nbsp are 1 to 1000 or 10 3 displaystyle 10 3 nbsp Similarly the probability of the event that y 1 displaystyle y 1 nbsp when x 1 x 2 0 displaystyle x 1 x 2 0 nbsp can be computed as 1 1000 1 1 1001 displaystyle 1 1000 1 1 1001 nbsp b 1 1 displaystyle beta 1 1 nbsp means that increasing x 1 displaystyle x 1 nbsp by 1 increases the log odds by 1 displaystyle 1 nbsp So if x 1 displaystyle x 1 nbsp increases by 1 the odds that y 1 displaystyle y 1 nbsp increase by a factor of 10 1 displaystyle 10 1 nbsp The probability of y 1 displaystyle y 1 nbsp has also increased but it has not increased by as much as the odds have increased b 2 2 displaystyle beta 2 2 nbsp means that increasing x 2 displaystyle x 2 nbsp by 1 increases the log odds by 2 displaystyle 2 nbsp So if x 2 displaystyle x 2 nbsp increases by 1 the odds that y 1 displaystyle y 1 nbsp increase by a factor of 10 2 displaystyle 10 2 nbsp Note how the effect of x 2 displaystyle x 2 nbsp on the log odds is twice as great as the effect of x 1 displaystyle x 1 nbsp but the effect on the odds is 10 times greater But the effect on the probability of y 1 displaystyle y 1 nbsp is not as much as 10 times greater it s only the effect on the odds that is 10 times greater Multinomial logistic regression Many explanatory variables and many categories edit Main article Multinomial logistic regression In the above cases of two categories binomial logistic regression the categories were indexed by 0 and 1 and we had two probabilities The probability that the outcome was in category 1 was given by p x displaystyle p boldsymbol x nbsp and the probability that the outcome was in category 0 was given by 1 p x displaystyle 1 p boldsymbol x nbsp The sum of these probabilities equals 1 which must be true since 0 and 1 are the only possible categories in this setup In general if we have M 1 displaystyle M 1 nbsp explanatory variables including x0 and N 1 displaystyle N 1 nbsp categories we will need N 1 displaystyle N 1 nbsp separate probabilities one for each category indexed by n which describe the probability that the categorical outcome y will be in category y n conditional on the vector of covariates x The sum of these probabilities over all categories must equal 1 Using the mathematically convenient base e these probabilities are p n x e b n x 1 u 1 N e b u x displaystyle p n boldsymbol x frac e boldsymbol beta n cdot boldsymbol x 1 sum u 1 N e boldsymbol beta u cdot boldsymbol x nbsp for n 1 2 N displaystyle n 1 2 dots N nbsp p 0 x 1 n 1 N p n x 1 1 u 1 N e b u x displaystyle p 0 boldsymbol x 1 sum n 1 N p n boldsymbol x frac 1 1 sum u 1 N e boldsymbol beta u cdot boldsymbol x nbsp Each of the probabilities except p 0 x displaystyle p 0 boldsymbol x nbsp will have their own set of regression coefficients b n displaystyle boldsymbol beta n nbsp It can be seen that as required the sum of the p n x displaystyle p n boldsymbol x nbsp over all categories n is 1 The selection of p 0 x displaystyle p 0 boldsymbol x nbsp to be defined in terms of the other probabilities is artificial Any of the probabilities could have been selected to be so defined This special value of n is termed the pivot index and the log odds tn are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables t n ln p n x p 0 x b n x displaystyle t n ln left frac p n boldsymbol x p 0 boldsymbol x right boldsymbol beta n cdot boldsymbol x nbsp Note also that for the simple case of N 1 displaystyle N 1 nbsp the two category case is recovered with p x p 1 x displaystyle p boldsymbol x p 1 boldsymbol x nbsp and p 0 x 1 p 1 x displaystyle p 0 boldsymbol x 1 p 1 boldsymbol x nbsp The log likelihood that a particular set of K measurements or data points will be generated by the above probabilities can now be calculated Indexing each measurement by k let the k th set of measured explanatory variables be denoted by x k displaystyle boldsymbol x k nbsp and their categorical outcomes be denoted by y k displaystyle y k nbsp which can be equal to any integer in 0 N The log likelihood is then ℓ k 1 K n 0 N D n y k ln p n x k displaystyle ell sum k 1 K sum n 0 N Delta n y k ln p n boldsymbol x k nbsp where D n y k displaystyle Delta n y k nbsp is an indicator function which equals 1 if yk n and zero otherwise In the case of two explanatory variables this indicator function was defined as yk when n 1 and 1 yk when n 0 This was convenient but not necessary 24 Again the optimum beta coefficients may be found by maximizing the log likelihood function generally using numerical methods A possible method of solution is to set the derivatives of the log likelihood with respect to each beta coefficient equal to zero and solve for the beta coefficients ℓ b n m 0 k 1 K D n y k x m k k 1 K p n x k x m k displaystyle frac partial ell partial beta nm 0 sum k 1 K Delta n y k x mk sum k 1 K p n boldsymbol x k x mk nbsp where b n m displaystyle beta nm nbsp is the m th coefficient of the b n displaystyle boldsymbol beta n nbsp vector and x m k displaystyle x mk nbsp is the m th explanatory variable of the k th measurement Once the beta coefficients have been estimated from the data we will be able to estimate the probability that any subsequent set of explanatory variables will result in any of the possible outcome categories Interpretations editThere are various equivalent specifications and interpretations of logistic regression which fit into different types of more general models and allow different generalizations As a generalized linear model edit The particular model used by logistic regression which distinguishes it from standard linear regression and from other types of regression analysis used for binary valued outcomes is the way the probability of a particular outcome is linked to the linear predictor function logit E Y i x 1 i x m i logit p i ln p i 1 p i b 0 b 1 x 1 i b m x m i displaystyle operatorname logit operatorname mathbb E Y i mid x 1 i ldots x m i operatorname logit p i ln left frac p i 1 p i right beta 0 beta 1 x 1 i cdots beta m x m i nbsp Written using the more compact notation described above this is logit E Y i X i logit p i ln p i 1 p i b X i displaystyle operatorname logit operatorname mathbb E Y i mid mathbf X i operatorname logit p i ln left frac p i 1 p i right boldsymbol beta cdot mathbf X i nbsp This formulation expresses logistic regression as a type of generalized linear model which predicts variables with various types of probability distributions by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable The intuition for transforming using the logit function the natural log of the odds was explained above clarification needed It also has the practical effect of converting the probability which is bounded to be between 0 and 1 to a variable that ranges over displaystyle infty infty nbsp thereby matching the potential range of the linear prediction function on the right side of the equation Both the probabilities pi and the regression coefficients are unobserved and the means of determining them is not part of the model itself They are typically determined by some sort of optimization procedure e g maximum likelihood estimation that finds values that best fit the observed data i e that give the most accurate predictions for the data already observed usually subject to regularization conditions that seek to exclude unlikely values e g extremely large values for any of the regression coefficients The use of a regularization condition is equivalent to doing maximum a posteriori MAP estimation an extension of maximum likelihood Regularization is most commonly done using a squared regularizing function which is equivalent to placing a zero mean Gaussian prior distribution on the coefficients but other regularizers are also possible Whether or not regularization is used it is usually not possible to find a closed form solution instead an iterative numerical method must be used such as iteratively reweighted least squares IRLS or more commonly these days a quasi Newton method such as the L BFGS method 25 The interpretation of the bj parameter estimates is as the additive effect on the log of the odds for a unit change in the j the explanatory variable In the case of a dichotomous explanatory variable for instance gender e b displaystyle e beta nbsp is the estimate of the odds of having the outcome for say males compared with females An equivalent formula uses the inverse of the logit function which is the logistic function i e E Y i X i p i logit 1 b X i 1 1 e b X i displaystyle operatorname mathbb E Y i mid mathbf X i p i operatorname logit 1 boldsymbol beta cdot mathbf X i frac 1 1 e boldsymbol beta cdot mathbf X i nbsp The formula can also be written as a probability distribution specifically using a probability mass function Pr Y i y X i p i y 1 p i 1 y e b X i 1 e b X i y 1 e b X i 1 e b X i 1 y e b X i y 1 e b X i displaystyle Pr Y i y mid mathbf X i p i y 1 p i 1 y left frac e boldsymbol beta cdot mathbf X i 1 e boldsymbol beta cdot mathbf X i right y left 1 frac e boldsymbol beta cdot mathbf X i 1 e boldsymbol beta cdot mathbf X i right 1 y frac e boldsymbol beta cdot mathbf X i cdot y 1 e boldsymbol beta cdot mathbf X i nbsp As a latent variable model edit The logistic model has an equivalent formulation as a latent variable model This formulation is common in the theory of discrete choice models and makes it easier to extend to certain more complicated models with multiple correlated choices as well as to compare logistic regression to the closely related probit model Imagine that for each trial i there is a continuous latent variable Yi i e an unobserved random variable that is distributed as follows Y i b X i e i displaystyle Y i ast boldsymbol beta cdot mathbf X i varepsilon i nbsp where e i Logistic 0 1 displaystyle varepsilon i sim operatorname Logistic 0 1 nbsp i e the latent variable can be written directly in terms of the linear predictor function and an additive random error variable that is distributed according to a standard logistic distribution Then Yi can be viewed as an indicator for whether this latent variable is positive Y i 1 if Y i gt 0 i e e i lt b X i 0 otherwise displaystyle Y i begin cases 1 amp text if Y i ast gt 0 text i e varepsilon i lt boldsymbol beta cdot mathbf X i 0 amp text otherwise end cases nbsp The choice of modeling the error variable specifically with a standard logistic distribution rather than a general logistic distribution with the location and scale set to arbitrary values seems restrictive but in fact it is not It must be kept in mind that we can choose the regression coefficients ourselves and very often can use them to offset changes in the parameters of the error variable s distribution For example a logistic error variable distribution with a non zero location parameter m which sets the mean is equivalent to a distribution with a zero location parameter where m has been added to the intercept coefficient Both situations produce the same value for Yi regardless of settings of explanatory variables Similarly an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by s In the latter case the resulting value of Yi will be smaller by a factor of s than in the former case for all sets of explanatory variables but critically it will always remain on the same side of 0 and hence lead to the same Yi choice This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available It turns out that this formulation is exactly equivalent to the preceding one phrased in terms of the generalized linear model and without any latent variables This can be shown as follows using the fact that the cumulative distribution function CDF of the standard logistic distribution is the logistic function which is the inverse of the logit function i e Pr e i lt x logit 1 x displaystyle Pr varepsilon i lt x operatorname logit 1 x nbsp Then Pr Y i 1 X i Pr Y i gt 0 X i Pr b X i e i gt 0 Pr e i gt b X i Pr e i lt b X i because the logistic distribution is symmetric logit 1 b X i p i see above displaystyle begin aligned Pr Y i 1 mid mathbf X i amp Pr Y i ast gt 0 mid mathbf X i 5pt amp Pr boldsymbol beta cdot mathbf X i varepsilon i gt 0 5pt amp Pr varepsilon i gt boldsymbol beta cdot mathbf X i 5pt amp Pr varepsilon i lt boldsymbol beta cdot mathbf X i amp amp text because the logistic distribution is symmetric 5pt amp operatorname logit 1 boldsymbol beta cdot mathbf X i amp 5pt amp p i amp amp text see above end aligned nbsp This formulation which is standard in discrete choice models makes clear the relationship between logistic regression the logit model and the probit model which uses an error variable distributed according to a standard normal distribution instead of a standard logistic distribution Both the logistic and normal distributions are symmetric with a basic unimodal bell curve shape The only difference is that the logistic distribution has somewhat heavier tails which means that it is less sensitive to outlying data and hence somewhat more robust to model mis specifications or erroneous data Two way latent variable model edit Yet another formulation uses two separate latent variables Y i 0 b 0 X i e 0 Y i 1 b 1 X i e 1 displaystyle begin aligned Y i 0 ast amp boldsymbol beta 0 cdot mathbf X i varepsilon 0 Y i 1 ast amp boldsymbol beta 1 cdot mathbf X i varepsilon 1 end aligned nbsp where e 0 EV 1 0 1 e 1 EV 1 0 1 displaystyle begin aligned varepsilon 0 amp sim operatorname EV 1 0 1 varepsilon 1 amp sim operatorname EV 1 0 1 end aligned nbsp where EV1 0 1 is a standard type 1 extreme value distribution i e Pr e 0 x Pr e 1 x e x e e x displaystyle Pr varepsilon 0 x Pr varepsilon 1 x e x e e x nbsp Then Y i 1 if Y i 1 gt Y i 0 0 otherwise displaystyle Y i begin cases 1 amp text if Y i 1 ast gt Y i 0 ast 0 amp text otherwise end cases nbsp This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable The reason for this separation is that it makes it easy to extend logistic regression to multi outcome categorical variables as in the multinomial logit model In such a model it is natural to model each possible outcome using a different set of regression coefficients It is also possible to motivate each of the separate latent variables as the theoretical utility associated with making the associated choice and thus motivate logistic regression in terms of utility theory In terms of utility theory a rational actor always chooses the choice with the greatest associated utility This is the approach taken by economists when formulating discrete choice models because it both provides a theoretically strong foundation and facilitates intuitions about the model which in turn makes it easy to consider various sorts of extensions See the example below The choice of the type 1 extreme value distribution seems fairly arbitrary but it makes the mathematics work out and it may be possible to justify its use through rational choice theory It turns out that this model is equivalent to the previous model although this seems non obvious since there are now two sets of regression coefficients and error variables and the error variables have a different distribution In fact this model reduces directly to the previous one with the following substitutions b b 1 b 0 displaystyle boldsymbol beta boldsymbol beta 1 boldsymbol beta 0 nbsp e e 1 e 0 displaystyle varepsilon varepsilon 1 varepsilon 0 nbsp An intuition for this comes from the fact that since we choose based on the maximum of two values only their difference matters not the exact values and this effectively removes one degree of freedom Another critical fact is that the difference of two type 1 extreme value distributed variables is a logistic distribution i e e e 1 e 0 Logistic 0 1 displaystyle varepsilon varepsilon 1 varepsilon 0 sim operatorname Logistic 0 1 nbsp We can demonstrate the equivalent as follows Pr Y i 1 X i Pr Y i 1 gt Y i 0 X i Pr Y i 1 Y i 0 gt 0 X i Pr b 1 X i e 1 b 0 X i e 0 gt 0 Pr b 1 X i b 0 X i e 1 e 0 gt 0 Pr b 1 b 0 X i e 1 e 0 gt 0 Pr b 1 b 0 X i e gt 0 substitute e as above Pr b X i e gt 0 substitute b as above Pr e gt b X i now same as above model Pr e lt b X i logit 1 b X i p i displaystyle begin aligned Pr Y i 1 mid mathbf X i amp Pr left Y i 1 ast gt Y i 0 ast mid mathbf X i right amp 5pt amp Pr left Y i 1 ast Y i 0 ast gt 0 mid mathbf X i right amp 5pt amp Pr left boldsymbol beta 1 cdot mathbf X i varepsilon 1 left boldsymbol beta 0 cdot mathbf X i varepsilon 0 right gt 0 right amp 5pt amp Pr left boldsymbol beta 1 cdot mathbf X i boldsymbol beta 0 cdot mathbf X i varepsilon 1 varepsilon 0 gt 0 right amp 5pt amp Pr boldsymbol beta 1 boldsymbol beta 0 cdot mathbf X i varepsilon 1 varepsilon 0 gt 0 amp 5pt amp Pr boldsymbol beta 1 boldsymbol beta 0 cdot mathbf X i varepsilon gt 0 amp amp text substitute varepsilon text as above 5pt amp Pr boldsymbol beta cdot mathbf X i varepsilon gt 0 amp amp text substitute boldsymbol beta text as above 5pt amp Pr varepsilon gt boldsymbol beta cdot mathbf X i amp amp text now same as above model 5pt amp Pr varepsilon lt boldsymbol beta cdot mathbf X i amp 5pt amp operatorname logit 1 boldsymbol beta cdot mathbf X i 5pt amp p i end aligned nbsp Example edit This example possibly contains original research Relevant discussion may be found on Talk Logistic regression Please improve it by verifying the claims made and adding inline citations Statements consisting only of original research should be removed May 2022 Learn how and when to remove this message As an example consider a province level election where the choice is between a right of center party a left of center party and a secessionist party e g the Parti Quebecois which wants Quebec to secede from Canada We would then use three latent variables one for each choice Then in accordance with utility theory we can then interpret the latent variables as expressing the utility that results from making each of the choices We can also interpret the regression coefficients as indicating the strength that the associated factor i e explanatory variable has in contributing to the utility or more correctly the amount by which a unit change in an explanatory variable changes the utility of a given choice A voter might expect that the right of center party would lower taxes especially on rich people This would give low income people no benefit i e no change in utility since they usually don t pay taxes would cause moderate benefit i e somewhat more money or moderate utility increase for middle incoming people would cause significant benefits for high income people On the other hand the left of center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes This would cause significant positive benefit to low income people perhaps a weak benefit to middle income people and significant negative benefit to high income people Finally the secessionist party would take no direct actions on the economy but simply secede A low income or middle income voter might expect basically no clear utility gain or loss from this but a high income voter might expect negative utility since he she is likely to own companies which will have a harder time doing business in such an environment and probably lose money These intuitions can be expressed as follows Estimated strength of regression coefficient for different outcomes party choices and different values of explanatory variables Center right Center left Secessionist High income strong strong strong Middle income moderate weak none Low income none strong none This clearly shows that Separate sets of regression coefficients need to exist for each choice When phrased in terms of utility this can be seen very easily Different choices have different effects on net utility furthermore the effects vary in complex ways that depend on the characteristics of each individual so there need to be separate sets of coefficients for each characteristic not simply a single extra per choice characteristic Even though income is a continuous variable its effect on utility is too complex for it to be treated as a single variable Either it needs to be directly split up into ranges or higher powers of income need to be added so that polynomial regression on income is effectively done As a log linear model edit Yet another formulation combines the two way latent variable formulation above with the original formulation higher up without latent variables and in the process provides a link to one of the standard formulations of the multinomial logit Here instead of writing the logit of the probabilities pi as a linear predictor we separate the linear predictor into two one for each of the two outcomes ln Pr Y i 0 b 0 X i ln Z ln Pr Y i 1 b 1 X i ln Z displaystyle begin aligned ln Pr Y i 0 amp boldsymbol beta 0 cdot mathbf X i ln Z ln Pr Y i 1 amp boldsymbol beta 1 cdot mathbf X i ln Z end aligned nbsp Two separate sets of regression coefficients have been introduced just as in the two way latent variable model and the two equations appear a form that writes the logarithm of the associated probability as a linear predictor with an extra term ln Z displaystyle ln Z nbsp at the end This term as it turns out serves as the normalizing factor ensuring that the result is a distribution This can be seen by exponentiating both sides Pr Y i 0 1 Z e b 0 X i Pr Y i 1 1 Z e b 1 X i displaystyle begin aligned Pr Y i 0 amp frac 1 Z e boldsymbol beta 0 cdot mathbf X i 5pt Pr Y i 1 amp frac 1 Z e boldsymbol beta 1 cdot mathbf X i end aligned nbsp In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Yi is in fact a probability distribution i e it sums to 1 This means that Z is simply the sum of all un normalized probabilities and by dividing each probability by Z the probabilities become normalized That is Z e b 0 X i e b 1 X i displaystyle Z e boldsymbol beta 0 cdot mathbf X i e boldsymbol beta 1 cdot mathbf X i nbsp and the resulting equations are Pr Y i 0 e b 0 X i e b 0 X i e b 1 X i Pr Y i 1 e b 1 X i e b 0 X i e b 1 X i displaystyle begin aligned Pr Y i 0 amp frac e boldsymbol beta 0 cdot mathbf X i e boldsymbol beta 0 cdot mathbf X i e boldsymbol beta 1 cdot mathbf X i 5pt Pr Y i 1 amp frac e boldsymbol beta 1 cdot mathbf X i e boldsymbol beta 0 cdot mathbf X i e boldsymbol beta 1 cdot mathbf X i end aligned nbsp Or generally Pr Y i c e b c X i h e b h X i displaystyle Pr Y i c frac e boldsymbol beta c cdot mathbf X i sum h e boldsymbol beta h cdot mathbf X i nbsp This shows clearly how to generalize this formulation to more than two outcomes as in multinomial logit This general formulation is exactly the softmax function as in Pr Y i c softmax c b 0 X i b 1 X i displaystyle Pr Y i c operatorname softmax c boldsymbol beta 0 cdot mathbf X i boldsymbol beta 1 cdot mathbf X i dots nbsp In order to prove that this is equivalent to the previous model the above model is overspecified in that Pr Y i 0 displaystyle Pr Y i 0 nbsp and Pr Y i 1 displaystyle Pr Y i 1 nbsp cannot be independently specified rather Pr Y i 0 Pr Y i 1 1 displaystyle Pr Y i 0 Pr Y i 1 1 nbsp so knowing one automatically determines the other As a result the model is nonidentifiable in that multiple combinations of b0 and b1 will produce the same probabilities for all possible explanatory variables In fact it can be seen that adding any constant vector to both of them will produce the same probabilities Pr Y i 1 e b 1 C X i e b 0 C X i e b 1 C X i e b 1 X i e C X i e b 0 X i e C X i e b 1 X i e C X i e C X i e b 1 X i e C X i e b 0 X i e b 1 X i e b 1 X i e b 0 X i e b 1 X i displaystyle begin aligned Pr Y i 1 amp frac e boldsymbol beta 1 mathbf C cdot mathbf X i e boldsymbol beta 0 mathbf C cdot mathbf X i e boldsymbol beta 1 mathbf C cdot mathbf X i 5pt amp frac e boldsymbol beta 1 cdot mathbf X i e mathbf C cdot mathbf X i e boldsymbol beta 0 cdot mathbf X i e mathbf C cdot mathbf X i e boldsymbol beta 1 cdot mathbf X i e mathbf C cdot mathbf X i 5pt amp frac e mathbf C cdot mathbf X i e boldsymbol beta 1 cdot mathbf X i e mathbf C cdot mathbf X i e boldsymbol beta 0 cdot mathbf X i e boldsymbol beta 1 cdot mathbf X i 5pt amp frac e boldsymbol beta 1 cdot mathbf X i e boldsymbol beta 0 cdot mathbf X i e boldsymbol beta 1 cdot mathbf X i end aligned nbsp As a result we can simplify matters and restore identifiability by picking an arbitrary value for one of the two vectors We choose to set b 0 0 displaystyle boldsymbol beta 0 mathbf 0 nbsp Then e b 0 X i e 0 X i 1 displaystyle e boldsymbol beta 0 cdot mathbf X i e mathbf 0 cdot mathbf X i 1 nbsp and so Pr Y i 1 e b 1 X i 1 e b 1 X, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.