Why Regular Regression Does NOT Work

It is an error to use regular least-squares regression when the dependent Y-variable is binary (takes on values of 0 or 1 only). Note: see the previous article When to Use Logistic Regression for a discussion of dichotomous and binary dependent Y-variables.

Note: If you want to skip the discussion and jump directly to the pictures click here.

Recall that in least-squares regression, we model the Y-variable as a linear function of the X-variables plus a random error that is assumed to have a normal distribution. That is, the i‘th Y observation is assumed to have been generated my the following equation:

Least-Squares Regression Equation
where,
Beta 1 to Beta p are the regression slope coefficients,
Xi1 to Xip are the corresponding values of the X-variables,
Alpha is the intercept, and
Epsilon sub i is the random error term.

The key point is that in regular least-squares regression, the error term Epsilon sub i has a normal distribution with mean 0 and standard deviation Epsilon sub i. This means the Epsilon sub i must be a continuous variable (i.e., one that takes values on an interval), not a binary variable (i.e., a variable taking only the values 0 and 1).

Thus, if you use regular least-squares regression when you have a binary dependent variable (and should be using logistic regression) you are violating the least-squares requirement that the regression errors have a normal distribution.

When the assumptions that underlie the least-squares regression model are violated, you can no longer rely on the statistical inference (e.g., which regression coefficients are significant) or predictions that are made based on the least-squares model.

What all of the above means can be seen easily with a few pictures. These pictures are for a simple regression (when there is only one X) and they plot Y vs. X.

The first figure shows the kind of data that is appropriate for regular least-squares regression:

In this figure, you can see that the Y-variable takes on continuous values.

The figure below shows data that is appropriate for logistic regression:

In this figure, you can see that the Y-variable only takes on two values, 0 and 1. This means that the data appear to be on two horizontal parallel lines, one at 0 and the other at 1. If you look carefully, you can see that the probability that Y is 1 increases as the value of X increases.

You can also see by looking at this picture that the equation above for the least-squares regression must give silly predictions for Y when Y takes on only binary values. The equation is linear. For any regression coefficient that is positive, increasing the corresponding X value will cause the prediction for Y to increase. You can make the predicted value of Y as large as you want just by moving the X value far enough. Thus, there will be X values for which the predicted Y value will far exceed 1. Similarly, there will be other X values for which the predicted Y value will negative and far below 0. Such predictions make no sense when the only values that the Y-variable can take on are 0 and 1.

To show this, I have added the least squares regression line to the two figures shown above. Here is the first one which was for data that is appropriate for least-squares regression.

For this first figure, the regression line makes perfect sense and gives very reasonable predictions.

Next, the figure for the data with the binary Y-variable is shown.

As you can see, the least-squares regression line gives predictions that make no sense. For example, for an X value of 8, the least-square regression line predicts that Y will be above 1.5. But Y can only take on values of 0 and 1.

In summary, you cannot use least-squares regression when your Y-variable only takes on binary values. The assumptions that underlie least-squares regression are violated and, as a result, any inferences that are made are likely to be wrong. In addition, predictions about Y made using the least-squares line will likely make no sense. It is necessary to use a method designed for binary Y variables such as logistic regression.

Any questions or comments on this article are welcome!

This entry was posted in Introduction. Bookmark the permalink.

4 Responses to Why Regular Regression Does NOT Work

  1. Michael Clayton says:

    Title should be WHEN not WHY….will not work….etc.

  2. Avinash S says:

    Please clarify how did you conclude that y-variable would be of continuous variable type in linear regression, since error is assumed to of normal distribution with mean of zero. I mean how the distribution of error gives that idea. Thanks.

    • Tejas Pancholi says:

      Hello Avinash,
      I don’t think he has mentioned that error is what decides if Y will be continuous or discrete. I think the nature of Y can be predicted by the value that it can take on like Win or lose , ad shown or not shown, product bought or not bought…etc where there are only two possible outcome and nothing apart from that, In all these cases the Y will be discrete however if variable Y was taking value on the real scale(0,1,0.9,3.14,1.99999) etc then it will be continuous variable i.e. House price, mileage of vehicle, etc.

    • Ravishankar Chandrasekar says:

      Hi Avinash,

      The error is nothing but Y- Y hat. If the error is following a normal distribution(linear regression assumption). The errors grow linearly with the Y variable. And it takes continuous values. Clearly Error comes from prediction of Y. If, error takes continuous values, Y is bound to have taken continuous values. Hence, you cannot use logistic regression on binary dependent variables. Since it violates the very basic assumption of a linear regression model.

Leave a Reply

Your email address will not be published. Required fields are marked *


ENTER answer below as a NUMBER.
CAREFUL! Error will delete your comment. Save it first (cntrl-A then cntrl-C saves to clipboard). *
Time limit is exhausted. Please reload the CAPTCHA.