It is an error to use regular least-squares regression when the dependent Y-variable is binary (takes on values of 0 or 1 only). Note: see the previous article When to Use Logistic Regression for a discussion of dichotomous and binary dependent Y-variables.
Note: If you want to skip the discussion and jump directly to the pictures click here.
Recall that in least-squares regression, we model the Y-variable as a linear function of the X-variables plus a random error that is assumed to have a normal distribution. That is, the i‘th Y observation is assumed to have been generated my the following equation:
are the regression slope coefficients,
are the corresponding values of the X-variables,
is the intercept, and
is the random error term.
The key point is that in regular least-squares regression, the error term has a normal distribution with mean 0 and standard deviation . This means the must be a continuous variable (i.e., one that takes values on an interval), not a binary variable (i.e., a variable taking only the values 0 and 1).
Thus, if you use regular least-squares regression when you have a binary dependent variable (and should be using logistic regression) you are violating the least-squares requirement that the regression errors have a normal distribution.
When the assumptions that underlie the least-squares regression model are violated, you can no longer rely on the statistical inference (e.g., which regression coefficients are significant) or predictions that are made based on the least-squares model.
What all of the above means can be seen easily with a few pictures. These pictures are for a simple regression (when there is only one X) and they plot Y vs. X.
The first figure shows the kind of data that is appropriate for regular least-squares regression:
In this figure, you can see that the Y-variable takes on continuous values.
The figure below shows data that is appropriate for logistic regression:
In this figure, you can see that the Y-variable only takes on two values, 0 and 1. This means that the data appear to be on two horizontal parallel lines, one at 0 and the other at 1. If you look carefully, you can see that the probability that Y is 1 increases as the value of X increases.
You can also see by looking at this picture that the equation above for the least-squares regression must give silly predictions for Y when Y takes on only binary values. The equation is linear. For any regression coefficient that is positive, increasing the corresponding X value will cause the prediction for Y to increase. You can make the predicted value of Y as large as you want just by moving the X value far enough. Thus, there will be X values for which the predicted Y value will far exceed 1. Similarly, there will be other X values for which the predicted Y value will negative and far below 0. Such predictions make no sense when the only values that the Y-variable can take on are 0 and 1.
To show this, I have added the least squares regression line to the two figures shown above. Here is the first one which was for data that is appropriate for least-squares regression.
For this first figure, the regression line makes perfect sense and gives very reasonable predictions.
Next, the figure for the data with the binary Y-variable is shown.
As you can see, the least-squares regression line gives predictions that make no sense. For example, for an X value of 8, the least-square regression line predicts that Y will be above 1.5. But Y can only take on values of 0 and 1.
In summary, you cannot use least-squares regression when your Y-variable only takes on binary values. The assumptions that underlie least-squares regression are violated and, as a result, any inferences that are made are likely to be wrong. In addition, predictions about Y made using the least-squares line will likely make no sense. It is necessary to use a method designed for binary Y variables such as logistic regression.
Any questions or comments on this article are welcome!