It is an error to use regular least-squares regression when the dependent *Y*-variable is binary (takes on values of 0 or 1 only). Note: see the previous article When to Use Logistic Regression for a discussion of dichotomous and binary dependent *Y*-variables.

Note: If you want to skip the discussion and jump directly to the pictures click here.

Recall that in least-squares regression, we model the *Y*-variable as a linear function of the *X*-variables plus a random error that is assumed to have a normal distribution. That is, the *i*‘th *Y* observation is assumed to have been generated my the following equation:

where,

are the regression slope coefficients,

are the corresponding values of the *X*-variables,

is the intercept, and

is the random error term.

The key point is that in regular least-squares regression, the error term has a normal distribution with mean 0 and standard deviation . This means the must be a continuous variable (i.e., one that takes values on an interval), not a binary variable (i.e., a variable taking only the values 0 and 1).

Thus, if you use regular least-squares regression when you have a binary dependent variable (and should be using logistic regression) you are violating the least-squares requirement that the regression errors have a normal distribution.

When the assumptions that underlie the least-squares regression model are violated, you can no longer rely on the statistical inference (e.g., which regression coefficients are significant) or predictions that are made based on the least-squares model.

What all of the above means can be seen easily with a few pictures. These pictures are for a simple regression (when there is only one X) and they plot *Y* vs. *X*.

The first figure shows the kind of data that is appropriate for regular least-squares regression:

In this figure, you can see that the *Y*-variable takes on continuous values.

The figure below shows data that is appropriate for logistic regression:

In this figure, you can see that the *Y*-variable only takes on two values, 0 and 1. This means that the data appear to be on two horizontal parallel lines, one at 0 and the other at 1. If you look carefully, you can see that the probability that *Y* is 1 increases as the value of *X* increases.

You can also see by looking at this picture that the equation above for the least-squares regression must give silly predictions for *Y* when *Y* takes on only binary values. The equation is linear. For any regression coefficient that is positive, increasing the corresponding *X* value will cause the prediction for *Y* to increase. You can make the predicted value of *Y* as large as you want just by moving the *X* value far enough. Thus, there will be *X* values for which the predicted *Y* value will far exceed 1. Similarly, there will be other *X* values for which the predicted *Y* value will negative and far below 0. Such predictions make no sense when the only values that the *Y*-variable can take on are 0 and 1.

To show this, I have added the least squares regression line to the two figures shown above. Here is the first one which was for data that is appropriate for least-squares regression.

For this first figure, the regression line makes perfect sense and gives very reasonable predictions.

Next, the figure for the data with the binary *Y*-variable is shown.

As you can see, the least-squares regression line gives predictions that make no sense. For example, for an *X* value of 8, the least-square regression line predicts that *Y* will be above 1.5. But *Y* can only take on values of 0 and 1.

In summary, you cannot use least-squares regression when your *Y*-variable only takes on binary values. The assumptions that underlie least-squares regression are violated and, as a result, any inferences that are made are likely to be wrong. In addition, predictions about *Y* made using the least-squares line will likely make no sense. It is necessary to use a method designed for binary *Y* variables such as logistic regression.

Any questions or comments on this article are welcome!

Title should be WHEN not WHY….will not work….etc.

Please clarify how did you conclude that y-variable would be of continuous variable type in linear regression, since error is assumed to of normal distribution with mean of zero. I mean how the distribution of error gives that idea. Thanks.

Hello Avinash,

I don’t think he has mentioned that error is what decides if Y will be continuous or discrete. I think the nature of Y can be predicted by the value that it can take on like Win or lose , ad shown or not shown, product bought or not bought…etc where there are only two possible outcome and nothing apart from that, In all these cases the Y will be discrete however if variable Y was taking value on the real scale(0,1,0.9,3.14,1.99999) etc then it will be continuous variable i.e. House price, mileage of vehicle, etc.

Hi Avinash,

The error is nothing but Y- Y hat. If the error is following a normal distribution(linear regression assumption). The errors grow linearly with the Y variable. And it takes continuous values. Clearly Error comes from prediction of Y. If, error takes continuous values, Y is bound to have taken continuous values. Hence, you cannot use logistic regression on binary dependent variables. Since it violates the very basic assumption of a linear regression model.