In logistic regression, as in least-squares regression, you often want to try to assess the effects of the independent -variables on the dependent -variable. In logistic regression, however, things are somewhat more complicated than in least-squares regression as I explain in this Article. This Article is Part 3 of a 5-part series focused on understanding and interpreting the logistic regression coefficient output table. Click here to see a brief summary of the focus of each of the 5 parts..
Coefficient Table Use #2: Assessing the Effects of the X-Variables
As I discussed in Part 3 of the parallel series discussing the least-squares regression coefficient table, in regular least-squares regression, assessing the effects of the -variables is relatively straightforward. In least-squares regression, because the equation that is fit to the data is completely linear, the effects of changing an
Things are not so simple in logistic regression. In logistic regression, we are not fitting a linear equation to the dependent variable, but rather are fitting a linear equation to the log odds:
where is the probability of “success” (i.e., the probability that ). Things would be just as simple as for linear least-squares regression if the log odds were an intuitive quantity that we were directly interested in. If that were the case, then the interpretation of the regression coefficient is the same as for least-square regression: The regression coefficient shows the effect on the log odds of a one-unit change in the corresponding -variable.
I don’t know about you, but I don’t normally think about log odds. As a result, knowing the effect of an -variable on the log odds does not communicate much to me. Truth be known, what I think about are probabilities, so what I usually really want to know about is the effect of changing an -variable on the probability of success (the probability that the dependent variable is 1).
The problem is that in logistic regression, unlike regular least-squares regression, the effect of changing an -variable is not the same everywhere. Rather, it depends on the values of the other
This equation may be a little hard to read, but the exponent of is the linear function . Another, equivalent, way to write the equation above is as follows:
Here, the exponent of is the negative of the linear function: .
Now, at this point, I do not want you to worry at all about these equations which show you the non-linear relationship between the ‘s and . I have shown you these equations only for their SHOCK VALUE. I want you to go “eeewwww” or “yuck” or “OMG” or whatever, and feel free to tune them out of your mind for now. But I do want you to understand that the relationship between the probability and the -variables is not linear and that things are more complicated than in regular least-squares regression.
While you and I might both ultimately want to understand the effect of changing an -variable on , we can make an intermediate step in that direction by first trying to understand the effect on the odds, rather than the log odds. Odds are not totally outside of the realm of our intuitive understanding. To interpret the effect of changing an -variable on the odds, we consider the odds ratio, a statistic that is generally output as a part of the coefficient table output in logistic regression.
So in the remainder of this Article, I am going to first show you how the reported odds ratio is calculated. Then I will go on to explain why this odds ratio is useful and how to interpret it. I will derive the odds ratio in a separate Article. To begin, however, I would like to have a specific example in front of you as a basis for discussion. So here again is the coefficient table output from the Kid Creative logistic regression.
So how are the odds ratios in the last column calculated? The odds ratio is simply the exponential of the corresponding regression coefficient. That is:
Once again, I am not going to derive this equation here. If you want to see “the math,” click here.
Let’s check this formula just to make sure that it is completely clear. For example, the logistic regression coefficient for the variable IsFemale is 1.646000. Typing this into a calculator and hitting the button, or EXP() button (depending on the calculator) or typing “=exp(1.646000)” into Excel gives 5.1861935. This matches the corresponding number in the “Odds Ratio” column. So the odds ratios are easy to calculate from the logistic regression coefficients by using the exponential function.
But what does the “Odds Ratio” mean. The numbers in the Odds Ratio column show how the odds change for a one-unit change in the -variable. The odds ratio shows this change in a multiplicative fashion rather than as a difference.
Again consider the IsFemale variable as an example. Suppose that we have two people that are identical with respect to the all of the other variables except that one is male and one is female. Since the IsFemale variable is coded as 0 for males and 1 for females, “changing” from male to female is a one-unit change in the IsFemale variable. The the Odds Ratio value for this variable is 5.1861935. This means that we expect the odds that the female buys are about 5.2 times the odds that the “equivalent” male would buy. So if the odds that the male buys happen to be 1 to 9, then the odds that the female buys are 5.2 to 9 (or 26 to 45). If the odds that a particular male buys happen to be 2 to 19, then the odds that the equivalent female buys would be 10.4 to 19 (or 52 to 95). So what I have done in these examples is to take the first number in the odds (that is, the x in x to y) and multiplied it by the odds ratio. Then, in the parentheses, I have converted the odds to integers as is customary for odds reported as x to y.
Lets look as another example that does not involve an indicator (dummy) variable. Consider, the ResidenceLength. The Odds Ratio for ResidenceLength is 1.025. This means that for each additional year of residence (a 1-unit change in the ), the odds of buying is multiplied by 1.025. Thus, if the odds of a customer buying are 1 to 9, then if they have an additional year of residence the odds will be 1.025 to 9 (or 41 to 360 obtained by multiplying both by 40). Similarly, it the customer had one less year of residence length, the odds would be 1/1.025=0.976 to 9 (or 122 to 1125 obtained by multiplying both by 125).
My examples have shown you how the Odds Ratio numbers are used to adjust the odds for a one-unit change in the -variable. Showing you such examples is necessary in order for you to really understand what the odds ratio means, but they are a bit more complicated than where one typically stops in the interpretation of odds ratios in logistic regression.
Let me show you how the Odds Ratio is typically interpreted using the odds ratio for ResidenceLength which is 1.025. What is typically done is to look past the leading 1 to the decimal (0.025 in this case) and convert it to a percent. So in this case, each additional year results in a 2.5% increase in the odds of buying.
So here is the punchline: If you subtract 1 from the odds ratio and multiply by 100 (that is, ), this shows the percentage change in the odds for a 1-unit change in the . Thus, the odds ratios allow us to see what the effect of the -variables are on the odds. As summarized in the odds ratio, the effect of changing an -variable is multiplicative. It is usually best thought of in terms of the percentage change in the odds for a one-unit change in the . This approach is somewhat intuitive (or at least you get used to it after a while).
In this article, I will not try to make the next step to try to interpret the effects of changes in the -variables on the probabilities. I will address this in a future article. I also have not derived the formula that shows that the odds ratio for an -variable in logistic regression is just the exponential of the corresponding regression coefficient. To see that derivation, click here.
Before closing, I want to remind you that regressions generally show association, not causation. While throughout this article, I have allowed myself to talk about “the effects” of changing an -variable, this is really not correct. What would be correct is to talk about a one-unit change in an -variable being associated with a percentage change in the log odds. Constantly talking about an “associated change” rather than “the effect” is not natural, so I have taken the incorrect liberty of possibly implying a causal relationship.
I also want to remind you that the effect of an -variable captured by a regression coefficient is conditional on the values of the other variables (i.e., they are held constant). This fact also has important implications with respect to interpretation. These two issues (causation and conditional effect) are discussed further in the last half of the article corresponding to this one in the 5-part review of least squares regression (click here).
In the next article (Part 4) in this series, I will continue to parallel the discussion in the review series on regular least-squares regression, and will discuss making predictions.
Questions or Comments?
Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.
I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.