In logistic regression, as in least-squares regression, you often want to try to assess the effects of the independent -variables on the dependent -variable. In logistic regression, however, things are somewhat more complicated than in least-squares regression as I explain in this Article. This Article is Part 3 of a 5-part series focused on understanding and interpreting the logistic regression coefficient output table. Click here to see a brief summary of the focus of each of the 5 parts.
.- Part 1: How the logistic regression coefficient table compares to the corresponding table from least-squares regression.
- Part 2: Coefficient table use #1 — determining what variables in the logistic regression matter.
- Part 3: This part. Coefficient table use #2 — assessing the impact of each of the -variables on the dependent variable (actually,.
- Part 4: Coefficient table use #3 — predicting the probability that the dependent variable is 1.
- Part 5: Coefficient table use #4 — assessing the uncertainty in the regression coefficients.
Coefficient Table Use #2: Assessing the Effects of the X-Variables
As I discussed in Part 3 of the parallel series discussing the least-squares regression coefficient table, in regular least-squares regression, assessing the effects of the -variables is relatively straightforward. In least-squares regression, because the equation that is fit to the data is completely linear, the effects of changing an
Things are not so simple in logistic regression. In logistic regression, we are not fitting a linear equation to the dependent variable, but rather are fitting a linear equation to the log odds:
where is the probability of “success” (i.e., the probability that ). Things would be just as simple as for linear least-squares regression if the log odds were an intuitive quantity that we were directly interested in. If that were the case, then the interpretation of the regression coefficient is the same as for least-square regression: The regression coefficient shows the effect on the log odds of a one-unit change in the corresponding -variable.
I don’t know about you, but I don’t normally think about log odds. As a result, knowing the effect of an -variable on the log odds does not communicate much to me. Truth be known, what I think about are probabilities, so what I usually really want to know about is the effect of changing an -variable on the probability of success (the probability that the dependent variable is 1).
The problem is that in logistic regression, unlike regular least-squares regression, the effect of changing an -variable is not the same everywhere. Rather, it depends on the values of the other
This equation may be a little hard to read, but the exponent of is the linear function . Another, equivalent, way to write the equation above is as follows:
Here, the exponent of is the negative of the linear function: .
Now, at this point, I do not want you to worry at all about these equations which show you the non-linear relationship between the ‘s and . I have shown you these equations only for their SHOCK VALUE. I want you to go “eeewwww” or “yuck” or “OMG” or whatever, and feel free to tune them out of your mind for now. But I do want you to understand that the relationship between the probability and the -variables is not linear and that things are more complicated than in regular least-squares regression.
While you and I might both ultimately want to understand the effect of changing an -variable on , we can make an intermediate step in that direction by first trying to understand the effect on the odds, rather than the log odds. Odds are not totally outside of the realm of our intuitive understanding. To interpret the effect of changing an -variable on the odds, we consider the odds ratio, a statistic that is generally output as a part of the coefficient table output in logistic regression.
So in the remainder of this Article, I am going to first show you how the reported odds ratio is calculated. Then I will go on to explain why this odds ratio is useful and how to interpret it. I will derive the odds ratio in a separate Article. To begin, however, I would like to have a specific example in front of you as a basis for discussion. So here again is the coefficient table output from the Kid Creative logistic regression.
So how are the odds ratios in the last column calculated? The odds ratio is simply the exponential of the corresponding regression coefficient. That is:
Once again, I am not going to derive this equation here. If you want to see “the math,” click here.
Let’s check this formula just to make sure that it is completely clear. For example, the logistic regression coefficient for the variable IsFemale is 1.646000. Typing this into a calculator and hitting the button, or EXP() button (depending on the calculator) or typing “=exp(1.646000)” into Excel gives 5.1861935. This matches the corresponding number in the “Odds Ratio” column. So the odds ratios are easy to calculate from the logistic regression coefficients by using the exponential function.
But what does the “Odds Ratio” mean. The numbers in the Odds Ratio column show how the odds change for a one-unit change in the -variable. The odds ratio shows this change in a multiplicative fashion rather than as a difference.
Again consider the IsFemale variable as an example. Suppose that we have two people that are identical with respect to the all of the other variables except that one is male and one is female. Since the IsFemale variable is coded as 0 for males and 1 for females, “changing” from male to female is a one-unit change in the IsFemale variable. The the Odds Ratio value for this variable is 5.1861935. This means that we expect the odds that the female buys are about 5.2 times the odds that the “equivalent” male would buy. So if the odds that the male buys happen to be 1 to 9, then the odds that the female buys are 5.2 to 9 (or 26 to 45). If the odds that a particular male buys happen to be 2 to 19, then the odds that the equivalent female buys would be 10.4 to 19 (or 52 to 95). So what I have done in these examples is to take the first number in the odds (that is, the x in x to y) and multiplied it by the odds ratio. Then, in the parentheses, I have converted the odds to integers as is customary for odds reported as x to y.
Lets look as another example that does not involve an indicator (dummy) variable. Consider, the ResidenceLength. The Odds Ratio for ResidenceLength is 1.025. This means that for each additional year of residence (a 1-unit change in the ), the odds of buying is multiplied by 1.025. Thus, if the odds of a customer buying are 1 to 9, then if they have an additional year of residence the odds will be 1.025 to 9 (or 41 to 360 obtained by multiplying both by 40). Similarly, it the customer had one less year of residence length, the odds would be 1/1.025=0.976 to 9 (or 122 to 1125 obtained by multiplying both by 125).
My examples have shown you how the Odds Ratio numbers are used to adjust the odds for a one-unit change in the -variable. Showing you such examples is necessary in order for you to really understand what the odds ratio means, but they are a bit more complicated than where one typically stops in the interpretation of odds ratios in logistic regression.
Let me show you how the Odds Ratio is typically interpreted using the odds ratio for ResidenceLength which is 1.025. What is typically done is to look past the leading 1 to the decimal (0.025 in this case) and convert it to a percent. So in this case, each additional year results in a 2.5% increase in the odds of buying.
So here is the punchline: If you subtract 1 from the odds ratio and multiply by 100 (that is, ), this shows the percentage change in the odds for a 1-unit change in the . Thus, the odds ratios allow us to see what the effect of the -variables are on the odds. As summarized in the odds ratio, the effect of changing an -variable is multiplicative. It is usually best thought of in terms of the percentage change in the odds for a one-unit change in the . This approach is somewhat intuitive (or at least you get used to it after a while).
In this article, I will not try to make the next step to try to interpret the effects of changes in the -variables on the probabilities. I will address this in a future article. I also have not derived the formula that shows that the odds ratio for an -variable in logistic regression is just the exponential of the corresponding regression coefficient. To see that derivation, click here.
Before closing, I want to remind you that regressions generally show association, not causation. While throughout this article, I have allowed myself to talk about “the effects” of changing an -variable, this is really not correct. What would be correct is to talk about a one-unit change in an -variable being associated with a percentage change in the log odds. Constantly talking about an “associated change” rather than “the effect” is not natural, so I have taken the incorrect liberty of possibly implying a causal relationship.
I also want to remind you that the effect of an -variable captured by a regression coefficient is conditional on the values of the other variables (i.e., they are held constant). This fact also has important implications with respect to interpretation. These two issues (causation and conditional effect) are discussed further in the last half of the article corresponding to this one in the 5-part review of least squares regression (click here).
In the next article (Part 4) in this series, I will continue to parallel the discussion in the review series on regular least-squares regression, and will discuss making predictions.
Questions or Comments?
Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.
I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.
I am studying the relationship between diseases and surgical complications, via retrospective database analysis. The main question is of course wheteher or not a certain disease increases the risk of surgical complications. Most of the factors we look for are dichotomous in nature, such as “disease yes/no” or “complicated vs non-complicated disease”, plotted against individual complications such as “bleeding yes/no”, “infection yes/no”, “any complication yes/no” etc.
So, when comparing only two binary variables, such as “disease yes/no”, against a surgical complication, such as “bleeding yes/no” – does it make any sense to use logistic regression analysis? I mean, I can easily deduct for instance the odds ratios by just using four numbers (exposed positives, exposed negatives, non-exposed positives and non-exposed negatives), right?
So – is it only relevant to use the analysis when you have multiple variates which you want to calculate against the binary y-variable (for instance other possibly non-binary confounding factors such as age, BMI etc). I.e., would it in this case only be useful for “adjusting” the relationsship, taking confounders into account? Otherwise, I donĀ“t understand where the actual regression analysis comes into play (and I have been told that the data I got in a large excel-file was calculated using log regr analysis).
You are absolutely correct in what you say. If you have a dichotomous Y variable and only one dichotomous X variable, then you can compute the odds ratios from the estimated probabilities directly. You can also test whether on not the X-variable is significantly related to the Y variable using a two sample test of proportions (taught in most basic stats courses and in essentially every basic stats book).
Things change, however, when you add in another X variable, even if it is also dichotomous. The reason is that the probabilities are not additive in the same way that expectations are in regular linear least-squares regression. For example. suppose that the 0 and 1 conditions of X1 correspond to sample proportions of complications of 1% and 3%, for a change of 2% when this X-variable changes from low to high. Suppose that the 0 and 1 conditions of a second variable X2 correspond to sample proportions of 2% and 5% for a change of 3% as this variable changes from low to high. Finally, suppose that when X1 = 0 and X2 = 1, the sample proportion is 2.5%. What would you predict for the case that X1=0 and X2=0? The naive approach (which would work with regular least squares regression) would be to subtract 3% from 2.5% to get a probability of -0.5%. That does not work.
So the effects of the X-variables on the Y variable cannot be additive or linear in the same way that they are in regular least-squares regression. This leads to transforming P(Y=1) using the logistic transform to obtain a linear relationship. Only then do the effects of the X-variable add in the way that you would generally like them to. Without using the logistic approach, things would get even more complicated if there were interactions between the X-variables.
So, I think you will be much better off if you go ahead and use the logistic regression set up. In addition to dealing with the additivity problem illustrated by the example I just gave, the logistic regression approach will allow you to use non-dichotomous X-variables, add interactions, and so on. And there is well-developed “machinery” for testing hypotheses, assessing how well the model fits, etc.
One final comment. The approach(es) you can take to the sort of problem you describe (surgical complications) depends on exactly what you are trying to do. If you really are only interested in predicting the complication rates and are not interested in statistical testing of which variables are important, etc., then, if you have enough data, you could use a non-parametric approach such as classification trees. Logistic regression, however, is very appropriate for this problem and it has the advantage over some of the “big data” methods in that you can do “traditional” hypothesis testing to assess which variables appear to have a relationship with the dependent Y variable.
Your project sounds interesting and important. I hope my discussion helps.
Regards,
StatsProf
Hello Professor I would like to congratulate you for these explanations on issues that you can hard find to textbooks but are the basis for a lot of problems you have in the modelling process.
My question is why the punchline is to subtract one from the odds ratio ? Why not 2 or 3 ?
And my next question is , do we have any hint on how to interpret the odd ratio Change to probability Impact ? I now that you mention you are tending to create a new article on that but a hint would help me much!
Thank you very much for your time!
Marios,
Thanks for you question!
We subtract 1 because an odds ratio of something like 1.15 means a 15% increase in the odds for a 1 unit increase in the X variable. Similarly, and odds ratio of 0.95 means a 5% decrease in the odds for a 1 unit increase in the X-variable. We are just converting a factor (like 1.15) into a percent increase (like 15%).
I do plan on article on interpreting the odds ratio in terms of a change in the probability. I will have it complete within a month from now. I have recently been very busy teaching, so have not had much time to develop material for this web site. But, my teaching is lightening up, and this is a priority.
The hint you ask for, however, is that you have to use the logistic equation to predict the change in the p. Because the equation is non linear, the amount of change in the p depends no only on the change in the X-variable for which you are considering the odds ratio, but also on the values of the other X-variable.
I hope this helps.
StatsProf
Thank you very much for your answer.
One last to make it clear for me. If we have an odds ratio of 15 ,which means that one unit change in the regressor X would cause a 15times the odds change to the event , how we can interpret this in terms of subtract the unit ? 1400% ??
Thank you again for your responses!
Marios,
Yes. You are correct. It is a 1400% increase in the odds.
This would be like going from a p of 0.205 to a p of 0.795.
Why? (I think you understand this, but I am going to explain it anyway for other readers.)
Well, the odds corresponding to a p of 0.205 are 0.205/(1-0.205) = 0.258.
Similarly, the odds corresponding to a p of 0.795 are 0.795/(1-0.795) = 3.878.
Finally, 3.878/0.258=15.0 so the odds corresponding to 0.795 (which is 3.878) are 15 times bigger than the odds for 0.205 (which is 0.258).
Of course, in doing the above calculation, I could start at a p other than 0.205 and then see what the new probability would be for a change corresponding to an odds ratio of 15. I have made a graph that shows what would happen for all “starting” probabilities between 0 and 1.
The x-axis in the graph is the “starting” probability. The orange-brown trace shows the “ending” probability after the change corresponding to an odds-ratio of 15. The vertical difference between the orange-brown trace and the blue trace shows the change in the probability.
So what I am planning to do when I have a chance is create a page that will do these calculations for the user and also make a graph similar to the one above for whatever odds ratio the user inputs.
I hope this helps.
Regards,
StatsProf
Hello! On the punchline thing, this only works for variables with no indicator?
So the “punchline” is that the (odds ratio-1)*100 shows the percentage change in the odds ratio for a one unit change in the X. Here the “odds ratio” is from the logistic regression coefficient table output for the variable X and it is calculated as odds ratio = exp(beta) where beta is the slope coefficient for X.
I think that what you are asking is whether this works for X’s that are indicator (dummy) variables that only take on the values 0 or 1. The answer is basically yes, but there is a small caveat. If your indicator variable X is zero (X=0), then it makes perfect sense to ask what the effect would be if you change X from 0 to 1 (X=1).
But if your indicator variable is already 1 (X=1), it does not really make sense to ask what the effect is if you change it one unit to 2 (X=2) since X=2 is not a possible value of the variable. But it is, of course, reasonable to ask what the effect would be if you went the other way and changed it from X=1 to X=0.
So, for example, if I have an X-variable that is an indicator variable and has a value X=1 and the odds ratio says that a one unit increase in the variable would give me a 20% increase in the odds (that is the “odds ratio” is 1.2), then I know that switching it the other way (from X=1 to X=0) would give me a 20% decrease in the odds.
I hope this helps.
Regards, StatsProf
Thank you very much for answering.
I want to confirm though, if X is an indicator variable with an odds ratio of 5.2, does that still mean that one unit increase in X gives 20% increase in the odds?
I get confused whenever the odds ratio starts from 2.0 -_-“
Since X is an indicator variable, it takes on values 0 or 1. Suppose the p0 is the probability predicted by the model when X=0 and p1 is the probability predicted by the model when X=1. (If you have more than one X, you have to keep everything else constant.) An odds ratio of 5.2 means that:
5.2 * p0/(1-p0) = p1/(1-p1)
This is an increase (when you change X from 0 to 1) of 420% in the odds ratio. A 420% increase corresponds to a factor of 5.2.
One can solve this equation to see that p1 = 5.2*p0/(1+p0*(5.2-1)) and can then experiment with the probability p0 to see what kind of change in the probability a particular odds ratio gives.
An odds ratio of 5.2 corresponds to the following:
A change from p0 = 0.304845 to p1 = 0.695155 (this is symmetric around 0.5)
A change from p0 = 0.1 to p1 = 0.366197
A change from p0 = 0.9 to p1 = 0.979079
You can use the equation I gave you above to calculate the change from p0 to p1 for any starting p0. If you have an odds ratio that is something other than 5.2, just substitute it in for the 5.2.
This illustrated the difficulty interpreting the effects of changes in the X-variables in logistic regression. The effect of the odds ratio on the probability is different depending on the starting value p0. The logistic regression model is highly non-linear which makes interpretation difficult.
One of the things on my “to do” list is to create an automatic calculator that lets you experiment with the odds ratio and probability p0 to see what happens.
Again, thanks for you question and I hope I have answered it.
StatsProf
Thank you for the reply, StatsProf! It’s helped me a lot with my academic paper.
But now I have a problem with my significant independent variables. I have one with 11 categories (wowzers). In SPSS, only 10 of these are shown with their odds ratios because the 11th becomes the reference category. I am wondering how to compute for the odds ratio of this reference category T_T
Your question has caused me to think hard about what one might want to do in assessing categorical variables such as one that is coded with a dummy variable. I think I will write a post to address this, but I will give you a hint about what I am thinking here. (I can’t use real math notation here in replies to comments.)
First, I will use only 4 categories for simplicity. You will have to extend this to 11.
Suppose the categories are C1, C2, C3, and C4. There will only be 3 dummy variables. I will choose C1 to be the base case. Thus, there will a dummy variables corresponding to C2, C3, and C4 which I will call D2, D3, and D4. Each of these dummy variables will take on a value of 1 for its category and 0 otherwise. “D1″ is not used because the C1 category correspond to D2, D3, and D4 all taking the value 0.
Now odds ratios are about what happens to the odds (i.e., p/(1-p)) when the X-variable is changed by one unit. This doesn’t quite apply in the case of dummy variables because of the special role of the base case (as your question points out). So what to do.
Well, it seems to me that in the case of a categorical variable, it does not make a lot of sense to think about changing the X by one unit. But it does make sense to think about what happens to the odds when you change the category.
So I am going to think about trying to summarize what happens to p/(1-p) when the category changes. I am going to think about summarizing this with a matrix that shows what happens for all possible changes. The left-hand side (rows) will be the “From” side and the top (columns) will be the “To” side. Like this:
Odds Ratios for Category Changes:
The diagonals will be 1 because changing from CX to CX does not change p/(1-p).
Now the odds-ratios given by your computer output give you the first row, because the represent what happens when you change from the base case (C1 here) to each of the other cases. So the matrix becomes:
Here OR(D2) denotes the odds ratio corresponding to the dummy variable D2.
The first column is easy, because is going from C1 to C2 is a factor of OR(D2), the going the other way is a factor of 1/OR(D2).
So here is the matrix so far:
Now I can think of the remainder of the elements of the matrix as follows. If I want to change from C2 to C3, then I first co back to C1 and them from there to C3. So the odds ratio is OR(D3)/OR(D2).
So here is the final matrix that shows the odds ratios for all possible category changes.
This matrix should give you the kind of thing you are looking for.
Just to be sure you know how to read it, if I want the odds ratio for changing from C2 to C4, I use row 2 and column 4 and see that the odds ratio is OR(D4)/OR(D2). This is the multiplier that will show the impact on the odds p/(1-p) when I change from category C2 to C4.
Note: I have never seen this done before, but your question stimulated me to think hard about what you probably actually want.
You should try to check the algebra and make sure you agree.
I think that this is an interesting enough idea that I will turn it into a post where I can use real math notation.
Regards,
StatsProf