The z-value is the regression coefficient divided by its standard error. It is also sometimes called the z-statistic. It is usually given in the third column of the logistic regression regression coefficient table output. Thus, in the example below, the z-value for the regression coefficient for ResidenceLength is .
If the z-value is too big in magnitude (i.e., either too positive or too negative), it indicates that the corresponding true regression coefficient is not 0 and the corresponding -variable matters. A good rule of thumb is to use a cut-off value of 2 which approximately corresponds to a two-sided hypothesis test with a significance level of . So, for the ResidenceLength variable, the z-value is 1.79 which is not large enough to provide strong evidence that ResidenceLength matters.
Note: The relationship between the regression coefficient, its standard error, the z-value, and the p-value is virtually identical both logistic regression and regular least-squares regression. So if you understand this is regular regression, you also understand it in logistic regression.
In statistics, the letter “Z” is often used to refer to a random variable that has a standard normal distribution. A standard normal distribution is a normal distribution with expectation 0 and standard deviation 1. This is the normal distribution that is generally tabulated in the back of any basic statistics book.
Because of this, the term “z-value” is often used to refer to the value of a statistic that has a standard normal distribution. Sometimes it is also used to refer to percentile points from the standard normal distribution that are used to compare to the value of statistic. For example, one might refer to “the z-value corresponding to a 95% confidence interval” (which would be 1.96).
In basic univariate statistics, z-statistics and z-values usually come about as a result of standardizing a statistic such as the sample mean or sample proportion . Standardizing a statistic means subtracting its expected value and then dividing by its standard error (the standard error of a statistic is its standard deviation). The leading example of this from basic statistics would be a z-statistic derived from from the sample mean:
Here is the statistic to be standardized (the sample mean), is its expectation (which, for the sample mean, is the same as the population mean), and the standard deviation of the statistic is . Here is the population standard deviation, and the formula comes about because of the relationship between the standard deviation of a sample mean and the population mean. Finally, the statistic has a normal distribution as the sample size gets large because of the central limit theorem.
Hopefully, the above reminds you about enough of your basic statistics that using these ideas in the context of logistic regression will make sense.
So where do z-values come about in logistic regression? They primarily come about as a result of standardizing the logistic regression coefficients when testing whether or not the individual -variables are related to the -variables. For example, consider the coefficient table output from the logistic regression in the “Kid Creative” example I discussed in the post Understanding Logistic Regression Output: Part 2 — Which Variables Matter.
You will see that the z-values are given in the third column of numbers in the table. These z-values are computed as the test statistic for the hypothesis test that the true corresponding regression coefficient is 0. (Note: The -values computed from the z-values are given in the 4th column of numbers in the regression coefficient output table. I generally do not look at the z-values, but rather use the -values.
More specifically, suppose we want to determine if an -variable matters (that is, has a significant relationship to the variable). We determine this by testing the null hypothesis that the corresponding regression coefficient is 0. In hypothesis testing, we assume the null hypothesis is true, and then see if the data provide evidence against it. So in this case, we assume is 0. That is, we assume the expectation of the fitted regression coefficient is 0. So we standardize the regression coefficient as follows:
Note that there is no closed-form formula for . It is computed as the solutions to a non-linear system of equations.
So, for example, consider the ResidenceLength regression coefficient in the coefficient output table above. For this variable and , so the Z-value is . So the value in the third column of numbers is .
How do we interpret the Z-values? As a rough rule of thumb, if the absolute value of the Z-value is bigger that 2.0, the variable is significant (which means that there is statistical evidence that it is related to the variable). This gives a rough hypothesis test with a significance level of about .
More precisely, in the hypothesis test, select a significance level such as . Determine the corresponding critical value for the test. This will depend on whether or not the hypothesis test is one-sided or two-sided. If it is one-sided, the critical value will be the upper percentage point of the standard normal distribution (generally referred to as ). If it is a two sided test (most common), then the critical value is the upper percentage point (generally referred to as ). The absolute value of the Z-value is then compared to the appropriate critical value to determine if the test is significant. That is, the regression coefficient is significantly different from 0 if:<\p>
or
]]>
In this article address how much data you need in order to run standard logistic regression analyses.
As is the case in essentially all statistical analysis, the amount of data you need in logistic regression depends on the number of parameters that you are trying to estimate. In regression (both logistic regression and regular least-square regression), the number of parameters you are trying to estimate depends on the number of independent -variables in the model. That is, there is a relationship between the number of observations and the number of -variables. So the sample size issue can be viewed in two ways. The first is, given a number of -variables, how many observations do I need? The second is, given a number of observations, how many -variables can I use?
In logistic regression (in comparison to regular least-squares regression), there is an additional issue created by the fact that the dependent Y-variable is dichotomous. In regular least-square regression, when we talk about required sample size it is in the context a of situation where all of the values will be different. This has to be the case in least-squares regression if the error term really follows a normal distribution. In practice, you may have some stacking up of values (because of rounding, for example), but almost all of the -values will be different.
In logistic regression, since the -values only take on the values 0 or 1, there obviously will be a great deal of stacking up of the values. Thus, how many -variable I can fit also involves how “spread out” the -values are. This is captured by looking at the minimum number of 0’s and 1’s.
Specifically, suppose:
What we need in logistic regression is for to be about 10 times the number of parameters.
In logistic regression, the number of parameters is one more than the number of ‘s (because there is one additional parameter for the intercept). So if there are -variables, you needPlease note that this is a rule of thumb. As such, it is an approximation which may not work in all circumstances.
Here is an example. Suppose that I have 436 observations and 78 were “successes” with the rest “failures” (i.e., failures). Then . This means that I can fit -variables. So my model can have 6 to 7 -variables.
For more information, see the book Applied Logistic Regression (Wiley Series in Probability and Statistics). This topic is addressed in Chapter 10 (“Special Topics”) in Section 10.5 (“Sample Size Issues When Fitting Logistic Regression Models”).
]]>
This article discusses using the logistic regression coefficient table output to assess uncertainty. It is the last part of a 5-part series focused on understanding and interpreting the logistic regression coefficient table. Click here to see a brief summary of the focus of each of the 5 parts.
In logistic regression, assessing the uncertainty in the estimated coefficients is virtually the same as for least-squares regression (click here for a review of assessing uncertainty in least-squares regression). In both logistic regression and least-squares regression, the regression coefficient table will include a column for the regression coefficients followed by a column of standard errors, then by a column of test statistics, and finally a column of -values. The table below shows the coefficient table output for the Kid Creative regression (recall that what we are modeling is the probability of buying the Kid Creative magazine).
The standard errors can be used to construct confidence intervals for the regression coefficients. It is not my intention to repeat a course on basic statistics here, so I am not going to either derive or be precise about these confidence intervals. However, roughly speaking, going plus or minus 2 times the standard error from the regression coefficient gives approximately a 95% confidence interval for the coefficient.
For example, for Residence Length, the regression coefficient is 0.024680. The next column gives the standard error of the regression coefficient which is 0.013800. Thus an approximate 95% confidence interval for the Residence Length regression coefficient is:
This means that the regression coefficient for Residence Length could be anywhere from to (with 95% confidence).
As I explained in Part 3 of this series, we often use the odds-ratio, which is the exponential of the regression coefficient (i.e., ), to help to interpret the meaning of the regression coefficient. The odds-ratio for the Residence Length coefficient, as shown in the coefficient table, is 1.0250. This means that there is a 2.5% increase in the odds of buying the Kid Creative magazine associated with each additional year of residence.
We can also compute the odds-ratios corresponding to the ends of the confidence interval. These odds-ratios will give us an equivalent confidence intervals for the odds. So continuing the example using Residence Length, the odds ratios corresponding to the ends of the confidence interval are and .
Thus, the interval is an approximate 95% confidence interval for the odds ratio. This means that there could be anywhere from a 0.292% decrease to a 5.367% increase in the odds of buying the Kid Creative magazine associated with each additional year of residence.
I have now discussed the main way that the logistic regression coefficient table output is used to assess uncertainty. You may recall that when I discussed the same issue (assessing uncertainty) in my review of least-squares regression, I briefly touched on computing the uncertainty of predictions (prediction intervals). If you want to review that discussion you can see it by clicking here.
In logistic regression, it does not make sense to create a prediction interval for a new observation in the same way that it does in regular least-squares regression. The reason is simply that we know that outcome of the dependent -variable is either 0 or 1 (and not any other values) for any set of values of the -variables. This, of course, is because the -variable is binary in logistic regression. What we don’t know is the probability. We could think about computing a confidence interval for this probability (which could be done), but not really for the outcome of (or at least not in any kind of a way that is parallels what is done in least-squares regression).
The point of this last paragraph is simply this. In logistic regression analysis, we don’t have to worry about prediction intervals. People do not try to compute anything when using logistic regression that is an analog to or resembles prediction intervals in regular least-square regression. Further, the logistic regression software generally does not have any way to output prediction-type intervals.
This article concludes my discussion of interpreting and using the coefficient table output that is produced by logistic regression software. The coefficient table is the most important and useful part of the logistic regression output. But we do have another huge topic to address, one that is almost as big as the coefficient table. That is assessing how well the logistic regression model fits the data. This issue is called “goodness-of-fit.” It is probably the most important remaining topic that I will address as I continue to develop the major ideas of logistic regression.
]]>This article discusses making predictions using logistic regression. It is part 4 of a five-part series focused on understanding and interpreting the logistic regression coefficient output table. Click here to see a brief summary of the focus of each of the 5 parts.
Making predictions in logistic regression is very similar to making predictions in least-squares regression (click here for a review of prediction in least-squares regression). All you do is to plug the -variables into the logistic regression equation specified by the regression coefficients estimates that you get from the coefficient table output. There are two differences, however. First, the prediction given by logistic regression is of the probability that the dependent variable is 1 (that is, ). This is slightly different from predicting the value of since is either 0 or 1, and the probability will be a number in the middle like, for example, 0.75. Second, the prediction equation in logistic regression is more complicated than for regular linear least-squares regression.
As discussed previously, logistic regression fits a linear equation to the log odds:
Here are the estimates of the regression coefficients from the coefficient table output. So if we want to calculate directly, we need to solve this equation for . The solution is:
I am not going to derive this equation here, but if you want you want all of the details, you can see them here. Otherwise, just accept that this equation does show how to calculate the predicted probability from the the -values together with the estimates of the regression coefficients.
I am now going to show you an example using the “Kid Creative” data. First, we will need the regression coefficients from the coefficient output table shown below (from the column labeled “Estimate”): Using these estimated logistic regression coefficients, the prediction equation is
Now suppose I wanted to predict the probability that the following person buys the Kid Creative magazine subscription
It turns out that the variable values that I used in making this prediction correspond to observation number 184 in the KidCreative data set. This particular person happened to buy the magazine, but with the odds of buying around 60-40 (if the prediction is correct), it certainly could have gone the other way.
So now I have explained how to use the output from logistic regression to make predictions about the probability of “success.” Such predictions are extremely useful as they are often the key “ingredient” in many “data mining,” machine learning, marketing analytics, and other “big data” problems where the analysis and prediction is automated. But, of course, such predictions are also very useful in “small data” problems as well.
In the final part of this five-part series, I will discuss assessing the uncertainly of the regression coefficients and odds ratios.
Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.
I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.
]]>As discussed previously, in logistic regression the log odds is modeled as a linear function of the -variables. That is
To solve this equation for , we first apply the exponential function to both sides of the equation:
Recall that so that the right hand side of the above equation is
Also remember that “log” is the natural logarithm, so the exponential function is its inverse (i.e., ). Thus, the left hand side is
Thus, after exponentiating both sides, logistic regression equation becomes:
Next multiply both sides by ,
and then “break up” the term,
Now move the term (the last term on the right-hand side) over to the left-hand side by adding it to both sides:
Next, factor out the ,
Finally, divide both sides by to get :
This is the equation for that you will see in multiple articles on this website.
There is one other form of this equation that is commonly used. It is obtained by multiplying the top and bottom of the right hand side of the equation for by . Since , this gives
The terms in the denominator are customarily written in the opposite order. So the second form of the equation for is
I have probably put in too many steps in the derivation above, but I wanted it to be accessible to almost everyone, even if your algebra skills are a little rusty.
Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.
I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.
]]>In this article, I am going to show you why the odds ratio for an -variable in a logistic regression is simply the exponential of the regression coefficient. This article just contains “the math” and no interpretation. A discussion of interpretation of the odds ratio in logistic regression can be found here. A more basic discussion, that includes definitions of “odds” and “odds ratio” can be found here.
Before I begin, I want to remind you of a basic property of exponents from algebra. Specifically, when two like terms are multiplied together, the exponents add. For example,
In what follows, I will actually be going the other way: .
As I expect you know, in logistic regression, the log odds is a linear function of the ‘s:
The odds, then, can be found by exponentiating both sides:
What we want to know is what happens to the odds when we add 1 unit to one of the ‘s. We will assess the effect of this change using the odds ratio.
So let’s suppose that we considering a 1-unit change in . In that case, the new odds are given by
Note that I have used a subscript of “new” on the to show that this is the “new” , and thus the new odds that corresponds to adding 1-unit to .
I am now going to rearrange the right hand side of this equation, first by multiplying out and then by pulling out the term using the exponent formula given above. Thus,
We can now use this formula to compute the odds ratio:
Now notice that the in both the top and the bottom divide out, so we are left with
Thus, the odds ratio corresponding to a 1-unit change in an -variable is just the exponential of the ‘s regression coefficient.
Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.
I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.
]]>In logistic regression, as in least-squares regression, you often want to try to assess the effects of the independent -variables on the dependent -variable. In logistic regression, however, things are somewhat more complicated than in least-squares regression as I explain in this Article. This Article is Part 3 of a 5-part series focused on understanding and interpreting the logistic regression coefficient output table. Click here to see a brief summary of the focus of each of the 5 parts.
.As I discussed in Part 3 of the parallel series discussing the least-squares regression coefficient table, in regular least-squares regression, assessing the effects of the -variables is relatively straightforward. In least-squares regression, because the equation that is fit to the data is completely linear, the effects of changing an
Things are not so simple in logistic regression. In logistic regression, we are not fitting a linear equation to the dependent variable, but rather are fitting a linear equation to the log odds:
where is the probability of “success” (i.e., the probability that ). Things would be just as simple as for linear least-squares regression if the log odds were an intuitive quantity that we were directly interested in. If that were the case, then the interpretation of the regression coefficient is the same as for least-square regression: The regression coefficient shows the effect on the log odds of a one-unit change in the corresponding -variable.
I don’t know about you, but I don’t normally think about log odds. As a result, knowing the effect of an -variable on the log odds does not communicate much to me. Truth be known, what I think about are probabilities, so what I usually really want to know about is the effect of changing an -variable on the probability of success (the probability that the dependent variable is 1).
The problem is that in logistic regression, unlike regular least-squares regression, the effect of changing an -variable is not the same everywhere. Rather, it depends on the values of the other
This equation may be a little hard to read, but the exponent of is the linear function . Another, equivalent, way to write the equation above is as follows:
Here, the exponent of is the negative of the linear function: .
Now, at this point, I do not want you to worry at all about these equations which show you the non-linear relationship between the ‘s and . I have shown you these equations only for their SHOCK VALUE. I want you to go “eeewwww” or “yuck” or “OMG” or whatever, and feel free to tune them out of your mind for now. But I do want you to understand that the relationship between the probability and the -variables is not linear and that things are more complicated than in regular least-squares regression.
While you and I might both ultimately want to understand the effect of changing an -variable on , we can make an intermediate step in that direction by first trying to understand the effect on the odds, rather than the log odds. Odds are not totally outside of the realm of our intuitive understanding. To interpret the effect of changing an -variable on the odds, we consider the odds ratio, a statistic that is generally output as a part of the coefficient table output in logistic regression.
So in the remainder of this Article, I am going to first show you how the reported odds ratio is calculated. Then I will go on to explain why this odds ratio is useful and how to interpret it. I will derive the odds ratio in a separate Article. To begin, however, I would like to have a specific example in front of you as a basis for discussion. So here again is the coefficient table output from the Kid Creative logistic regression.
So how are the odds ratios in the last column calculated? The odds ratio is simply the exponential of the corresponding regression coefficient. That is:
Once again, I am not going to derive this equation here. If you want to see “the math,” click here.
Let’s check this formula just to make sure that it is completely clear. For example, the logistic regression coefficient for the variable IsFemale is 1.646000. Typing this into a calculator and hitting the button, or EXP() button (depending on the calculator) or typing “=exp(1.646000)” into Excel gives 5.1861935. This matches the corresponding number in the “Odds Ratio” column. So the odds ratios are easy to calculate from the logistic regression coefficients by using the exponential function.
But what does the “Odds Ratio” mean. The numbers in the Odds Ratio column show how the odds change for a one-unit change in the -variable. The odds ratio shows this change in a multiplicative fashion rather than as a difference.
Again consider the IsFemale variable as an example. Suppose that we have two people that are identical with respect to the all of the other variables except that one is male and one is female. Since the IsFemale variable is coded as 0 for males and 1 for females, “changing” from male to female is a one-unit change in the IsFemale variable. The the Odds Ratio value for this variable is 5.1861935. This means that we expect the odds that the female buys are about 5.2 times the odds that the “equivalent” male would buy. So if the odds that the male buys happen to be 1 to 9, then the odds that the female buys are 5.2 to 9 (or 26 to 45). If the odds that a particular male buys happen to be 2 to 19, then the odds that the equivalent female buys would be 10.4 to 19 (or 52 to 95). So what I have done in these examples is to take the first number in the odds (that is, the x in x to y) and multiplied it by the odds ratio. Then, in the parentheses, I have converted the odds to integers as is customary for odds reported as x to y.
Lets look as another example that does not involve an indicator (dummy) variable. Consider, the ResidenceLength. The Odds Ratio for ResidenceLength is 1.025. This means that for each additional year of residence (a 1-unit change in the ), the odds of buying is multiplied by 1.025. Thus, if the odds of a customer buying are 1 to 9, then if they have an additional year of residence the odds will be 1.025 to 9 (or 41 to 360 obtained by multiplying both by 40). Similarly, it the customer had one less year of residence length, the odds would be 1/1.025=0.976 to 9 (or 122 to 1125 obtained by multiplying both by 125).
My examples have shown you how the Odds Ratio numbers are used to adjust the odds for a one-unit change in the -variable. Showing you such examples is necessary in order for you to really understand what the odds ratio means, but they are a bit more complicated than where one typically stops in the interpretation of odds ratios in logistic regression.
Let me show you how the Odds Ratio is typically interpreted using the odds ratio for ResidenceLength which is 1.025. What is typically done is to look past the leading 1 to the decimal (0.025 in this case) and convert it to a percent. So in this case, each additional year results in a 2.5% increase in the odds of buying.
So here is the punchline: If you subtract 1 from the odds ratio and multiply by 100 (that is, ), this shows the percentage change in the odds for a 1-unit change in the . Thus, the odds ratios allow us to see what the effect of the -variables are on the odds. As summarized in the odds ratio, the effect of changing an -variable is multiplicative. It is usually best thought of in terms of the percentage change in the odds for a one-unit change in the . This approach is somewhat intuitive (or at least you get used to it after a while).
In this article, I will not try to make the next step to try to interpret the effects of changes in the -variables on the probabilities. I will address this in a future article. I also have not derived the formula that shows that the odds ratio for an -variable in logistic regression is just the exponential of the corresponding regression coefficient. To see that derivation, click here.
Before closing, I want to remind you that regressions generally show association, not causation. While throughout this article, I have allowed myself to talk about “the effects” of changing an -variable, this is really not correct. What would be correct is to talk about a one-unit change in an -variable being associated with a percentage change in the log odds. Constantly talking about an “associated change” rather than “the effect” is not natural, so I have taken the incorrect liberty of possibly implying a causal relationship.
I also want to remind you that the effect of an -variable captured by a regression coefficient is conditional on the values of the other variables (i.e., they are held constant). This fact also has important implications with respect to interpretation. These two issues (causation and conditional effect) are discussed further in the last half of the article corresponding to this one in the 5-part review of least squares regression (click here).
In the next article (Part 4) in this series, I will continue to parallel the discussion in the review series on regular least-squares regression, and will discuss making predictions.
Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.
I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.
]]>Most logistic regression software outputs (or can be asked to output) odds ratios along with the regression coefficients. These odds ratios are the exponential of the corresponding regression coefficient:
For example, if the logistic regression coefficient is the odds ratio is .
The odds ratio is the multiplier that shows how the odds change for a one-unit increase in the value of the . Continuing the example above, if the odds are 1 to 4 or 0.25, then increasing the variable by 1 unit will change the odds to or pretty close to 1 to 3.
Note: Do not confuse the odds of 0.25 as a probability — the corresponding probability is 0.20. Similarly, the odds of 0.32 corresponds to a probability of 0.242.
Another way to try to interpret the odds ratio is to look at the fractional part and interpret it as a percentage change. For example, the odds ratio of 1.28 corresponds to a 28% increase in the odds for a 1-unit increase in the corresponding .
The formula is:
As a final example, if the odds ratio is 0.94, then there is a 6% decrease in the odds for a 1-unit increase in the corresponding .
]]>
You can determine which variables matter in logistic regression by looking at the -values of the coefficients. This is done in exactly the same way as for regular least squares regression as discussed in Part 2 of the review of regular least-squares regression.
The coefficient table output for the Kid Creative logistic regression is shown below: The -values are in the fourth numerical column which is labeled Pr(>|z|). -variables with -values that are less than 5% would generally be considered to be significant meaning that there is statistical evidence that they affect the probability that the -variable is 1 (i.e., that the customer buys the Kid Creative magazine). More generally, for a given significance level , a variable is significant at the level of significance if the -value is less than .
If we examine the -values in the logistic regression output above, we see that the following variables are significant at the 5% level of significance:
If we relax the significance level a bit (that is, consider -values greater than 5%), we see the following additional variables may also matter:
There is no statistical evidence that the following variables matter.
Before closing this article, I want to remind you of couple of things. First, to say that an -variable does not matter means that the corresponding regression coefficient (the corresponding
The next part of this series (Part 3) discusses assessing the impact of each of the variables.
Once the logistic regression has been “run,” the software will calculate estimates for
Notes:
]]>