This article discusses making predictions using logistic regression. It is part 4 of a five-part series focused on understanding and interpreting the logistic regression coefficient output table. Click here to see a brief summary of the focus of each of the 5 parts.

**Understanding Logistic Regression Coefficient Ouput**(5-part series)

- Part 1: How the logistic regression coefficient table compares to the corresponding table from least-squares regression.
- Part 2: Coefficient table use #1 — determining what variables in the logistic regression matter.
- Part 3: Coefficient table use #2 — assessing the impact of each of the -variables on the dependent variable (actually,.
- Part 4: Coefficient table use #3 — predicting the probability that the dependent variable is 1.
- Part 5: Coefficient table use #4 — assessing the uncertainty in the regression coefficients.

## Coefficient Table Use #3: Making Predictions

Making predictions in logistic regression is very similar to making predictions in least-squares regression (click here for a review of prediction in least-squares regression). All you do is to plug the -variables into the logistic regression equation specified by the regression coefficients estimates that you get from the coefficient table output. There are two differences, however. First, the prediction given by logistic regression is of the probability that the dependent variable is 1 (that is, ). This is slightly different from predicting the value of since is either 0 or 1, and the probability will be a number in the middle like, for example, 0.75. Second, the prediction equation in logistic regression is more complicated than for regular linear least-squares regression.

As discussed previously, logistic regression fits a linear equation to the log odds:

Here are the estimates of the regression coefficients from the coefficient table output. So if we want to calculate directly, we need to solve this equation for . The solution is:

I am not going to derive this equation here, but if you want you want all of the details, you can see them here. Otherwise, just accept that this equation does show how to calculate the predicted probability from the the -values together with the estimates of the regression coefficients.

I am now going to show you an example using the “Kid Creative” data. First, we will need the regression coefficients from the coefficient output table shown below (from the column labeled “Estimate”): Using these estimated logistic regression coefficients, the prediction equation is

Now suppose I wanted to predict the probability that the following person buys the Kid Creative magazine subscription

- Income: Income = 58000
- Gender Female: IsFemale = 1
- Married: IsMarried = 1
- College Educated: HasCollege = 1
- Not a Professional: IsProfessional = 0
- Not Retired: IsRetired = 0
- Employed: Unemployed = 0
- Eight years of Residency in Current City: ResLength = 8
- Dual Income: Dual = 1
- Does not have Children: Minors = 0
- Owns Home: Own = 1
- Lives in a house: House = 1
- Race is white: White = 1
- First language is English: English = 1
- Has not previously purchased a children’s magazine: PrevChild = 0
- Has previously purchased a parenting magazine PrevParent = 1

It turns out that the variable values that I used in making this prediction correspond to observation number 184 in the KidCreative data set. This particular person happened to buy the magazine, but with the odds of buying around 60-40 (if the prediction is correct), it certainly could have gone the other way.

So now I have explained how to use the output from logistic regression to make predictions about the probability of “success.” Such predictions are extremely useful as they are often the key “ingredient” in many “data mining,” machine learning, marketing analytics, and other “big data” problems where the analysis and prediction is automated. But, of course, such predictions are also very useful in “small data” problems as well.

In the final part of this five-part series, I will discuss assessing the uncertainly of the regression coefficients and odds ratios.

**Questions or Comments?**

Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.

I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.

Hi Prof,

Sir i want to predict gold price using logistic regression. I have gold price data from 2 years to today. and i want to know what is the gold price yesterday.

Wow, I found this really easy to follow. Thank you for that. 🙂 I do have some questions, though.

1. How do I interpret a negative probability when I’m using a dummy variable? For example, suppose in your example only income and isfemale are the only variables on the right-hand side and I get a probability of -0.23. Is it accurate to say that “a female is less likely to buy the magazine by 0.23”?

2. If I want to see how an increase in income will affect the probability, is it correct to plug in another value for income (.e.g. from 58000 to 65000)? Do I just keep other values constant?

3. To get partial effects at the average of, say, residencelength, can I simply plug in the sample means of all the x’s in the equation?

Hope you can clarify. Cheers!

Hi Professor,

Thanks for sharing this article since it’s really helpful. I have a question regarding how to calculate probability with logistic regression including categorical variables:

1. As illustrated in the example above, Pr (Y=1) = Exp(f(x))/[1+Exp(f(x))]

2. what if one or more of the variables are categorical: for example, what if the variable “income”, instead of inputting actual income, this variable has 4 categories: 1. <$50k; 2. 50k – 70k; 3. 70k – 90k; 4. 90k+?

Thanks,

Rayna

Hi Rayna,

There is a kind of logistic regression called ordered logistic regression (or sometimes ordinal logistic regression) that is specifically designed for the situation that you describe. For example, it is appropriate for analyzing responses that are on a 5 or 7 point Likert scale.

I did a quick web search for good references and did not find really good ones that I would really recommend. You might start with the Wikipedia page (https://en.wikipedia.org/wiki/Ordered_logit) but it is rather dense and I suspect will not be very helpful. At least now you have the terms to search for on your own. Eventually I will write about ordered logistic regression on this web site, but that is some time off.

Regards,

StatsProf

Dear Prof,

Most of your articles have been extremely helpful to me in understanding logistic regression better and in simpler terms. Thank you very much!

I had a question on this specific article – In order to predict Y=1, I notice that you’ve considered all coefficients in your example math, regardless of whether they are statistically significant or not. Did you do this just for illustration purposes, or is it really that one should ignore significance while predicting an outcome?

Chetan,

Thanks for your kind comment about my articles. I am glad the material has been useful to you.

Your question is an excellent one. The short answer as to why I ran the regression using all of the X-variables whether or not the coefficients were significant is that I have yet to write articles about variable selection and model building.

Variable selection and model building is a large topic and exactly what should be done is not settled. Also, the approaches to use vary with the context of the problem and what the model will be used for. For example, if you have a large amount of data (as in “data science” contexts), then selecting the best model can (and probably should) be done with a hold-out sample (cross-validation). If I am using the model for prediction (where I am not trying to change any x-values) I am probably less concerned about the significance of the regression coefficients than if I am trying to use a particular regression coefficient from the model as a basis for changing the value of some x-variable (e.g., something like public policy arguments for or against rent-control based on the coefficient of a dummy variable).

I hope to get back to writing some of these more advanced articles soon. But here is the short answer about how to do model selection. In traditional statistics contexts (relatively small sample sizes and not too many x-variables) use the AICc or AIC measure of fit. Small values are better and the AIC number is only useful to compare models to each other (it does not have a useful interpretation). In “big data” context with large numbers of observations and/or large numbers of x-variables, use hold-out samples and/or cross-validation.

By the way, I have two videos on YouTube on about model overfitting and cross-validation: Video Part 1 and Video Part 2

Dear Prof,

I have a similar question. When I did the binary logistic regression, I found the significance of a predictor could be affected by other independent predictor(s). For example, when I used predictor A, B and C to build the logistic regression model for the probability of a injury, all of these three predictors were statistically significant. I also did this using the predictor A, B and D, and all of them were significant. But when I used A, B, C and D to do the same logistic regression, then only A and B were significant, the significance of C and D was gone. Could you help to explain this? Do the insignificant results in last logistic regression suggest that C and D have no significant effects on the probability of the injury?

Thanks,

Dear StatsProf,

I work in fundraising and have developed a logistic regression model to predict the likelihood of a constituent making a gift above a certain level. The first question my coworkers asked is what the time frame is for the predicted probability. In other words, if the model suggests John Smith has a 65% chance of making a gift, they want to know if that’s within the next 2 years, 5 years, or what. The predictor variables contain very little information about time, so I don’t think I have any basis to make this qualification.

The following approach has been suggested: If we want to say someone has a probability of giving within the

next3 years, we should rerun the model but restrict the data to events within thelast3 years. Likewise, if we use data from the last 2 years, then we’d be able to say someone has a probability of giving within the next 2 years.The event we’re modeling is already pretty rare so I’d be concerned about dropping data, but apart from that, I just don’t see the logic in what was suggested to me. Does this sound like a reasonable approach to you?

Any suggestions on other ways to handle the question of time would be much appreciated. It seems like what my coworkers want is a kind of survival analysis predicting the event of making a big gift, but I’ve never done that type of analysis, so that’s just a guess.

Thanks for your time,

DC

Thanks Prof. for your time and interest. I’ll go through the material and give you a feedback later.

Hi Prof,

Assuming the prediction for observation # 184 was incorrect yielding a low probability (while in fact that person happened to buy the magazine), how will you interpret the result?

In fact, I ran a regression to investigate the effect of mother’s age, weight, history of hypertension, urinary irritation and smoking behaviors on baby’s weight at birth. And after making my predictions, I realized that there were many mothers who delivered babies with low weights but the predicted probabilities of occurrence were small.

Does that mean that the model was not good fit?

How do you assess whether a logistic regression model fitted data well?

Thanks

Hi Eddy,

Your question is an excellent question. The whole issue of assessing how well the model fits generally goes under the name of “goodness of fit” in statistics. I have not yet addressed goodness of fit issues in this web site. It is on my “To Do” list, but it will be no earlier than next winter before I will be able to do much in this regard. This is mostly due to how busy I am right now, but also because “goodness of fit” in logistic regression is relatively complicated and it takes quite a bit of thought to figure out how to make it accessible to a wide audience.

So I am not going to be able to give you a very complete answer to your question. But let me see if I can perhaps give you some direction. First, you have to decide how you want to view the problem. If you are using logistic regression as a basis for classification, then you might want to assess the performance in the same way that classifiers are assessed. This would mean that you would calculate things like the sensitivity (also called recall), specificity, and precision. You can find definitions for all of these by searching. They are commonly used in medical contexts to assess the quality of medical tests. You might also use the ROC curve.

Now if you want to keep the perspective of logistic regression (i.e., you are predicting the probability of an event instead of a classification), then what you do depends quite a lot on how much data you have and what you intend to with it. One approach is called the Hosmer-Lemshow test. It basically works by looking at the predicted probabilities, sorting them, dividing them into groups (usually deciles, but this would depend on the sample size – if you have more data, you can take a finer look), and testing wither the actually proportion of “successes” is consistent with the predicted proportions using a chi-squared test. You can find a discussion of this in most serious books on logistic regression such as Applied Logistic Regression. I do not know how accessible this will be to you (depends on your background).

With respect to your example, I will say that low probability events can occur, so just because you have a mother that is predicted to have a normal birth weight baby with high probability who actually has a low birth weight baby it does not mean that the model is wrong. What matters is if the overall proportion of mothers that are predicted to have normal birth weight babies reasonably matches the predicted probability.

The next question, then, is what to do about it if the model really is not fitting the data very well. This is also a complicated question. One thing that you might do is look for additional explanatory variables. Or you might try including interactions between the variables you have or polynomial terms. One could even fit a piecewise linear function in one of more of the variables. All of these approaches are more complicated however.

Another thing that occurs to me is that your Y variable could be continuous since you have the babies’ weights. Instead of modeling whether or not the baby is low birthweight or not (which is what I am assuming you are doing), you could use regular regression to try to predict the weight directly. Note that you would probably want to model some transformed version of the weight (sqrt(weight) or log(weight)) in order to get the weights to look like they are from a normal distribution.

So I am sorry that I have not given you a more direct and helpful answer. Adding the material you are requesting to this web site is an objective or mine, but it will be time consuming.

Regards,

StatsProf