**From Odds to Probability**

We have all used the word “odds” to describe the probability of something. “Odds are it will rain tomorrow” means that there is a greater than 50% chance of rain. Horse racing fans and other gamblers will likely be familiar with odds and may know exactly how the term “odds” relates to “probability,” but many others are not. So I am now going to explain what it means when someone says something like “The odds are 20 to 1 …?”

To be a bit more specific, suppose that I say that the odds are 20 to 1 that my car will start tomorrow morning when I try to start it to go to work. What I mean is that it is 20 times more likely that the car will start than not. Thus, for every 20 chances that the car will start there is one chance that the car will not start. This means that the probability that the car will start is (the number of chances of starting over the total number of chances). And a probability of 0.952 means that, in the long run, the car will start 95.2% of the time; that is, 952 starts in 1000 tries on average.

Recognizing the pattern in the example above allows you to write down a formula that converts odds to probability. Specifically, if I say that the odds of a thing happening are to (sometimes written ) this means the the probability that the thing will happen is:

Note that there are many ways to state the odds that give the same probability. For example, to say that the odds that my car starts tomorrow are 20 to 1 is the same as saying the odds are 40 to 2.

Similarly, it is the same thing to say that the odds are 5 to 2 as to say that they are 2.5 to 1. To complete this second example, the probability corresponding to 5 to 2 odds is . This, of course, is the same as the probability corresponding to 2.5 to 1 odds: .

It is customary, however, to state odds as a ratio of integers with all of the common factors divided out. So one would typically hear a statement like “the odds of the horse winning the race are 7 to 2” ( ), rather than “the odds are 3.5 to 1” (both not integers) or “the odds are 14 to 4” (common factor of 2 which can be divided out of the ratio).

**From Probability to Odds**

So at this point, we know how to convert an odds to a probability. You can also, of course, go the other way and convert a probability to an odds.

Suppose that the probability of rain tomorrow is 15%. Following the logic of “odds” above, this means that there are 15 chances in 100 of it raining and 85 chances in 100 of it not raining. The odds, then, are 15 to 85 (or 3 to 17 if written in the conventional form).

Recognizing the pattern, if the probability of something is , the odds are to . Since neither or will be integers, to write this in conventional form, we will need to scale both and (that is, multiply them by some number) so that they are integers and have no common factors.

Some examples will illustrate:

If , the odds are 0.75 to 0.25 or 3 to 1 in conventional form (multiplying by 4).

If , the odds are 0.12 to 0.88 or 3 to 22 in conventional form (multiplying by 25). Note that I figured this out by first multiplying by 100 (odds are 12 to 88) and then finding the common factor of 4 which I then divided out (12/4 = 3 and 88/4 = 22).

Note that as far as “the math” is concerned, it is not necessary to convert odds to conventional form. So in terms of any calculations that you do, it does not matter if the odds are expressed as 5 to 2 or 2.5 to 1.

In converting probabilities to odds, it is also helpful to recognize the decimal rounding of common fractions. For example, means odds of 33 to 67 (there are no common factors). But if we recognize that 0.33 probably is 1/3 rounded to two decimals, then the odds are 1/3 to 2/3 or 1 to 2 (multiplying by 3). An odds of 1 to 2 somehow seems simpler than an odds of 33 to 67 even though there is not likely to be any difference that matters in any calculation based on these two pairs of odds.

I will give one more example that illustrates this issue. If the odds are 778 to 222 or 389 to 111. The number 389 is prime (only factors are 1 and 389), so it is not possible to simplify the odds 389 to 111 further.

But if you recognize that 0.778 is 7/9 rounded to 3 decimal places, then the odds become 7/9 to 2/9 or 7 to 2. This is much more appealing.

## Odds Ratios

So you are now familiar with how odds and probability are related. In order to understand the output of a logistic regression analysis, you also need to have some understanding of odds ratios.

Suppose that we are considering two random things. To make this concrete, suppose that we are considering the probability that my car starts tomorrow morning (the odds of which, in the example above was 20 to 1) and the probability that your car starts tomorrow morning. Lets suppose that the probability that your car starts is 29 to 2.

The odds ratio is simply the ratio of these two odds:

What this means is that the odds of my car starting are 38% higher than the odds of your car starting. Note that this is *not the same* as saying that the probability of my car starting is 38% higher than the probability of your car starting. I will return to this below.

So if the odds of thing are to and the odds of thing are to , then the odds ratio is:

We can also write the odds ratio in terms of probabilities. Using the formula given above to calculate the probabilities gives:

Since , using probabilities, the odds ratio expressed in terms of probabilities is:

I will now compute the odds ratios using probabiliities for the car starting example we have been discussing.

Since the odds of my car starting is 20 to 1 and the odds of your car starting are 29 to 2, the probabilities are

Thus the odds ratio is

(Note I used so many decimals in the probabilities in order to make sure that the answer did not differ from the odds ratio above due to rounding.)

As I indicated above, the correct interprettion of the odds ratio of 1.38 is that the odds of my car starting are 38% higher than the odds of your car starting. Note that we have already calculated the probabilities corresponding to the odds of 20 to 1 and 29 to 2 and they came out to be and . Since

we can see that the probability that my car starts is 2% higher than the probability that your car starts. Thus it would be very wrong to interpret the odds ratio as providing the percentage change in the probabilities. It must be interpreted in terms of the change in the odds.

**Questions or Comments?**

As always. please leave questions or comments below. I will try to answer them and I would like to improve this material.

I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.

Hello, My name is Shah Fahad, i am doing MS in Economics. I am using Logistic regression model. After running LRM i am facing problem of Multicollinearity in one of my most important Independent variable. Can you please help me out. some one told me go for step-wise method. I don’t know how can i go with step-wise. Can some one teach me how to solve Multicollinearity problem.

Thank you for your question.

Multicolinearity is a near linear relationship between a set of x-variables. This causes the regression fit to be unstable in certain directions (in the linear space defined by the independent variables), which results in an inflation of the estimated variances of the x-variables involved.

When you say that your most important independent variable is multicolinearity in one of the most important independent variables, you must mean that it is multicolinear with some other variable in the model. You don’t say what you mean by “important.” Is it important in the theory (for example it is your variable of interest) or important in the regression (i.e., has high explanatory power). I am assuming you mean it is a variable of interest.

In any case, when you were told to use stepwise regression, what you were really being told to do is include the most important variable among the set that is causing the multicolinearity.

Stepwise regression build regression models either by starting with all the variables in the model and dropping them one-by-one or starting with no variables in the model and adding in the most important one at each step. The idea is to end up with a model that only contains the important variables, with the unrelated variables omitted.

Stepwise regression has the tendency to overfit the data (and make you think that there are some variables that matter when they do not). Measures, such as the AIC, can be used to pick the best model (smallest is best).

You will have to understand which variables are causing the multicolinearity. Then think hard about what they mean. Multicoliearity often occurs because two or more variables are measuring the same abstract idea. Therefore the variables are closely related. Dropping one of the variables that is causing the multicolinearity may work. Other things that can be done is to average them or look at their difference.

Multicoliearity is sometimes also cause just by some error in the creating of your x-variables. So begin my checking the data you have in your model really is what you intended.

Good luck.

In case of my analysis, the regression between continues variable (crop production) and categorical variable (electricity connection) found OR 1.39, thus, i wrote households with higher crop production (OR 1.39, p=0.05) have a higher chance (39%) for electricity connection

Do you think, this interpretation is correct? then can you refer some of the article which has followed same style

I love your notes and they are helping me greatly with my Logistical regression project.

I just want to inform you of a mistake in the paragraph about Odds.

You mention that if the odds of your car starting tomorrow are 20 to 1 then the probability of the car starting should be equal to b/(a+b) which would give 1/(20+1) =

0.0476

Just wanted to make this clear as I can see some people getting very confused over this.

Apologies, I have just realised that I am using betting odds which use the formula b/(a+b) . Everybody else seems to use the formula a/(a+b)

James,

Thanks for checking things out and I am glad that you resolved your question. If you do find any errors, please let me know.

Regards,

StatsProf

Hi there I was wondering if your services were available for hire as I require some assistance with a project. Regards rick

Rick,

Thanks for asking. I am available for consulting work. I will contact you by e-mail. I have been intending to put up a form to contact me on the web site, but have not gotten around to it yet.

StatsProf

are you available for consulting work?

Marta,

Thanks for asking. I am available for consulting work. I will contact you by e-mail.

StatsProf

I am using regression to analyze the probabilities of events in sporting matches. I can use logistic regression to find the percentage chance that a particular player will or won’t score a goal in a soccer match for example. I can also predict how many goals he/she is expected to score in a game or season. While I can do this for each player I’m curious as to what the best way of calculating the chance of each player being the first goal scoreI am using regression to analyze the probabilities of events in sporting matches. I can use logistic regression to find the percentage chance that a particular player will or won’t score a goal in a soccer match for example. I can also predict how many goals he/she is expected to score. While I can do this for each player I’m curious as to what the best way of calculating the chance of each player being the first goal scorer in a match? In this case only one player can get this result and the probabilities of each player should add up to 1 (assuming a player will score)

Thank you for your question. It sounds like you are applying logistic regression in very interesting ways!

The question you ask me about, calculating the probability that a player is the first to score in a game, does not really fit into the logistic regression framework, as I think you have recognized (but see below for more discussion). The statistical models you seek go under names like “rank models” (i.e., statistical models that predict rankings) or “choice models” (statistic models that predict choices). Note that the problem you pose is exactly the same (from a statistical point of view) as calculating the probability that a horse is first in a horse race.

These models work by conceptualizing an outcome variable (not observed) for each player, and then computing the probability that a particular player’s outcome is the minimum. So, what you would model is the time until each player scores a goal in a game. The probability that the player is the first to score a goal, then, is the probability that the player has the minimum time. In horse racing, the unobserved variable is the time the horse takes to run the race. The winner, then, is the horse with the shortest time, so the probability that a particular horse wins the race is the same as the probability that its time is the minimum. In choice models (like which brand of cereal to buy), each brand generates a utility for the customer (unobserved). The customer then selects the brand with the highest utility (the maximum in this case). Note that all you observe is who wins or what brand a customer buys, so the variables capturing the times or utility are not observed (latent).

So the most compatible method for the problem you describe is not logistic regression. Thus. these models are beyond the scope of this web site. I have given you the hints you need, however, to find appropriate methods. A search for “statistical models horse racing” on google is likely to be very productive. This will yield quite a number of potentially interesting papers. Also searches for “statistical rank models” or “statistical choice models” should yield useful stuff. Because your question raised by curiosity about these models, I just bought the following book from Amazon: Modeling Ordered Choices: A Primer. It seems like it should deal with the correct topics. But, please note that, since I have not read this book yet, I can’t be certain that it has what you need.

One final comment. If you are not worried about your models being “theoretically” correct, then there are a lot of things you can do with logistic regression. For example, if you are interested in predicting who will score the first goal, then you can create a logistic regression model for each player (1 if they score the first goal and 0 otherwise), run it for each player in a game, and predict the player who will score the first goal by selecting the one with the highest probability. If you really need the probabilities of scoring the first goal to sum to one, then you can take the predictions from the logistic regression models, and force them to sum to one. These methods may work very well in spite of the fact that they are entirely heuristic (and would be very hard to justify as “correct”). Please note that such heuristics are commonly used in fields such as machine learning where the theoretical correctness of the approaches used is not valued as highly as if the methods seem to work in practice.

I hope this helps. Great question!

Regards,

StatsProf

Thank you for your very detailed reply.

The heuristic approach you described was what I was considering trying.

I have come across statistical modelling for horse races but had initially discounted these methods as goal scoring is not a race. But I guess when I want to estimate first goal scorer it is of course a race. The other problem is while I have data of some leagues which include exact time of goal, other data sets only include how many goals each game for each player. Though if a player were to average -for example- 3 goals in 90 minutes, that player would be expected to score every 30 minutes… If I look at my data (without exact times) this way I guess I can still use statistical rank models.

I think It will be interesting to play with both statistical rank models and the heuristic logistic regression-forced to equal 1- and to compare results.

Thanks again

In epidemiology, risk is a probability or proportion (e.g., the number who become infected divided by the number exposed). To estimate it, one must have an accurate denominator. Consider a cohort study, where a population of size N with exposure E is followed for time t and the outcome variable is occurrence of disease X. Then the risk can be calculated because a denominator N is known. In cross-sectional and case control studies the size of the population exposed is unknown, so a risk cannot be calculated. However, and here I leave out the mathematical details, in the calculation of odds ratios for the development of disease the size of the population at risk will cancel out, so it is not necessary to known the size of the population. This is generally mathematically demonstrated using simple 2X2 tables, one representing a cohort study and one representing a case control study.

David,

Thanks so much for your comment. Since I do not work in epidemiology, I was unaware of the situation you describe. Very interesting that focusing on the odds-ratio means that you do not have to know the population size.

Because your comment peaked my interest, I spent some time researching the various things you mention in your comment. Great material for a future article on this web site on relative risk, odds-ratios, cohort, cross-sectional, and case controlled studies which could include working out the math for the 2×2 table examples you mention.

Thanks again for your comment.

StatsProf

StatProf

You’re very welcome. I will look forward to your future article.

I ended up at your website while attempting to find discussions of sample size determinations for a 3-sample test of proportions. The literature is out there, but little in the way of discussion suited for non-PhD’s (just my opinion, as an MS level statistician, albeit one who went all the way to ABD before moving on), or practical examples. Happily, my old mentor sent me something on the Wald statistic approach that made sense. Perhaps one day you can delve into sample size determinations.

David

I am doing logistic regression to determine a diagnosis of malignant or benign based on cell features (i.e. area, smoothness, texture, and concavity). My professor asked me to determine the accuracy of our logistic regression equation to our preliminary data which includes 569 images each with the predictor variables (cell features) and their diagnosis of either malignant (1) or benign (0). He said this should be straightforward and easy but I’m getting stuck on how to accomplish this. Thank in advance.

I am sorry that I did not reply to your message sooner. The e-mail link that notifies me of comments was not working, so I was not aware of your question until much later. The e-mail link is working now!

One approach is to create a of predicted success/failure verses actual success failure. This is generally called a “confusion matrix” in machine learning. If the logistic regression model gives a predicted probability of success greater than 50%, predict a 1. If the predicted probability is less than 50%, predict a 0. Then see what the outcomes are and create a 2×2 matrix that shows the counts or proportions in each of the four cells:

Outcome 1 Outcome 0

Predicted 1 p_11 p_10

Predicted 0 p_01 p_00

Observations on the diagonal are correctly classified. Observations on the off-diagonal are incorrectly classified.

A related approach is to divide the predicted probabilities into ranges such as each decile. That is find all observations with a predicted probability of between 0.0 and 0.1, all observations with a predicted probability of between 0.1 and 0.2, and so on. For each class, then determine the actual sample proportion of successes. These proportions can then be compared to the predicted probabilities (e.g., 0.05 for the interval from 0.0 to 0.1). Various statistical tests can then be used to assess the fit.

I am going to leave my response to the above at this point. Later, as I continue to develop this web site, I will return to the issue of assessing model fit in detail. But the above should give you some hints as to where to look to get more information addressing your question.

Regards,

StatsProf

This is a question i would like you to help me answer.

My question goes this way.

Why is it that in cross-sectional and case-control studies, you can not use logistic regression to predict risk but you can estimate the odds ratio.

I am not sure that I fully understand your question. So if you could give me a more detailed description of the situation you are considering, I could try to give you an more specific answer.

I will take a bit of a stab at an answer, however. Risk (at least in financial situations) is measured by standard deviation (of the return on the investment). For example, the risk of an investing is generally referring to its standard deviation. Estimating the standard deviation of the dependent (binary) variable in a logistic regression can be done, but it does not relate to risk in the usual sense.

It is possible that you are using the term “risk” in a way that is different from the fields I works in. That is were additional details would be helpful.

In any case, I hope this short answer helps a bit. If not, feel free to follow up.