*Y*variable takes on only two values. Such a variable is referred to a “binary” or “dichotomous.” “Dichotomous” basically means two categories such as yes/no, defective/non-defective, success/failure, and so on. “Binary” refers to 0’s and 1’s.

In logistic regression, the *Y* variable is generally binary. That is, it takes on the values 0 or 1 only. If the original variable was dichotomous (e.g., “yes” or “no”), then the categories are coded as 0 and 1. You get to choose which of the dichotomous categories is coded as 1.

In regression, in addition to your dependent variable (your *Y* variable), you also have explanatory variables (your *X*-variables). Your goal is to understand the relationship between the explanatory *X*-variables and the *Y*-variable.

For example, you might be interested in factors that influence (or explain) whether or not a person in the U.S. owns a U.S.-made or foreign-made (non-U.S.) car. It would be natural to code owning a U.S. car as a 1 and owning a foreign car as a 0. So the *Y*-variable in the logistic regression is whether or not a person owns a U.S. car coded as a 1 if he or she does and a 0 otherwise.

You might then want to study how various factors or variables influence whether or not a person owns a foreign car. Variables you might consider are income, age, gender, marital status, children, political affiliation, and so on. These variables are the *X*-variables in the logistic regression that you will use to try to explain or predict the value of the *Y* variable. For example, you might want to know if gender matters — are men or women more likely to own U.S. cars.

In logistic regression, the *X*-variables are used to build a mathematical equation that predicts the probability that the *Y*-variable takes on a value of 1. Thus, we use logistic regression when it is plausible that whether or not the *Y*-variable is 0 or 1 is like a flip of a coin where the probability of getting “heads” depends on the *X*-variables. That is, unlike flipping a regular coin, the probability of getting “heads” is not always 50/50, but rather depends on the values taken by the *X*-variables.

In summary then, we use logistic regression when:

- We have a binary or dichotomous
*Y*variable. - We have explanatory
*X*-variables that we think are related to the*Y*-variable. - It is reasonable to think that the value the
*Y*-variable takes on is like a coin flip where the probability of getting a 1 (“heads”) depends on the explanatory variables.

Comments or questions are welcome! I want to keep improving this material.

Hi Dear StatsProf

Dear Prof. I would like to have your comment or suggestion on my situation.

I have collected the data, there are 300 non-injury and only 17 injury… four categorical variables are significant according to Chi-squire, then I used Multiple logistic regression for significant variables. Three of them are significant again. does it make any sense? I would like to know whether can I use Multiple logistic regression because only 17 respondent had injured from 317 of the respondents.?

I used SPSS to analysis data.

If I can not run it what should I have to do? There is any way to salve it.

I appreciate all your help and support; it’s been a great encouragement to me

Dear statsprof?

You website is extremely useful.

I am in a dilemma with my project. I have collected data on patients admission route to hospital and whether they brought their own medication in with them. Therefore I have figures for the number of patients who came via that route and thereafter whether they brought their own medication in. I then facilitated staff education to ambulance staff and new medication bags in all ambulances. I collected data thereafter on the same again – patients route of admission to hospital (eg ambulance from home, ambulance from gp surgery, own transport from home etc) and whether they brought their medication in with them. I want to carry out a statistical analysis on whether there is a link between the route of admission and whether patients brought their meds into hospital and also the difference before and after staff education> (i am hoping for statistical signifiance for patients admitted via ambulance as these are the staff I educated) Many thanks in advance

Joanne

Your materials are good to obtain knowledge but I am developing my proposal, hence, I want materials to shows about the methods of sample size determination on impact of credit on small farmers. Thank you.

Major apologies…my post should read like this:

I have a dichotomous DV (retained, not retained) and an IV which is continuous and the score is a value of their ability to perform (so 1 is a really good performer and 20 is a really bad performer. The idea is to predict retention or non-retention in a program based on the score they receive on the IV. I am trying to figure out at what point the score (predictor) on the IV becomes significantly more likely than the scores before it to no longer be retained. For example, lets say the DV is (Not retained =0 and retained =1) and the IV the scores range from 1 to 20. Do the scores above a certain score indicate non-retention. However, not all scores might be in use, so say out of 100 people the IV scores range between 1 and 8 only. So my question is, am I better off grouping the scores and using the IV as a categorical variable in a logistic regression?

Its been a long day….

James

James,

Thanks for your question. I do not think I would group the scores for the independent variable. If you do so, you are throwing away some of the information that the 1 to 20 point scale gives. I do not think it matters much that only scores 1 to 8 are used ot of the 1 to 20 point scale as long as scoring is done in the same way for all of the employees.

What the logistic regression will give you is a function that predicts the probability of retention as a function of the IV’s. You can then use this estimated probability to classify the observations into retained and not retained groups if you want. Note that logistic regression is not exactly the same as a classifier, although is it often used that way. In many cases people use the 50% probability point as a cut-off point, so that observations are classified depending on which category is more probably according to the estimated probability.

You could “turn this around” and use a 50% probability to divide the IV (assuming there is only one) into two regions corresponding to most likely to be retained and most likely to not be retained.

Finally, what I would do might be influenced by the number of observations (people) I had in the data set. If the number of observations is very large, then I might consider additional approaches such as CART (classification and regression trees). CART works by dividing the IV’s into categories and using a simple estimate (the sample proportion) on each region formed. This is done automatically and works well when there are a very large number of observations. But, again, in a typical situation, I would generally not group the scores.

Regards,

StatsProf

Thanks for your prompt reply.

The reason I was worried about not all scores being utilised was because of the restriction of range problem but that might only be applicable if I was using a correlation analysis.

There is only 1 IV and I like the sound of splitting it into two regions by a 50% probability . Just to be certain of what you are suggesting, are you saying I should find the 50% cumulative frequency point on the IV and make the Iv dichotomous and then run the LR??

Thanks again …

Yes. You have interpreted my suggestion correctly. Good luck with your project.

I need to make a correction:

It should have said:

I have a dichotomous DV (retained, not retained) and an IV which is continuous. I am trying to figure out at what point the score on the IV becomes significantly more likely than the scores before it to no longer be retained. For example, lets say the DV is (Not retained =0 and retained =1) and the IV the scores range from 1 to 20. However, not all scores might be in use, so say out of 100 people the IV scores range between 1 and 8 only. So my question is, am I better off grouping the scores and using the IV as a categorical variable in a logistic regression?

I have a dichotomous DV (retained, not retained) and an IV which is continuous and the score is a value of their ability to perform (so 1 is a really good performer and 20 is a really bad performer. The idea is to predict retention or non-retention in a program based on the score they receive on the IV. I am trying to figure out at what point the score (predictor) on the IV becomes significantly more likely than the scores before it to no longer be retained. For example, lets say the DV is (Not retained =0 and retained =1) and the DV the scores range from 1 to 20. Do the scores above a certain score indicate non-retention. However, not all scores might be in use, so say out of 100 people the DV scores range between 1 and 8 only. So my question is, am I better off grouping the scores and using the IV as a categorical variable in a logistic regression?

Your help would be greatly appreciated!

James

Thank you putting together such an informative site. I very much appreciate it! I have a situation where the y variable has 3 non-ordered categories (3 different career choices). I have about 15 predictor variables of mixed scales. Some are interval, others are ordinal, some are categorical. So, neither logistic nor ordered logistic fits. I am at a lost as to how best to analyze the data, and would greatly appreciate your advice on it.

Thank you very much!

Thank you, StatsProf.

I figured it out: Multinomial logistic regression. I totally blanked out on that.

Regards,

Justin

Justin,

I am glad you answered your own question! You are right, multinomial logistic regression is what you want to use. Eventually, I will develop material about multinomial logistic regression on this web site, but I haven’t gotten to it yet.

Regards,

StatsProf

Dear StatsProf,

I am studying the outcome of a disease (Y variable is outcome/no outcome). Some patients get the outcome and some not. All of my X variables are binary as well (they consist of clinical characteristics with a Yes/No). Is such a multivariate logistic regression viable? My main difficulty, though, is time. I know this is not a survival analysis and time is not taken into account but in real life clinical characteristics change as time goes by. Each patients has different responses to my X variables through time. So, at which point through time is the best to use in my analysis. My best guess is that for those who experienced the outcome, the ideal time is near the time they experienced the outcome and for those who did not experience the outcome is their latest follow up. What are your views on this?

I am grateful for this wonderful website, helping us in the difficult task of statistical analysis.

kind regards,

Alex

Alex,

Logistic regression is appropriate in the situation you describe. There is certainly no problem with the X-variables being binary. In fact, when you have multiple observations at the unique combinations of X-variables, it makes it easier to check how good the logistic regression model fits (I have not yet written about this on the web site).

Based on your description, I think you do need to use the same time period to determine the outcome (the Y). So, for example, your Y variable needs to represent whether or not the person gets the disease within, say, one year of the measurement of the predictors (X-variables). In this case, you would only be able to use the results for people that you have been able “follow” for a year after the measurement of the X’s to determine if they get the disease. Obviously, the one-year time-frame in my example here is arbitrary. You could use a shorter or longer time frame. But it needs to be the same all of the observations (people).

There is another approach that you could take that is not based on logistic regression. Rather it is based on statistical methods that are generally refered to as either reliability analysis or survival analysis. In this case, your Y variable is the time between the measuring of the X’s and when the person get the disease. The the X’s are now predicting the average time to on-set of the disease. In any particular study (which will have a fixed time frame), there will be patients that do not get the disease within the time frame. This is OK and is “handled” by the statistical methods. Observations for which the event (onset of the disease) does not occur within time frame of the study are called “censored” events. Censoring can also occur for other reasons (such as the patient dies of something else, so all you know of that the time to onset of this disease was longer than the time they died of whatever killed them).

If you want to take this approach (which may address the time-related issues you raise), you should look for books and references on reliability analysis or survival analysis. “Reliability” books tend to be more focused on reliability of machines and so on. “Survival” books tend to focus on the survival of living things. The statistics and math are the same in both cases, but I think there is a more natural fit between survival analysis and the problem you describe. If you want a recommendation for a good textbook (which I have used before), you might try Applied Survival Analysis: Regression Modeling of Time to Event Data (Wiley Series in Probability and Statistics)

I hope this helps.

Regards,

StatsProf

Thank you very much for your detailed answer. Looking forward for new posts.

Alex

Dear statsprof,

can you please advice on which type of analysis can be applied on this kind of data? My aim is to calculate either OR (odds ratios) or RR (risk ratios) (This is an example of questionnaire to a group of all patients afflicted with a lung disease). Thank you.

-Are you male or female? Male=120; Female=80

-Are you using daily inhaled corticosteroids? Yes=150; No=50

-Which inhaled medication use? Salbut=24; Simbic=30; Flexot= 75 (which makes less than 150; data lost)

-Severity of disease (Mild; Moderate; Severe): Mi=125; Mo=60; Se=15

-etc.

Hello Max,

Thanks for your question. I am assuming that the response variable (the Y) is the severity or the disease. This variable is categorized into three categories: mild, moderate, and severe.

When you have multiple categories and they are ordered, there is a technique that is an extension of logistic regression called ordered logistic regression. In ordered logistic regression (for three categories), we would have three probabilities: pMi, pMo, and pSe. One of these we would define by subtracting from 1: for example, pSe = 1 – (pMi+Pmo).

For the other two probabilities, they are modeled by two logistic regression equations:

log(pMi/(1-pMi)) = alpha1 + beta1*X1 + beta2*X2 + … + betap*Xp

log((piMi+pMo)/(1-pMi-pMo)) = alpha2 + beta1*X1 + beta2*X2 + … + betap*Xp

where alpha2 > alpha2. These are two parallel lines because they only differ by the intercept.

Notice that the odds ratios in these two equations sort of build up from the smallest category, always based on the cumulative probability up to the category corresponding to the odds ratio under consideration.

These equations can then be solved for the probabilities and the parameters estimated using maximum likelihood techniques.

I did a quick search on the interenet to see if I could find a good reference for you. I did not find one. I thought the Wikipedia page was almost useless.\

The book I use most often as a reference for logistic regressionrelated things is Applied Logistic Regression (Wiley Series in Probability and Statistics). This book has a half a chapter on ordinal logistic regression.

There are a variety of other books on Amazon as well — just search for ordered logistic regression.

Eventually, I will write about ordinal logistic regression on this web site, but it will be a quite a while because I still have a lot to cover on regular logistic regression.

I hope this helps.

Best Regards,

StatsProf

Dear Prof

Thanks for your website, its amazingly helpful.

For my project (assessing tumour response to various treatments) i was looking into impact of various factors (explanatory factors or X) such as gender, tumour groups, treatment groups, different age, performance status into their response out come (non-responder or responder (Binary outcome) Y.

Do you think log regression makes sense for my data ?

I was also looking into Progression free survival data and the impact of various variables, which will be suitable statistical test?

Thanks

Doc,

Thanks for your comment and questions.

If your response variable is binary (like responder or non-responder), then logistic regression is appropriate.

I will make the following comment, however. In general, if you have a continuous variable as an outcome, one that measures the degree of the effect, then it is better to use that variable rather than a categorical variable (like responder or non-responder). Thus, if you have a variable that measures the degree of response (percent tumor shrinkage, for example), it will be better to use than a variable that just indicates whether or not there was a response.

The reason, which is intuitive, is that the binary variable (Was there a response?) does not contain as much information as the continuous variable (How much response). So, in general, continuous variables are better to use (if you can) than categorical variables.

However, continuous variables are subject to the presence and effects of outliers in a way that categorical variables generally are not. So you have to be careful about the effects of outliers.

It would be very reasonable to do both an analysis using the degree of response as well as another analysis using logistic regression with the binary variable (response of not) as a check.

Progression-free survival time would also be a very useful variable to consider. You would use the methods of survival analysis to analyze these times.

Notes to other readers:

So, in summary:

Is this model appropriate if you purposefully select participants who only have y=1? In my case, y=0 is possible, but not relevant…

What your question implies is correct. The logistic regression model will not be useful unless Y takes on both 0 and 1 values. Logistic regression is trying to model the probability of success (i.e., Y=1) as a function of some explanatory variables (the X’s). If all of the Y’s are 1, there is no variation for the X’s to explain.

If you want to outline in ore detail the nature of your data, I may be able to suggest other techniques that might be useful to analyze it.

Thanks for you question.

StatsProf