Just to be sure that you have a clear idea of what a data set that is appropriate for logistic regression analysis looks like, I am providing an example in this article. As I indicated in the previous article, a multivariate logistic regression data set is essentially the same as a multivariate regular linear regression data set except that the dependent variable is binary.

## The Context – A Data Mining Example

In this example, a magazine reseller is trying to decide what magazines to market to customers. In the “old days,” this might have involved trying to decide which customers to send advertisements to via regular mail. In the context of today and the “web,” this might involved deciding what recommendations to make to a customer viewing a web page about other items that the customer might be interested in and therefore want to buy. The two problem are essentially the same.

In this example, the website MZines4You.com (a fictious name I made up which was unclaimed when this article was written) wants to decide what magazines to include in e-mails to customers as a part of an e-mail marketing campaign. All of the e-mails that will be sent will go to customers that have previously bought a magazine subscription at MZines4You.com and who have not opted out of receiving e-mails.

The magazines advertised in each e-mail will be automatically selected specifically for each customer when the e-mail is generated in order to maximize the probability that the customer will buy. MZines4You.com will only include ads for three magazines in each e-mail in a row at the top of the message because management believes that including more ads is ineffective. MZines4You.com also believes that including only three ads makes it much more likely that the ads will appear in the recipient’s e-mail preview and therefore actually be viewed (without the recipient actually having to open the e-mail).

## The Sample – Obtaining The Data

Because all of the recipients of the e-mails have previously made a purchase at MZines4You.com, the company can match the data collected when the customer made their previous purchase with third party data (which can be purchased from data sources such as the credit scoring agencies) so they have quite a lot of information about each customer. For example, they have data such as income, number of people in the household, and so on. This kind of merging of data from multiple sources to assemble a remarkably rich “profile” of each customer is becoming increasingly common.

Here are the variables that MZines4You.com has on each customer from third-party sources:

- Household Income (Income; rounded to the nearest $1,000.00)
- Gender (IsFemale = 1 if the person is female, 0 otherwise)
- Marital Status (IsMarried = 1 if married, 0 otherwise)
- College Educated (HasCollege = 1 if has one or more years of college education, 0 otherwise)
- Employed in a Profession (IsProfessional = 1 if employed in a profession, 0 otherwise)
- Retired (IsRetired = 1 if retired, 0 otherwise)
- Not employed (Unemployed = 1 if not employed, 0 otherwise)
- Length of Residency in Current City (ResLength; in years)
- Dual Income if Married (Dual = 1 if dual income, 0 otherwise)
- Children (Minors = 1 if children under 18 are in the household, 0 otherwise)
- Home ownership (Own = 1 if own residence, 0 otherwise)
- Resident type (House = 1 if residence is a single family house, 0 otherwise)
- Race (White = 1 if race is white, 0 otherwise)
- Language (English = 1 is the primary language in the household is English, 0 otherwise)

So how might MZines4You.com decide what magazines to market to each person; that is, what ads to put in each e-mail? One way would be to develop an equation (this is where multivariate logistic regression comes in) that predicts the probability that a customer will buy a particular magazine based on the data that the company has about the customer. Such an equation would be developed for each magazine that the company sells.

If Mzines4You.com has such a model for each magazine that they sell, they can calculate the probability that the customer will buy for each one of the magazines they offer. Then they can put the top three magazines in the e-mail (that is, the three that the model predicts the customer is most likely to buy). Note: MZines4You.com might do more complicated things than just look at the predicted probabilities (such as looking at the expected profit from the sale), but for simplicity let’s just assume that the goals is to put ads in the e-mail for the three magazines that the customer is most likely to buy.

In order to be able to develop an equation that predicts the probability that a customer will buy a particular magazine, the company will need to run an experiment in order to collect data on customer purchase behavior. One way to do this is to randomly select some customers from the customer database and then send them e-mails with randomly selected ads. Whether or not these customer buy the advertised magazines can provide the data necessary estimate the equations that will be used to predict the probability that a customer purchases a particular magazine.

If you have a large number of magazines that you sell, you may need to send out a large number of e-mails in order to get useful prediction equations. Making sure that you end up with enough data for each magazine to end up with a useful equation for predicting the probability of purchase can be a bit complicated (and require a large number of e-mails in the experiment), but I am not going to delve into these issues in this example.

So the problem of deciding what magazine ads to place in each e-mail boils down to developing an equation for each magazine that predicts the probability that a customer will buy. We are now going to focus on the issue of developing such an equation for one magazine (“Kid Creative”) whose target audience are children between the ages of 9 and 12. In the process of sending out the “experimental” e-mails, the ad for “Kid Creative” was shown in 673 e-mails to customers and the purchase behavior recorded.

In addition to the variables for each customer listed above (the ones obtained from 3rd party sources), Mzines4You.com has the following variables from their own databases:

- Previously purchased a parenting magazine (PrevParent = 1 if previously purchased a parenting magazine, 0 otherwise).
- Previously purchased a children’s magazine (PrevChild = 1 if previously purchased a children’s magazine)

The dependent variable comes from the “experiment;” that is, from the 763 e-mails to customers containing the ad for “Kid Creative” and whether or not the the customer purchased the magazine. That is, the dependent variable is

- Purchased “Kid Creative” (Buy = 1 if purchased “Kid Creative,” 0 otherwise)

## The Data

So here is what the data looks like (with some columns and rows omitted so that the table fits on the page):

Note that the variable, “Buy,” is binary (i.e., 0 or 1) as is required for a logistic regression. The independent variables can be binary or not (for example, Income is not a binary variable), just as in a regular least-squares regression.

I hope that this example makes it very clear what sort of data set is suitable for logistic regression. If you have any questions or comments, please let me know below.

I have a question – I have a dataset for 2001-2013 which provides details like:

PatientID (Eg: P1000132)

Hospital Name (Eg. Thomas Mount Hospital)

Hospital Type (Eg. Public/Private/Clinics)

Ward Type (Eg: A, B, C, D etc)

Diagnosis (Eg: Prostrate Cancer, Accidental Fracture etc)

Hospital Bill (Eg. USD5,617.00)

I need to predict if a patient is likely to get admitted in 2014. It requires a result in Yes/No, Yes for likely to get admitted, No for not likely.

How would you suggest me to prepare my dataset in this case if I need to apply logistic regression?

Logistic regression analysis is very appropriate for this type of problem if the time-frame used to capture admission is constant.

From your question, I cannot tell exactly what the purpose of the analysis is supposed to be. It might be that you are trying to forecast “demand” in 2014 from “repeat customers.” I suspect, however, that your question may be related to forecasting hospital “readmission” rates which are used in the calculation of reimbursement under the Affordable Care Act (Obamacare) in the U.S. Implementation of Obamacare is driving a lot of questions right now. I don’t know if you are from the U.S., but I am going to start by answering the question from that point of view (as it is a useful discussion anyway). If this is not what you are trying to do, please ask me again with a little more detail and I will give you an answer.

Under the Affordable Care Act, a hospital is penalized if a patient is readmitted within 30 days of discharge. Thus, there is a great incentive to try to predict which patients are likely to be readmitted (according to this criterion) so that the hospital can intervene to try to prevent that readmission (for example, by not discharging the patient as soon). In this situation, your data consists of all of the X-variables (independent variables) that you are using to try to explain the probability of readmission. The list of variables you have given in your question are the kind of variables that you might use as X-variables.

For this situation, your Y-variable is 1 for readmission within 30 days of discharge and 0 if readmission did not occur in that time frame. Thus, you can get a data point for each patient that was discharged up until one month (30 days) ago. Patients discharged less than 30 days ago cannot be used because you can’t yet tell if they will be readmitted within the 30 days and thus cannot determine the value of their Y-variable.

To predict the probability that a patient that has just been discharged will be readmitted within 30 days, you build a logistic regression model based on you data, and then you just plug the new patients X-variable values into the logistic regression equation. The equation will give you a predicted probability. If you really need to just classify the patient as “likely” or “not likely” (as you indicate in your question), you will need to decide on a criterion for the probability such as the predicted probability > 0.6 means “likely.”

Now if you are trying to do something else, like forecast the probability that a patient admitted in 2013 will be admitted again in 2014, things are more complicated. The reason is that 2014 represents different relative time periods depending on when the patient was discharged. For example, if a patient was discharged in Jan 2013, then 2014 represents months 12 to 23 after discharge. But for a patient discharged in December of 2013, 2014 represents months 1 to 12 after discharge. Thus, this is a more complicated problem than the Obamacare readmission problem.

What you would do in this case would depend on how much data you have and how accurate an answer you need. If you have lots of data, you could build separate logistic regression models for different time horizon that you would need to forecast. So for patients discharged in January 2013, you could build a model of the probability of being readmitted 12 to 23 months out. For patients discharged in February 2013, you could build a model for being readmitted 11 to 22 months out and so on. Each one of these logistic regression problems would look like the first problem I described (readmission during 30 days after discharge) with the 1 and 0 coding of readmission based on the time period for the model.

Note: In this discussion, I have assumed that “monthly” granularity is adequate. In principle, if you had enough data, you could do this using “daily” granularity.

I hope this discussion helps. If you have additional questions (which seems like it might be quite likely), do not hesitate to ask (and/or provide more details about your problem).

Regards,

StatsProf

Thanks Professor. Your explanation indeed is of help.

However, I am from India and not working on “ObamaCare”. I missed out mentioning 2 X variables – date of admission and date of discharge. Which means I have duration of stay. And as you mentioned I have “daily” granularity.

I have one question (disclaimer – my understanding might be wrong) – I understand the logit reg takes the dependent variable to be binary (0 and 1) only. So in my dataset, if I consider a X-var which is a continuous variable, my understanding is, it takes 0 if the value is 0, it takes 1 if the value is 1 and in case the value >1 it treats it as 0 (unless I recode them appropriately). Based on my understanding – So in your dataset, when you have Income as a continuous variable, how does the logit function handles it?

(Your site is with very useful contents, and more useful you attending to our queries. I wish we have one linearregressionanalysis.com )

Thanks for the clarification.

In logistic regression, the x-variables can be continuous or not, just like in regular least-squares regression. Logistic regression “converts” a linear function of the X-variables into a probability using the S-shaped logistic function. As you point out, the Y-variable has to be 0 or 1.

Suppose that we call the linear function of the X-variables LinFun. So LinFun = alpha + beta1*X1 + beta2*X2 + …. + betaP*XP. Then the probability associated with these X’s is p = exp(LinFun)/(1+exp(LinFun). This is the logistic function.

So (if I understand your question), it is not generally that case that “it takes 0 if the value is 0, it takes 1 if the value is 1 and in case the value >1 it treats it as 0.” How a value of X=0, X=1, or X>1 affects the probability depends on the logistic equation and the values of the estimated coefficients beta and the intercept alpha. (I have used better math notation in my explanation .

Note that in my previous explanations, I have not told you how the slope coefficients (the betas) and the intercept (the alpha) are computed by the software you are using. This may be confusing you. It is probably better to just take the estimation procedure that is used to compute the coefficients and intercept as a “black box” and not worry about exactly how this is done. What is important is to understand what coefficients mean and their meaning is really contained in the logistic equation. Again, see my previous post (What is logistic regression?.)

Having taught basic least squares regression to business students for years, they do not really understand how the least-squares coefficients are computed. They have some intuition that the software is minimizing the squared residuals. but that is all. What is important is that they know how to use and interpret the coefficients. The same thing applies for logistic regression. For most logistic regression users, it is not particularly important to understand how the coefficients are estimated. Just what they mean and how to use them. This is the logic behind how I have tried to explain logistic regression in the post I wrote reviewing least-squares background and the posts that build on the review that explain the logistic regression coefficient table (see the main page under Least Squares Regression Background and Understanding the Logistic Regression Table).

If you want some language that motivates how the logistic regression coefficients are estimated by the software, what is done is to find the values of slope coefficients and the intercept that makes the actual values that occurred in your data set as likely or probably as possible. This technique is called maximum likelihood and finding the actual values that maximize the likelihood required non-linear numerical optimization. But again, this last paragraph probably contains little or not useful information for most people that want to run and use logistic regression.

I suspect that my response has drifted away from the main thrust of your question, so feel free to ask follow up questions.

Regards,

Stats Prof

Do you still check this site, I have a question to ask.

Yes. I am notified via e-mail when a comment is posted.

Please ask your question and I will see if I can answer it.

StatsProf

I want to point out a mistake in your data set, here given X variables are 15 but in data-set they are 16, for own house we have two columns each for own and house. Kindly answer it. and please add more example . Thanks

Thank you for pointing this out!

I have added a description of the 16th variable. The variable is “House” which takes on a value of 1 if the residence is a single family house and 0 for everything else (apartment, condominium, duplex etc.).

Hello,

I’m a student in UK, now, I’m doing a dissertation about dividend policy based on life-cycle theory. I run logistic regression with panel data for 266 firms during 5 years. To solve firm clusters and timing effect, I use dummy variables, not using Fixed-effect or Random effects model. The reason I do it because I want to consider timing effect through financial crisis. However, I don’t know how to run it by STATA software. I adjusted all variables. The dependent variable, dividend in the current year, is available and transformed into binary variable. Do I have to calculate the probability of this variable before running this model or STATA itself can calculate this issue?

Your website is simply to understand. Thank you so much.

Thanks for your question. Your research sounds very interesting.

A quick comment before I try to answer your question. If you are using dummy variables to address the issue of firm clusters, it sounds to me as if you are taking a fixed effects approach.

The logistic regression model will predict the probability that the dependent variable is 1. Thus, if I understand your question correctly (and I may not), you do not need to calculate the probability of the dependent variable in advance. By this, I am assuming that you mean computing the sample proportion for a group of dependent variables.

If you want to provide me more details, I would be happy to try to give you a more specific and more definite answer.

Good luck with your research.

StatsProf

Hello,

I’m a Italian student of economics.

My thesis concerns the logistic regression but I can’t find appropriate dataset.

Can you help me?

Thanks for your question. I am not exactly sure what level of problem your thesis requires. But here are some suggestions.

Predicting company bankruptcies is a classic area where logistic regression can be applied. Perhaps you have access to financial databases that provide financial data along with whether or not a company filed for bankruptcy. The idea is to use prior financial data to predict whether or not the company files for bankruptcy in the following year.

Similar analysis can be performed to try to predict mergers and acquisitions.

For more general help looking for data, the following page gives links to many datasets: http://www.statsci.org/datasets

If you want a specific example that might work, here is a data set from a Portuguese bank: Bank Marketing

I hope this helps. Good luck.

thanks for detailed article…

I want to run a logistic model but confused about option of data preparation.

I have survey responders which is a fairly small % of my customer base and out of those responders there are customers who don’t like my company(called detractors) vs the loyal customer(called promoters) and now i want to predict who are detractor or promoter for my company. So problem is how to select a development and validation data set. in responder population 60% are promoter and rest 40 are detractor so if i use responder data set only as my validation and development set then the results will be bad as 60% population s suppose to be promoter . …. please help

reg

So if I understand your question correctly, you are concerned that there is an important difference between the people that respond to the survey and your company’s entire customer base. If you only use the data from the respondents, then you are concerned that you will get many more “detractors” than you should. This concern would make sense, because unhappy people might be more likely to respond in order to express their dissatisfaction. I was a little confused by the very end part of your question, so if I have not interpreted it correctly, please let me know.

You concern is valid. Non-response can cause a very important difference between the results you get from your sample, and the actual behavior of the population. So what can you do?

As an aside, you have a another logistic regression that could be done (and might be useful) before you get to the analysis of the respondents. Specifically, you could look at what variables you know about all the customers that received the survey (both respondents and non-respondents) and then create a logistic regression model to predict who responds to the survey. If you cannot find variables that predict who responds to the survey, then it might make sense to assume the the respondents are just a random selection (in which case, there is no non-response bias).

The non-response bias may not be that important, depending on what you are trying to do with your analysis. Specifically, if you are not really concerned about the probability that a customer is a detractor or promoter, but rather just want to rank them so that you know what customers to target, then the logistic regression model that you create just using the respondents may do a pretty good (or even very good) job of that. In many real-world situations, what you really care about is not the probability, but the rank because you can then use the rank to identify the most likely customers to be dissatisfied, or defect, or purchase or whatever. These customers can then be targeted for intervention.

Mathematically, you can show that if the detractors in your sample are a random sample of all detractors (with an unknown sampling probability), and if the promoters in your sample are also a random sample of all promoters (with another unknown probability), then the logistic regression slope coefficients will be “correct.” The intercept term in the model will be completely wrong, and you will not know what it should be unless you know the two probabilities.

In data mining situations where the amount of data is large and the data will be divided into estimation (training), validation, and test data sets, the estimation (training) data set is often selected so that it has an equal number of 1′s and 0′s. The slope coefficients in the regression that is built from this “balanced” sample will be correct, but the intercept will be wrong. In this circumstance, you know the probability used to select the 1′s (because you randomly selected them from the large data set), and also the probability used to select the 0′s (same reason), so you can correct the intercept term using a formula. In data mining, this approach generally goes by the term “balanced logistic regression.” In statistics it generally goes by the term “case controlled study.” Eventually, I will explain all of this in detail on this web site, but it will be a while as this is an advanced topic. In the mean time, you have the key words to do more research on your own if you want.

My point here is that there is some reason to believe that the regression coefficients in your logistic regression model built only from the respondents may not be too bad. You will not know what the intercept really should be, but with only the slope coefficients, you can rank the customers in order of the probability that they are detractors. Such a ranking does not depend on the intercept. From practical point of view, a ranking may be enough for determining who you try to approach with customer recovery efforts or some other intervention.

My response to your question is already much too long, but I did want to say a bit about how non response is generally handled. Usually people are trying to argue that their sample is OK and that non-response does not matter. They generally do this by arguing that their group of respondents is similar to the non-respondents by comparing variables that they do have for both groups and showing that these variables are not statistically different. For example, they try to show that the ages, genders, incomes, etc. of the respondents and non-respondents are not different. If you try to build a logistic regression for who responds (as I suggested in the “aside” above) and cannot find any variables that predict who responds, then you can use this to argue that your respondents and non-respondents are not difference

The second approach is to look at sequencing of time order of responses. The general idea is that the survey is sent out and a group responds. Later, a reminder is sent out, and some more people respond. A third iteration might occur, or there may be some follow up with phone interviews. Now, the people who responded late were almost non-respondents, so if they are not different from the group of early respondents, then you can argue that the results you obtain from your sample would not be different if you were able to eventually get a response from everyone.

You may be able to use this idea with your data if you can follow up with a reminder to the non-respondents, or have other time-of-response information.

Hopefully, my answer has been-target with respect to your question. If not, please let me know.

Regards,

StatsProf