Just to be sure that you have a clear idea of what a data set that is appropriate for logistic regression analysis looks like, I am providing an example in this article. As I indicated in the previous article, a multivariate logistic regression data set is essentially the same as a multivariate regular linear regression data set except that the dependent variable is binary.
The Context – A Data Mining Example
In this example, a magazine reseller is trying to decide what magazines to market to customers. In the “old days,” this might have involved trying to decide which customers to send advertisements to via regular mail. In the context of today and the “web,” this might involved deciding what recommendations to make to a customer viewing a web page about other items that the customer might be interested in and therefore want to buy. The two problem are essentially the same.
In this example, the website MZines4You.com (a fictious name I made up which was unclaimed when this article was written) wants to decide what magazines to include in e-mails to customers as a part of an e-mail marketing campaign. All of the e-mails that will be sent will go to customers that have previously bought a magazine subscription at MZines4You.com and who have not opted out of receiving e-mails.
The magazines advertised in each e-mail will be automatically selected specifically for each customer when the e-mail is generated in order to maximize the probability that the customer will buy. MZines4You.com will only include ads for three magazines in each e-mail in a row at the top of the message because management believes that including more ads is ineffective. MZines4You.com also believes that including only three ads makes it much more likely that the ads will appear in the recipient’s e-mail preview and therefore actually be viewed (without the recipient actually having to open the e-mail).
The Sample – Obtaining The Data
Because all of the recipients of the e-mails have previously made a purchase at MZines4You.com, the company can match the data collected when the customer made their previous purchase with third party data (which can be purchased from data sources such as the credit scoring agencies) so they have quite a lot of information about each customer. For example, they have data such as income, number of people in the household, and so on. This kind of merging of data from multiple sources to assemble a remarkably rich “profile” of each customer is becoming increasingly common.
Here are the variables that MZines4You.com has on each customer from third-party sources:
- Household Income (Income; rounded to the nearest $1,000.00)
- Gender (IsFemale = 1 if the person is female, 0 otherwise)
- Marital Status (IsMarried = 1 if married, 0 otherwise)
- College Educated (HasCollege = 1 if has one or more years of college education, 0 otherwise)
- Employed in a Profession (IsProfessional = 1 if employed in a profession, 0 otherwise)
- Retired (IsRetired = 1 if retired, 0 otherwise)
- Not employed (Unemployed = 1 if not employed, 0 otherwise)
- Length of Residency in Current City (ResLength; in years)
- Dual Income if Married (Dual = 1 if dual income, 0 otherwise)
- Children (Minors = 1 if children under 18 are in the household, 0 otherwise)
- Home ownership (Own = 1 if own residence, 0 otherwise)
- Resident type (House = 1 if residence is a single family house, 0 otherwise)
- Race (White = 1 if race is white, 0 otherwise)
- Language (English = 1 is the primary language in the household is English, 0 otherwise)
So how might MZines4You.com decide what magazines to market to each person; that is, what ads to put in each e-mail? One way would be to develop an equation (this is where multivariate logistic regression comes in) that predicts the probability that a customer will buy a particular magazine based on the data that the company has about the customer. Such an equation would be developed for each magazine that the company sells.
If Mzines4You.com has such a model for each magazine that they sell, they can calculate the probability that the customer will buy for each one of the magazines they offer. Then they can put the top three magazines in the e-mail (that is, the three that the model predicts the customer is most likely to buy). Note: MZines4You.com might do more complicated things than just look at the predicted probabilities (such as looking at the expected profit from the sale), but for simplicity let’s just assume that the goals is to put ads in the e-mail for the three magazines that the customer is most likely to buy.
In order to be able to develop an equation that predicts the probability that a customer will buy a particular magazine, the company will need to run an experiment in order to collect data on customer purchase behavior. One way to do this is to randomly select some customers from the customer database and then send them e-mails with randomly selected ads. Whether or not these customer buy the advertised magazines can provide the data necessary estimate the equations that will be used to predict the probability that a customer purchases a particular magazine.
If you have a large number of magazines that you sell, you may need to send out a large number of e-mails in order to get useful prediction equations. Making sure that you end up with enough data for each magazine to end up with a useful equation for predicting the probability of purchase can be a bit complicated (and require a large number of e-mails in the experiment), but I am not going to delve into these issues in this example.
So the problem of deciding what magazine ads to place in each e-mail boils down to developing an equation for each magazine that predicts the probability that a customer will buy. We are now going to focus on the issue of developing such an equation for one magazine (“Kid Creative”) whose target audience are children between the ages of 9 and 12. In the process of sending out the “experimental” e-mails, the ad for “Kid Creative” was shown in 673 e-mails to customers and the purchase behavior recorded.
In addition to the variables for each customer listed above (the ones obtained from 3rd party sources), Mzines4You.com has the following variables from their own databases:
- Previously purchased a parenting magazine (PrevParent = 1 if previously purchased a parenting magazine, 0 otherwise).
- Previously purchased a children’s magazine (PrevChild = 1 if previously purchased a children’s magazine)
The dependent variable comes from the “experiment;” that is, from the 763 e-mails to customers containing the ad for “Kid Creative” and whether or not the the customer purchased the magazine. That is, the dependent variable is
- Purchased “Kid Creative” (Buy = 1 if purchased “Kid Creative,” 0 otherwise)
So here is what the data looks like (with some columns and rows omitted so that the table fits on the page):
Note that the variable, “Buy,” is binary (i.e., 0 or 1) as is required for a logistic regression. The independent variables can be binary or not (for example, Income is not a binary variable), just as in a regular least-squares regression.
I hope that this example makes it very clear what sort of data set is suitable for logistic regression. If you have any questions or comments, please let me know below.