How Big a Sample? How Many X-Variables?

In this article address how much data you need in order to run standard logistic regression analyses.

As is the case in essentially all statistical analysis, the amount of data you need in logistic regression depends on the number of parameters that you are trying to estimate. In regression (both logistic regression and regular least-square regression), the number of parameters you are trying to estimate depends on the number of independent X-variables in the model. That is, there is a relationship between the number of observations and the number of X-variables. So the sample size issue can be viewed in two ways. The first is, given a number of X-variables, how many observations do I need? The second is, given a number of observations, how many X-variables can I use?

In logistic regression (in comparison to regular least-squares regression), there is an additional issue created by the fact that the dependent Y-variable is dichotomous. In regular least-square regression, when we talk about required sample size it is in the context a of situation where all of the Y values will be different. This has to be the case in least-squares regression if the error term really follows a normal distribution. In practice, you may have some stacking up of values (because of rounding, for example), but almost all of the Y-values will be different.

In logistic regression, since the Y-values only take on the values 0 or 1, there obviously will be a great deal of stacking up of the values. Thus, how many X-variable I can fit also involves how “spread out” the Y-values are. This is captured by looking at the minimum number of 0’s and 1’s.

Specifically, suppose:

n_0 = the number of 0’s
n_1 = the number of 1’s
n_{\text{min}} = \min(n_0,n_1)

What we need in logistic regression is for n_{\text{min}} to be about 10 times the number of parameters.

In logistic regression, the number of parameters is one more than the number of X‘s (because there is one additional parameter for the intercept). So if there are p X-variables, you need

n_{\text{min}} \ge 10(p+1).

Please note that this is a rule of thumb. As such, it is an approximation which may not work in all circumstances.

Here is an example. Suppose that I have 436 observations and 78 were “successes” with the rest “failures” (i.e., 436-78=358 failures). Then n_{\text{min}}=\min(78,358)=78. This means that I can fit (78/10) - 1 = 6.8 X-variables. So my model can have 6 to 7 X-variables.

For more information, see the book Applied Logistic Regression (Wiley Series in Probability and Statistics). This topic is addressed in Chapter 10 (“Special Topics”) in Section 10.5 (“Sample Size Issues When Fitting Logistic Regression Models”).

This entry was posted in Basic and tagged , . Bookmark the permalink.

One Response to How Big a Sample? How Many X-Variables?

  1. John theis says:

    I am trying to use a logistic regression to approximate a cox proportional hazard model. How can I interpret the coefficients in the model. I have already adjusted for the significance of the model.

Leave a Reply

Your email address will not be published. Required fields are marked *

ENTER answer below as a NUMBER.
CAREFUL! Error will delete your comment. Save it first (cntrl-A then cntrl-C saves to clipboard). *
Time limit is exhausted. Please reload the CAPTCHA.