What are Z-Values in Logistic Regression?

Very Short Answer

The z-value is the regression coefficient divided by its standard error. It is also sometimes called the z-statistic. It is usually given in the third column of the logistic regression regression coefficient table output. Thus, in the example below, the z-value for the regression coefficient for ResidenceLength is 0.024680/0.013800 = 1.79.

If the z-value is too big in magnitude (i.e., either too positive or too negative), it indicates that the corresponding true regression coefficient is not 0 and the corresponding X-variable matters. A good rule of thumb is to use a cut-off value of 2 which approximately corresponds to a two-sided hypothesis test with a significance level of \alpha=0.05. So, for the ResidenceLength variable, the z-value is 1.79 which is not large enough to provide strong evidence that ResidenceLength matters.

Note: The relationship between the regression coefficient, its standard error, the z-value, and the p-value is virtually identical both logistic regression and regular least-squares regression. So if you understand this is regular regression, you also understand it in logistic regression.

Detailed Explanation

In statistics, the letter “Z” is often used to refer to a random variable that has a standard normal distribution. A standard normal distribution is a normal distribution with expectation 0 and standard deviation 1. This is the normal distribution that is generally tabulated in the back of any basic statistics book.

Because of this, the term “z-value” is often used to refer to the value of a statistic that has a standard normal distribution. Sometimes it is also used to refer to percentile points from the standard normal distribution that are used to compare to the value of statistic. For example, one might refer to “the z-value corresponding to a 95% confidence interval” (which would be 1.96).

In basic univariate statistics, z-statistics and z-values usually come about as a result of standardizing a statistic such as the sample mean \bar X or sample proportion \hat p. Standardizing a statistic means subtracting its expected value \mu_\text{stat} and then dividing by its standard error \sigma_\text{stat} (the standard error of a statistic is its standard deviation). The leading example of this from basic statistics would be a z-statistic derived from from the sample mean:

    \[Z = { \bar X - \mu \over \sigma/\sqrt{n}\]

Here \bar X is the statistic to be standardized (the sample mean), \mu is its expectation (which, for the sample mean, is the same as the population mean), and the standard deviation of the statistic \sigma_\text{stat} is \sigma_{\bar X} = \sigma/\sqrt{n}. Here \sigma is the population standard deviation, and the formula \sigma_{\bar X} = \sigma/\sqrt{n} comes about because of the relationship between the standard deviation of a sample mean and the population mean. Finally, the statistic has a normal distribution as the sample size gets large because of the central limit theorem.

Hopefully, the above reminds you about enough of your basic statistics that using these ideas in the context of logistic regression will make sense.

So where do z-values come about in logistic regression? They primarily come about as a result of standardizing the logistic regression coefficients when testing whether or not the individual X-variables are related to the Y-variables. For example, consider the coefficient table output from the logistic regression in the “Kid Creative” example I discussed in the post Understanding Logistic Regression Output: Part 2 — Which Variables Matter.

You will see that the z-values are given in the third column of numbers in the table. These z-values are computed as the test statistic for the hypothesis test that the true corresponding regression coefficient \beta is 0. (Note: The p-values computed from the z-values are given in the 4th column of numbers in the regression coefficient output table. I generally do not look at the z-values, but rather use the p-values.

More specifically, suppose we want to determine if an X-variable matters (that is, has a significant relationship to the Y variable). We determine this by testing the null hypothesis that the corresponding regression coefficient \beta is 0. In hypothesis testing, we assume the null hypothesis is true, and then see if the data provide evidence against it. So in this case, we assume \beta is 0. That is, we assume the expectation of the fitted regression coefficient \hat\beta is 0. So we standardize the regression coefficient as follows:

    \[Z = {\hat\beta - 0 \over \hat\sigma_{\hat\beta} } = \hat\beta/\hat\sigma_{\hat\beta}\]

Note that there is no closed-form formula for \hat\sigma_{\hat\beta}. It is computed as the solutions to a non-linear system of equations.

So, for example, consider the ResidenceLength regression coefficient in the coefficient output table above. For this variable \hat\beta=0.024680 and \hat\sigma_{\hat\beta}=0.013800, so the Z-value is \hat\beta/\hat\sigma_{\hat\beta} = 0.024680/0.013800 = 1.79. So the value in the third column of numbers is Z=1.79.

How do we interpret the Z-values? As a rough rule of thumb, if the absolute value of the Z-value is bigger that 2.0, the variable is significant (which means that there is statistical evidence that it is related to the Y variable). This gives a rough hypothesis test with a significance level of about \alpha=0.05.

More precisely, in the hypothesis test, select a significance level such as \alpha=0.05. Determine the corresponding critical value for the test. This will depend on whether or not the hypothesis test is one-sided or two-sided. If it is one-sided, the critical value will be the upper \alpha percentage point of the standard normal distribution (generally referred to as z_\alpha). If it is a two sided test (most common), then the critical value is the upper \alpha/2 percentage point (generally referred to as z_{\alpha/2}). The absolute value of the Z-value is then compared to the appropriate critical value to determine if the test is significant. That is, the regression coefficient is significantly different from 0 if:<\p>

    \[|\text{Z-value}| \ge z_\alpha \;\;\;\text{(one-sided)}\]

or

    \[|\text{Z-value}| \ge z_{\alpha/2} \;\;\;\text{(two-sided)}\]

This entry was posted in FAQ. Bookmark the permalink.

4 Responses to What are Z-Values in Logistic Regression?

  1. Pingback: A gentle introduction to logistic regression and lasso regularisation using R | Eight to Late

  2. Sai Potturi says:

    this was really useful. Thank you.

  3. AEM says:

    This is really useful, thanks. Maybe a more general/basic question. If z-scores are so correlated to p-values and we actually attribute a cut-off of 2 for hypothesis testing, what does it brings in addition to the p-value information?

    Cheers!

  4. gangan says:

    I think your address is very useful to us.I also appreciate for you had written your post. long time no update. Wish you can back to this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *


ENTER answer below as a NUMBER.
CAREFUL! Error will delete your comment. Save it first (cntrl-A then cntrl-C saves to clipboard). *
Time limit is exhausted. Please reload the CAPTCHA.