Least-Squares Background Part 3: Assessing the Effects of the X-Variables

In this Article, I discuss the next use of the regression coefficients, namely to try to assess the impact of each of the X-variables. This article is the third part of a five part series providing a brief review of least-squares regression which will serve as background for understanding the logistic regression output. In Part 1, the Kid Creative data was used as the basis of a least-squares regression example and the regression output given. Part 2 outlined the uses of the regression coefficient table and then discussed using the p-values to determine which variables matter.

Coefficient Table Use #2: Determining the Effects of the X-Variables

In a regular least-squares regression, the effect of each X-variable is generally assessed by looking at the sign and magnitude of the corresponding regression coefficient. The table below shows just the regression coefficient estimates pulled from coefficient table from the Kid Creative least-squares regression example (to see the entire regression output, click here):

Recall that in this example, Household Income was the Y-variable that was regressed on the X-variables listed in the table. (To review the set-up for this example, see Part 1).

Suppose we are interested in the relationship between gender and household income. If we look in the table above at the regression coefficient for IsFemale we see it is –3997.272. This would generally be interpreted as indicating that being female is associated with having about $4,000.00 less in household income (other things being constant). Note that I have carefully used the words “is associated” (rather than a word like “causes”) and included the parenthetical “other things being constant” in this interpretation as I will discuss further below.

As another example and in a similar fashion, the coefficient of IsProfessional (\beta=11432.073) would generally be interpreted as indicating that being a professional is associated with an increased household income of about $11,500.00 (again, other things being constant).

In this way, the least-squares regression coefficients can be used to try to assess the impact of each of the X-variables on the Y-variable. What the regression coefficient shows is the estimate of the expected change in Y for a one-unit change in the corresponding X-variable.

In the particular two examples above, the variables IsFemale and IsProfessional are indicator or “dummy” variables that indicate the presence or absence of a condition (e.g., “1” means Female, “0” means not female). Thus a change in one unit for this kind of X-variable means a change of category (e.g., from Male to Female). Thus, the regression coefficients of these dummy variables shows the expected difference in the Y-variable between the corresponding categories.

There are a number of important issues to note here. First, because linear regression assumes a linear relationship between the dependent Y-variable and the independent X-variables, the “effect” of changing a variable is the same no matter what the values of the other variables are. For example, according to the linear regression model, being female is associated with about $4,000 less household income no matter what the values of the other variables are. That is, being female costs you $4,000.00 whether you are a professional or not, whether you are retired or not, and so on.

It is a property of linear equations that the effect of changing an X-variable is the same no matter what the values of the other variables. Loosely speaking, this means that the interpretation of the regression coefficient is the same “everywhere” (no matter what the other X-variables are). This is one of the reasons that linear functions are so appealing and why we would like to use linear functions to fit our data whenever possible. Linear functions are relatively easy to understand and interpret. In contrast, non-linear functions are often very difficult to understand and interpret as the effects of changing the X-variables depends on where on the surface of the function you are located.

It is also important to understand that the regression coefficient shows the effect of changing the X variable conditional on the values of the other X-variables (which means they stay fixed). While this statement sounds simple enough, there is more to this idea than might be immediately obvious. It may well be impossible in the real world to actually change the just the X-value of one variable.

It is frequently the case that the X-variables in a regression are related to each other (perhaps causally). Thus, it might not make a lot of sense to think that you can change one X variable without changing the others. For example, suppose I am looking at the household income for a high-school educated, non-professional person. It makes little sense to think about the effects on household income of an identical person who is a professional. Being a professional almost certainly requires more than a high-school education, so it makes little or no sense to to think about changing their professional status without also changing their level of education.

The fact that the regression coefficients show the effects of each variable conditional on the other variables can make their interpretation non-intuitive. This is because what a regression coefficient for a partcular X-variable really shows is the effect above and beyond the effects of the other variables. To say it another way, the regression coefficient shows the effect of the variable once the effects of all of the other variables have been removed. So what exactly does this mean? The Kid Creative regression output provides and example which may help to clarify things.

In the regression output, notice that the regression coefficient for “English” is negative (\beta=-4273.756). This is counter intuitive. In the United States, I would have expected that being an English speaker as your primary language would be associated with a higher, not lower, income. This counter-intuitive result led me to think a little about what might be going on.

First, I looked at what I considered to be a related variable, namely race (“White” or not). As expected, the coefficient for “White” is positive (\beta=7259.981). As I thought about it, I began to suspect that virtually all of the “White” survey respondents would also be English speakers. If this is true, then there would be virtually no variation in the value of the English variable for white people (the idea is that if all white people speak English, if White is 1, then English is also 1).

Now if all of the variation in the English variable is occurring for non-white people, then the English variable may be capturing income differences between non-white racial groups. For example, it may be capturing the difference between income for African Americans (English speakers) and Hispanics (non-English speakers) or possibly Asians (also non-English speakers). Since it is at least plausible that African Americans might make less than Hispanics or Asians, it becomes plausible that the variable English might be associated with a decrease in household income.

All of this is speculative, as I have not done the detailed analysis necessary to see exactly what is going on. I did, however, as a sanity check, calculate the average income for English speakers (English = 1) and non-English speakers (English = 0). The average income for non-English speakers is $29,819.67. The average income for English speakers is $35,602.94. Thus, English speakers earn on average $5,783.27 more than non-English speakers. This difference is positive as expected. Comparing this $5,783.27 to the negative regression coefficient of –$4,273.76 illustrates why the conditional nature of regression coefficients can make their interpretation difficult.

Finally, it is very important to keep in mind that most statistical analysis of non-experimental data shows association, not causation. Association is very different from causation for a lot of reasons. This is a huge topic that I cannot expand on fully here, but I will give one quick example.

The $4,000.00 “effect” of gender on household income could be due to many things. It might be due to discrimination or it might be due to any of a virtually unlimited number of variables that are not in the regression model. For example, females may work fewer hours than males due to family demands or for a host of other reasons such as a desire to have a more balanced life. Gender does not “cause” this difference. Gender may be associated with lower income in this data set, but this is very different from having identified a cause.

In quite a few instances in my discussion above, I have talked about the regression coefficient showing the “effect” of changing the X-variable. This is, in fact, how people tend to talk about the regression coefficients, but it is usually wrong because the regression shows association, not causation. Frankly, it is hard to talk about the interpretation of regression coefficients without using causal language. Whether causal language is used of not, it is very important to not forget that regressions generally show association, not causation. Interpreting them as showing causation is usually very, very risky.

As a final note, there is a relationship between the topic discussed here (the effects of the X-variables) and the topic discussed in Part 2 of this series (determining which X-variables matter). If there is no statistical evidence that a regression coefficient matters, then it really does not make a great deal of sense to interpret its impact as described in this article even though the corresponding estimated regression coefficient is not 0. In fact, in this article I have ignored the issue of our uncertainty about the true regression coefficient values. I will return to this in Part 5.

In summary, one of the common and important uses of regression coefficients is to try to assess the impact of the X variables. While this is commonly done, often with a causal relationship implied, causal interpretation should be done with great caution.

In the next part of the background series, I will discuss another very important use of the regression coefficient table, namely making predictions. Click here to proceed to Part 4.

Questions or Comments?

Any questions or comments? Please feel free to comment below. I am always wanting to improve this material.

I have had to implement a very simple “captcha” field because of spam comments, so be a bit careful about that. Enter your answer as a number (not a word). Also, save your comment before you submit it in case you make a mistake. You can use cntrl-A and then cntrl-C to save a copy on the clipboard. Particularly if your comment is long, I would hate for you to lose it.

This entry was posted in Background. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

ENTER answer below as a NUMBER.
CAREFUL! Error will delete your comment. Save it first (cntrl-A then cntrl-C saves to clipboard). *
Time limit is exhausted. Please reload the CAPTCHA.