By James Kwak
Last week, the Washington Post summarized a draft paper by Jonathan Rothwell of Gallup on the demographic correlates of support for Donald Trump. As various people have noted, the headline was a bit over-the-top:
The “widespread theory,” of course, is the idea that Trump supporters are, at least in part, motivated by economic anxiety—an idea that sophisticated columnists like Matt Yglesias like to make fun of, as I discussed recently.
The article itself, as many people have noted, is considerably more circumspect than its headline. (Note to those who don’t know: Headlines are written by editors, not the people on the byline.) This is the summary near the top of the article:
According to this new analysis, those who view Trump favorably have not been disproportionately affected by foreign trade or immigration, compared with people with unfavorable views of the Republican presidential nominee. The results suggest that his supporters, on average, do not have lower incomes than other Americans, nor are they more likely to be unemployed. [Actually, according to the paper, they are more likely to be unemployed, but that’s not particularly important.]
Yet while Trump’s supporters might be comparatively well off themselves, they come from places where their neighbors endure other forms of hardship. In their communities, white residents are dying younger, and it is harder for young people who grow up poor to get ahead.
The paper itself is more circumspect still. Here’s an excerpt:
Higher household income predicts a greater likelihood of Trump support overall and among whites, though not among white non-Hispanic Republicans. In other words, compared to all non-supporters or even other whites, Trump supporters earn more than non-supporters, conditional on these factors, but this is partly because Republicans, in general, earn higher incomes, and the difference is no longer significant when restricted to this group. …
On the other hand, workers in blue collar occupations (defined as production, construction, installation, maintenance, and repair, or transportation) are far more likely to support Trump, as are those with less education. … Since blue collar and less educated workers have faced greater economic distress in recent years, this provides some evidence that economic hardship and lower-socio- economic status boost Trump’s popularity.
Before we go further, let’s make sure we understand exactly what this paper does and does not show. For the most part, it’s based on a probit regression of the likelihood a person will support Trump (that’s the dependent, or left-side variable) on a long list of variables for that person (employment status, religion, etc.) and a long list of variables measured for the area in which that person lives (share with BA degree, share of manufacturing jobs, etc.). For each variable, there is a regression coefficient that shows the impact of that variable on the likelihood of supporting Trump, and then an indication of whether that variable is statistically significant. For example, in model 1, looking at all people, being unemployed increases the chances that someone will support Trump by about 5%, which is significant at the 99% level.
There are two reasons why this paper says less than readers might think. The first is that many of the right-side (explanatory) variables are highly correlated. When you have highly correlated explanatory variables, you can get wildly inaccurate results. Let’s say you are trying to figure out what factors determine the number of words in a child’s vocabulary. In your model, you include age, since kids learn more words as they get older. You also include grade in school, since they learn more words the longer they spend in school. Do you see the problem? Age and grade are almost perfectly correlated; you’re basically using two variables when there is only one in real life—so the actual results of your model will be highly volatile. You might find that age is significant but not grade; or vice-versa; or that both are significant. If both are significant, you might conclude that both have a positive impact on vocabulary: that is, fourth graders know more words than third graders, but within any grade, older kids know more words than younger kids. That sounds plausible—but it would be a mistake. When explanatory variables are highly correlated, results are extremely sensitive to outliers. If you have one older kid in fourth grade who knows lots of words, you could get a positive coefficient on age; but if that one kid doesn’t know very many words, you could get a negative coefficient.
How does this apply to this paper? The individual explanatory variables include, among other things: employment status (e.g., self-employed); religion; “works for government”; sex; marital status; “works in blue collar occupation”; union member, non-government; race and ethnicity; highest degree; and household income. The regional explanatory variables include: share of college graduates; share of manufacturing jobs; median income; share of white people; and white mortality rate. All of those variables are obviously correlated with income, particularly highest degree. So we have the same problem described above—too many variables for the amount of variance in our sample—which produces arbitrary results. (One way to think about this is that you could use a bunch of those variables to predict household income pretty accurately, at which point the household income variable itself becomes unnecessary.)
The Washington Post writeup remained blissfully unaware of this problem:
After statistically controlling factors such as education, age and gender, Rothwell was able to determine which traits distinguished those who favored Trump from those who did not, even among people who appeared to be similar in other respects.
This is the argument that the statistical significance of the income coefficient means that, among people who are otherwise identical, higher income does have an effect (pro-Trump, in this case). But as explained above, that’s a fallacy. Multicollinearity, as this statistical problem is called, means that individual coefficients are unreliable. The model as a whole may predict support for Trump pretty well, but you have no way of knowing which variables are doing the predicting.
That’s the first problem with this paper: we can’t trust the coefficients. The second problem is one of interpretation. Even if we accept for a moment the coefficients on the explanatory variables, the paper says nothing about why people actually support Trump; it’s just a long list of correlations.
So imagine this simple world. There are 100 people. 50 are poor and 50 are rich. In each group, one half (25 people) vote based on their feelings, such as economic anxiety. The other half vote based on their interests. So the electorate looks something like this:
Of the people who vote their feelings, let’s say economic anxiety does increase support for Trump. So Trump gets 15 of the people in the Feelings/Poor box but only 10 people in the Feelings/Rich box. For people who vote their interests, however, income is positively correlated with Trump support, since he has promised to cut their taxes. So Trump gets 20 of the people in the Interests/Rich box but only 5 people in the Interests/Poor box (because the other 20 realize that Hillary Clinton’s policies will be better for them).
Now our exit poll looks like this:
Trump gets only 40% of the poor voters, but 60% of the rich voters.
That’s what the Gallup paper shows, and that’s what the Washington Post editors used as their headline: rich people prefer Trump, so economic anxiety is a myth. But I constructed this outcome using a model that explicitly incorporated economic anxiety as a factor (in the Feelings row, Trump does better with poor people). In other words, the economic anxiety story is consistent with a study showing that, on average, rich people prefer Trump.
The lesson is very simple, and it’s one that everyone knows before becoming a poll-reading pundit: People make decisions for different reasons. Something can be an important factor—here, it gives Trump a 20-point advantage among half the population—but get outweighed by some other important factor. Or, to put it in sophisticated language, you can’t use income as an instrument for economic anxiety, because income affects Trump support through other channels (in this example, because some rich people realize that Trump’s tax cuts will be good for them). This is really just the same mistake that Matt Yglesias made yesterday with race and age.
As should be obvious, I think that economic anxiety is a reason why some people support Donald Trump. I can’t prove it from poll breakouts, or from the Gallup paper, because this type of hypothesis can’t be proven or disproven with that type of data. That’s the one thing you should remember the next time someone argues that some demographic statistics show why some politician is popular.