By Robert J. Rietz
A very long time ago, at Michigan State University, I took a senior-level statistics course. The professor demonstrated how polls can produce statistically valid findings with a surprisingly small number of responses. For example, a recent state poll reported Candidate A leading Candidate B by 2.0%, with a 3.6% margin of error, based on answers from 1,181 registered voters. I revisited the professor’s proof and discovered two potential disconnects between statistical theory and polling practices: collecting response data and processing it. Explaining a poll’s results—the “margin of error”—creates another issue.
I’m convinced that pollsters created margin of error so the public can understand a poll’s conclusions. They use this phrase instead of saying the outcomes satisfy a 95% or 99% confidence level of a statistical test, such as the T-Test or the F-Test, and refer to the expected variability of the results as the “margin of error.” However, the professor’s proof, and the polling profession, rely on a rather dubious assumption. Namely, if the pollsters asked the question many times to random members of the underlying population, the polling returns would follow a Gaussian distribution.
Is the Gaussian distribution of responses a valid assumption for pollsters to use? George Gallup contacted random people by telephone in the summer and fall of 1948 and asked if the home’s residents would vote for Harry Truman or Thomas Dewey. Those answers led Gallup, and almost all other pollsters, to predict Dewey would beat the incumbent President Truman. The Chicago Daily Tribune published an early headline, “DEWEY DEFEATS TRUMAN,” and a grinning re-elected President Truman lifted the newspaper over his head for the iconic photograph.
What happened? Gallup unknowingly introduced a bias by conducting this poll entirely by telephone. In 1948, phone service was concentrated in urban areas and more widely available among the higher economic classes—groups that tended to support Republicans in the 1940s. Farmers and low-income families, less likely to own telephones, voted Democratic back then and voted heavily for Truman. Modern pollsters attempt to avoid a similar bias by weighting their respondents’ demographics to mirror the underlying population, but issues can arise with this approach.
Pollsters have many variables at their disposal to help ensure they can weight their respondents’ answers to reflect the target population. Age, gender, marital status, ethnicity, religion, income, education, homeownership, and party affiliation are some of the more common categories available to pollsters. They can view how the poll’s results vary by age of the respondents, for example, though the margin of error for any subgroup increases significantly. However, reflecting more variables requires larger data sets, increasing the cost of the poll. But a sophisticated poll risks getting lost in the plethora of simpler, less expensive, and less reliable polls.
Pollsters can survey three different populations: all people over age 18, registered voters, and likely voters. Would a poll of each group produce the same outcome? Would a registered voter lie to a pollster and describe themselves as a likely voter? Can respondents be influenced by the question itself? Although exaggerated, a slanted question, such as, “Are you more likely to vote for strong, patriotic Candidate A or Candidate B, who’s just another sniveling politician?” would push responses in Candidate A’s favor.
Other subtle biases can be present. Suppose supporters of Candidate A are more reluctant to disclose their voting intentions than supporters of Candidate B. Supporters of Candidate A might self-select to not provide their voting intention, or to respond as undecided. Would the determinations of such a poll be an accurate representation of Candidate A’s strength?
Pollsters are human beings and value how the public views their work. The outcomes of many polling organizations tend to cluster. A pollster with rogue results may question its data and process, or adjust them, before publishing their work.
Modeling who will vote and who will stay home is polling’s most critical variable. The demographics of actual voters and how they differ from those of nonvoters can’t be determined until after the election. Increased turnout of a subgroup favorable to Candidate A, like farmers for Harry Truman, can swing an election and contradict past polls. Like the stock market, past performance (turnout) of any specific demographic subgroup is no guarantee of future turnout.
I suggest you ignore all the polls before Election Day, and participate in the only poll that matters—the one where you mark your ballot.