Bad Data

By James Kwak

To make a vast generalization, we live in a society where quantitative data are becoming more and more important. Some of this is because of the vast increase in the availability of data, which is itself largely due to computers. Some is because of the vast increase in the capacity to process data, which is also largely due to computers. Think about Hans Rosling’s TED Talks, or the rise of sabermetrics (the “Moneyball” phenomenon) not only in baseball but in many other sports, or the importance of standardized testing scores in K-12 education, or Karl Rove’s usage of data mining to identify likely supporters, or the FiveThirtyEight revolution in electoral forecasting, or the quantification of the financial markets, or zillions of other examples. I believe one of my professors has written a book about this phenomenon.

But this comes with a problem. The problem is that we do not currently collect and scrub good enough data to support this recent fascination with numbers, and on top of that our brains are not wired to understand data. And if you have a lot riding on bad data that is poorly understood, then people will distort the data or find other ways to game the system to their advantage.

Readers of this blog will all be familiar with the phenomenon of rating subprime mortgage-backed securities and their structured offspring using data exclusively from a period of rising house prices — because those were the only data that were available. But the same issue crops up in many different stories covering different aspects of society.

