24 March 2015

big data is very often bad data

In my research I study issues with big data that make them messy. Missing data, corrupted data, uncalibrated sensors, biased human participants, inaccurate information on where or when the measurement was taken, measuring X when you really wanted to know Y... etc. A lot of big data proponents act like it's easy peasy to milk all the information possible out of these datasets. It's not!

What a perfect time to learn this lesson but during March Madness, via a winning prediction algorithm that does so well because it figured out what were the best data to use. This is called "variable selection" in statistics, and we try to do it in an automated way, but often (as was the case here) it's really domain expertise that allows one to figuring out which data are the best for inference.

One of my favorite lists of problems with big data can be found here-- including the fact that "although big data is very good at detecting correlations, ... it never tells us which correlations are meaningful." Here is another nice article from last summer on the limitations of big data -- through ok cupid and facebook user experiments. And of course my earlier blog post that gives props to IEEE for discussing the same.

Every statistical inference procedure -- from simply calculating p-values, to predicting class labels with SVM, to estimating system dynamics with filters -- has assumptions that may or may not hold in practice. Understanding the implications of that is crucial to figuring out how to use big data.