What a perfect time to learn this lesson but during March Madness, via a winning prediction algorithm that does so well because it figured out what were the best data to use. This is called "variable selection" in statistics, and we try to do it in an automated way, but often (as was the case here) it's really domain expertise that allows one to figuring out which data are the best for inference.
One of my favorite lists of problems with big data can be found here-- including the fact that "although big data is very good at detecting correlations, ... it never tells us which correlations are meaningful." Here is another nice article from last summer on the limitations of big data -- through ok cupid and facebook user experiments. And of course my earlier blog post that gives props to IEEE for discussing the same.
Every statistical inference procedure -- from simply calculating p-values, to predicting class labels with SVM, to estimating system dynamics with filters -- has assumptions that may or may not hold in practice. Understanding the implications of that is crucial to figuring out how to use big data.
No comments:
Post a Comment