05 March 2013

Context for Big Data

As a part of the Internet Multi-Resolution Analysis (MRA) long program at IPAM a few years ago, I became very aware of the need for context when analyzing big data. Our most infamous example was the claim made by some that the internet was extremely vulnerable to targeted attacks. The reasoning in that paper suffered two faults, one of logic but another of lack of context. The actual internet topology is very different from the one suggested in the paper, as was discussed in this article by some of the organizers of the MRA program.

A few weeks back, this blog entry from the new york times also stressed the need for context. It tells a story of collecting data using sensors on elevator and stair usage, where after a few days of collection the conclusion was that students use the stairs more at night. That seemed an interesting story until a security guard gave them some needed context: that the elevators had been breaking at night. So of course people were taking the stairs!

Missing context and missing data can be as (if not more) important as confounding factors in data collection. As we see more and more data collected and analyzed for various decision-making purposes from government to corporations to industry, and in both the private and public domains, I believe that the need to understand potential pitfalls of missing data and uncertainty will be central to actually getting good use out of that data.

No comments: