C&B Notes

Harnessing Big Data

Cheap access to mountains of data that were previously uncollectible and/or unmeasurable will lead to transformative insights.  Simply having greater amounts of information, however, does not allow users to ignore statistical foundations.  Instead, it requires that they be more watchful for all the biases and/or errors that a data avalanche may obscure.

But the “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast.  Google Flu Trends was built on found data and it’s this sort of data that interests me here.  Such data sets can be even bigger than the LHC data — Facebook’s is — but just as noteworthy is the fact that they are cheap to collect relative to their size, they are a messy collage of data points collected for disparate purposes and they can be updated in real time.  As our communication, leisure and commerce have moved to the internet and the internet has moved into our phones, our cars and even our glasses, life can be recorded and quantified in a way that would have been hard to imagine just a decade ago.

Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.

Unfortunately, these four articles of faith are at best optimistic oversimplifications.  At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks.  Absolute nonsense.”


Four years after the original Nature paper was published, Nature News had sad tidings to convey: the latest flu outbreak had claimed an unexpected victim: Google Flu Trends.  After reliably providing a swift and accurate account of flu outbreaks for several winters, the theory-free, data-rich model had lost its nose for where flu was going.  Google’s model pointed to a severe outbreak but when the slow-and-steady data from the arrived, they showed that Google’s estimates of the spread of flu-like illnesses were overstated by almost a factor of two.

The problem was that Google did not know — could not begin to know — what linked the search terms with the spread of flu. Google’s engineers weren’t trying to figure out what caused what.  They were merely finding statistical patterns in the data.  They cared about ­correlation rather than causation.  This is common in big data analysis.  Figuring out what causes what is hard (impossible, some say).  Figuring out what is correlated with what is much cheaper and easier.  That is why, according to Viktor Mayer-Schönberger and Kenneth Cukier’s book, Big Data, “causality won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning”.  But a theory-free analysis of mere correlations is inevitably fragile.  If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.


Opinion polls are based on samples of the voting population at large.  This means that opinion pollsters need to deal with two issues: sample error and sample bias.  Sample error reflects the risk that, purely by chance, a randomly chosen sample of opinions does not reflect the true views of the population.  The “margin of error” reported in opinion polls reflects this risk and the larger the sample, the smaller the margin of error.  A thousand interviews is a large enough sample for many purposes and Mr. Gallup is reported to have conducted 3,000 interviews.  But if 3,000 interviews were good, why weren’t 2.4 million far better?  The answer is that sampling error has a far more dangerous friend: sampling bias.  Sampling error is when a randomly chosen sample doesn’t reflect the underlying population purely by chance; sampling bias is when the sample isn’t randomly chosen at all.  George Gallup took pains to find an unbiased sample because he knew that was far more important than finding a big one.


Statisticians are scrambling to develop new methods to seize the opportunity of big data.  Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them.

Recall big data’s four articles of faith.  Uncanny accuracy is easy to overrate if we simply ignore false positives, as with Target’s pregnancy predictor.  The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it.  The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count.  As for the idea that “with enough data, the numbers speak for themselves” — that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries.  “Big data” has arrived, but big insights have not.  The challenge now is to solve new problems and gain new answers — without making the same old statistical mistakes on a grander scale than ever.