Lies, damn lies and … big data!

Editor Neil McNaughton argues that ‘big data’ is old hat! In the fifty or so years since numerical classifications methods were trialed by geologists, some learnings appear to have been forgotten. A timely issue of Nature addresses the dangers of ‘point and click' statistical software.

Speaking at ECIM (more on page 6), Teradata’s Niall O’Doherty opined that one of the myths of big data is that it is something new. His words came back to me as I sat in on the analytics/big data track at the Society of Petroleum Engineers ATCE in Houston a couple of weeks later (more next month). For yes, big data and its accompanying statistical toolset is almost as old as computing itself, as any user of the venerable SPSS* package will tell you.

My own first exposure to what was then ‘numerical taxonomy’ dates back almost 50 years. At the time I was studying geology, not computer science. Moreover I was not a very good student. This was the late 1960s and there was other stuff to do.

Having said that, this bad student who, a long time ago, studying a different subject in a half-hearted manner, did grasp a couple of things that seem to have subsequently been forgotten. First, it does not always work very well. Fiddling with input parameters makes it possible to classify everything as either ‘all the same’ or ‘all different.’ This is due in part to the fact that it is hard to find a sufficient number of independent, non-correlated input measurements (dimensions). The other problem is that that the technique often failed to produce statistically significant results because there were not enough numbers to crunch. Or, in modern parlance, the input matrices were too ‘sparse.’

To be charitable, data today is ‘bigger’ so hopefully, sparseness should be less of an issue. But the risk of correlated inputs appears to have escaped some of today’s big data protagonists. In the dash to apply analytics to shale prospectivity, one SPE paper proposes using no less than 39 attributes (dimensions) derived from a post stack seismic dataset. Now some of these might be just a little correlated, no?

A query of a philosophical nature which I like to ask seismologists is, given a 3D post stack volume of 4 pretty independent dimensions (x, y, z/t, amplitude), what is the maximum number of truly independent ‘attributes’ that can be computed? I am sure that it is less than 39. I suspect that you cannot add any truly non-correlated inputs to the mix at all but such reasoning is above my pay grade.

I am not sure that I am a better student today than I was then, but over the years I have developed a sneaky way of learning. I avoid stuff that seems too hard on the (probably unjustified) assumption that the author doesn’t really know what he or she is talking about. And I home-in on what seem like insightful comment and summary. Thus I learned on page 110 of David Hand’s excellent ‘Very short introduction to statistics’ something of how ‘statistical computing’ really works. An insight, by the way, that no end of presentations on the business benefits of ‘big data’ have provided. Hand explains how data is split into a training data set from which some statistical relationship is derived. This is then tested on the remainder of the data that was not used for training.

Of course you have heard this before, but, as Hand points out, it is only a part of the story. The real application of big data is that the process is repeated multiple times inside the data set. Randomly selected subsets are used for computation and compared with the remaining data to provide ‘an overall measure of likely future performance.’

Curiously, oil and gas practitioners prefer to do this manually, using half the data for stats and to test, in a suck-it-and-see fashion, on the other half. I suspect that this falls short of statistical best practice even if it gives a warm feeling when it comes up trumps. It should be better to plug all the data in at the start, to have as little sparseness as possible and let the machine tell you how good the forecast is with some hands-off measure.

Speaking of warm feelings, statistics and experiments, there is a great section in a recent edition of Nature which deals exactly with such questions and what to do about it. Nature’s feature, titled ‘Fooling ourselves,’ by Regina Nuzzo, discusses how cognitive bias, our desire to find results that confirm our preconceptions, plagues scientific research and publishing. This usually comes to light when other teams of researchers have access to the same data set and fail to reproduce the findings. Nature reports on a study by Stanford’s ‘Meta-Research’ Institute which found that only around one-third of 100 psychological studies proved reproducible. Elsewhere, a measly 6 out of 53 ‘landmark’ studies in cancer research proved good. One problem is the ‘widespread use of point and click data analysis software that has made it easy for researchers to sift through massive data sets without understanding the methods.’

Reading the article made me think back to my seismic interpreting days. Seeking confirmation for a preconceived geological paradigm was exactly what we did all day long! For a refreshing insight into such thinking, check-out our article on page 4 for an example of cognitive bias in interpretation from GeoTeric.

Nature suggests that one way of avoiding bias is for publishers to publish packages of both analysis and the original data such that its reproducibility can be tested. Seemingly, in particle physics, ‘blind’ data analysis is de rigueur, when researchers do their stats on jumbled data that is only unscrambled after the calculations.

A special issue of the Journal of Librarianship and Scholarly Communication addresses these very issues in an attempt to ‘map the landscape of research data.’ A new role is proposed for activist librarians who, ‘as experienced knowledge workers [are] vital players in the research data management enterprise.’

* Statistical package for the social sciences – first released in 1968. Now owned by IBM.

@neilmcn

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.