How big is big data? And how different is it from ‘small’data?

Neil McNaughton editorializes over a serendipitously special ‘big data’ issue of Oil IT Journal. He offers a sideways look at the Hadoop movement and brings together some lose ‘big data’ strands from Teradata, SAS and OSIsoft, as well as new terabyte fiber and computer tomography datasets.

People often ask us for our ‘editorial calendar.’ I find this a curious request in that it suggests that news and views can be predicted months ahead of time. Of course if your publication is more interested in carrying the marketing department’s message than getting at the truth, well the news at least, then your position may be different. This month though we seem to have stumbled into a thematic issue of sorts, that of ‘big data in oil and gas.’ Being a believer in the ‘nothing new under the sun’ theory of IT, (and for that matter the world) my sketchy understanding of ‘big data’ is as follows.

As the likes of Google and Yahoo came into being they were faced with a problem of analyzing very large volumes of ‘click stream’ data coming into their server farms. This kind of problem was not exactly new. It is very similar to any stream of data like, for instance banking transactions. These come in in such volumes that they are likely stored in a system that may or may not be amenable to the kind of analytics that the business may later need. Thus we have had ‘data warehouses’ online transaction processing systems (OLAP) and what have you.

A good example of prior ‘big data’ art is evidenced in our 2006 report from an earlier PNEC on Wal-Mart’s IT. Here time series data from cash registers around the world is captured and then flipped into a format that is more amenable for analysis. This likely involves turning it into a ‘data cube’ (a matrix to us scientists) that can be queried and analyzed statistically with ‘R’ or toolsets like Spotfire and SAS.

Getting back to the Googles and Yahoos, their next obvious step would have been to buy an enormous data warehouse. Except that this is not what happened for an obvious reason. Just think of the bill if Google was running Teradata! No, if you are fresh out of school and faced with such a problem you roll up your sleeves and code. BTW do not take this too literally, for every 1000 that do this, probably 999 are still sitting in front of their computers surrounded by discarded pizza boxes struggling to make that killer app. But I digress.

Natural selection being what it is, some of this intense coding effort will produce results. A savvy entrepreneurial type can then further leverage such by a) giving the code base away so that they get a more or less free ride from the open source community and b) work hard to consolidate the business. The latter may involve building massive server farms, diverting rivers and so on (another digression). Google’s data warehouse saved it from paying for a zillion commercial licenses and begat the ‘big data’ movement, a suite of technologies centered on Hadoop. So far so good. But what does all this mean to a vertical like oil and gas?

Reordering time series data and performing matrix operations sounds a lot like seismic processing. Indeed this is one of the putative use cases (page 12) from Hortonworks, the company that has been anointed, just as Red Hat was for Linux, as the torch bearer for ‘commercial’ Hadoop. Seismic data is indeed ‘big,’ but decades of R&D and relatively little pressure to save on costs mean that any new technology is going to have a hard time battling the highly tuned installed base. In a short chat with some Hortonworks representatives I was assured that the combination of a Hadoop ‘data lake’ and schema-on-query (as opposed to schema-on-load) were real differentiators. I was also told that Hadoop is already deployed in a production environment at oil and gas companies. Which companies these are could not be revealed because of the ‘commercial advantage’ it bestows. I intimated that I had ‘heard that one before’ and was accused of cynicism and ‘cautioned’ although I’m not sure against what. I guess I’ll find out now!

Steve Holdaway’s book on ‘big data’ (review on page 3) is interesting in that it makes no reference to Hadoop at all. Holdaway’s focus is on data mining with statistical tools. The omission reminds me of a conversation I had with one practitioner who was wide-eyed when I asked if the compute resources were stretched. I was thinking of a big cluster. He was thinking pattern matching and fuzzy logic running on a PC. Much of key oil and gas data (especially historical production) is actually quite small!
Our lead this month on Teradata at Statoil is another interesting ‘big data’ use case. Also sans Hadoop incidentally, and actually more like the Wal-Mart deployment above.

Our ‘overflow’ report from Intelligent Energy (page 7) includes Maersk’s trials of running Schlumberger’s Eclipse reservoir simulator in the Amazon cloud. OK, it’s not Hadoop either, but it does show that big data in the broadest sense can be shifted around and processed sans infrastructure. Another presentation at IE described how a 150 terabyte data set could be acquired in a couple of days of digital temperature sensing observation. That’s ‘big’ alright! As indeed are the multi terabyte data sets that BP’s digital rocks effort generates in a matter of hours (page 12). Another use case put forward for big data is all that stuff that is (or soon will be) streaming in from the digital oilfield. Here again, the technology space is pretty well occupied by domain specific tools notably from OSIsoft (see our report on page 6).

If your problem set involves fault tolerant clusters and the sort of re-ordering before query operations that the data warehouse provides and if you are looking in an entrepreneurial fashion to deploy such economically, then maybe the Hadoop approach is the way forward. Perhaps in a couple of years we will all be thinking ‘map reduce’ instead of SQL. If you would like to hear more on big data in oil and gas and you are in London in July you might like to consider attending the SMi ‘Big data in oil and gas’ conference.

Follow @neilmcn

How big is big data? And how different is it from ‘small’data?

Neil McNaughton editorializes over a serendipitously special ‘big data’ issue of Oil IT Journal. He offers a sideways look at the Hadoop movement and brings together some lose ‘big data’ strands from Teradata, SAS and OSIsoft, as well as new terabyte fiber and computer tomography datasets.

Click here to comment on this article

Click here to view this article in context on a desktop

How big is big data? And how different is it from ‘small’data?

Neil McNaughton editorializes over a serendipitously special ‘big data’ issue of Oil IT Journal. He offers a sideways look at the Hadoop movement and brings together some lose ‘big data’ strands from Teradata, SAS and OSIsoft, as well as new terabyte fiber and computer tomography datasets.

Sign up for occasional emails and subscription information...

Click here to comment on this article

Click here to view this article in context on a desktop