Big data, Hadoop, the hype … and a compelling use case explained

Neil McNaughton reports from Digital Energy on ‘big data’ where his skunk works detector was showing red. Can Hadoop best years of established technology and confer a real ‘competitive advantage?’ A conversation with GE shows what Hadoop is good for and where the hype takes over.

Judging by the plethora of ‘big data’ presentations at the SPE Digital Energy (DE) conference this month, the industry is about to be taken by storm by Hadoop. Not just our industry, but, according to a new book from the Open Technology Institute, ‘How we live, work, and think.’ Vendors at DE intimated that several majors had Hadoop trials running on seismic processing and real time data analysis—and indeed that these were so successful that the said majors were not speaking about these tests because of the major ‘competitive advantage’ they were gaining. My hype radar was beeping loud—after all, when nothing leaks from a project, it could be because it is a skunk-works flop.

IDC’s Jill Feblowitz did a good job of fanning the flames of the Hadoop blaze with some chapter and verse. A ‘Seismic Hadoop’ project initiated by Cloudera and oil and gas data mining R&D performed at Stavanger University. Neither of which from a cursory look would appear to offer much ‘competitive advantage.’

So what is all the fuss about? Some good introductory material is to be had in a recent paper from Atos which offers a short history of the technology’s evolution from Google’s Big Table, along with a reasoned taxonomy of big data solutions, what Hadoop, MapReduce and NoSQL are really about and a league table of other parts of the ecosystem.

On the hype side of the equation, the mantra is the four ‘Vs,’ i.e. data volume, velocity, variety and … value. Volume and velocity are fairly self evident. Think lots of high bandwidth real time data. Variety is the promise of being able to process across multiple sources—documents and databases. But the ‘value’ part of the equation is more subtle. For some this refers to the high value that the ecosystem promises. For others, the ‘big data’ movement derives information from low value data. Think of all those log files that accumulate on your systems before being deleted unread or stuff disappearing down the ‘data exhaust’ of an offshore platform (OITJ 04/2011).

From the Atos paper it appears that the technology is best at doing fairly dumb stuff, but on very large sets of somewhat inaccessible data. Seismic data sort is one use case that has been suggested. What is curious here is that for Hadoop to bring a competitive advantage, it would mean that decades of geophysical research targeting the exact problem have been bested by a serendipitous generic approach. This to my mind seems unlikely.

In the same vein I thought, how come decades of research into the data historian could be bested by big data? After his talk on Hadoop at the SMi E&P data conference last month I put this question to GE’s Brian Courtney who is well placed to answer as GE provides both a data historian (Proficy) and a new Hadoop-based solution. He said, ‘Historians are great at storing and retrieving time-series data. Proficy can capture millions of tags per second. Historians are also excellent for operational queries—provided you are asking for tags in time sequence order.

Historians are not so good at ‘ad-hoc’ queries such as ‘have I seen this five second start up pattern before?’ Here you would need to query across all your data looking for the key tags on the same equipment. In a historian this can a) return more data than you can handle and b) take a very long time.

Hadoop is more like a warehouse than a historian. Moving data from the historian to Hadoop is very time consuming. Typically you only send data once it is clean and complete. A typical time frame might be every three days. Hadoop takes the data and spreads pieces of it over lots of computers in the cluster. Note that if you had to update a record, you would need to delete the entire archive. Updates are very expensive in Hadoop.

The other challenge with Hadoop is that it takes maybe thirty seconds or more to process a query, figure out where you put the data, move the query to the right nodes, process the query, concatenate the results, and return them. In other words, Hadoop is not good for rapidly changing data or for querying small amounts of data in near real-time.

What Hadoop excels at is a query such as ‘For all temperature settings for all turbines, determine the average temp setting five minutes before and after a particular type of alarm.’ Again, you couldn’t do this with a typical Historian. Hadoop can do this very quickly as it can decompose the query, send it to hundreds or thousands of nodes in the cluster and process the request in parallel. We have run a query like this that returned 12 signals in a 4 terabyte data set in 2� minutes. The same query running on a historian crashed our clients and the server!

Hadoop is great as a data warehouse and allows you to do deeper, more meaningful analytics, data mining and discovery. Historians are best used as operational data stores—storing data in near real-time and for querying trending data.’

Well that cleared things up for me, I must say. Looking back over the four ‘Vs’ it would appear that the volume part is well handled by Hadoop which is after all based on a highly scalable file system. Velocity? Courtney’s analysis is rather nuanced here. The data may be streaming in to the historian at great speed, but the ‘warehouse’ approach means that the data mining is an off-line process. Although once you’ve found your critical pattern, this could be embedded in real time monitoring process inside the historian. Variety? Not in the GE use case which is dealing with standard historian data. And Value? This is a good example of adding value to relatively low value data.

So what about seismics? A lot of the Map/Reduce data manipulation sounds rather like the sorting and re-ordering performed in in seismic imaging. Does Hadoop offer anything new here? If you know the answer, we’d love to hear from you.

@neilmcn

Big data, Hadoop, the hype … and a compelling use case explained

Neil McNaughton reports from Digital Energy on ‘big data’ where his skunk works detector was showing red. Can Hadoop best years of established technology and confer a real ‘competitive advantage?’ A conversation with GE shows what Hadoop is good for and where the hype takes over.

Click here to comment on this article

Click here to view this article in context on a desktop

Big data, Hadoop, the hype … and a compelling use case explained

Neil McNaughton reports from Digital Energy on ‘big data’ where his skunk works detector was showing red. Can Hadoop best years of established technology and confer a real ‘competitive advantage?’ A conversation with GE shows what Hadoop is good for and where the hype takes over.

Sign up for occasional emails and subscription information...

Click here to comment on this article

Click here to view this article in context on a desktop