Book review—Hadoop: The Definitive Guide

Tim White’s guide to Hadoop from O’Reilly Books, now in its third edition, is a 650 page introduction to the deployment of ‘big data’ applications. But what is Hadoop? What’s in it for oil and gas?

Hadoop is one of those IT things that—the more you hear about it, the less you understand, where hype blurs reality. We decided to go to the horses mouth, with a quick spin through Tom White’s ‘Hadoop, The definitive guide (HTDG).’ Well a quick spin is not quite right for a 650 page manual—but here goes.

Hadoop is a data storage and analysis platform that was developed for ‘big data’ such as generated by web traffic to Google and Yahoo where the tool originated. Analyzing multi terabyte log files was proving hard with the technology of the day for several reasons. The relational database (briefly re-baptized by White, sans explanation, as the ‘rational’ database) requires data to be read in a random fashion from many disk locations. Such ‘seek’ activity is slow. It is better to stream data from disk leveraging all available bandwidth. Hadoop was also designed to be deployed and scale across the massive commodity clusters used by Yahoo and Google and to tolerate hardware failure—which when you have 100,000 nodes is a relatively frequent event. Hadoop adopts a write once, read many approach to minimize data movement and is said to be suited to processing large, unstructured data sets. Interestingly, for geophysicists, Hadoop confronts the same problem set as high performance computing environments i.e. network bandwidth. Hadoop is said to offer a simpler programming model than MPI. Hadoop’s design means that if a problem is running slow, you just add more nodes to the cluster. Hadoop is inherently parallel.

But how does it work? In chapter 2, HDTG walks through a typically Hadoop-esque problem—analysing a data set from the US National Climatic Data Center. Ascii data is first analyzed with a typical Unix approach using awk taking 42 minutes on a single EC2 instance. To speed up, parallelization is required, but at the expense of considerable programmer effort. Enter MapReduce—the essence of Hadoop. MapReduce is a little hard to grasp although the explanation appears simple enough. Like special relativity, one would not want to be tested on it after a first read. But to cut to the chase, the Hadoop program scales transparently and ran in six minutes on 10 EC2 nodes.

We’ll skip the next 600 pages, replete with code and Hadoop-derivated projects, to checkout what Hadoop is used for. A contribution from Facebook outlines the use of Hive, an open source data warehouse and SQL front end to Hadoop which has displaced Oracle for some tasks. Hive allows SQL programmers to play with Hadoop without the ‘complexity of MapReduce.’

So what is Hadoop’s potential in oil and gas? It is tempting to see application in seismic processing (lots of sorting there) and maybe in real time data. The question is who is going to do the MapReduce for seismics and will the results better the years of effort already spent on these problems by the imaging community? Speaking at the EAGE workshop on open source software in geophysics, BG Group’s Chris Jones made a passing reference to running Seismic Un*x on Hadoop (page 7). Maybe we’re onto something.

This article originally appeared in Oil IT Journal 2012 Issue # 7.

For more information or to comment on this topic email here.