They say that there is no such thing as a free lunch. There is probably no such thing as a free book either. IBM’s oeuvre titled, ‘Understanding Big Data’ is a savvy blend of marketing puffery for the company’s InfoSphere BigInsights and some insight into what the Hadoop/Big Data hype is about. One use case intriguingly is the oil field—where, apparently, a well can have ‘20,000 sensors generating multiple terabytes per day,’ a statement that might surprise some. This and other examples are shoehorned into a FUD-ish description of overwhelming data volumes either ‘at rest’ (in a database), or ‘in motion’ (streaming from real time data ‘exhausts’). Skipping to page 55, we learn that Hadoop is a distributed file system that stores data and indexes redundantly across large clusters. The system was developed by Google to provide extensible, resilient infrastructure on commodity hardware. Is Hadoop just a massive RAID system?
Well perhaps not. UBD makes the statement that ‘MapReduce is the heart of Hadoop’ and goes on to explain the workings of the map and reduce approach to dealing with large amounts of sparse data. Here the brief explanation is at once simple and frustratingly opaque. To understand what is really happening a quick trip to Wikipedia was necessary where we learn that MapReduce, rather than being ‘at the heart of Hadoop’ is ‘a free and open source implementation of Hadoop.’ All very confusing.
UBD moves on to describe various Hadoop-based projects such as Hive, a SQL front end, an ETL adaptor, Flume, and others. Amongst these is HBase—a column-oriented database for sparse data sets. Although Hbase only gets a short paragraph it is arguably the non relational approach of sparseness and de-normalization that make up the big data difference. At page 81 we return to the BigInsighhts sales pitch. For IBM, Hadoop is not designed to replace your data warehouse, but rather to complement it. If you have sweated blood getting your business intelligence up and running, the thought of a big data ‘complement’ may make you feel faint. A good question is the extent to which the technology is really just a rehash of business intelligence (data at rest) and complex event processing (data in motion) for the distributed file system.
IBM states that its BigInsights implementation is, and will always be, based on the core Hadoop distribution. This begs the question as to why you would want to use anything other than the free version. I guess it depends on how adventurous you are. All in all, UBD probably provides more insight into IBM’s marketing strategy than it does into Hadoop. But this is certainly not without interest—especially if you are, of if you are considering becoming, a client. For our part, consider this a first stab at what will likely be more attempts to come to terms with the new paradigm and to see if they really do have as much potential for application to oil and gas data—whether it is at rest or in motion.
© Oil IT Journal - all rights reserved.