'Big data and analytics?’ Pick one!

Neil McNaughton observes that much technological progress is concerned with dull infrastructure and scaling issues. To sell them they need spicing up with a neat idea. Examples? Google’s Page rank and the (apocryphal) beer and diapers tale of business intelligence success. Buzz at this year’s PNEC data management conference suggests that Hadoop’s promise remains elusive.

Way back, a major vendor released an update to its flagship database, lets call it ‘Founder’ with the promise of a major new feature. This got a lukewarm reception from the community since the ‘new’ feature was the reason that they had bought the software in the first place. IT has always been great at promising a lot and delivering... rather less, rather late, or as they say in the trade, ‘real soon now.’

There are good reasons for this. Much of what appears to be straightforward in the data and IT space is, in fact, rather hard to achieve. This is frequently due to the problem of scale. If you have a table of production values on your desktop, a simple click is all you need to order them. If you have data coming into different systems all over the world things are rather different. The whole history of major enterprise systems like ERP is like a long game of catching up with users’ prior expectations.

Meanwhile, industry has to maintain excitement and interest while it gets on with the grunt work of scaling and debugging the ‘next big thing.’

I’m not 100% sure of this, but I suspect that Google adopted this strategy back in the early days of search. At the time there were a few search engines about, Google, Altavista and others and for the end user, there was not a great deal to separate them in terms of results. As the web grew though, the problem for the search engines became the boring old one of scale. Keeping up the humongous indexes, tracking the clicks became an arms race.

As I said, scale is not a strong selling point. Google never said, ‘We will be the biggest and therefore the best.’ Instead it came up with a nice story, the Page rank and some cute math to back it up. Google may have used the Page rank in the very early days but after a few months it must have amassed its own, much more accurate click stream data for its recommendations. This has never stopped the Page rank being duly trotted out by Wikipedia and others since then to explain Google’s genius. Success was more likely was due to having mastered the tricky task of drinking from the clickstream firehose and, of course, from monetizing the results. Google’s approach to its big data is arguably more about doing dumb stuff at very large scale than smarty-pants data driven analytics.

The marketing ploy at work here is camouflage what’s boring with a great story. The big data movement today has two sides to it. The first, the boring one, is monitoring. The second, exciting one, is ‘analytics.’ Analytics is exciting because of the notion that today, our data has gotten so big that it just has to contain hidden gems of information that have so far eluded us. A great story indeed.

The relative importance of these two facets of big data is frequently misunderstood by end users and is misrepresented, probably quite innocently, by the marketing folks. In fact in the literature, both scientific and marketing, ‘big data’ is almost always conflated with ‘analytics’ and sometimes acronymized into ‘BDA.’

The first time we reported on what later became ‘big data’ was at PNEC in 2006 when we heard Nancy Stewart on how WalMart’s used powerful computing to track in-store sales in near real-time. I confess that when I thought back to this talk I ‘remembered’ the old retail business intelligence chestnut, the tale of nappies/diapers and beer. This has it that WalMart’s business intelligence ‘discovered’ that men who buy nappies also buy beer. This oft-cited, unexpected retail information gem is a cute tale but is it true?

I checked our Technology Watch report from the 2006 PNEC. Stewart made no mention of this story which is almost certainly an urban legend. I don’t doubt that analytics play a role in Wal-Mart’s use of high end computing, but, just as that major oils use SAP, the main use case is monitoring and situational awareness. It is as hard to track worldwide sales in real time as it is to track a major’s worldwide oil and gas production.

If, like many you are embarking on a big data project (as pretty well any IT project seems to be these days) you might like to reflect on the fact that most of the trendy technology that is deployed around big was developed for monitoring and situational awareness rather than for analytics.

At PNEC (see page 6) I caught the following snippets from different speakers. ‘The big data movement has left people behind in the quest for some machine learning future.’ ‘80-90% of our data scientists’ time is spent on data prep.’ ‘There are pockets of what passes as analytics but this is actually reporting.’ ‘We have Hadoop and we are trying to figure out what to do with it!’ ‘Today we have data lakes which are a lot like the old data warehouse!’

The good thing about the big data movement is its focus on data. Maybe ‘big data’ will succeed where ‘data management’ has failed. The whole data management problem is essentially one of scale and a lot of, but probably not all, the new stuff will undoubtedly help.

This is not going to happen though, if it is being sold and bought with a predominantly analytical focus. Companies pouring money into analytics-driven initiatives are likely to be disappointed as a) the grunt work of data management is neglected and b) the ‘analytics’ fail to turn up much that we did not already know! For instance in their application in a field like imaging ‘big’ seismic data. There again, maybe you know something that I don’t. If you have a true version of the beer and nappies story I’m all ears.


Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.