O’Reilly’s formula for success is a) identify a good subject, b) find a subject matter expert c) perform brain dump. The Bad Data Handbook* (BDH) succeeds with step a) but that’s about it. BDH is a hotchpotch of essays from nineteen authors. Some manage to stay on message. Chapters on the cloud, social media and on caring for machine learning experts less so.
Kevin Fink provides an interesting peek (with code) at processing web log data. Paul Murrel offers advice on getting data out of ‘awkward’ formats like Excel (use XLConnect) and processing it with ‘R’.
We enjoyed Joch Levy’s chapter on ‘bad data in plain text’ with an authoritative account of character encodings and text processing in Python. Adam Laciano’s chapter on scraping data from web pages does a good job of showing what an ugly task this can be. For one website using Flash, this meant running Matlab scripts to extract text from screen grabs! Jacob Perkins’ ‘detecting liars on the web’ describes how Python’s NLTK library for natural language processing is used to classify movie reviews. Interesting but again, somewhat off topic!
A problem with BDH is that the subject means different things to different people. Phil Janert’s chapter covers defect reduction in manufacturing, analyzing call center data and making the most of data with statistics-based hypothesis testing. BDH is very much in the modern world of NoSQL, file databases and the web. The topics of database integrity and naming conventions are not covered—even though these are key routes to clean data.
Ethan McCallum makes a brave attempt to tie all this together but his is less of an editor’s role, more on of an applier of lipstick to the pig. Again, the problem with BD is the subject and the fact that the book is mostly about making sense of data as it is found on the web. The issue of how to avoid creating bad data in the first place is not covered. Which is a shame as this is arguably more important.
* by Ethan McCallum. O’Reilly 2013. ISBN 9781449321888.
This article originally appeared in Oil IT Journal 2013 Issue # 1.
For more information or to comment on this topic email here.