Drew Conway and John Myles White’s credentials are impeccable—Conway used to work in intel, White is a psychologist, statistician and maintainer of several ‘R’ packages. Their new book, ‘Machine learning for hackers* (ML4H)’ uses the successful O’Reilly mix of code and conversation to take the reader through various sample tasks.
ML4H uses the open source ‘R’ programming language throughout and assumes knowledge of command-line work. R provides data formatting visual-ization, statistics and analysis in a moderately terse syntax. According to ML4H, R has seen a ‘meteoric rise’ in the data sciences and machine learning communities—making it the ‘de facto lingua franca’ for analytics.
R provides tools for massaging ugly real world data into shape ready for visualization. Father of data analysis John Tukey is credited with the distinction between exploratory and confirmatory data analysis—the first used to identify trends, the second to see if they are statistically significant. R does both of these—a few lines of code turns a ragged public domain ‘big data’ set into an almost Share-Point-esque dashboard.
Then comes the smarts. Key to real machine learning is Bayesian statistics, combining data driven probabilities with prior knowledge. Examples such as determining a person’s gender from ‘her height and weight’ (a she maybe?) are easy to envisage in more pertinent domains such as deriving lithology from seismic attributes and logs, or in root cause analysis of maintenance data.
R’s capability to work on text analytics is demonstrated in the Bayesian spam detector. A single line of code removes ‘stop words,’ further natural language processing performs word counts and more statistics to separate the spam from the ham. Again—industrial use cases of such techniques abound. Also of interest is the section on multi-dimensional ‘spatial’ analytics which uses matrix operations like distance metrics and multi-dimensional scaling to put US senators’ voting patterns on the map.
A couple of niggles—some unnecessary neologisms—what the heck is a ‘gization?’ and some graphs could do with better labeling. The authors warn against using R on really big data—suggesting that such should be recoded in C. This is not necessarily the only option. A press release from Teradata this month announced an R interpreter for its high-end data appliance. R is also available in Spotfire’s statistics services layer.
* O’Reilly press 2012. ISBN 9781449303716.
This article originally appeared in Oil IT Journal 2012 Issue # 3.
For more information or to comment on this topic email here.