Michel Lutz, group data officer with Total, provided the keynote to the 2017 EAGE workshop on data science for geosciences. Before joining Total, Lutz was a researcher at Frances’s LIMOS research unit and author of a book* on data science. Lutz traced the history of artificial intelligence and machine learning from its origins in 19th Century statistics to the current buzz driven by its use by Amazon, Google and others. An illustration of reinforced learning is shown by Google’s DeepMind learning to play Breakout!
Much of AI was developed in the 1960s and 70s. Today techniques such as neural nets and decision trees are being ‘democratized’ with open source software and accessible big data and compute resources that are ‘revealing the power of these methods.’ The GAFA** web giants have put these techniques at the heart of their business, boosting scientific and technology development. The challenge today is to ‘learn more from less data,’ extending these techniques into fields like geosciences, where labeled data may be scarce. In parallel there is a shift from learning to ‘understanding,’ as exemplified by Stanford’s Karpathy ‘deep visual-semantic alignments for generating image descriptions.’ But today these require considerable data preparation and model tuning. Learning is limited to a specific task. More autonomy is required of ML. Also, caution is required as statistical learnings may embed bias in input data.
AI is a broad field, based on old statistical foundations. Machine learning on big data sets is something new. Potential areas of application include textual competitor analysis, integrated reporting, imagery (cores, thin sections), real time and static structured data. Total’s own efforts include a ‘competitors cruncher’ that mashes data from IHS, Rigzone and DrillingInfo. Another app predicts production in shale wells from decline curve analysis, augmented with data-driven analytics of well parameters and location. This has demonstrated predictions with an R2*** of over 0.8, said to be ‘exceptional’ for shale. Other trials include semantic analysis of technical documentation, seismic trace classification, biomarkers, HSE incident analysis, real time analytics of blowout/kicks and rotating machinery. IBM Watson was trialed on an analysis of biomarker reports.
Total’s data science ecosystem is a smorgasbord of open source acronyms (RStudio, Python Grafana, Bootstrap, Kibana, Hadoop, Spark...) alongside tools like Power BI, Spotfire and Tableau. The next challenge is to put algorithms into production and make the data science hype ‘disappear.’ On a closing note Lutz observed that ‘a high level of expertise in data management and governance is required to enable data science across the business.’
In the Q&A, Lutz was quizzed on the use of open source software, specifically on the likelihood of Total contributing open data to the community. Lutz admitted that today, Total is a consumer but recognizes the importance of giving back. The problem of convincing users of the validity of non physics-based models was also raised. Lutz agreed, this is a big challenge, ‘not just in oil and gas.’
Alan Smith (Luchelan/Ovation) related a failed attempt to use Hadoop for seismic data management (DM). Conventional seismic DM holds an index in Oracle and data on disk. This causes problems when extracting data to building multi-client data sets which required manual intervention, ‘tape monkeys’ and so on. Hadoop has been presented as a panacea for manipulating data. Tests performed at Ovation Data found that data could be recovered fast, if one could accept unordered traces. When data order is important, ‘everything slows down.’ The conclusion, ‘don’t use Hadoop, it is not wonderful!’ Instead, Smith advocates putting the data into a NoSQL database and ‘using the principles behind Hadoop’ to speed up retrieval by a couple of orders of magnitude. Parallel workflows leverage multiple 10GB links into the cluster. The study was funded by InnovateUK, ICL provided Hadoop support.
Turning to the sharp end of the big data spectrum, Hadi Jamali-Rad presented Shell’s wireless internet of things (IoT) applications. Shell has standardized its land seismics and pipeline sensor networks on LoRaWAN, a cheap, low power, long range solution that connects into the cloud for ‘world wide accessibility.’ LoRa has undergone extensive testing on the 40x40 km Groningen microseismic interferometer network. One test sent LoRa data via a balloon and on through four service providers over a 354km range.
Teradata’s Duncan Irving traced the split, a couple of decades ago, between IT and operations technology. Upstream users live in the OT field which is characterized by point solutions, big data silos and ‘the rise of application users as opposed to scientists.’ Meanwhile business-at-large has captured IT with at-scale analytics on Hadoop. Geosciences’ HPC architectures were designed for physics, not analytics and there is a cultural resistance to big data. Formats like SEG-Y/D are ‘hard to crack open and use by the data scientist.’ Jane McConnel took over to advocate a subsurface data lake providing access to geoscience data. But, ‘you still need a data model,’ leveraging PPDM and Energistics concepts. Another litany of open source tools was proposed (NiFi, Kylo, PostGresQL, MySQL, MariaDB…) along with the Teradata ‘Think Big’ data science lab used at the Hackathon (page 9).
Matt Hall (Agile Geosciences) is ‘fostering a high impact ML ecosystem for subsurface and engineering.’ Hall forecasts a data revolution ‘akin to stacking or seismic stratigraphy.’ DeepMind/AlphaGo have shown the way forward, but the bandwidth and scale of geoscience data is ‘unusual’ and ‘expensive.’ Figuring the 4D history of the planet ‘is harder than predicting a film preference on Netflix.’ Another caveat, AlphaGo is ‘not open source, not reproducible.’ Hall plugged the Cornell arXiv.org as the preferred outlet for open source/reproducible publishing. Big data also has ‘endemic problems’ with data quality and units of measure. In Canada there is ‘unpleasant litigation’ over ownership of seismic data and what is/should be in the public domain. Elsewhere, the MNIST/IRIS datasets are ‘very useful.’
More in our next issue.
*Data Science: Fondamentaux et etudes de cas, Eyrolles.
** Google, Amazon, Facebook, Apple.
*** Coefficient of determination.
© Oil IT Journal - all rights reserved.