Common Data Access’ unstructured data challenge

Open source software’s field day at UK upstream linguistic analysis testbed.

The results of Common Data Access’ (the UK’s joint industry oil and gas data body) challenge were presented late last year. Nine companies took up the challenge of ‘using data and linguistic analytics’ on a heterogeneous multi terabyte North Sea dataset.

Both New Digital Business and Venture IM teamed with Cray on an analytics pipeline that embedded Cray’s Graph Engine running on a Cray Urika-GX supercomputer. In fact Cray went all-in on this effort, providing a comprehensive open source stack that used Apache Tika to parse the plethora of file formats. Tika’s optical character recognition function was used to read scanned images. In all some 3.5 terabytes of data were ingested and indexed. Cray is very keen to promote its graph technology in this context, ‘graph will be a significant part of any similar projects because G & G data is so interconnected,’ and is offering ‘unlimited free access’ to its Urika-GX for further research.

For most, analyzing this multi-discipline data set involved some sort of taxonomic analysis to align the different terminologies used. Flare Solutions, a specialist in the field, showed how its technology is used to build a synthetic well file by classifying documents against the taxonomy, identifying key industry information products. Flare noted that ‘having some structure in unstructured information supports text analytics.’ The CDA data set ‘stress tested’ current classifications and represented a learning opportunity as many additional synonyms were found, highlighting terminology variations across organizations. Flare is also planning a move to a graph model.

Hampton Data Services teamed with Zorroa whose convolutional neural net technology was used to extract information from scanned images and classify report types prior to OCR. Iterative fuzzy search and guided machine learning provided a mechanism for improving classification with use.

Independent Data Services also pitched in with an open source-based solution, using Tesseract OCR and the OpenStack private cloud that embeds ElasticSearch (web-based text search), Log Stash (document processing) and Kibana (data exploration). The stack enables mining of structured and unstructured data.

Agile Data Decisions demoed its iQC tool, observing that it is better to maximize the use of what structured database information is available rather than trying to extract information from unstructured documents. Again, open source software predominated with Python-based machine learning and a Hadoop ecosystem.

Schlumberger used Wipro Holmes and Solr to derive value from unstructured data, although the slide set is rather light on the details!

CDA project manager Dan Brown summarized the outcome of the exercise for Oil IT Journal. We have incorporated an analytics program in our business plan, building on what we’ve learned from the challenge and are considering working with other industry bodies to deliver a second challenge for 2017, in the seismic domain. We continuing to share lessons learned from the challenge with a second ECIM workshop in Stavanger and we will be presenting the results at upcoming conferences. The key data management lesson is that modern analytical techniques depend upon access to large quantities of high quality, well organized data, for training and tool evaluation, as well as problem solving. There is a clear role for national data repositories in facilitating this.’

Read the CDA Challenge presentations here.


Although this is unrelated to the CDA challenge, those interested in such matters may like to follow the ‘TextExt’ DBpedia Open Text Extraction Challenge.

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.