At a recent EAGE event, Jérôme Massot (now with Schlumberger) reported on his work for former employer Total on semantic AI and knowledge extraction in geosciences. The project, carried out in Total’s embedded location in Google’s TC3 building in Sunnyvale, CA set out to leverage natural language processing (NLP) to help Total’s carbon capture and storage (CCS) researchers exploit the ‘tsunami’ of papers. A search for CCS on the Science Direct website retrieved some 40,000 papers in 2020.
Straightforward Google search is a good starting point when searching for specific information. At a general level, knowledge management tools can improve search with a degree of natural language understanding, intelligent ranking and summarization (à la Wikipedia). But for cutting edge topics, like CCS these generalist strategies fall short.
Enter the conceptual ‘bibliography smart assistant for geosciences’. This works across a document lake, providing content extraction, understanding and NL-generated output. For an oil company the data lake will span private data containing documents with a high business value and quality. Alongside is the public data lake. This is bigger, but may be locked, ‘science is not free’. There are licensing issues and doubts as to document veracity. On archive.org, ‘anything goes, there is no peer review’.
Starting in 2018*, Total set out to create a geoscience-specific equivalent of Google Assistant for geoscience. It turned out that content extraction proved very challenging. Text was extracted from PDF documents with Apache TIKA, PDF2Text and Grobid, machine learning software for extracting information from scholarly documents.
Extracted text is then process into different ‘representations’ (words, sentences, paragraphs) cleansed and tokenized into ‘n-grams’, groups of words. Contextualized embeddings assign a value to each word based on context. Massot’s team developed embeddings for geoscience using the FastText open source library. With some tuning, the system could return word neighbors. A search for ‘migration’ could return related concepts such as sandstone, beach, reef automatically. No ‘fastidious’ annotation of documents is required.
However, Google’s native language model ‘BERT’ covers a general knowledge domain that is not fit for purpose on such specialist documents. Other communities (financial FinBert, biology BioBert) have retrained the model on smaller domain-specific corpuses. This has proved harder for geoscience as labeled data is generally lacking. ‘There is no CCUSBert!’ Trials with Google’s GPT-2 library produced nice-sounding, meaningless text.
Massot believes that there is a need for a shared open source geoscience ontology and corpus of academic papers and technical documents. He sees a role for the EAGE here. Solutions need to be developed by geoscientists in collaboration with data scientists.
* See the 2018 release.
Those interested in geo-ontology should read Paul Cleverley’s blog where he presents his work dating back to 2016 and points to a somewhat plagiaristic 2019 Schlumberger paper on ‘GilBERT’, Geologically informed language modeling with BERT.
© Oil IT Journal - all rights reserved.