Neo4J extends graph database with NLP and AI functionality

User Caterpillar reports 27 million parsed and tagged phrases in document repository.

A new release of the Neo4j graph database embeds natural language processing functionality used by flagship customer Caterpillar. Neo4j 3.5 now offers artificial intelligence (AI) and machine learning (ML) functionality. A talk by Caterpillar’s Ryan Chandler at the 2017 Neo4J GraphConnect event in New York showed how the graph database is a foundation for enterprise AI applications, capturing facts and relationships among people, processes, applications, data and machines.

Chandler, Caterpillar’s chief data scientist, has applied natural language processing to a 100,000 document data repository of Caterpillar’s supply, maintenance and repair operations. The new functionality in Neo4J enables the large-scale analysis of text for meaning representation and automatic reading at scale. According to The Data Warehouse Institute, around a half of all enterprise data is unstructured. This is the knowledge that Cat wants to tap into. There are two schools in language processing that leverage ‘dependency’ and ‘constituency’ structures. But both of these are graphs, so the overriding principle is, ‘parse your text into a graph’.

Document repositories grow constantly, in business intelligence systems there is always that ‘next report’. Caterpillar ties different documents together by linking part number, facility identifier and so on. One use case is the development of a natural language dialog systems that allows queries such as ‘how many of this particular part were shipped to Asia?’ Semantic analysis splits text into components, numerical counts, nouns (truck), verb (sold) and RegEx dates. Queries can expand to ‘how many trucks were manufactured around this date and shipped to Asia?’ This requires a dictionary of synonyms - build, produce, manufacture … all built into the graph. The Google speech to text API allows queries in natural language.

The next step is to ‘read at scale’ to extract more meaning, especially from warranty documents, an ‘excellent primary source’. So if a document reports ‘engine knocking’, an oil test can be initiated, and the root cause and recommended solution identified.

Caterpillar has parsed and tagged some 27 million phrases in its repository. A pipeline comprising a Python NLP toolkit, ML classifier and ‘R’ leverages the WordNet lexical data and Stanford’s ‘S NLP’ dependency analyzer. Half of Cat’s items were already tagged and used to train the other half. S NLP was found to be a great improvement over a naive ‘broken is bad’, ‘bucket is equipment’ approach. For example, proximity may reveal that it was the bucket adapter that was broken.

The approach is computationally expensive, especially with 27 million documents. But even a shallow parsing can extract meaning at scale. The open source WordNet was a ‘free’ bonus for fine tuning and constraining definitions. The next step is to add in some VR with the Oculus (and Unity game engine) and filter on time for diachronic document search. Cat is now working with the NSCA on a theoretical model for graph/text/analysis and more ‘semantic analysis at scale’.

For its part, Neo4J has added full-text search into the graph, enabling text-intensive graph applications such as knowledge graphs, metadata management and bill of materials along with AI extensions to its ‘Go’ programming language.

More from Neo4J. Watch Chandler on YouTube.

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.