BP’s data lake

Open source software returns in force to BP as ‘R’ programming language, Hadoop, Lucene, Solr are deployed in service-oriented information architecture. Move to Amazon cloud in progress.

BP presentations at the SPE’s digital energy (DE) event and PPDM’s annual Houston event demonstrate the growing role of open source software in high-end upstream information management. At DE, Mohamed Sidahmed presented work performed leveraging the ‘R’ statistical programming language to ‘augment’ operations monitoring by mining unstructured drilling reports. Unstructured, textual, hitherto the ‘missing link’ in the information workflow, contains valuable information on the root causes of deviations from plan and help address ‘inadequate reaction to real time changes.’

R-based text analytics leverage the collective knowledge stored in BP’s Well Advisor, looking for interesting patterns. Visual representations (word clouds) integrate existing surveillance systems and can provide early warning of, for instance, pump failure. More fancy techniques such as ‘latent Dirichlet allocation’ helps identify precursor events hidden in the data. Reports with similar content can then be attached to the root causes of non productive time and rarer high impact events. Data driven learning is now embedded in BP’s CoRE real time environment.

Meanwhile at the PPDM Houston data management symposium Meena Sundaram presented a ‘self service’ architecture deployed at BP’s Lower 48/Gulf of Mexico unit.
BP’s service-oriented architecture is now up and running. Applications and data sources are exposed as Rest endpoints, ‘providing scalability and adaptability to technology innovation.’ The infrastructure stack builds on a Cloudera ‘data lake,’ of over 35 domain-specific data sources. These feed into business intelligence and descriptive analytics applications. The system also supports enterprise level activities from production accounting to budgets and reserves reporting along with bespoke ‘on demand’ business scenarios and ad hoc queries.

Sundaram qualifies enterprise level data access as a ‘chicken and egg problem. Do you clean the data or show the data?’ BP has opted for the ‘show’ option, along with governance and data improvement with use. Today the data lake uses Cloudera MapReduce/HDFS, Voyager GIS data discovery and Amazon Cloud Search. ‘Big data’ tools including Solr and Lucene are also used. The toolset is now evolving to offer prescriptive analytics. BP’s near term goal is to move the supporting infrastructure to the Amazon Web Services cloud. More from the PPDM Houston conference and from SPE Digital Energy in the next edition of Oil IT Journal.

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.