PNEC 2020 Online

Ecopetrol deploys Kadme Whereoil cloud platform. PPDM professionalizes the data managers. Troika on seismic formats, SEG and OSDU. Rive University curriculum evolves towards data science. ExxonMobil at forefront of seismic data management. Total moots hybrid, on premise/cloud data solution, downplays OSDU. Denondo data warehouse for Oxy/Anadarko. Apache AirFlow data ingestion for Schlumberger’s Delfi. Noble Energy cleans data with InnerLogix. Bluware on the true costs and gotchas of data in the cloud. Shared data enhances Woodmac Analytics Lab model. LEK Consulting compares digital maturity across industries.

Gustavo Londono presented on NOC Ecopetrol’s a cloud-based upstream data platform that leverages Kadme’s Whereoil* technology to provide ‘fast and complete’ data access and advanced natural language search. Ecopetrol’s decade old legacy system had ‘reached its limits’ and the time spent looking for geoscience data was back up to ‘around 80%’. Two years ago, the company started work on a transformation and contracted with Kadme for the map-based, in-cloud solution. The automated data consolidation platform includes an NLP capability across 2.5 million documents and 800k logs. Data enrichment adds quality flags, deduping, text from PDF/scans and georeferenced documents for an area of interest (AOI) search. Data migration began in October 2019 and the system launched to 800 concurrent users a year later. Ecopetrol’s legacy systems have been decommissioned and the company is preparing for ML/AI models in a mature data lake. The solution also manages check-out and return of physical data objects. Teams collaborate via the cloud.

* Kadme’s Whereoil is also used by YPF as we reported last year.

Cynthia Schwendeman (BP) and Patrick Meroney (Katalyst Data Management) outlined work done in PPDM’s Professional Development Committee. The PDC has surveyed some 500 data managers across industry, finding ‘high variability’ in job descriptions. PPDM has taken upon itself to standardize these and provide guidance on roles and responsibilities from business data owner/steward, business analyst, project manager to data scientist with ‘six simple role descriptions’. Subject to board approval, the PDC approach is to extend into in midstream/downstream and renewables.

Jill Lewis (Troika) observed that the standards from organizations such as the IOGP, SEG and Energistics are all free. The navigation standards from the IOGP’s ‘great mathematical minds’ have earned world-wide recognition. These are now shared positioning standards across IOGP SEG and Energistics, with a special mention for the latter’s units of measure work. The question now arises as to why the industry at large has not ‘moved with us’. Many still use SEG Y Rev zero! R1 was released in 2000 but not much used. In 2017, R2 came along with a machine-readable implementation suited for automation and machine learning applications. Turning to OSDU, Lewis observed that OSDU release 2 supports both SEG-Y R2 and (the somewhat competing) OpenVDS format. Overall, take-up for R2 has been ‘pretty poor’ but support is coming from some NDRs with, notably, its inclusion in the NPD Yellow Book. In-country regulations may also mandate storage in a neutral format such as SEG-Y. Lewis expressed disappointment that OSDU was not engaging more with the SEG standards committee.

Dagmar Beck (Rice University) gave the Tuesday keynote presentation on how university education is evolving to meet workforce needs. These center on a need for data scientists that are ready to work in the ‘era of big data that encompasses a wide range of industries and businesses’. Rice University’s professional science master’s programs (PSM) prepare post grads for management in technology-based companies, governmental agencies and not for profits. These programs combine study in science or mathematics with coursework in management, policy, or law. The science and business programs were initiated in 1997 by the Sloan Foundation and Rice was one of the first to get Sloan funding for its subsurface geosciences program and others. The PSM Programs evolve as per industry desiderata, today this means energy data management, data science and data governance. All of which is now bundled into the Rice Energy Data Management Certificate program. The data science component is provided by the Rice D2K Initiative.

Yuriy Gubanov put ExxonMobil ‘at the forefront’ of revolutionizing data management practices. Today ExxonMobil is developing a cloud infrastructure to move its seismic data from storage ‘in salt mine caves’ to long-term geoscience archives in the cloud. ‘Blob storage has replaced boxes of tapes and hard copies’. Exxon’s seismic data is marked for ‘indefinite retention’, stored on tape, but often in legacy formats that can be hard to read. A large amount of data is ‘underutilized’. Gubanov advocates digitizing everything in open formats to avoid proprietary lock-in.

The biggest challenge for cloud storage is up/download in this complicated architecture. Data egress charges can be high and security is an issue. But the cloud offers a lot in terms of APIs and a virtual desktop infrastructure. Exxon has evolved a ‘cloud first’ architecture with data access via an API. Tapes go into cloud blobs (binary large objects). The cloud is making data ‘ready for ML’.

Having said that, Exxon is ‘not yet done’. There are challenges. Some divisions hold back on cloud adoption. The cloud is not free. Data sovereignty can be an issue in some jurisdictions where it may be necessary to run an application in a local data center. There are also additional use cases coming out of nowhere which may challenge assumptions economically and technically. Today, Exxon has migrated terabytes, but there are ‘petabytes left!’ QC needs more work and third party integration needs fine tuning.

And (speaking of additional use cases) there is OSDU which ‘prompted us to look at our architecture’. There is overlap of several components, but much is complementary and ‘we may make our APIs OSDU compatible’. Gubanov concluded that the move to the cloud is both an enabler and an ‘inspirational goal’. But one that ‘both IT and the business are committed to’.

In the Q&A Gubanov elaborated on the relationship with OSDU. Ingestion and cloud services overlap, but workflows are complementary as is the API. Some stuff that is not compatible will need some refactoring, ‘mostly on our side’. The main geoscience workflows are not cloud based which makes for data egress charges. Exxon is experimenting with ‘express style’ connections to the cloud to optimize up/down load. Exxon is also migrating complete workflows to the cloud to minimize data movement.

Hilal Mentes presented Total’s ‘innovative approach’ to seismic data management. Total has developed in-house, on-premise data stores for seismic (DMS), well data (DMW) and interpretation results data storage (IRDS)*. The current system works well but is said to scale poorly and is ‘not amenable to AI/ML’. The ‘to be’ situation envisages ‘more autonomy, collaboration and standardization’. In other words, ‘a data lake’. The idea is for a hybrid, on premise/cloud solution as a future replacement for DMS.

Mentes cited the work of OSDU in the context of a cloud-based, single source of the truth. The OSDU API promises a path to learning and sharing best practices for a cloud migration. However, the OSDU Plan in Total has been cancelled**. The future is less clear. ‘we will probably start using a cloud solution, it may be OSDU’. More work is needed to develop ‘an efficient data platform that will open the way for the digital transformation’.

* The databases are a component of Total’s Sismage-CIG (Geoscience and Reservoir Integrated Platform)

** One issue cited by Mentes is whether an OSDU-based cloud system will be able to handle real-world seismic data loads and what would the performance be compared to Total’s current systems.

Ravi Shankar (Denodo) presented a case study of Denodo’s work for Oxy/Anadarko evolving a ‘traditional’ data warehouse into a ‘logical’ data warehouse. The old data warehouse is not so good at capturing unstructured and other novel data types. Attempts to create data lakes in Hadoop have led to more data silos. The ‘single source of truth’ remained elusive. ‘Only 17% of Hadoop deployments are in production’. Enter the logical data warehouse (LDW) a recognition that all data cannot reside in a single location*. The data lake and data warehouse are complementary and the LDW overlays both. Denodo’s LDW/data virtualization allows for data abstraction and per-user/persona-based access. Queries are run across original data in situ and are said to be faster than Hadoop. A multi-cloud potential is claimed.

More on Denodo’s work with Anadarko here.

* The LDW has echoes of many earlier virtual data stores and data virtualization efforts such as OVS, Petris (now Schlumberger), Tibco and others.

In a similar vein, Anubhav Kohli (Schlumberger) reported on the shift from on-premises to cloud-based data warehouses and unstructured data lakes. These have created problems for data managers trying to combine existing workflows with the new platforms. Enter Apache Airflow, an open-source workflow manager originally developed by AirBnB and now used by Adobe, Google, ING and others. In Airflow, workflows are represented as directed acyclic graphs, collections of different tasks that can be run sequentially or in parallel. DAGs are suited to modern deployment mechanisms such as Docker and Kubernetes.

A ‘data to Delfi’ workflow showed how the Airflow GUI provides click-through access to tasks and code. A DAG is a collection of tasks, dependencies and operators. These can be assembled into a workflow using Python, with metrics on performance etc. Airflow has been selected as the main workflow engine for OSDU, the Open Subsurface Data Universe.

In the Q&A, Kohli was pressed on Delfi data ingestion. It turns out that the example shown was more of a demonstration of Airflow functionality. Creating pipelines from various sources and technologies can become quite complex. Schlumberger’s Delfi exposes its own ingestion services for different file formats like CSV, LAS, DLIS, documents and SEGY. But Schlumberger does use Airflow and has built multiple end-to-end Airflow pipelines using Google Composer inside the Google cloud. These connect with third party data sources, adding asynchronous retry mechanisms, load balancing and alerts for job completion. Schlumberger’s DAGs ingest nearly 50 million records and have produced ‘an overall 64% reduction in man-days’.

Ankur Agarwal described Noble Energy’s use of Schlumberger’s InnerLogix data cleansing toolset to automate data loading. Noble has data stored in multiple applications including WellView and Prosource. The question is, which is the system of record? All give a different answer! Noble is addressing the problem by adding rule-based data quality management from InnerLogix to its data in ProSource. The system applies business rules such as ‘no well without a UWI, lat/long’ across applications from different vendors, taking tops from Petra into WellView and synchronizing Petrel Studio with Petra and other applications.

Andy James (Bluware) asked rhetorically, ‘If cloud storage is so cheap, why isn't everyone moving their petabytes of seismic data to the cloud?’ Talk is of the economies of scale associated with cloud storage, but few oils are ‘jumping on the cloud storage bandwagon’. Moving petabytes of seismic data to the cloud is difficult, figuring the true cost of managing large data sets in the cloud is hard and the exercise does not, in itself, add business value.

In a large oil company, seismic data can be 85% of the total data volume and is likely stored across multiple locations and formats including tape. So how much would a move to the cloud cost? Key to the cloud is the object store which is cheap but differs from regular file storage. A petabyte-month on AWS costs something like $24k in hot storage, $4.2k in cold. Then there is the question of the user experience for a typical geophysical workload which conventionally likes data to be close to compute resources. This can be achieved by moving the workstation into the cloud. Indeed, virtual workstations ‘have been around for a while’. On the desktop, 60 frames/second in the browser ‘is no problem’.

But geoscience apps don’t work with the cloud’s object store, they work with files - SEGY etc. Unfortunately, file systems in the cloud are very expensive. On Amazon, a NetApp file server in the cloud can be $400k/PB/month. You should not fixate on the cost of archival in the cloud, the true cost is for usage in the cloud. One solution is to abandon the file format and use Bluware’s OpenVDS and VDS formats. These bricked formats, which work with pre and post stack data, leverage the cheap object storage and offer fast encode/decode to application-specific formats. Data is read from cheap, scalable object storage and streamed with Bluware’s ‘Fast’ decoder into an application running on a virtual machine in the cloud.

In the Q&A, James was pressed on cloud costs. He acknowledged that there is a lot of uncertainty as to how much will this cost in the long run. There is a ‘fixation’ on archive cost, but the cloud vendors charge more for data egress which should be avoided*. Another issue is the relationship between VDS and the SEG’s standards. Jill Lewis (Troika and SEG Standards Committee) invited Bluware to ‘join us in making SEGY cloud ready. Open VDS is not an open format as it sits behind an API that hides the actual data that is being written’.

* A similar situation existed (and probably still exists) with physical document and tape data storage. Vendors take your data off your hands for relatively little. But getting it back is a different price point!

At PNEC 2019, Wood Mackenzie presented its Analytics Lab*, a cross-industry offering that encourages companies to build data consortiums. This year, Woodmac’s James Valentine presented the results of a pilot project in the Bakken shale that demonstrated the value obtained from analytics across a ‘broader, higher fidelity dataset than is available publicly or to any single operator’.

To combat disconnects between data scientists and subject matter experts, analytics-derived data models need to be explainable. Selecting the best predictive features is crucial. X/Y (lat/long) features are ‘prime offenders’ in the analytical model that ‘bring all sorts of variables along for the ride like geology, operators’. Variables that don’t add value need to be removed. ‘Kitchen-sink style’ ML is unsuitable for decision making as it is extremely easy to overfit the data, resulting in million dollar failures. Explainability comes from interaction with the model. What happens when you hold out one operator, or hold out last year’s data? Observe the changes in the output and see the limits of the model as error rates go way up. Data sharing between operators (even of a small data set) may provide very different and valuable experimental results.

Very large data sets are needed to train a reliable model. Enter the Woodmac/Verisk Data Consortium. The Bakken proof of concept trial was based on some 250 million data points and was ‘a real eye opener’ for participants, showing a huge data quality opportunity to homogenize frac fluid volumes and units of measure between public data and operator data. The final model included commercial information on well costs from other Woodmac units.

* As we reported earlier in Oil IT Journal.

Houston-based LEK Consulting has surveyed a broad range of industries including the upstream with regard to digital success and how companies can ‘stay ahead of the curve’. Across the board, leaders are pressing their advantage mostly in two areas: ways of working (automation, remote monitoring) and digitized operations (planning procurement, supply chain). In both areas, the difference between good and bad is huge, and the gap is likely to widen. The imperative is particularly great in the upstream, a ‘challenging environment’ that is ‘behind the curve’. Although the upstream is ‘about on par with heavy industry’ and the situation is less dire when the level of complexity and the high cost of failure. ‘Applying AI is hard, development is done relatively infrequently. The upstream is not your average industry. Upstream is making steady progress given its problem set’. One difficulty stems from the decades of previous E&Ps digital initiatives that have ‘not much to show for them’. Gujral cited BP as a digital success with its rapid prototyping and deployment of sensors to detect fugitive emissions, a partnership with an AI startup to optimize workover frequency. Suncor also got a shout-out as ‘managing to a specific digital P&L’. Download the LEK Survey here.

More from the PNEC Conferences home page.

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.