NASA’s IM/data science. A quick comment on COP21

In our action-packed last issue of 2015, we bring you an exclusive report from the SPE Petroleum Data Driven Analytics dinner where Richard Doyle explained how NASA manages its (very) big observational systems data. Somewhat surprised by the deliberations of COP21, editor Neil McNaughton struggles to understand what its ’success’ actually means.

As we enter 2016, my notebook is bulging with unused but interesting material. So this month, in place an editorial, I bring you our report from the SPE Petroleum Data Driven Analytics (PD2A) technical section dinner held during the 2015 Houston ATCE where Richard Doyle, manager of information and data science (I&DS) at the national space technology program of the Jet Propulsion Laboratory (JPL) gave the keynote address. The I&DS program’s ‘punchline’ is to take a lifecycle perspective of the context and data from NASA’s observational systems. Examples of such include NASA’s Mars Rover missions, but also the US Department of Defense’s intelligence gathering and even monitoring of an infant in pediatric care. In all of these situations, locally acquired data is pushed to a remote location. The data pipeline can be long and may involve data loss or corruption. Archived data may be compromised. Doyle observed that oil and gas shows a similar pattern with data captured at remote locations where transfer and management present similar architectural challenges.

So what are the challenges for big/observational data? It may arrive too fast for communications systems to transport. As an example, when a spacecraft lands on Mars, the whole process is over in less time than it takes for a radio communication to round trip to earth. Remote control becomes impossible and an autonomous capability is required. Elsewhere, for example the square kilometer array radio telescope currently under construction will collect too much data for practical storage and archival. In such situations it is necessary to make up-front decisions as to what observations are worth keeping. This will involve some kind of a ‘quick-look’ analysis and early stage data triage. There is likewise the risk of a similar capacity shortfall all along the data pipeline. Multiple formats and standards make it hard to collate data from different systems.

If big data is the challenge, data science is the solution. Data science includes the whole gamut of artificial intelligence and machine learning but also data curation a.k.a. ‘data wrangling.’ A seminal 2013 book Frontiers in massive data analysis describes many of the techniques used and in particular emphasizes the need for the reproducibility of published results. Reproducibility is emerging as a ‘critical path challenge’ in the big data movement. A study from the US National Institute of health found that 80% of big data derived results are not reproducible! Implementing data science requires ‘cross-over’ people. Folks with competency in science, math and IT are worth their weight in gold. Data provenance and confidence are important but reproducibility means a complete description of how results were achieved. NASA’s data pipeline traditionally involved a linear flow from operations, through data science and into a data archive. A key component here is the 12 petabyte data archive managed by NASA’s Earth Science division. This is built atop of the NASA developed object oriented data technology (Oodt) store, now an open source Apache project.

However such a linear workflow actually compromises a holistic, end to end view of systems whose complexity and richness may not be completely captured. NASA is looking beyond these workflows into the future of data science. NASA manages multiple collections of data from its various missions (water, ocean, CO2 etc.). These leverage the Oodt framework to capture, analyze and integrate data. Examples include the rapid identification of signals in time series data from the airborne visible/infrared spectrometer program, Aviris. The techniques have spilled over into healthcare where they are used in histology. Another is the Palomar transient factory which uses optical astronomical observations to look for large objects (meteorites, comets) that are heading for earth! Data visualization has shown spectacular drought-related subsidence (70cm in 7 years) in California’s central valley where voluntary restriction on water use means that a ‘brown (lawn) is the new green.’ Climate scientists and CO2 modelers also use Oodt. The detection of Mars ‘dust devils’ in real time, an example of the use of up-front/quick look processing to reduce communication bandwidth.

Graph-based tracking/logging of the data path is used to suppress unimportant details. Credit assignment and goal-based reasoning provide context to interpretations. Planetary data system supports all NASA missions with a petabyte archive of data. The Oodt information model that describes data and use cases has been key to success. Model-based engineering is used to derive specific use cases. This involves sitting down with the subject matter experts, even though they’ll think you are wasting their time! You need to build an ontology – although using the term itself can be a turn-off. Enter the PDS4 standard that is to be used on the 2017 NASA/ESA BepiColombo mission to Mercury. JPL has teamed with its Caltech parent to offer a massive open, online course (Mooc) on distributed data analytics which has been watched by 16,000 people.


I know that I said last month that I would comment on the outcome of the COP21 21 deliberations. This is harder than I thought as apart from its universally acclaimed ‘success’ it is hard to see exactly what has been achieved. Kevin Anderson provides an interesting take in Nature. It is about CCS but not about fossil! But to my mind, the most illuminating commentary appeared in a cartoon in the local satirical magazine, the Canard Enchainé. One dude says, ‘They just reached agreement on global warming! And gasoline prices are down.’ His buddy replies, ‘Ain’t life good!


Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.