Review: Reproducibility and Replicability in Science

The 250 page report from the National Academies Press investigates the reproducibility of scientific experiments and publications citing an eminent geophysicist as father of the reproducibility movement and new boosts to reproducibility from Docker, the cloud and the Jupyter notebook.

A new publication, Reproducibility and Replicability in Science* (R&RiS) from the US National Academies Press investigates the reproducibility of scientific experiments and publications. The overarching theme is that often, single experiments are taken as demonstrating some finding or other, which subsequent studies find not to be true. In the introduction, the to-and-fro of the benefits or otherwise of margarine as a healthier alternative to butter is cited, along with changing advice from the medical profession on the merits of daily doses of baby aspirin to reduce the risk of heart attack. R&RiS is a ‘consensus study’ from an impressive collection of Foundations and Academies, with backing from Congress and support from the Alfred P. Sloan Foundation.

What caught our attention in the report was the references to preeminent geophysicist Jon Claerbout whose work on seismic processing led to his launch of the reproducibility movement (still current with Madagascar). With special reference to data and compute-intensive scientific work, Claerbout observed that minor mistakes in code can lead to serious errors in interpretation and in reported results and proposed that both data and code should be openly shared so that results could be reproduced. Clarebout is quoted as saying, ‘An article about computational science [. . .] is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.’

As an example, an image processing scientific workflow may involve user interaction that a subsequent investigator may not be able to replicate. Reproducibility zealots thus eschew interactive programs ‘unless they include the ability to arrive in any previous state by means of a script’. Others likewise deprecate the ubiquitous spreadsheet as prone to non-reproducible results. ‘The use of spreadsheet software impairs reproducibility because spreadsheets conflate input, output, code, and presentation. Spreadsheets inhibit one’s ability to make a record of all steps taken to construct a full analysis of the data, and they are notoriously hard to debug’.

There are better ways of capturing the scientific workflow. Workflow management systems such as that developed by CERN for its investigations with the large Hadron collider can capture and store data and workflow provenance automatically. These systems link results with the computational processes that derived them. The Chimera system (developed for the Sloan Digital Sky Survey) likewise captures and automates a complex pipeline of transformations on the data by external software. The Open Science Framework, developed by the Center for Open Science is a cloud-based project management tool that emerged as part of efforts to replicate psychological research for use in other fields.

The misunderstanding and misuse of statistical significance testing is a particular source of non-reproducibility. As recently as 2016 the American Statistical Association, noting that in its 177 years of existence, it had never previously taken a stance on matter of statistical practice, published its six principles on the use of the P-value test, ‘in the hopes that they would “shed light on an aspect of our field that is too often misunderstood and misused in the broader research community.”’ This year the ASA published a special edition of its official journal, The American Statistician titled ‘Statistical Inference in the 21st Century: A World Beyond P < 0.05.’

There are now also tools reproducing research results, notably ReproZip, that creates a reproducible package of the whole computational sequence for execution in the cloud without additional software. More generally, virtual machines that encapsulate an entire computational environment, from the operating system up, can enable reproducibility, so long as the source code is made public. The combination of virtual machines and public cloud has proved valuable for reproducibility in several domains, such as microbial ecology and bioinformatics. Docker containers are another route to reproducibility as witnessed by the Software Sustainability Institute’s June 2017 workshop on Docker Containers for Reproducible Research.

Jupyter interactive computational notebooks are another technology supporting reproducible research enabling researchers to fully narrate their analysis with text and multimedia content. Notebooks can be shared with other researchers to reproduce computations. According to R&RiS, ‘scientists are increasingly adopting Jupyter for their exploratory computing, sharing knowledge within their communities, and publishing alongside traditional academic papers’. The gravitational wave LIGO team published Jupyter notebooks that reproduced the analysis of the data and displaying the signature of a binary black-hole merger.

Several institutions have contributed guidelines to the open data movement: the FAIR (findable, accessible, interoperable, and reusable) data principles from the Lorentz Center in the Netherlands. The Transparency and Openness Promotion guidelines and those by the Association for Computing Machinery.

R&RiS dives into reproducibility in geoscience, a rather harder task and a contentious one when applied to the forecasting of natural hazards, and ‘notoriously difficult to predict’ extreme events of low probability but high consequence. Here scientific forecasts are expressed as probabilities involving iteration of forecasting models over many cycles of data gathering, model calibration, verification, simulation, and testing.

In conclusion, R&RiS advises that members of the public and policy makers have a role to play to improve reproducibility and replicability. When reports of a new discovery are made in the media, one needs to ask about the uncertainties associated with the results and what other evidence exists that the discovery might be weighed against. Anyone making personal or policy decisions based on scientific evidence should be wary of making a serious decision based on the results, no matter how promising, of a single study. Similarly, no one should take a new, single contrary study as refutation of scientific conclusions supported by multiple lines of previous evidence.

Curiously there is no mention (outside of the copious list of references) of big data, artificial intelligence or machine learning, fields which would all merit from a close inspection as to their ‘reproducibility’. There is no mention either of ‘fake news’. But both AI and fake news make up the sub text that dare not say its name, politely hiding inside RR&IS’ exhortations.

* National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. ISBN: 978-0-309-48616-3.

Review: Reproducibility and Replicability in Science

The 250 page report from the National Academies Press investigates the reproducibility of scientific experiments and publications citing an eminent geophysicist as father of the reproducibility movement and new boosts to reproducibility from Docker, the cloud and the Jupyter notebook.

Click here to comment on this article

Click here to view this article in context on a desktop

Review: Reproducibility and Replicability in Science

The 250 page report from the National Academies Press investigates the reproducibility of scientific experiments and publications citing an eminent geophysicist as father of the reproducibility movement and new boosts to reproducibility from Docker, the cloud and the Jupyter notebook.

Sign up for occasional emails and subscription information...

Click here to comment on this article

Click here to view this article in context on a desktop