The Case for Open Source in Research
and what we’re doing with eilmer

Rowan J. Gollan

10 March 2016

Outline

The Scientific Method

a method or procedure that has characterized natural science since the 17th century consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypothesis

Oxford English Dictionary

Science is the systematic enterprise of gathering knowledge about the universe and organizing and condensing that knowledge into testable laws and theories. The success and credibility of science are anchored in their willingness to expose their ideas and results to independent testing and replication by other scientists. This requires the complete and open exchange of data, procedures and materials.

American Physical Society

The Mertonian Norms (1942)

Communalism
all scientists should have equal access to scientific goods (intellectual property) and there should be a sense of common ownership in order to promote collective collaboration, secrecy is the opposite of this norm.
Universalism
all scientists can contribute to science regardless of race, nationality, culture, or gender.
Disinterestedness
according to which scientists are supposed to act for the benefit of a common scientific enterprise, rather than for personal gain.
Organized Skepticism
skepticism means that scientific claims must be exposed to critical scrutiny before being accepted.

Here shown in the form given by Ziman (2000).

Branches of science

The motivation for using the scientific method is to route out error.

Two established branches of science:

Deductive branch
concept of proof
Empirical branch
hypothesis testing, structured communication of methods and protocols

One or two emerging branches of science:

Computational science
process of discovery involves (large-scale) computer simulation
Data-driven science
???



Computational science: a new branch of science?

What might distinguish computational science from other branches?

A historical diversion: publishing in journals

What has worked to disseminate science in the empirical branch?

Publishing in archival journals with a detailed description of Materials and Methods sufficient for replication.

Note: The first scientific journal, the Transactions of the Royal Society London, was created primarily to establish precedence of discoveries. This also became a means of dissemination and had a peer-review. After a few issues, it was realised that the transactions were also a great archive.

That first journal did what we still think of as the role of journals today: registration, dissemination, peer-review, and archival record.

The standard for reporting


that the person I addressed them to might, without mistake, and with as little trouble as possible, be able to repeat such unusual experiments.

New Experiments, 1660, Robert Boyle

The problem in computational science

an article about computational science in a scientific publication is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete software development environment and the complete set of instructions that generated the figures.

Jon Clarebout, paraphrased by Donoho (1998)

My thesis: computational science has lost its way with regards to the scientific method. Our work often fails to be replicable, it lacks transparency of communication, and is unverifiable. This hinders or makes it impossible to root out error.

The extent of the problem

…but we define our algorithms formally with mathematics in computational fluid dynamics. We must be safe from these errors, right?

The problem closer to home: CFD


Kleb and Wood (2004) surveyed 49 new CFD models in IJNMF and JCP. They found that only 22% of the models were published with component level data.

In other words, for 78% of those models (algorithms) there’s no easy means to verify that the algorithm is implemented correctly!

The solution: in the large scale

Augment our traditional means of dissemination when dealing with computational science: publish the simulation code as open source.

Our view is that we have reached the point that, with some exceptions, anything less than release of actual source code is an indefensible approach for any scientific results that depend on computation, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility.

Ince, Hatton & Graham-Cumming (2011), Nature

Barriers to publishing open source

A survey of attendees (1008 university researchers) at a Machine Learning conference were asked about their perceived barriers to sharing code. The top 10 reasons not to share code.

Table 10 from [Stodden, 2010]
The time it takes to clean up and document for release 77.78%
Dealing with questions from users about the code 51.85%
The possibility that your code may be used without citation 44.78%
The possibility of patents or other IP constraints 40.00%
Legal barriers, such as copyright 33.72%
Competitors may get an advantage 31.85%
The potential loss of future publications using this code 31.11%
The code might be used in commercial applications 28.15%
Availability of other code that might substitute for your own 21.64%
Whether you put in a large amount of work building the code 20.00%
Technical limitations, ie. webspace platform space constraints 20.00%

The Solution: for the scientific community

The Solution: tools to help computational reproducibility

The Solution: in our own back garden

What we’re doing towards reproducible computational science with eilmer

Standards we are aiming for

As an example, publication in the Journal of Open Research Software requires the following:

1. Is the software in a suitable repository?
2. Does the software have a suitable open licence?
3. If the Archive section is filled out, is the link in the form of a persistent identifier, e.g. a DOI? Can you download the software from this link?
4. If the Code Repository section is filled out, does the identifier link to the appropriate place to download the source code? Can you download the source code from this link?
5. Is the software license included in the software in the repository? Is it included in the source code?
6. Is sample input and output data provided with the software?
7. Is the code adequately documented? Can a reader understand how to build/deploy/install/run the software, and identify whether the software is operating as expected?
8. Does the software run on the systems specified?
9. Is it obvious what the support mechanisms for the software are?

The “Gold” standards

From the blog of Cameron Neylon in 2010 on the launch of the journal Open Research Computation:

The submission criteria for ORC Software Articles are stringent. The source code must be available, on an appropriate public repository under an OSI compliant license. Running code, in the form of executables, or an instance of a service must be made available. Documentation of the code will be expected to a very high standard, consistent with best practice in the language and research domain, and it must cover all public methods and classes. Similarly code testing must be in place covering, by default, 100% of the code. Finally all the claims, use cases, and figures in the paper must have associated with them test data, with examples of both input data and the outputs expected.

The primary consideration for publication in ORC is that your code must be capable of being used, re-purposed, understood, and efficiently built on. You work must be reproducible. In short, we expect the computational work published in ORC to deliver at the level that is expected in experimental research.

Spoiler alert: This journal never published a single article. It did, however, morph into Source Code for Biology and Medicine.

Direct reproducibility

Revision ID is built into the executable and reported in logfile

> e4shared --job=cone20 --prep
Eilmer4 compressible-flow simulation code.
Revision: e1a4f47b18c8 384 default tip
Begin preparation stage for a simulation.

High-level user input is converted to low-level verbose input for simulation. The verbose input is a complete record of input.

Unit testing

version(ideal_gas_test) {
   int main() {
      auto gm = new IdealGas();
      auto gd = new GasState(gm, 100.0e3, 300.0);
      gm.update_thermo_from_pT(gd);
      assert(approxEqual(gd.rho, 1.16109, 1.0e-4), failedUnitTest());
      return 0;
   }
}

Integrated testing

Running the integration tests.

> cd dgd/examples/eilmer
> tclsh eilmer-test.tcl

Best practices for developers

Presently we work with students locally. We have had no external development of the new eilmer code yet. We try to encourage a set of development practices that have served myself and Peter Jacobs very well over a large number of years.

I see this as a large challenge if we move to a model of encouraging external development on the code. But a challenge I think we should embrace.

Final best practice

Don’t commit code after 5pm on a Friday.


Conclusion

BACKUP SLIDES

Examples of changing disclosure practices

Asking for source code release sounds like asking for a lot in terms of changing how we disseminate our work. There are examples in other areas of science where the community has come together and changed disclosure practices: