ReproNim module for dataprocessing

Lesson 1: Core concepts using an analysis workflow example

Overview

Teaching: 30 min
Exercises: 60 min

Questions

What are the different considerations for reproducible analysis?

Objectives

Learn core elements of reproducible analysis

Familiarize with various technology terms

You can skip this lesson if you can answer these questions? —>

What are the four core elements of a reproducible analysis?

Why should you annotate your data in a machine accessible manner?

Why should you use version control for data and code?

What are some ways to share your analysis environment with others?

What does continuous integration help you with?

This lesson uses the Simple Workflow Github repository to illustrate core concepts of reproducible analysis and pitfalls associated with complex data, software, and computing environments. The complete simple workflow paper is available here.

Lesson outline

Lesson outline
Lesson requirements
Overview of workflow
Element 1: Storing data and metadata
Element 2: Creating a reproducible execution environment
Element 3: Running analysis and storing expected results and provenance
Element 4: Checking output consistency using continuous integration
Results from running the Simple Workflow example with your own CI servers running jenkins.
Results from running the Workflow

Lesson requirements

You will make the most of this lesson if you have an understanding of:

Overview of workflow

The basic workflow presented here (i) extracts a collection of brain images and associated phenotypic traits (e.g., age) from a spreadsheet, and (ii) runs a Nipype workflow that takes the anatomical brain images and performs some simple anatomical image processing. Executing the whole workflow may take a bit of time to run depending on the power of your machine or cluster. For learning purposes and to minimize the time you can run the workflow on one participant.

Hands on exercise:

Can you rerun the analysis in the simple workflow example?

Solution

Follow the README in the repo and rerun the analysis with the Docker example.

Element 1: Storing data and metadata

To ensure reproducibility all data and metadata must be accessible and preferably be machine-accessible.

Machine accessibility

Machine accessibility means that information regarding an analysis or research workflow (a.k.a., metadata) can be easily accessed by and parsed with automated tools. Typically, the main approach to describe our research is to write a document that is shared with colleagues and collaborators. However, extracting relevant information regarding data acquisition, processing and/or analysis, requires significant human resources, both to interpret and translate into code. An alternate approach is to encode the metadata using structured markup (e.g., RDF, JSON, XML). Often such markup can be standardized to provide machine accessibility.

In this example, data and metadata are stored in a Google spreadsheet. Phenotypic information is stored as characters/strings. The imaging data are stored as pointers/links to files in the NITRC XNAT repository. However, this particular example does not have any semantic or (data) type information associated with the input file.

The column headers can be described in detail in a JSON document using JSONLD, a format that supports semantic annotation. The annotation provides information about the data contained in the column and allows for the harmonization of the information with other similar tables. For example, the JSONLD metadata key could tell us that the URLs correspond to anatomical T1-weighted images of the human brain and that the age of participants is expressed in years.

Lesson 2 covers different aspects of data annotation, harmonization, cleaning, storage, and sharing.

Hands on exercise:

What types of output files are created by the workflow?

Solution

There are four types of output files created:

Compressed NIfTI files corresponding to different segmentations

A JSON file corresponding to the volume measures and voxel counts of brain segmentations.

VTK surface files for each subcortical surface

A Turtle syntax provenance file capturing the information

What are some of the drawbacks of the form in which the input and output are represented?

Solution

The input Google spreadsheet does not provide a key to annotate each column. The output JSON file also does not describe the keys anywhere. These things make it difficult for a human or a machine to interpret the values. In addition, it prevents harmonization of data.

Element 2: Creating a reproducible execution environment

The second component of this example is a setup script and a Docker container that creates the necessary computational environment for analysis.

Problems with creating environments with a script

The script assumes that you have access to certain software, such as bash and FSL, on your system. This means you have to run this on a Unix-like system such as Linux or MacOS.

The script will install all other necessary software (e.g., Python libraries and their dependencies) and will not handle any conflicts with your existing software environment.

Alternatives that reproduce environments with minimal software dependencies are technologies like virtual machines (VirtualBox, VMWare, NITRC CE), containers (Docker, Singularity) and installers (e.g., Vagrant, Packer). These can be very useful to replicate existing environments and therefore simplify the installation problem significantly. However, at present some of these technologies are not installed by default on computing clusters you may have access to.

Lesson 3 covers container technologies and how to create, use, modify, and reuse containers.

Exercise:

What must a reproducible computational environment contain?

Solution

A reproducible computational environment must contain all the necessary data (any inputs or other internal software package data such as brain templates), environment variables, and software necessary for carrying out the ascribed computation. Ideally, such an environment itself should be reproducible.

Element 3: Running analysis and storing expected results and provenance

Once the environment is set up, one can execute the analysis workflow. Each time the analysis is run the provenance of the workflow is captured and stored using a PROV model for workflows. All of this happens inside a single executable script. The script Simple_Prep.sh uses Nipype dataflows to ensure a consistent representation of the execution graph, itself a representation of the steps followed during this part of the analysis workflow.

Using dataflow technologies for analysis instead of shell scripts

There are many dataflow platforms out there. These typically enable a compact, abstract graph based representation of a dataflow, allowing for their reuse and consistency of execution. They also enable running the same dataflow in different computing environments and not requiring the user to keep track of complex data dependencies across nodes. While Nipype was used in this example, other brain imaging dataflow systems include Automated Analysis, PSOM, FASTR.

Running the analysis is one part of reproducibility. It is also important to capture the output necessary for scientific hypothesis testing or exploration. In this example, the volumes of subcortical structures and of the different brain tissue classes are extracted and stored in a JSON document. A specific run of this workflow on a specific platform was used to create the provenance document and the expected outputs data. When another user runs this workflow, their output can be compared to the expected output.

Lesson 4 covers dataflow technologies, specifically how to create analysis pipelines and applications and capture provenance when running these pipelines.

Exercise:

What are some advantages of using dataflow technologies?

Solution

Compact and structured representation of analysis.

Can be reused with minimal changes.

Many dataflow tools can be used across different environments.

The tools take care of data management.

Many dataflow tools support distributed execution of steps.

Element 4: Checking output consistency using continuous integration

Once the data and environment are setup appropriately and the analysis is run, it would be good to know if the same results, within some threshold, are obtained when a dataset containing the similar data or a similar workflow is used. These can be carried out using continuous integration services, such as Travis, CircleCI, Jenkins, which allow for the execution of an analysis workflow and, automated comparison tests as versions of data or software change.

Continuous integration testing

In typical brain imaging analyses, there is a complex interaction between data, software, and scientific hypothesis testing results. Continuous integration services ensure that such results can be obtained consistently and provide a framework to evaluate when results diverge. While typically used for software testing, continuous integration has become essential for process management by creating a complete test cycle.

In brain imaging, most published results rarely come with data and code to allow for the retesting of the outcome when either the software version changes or when new datasets are available. The intent of this simple workflow example is to move the community towards such comprehensive data preservation and testing integration.

Lesson 5 covers how to use continuous integration services and also how container technologies can be used to run your own integration testing.

Exercise:

How does setting up continuous integration testing help with research?

Solution

During the lifetime of a project, the data and software may change. Setting up testing environments allow ensuring that changes in results can be attributed to changes in data and software. Results that remain the same across different software and similar data sets are more generalizable than results that are specific to a particular dataset and a particular software environment.

Results from running the Simple Workflow example

It turns out that this Simple Workflow is not reproducible across different versions of software and operating systems. The observed inconsistencies (see Table 1) point to issues of randomization and/or initialization within the algorithms that are run. While it is easy to detect deviation of execution in different environments, it is harder to determine the cause of the deviation. This is where rich provenance capture can help establish where along an execution graph an analysis diverged and help zero in on the possible culprits.

Lesson 6 covers details of how provenance can be captured.

Hands on exercise:

Submit only your json output and provenance files as a pull request?

Solution

See an example here

Using your Unix skills (find, tar), extract only the json files keeping the directory structure intact.

Add the provenance files (trig, provn)

Fork the simple_workflow repo

Add your outputs to a new folder under other_outputs.

Commit the changes, push to your repo, and send a pull request.

What do the results indicate about neuroimaging software?

Solution

Researchers have to be careful about variations coming from numerical software. Software engineers have to test their software for numerical variation across different operating systems and software environments. The only way to scale this is using continuous integration approaches.

Key Points

Reproducible analysis is technologically possible

Learning these technologies can help produce more reliable research output

Using such frameworks provide a better way to communicate information to colleagues and collaborators