ReproNim module for dataprocessing

Lesson 3: Create and maintain reproducible computational environments

Overview

Teaching: 60 min
Exercises: 120 min
Questions
  • Why and how to use containers and Virtual Machines?

Objectives
  • Determine the requirements of your analysis software for reproducibility

  • Use and create containers for reproducible research

You can skip this lesson if you can answer these questions? —>

This lesson is an introduction to reusable computational environments. This will focus on components relevant to brain imaging analyses.

Lesson outline

Lesson requirements

It is essential to have a basic understanding of:

Element 1: Overview

Carrying out reproducible data processing requires understanding details of the software environment used in the analysis. An analysis is the outcome of data, analysis scripts, and the analysis environment (software and hardware). Modern operating systems are complex and every analysis software may have dependencies on many different components of the system. These include:

Given the complexity of the environment, it is imperative to get a clear understanding of:

  1. How to capture the details of the software and hardware requirements of an analysis
  2. How to determine which of these components can have a direct impact on results
  3. How to recreate the environment in which the analysis was done

There are many ways to re-create a complete environment. Usually it is done via construction of environment “images” or “containers”.

Element 2: Understanding container technologies

Container technologies provide a mechanism to encapsulate analysis environments for redistribution. Many of these technologies also allow creating an executable environment based on a script, thus allowing ease of reproducing analysis environments. Some popular virtual machine and container technologies are:

1. Virtual Machines

and Vagrant

A Virtual Machine (VM) is an emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. And Vagrant is a tool that simplifies process of building and managing virtual machine environments.

Follow instructions to install Vagrant.

2. Amazon AMIs

 a. [NITRC Computational Environment: NITRC-CE](http://www.nitrc.org/plugins/mwiki/index.php/nitrc:User_Guide_-_NITRC_Computational_Environment)

3. Docker

Docker provides a slew of services to make building and distributing containers easier. This includes integration with GitHub and the ability to pull pre-built containers from Docker hub. In addition docker containers can orchestrated together with docker-compose to generate interacting services.

Unfortunatelly, Docker can not be easily used on traditional HPC resources. One of the main reason is privilige escalation via Docker, i.e. users can get root access to the host system.

There are various versions of Docker depending on the system you’re using, reade more here and follow to instruction to install an appropriate version.

4. Singularity

Singularity offers an alternative to docker on HPC clusters. Creating singularity containers requires root privileges on a machine virtual or physical. However, using singularity does not. The Singularity User Guide describes how to use the different components of singularity. One of the big advantages of Singularity is support for native drivers and libraries (e.g., GPU, MPI, etc.,.).

If you have Linux you can directly install and run on your OS. For other systems, you should use Vagrant and VM, follow the instruction for Mac or Windows.

You can use each of the technologies above to setup analysis environments for brain imaging, but there are important technical differences between them.

Question:

Which container technologies can be used in HPC centers?

Answer

VM with Vagrant Singularity

Element 3: Using pre-built containers for brain imaging

1. Vagrant: A neuroimaging environment based on NeuroDebian can be initialized quickly

using a single command:

   vagrant init hlaubish/NeuroDebian_64; vagrant up --provider virtualbox

You can read more about NeuroDebian VM here

A general introduction to Vagrant is available here and a video tutorial is below.

Vagrant is based on VirtualBox, and another example of reusable environment is this virtual machine. This virtual machine can be used to reproduce the analyses from this paper.

2. NITRC-CE: TODO

3. Docker:

There are many existing images available on Docker Hub. You can find images for Ubuntu as well as images that contain more specific software, e.g. Nipype. Simple examples of how to pull and run an image can be found in this presentation.

A set of example brain imaging docker images can be also found as part of the BIDS-Apps project. The project provides a basic tutorial to get started with the app.

4. Singularity:

Singularity is useful in HPC centers where docker is not allowed. Any docker image can be pulled in as a singularity container. Therefore, you can retrieve any of the bids-apps above as a singularity image using the command:

   singularity shell docker://bids/freesurfer

Singularity also has an online registry for images – Singularity Hub. You can pull images directly from either Docker Hub or Singularity Hub, more about pull command you can find here.

More details on how to run an image you can find here.

Question:

Can you run a Docker image in HPC centers?

Answer

Yes, if you have Singularity. Singularity can run both Singularity and Docker images.

Hands on exercise:

Pull satra/nih-workshop-2017 Docker image and check which python packages are installed.

Hands on exercise:

Repeat the previous exercise using Singularity.

Hands on exercise:

Follow [Simple Workflow README] (https://github.com/ReproNim/simple_workflow) and run run_demo_workflow.py for one subject. Be sure to mount the directory to save your output.

Hands on exercise:

Repeat the previous exercise using Singularity.

Element 4: Creating reproducible computational environments

1. Creating a Vagrant VM for distribution

Vagrant supports VirtualBox and VMWare virtual machines. Using Vagrant with VirtualBox is a matter of creating a Vagrantfile and using it download and configure an execution environment. As an example, one can consider how to create an image with Neurodebian and install FSL tools into it:

vagrant init ubuntu/trusty64
vagrant up

vagrant ssh -c /bin/sh <<EOF
   wget -O- http://neuro.debian.net/lists/trusty.us-nh.full | sudo tee /etc/apt/sources.list.d/neurodebian.sources.list
   sudo apt-key adv --recv-keys --keyserver hkp://pgp.mit.edu:80 0xA5D32F012649A5A9    sudo apt-get update
   sudo apt-get -y install fsl-complete
EOF

2. Create a Docker image

In order to create a Docker Image, you should write a Dockerfile. A simple example of writing Dockerfile and build an image you can find here.

If you want to create a new image for neuroimaging, you should check Nuerodocker project that allows you to generate custom Dockerfiles and minifies existing Docker images. Neurodocker not only simplifies writing a new Dockerfile, but also incorporates the best practice for installing software. You can compare a simple script to create a Docker image with FSL to a Dockerfile itself that contains much more details of proper installation and cleaning. Neurodocker can be also easily used to include Python and all Python libraries that can be installed using conda or pip. This is a simple example of neurodocker command and Dockerfile. More examples can be found here.

3. Create a Singularity image

In order to create an empty image or import layers from Docker image, you don’t need root privileges. You can do it using create and import commands.

However, if you want to install additional software or create an image from scratch, you need to have root privileges on a machine. It doesn’t have to be a physical machine, if you’re using HPC account, you can use Vagrant to create a Virtual Machine with Singularity that can in turn be used to create a new image.

This is a similar situation to running Singularity on Mac or Windows. You can follow the instruction from previous part or try to build a Vagrant Box from scratch:

vagrant init ubuntu/trusty64
vagrant up

vagrant ssh -c /bin/sh <<EOF
   sudo apt-get update
   sudo apt-get -y install build-essential curl git sudo man vim autoconf libtool
   git clone https://github.com/singularityware/singularity.git
   cd singularity
   ./autogen.sh
   ./configure --prefix=/usr/local
   make
   sudo make install
EOF

If you want to create a new image you’re going to share, the recommended practice is to create a bootstrap file. You can still start from a Docker image, but you can easily add environmental variables, additional software etc. A short recipe for bootstrap files you can find here here, for best practices and more details check this.

For more information about creating a Singularity image, changing size, etc., you should check this Singularity website that contains nice short videos.

If you want to change already existing image (e.g. for testing purpose), you can mount the image using --writable option (yes, you need to have root privileges).

Question:

Which Singularity commands require root privileges?

Answer

bootstrap most commands with --writable options

Hands on exercise:

Create a Docker image using Neurodocker with a specific version of FSL and a Python 3.6 conda environment.

Hands on exercise:

Create a Singularity image by importing previously created Docker image.

Hands on exercise:

Create a bootsrap file that starts from your Docker image. In addition to FSL install git and your favourite text editor (e.g. emacs). Create directories for your data and output.

Element 5: Capturing the essential pieces needed for an analysis

Many packages, such as FSL, FreeSurfer, AFNI, do many different kinds of computation. Not all of the components are necessary for a specific analysis, and definitely not for repeating the exact analysis steps. ReproZip allows users to package the essential pieces necessary for an analysis. ReproZip can be used inside all of the above container technologies to create a minimal package for repeating an analysis. As such the reprozip bundle contains the necessary software components for recreating the exact software environment used in an analysis.

Key Points