ReproNim Reproducible Basics Module

Version control systems

Overview

Teaching: 300 min
Exercises: 40 min
Questions
  • How do version control systems help reproducibility, and which systems should be used?

Objectives
  • Become familiar with version control systems for code and data, as well as relevant tools based on them

  • Learn how to use version control systems to obtain, maintain and share code and data

  • Review available third party services and workflows that could be used to help to guarantee reproducibility of results

You can skip this lesson if you can answer these questions —>

What is a “version control system” (VCS)?

We all probably do some level of version control with our files, documents, and even data files, but without a version control system (VCS) we do it in an ad-hoc manner:

A Story Told in File Names by Jorge Cham, http://www.phdcomics.com/comics/archive_print.php?comicid=1323

So, in general, VCSs help track versions of digital artifacts such as code (scripts, source files), configuration files, images, documents, and data – both original or generated (as outcome of the analysis). With proper annotation of changes, a VCS becomes the lab notebook for changing content in the digital world. Since all versions are stored, VCS makes it possible to provide any previous version at any later point in time. You can thus see how it can be important for reproducing previous results – if your work’s history is stored in a VCS, you just need to get a previous version of your materials and carry out the analysis using it. You can also recover a file which you mistakenly removed since a previous version would be contained within your VCS, so no more excuses like “the cat ate my source code”. For those features alone it is worth placing any materials you produce and care about under some appropriate VCS.

Besides tracking changes, another main function of a VCS is collaboration. VCSs are typically used not only locally but across multiple hosts. Any modern VCS supports transfer and aggregation of versions of (or changes to) your work among collaborators. By using some public services (such as GitHub) you can also make them available to other online services (such as travis-ci) that can be configured to react to any new change you introduce and to perform prescribed actions. Integration with such services, which allows data to be automatically reanalyzed and verified for expected results, provides another big benefit for guaranteeing correct computations and reproducibility.

In this module we will first learn about

Git

External teaching materials

To gain good general working knowledge of VCSs and Git, please go through the following lessons and a tutorial:

It is recommended to configure Git so that commits have appropriate author information.

% git config --global user.name "FirstName LastName"
% git config --global user.email "ideally@real.email"

Exercise: a basic Git/GitHub workflow

Goal: submit a pull request (PR) suggesting a change to the https://github.com/ReproNim/simple_workflow analysis. You should submit an initial PR with one of the changes, and then improve it with subsequent additional commits, and see how the PR gets automatically updated. Possible changes for the first commit to initiate a PR:

Then proceed to enact more meaningful change:

Exercise: exploiting git history

Goal: determine how estimate for the Left-Amygdala changed in the AnnArbor_sub16960 subject from release 1.0.0 to 1.1.0.

Answer

git diff allows us to see the differences between points in the git history and to optionally restrict the search to the specific file(s), so the answers to the challenge were git tag and git grep:

% git diff 1.0.0..1.1.0 -- expected_output/AnnArbor_sub16960/segstats.json
...
     "Left-Amygdala": [
-        619,
-        742.80002951622009
+        608,
+        729.60002899169922
     ],

Third-party services

As you have learned in the Remotes in GitHub section of the Software Carpentry Git course, the GitHub website provides you with public (or private) storage for your Git repositories on the web. The GitHub website also allows third-party websites to interact with your repositories to provide additional services, typically in a response to your submission of new changes to your repositories. Visit GitHub Marketplace for an overview of the vast collection of such additional services. Some services are free, some are “pay-for-service”. Students can benefit from obtaining a Student Developer Pack to gain free access to some services which otherwise would require a fee.

Continuous integration

There is a growing number of online services providing continuous integration (CI) services. Although the free tier is unlikely to provide you with sufficient resources to carry out entire data analyses on your data, you are encouraged to use CIs. They can help verify your code’s reproducibility and correct execution, as well as the reproducibility of your results. CIs can be used on a set of unit-tests using “toy”/simulated data or on a subset of the real dataset. For example, see simple workflow code for a very simple, re-executable neuroimaging publication.

Travis CI

Travis CI was one of the first free continuous integration services integrated with GitHub. It is free for projects available publicly on GitHub.

External teaching materials

Circle-CI

External teaching materials

External review materials

Exercise

Adjust simple_workflow to execute sample analysis on another subject.

git-annex

git-annex is a tool that allows a user to manage data files within a git repository without committing the (large) content within those data files directly under git. In a nutshell, git-annex

Later on, if you have access to the clones of the repository that have the copy of the file, you can easily get it (which will download/copy that file under .git/annex/objects) or drop it (which will remove that file from .git/annex/objects).

As a result of git not containing the actual content of those large files, but instead containing just symlinks and information within the git-annex branch, it becomes possible to

Note

Never manually git merge a git-annex branch. git-annex uses a special merge algorithm to merge data availability information, and you should use git annex merge or git annex sync commands to merge the git-annex branch correctly.

External teaching materials

How can we get data files controlled by git-annex?

: Using git/git-annex commands

  1. “Download” a BIDS dataset from https://github.com/datalad/ds000114
  2. get all non-preprocessed T1w anatomicals
  3. Try (and fail) to get all T1.mgz files
  4. Knowing that yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114 is available via http from http://datasets.datalad.org/workshops/nipype-2017/ds000114/.git , get those T1.mgz files

Answer

% git clone https://github.com/datalad/ds000114   # 1.
% cd ds000114
% git annex get sub-*/anat/sub-*_T1w.nii.gz       # 2.
% git annex get derivatives/freesurfer/sub-*/mri/T1.mgz  # 3. (should fail)
% git remote add datalad datasets.datalad.org/workshops/nipype-2017/ds000114/.git
% git fetch datalad
% git annex get derivatives/freesurfer/sub-*/mri/T1.mgz  # 4. (should succeed)

How can we add the file a.txt directly under git, and file b.dat under git-annex?

Simple method (initial invocation)

Use git add for adding files under Git, and git annex add to add files under annex:

% git add a.txt
% git annex add b.dat

Advanced method (for all future git annex add calls)

If you want to automate such “decision making” based on either files’ extensions and/or their sizes, you can specify those rules within a .gitattributes file (which in turn also needs to be git add-ed), e.g.

% cat << EOF > .gitattributes
* annex.largefiles=(not(mimetype=text/*))
*.dat annex.largefiles=anything
EOF

would instruct the git annex add command to add all non-text (according to the auto-detected MIME-type of their content) and all files having the .dat extension to git-annex and the rest to git:

% git add .gitattributes     # to add to git the new .gitattributes
% git annex add a.txt b.dat

DataLad

The DataLad project relies on Git and git-annex and establishes an integrated data monitoring, management and distribution environment. As a sample of a data distribution based on a number of “data crawlers” for existing data portals, it provides unified access to over 10TB of neural data from various initiatives (such as CRCNS, OpenfMRI, etc).

External teaching materials

What DataLad command assists in recording the “effect” of running a command?

% datalad run COMMAND PARAMETERS

Please see datalad run –help for more details.

How can we create a new sub-dataset, populate it with derivative data, and share it?

Using DataLad commands, and starting with your existing clone of ds000114 from the preceding exercise:

  1. Create sub-dataset derivatives/demo-bet
  2. Using a skull-stripping tool (e.g. bet from FSL suite), for each original subject, produce a skull-stripped anatomical under its corresponding subdirectory of derivatives/demo-bet. You are encouraged to use the datalad run command (available in DataLad 0.9 or later) to leave a record on the action you took
  3. Publish your work to your “fork” of the repository on GitHub, while uploading data files to any data host you have available (ssh/http server, box.com, dropbox, etc)

Answer

% cd ds000114
% datalad create -d . derivatives/demo-bet   # 1.
% # a somewhat long but fully automated and "protocoled" by run solution:
% datalad run 'for f in sub-*/anat/sub-*_T1w.nii.gz; do d=$(dirname $f); od=derivatives/demo-bet/$d; mkdir -p $od; bet $f derivatives/demo-bet/$f; done'  # 2.
% # establish a folder on box.com access to which would be shared in the group
% export WEBDAV_USERNAME=secret WEBDAV_PASSWORD=secret
% cd derivatives/demo-bet
% # see https://git-annex.branchable.com/special_remotes for more supported git-annex special remotes
% git annex initremote box.com type=webdav url=https://dav.box.com/dav/team/ds000114--demo-bet chunk=50mb encryption=none
% datalad create-sibling-github --publish-depends box.com --access-protocol https ds000114--demo-bet
% datalad publish --to github sub*
% 

Additional relevant helpers

Neuroimaging ad-hoc “versioning”

Key Points