ReproNim Reproducible Basics Module

Version control systems

Overview

Teaching: 300 min
Exercises: 40 min
Questions
  • How do version control systems facilitate reproducibility, and which systems should be used?

Objectives
  • Become familiar with version control systems for code and data, as well as relevant tools based on them

  • Learn how to use version control systems to obtain, maintain, and share code and data

  • Review available third party services and workflows that could be used to help to guarantee reproducibility of results

You can skip this lesson if you can answer these questions:

What is a “version control system”?

We all probably do some level of version control with our files, documents, and even data – but without a version control system (VCS), we do it in an ad-hoc manner:

A Story Told in File Names by Jorge Cham, http://www.phdcomics.com/comics/archive_print.php?comicid=1323

In general, a VCS helps you track versions of digital artifacts, such as code (scripts, source files), configuration files, images, documents, and data – both original or derived (e.g., the outcome of an analysis). With proper annotation of changes, a VCS becomes the lab notebook for changing content in the digital world. Since all versions are stored, VCS makes it possible to provide any previous version at a later point in time. You can see how this is critical for reproducing previous results – if your work’s history is stored in a VCS, you just need to get a previous version of your materials to reproduce an earlier analysis. You can also recover a file which you mistakenly removed since a previous version would be contained within your VCS, so no more excuses like “the cat ate my source code”. These features alone make it worthwhile to place any materials you produce and care about under an appropriate VCS.

Besides tracking changes, another main function of a VCS is collaboration. Any modern VCS supports transfer and aggregation of changes to your work among collaborators. Public versioning and collaboration services (such as GitHub) allow you to integrate other online services (such as travis-ci) that can be configured to automatically evaluate any new changes. Integration with such services, which allow data to be automatically reanalyzed and verified for expected results, plays an important role in reproducibility.

In this module we will learn about:

Git

External teaching materials

To gain good general working knowledge of VCSs and Git, please go through the following lessons/tutorials:

Setting up Git for the first time?

When setting up Git on a new host, we recommend configuring Git so that commits have appropriate author information.

% git config --global user.name "FirstName LastName"
% git config --global user.email "ideally@real.email"

Exercise: a basic Git/GitHub workflow

Submit a pull request (PR) suggesting a change to the https://github.com/ReproNim/simple_workflow analysis. You should submit an initial PR with one of the changes, and then improve it with subsequent additional commits, and see how the PR gets automatically updated. Suggested changes for the first commit to initiate a PR:

Then proceed to enact more meaningful change:

Exercise: exploiting Git history

Goal: determine how the estimate for the Left-Amygdala in AnnArbor_sub16960 subject changed from release 1.0.0 to 1.1.0.

Answer

git diff allows us to see the differences between points in the Git history and to optionally restrict the search to the specific file(s), so the answers to the challenge were git tag and git grep:

% git diff 1.0.0..1.1.0 -- expected_output/AnnArbor_sub16960/segstats.json
...
     "Left-Amygdala": [
-        619,
-        742.80002951622009
+        608,
+        729.60002899169922
     ],

Third-party services

As you learned in the Remotes in GitHub section of the Software Carpentry Git course, the GitHub website provides you with public (or private) storage for your Git repositories on the web. The GitHub website also allows third-party websites to interact with your repositories to provide additional services, typically in response to new changes to your repositories. Visit GitHub Marketplace for an overview of the vast collection of such additional services. Some services are free, some are “pay-for-service”. Students can benefit from obtaining a Student Developer Pack to gain free access to some services which otherwise would require a fee.

Continuous integration

A growing number of online services provide continuous integration (CI) services. Although the free tier may not provide sufficient resources to carry out entire analyses on your data, we encourage using CIs. They help verify your code’s correct execution and the reproducibility of your results. CIs can be used to execute unit-tests on simulated data or a subset of the real data. For example, see simple workflow code for a very simple, re-executable neuroimaging publication.

Travis CI

Travis CI was one of the first free continuous integration services integrated with GitHub, and is free for publicly available projects.

External teaching materials

CircleCI

External teaching materials

External review materials

Exercise: adjust simple_workflow

Adjust simple_workflow to execute sample analysis on another subject.

git-annex

git-annex is a tool that allows a user to manage data within a Git repository without committing the (large) content of those files directly under git. In a nutshell, git-annex

Later on, if you have access to the clones of the repository containing the file, you can easily get it (which will download/copy that file under .git/annex/objects) or drop it (which will remove that file from .git/annex/objects).

Since Git doesn’t contain the actual content of large files, but instead just contains symlinks and information in the git-annex branch, it becomes possible to

Note

Never manually git merge a git-annex branch; git-annex uses a special merge algorithm to merge data availability information, and you should use git annex merge or git annex sync commands to merge the git-annex branch correctly.

External teaching materials

Exercise: getting data files controlled by git-annex

Using git/git-annex commands

  1. “Download” a BIDS dataset from https://github.com/datalad/ds000114
  2. get all non-preprocessed T1w anatomicals
  3. Try (and fail) to get all T1.mgz files
  4. Knowing that yoh@falkor:/srv/datasets.datalad.org/www/workshops/nipype-2017/ds000114 is available via http from http://datasets.datalad.org/workshops/nipype-2017/ds000114/.git, get the T1.mgz files

Solution

% git clone https://github.com/datalad/ds000114          # 1.
% cd ds000114
% git annex get sub-\*/anat/sub-\*\_T1w.nii.gz       # 2.
% git annex get derivatives/freesurfer/sub-\*/mri/T1.mgz  # 3. (should fail)
% git remote add datalad http://datasets.datalad.org/workshops/nipype-2017/ds000114/.git
% git fetch datalad
% git annex get derivatives/freesurfer/sub-\*/mri/T1.mgz  # 4. (should succeed)

How can we add the file a.txt directly under git, and file b.dat under git-annex?

Simple method (first time)

Use git add for adding files under Git, and git annex add to add files under annex:

% git add a.txt
% git annex add b.dat

Advanced method (for all future git annex add calls)

If you want to automate such “decision making” based on either file extensions and/or file size, you can specify those rules within a .gitattributes file. Adding the following two lines would instruct the git annex add command to add all non-text and all files having the .dat extension to git-annex and the rest to git:

* annex.largefiles=((mimeencoding=binary)and(largerthan=0))
\*.dat annex.largefiles=anything

Note that the .gitattributes file needs to be added and committed in order to come into effect:

% git add .gitattributes     # to add the new .gitattributes to git
% git annex add a.txt b.dat

DataLad

The DataLad project relies on Git and git-annex to establish an integrated data monitoring, management, and distribution environment. As a data distribution capitalizing on a number of “data crawlers” for existing data portals, it provides unified access to over 240TB of data from various initiatives (such as CRCNS, OpenNeuro, etc.).

External teaching materials

What DataLad command assists in recording the “effect” of running a command?

% datalad run COMMAND PARAMETERS

Please see datalad run –help for more details.

Exercise: creating, populating, and sharing a new sub-dataset

Using DataLad commands, and starting with your existing clone of ds000114 from the preceding exercise, do the following:

  1. Create sub-dataset derivatives/demo-bet
  2. Using a skull-stripping tool (e.g., bet from FSL) to produce a skull-stripped anatomical for each subject under the subdirectory derivatives/demo-bet; use the datalad run command (available in DataLad 0.9 or later) to keep a record of your analysis
  3. Publish your work to your fork of the repository on GitHub and upload data to your preferred host (an ssh/http server, box.com, Zenodo, Dropbox, etc.)

Solution

% cd ds000114
% datalad create -d . derivatives/demo-bet                   # 1.
% # a somewhat long but fully automated and "protocoled" run solution:
% datalad run 'for f in sub-\*/anat/sub-\*\_T1w.nii.gz; do d=$(dirname $f); od=derivatives/demo-bet/$d; mkdir -p $od; bet $f derivatives/demo-bet/$f; done'  # 2.
% # establish a folder on box.com with access shared among your group
% export WEBDAV_USERNAME=secret WEBDAV_PASSWORD=secret
% cd derivatives/demo-bet
% # see https://git-annex.branchable.com/special_remotes for other supported git-annex special remotes
% git annex initremote box.com type=webdav url=https://dav.box.com/dav/team/ds000114--demo-bet chunk=50mb encryption=none
% datalad create-sibling-github --publish-depends box.com --access-protocol https ds000114--demo-bet
% datalad publish --to github sub\*                          # 3/
%

Additional relevant helpers

Neuroimaging ad-hoc “versioning”

Key Points