OverviewTeaching: 30 min Exercises: 60 minQuestions
What are the different considerations for reproducible analysis?Objectives
Learn core elements of reproducible analysis
Familiarize with various technology terms
You can skip this lesson if you can answer these questions? —>
- What are the four core elements of a reproducible analysis?
- Why should you annotate your data in a machine accessible manner?
- Why should you use version control for data and code?
- What are some ways to share your analysis environment with others?
- What does continuous integration help you with?
This lesson uses the Simple Workflow Github repository to illustrate core concepts of reproducible analysis and pitfalls associated with complex data, software, and computing environments. The complete simple workflow paper is available here.
It is helpful to have an understanding of:
This basic workflow extracts a collection of brain images and associated phenotypic (e.g., age) information from a spreadsheet, and runs a Nipype workflow that takes the anatomical brain images and performs some simple anatomical image processing. The workflow itself can take a bit of time to run depending on the power of your machine or cluster. For learning purposes and to minimize the time you can run the workflow on one participant.
Hands on exercise:
Can you rerun the analysis in the simple workflow example?
Follow the README in the repo and rerun the analysis with the Docker example.
To ensure reproducibility all data and metadata must be accessible and preferably machine accessible.
A typical approach to describing research is to write a document that one shares with colleagues and collaborators. However, such information requires significant human resources to interpret and translate into code. An alternate approach is to encode the metadata using structured markup (e.g., RDF, JSON, XML). Often such markup can be standardized to provide machine accessibility.
In this example the data and metdata are stored in a google spreadsheet. The phenotypic information are stored as literals. The imaging data are stored as pointers to files in the NITRC XNAT repository. However, this particular example does not have any semantic or type information associated with the input file.
The column headers can be described in detail in a JSON document using JSONLD a format that supports semantic annotation. The annotation provides information about the data contained in the column and allows harmonizing the information with other similar tables. For example, the JSONLD metadata key could tell us that the URLs correspond to anatomical T1-weighted images of the human brain and that the age of participants is in years.
Most datasets use PDFs or other human readable documents. Using consistent and self-describing data structures make the data more accessible.
Lesson 2 covers different aspects of data annotation, harmonization, cleaning, storage, and sharing.
Hands on exercise:
What types of output files are created by the workflow?
There are four types of output files created:
What are some of the drawbacks of the form in which the input and output are represented?
The input Google spreadsheet does not provide a key to annotate each column. The output JSON file also does not describe the keys anywhere. These things make it difficult for a human or a machine to interpret the values. In addition, it prevents harmonization of data.
The second component of this example is a setup script and a Docker container that creates the necessary computational environment for analysis.
Problems with creating environments
- The default script assumes that you have access to certain software, such as bash and FSL, on your system. This means you have to run this on a unix-like system such as Linux or MacOS.
- All other software, python libraries and their dependencies are installed by the script. The script itself does not care if these installs conflict with your existing software environment.
Alternatives that reproduce environments with minimal software dependencies are technologies like virtual machines (VirtualBox, VMWare, NITRC CE), containers (Docker, Singularity) and installers (e.g., Vagrant, Packer). These can be very useful to replicate existing environments and therefore simplify the installation problem significantly. However, at present some of these technologies are not installed by default on computing clusters you may have access to.
Lesson 3 covers container technologies and how to create, use, modify, and reuse containers.
What should a reproducible computational environment contain?
A reproducible computational environment must contain all the necessary data (any inputs or other internal software package data such as brain templates), environment variables, and software necessary for carrying out the ascribed computation. Ideally, such an environment itself should be reproducible.
Once the environment is setup, one can execute the analysis. Each time the analysis is run the provenance of the workflow is captured and stored using a PROV model for workflows. All of this happens inside a single executable script. The script itself uses the Nipype dataflow framework to ensure a consistent representation of the execution graph using Python as the dataflow language, and therefore benefits from all the advantages a dataflow framework brings to analyses.
Using dataflow technologies for analysis instead of shell scripts
There are many dataflow platforms out there. These typically enable a compact, abstract graph based representation of a dataflow, allowing reuse and consistent of execution. They also enable running the same dataflow in different computing environments and not requiring the user to keep track of complex data dependencies across nodes. While Nipype was used in this example, other brain imaging data flow systems include Automated Analysis, PSOM, FASTR.
Running the analysis is one part of reproducibility. It is important to also capture the output necessary for scientific hypothesis testing or exploration. In this example, the volumes of subcortical structures and of the different brain tissue classes are extracted and stored in a JSON document. A specific run of this workflow on a specific platform was used to create the provenance document and the expected outputs data. When another user runs this workflow, their output can be compared to the expected output.
Lesson 4 covers data flow technologies, specifically how to create analysis pipelines and applications and capture provenance when running these pipelines.
What are some advantages of using dataflow technologies?
- Compact and structured representation of analysis
- Can be reused with minimal changes
- Many dataflow tools can be used across different environments.
- The tools take care of data management.
- Many dataflow tools support distributed execution of steps.
Once the data and environment are setup appropriately and the analysis is run, it would be good to know if the same results, within some threshold, are obtained when a dataset containing the similar data or a similar workflow is used. These can be carried out using continuous integration services, such as Travis, CircleCI, Jenkins, which allow executing an analysis and performing a test comparison automatically as versions of data or software change.
Continuous integration testing
In typical brain imaging analyses there is a complex interaction between data, software, and scientific hypothesis testing results. Continuous integration services ensure that such results can be obtained consistently and provide a framework to evaluate when results diverge. While typically used for software testing, continuous integration has become essential for process management by creating a complete test cycle.
In brain imaging, most published results rarely come with data and code to allow retesting the outcome when either the software versions change or when a new dataset is available. The intent of this simple workflow framework is to move the community towards such comprehensive data preservation and testing integration.
Lesson 5 covers how to use continuous integration services like Travis and CircleCI, but also how container technologies can be used to run your own integration testing.
How does setting up continuous integration testing help with research?
During the lifetime of a project, the data and software may change. Setting up testing environments allow ensuring that changes in results can be attributed to changes in data and software. Results that remain the same across different software and similar data sets are more generalizable than results that are specific to a particular dataset and a particular software environment.
It turns out that this Workflow is not reproducible across different versions of software and operating systems. The observed inconsistencies (see Table 1) point to issues of randomization and/or initialization within the algorithms that are run. While its easy to detect deviation of execution in different environments it is harder to determine the cause of the deviation. This is where rich provenance capture can help establish where along an execution graph an analysis diverged and help zero in on the possible culprits.
Lesson 6 covers details of how provenance can be captured.
Hands on exercise:
Submit only your json output and provenance files as a pull request?
- See an example here
- Using your Unix skills (find, tar) extract only the json files keeping the directory structure intact.
- Add the provenance files (trig, provn)
- Fork the simple_workflow repo
- Add your outputs to a new folder under other_outputs.
- Commit the changes, push to your repo, and send a pull-request.
What do the results indicate about neuroimaging software?
Researchers have to be careful about variations coming from numerical software. Software engineers have to test their software for numerical variation across different operating systems and software environments. The only way to scale this is using continuous integration approaches.
Reproducible analysis is technologically possible
Learning these technologies can help produce more reliable research output
Using such frameworks provide a better way to communicate information to colleagues and collaborators