Compute
On the computational infrastructure, install django-crunch so you can use the crunch client.
Environment Variables
The crunch client needs to know how to access both the cloud server and the data storarage. These can be set with the following environment variables:
CRUNCH_URL
The URL to the cloud server. This can also be set with the command-line option
--url URL
.
CRUNCH_TOKEN
The access token for a user on the cloud server. This can be found in the admin section of the cloud server. This variable can also be given to the client with the command-line option
--token TOKEN
.
CRUNCH_STORAGE_SETTINGS
The path to a file with the parameters needed to instantiate the Django file storage class just as it is found on the cloud server. This file can be in TOML format or JSON. The values should be the same as what is in the cloud server Django settings. This variable can also be given to the client with the command-line option
--storage-settings PATH/TO/SETTINGS
.
If any of these aren’t provided as either command-line options or as environment variables, the client will prompt for the values.
Client Commands
To process a particular dataset, you can use the following command:
crunch run DATASET-SLUG
To simply process the next dataset on the cloud server:
crunch next
To process the next dataset for a specific project:
crunch next --project PROJECT-SLUG
To process all the remaining datasets on the cloud server:
crunch loop
To process all the remaining datasets for a particular project:
crunch loop --project PROJECT-SLUG
Other command line options are available. Check the command-line reference in the documentation or read the crunch client help listings. e.g.
crunch --help
Stages
Processing a dataset goes through 3 stages: Setup, Workflow and Upload.
Each of these stages will send status updates to the cloud server.
The status updates are for one of the following three states: Start, Success or Fail.
A typical successful processing run for a dataset will send 6 status updates: Setup Start, Setup Success, Workflow Start, Workflow Success, Upload Start, Upload Success.
If any of the stages fail, then a Fail
status will be sent and the processing job will stop.
Setup
This involves:
Copying the initial data from storage
Saving the MD5 checksums for all the initial data in
.crunch/setup_md5_checksums.json
Saves the metadata for the dataset in
.crunch/dataset.json
Saves the metadata for the project in
.crunch/project.json
Creates the script to run the workflow (either a bash script or a Snakefile for Snakemake)
Workflow
Runs the workflow on a dataset that has been set up.
This involves running a bash script as a subprocess or running Snakemake with a Snakefile.
Upload
Uploads new or modified files to the storage for the dataset.
It also creates the following files:
- .crunch/upload_md5_checksums.json
which lists all MD5 checksums after the dataset has finished.
- .crunch/deleted.txt
which lists all files that were present after setup but which were deleted as the workflow ran.
Note
Currently crunch does not delete files from the remote storage if they were deleted during the workflow.
The files are just listed in .crunch/deleted.txt
Warning
The Django storage class being used may not overwrite modified files but instead change the names slightly. For this reason, it’s best not to modify files in the workflow section but create new files instead. This behaviour may change in future versions of crunch.