Making History with Minus80 datasets

Through ongoing support of a Mozilla Science Mini-Grant, we've taken big steps in adding features enabling minus80 to help researchers track, manage, and share core genetic datasets....

a month ago

Latest Post Using Python Decorators to Authenticate Google Cloud Functions by Rob Schaefer

Previous versions of Minus80 (up to v0.3.3) focused on incorporating persistence to biological datasets, especially instances such as gene networks or reference genomes that require a lot of computational power to create. This philosophy of "build once, reuse many" allows researchers to quickly use large, complex datasets in their everyday analyses. Like pulling stock solution from the deep freeze, it takes just a moment to get going with the analysis you want to perform. Similarly, with datasets backed by Minus80, it's easy to pull up a frozen instance and start analyzing your data.

This is great for mature or clear cut datasets, but with many biological datasets, a lot of effort and iterations go into developing and getting datasets to a point where persistence is useful. In other words, while persistence and reusability are the end goals, datasets first and foremost need to be stable and reproducible.

Functionality added to v1.0.0 of Minus80

Through ongoing support of a Mozilla Science Mini-Grant, we've taken big steps in adding features enabling minus80 to help researchers track, manage, and share core genetic datasets. A short summary of the change-log for v1.0.0 to date includes:

A sneak peek into our latest developmental version (v1.0.0-dev), showcases some of latest latest developments incorporated into Minus80.

Brand New CLI

In addition to major changes behind the scenes, Minus80 has a new, improved CLI. Let's check it out:

The new and improved minus80 CLI

The --help tag prints out the available commands. Assuming a fresh install, there should be nothing being tracked by Minus80 yet.

The minus80 list command (no data available)

Let's create a trackable datasets using the new Project data type.

Tracking data directories with Projects

Starting in v1.0.0  you can track your own custom data directories using minus80 Projects. Using the init command, a project dicrectory is created and then tracked by minus80. Here is a basic workflow:

Listing available minus80 datasets

First, we use the list command to see what's currently in Minus80. [Nothing here yet] indicates no available Minus80 datasets, let's make one. The init command defaults to a new Minus80 Project, called foobar here. When we list the projects again using the list command, we can see an entry for foobar under the Project heading. We can also see a new directory was made with the same name. Minus80 is now tracking the contents of this directory!

Version history through freezing tagged datasets

Any data in the foobar/ directory can now be indexed and tracked by Minus80. Note that Minus80 is not meant to manage your raw data, but rather better suited to be used with smaller, day-to-day, curated datasets. Suppose, for instance, you processed your raw RNA-Seq reads (perhaps 100's of Gb of data) resulting in a (much smaller) gene expression matrix. This resultant dataset is a perfect candidate for minus 80, as most downstream analyses require lot of pitvoting, iterating, and analyzing.

Suppose, you just read about a fancy new normalization method you'd like to try on your data. However, you want to make a checkpoint you can go back to in case things don't pan out. You can freeze the current state of your minus80 Project using the freeze command.

Example usage of the freeze command

Here we first list to see the Project directory called foobar from above. We create a file called foobar/data.txt and put some data in it (1234). We use the freeze command to create a snapshot in time and "tag" it with the string version_1. Let's take a closer look at the --help entry for the freeze command to better understand what is going on.

$ minus80 freeze --help                                               
Usage: minus80 freeze [OPTIONS] <slug>

  Freeze a minus80 dataset

Options:
  --help  Show this message and exit.
The help entry for the freeze command

The freeze command takes a single positional argument called <slug>. Since minus80 can track more than just Project objects, a more verbose notation is needed to let minus80 know what you'd like to freeze. The syntax is as follows: <dtype>.<name>:<tag> where dtype is the data type supported by minus80 (Project in this case), name is the name of the dataset, and tag is a short, user defined tag used to differentiate what the snapshot contains. Here, we use the string "version_1".

We can see what tags a dataset has using the --tags flag in the list command.

$ minus80 list --tags                                                 
Project
  └──foobar
   └──version_1 1a0e22dcdd (11:40AM - Sep 17, 2019)
Showing tags with the list command

Under the foobar heading is a version_1 along with a checksum designator (1a0e22dcdd) and a timestamp.

Let's make some changes to our data.txt file, representing some sort of analysis (e.g. a new normalization technique).

Add some more data to data.txt and freeze the results

First, we cat the current data file showing 1234. Next we append the string abcd to the data file and freeze it with the tag version_2. We can see the updated tag using the minus80 list --tags command. In addition to the updated timestamp, we can see a different checksum digest.  

Suppose you were not happy with this new modification and wanted to revert back to the way things were in version_1. We can do just that using the thaw command!

Using the thaw command to revert to previously tagged datasets.

Here, we can again see that our data.txt file contains both 1234 and abcd. We can thaw our previous version which reverts the data files back to the state they were when it was frozen. Like the freeze command, thaw takes in a slug of the form <dtype>.<name>:<tag>, in this case, Project.foobar:version_1 since we want to go back to version_1.

We see the SUCCESS! output indicating our operation worked, and when we look at what is inside data.txt we see only 1234.  We're making history!

What's next?

Data becomes exceedingly useful when it can be shared! Either with yourself (perhaps on a new computer), with collegues or with the rest of the world.

Similar to to relationship between git and github or docker and dockerhub data, next steps for minus80 will be connecting it to the cloud, where you can push and pull your datasets or share them with friends via a URL. We currently have some functionality available in the v1.0.0-dev version of Minus80, however the contents of that will be for another Blog update. Check out the rest of the development on GitHub or connect on Twitter.

Rob Schaefer

Published a month ago