Previous versions of Minus80 (up to v0.3.3) focused on incorporating persistence to biological datasets, especially instances such as gene networks or reference genomes that require a lot of computational power to create. This philosophy of "build once, reuse many" allows researchers to quickly use large, complex datasets in their everyday analyses. Like pulling stock solution from the deep freeze, it takes just a moment to get going with the analysis you want to perform. Similarly, with datasets backed by Minus80, it's easy to pull up a frozen instance and start analyzing your data.
This is great for mature or clear cut datasets, but with many biological datasets, a lot of effort and iterations go into developing and getting datasets to a point where persistence is useful. In other words, while persistence and reusability are the end goals, datasets first and foremost need to be stable and reproducible.
Through ongoing support of a Mozilla Science Mini-Grant, we've taken big steps in adding features enabling minus80 to help researchers track, manage, and share core genetic datasets. A short summary of the change-log for
v1.0.0 to date includes:
- Updated command line interface
- Simpler data management and tracking through Projects
- Version history through freezing tagged datasets
A sneak peek into our latest developmental version (v1.0.0-dev), showcases some of latest latest developments incorporated into Minus80.
Brand New CLI
In addition to major changes behind the scenes, Minus80 has a new, improved CLI. Let's check it out:
--help tag prints out the available commands. Assuming a fresh install, there should be nothing being tracked by Minus80 yet.
Let's create a trackable datasets using the new Project data type.
Tracking data directories with Projects
v1.0.0 you can track your own custom data directories using minus80
Projects. Using the
init command, a project dicrectory is created and then tracked by
minus80. Here is a basic workflow:
First, we use the
list command to see what's currently in Minus80.
[Nothing here yet] indicates no available Minus80 datasets, let's make one. The
init command defaults to a new Minus80
foobar here. When we list the projects again using the
list command, we can see an entry for
foobar under the
Project heading. We can also see a new directory was made with the same name. Minus80 is now tracking the contents of this directory!
Version history through freezing tagged datasets
Any data in the
foobar/ directory can now be indexed and tracked by Minus80. Note that Minus80 is not meant to manage your raw data, but rather better suited to be used with smaller, day-to-day, curated datasets. Suppose, for instance, you processed your raw RNA-Seq reads (perhaps 100's of Gb of data) resulting in a (much smaller) gene expression matrix. This resultant dataset is a perfect candidate for minus 80, as most downstream analyses require lot of pitvoting, iterating, and analyzing.
Suppose, you just read about a fancy new normalization method you'd like to try on your data. However, you want to make a checkpoint you can go back to in case things don't pan out. You can freeze the current state of your minus80 Project using the
Here we first list to see the
Project directory called
foobar from above. We create a file called
foobar/data.txt and put some data in it (
1234). We use the
freeze command to create a snapshot in time and "tag" it with the string
version_1. Let's take a closer look at the
--help entry for the freeze command to better understand what is going on.
The freeze command takes a single positional argument called
<slug>. Since minus80 can track more than just
Project objects, a more verbose notation is needed to let minus80 know what you'd like to freeze. The syntax is as follows:
dtype is the data type supported by minus80 (
Project in this case),
name is the name of the dataset, and
tag is a short, user defined tag used to differentiate what the snapshot contains. Here, we use the string "
We can see what tags a dataset has using the
--tags flag in the
foobar heading is a
version_1 along with a checksum designator (
1a0e22dcdd) and a timestamp.
Let's make some changes to our
data.txt file, representing some sort of analysis (e.g. a new normalization technique).
cat the current data file showing
1234. Next we append the string
abcd to the data file and freeze it with the tag
version_2. We can see the updated tag using the
minus80 list --tags command. In addition to the updated timestamp, we can see a different checksum digest.
Suppose you were not happy with this new modification and wanted to revert back to the way things were in
version_1. We can do just that using the
Here, we can again see that our
data.txt file contains both
abcd. We can
thaw our previous version which reverts the data files back to the state they were when it was frozen. Like the
thaw takes in a slug of the form
<dtype>.<name>:<tag>, in this case,
Project.foobar:version_1 since we want to go back to
We see the
SUCCESS! output indicating our operation worked, and when we look at what is inside
data.txt we see only
1234. We're making history!
Data becomes exceedingly useful when it can be shared! Either with yourself (perhaps on a new computer), with collegues or with the rest of the world.
Similar to to relationship between
dockerhub data, next steps for
minus80 will be connecting it to the cloud, where you can
pull your datasets or share them with friends via a URL. We currently have some functionality available in the
v1.0.0-dev version of Minus80, however the contents of that will be for another Blog update. Check out the rest of the development on GitHub or connect on Twitter.