Storing big biological data: build once, reuse many.

Moving couches is a "do once, reuse many" problem. You go through the stress and pain of hauling that giant piece of furniture to your living space one time in order to lounge and relax on it, hopefully, many times.

5 years ago

Latest Post Upgrading LocusPocus genomic coordinates with Python Data Classes by Rob Schaefer public

Big data is a hot topic right now, especially in the biological sciences. Despite not having a precise definition for what being big takes, we are going to assume that many biological datasets fit within this category.

Also, data doesn't necessarily need to be big to be cumbersome. When talking about data analysis, I like to use the analogy of moving furniture. It is not a hard thing to describe: the couch needs to be moved up two flights of stairs and into the living room. I'm not sure about you, but when the couch finally does clear the door frame (by millimeters and only in a certain orientation) I immediately wonder who, if anyone, enforces standardized door sizes. And if couch engineers mathematically model couch dimensions before manufacturing them. Perhaps, also like me, you don't have the foresight of thinking whether or not your couch will fit into your apartment before signing a lease and loading it into the moving truck. Turns out this isn't an uncommon problem, especially when moving newer couches into older apartments[1].

Moving couches is a "do once, reuse many" problem. You go through the stress and pain of hauling that giant piece of furniture to your living space one time in order to lounge and relax on it, hopefully, many times. As scientists, we often sign the lease on our data before we have to move it in. When the time comes when you finally have to process half a terabyte of DNA sequence reads, it can feel like you are the only one trying to haul that king sized mattress up ten flights of stairs. However, with the right amount of person power and with the right tools[2], moving furniture and analyzing data can be made much more convenient.

Furniture and Freezers

How about another analogy? Bear with me. Let's talk data. Bases, not bits. Not furniture, but freezers. Most labs have a minus 80°C freezer. And despite what a fancy database is capable of, they pale in comparison to the amount of data that is stored in a minus 80°C.

Raw samples are prepped into working stocks and stored in the freezer. When an analysis needs to be done, samples are generated from the frozen stock and the assay is carried out. The same concept should apply to digital data analysis. Raw data is processed and "frozen". When an analysis needs to be done, data are generated from the frozen data and the analysis is run.

Minus80 is a python library that uses this analogy of pulling data objects from a frozen working stock which was processed from raw data to be used in an analysis.

Some key concepts minus80 implements:

Here is a diagram showing some of these relationships:

Multiple sources of raw data go into a frozen dataset. From this, two different data analysis datasets are built. Manipulating data from the data-analysis stage doesn't influence the frozen dataset it was built from (you don't normally add a solution back to a stock). However, the output from the data analysis can be frozen itself.

Minus80 comes with two main components. First, it implements several connections necessary to be able to freeze data. This will be covered in a different post. Second, minus80 comes with two Python objects that utilize this functionality. Let's take a look at those here.

Accessions and Cohorts

An accession is an experimental entry about a sample along with metadata about that collection. It distinguishes between "samples" in cases where the collection occurred with differences between space and time. For example, an experiment containing a single sample, sampled over the course of ten time points would equal ten experimental accessions. Another example: an experiment that contains a single sample, but ten different tissues some other or spatially differentiated components would again contain ten experimental accessions.

An accession object comes ready to use in minus80. The function signature for an accession is pretty straight forward. Start up iPython:

import minus80 as m80

Init signature: m80.Accession(name, files=None, **kwargs)
From google: Definition (noun): a new item added to an existing collection of books, paintings, or artifacts.  

An Accession is an item that exists in an experimental collection. 

Most of the time an accession is interoperable with a *sample*. However,
the term sample can become confusing when an experiment has multiple
samplings from the same sample, e.g. timecourse or different tissues. 

Init docstring:
Create a new accession.

name : str
    The name of the accession
files : iterable of str
    Files associated with the accession
**kwargs : keyword arguments
    Any number of key=value arguments that
    contain metadata.

An accession object

We just need a name for the accession. Optionally, we can provide a list of data files and any number of key-value pairs for meta data. Lets create three accessions from a hypothetical time course dataset:

t1 = m80.Accession('Sample1_t1',files=['raw_t1.txt'],time='t1')
t2 = m80.Accession('Sample1_t2',files=['raw_t2.txt'],time='t2')
t3 = m80.Accession('Sample1_t3',files=['raw_t3.txt'],time='t3')

Now that we are happy with our small set of accessions, lets freeze them into a Cohort. This action preserves the state of the Accessions and gives them a specific context by assigning them a name. Lets call this group of accessions "Sample1_Timecourse". Instances of Accession objects do not persist across different python sessions. Instances of Cohorts can be re-used many times across different python sessions. Accessions are one time use objects. Cohorts are re-use many objects.

The Cohort object has a method to create a new Cohort based on a collection of Accessions:


Signature: m80.Cohort.from_accessions(name, accessions)
Create a Cohort from an iterable of Accessions.

name : str
    The name of the Cohort
accessions : iterable of Accessions
    The accessions that will be frozen in the cohort
    under the given name

A Cohort object

Let's create the Cohort object:

c = m80.Cohort.from_accessions('Sample1_Timecourse',[t1,t2,t3])

Exiting our ipython session will cause us to lose the Accession objects, but since the Cohort is freezable we can recover them in the next session.

# A new iPython session
import minus80 as m80
c = m80.Cohort('Sample1_Timecourse')

Accessions can be recreated by indexing the cohort by Accession name:

t1 = c['Sample1_t1']
# Output:
Accession(Sample1_t1,files=['raw_t1.txt'],{'time': 't1', 'AID': 1})

Note: Accessions created from the Cohort object don't influence the object they came from. But as the data model above shows, new Cohorts can be created from Accession objects that come from other Cohorts. For instance:

t1 = c['Sample1_t1']
t2 = c['Sample1_t2']
# Create a new Cohort from Accessions pulled from c1
c2 = m80.Cohort.from_accessions('Sample1_subset',[t1,t2])

To learn more about other functionality Minus80 offers, check out the docs here.


The minus80 python library implements functionality to freeze and unfreeze biological data. This mimics the way biological data is stored in minus 80 freezers. Access to this data follows a "build once, re-use many times" philosophy. Approaching these problems pragmatically, with the right tools will hopefully help you feel like you aren't just pushing furniture around.

Header Photo Credit: unsplash-logoIgor Ovsyannykov

  1. And you can make a nice profit from being a good couch mover ↩︎

  2. This is a game changer ↩︎

Rob Schaefer

Published 5 years ago