public

Freezing Python objects with Minus80

The LinkageIO package minus80 [https://github.com/LinkageIO/Minus80] has two faces. The first provides several python objects (Accession and Cohorts) for storing data and metadata for biological samples. You

5 years ago

Latest Post Upgrading LocusPocus genomic coordinates with Python Data Classes by Rob Schaefer public

The LinkageIO package minus80 has two faces. The first provides several python objects (Accession and Cohorts) for storing data and metadata for biological samples. You can read more about those objects in this blog post and a more formally in the minus80 documentation. While these objects are useful, they only scratch the surface of what is possible using minus80.

The other side of minus80 is the Freezable python class, which powers the persistence functionality used by the Cohort object. The idea closely mimics the utility of an actual -80 freezer. As an example, the Accession and Cohort objects included with minus80 showcase this freezable functionality. Accession objects represent a biological sampling. A group of these are frozen together using the Cohort object, which is just a collection of Accession objects. Cohorts persist across python sessions. They live in a frozen state.

Frozen, Cohort objects can be used to produce Accession objects, which can be used in an analysis. Similarly to a bench-top experiment, you could technically prep your samples from scratch each time you wanted to perform an assay or an analysis. More likely, you have a master mix or have prepped your samples in a form which can be stored in the freezer. From this concoction, you can produce individual, single use samples that can be used in an analysis.

Ok. What does this look like in the lab? Suppose you collected some samples from the field. Your protocol depends on what type of experiment you are running. It's different for DNA than it is for RNA, but the core concept remains the same. Isolate the product you need for the sample, and stabilize it so that you can freeze it. Likewise, while you do this, you also organize it -- you make it more convenient. Perhaps from the raw tissue collected you plan on performing both DNA and RNA experiment, so your freezer has multiple tubes from the same sample in it. They exist in different contexts and serve different purposes.

Maybe you've run into a similar bioinformatics workflow. You are going to sequence a genome, so you generate a bunch of short reads. You submit the sequence and get a fastq file back. You map the reads to a reference genome to produce a bam file. Maybe you run a variant calling program to produce a vcf file. Each of these output files represent a natural frozen time point. You did a bunch of things to the fastq file(s) to make a bam file. Storing your data in a bam for vcf file is space efficient, but it is still cumbersome to access those data stores -- especially through python.

minus80 allows you to control how these files are frozen and gives you a quick interface so you can slice and dice these data up for an analysis or to make a plot. So, what makes a Cohort freezable and not an Accession? What makes anything freezable?

Underneath the hood

The code that creates an Accession or Cohort object is written as a python class. Classes are collections of code that dictate an objects data and provide a variety of methods (or functions) used to interact with that data. They also come with some built in methods on how to create new instances of that object type. Programing based on Classes allows for an paradigm called inheritance, where you can create a new, specialized class from a broader, more general parental class. Since this is not a tutorial on class structure, we're going to assume that you have a basic familiarity with how classes work and the basics of object oriented programming[1] (OOP).

In addition to the usual suspects when it comes to OOP, there are some design patterns that allow for some very interesting relationships to be built. Abstract Base Classes (ABSs) are classes that are only meant to be inherited, and are never meant to be instantiated themselves. If you've programmed in Python before, you've almost certainly run into these abstract data types. Within Python there is a ton of vocabulary that revolves around the idea of being iterable, yet you never create an instance of iterable itself. Instead, iterable is an abstract base that other object inherit from in order to ensure they act a certain way. It's a guarantee on properties on an object. If an object inherits from iterable, you can expect to have the properties and the methods of an iterable object.

Similarly, a freezable object inherits from the freezable abstract base class. In order to inherit from the freezable ABC, the interface requires that the object have a name as well as a datatype (which is customizable, but default to object class). Also, when an object becomes freezable, it gets access to a slew of convenience methods and access to underlying minus80 infrastructure. This includes access to a relational database (sqlite) as well as methods to store columnar data using bcolz and a rudimentary key/value storage system (backed by sqlite). Being freezable also give you some convenience methods for creating temp files. Read in depth about these methods in the minus80 docs

The act of freezing is not automatic. Packages like pickle exist to create persistent objects, but this process is pretty opaque. You don't have any influence on what gets pickled, it is just in and out. Minus80 doesn't do any of the packaging for you, data must be frozen when the data is processed. The easiest way to handle this is by using @classmethods which are special functions that create objects from different data sources. Let's look at the code:

class Cohort(Freezable):
    '''
        A Cohort is a named set of accessions. Once cohorts are
        created, they are persistant as they are stored in the 
        disk by minus80.
    '''

    def __init__(self,name):
        super().__init__(name)
        self.name = name
        self._initialize_tables()
        
    def add_accession(self,accession):
        '''
            Add a sample to the Database
        '''
        with self.bulk_transaction() as cur:
            # When a name is added, it is automatically assigned an ID 
            cur.execute(''' 
                INSERT OR IGNORE INTO accessions (name) VALUES (?)
            ''',(accession.name,))
            # Fetch that ID
            AID = self._get_AID(accession)
            # Populate the metadata and files tables
            cur.executemany('''
                INSERT OR REPLACE INTO metadata (AID,key,val) 
                VALUES (?,?,?)
            ''',((AID,k,v) for k,v in accession.metadata.items())
            )
            cur.executemany('''
                INSERT OR REPLACE INTO files (AID,path) VALUES (?,?)
            ''',((AID,file) for file in accession.files)
            )
        return self[accession]
        
    @classmethod
    def from_accessions(cls,name,accessions):
        self = cls(name)
        self.add_accessions(accessions)
        return self

Examining the source code for the Cohort class, you can see the class definition inherits from Freezable, which is defined in its own file. From the __init__ method, we can see that when a Cohort is created, it is initially empty. Accessions can be added to the Cohort using either the add_accession method or the from_accessions class method, which is just using the same code under the hood.

The Cohort.add_accession method utilizes some of the inherited freezable machinery mentioned above. You can see the call to self.bulk_transaction returning a cursor to the sqlite database which is where the accessions are added. When the Freezable object is created for the first time, sqlite tables are made for it in a centralized directory holing all the database files for minus80 (the default is ~/.minus80/). In subsequent python sessions, the Cohort object can be created by name (see the __init__ method) and accessions can be created from the Cohort object using python indexing (see the full Cohort.py source code). In addition to making something freezable, the developer is also responsible for getting object in and out of the databases, and providing an interface with which to interact with the data.

So, unlike pickle, which is concerned with simply storing and retrieving python objects. Minus80 gives python objects access to mechanisms to store data. This allows for a little more flexibility in how data is stored and how it is accessed in the future. You can create methods that retrieve only a small amount of the data, a subset, or provide an iterable interface so that data can be processed efficiently. Freezable objects just inherit the machinery to do this in a convenient way.

Conclusions

The minus80 package comes with an abstract base class that gives python objects access to a shared interface so they can store their data. Freezable objects gain access to databases that store both relational data as well as columnar data types. Objects backed by minus80 are stored in a centralized location on the disk which makes it easier to integrate and compare different data types.

Cover Photo by unsplash-logoPierre Gui


  1. If not, there are many, many great tutorials on how this works and why its a great idea. ↩︎

Rob Schaefer

Published 5 years ago