Upgrading LocusPocus genomic coordinates with Python Data Classes

Python 3.7 introduced DataClasses to the standard library. Check out how we are using this new feature to upgrade Locus objects....

10 months ago

Latest Post Upgrading LocusPocus genomic coordinates with Python Data Classes by Rob Schaefer

Have you ever written code like this?

class SomeCoolClass(object):
	def __init__(self, a, b, c, x, y, z):
		self.a = a
		self.b = b
		self.c = c
		self.x = x
		self.y = y
		self.z = z
The definition of boilerplate code

We can see here, the class attributes are essentially defined by the parameters inside its initialization function. This leads to a very simple __init__ function, which defines the internal data structures when the object is initialized. Each input parameter gets bound to it's respective variable in the object. There are certainly cases where the __init__ function provides a deeper purpose, especially where the data are not independent or some transformation/check needs to occur on an input:

class SomeCoolClass(object):
	def __init__(self, x, y, z=0):
		self.x_y = (x,y)  # store x and y together
		self.z = min(z,0) # assure z is never less than 0
x and y are represented internally as a tuple and z needs to be checked

However, in cases where your class represents some collection data pieces, the code needed to store those data in an object amount to unnecessary boiler plate. In addition to the __init__ function, other common class comparison functions suffer from unecessary boilerplate when the class is a mere container for data. For example, to implement the == operator to do things like:

x = SomeCoolClass(1,2,3,4,5,6)
y = SomeCoolClass(1,2,3,4,5,6)

# See if x and y are equal to each other
if x == y:
	print("x and y are equal")
Python allows you to compare class objects using the "==" operator

You need to implement the Python "dunder method" __eq__:

class SomeCoolClass(object):
    def __init__(self, a, b, c, x, y, z):
        self.a = a
        self.b = b
        self.c = c
        self.x = x
        self.y = y
        self.z = z
        
    def __eq__(self, other):
        ''' Check to see if self equals other '''
        if ( self.a == other.a and 
             self.b == other.b and
             self.c == other.c and
             self.x == other.x and
             self.y == other.y and
             self.z == other.z
        ):
            return True
        else:
            return False

Additionally, for many other relatively straight-forward operators like < or > or even more complex functions such as  repr, sort or hash, which depend on other basic operators, implementing all of these "dunder" methods turns into a lot of boilerplate.

Python DataClasses

Python 3.7 included a new feature called DataClasses to handle cases like where the class definition and behavior are primarily dictated by the data they contain. DataClasses are implemented using a function decorator. If you're not familiar with decorators, they are a way to wrap your function in another function and typically provide some sort of pre- or post- processing. In DataClasses, boilerplate "dunder" methods are auto-magically added. Let's see them in action:

from dataclasses import dataclass

@dataclass
class SomeCoolClass:
    a : int
    b : int
    c : int
    x : float
    y : float
    z : float
    

A couple of things to notice here. The @dataclass notation is the python decorator, it indicates that the class being defined in a DataClass. This means that the code defining our SomeCoolClass is wrapped by a function in the dataclasses package. This is where the additional "dunder" methods for __init__ and the comparison operators ( >, <, etc) are defined.

The other thing to notice here are the type annotations (see PEP526). Since this isn't a post about type annotations, all we need to know here is that the notation, a : int, means that the variable a should be a Python int. NOTE: currently type annotations are not strictly enforced and are mainly for programmer understanding and some meta-tooling such as linters, syntax highlighters and some IDEs.

If this is a bit of a brain stretch, don't worry. This is just some added python syntax so that classes marked (i.e. decorated) with @dataclass assigns the variables defined under the class label to self with an automatically generated __init__ function. So in the above case, a new instance can be created like so:

x = SomeCoolClass(1,2,3,4,5,6)
assert x.a == 1
assert x.b == 2
assert x.c == 3
# etc...
After it is defined, a DataClass works like any other class.

You can read more about all the specific "dunder" methods that get added to a typical DataClass in either its PEP or its documentation in the standard library.

Implementing a Locus object with DataClasses

NOTE: This blog post refers to version v1.0.2  of locuspocus.

The Locus object in locuspocus uses the @dataclass wrapper to define its components.  The dataclass definition for Locus is described below:

@dataclass()
class Locus:
    chromosome: str
    start: int 
    end: int

    source: str = 'locuspocus'
    feature_type: str = 'locus'
    strand: str = '+'
    frame: int = None
    name: str = None
    
    # Extra locus stuff
    attrs: LocusAttrs = field(default_factory=LocusAttrs)
    subloci: SubLoci = field(default_factory=SubLoci) 
The dataclass attributes for the Locus object.

Here we can see that a locus is a combination of several positional arguments:

As well as several keyword arguments:

Finally we can see two more fields  being defined in the data class, attrs and subloci, which we can come back to below.

In addition to a straight-forward representation of the data that go in the object, you also get the type annotations to indicate what kinds of data should populate the Locus object. For example, a chromosome should be a str.

A Locus is just combination of these pieces of data and by defining Locus using a dataclass, we get the added benefit of things like comparison operators. For example:

import locuspocus as lp

# Define a locus object - note we use the defaults for the
# remaining fields
x = lp.Locus('chr1', 1, 100)

# Define another object and see if they are equal
y = lp.Locus('chr1', 1, 100)

assert x == y
With DataClasses, you get free added functionality!

What about the other comparison methods such as > or <? Typically, DataClasses will default to testing that each field in your data class is greater than or less than the object you are comparing to. However, since that's not what we want here, we can simply overload the operator and define our own comparison method that only rely on the chromosome and start class attributes.

    def __lt__(self,locus):                                             
        if self.chromosome == locus.chromosome:                         
            return self.start < locus.start                             
        else:                                                           
            return self.chromosome < locus.chromosome                   
                                                                       
    def __le__(self,locus):                                             
        if (self.chromosome,self.coor) == (locus.chromosome,locus.coor):
            return True                                                 
        else:                                                           
            return self < locus   
Some dunder methods still need to be tweaked for certain behavior

Of course, after a dataclass is processed by the @dataclass wrapper, it is still just a normal Python class. You can inherit from it and add additional methods just like any other class.

Is this shoehorning?

Don't get me wrong, dataclasses are cool, however it turns out that a Locus object may not be the best scenario to use them. Just because something is cool, doesn't necessarily mean you should always use it. It may be that a Locus object is better defined using the older class definition.

On the one hand, by using dataclasses, we get the added benefit of automatically generated __init__ and __eq__ functions as well as the free use of repr() and str(). However, on the other hand, if you check out the source code, most of the basic operators like >, >=, etc needed to be overloaded anyways because the logic for performing those operations does not fit the default on implemented by dataclasses.

Once push came to shove, a Locus object needed a little more than just its data fields to fully define it's behavior. Above, I put off talking about attrs and subloci since they were more complicated than the default use case for DataClasses. In these cases, we need to use the field function, included in the dataclasses library to define the instances of attrs and subloci. Turns out, its possible to get the behavior I needed with Locus objects, using dataclasses, however, it is far from the best example. All in all I think it makes the code slightly more readable, and does indicate to any potential readers that a Locus object is primarily for storing data.

What's next?

This has been a short, inside scoop into how some of the newest python features makes its way into our code base. I hope it was helpful for seeing how feature such as DataClasses can allow you to write cleaner, more programmer-friendly code. Please drop by and check out our project on GitHub, leave a comment below, and stay tuned for more posts!

Acknowledgements

This post and the work that went into implementing the LocusPocus code was partially supported by a Mozilla Science Mini-Grant. I'd also like to acknowledge the cover art for this post which came from Scott Webb on UnSplash.

Rob Schaefer

Published 10 months ago