Have you ever written code like this?
class SomeCoolClass(object):
def __init__(self, a, b, c, x, y, z):
self.a = a
self.b = b
self.c = c
self.x = x
self.y = y
self.z = z
We can see here, the class attributes are essentially defined by the parameters inside its initialization function. This leads to a very simple __init__
function, which defines the internal data structures when the object is initialized. Each input parameter gets bound to it's respective variable in the object. There are certainly cases where the __init__
function provides a deeper purpose, especially where the data are not independent or some transformation/check needs to occur on an input:
class SomeCoolClass(object):
def __init__(self, x, y, z=0):
self.x_y = (x,y) # store x and y together
self.z = min(z,0) # assure z is never less than 0
However, in cases where your class represents some collection data pieces, the code needed to store those data in an object amount to unnecessary boiler plate. In addition to the __init__
function, other common class comparison functions suffer from unecessary boilerplate when the class is a mere container for data. For example, to implement the ==
operator to do things like:
x = SomeCoolClass(1,2,3,4,5,6)
y = SomeCoolClass(1,2,3,4,5,6)
# See if x and y are equal to each other
if x == y:
print("x and y are equal")
You need to implement the Python "dunder method" __eq__
:
class SomeCoolClass(object):
def __init__(self, a, b, c, x, y, z):
self.a = a
self.b = b
self.c = c
self.x = x
self.y = y
self.z = z
def __eq__(self, other):
''' Check to see if self equals other '''
if ( self.a == other.a and
self.b == other.b and
self.c == other.c and
self.x == other.x and
self.y == other.y and
self.z == other.z
):
return True
else:
return False
Additionally, for many other relatively straight-forward operators like <
or >
or even more complex functions such as repr
, sort
or hash
, which depend on other basic operators, implementing all of these "dunder" methods turns into a lot of boilerplate.
Python DataClasses
Python 3.7 included a new feature called DataClasses to handle cases like where the class definition and behavior are primarily dictated by the data they contain. DataClasses are implemented using a function decorator. If you're not familiar with decorators, they are a way to wrap your function in another function and typically provide some sort of pre- or post- processing. In DataClasses, boilerplate "dunder" methods are auto-magically added. Let's see them in action:
from dataclasses import dataclass
@dataclass
class SomeCoolClass:
a : int
b : int
c : int
x : float
y : float
z : float
A couple of things to notice here. The @dataclass
notation is the python decorator, it indicates that the class being defined in a DataClass
. This means that the code defining our SomeCoolClass
is wrapped by a function in the dataclasses
package. This is where the additional "dunder" methods for __init__
and the comparison operators ( >
, <
, etc) are defined.
The other thing to notice here are the type annotations (see PEP526). Since this isn't a post about type annotations, all we need to know here is that the notation, a : int
, means that the variable a
should be a Python int
. NOTE: currently type annotations are not strictly enforced and are mainly for programmer understanding and some meta-tooling such as linters, syntax highlighters and some IDEs.
If this is a bit of a brain stretch, don't worry. This is just some added python syntax so that classes marked (i.e. decorated) with @dataclass
assigns the variables defined under the class
label to self
with an automatically generated __init__
function. So in the above case, a new instance can be created like so:
x = SomeCoolClass(1,2,3,4,5,6)
assert x.a == 1
assert x.b == 2
assert x.c == 3
# etc...
You can read more about all the specific "dunder" methods that get added to a typical DataClass in either its PEP or its documentation in the standard library.
Implementing a Locus object with DataClasses
NOTE: This blog post refers to version v1.0.2
of locuspocus
.
The Locus
object in locuspocus
uses the @dataclass
wrapper to define its components. The dataclass definition for Locus
is described below:
@dataclass()
class Locus:
chromosome: str
start: int
end: int
source: str = 'locuspocus'
feature_type: str = 'locus'
strand: str = '+'
frame: int = None
name: str = None
# Extra locus stuff
attrs: LocusAttrs = field(default_factory=LocusAttrs)
subloci: SubLoci = field(default_factory=SubLoci)
Here we can see that a locus is a combination of several positional arguments:
chromosome
start
end
As well as several keyword arguments:
source
(default:locuspocus
)feature_type
(default:locus
)strand
(default:+
)frame
(default:None
)name
(default:None
)
Finally we can see two more fields
being defined in the data class, attrs
and subloci
, which we can come back to below.
In addition to a straight-forward representation of the data that go in the object, you also get the type annotations to indicate what kinds of data should populate the Locus
object. For example, a chromosome should be a str
.
A Locus is just combination of these pieces of data and by defining Locus
using a dataclass, we get the added benefit of things like comparison operators. For example:
import locuspocus as lp
# Define a locus object - note we use the defaults for the
# remaining fields
x = lp.Locus('chr1', 1, 100)
# Define another object and see if they are equal
y = lp.Locus('chr1', 1, 100)
assert x == y
What about the other comparison methods such as >
or <
? Typically, DataClasses will default to testing that each field in your data class is greater than or less than the object you are comparing to. However, since that's not what we want here, we can simply overload the operator and define our own comparison method that only rely on the chromosome
and start
class attributes.
def __lt__(self,locus):
if self.chromosome == locus.chromosome:
return self.start < locus.start
else:
return self.chromosome < locus.chromosome
def __le__(self,locus):
if (self.chromosome,self.coor) == (locus.chromosome,locus.coor):
return True
else:
return self < locus
Of course, after a dataclass is processed by the @dataclass
wrapper, it is still just a normal Python class. You can inherit from it and add additional methods just like any other class.
Is this shoehorning?
Don't get me wrong, dataclasses
are cool, however it turns out that a Locus
object may not be the best scenario to use them. Just because something is cool, doesn't necessarily mean you should always use it. It may be that a Locus
object is better defined using the older class definition.
On the one hand, by using dataclasses
, we get the added benefit of automatically generated __init__
and __eq__
functions as well as the free use of repr()
and str()
. However, on the other hand, if you check out the source code, most of the basic operators like >
, >=
, etc needed to be overloaded anyways because the logic for performing those operations does not fit the default on implemented by dataclasses
.
Once push came to shove, a Locus
object needed a little more than just its data fields to fully define it's behavior. Above, I put off talking about attrs
and subloci
since they were more complicated than the default use case for DataClasses. In these cases, we need to use the field
function, included in the dataclasses
library to define the instances of attrs
and subloci
. Turns out, its possible to get the behavior I needed with Locus
objects, using dataclasses
, however, it is far from the best example. All in all I think it makes the code slightly more readable, and does indicate to any potential readers that a Locus
object is primarily for storing data.
What's next?
This has been a short, inside scoop into how some of the newest python features makes its way into our code base. I hope it was helpful for seeing how feature such as DataClasses can allow you to write cleaner, more programmer-friendly code. Please drop by and check out our project on GitHub, leave a comment below, and stay tuned for more posts!
Acknowledgements
This post and the work that went into implementing the LocusPocus
code was partially supported by a Mozilla Science Mini-Grant. I'd also like to acknowledge the cover art for this post which came from Scott Webb on UnSplash.