Have you ever written code like this?
We can see here, the class attributes are essentially defined by the parameters inside its initialization function. This leads to a very simple
__init__ function, which defines the internal data structures when the object is initialized. Each input parameter gets bound to it's respective variable in the object. There are certainly cases where the
__init__ function provides a deeper purpose, especially where the data are not independent or some transformation/check needs to occur on an input:
However, in cases where your class represents some collection data pieces, the code needed to store those data in an object amount to unnecessary boiler plate. In addition to the
__init__ function, other common class comparison functions suffer from unecessary boilerplate when the class is a mere container for data. For example, to implement the
== operator to do things like:
You need to implement the Python "dunder method"
class SomeCoolClass(object): def __init__(self, a, b, c, x, y, z): self.a = a self.b = b self.c = c self.x = x self.y = y self.z = z def __eq__(self, other): ''' Check to see if self equals other ''' if ( self.a == other.a and self.b == other.b and self.c == other.c and self.x == other.x and self.y == other.y and self.z == other.z ): return True else: return False
Additionally, for many other relatively straight-forward operators like
> or even more complex functions such as
hash, which depend on other basic operators, implementing all of these "dunder" methods turns into a lot of boilerplate.
Python 3.7 included a new feature called DataClasses to handle cases like where the class definition and behavior are primarily dictated by the data they contain. DataClasses are implemented using a function decorator. If you're not familiar with decorators, they are a way to wrap your function in another function and typically provide some sort of pre- or post- processing. In DataClasses, boilerplate "dunder" methods are auto-magically added. Let's see them in action:
from dataclasses import dataclass @dataclass class SomeCoolClass: a : int b : int c : int x : float y : float z : float
A couple of things to notice here. The
@dataclass notation is the python decorator, it indicates that the class being defined in a
DataClass. This means that the code defining our
SomeCoolClass is wrapped by a function in the
dataclasses package. This is where the additional "dunder" methods for
__init__ and the comparison operators (
<, etc) are defined.
The other thing to notice here are the type annotations (see PEP526). Since this isn't a post about type annotations, all we need to know here is that the notation,
a : int, means that the variable
a should be a Python
int. NOTE: currently type annotations are not strictly enforced and are mainly for programmer understanding and some meta-tooling such as linters, syntax highlighters and some IDEs.
If this is a bit of a brain stretch, don't worry. This is just some added python syntax so that classes marked (i.e. decorated) with
@dataclass assigns the variables defined under the
class label to
self with an automatically generated
__init__ function. So in the above case, a new instance can be created like so:
Implementing a Locus object with DataClasses
NOTE: This blog post refers to version
Locus object in
locuspocus uses the
@dataclass wrapper to define its components. The dataclass definition for
Locus is described below:
Here we can see that a locus is a combination of several positional arguments:
As well as several keyword arguments:
Finally we can see two more
fields being defined in the data class,
subloci, which we can come back to below.
In addition to a straight-forward representation of the data that go in the object, you also get the type annotations to indicate what kinds of data should populate the
Locus object. For example, a chromosome should be a
A Locus is just combination of these pieces of data and by defining
Locus using a dataclass, we get the added benefit of things like comparison operators. For example:
What about the other comparison methods such as
<? Typically, DataClasses will default to testing that each field in your data class is greater than or less than the object you are comparing to. However, since that's not what we want here, we can simply overload the operator and define our own comparison method that only rely on the
start class attributes.
Of course, after a dataclass is processed by the
@dataclass wrapper, it is still just a normal Python class. You can inherit from it and add additional methods just like any other class.
Is this shoehorning?
Don't get me wrong,
dataclasses are cool, however it turns out that a
Locus object may not be the best scenario to use them. Just because something is cool, doesn't necessarily mean you should always use it. It may be that a
Locus object is better defined using the older class definition.
On the one hand, by using
dataclasses, we get the added benefit of automatically generated
__eq__ functions as well as the free use of
str(). However, on the other hand, if you check out the source code, most of the basic operators like
>=, etc needed to be overloaded anyways because the logic for performing those operations does not fit the default on implemented by
Once push came to shove, a
Locus object needed a little more than just its data fields to fully define it's behavior. Above, I put off talking about
subloci since they were more complicated than the default use case for DataClasses. In these cases, we need to use the
field function, included in the
dataclasses library to define the instances of
subloci. Turns out, its possible to get the behavior I needed with
Locus objects, using
dataclasses, however, it is far from the best example. All in all I think it makes the code slightly more readable, and does indicate to any potential readers that a
Locus object is primarily for storing data.
This has been a short, inside scoop into how some of the newest python features makes its way into our code base. I hope it was helpful for seeing how feature such as DataClasses can allow you to write cleaner, more programmer-friendly code. Please drop by and check out our project on GitHub, leave a comment below, and stay tuned for more posts!
This post and the work that went into implementing the
LocusPocus code was partially supported by a Mozilla Science Mini-Grant. I'd also like to acknowledge the cover art for this post which came from Scott Webb on UnSplash.