Most of the tools at Linkage are written in Python. It has been my daily driver since around 2012, after switching from a combination of R (for plotting and number crunching) and perl (for text manipulation). Immediately, I fell in love with the wonderful syntax, incredible developer tools, top notch class system, and a package system that's only gotten better over time. For most of our purposes, Python was at a perfect abstraction level for quickly developing new tools with a reasonable amount of speed and performance. By heavily relying on packages such as NumPy and Scipy for number crunching, Pandas for tabular data, and Matplotlib for plotting, code seemed to just pour out of your fingertips.
Not to say that things are perfect. When performance problems didn't fit into the neat abstractions provided by these great packages, execution time slowed to a crawl. For the most part, you can expect code performance bottlenecks to follow the Pareto principle, where ~80% of the execution time of a program comes from ~20% of your code. I'd say our cases was even better, since about 90% of the time there was an optimized solution from numpy, scipy, or pandas – which provided an implementation in C, C++, or Fortran. And in the worst case, when your problem didn't have a nice numpy/scipy solution, it was relatively easy to implement a critical loop in Cython, which yielded huge performance gains without getting too in the weeds with writing tons of C code.
The second major hurdle in Python was parallel processing. For the uninitiated, Python has a global interpreter lock (GIL) that only allows one thread within a process to execute python code at a time. For us, this was easily bypassed by just running several python processes at the same time for tasks that were "embarrassingly parallel". This did the trick, however it was incredibly wasteful since every instance of python had its own copy of sometimes very large data structures (our networks contain hundreds of millions of elements). Sure, we could pull the same Cython magic as above and just switch the C threads, which bypasses the GIL, however this would require banishing a large amount of code to this weird place between Python and C, and we would lose so many of the niceties that Python offers in the process.
In short, writing Python code is wonderful until it isn't. It sits at a nice level of abstraction where you never have to think about the dangerous aspects of lower level languages, however, at a cost. And, most of the time, this cost isn't realized until after you were done implementing something. You write a large chunk of idiomatic Python code only to start running it and realize it wasn't going to scale. And while there are work-arounds with the aforementioned libraries, at the end of the day you're tempting to re-write larger and larger chunks of code in Cython. But, NO! The whole reason you're writing Python is because it allows you to focus on solving the real problems and not segmentation faults. It's because the developer tools are amazing. It's because of the expressiveness of comprehensions and generators! At least that's what you tell yourself as you watch your program waste countless CPU cycles checking dynamic variable types.
"Have you thought about re-writing it in Rust?"
I've thought about exploring other languages. I wrote my fair share of C and C++ in college (which ultimately drove me to learn perl – no judgements!). I've heard great things about Haskell, and Go. Hell, I've even heard that the newest versions of Java are making a comeback with the cool kids these days. Yet, after years of casually looking for greener pastures, nothing really fully sold me. Then, one day, I heard about Rust.
At first, I was hesitant about Rust, it seemed to be more suited to "systems" level applications such as operating systems or file systems. However, after hearing about Mozilla using Rust to achieve "fearless concurrency" in Firefox Quantum, a web browser, I decided to take a closer look.
There are several things that really stood out to me about Rust. It is a young language which means its not tied to design decisions made decades ago about syntax and paradigms. It is able to combine together some expressive syntax from "higher level" languages, yet due to some stern choices enforced from the compiler, you are able to get the performance of lower level languages. With these compiler enforcements, you are not only guaranteed to prevent segmentation faults and race conditions among threads, but as a side effect, you are guaranteed program "correctness" (of course, nobody is immune to logical errors). Rust ships with awesome developer tools, a slick packaging system, and already has a vibrant and engaging community.
Not to say there aren't some downsides. I'm only just getting started in learning Rust, but several things stick out to me. At least in the genetics community, I think there is already a heavy burden on fast-track teaching scientists how to program. Rust's strong type system forces you to think about the number bits in the mantissa when you really just want to think of the numbers. Compared to duck typing in Python or R, where you can haphazardly hack something together, Rusts
enums force you to know what you want and how you want it upfront. And while Rust's
Option help avoid null pointers and dealing with exceptions, it is an abstraction that is well ... pretty abstract. Finally, lets face it, even seasoned programmers fight the borrow checker.
For the most part, we've been able to use Python to really flesh out our problem. We just have hit a wall in terms of performance. So the choice is either implement more of our code in something like Cython, or to try something fresh. In this case, Rust seems like a great choice.
Colorado Gold Rust
This past week there was an amazing Rust Conference that just so happened to be located right in my backyard. I was interested in meeting people who were using Rust for production projects and was blown away by the amazing things people were working on. The conference started off with a hands-on workshop where I learned about Async frameworks in Rust using Tokio (luckily they had strong parallels in Python 👍) as well as interacting with databases using Diesel. Day 2 came with incredible talks from people using rust to create distributed networks that are robust to nefarious interference:
a wonderful cupcake filled analogy talk for Rust's
and a stunning demonstration of creating real-time graphics:
along with other talks covering IDEs, filesystems, and many others.
Overall, seeing all these cool projects by amazing people really reaffirmed by interest in using Rust for major projects in the near future. While I still intend on continuing development of our Python based libraries and programs, I am keen on further exploring what Rust has to offer, especially in the realm of scientific computing.
Attending the Colorado Gold Rust conference was in part supported by a Mozilla Science Lab Mini-Grant. Mozilla supports development of the Rust programming language, however, had no influence or creative decisions in writing this blog post.