I'm super excited to announce that we will be teaming up with the great people over at Mozilla to launch a new project over the course of the next 6 months. We got a Mozilla Science Mini-Grant! The main focus of the project is to run a pilot program where we will be launching a "hub" where biomedical researchers
can find, share, track, and re-use core genetic datasets.
What's the BIG problem?
Data in biology is BIG. And its getting bigger. A recent perspective in PLoS Biology has projected that through 2025, the field of biology will generate as much, or more, "big data" than other scientific fields, such as astronomy, as well as other big tech industries like YouTube and Twitter. To put this in perspective, raw DNA sequence repositories such as NCBI and EMBL-EBI now house more than 4.1 trillion and 11 quadrillion base pairs respectively. These are databases that contain stable, raw data and store enough unprocessed data that they are only sustainable through large government entities (or at least out of the scope of a mini-grant!).
However, not only are the size of datasets growing, so are the number of studies producing new and interesting data. Most studies in genetics operate on smaller, curated datasets generated through a combination of these massive raw datasets, and specialized computational analyses, workflows, and pipeline. In essence:
raw_data + analysis = curated_data. For example, a lab could be studying muscular disease in horses. They've generated tens of terabytes of raw sequence data. After setting up their computational and statistical experiments, they are left with a small set of genomic features, or loci associated with their trait of interest. These loci include common genomic features such as genes, but may also contain other features such as polymorphisms (SNPs) or non-coding features such as lnc-RNA.
These sets of loci associated with a trait of interest evolve as studies progress. Perhaps a new algorithm or statistical method comes out, or additional samples are produced, or previously identified loci are validated. As studies evolve and are revised, these loci are typically updated and only the latest data are retained. They are often published and
distributed at the end of a study and then only as unstructured plain text or spreadsheets.
Due to incentives to keep results private until they are formally published, the iterative process of discovery and refinement of loci identified to be associated with a trait of interest are largely lost. Final results are presented as nice, refined, "shiny" products, while scientific methods are expelled as short descriptions in the methods section of a manuscript leading to overly simplified reports of inherently messy scientific processes. This lack of scientific "version control" not only hides the research process, it covers up valuable lessons learned from trying various methods applied during scientific discovery. This problem is analogous to software where only the "major releases" are published and released, while descriptions and annotations of "big fixes" and improvements are lost completely.
With the help of Mozilla, over the course of the next 6 months, we will be working on launching a prototype hub for genetics datasets called "LinkageHub". Similar to how other hub based management systems such as GitHub and Dockerhub work, scientist will be able to create, track and manage datasets related to their genetics experiments. A typical LinkageHub workflow will look like the following:
- Analyze raw sequence data using your current, in-lab, experimental workflow resulting in a set of genomic features associated with your trait
- Import your genomic features into
locuspocus, a software library we developed that stores and analyzes genomic features
- Once your data is at a certain "stage", tag your
locuspocusdatasets, similar to a
- Push your tagged datasets to
- rinse and repeat
As your experiment evolves, you can pull down, compare, and cross analyze tagged datasets. Suppose a researcher is interested in how a new statistical model or algorithm changes the candidate genes related to her study. She can tag her current
locuspocus datasets, and perform the new analysis, generating a new set of genes based on her updated workflow. She can then compare the output from the two "branches" of her workflow and decide if the new workflow improved her output. If it happens that weeks later a major flaw is found in the new analysis, she can reset by pulling down older, tagged versions of her dataset and, if needed, revert back to previous stages. This workflow mimics the fork/branch workflow in version control systems such as
git -- merging "branches" and comparing “diffs” of different code commits.
Other workflows are also possible via access to LinkageHub. Tagged datasets can be "pushed" and "pulled" from the cloud enabling a distributed "hub" of datasets. This eases sharing of datasets among collaborators, and enables access from any internet connected computer. You could parallelize your workflow by giving each node access to your LinkageHub repository. This encourages good practices such as data reproducibilty (especially when combined with workflow management systems such as containers). And when studies are ready for publication, tagged datasets can be referenced in publications either by URL or by a DOI by integrating with services such as FigShare or OSF.
Who is involved?
Linkage Analytics has a focus within agricultural genetics. Our mission is:
To develop useful, sustainable, free and open source software tools for the biological sciences. We aim to work closely with scientists who generate data in order to create code using the best and most computationally effective methods possible. We believe that code thrives in an open source environment and are committed to releasing our work under FOSS guidelines.
Throughout this project we will be working closely with leading scientists within plant and animal genomics. The project will take place in two stages, a small pilot involving two current collaborative labs in order to map out use cases and deliver a prototype product. We will then solicit the help of several other leading genetics labs to take place in a larger pilot study!
On the animal side, we are working closely with Dr. Molly McCue who leads the Equine Genetics and Genomics Laboratory at the University of Minnesota. Dr. McCue is an expert in the domain of veterinary medicine and her lab is a global leader in generating biological datasets and developing cutting-edge bioinformatics approaches in the horse.
On the plant side, we will be working with Dr. Ivan Baxter who leads a research team at the Donald Danforth Plant Science Center. Dr. Baxter is an expert in the domain of plant genetics and his laboratory is leading the field in discovering the biological drivers of elemental composition (i.e. phosphorus, potassium, iron) in plants with the goal of improving efficiency and sustainability while also meeting the food and biofuel demands of a growing human population.
And of course, the project is made possible in a large part by Mozilla who will be supporting the deployment of this project to the cloud.
Stay tuned. We are just getting started. As the project progresses, we will be looking for more participants for the larger pilot program. If you're interested, you can reach me on Twitter (@CSciBio) or by email (firstname.lastname@example.org). Watch our Github page for updates on the tools being used in this project, or better yet, get involved and contribute!
I will also be regularly blogging about my experience and the progress of the project. Look back here for updates and more stories about LinkageHub.