Precision medicine
For several years, scientists and researchers have been working to advance the understanding of Precision Medicine, the customization of medical practices and treatments for individual patients and its techniques for the betterment of human health. As discussed in the workshop, 90% of what affects our health is not determined by medical care, but rather our environment, socio-economic circumstances, lifestyle choices, and genetics [3]. In January 2015, U.S. President Obama highlighted the Precision Medicine effort and called the country’s attention to its importance during his State of the Union Address when he announced the launch of The Precision Medicine Initiative. The Initiative is detailed on the White House website with the following mission statement:
To enable a new era of medicine through research, technology, and policies that empower patients, researchers, and providers to work together toward development of individualized care [4].
The White House allocated a total of $215 million to the Department of Health and Human Services (DHHS) in support of the initiative in 2016. Also part of President Obama’s State of the Union Address was the National Cancer Moonshot initiative, a collaborative effort to end cancer. On February 1, 2016, the White House announced a $1 billion initiative to jumpstart the program and established “a new Cancer Moonshot Task Force – to be led by the Vice President – to focus on making the most of Federal investments, targeted incentives, private sector efforts from industry and philanthropy, patient engagement initiatives, and other mechanisms to support cancer research and enable progress in treatment and care” [5]. The multiple federal initiatives bring together experts in a variety of fields in order to work toward the common goal of improving the quality of life for all of us. This requires collaboration between experts around the world and amplifies the reliance on dependable data sharing practices.
The move toward Precision Medicine in healthcare models is accelerating the need for computational infrastructure to support collaborations among a wide range of professionals. The infrastructure must support the secure exchange of data between disparate groups. Currently, the biggest barrier to this is uncertainty about regulations covering access to data collected from patients and research participants, namely Protected Health Information (PHI), The Federal Information Security Management Act (FISMA), and The Health Insurance Portability and Accountability Act (HIPAA). Advances in the private cloud space are beginning to address some of these issues. Technologists are experimenting with container and enclave technologies, but the main issues with private cloud use in this context continue to be data security and costs (ingress/egress, computation on large-scale analysis).
In order to provide accurate and timely care tailored to the individual level, there is much work to be done in standardizing data collection, input methods, coding, storage, access, and analysis. Medical data is increasingly being stored in Electronic Health Records (EHRs), but not all providers have a common format for data, limiting the ability to use common analysis tools across various electronic data warehouses. Additionally, healthcare information varies in nature and can be difficult to collect and input into existing systems (e.g., symptoms, interactions, lab reports, and photographs). The systems may not code these datasets correctly, resulting in unstructured data in the record. Unstructured data, such as notes, contain valuable care information but that information is difficult to extract. This limits the ability for machine understanding and analysis. Examples of projects that utilize natural language processing to convert unstructured data into structured data were discussed, but this is still an evolving process.
Data structure standards and easier access to reliable raw data were important topics of discussion during the workshop. Many arguments for developing an accepted standard for data structure were raised. According to the National Human Genome Research Institute, a division of the National Institutes of Health, the human genome consists of about three billion base pairs [6] but about 99.5% of our DNA is the same as all other humans [7]. A standard method of separating out the ~1% of the human genome variances would significantly reduce the data size and help simplify management, transfer, and storage constraints between collaborators and analysis sites. Standardizing data structure would help prevent scientists from spending more time in going back to raw data for analysis, in addition to increasing the opportunities for machine learning and cross-provider databases. However, it should be understood that going back to raw data is necessary in some cases, relative to what the original research question was when the raw data/metadata was generated during sequencing. Particular data structures could act as a barrier to other scientists when looking at the structured data as opposed to the raw data. This is particularly true when scientists from different focus areas wish to compare datasets to identify similarities or differences between organisms (i.e., comparing a human genome to a particular plant genome). To this point, workshop discussions repeatedly highlighted the need for easier community access to reliable raw data and metadata of previously sequenced samples.
Metagenomics
Metagenomics is defined as the direct genetic analysis of genomes contained within an environmental sample [8] and applies to any category of organisms. The study of metagenomics has provided critical insight into understanding the function and relationships of the human body and the world we live in over the past several years, and is becoming the focus of an increasing number of research projects and scientific organizations. One such project is the White House Microbiome Initiative, announced in May 2016. While the initiative was announced after this workshop was held, it highlights the importance of expanding this field of study. This White House Initiative has three specific goals, developed throughout the course of a year-long fact-finding process. These goals are:
-
1.
Supporting interdisciplinary research to answer fundamental questions about microbiomes in diverse ecosystems.
-
2.
Developing platform technologies that will generate insights and help share knowledge of microbiomes in diverse ecosystems and enhance access to microbiome data.
-
3.
Expanding the microbiome workforce through citizen science and educational opportunities [9].
The increasing focus on the field of metagenomics, combined with rapid advances in technology, are creating massive amounts of data at rates that are difficult to predict. Therefore, close coordination and regular dialogue between research scientists, technology experts, and policymakers are crucial.
Understanding the underlying functions, relationships, and interactions between various parts of the same organism and between organisms, is not only reliant on significant computational power, but the ability to share datasets with other scientists to benefit from their expertise as well. This happens not only within the same discipline, but across disciplines. Examples were presented in the workshop that showed how cross-disciplinary studies could lead to the creation of extensive and powerful systems biology models. Intersections between such models can be found at great evolutionary distances, including between trees and humans, thus demonstrating the strongly conserved nature of many basic biological functions [10]. Researchers are also studying microbiomes in the earth’s atmosphere and how their distribution patterns relate to climate patterns. Big data approaches are fundamentally changing the way that biological research is being performed. As such, the need for data access, data repositories, and means with which to do rapid, large-scale bulk transfers was quite clear.
Many examples of scientific discoveries that were a result of sharing datasets between researchers and laboratories were discussed. However, two primary barriers currently exist that prevent scientists from using another’s dataset on a larger scale: not being able to locate targeted datasets in a timely fashion and the time required to verify the data’s quality and integrity. The combination of these issues often results in a researcher starting from scratch to sequence their own samples, made easier because of the prevalence of low-cost sequencers. This cycle compounds the problem of scattered datasets for further analysis. Still, it was noted that structured databases do exist and are heavily utilized in the community. The Department of Energy’s Joint Genome Institute (JGI) was founded in 1997 and has been a leader in large-scale, sequence-based science ever since. External data sources play a key role in metagenomics efforts at JGI. The sources come from external collaborators and community archives, such as the National Center for Biotechnology Information (NCBI) short read archive. Biological data made available by JGI, NCBI, The European Bioinformatics Institute (EMBL-EBI) and many other data repositories are invaluable resources and their best practices and lessons learned should be used as a baseline in the development of additional databases that will serve the genomics community.