A new gene variant dataset is a step toward the future of personalized medicine, report researchers.
Each person has about 4 million sequence differences in their genome relative to the reference human genome. These differences are known as variants. A central goal in precision medicine is understanding which of these variants contribute to disease in a particular patient. Therefore, much of the human genome annotation effort is devoted to developing resources to help interpret the relative contribution of human variants to different observable phenotypes—i.e., determining variant impact.
The new study generated a large, organized dataset, called EN-TEx, from four individual donors using high-quality genome sequencing to identify all the variants and many different assays to determine their effect on molecular phenotypes in 25 different tissues. The team reports its findings in the journal Cell.
“Our work helps provide a better annotation of the genome and a better understanding of variant impact,” says Mark Gerstein, professor of biomedical informatics and member of the Yale University Section of Biomedical Informatics & Data Science. “An average person’s personal genome has variants in 4 million places. We’re trying to figure out which of these lead to meaningful differences.”
“This work represents the type of innovative large-scale data mining and teamwork that Yale is well-poised to create, coordinate, or participate in,” says Lucila Ohno-Machado, professor of medicine and of biomedical informatics and data science, and chair of the section.
The team used long-read sequencing technologies to determine diploid genomes from four donors with high accuracy. Everyone has a diploid genome. This means that we have two copies of 22 chromosomes as well as sex chromosomes—one from our mother and one from our father. “Now, for each position on the genome, we can look for the differences between mom and dad in many different functional assays in a perfectly balanced way, allowing us to accurately ascertain variant effect in many tissues,” says Gerstein.
The team developed a variety of statistical and deep learning approaches to be able to leverage the dataset for practical applications. In particular, they built statistical models that identify subsets of regulatory regions in the human genome highly associated with disease variants. They also found many new linkages between variants and changes in nearby gene expression, connecting impactful but uncharacterized variants to genes with known function. This considerably expands previously determined catalogs, especially in many hard-to-assay tissues.
More fundamentally, the team developed a deep learning model that was able to predict whether a variant would disrupt a binding site for a regulatory factor—a protein that binds to specific sequences in the genome to turn nearby genes on or off. Interestingly, they found that to accurately predict this, they needed to look beyond just the binding site itself and consider a large genomic region around the site. The key to whether a binding site would be affected was the presence of nearby binding sequences for other regulatory factors. “Think of regulatory factors as the legs of the Lunar Module,” says Gerstein. “If it has four legs and one leg doesn’t work, the three other legs can anchor the defective leg.” Similarly, the anchoring of other regulatory factors might stabilize the disrupted binding site and make it less sensitive to variants.
One limitation of the resource is that only four people of European descent are profiled. The team would like to eventually enlarge their study to encompass hundreds of individuals with more diverse ancestries.
Overall, these advances will allow researchers and clinicians to better interpret potential disease-causing variants in an individual, connecting them to regulatory sites, nearby genes, and their tissue of action. “We’ve provided a consistent, beautiful data set and annotation resource for making these interpretations,” says Gerstein.
The global team was assembled by the National Human Genome Research Institute (NHGRI) within NIH, as part of NHGRI’s ENCODE consortium, which aims to functionally annotate the genome.
Source: Isabella Backman for Yale University