Representation Learning for Biological Sequence Data
Abstract:
Biology is extensively governed by sequence information. Billions of nucleotide characters encode the instructions of the human genome, and millions of amino acid characters encode the instructions of the human proteome. While modern sequencing methods make this information increasingly available, we cannot currently understand exactly what these strings of characters mean, and how they interact with each other to regulate biological processes. Many processes related to biological sequences contain very long-range interactions, highly structured output spaces, and suffer from limited labels. We argue that this data is too complex for humans to interpret. Inspired by their successes in other fields, we hypothesize that deep learning methods are well positioned to not only learn the function of sequences, but also aid in our understanding of how biology works. The goal of this proposal is to answer a twofold question: First, can we develop models that are able to accurately represent and predict functional properties of biological sequences? If so, can we interpret the results of these models to gain biological insights? In particular, we focus on two tasks (1) regulatory profile and gene expression prediction from genomic sequences, and (2) protein-protein interaction prediction from protein sequences. Finally, we will interpret what the models have learned to understand how these functional processes occur.
Committee:
- Vicente Ordóñez Román, Committee Chair (Department of Computer Science, SEAS, UVA)
- Yanjun Qi, Advisor (Department of Computer Science, SEAS, UVA)
- Yangfeng Ji (Department of Computer Science, SEAS, UVA)
- Clint Miller (Public Health Sciences, School of Medicine, UVA)
- Casey Greene (Biochemistry and Molecular Genetics, School of Medicine, University of Colorado