Naegle Lab Software

ASPEN: Reconstructing protein divergence histories

Understanding the divergence history of proteins has been important to furthering our knowledge of protein function. There are many methods to infer the history of protein duplications, which give rise to paralogs, and their passage onto future species, which gives rise to orthologs.

However, all methods suffer from two fundamental issues: 1) that the true model of evolution is hidden from us, so gauging the accuracy of a model for real protein sequences is impossible and 2) the model inference is extremely sensitive to the selection of input sequences (orthologs in living species) and the alignment method used.

We hypothesized that, despite the variance introduced by using different subsamples of the ortholog sequences in tree reconstruction, consistent relationships across an ensemble of trees would indicate true signal. ASPEN is a method that utilizes this hypothesis to reconstruct trees that are most consistent with observations in an ensemble of trees. We found that reconstructed ASPEN trees are more accurate than the traditional approach of creating one tree from one alignment of all sequences.

Additionally, using a technique to measure the reproducibility of a tree from all sequences (Precision), we find that Precision is a direct proxy for the likely accuracy of the all-sequence tree topology. In conclusion, we find that subsampling from available ortholog sequences is a powerful technique for identifying the likely accuracy of an all-sequence tree, reconstructing more accurate trees, and indicating the number of likely, but diverse, models of evolution one should consider in subsequent analysis or interpretation.

Preprint is available via bioRxiv

Code is available via GitHub

Simulation materials are available via Figshare


Naegle Software Aspen

ASPEN: Reconstructing protein divergence histories (GitHub)

Our Research Areas

  • Databases and resources for proteome-level PTM information

    A foundation of our work is the ability to have proteome information at our fingertips. This includes the current knowledge of tyrosine phosphorylation, quantitative measurements measured on those sites, and related protein annotations.  In enabling this research for our own lab, we also construct tools that can be used by the broader research community, with a focus on extendibility and reproducibility.

  • Inferring biological insight from high-dimensional data

    Kristen Naegle developed ensemble approaches to clustering of biological data in her Ph.D. work that demonstrated that one can infer function of tyrosine phosphorylation from quantitative measurements of the dynamic changes of network phosphorylation in cells in response to growth factor stimulation.  During her post-doctoral work, Dr. Naegle went on to show that robustness in clustering was predictive of protein interactions and inferred novel interactions in the epidermal growth factor receptor network.

  • SH2 domain binding

    A major piece of ongoing work in the lab is to develop methods that will allow us to identify what phosphotyrosines will be recognized by a binding domain. Specifically, we hope to push this area of research into arenas that allow us to predict the relative competition between domains for phosphotyrosine sequences and phosphotyrosine sequences for domains. This information will enable us to begin to predict the consequence of context differences between cells in response to the same extracellular cue. We will feel we have succeeded when these predictions can be used to explain complex network phenomena.

  • Engineering enzymatic interactions

    A major barrier to the study of protein phosphorylation is the ability to create phosphorylated proteins for in vitro study. The Naegle lab has been developing a cheap and fast method for producing phosphorylated proteins that capitalizes on observations made of enzymatic specificity.