Skip NavigationSkip to Content

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization

  1. Author:
    Cai, Tian
    Lim, Hansaim
    Abbu, Kyra Alyssa
    Qiu, Yue
    Nussinov,Ruth
    Xie, Lei
  2. Author Address

    CUNY, Grad Ctr, PhD Program Comp Sci, New York, NY 10016 USA.CUNY, Grad Ctr, PhD Program Biochem, New York, NY 10016 USA.CUNY Hunter Coll, Dept Comp Sci, New York, NY 10065 USA.CUNY, Grad Ctr, PhD Program Biol, New York, NY 10016 USA.Frederick Natl Lab Canc Res, Computat Struct Biol Sect, Basic Sci Program, Frederick, MD 21702 USA.Tel Aviv Univ, Sackler Sch Med, Dept Human Mol Genet & Biochem, IL-69978 Tel Aviv, Israel.Cornell Univ, Helen & Robert Appel Alzheimers Dis Res Inst, Feil Family Brain & Mind Res Inst, Weill Cornell Med, New York, NY 10021 USA.
    1. Year: 2021
    2. Date: Apr 26
  1. Journal: Journal of Chemical Information and Modeling
  2. AMER CHEMICAL SOC,
    1. 61
    2. 4
    3. Pages: 1570-1582
  3. Type of Article: Article
  4. ISSN: 1549-9596
  1. Abstract:

    Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical-protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.

    See More

External Sources

  1. DOI: 10.1021/acs.jcim.0c01285
  2. PMID: 33757283
  3. PMCID: PMC8154251
  4. WOS: 000644728600008

Library Notes

  1. Fiscal Year: FY2020-2021
NCI at Frederick

You are leaving a government website.

This external link provides additional information that is consistent with the intended purpose of this site. The government cannot attest to the accuracy of a non-federal site.

Linking to a non-federal site does not constitute an endorsement by this institution or any of its employees of the sponsors or the information and products presented on the site. You will be subject to the destination site's privacy policy when you follow the link.

ContinueCancel