Authors:

Po-Yu LiangXueting HuangTibo DuranAndrew J. WiemerJun Bai

Paper:

https://arxiv.org/abs/2408.08341

Introduction

Peptides play a crucial role in various biological processes and have significant applications in drug discovery and biotechnology. Traditional methods for peptide discovery, such as sequence-based and structure-based approaches, often require extensive datasets, which can limit their effectiveness. This study proposes a novel method that utilizes autoencoder-shaped models to explore the protein embedding space and generate novel peptide analogs using protein language models. This method requires only a single sequence of interest, avoiding the need for large datasets.

Related Research

Lab Experiment Based Method

Traditional methods for identifying and optimizing peptide-based inhibitors include rational design, phage display, and directed evolution. These methods have limitations such as requiring extensive sequence and structure information, significant time and effort, and resources for screening.

Deep Learning Based Method

Recent studies have explored generating peptides with desired properties using deep learning methods. These methods focus on amino acid sequences and three-dimensional structures, utilizing models such as conditional variational autoencoders, generative adversarial networks, and Markov chain Monte Carlo methods.

Method

Hypothesis

The study hypothesizes that peptides with similar embeddings are likely to share higher property similarities, even if their sequence expressions differ. This hypothesis is inspired by word embedding studies, where vector abstract feature representations learned from deep learning models capture semantic meaning.

Definition

The proposed method employs an autoencoder-shaped model to learn the feature embedding. The dataset is defined as (X = {x_0, …, x_i, …, x_n}), where (x_i) is the amino acid sequence of a protein. The method is defined as (\hat{y}\tau = g(f(x_i) + \delta\tau)), where (\hat{y}\tau) is the generated amino acid sequence of the protein analog at step (\tau), (f(\cdot)) is a model projecting a protein sequence into the latent space, (\delta\tau) represents the noise added to the protein embedding at step (\tau), and (g(\cdot)) projects the noised embedding back to the sequence.

Overview of the Proposed Method

The proposed method involves three main steps: embedding, noise introduction, and decoding.

  1. Embedding: Projects peptide sequences from a discrete space into a continuous latent space.
  2. Noise Introduction: Introduces noise into the embeddings to explore the latent space systematically.
  3. Decoding: Converts the noised embeddings back into peptide sequences.

To validate the method, two embedding models were utilized: ProtT5 and ESM-2.

Embedding

Two state-of-the-art models, ProtT5 and ESM-2, were used to embed peptide sequences. Both models were pre-trained to leverage their advanced capabilities in understanding protein sequences.

  1. ProtT5 Embedding: Utilized the “Prot-T5-XL-Ur50” model, which provides an embedding size of 1024.
  2. ESM-2 Embedding: Utilized the ESM-2 model with 150 million parameters, providing an output size of 640.

Noise

Noise was introduced into the peptide embeddings to explore the latent space. The noise was drawn from a uniform distribution and adjusted to maintain its effectiveness. An adaptive approach was employed to balance finding sequences similar to and diverse from the original.

Decoder

Different functions were employed for ProtT5 and ESM-2 based on their distinct architectures.

  1. ProtT5 Decoder: Utilized a pre-trained decoder to project the transformed embeddings.
  2. ESM-2 Decoder: Trained a new decoder module, designed to be symmetric with the encoder’s architecture.

Data & Experiment Setup

Data Source and Filtering

The BioLip dataset was used to test the method. After filtering, the final dataset contained 4,758 unique peptide sequences. The UniProtKB/Swiss-Prot dataset was used to train the ESM-2 decoder module.

Baseline Models

Two baseline approaches were compared with the proposed method: random generated sequences and BLOSUM generated sequences.

Evaluation Metrics

Three different indicators were used to evaluate the similarity between original and generated peptide sequences: Morgan Fingerprints, RDKit Descriptors, and QSAR descriptors.

Comparative Analysis

The method was applied to peptide ligands of the TIGIT receptor, identified through wet-lab experiments. Molecular Dynamics (MD) simulations were employed for further validation.

Result and Discussion

Overall Result

The proposed method outperformed baseline models in terms of average similarities for generating new sequences and peptides of different lengths. ProtT5 exhibited higher average similarity for RDKit descriptor similarity, while ESM-2 showed higher average similarity for Morgan fingerprint and sequence QSAR similarities.

Physics Modeling Validation

MD simulations were used to validate the method with wet-lab experiments generated sequences. The generated sequences exhibited similar behavior to the original sequences, with comparable or improved binding affinity.

Conclusion

The proposed method addresses the challenge of generating peptides with desired properties by leveraging autoencoder models to explore the protein embedding space. The method significantly outperforms baseline models and demonstrates robustness through MD simulations. Future work will focus on testing the method in actual wet-lab experiments to further validate its effectiveness.

Code:

https://github.com/LabJunBMI/Latent-Space-Peptide-Analogues-Generation

Share.

Comments are closed.

Exit mobile version