BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

Authors:

Yifei Yang、Runhan Shi、Zuchao Li、Shu Jiang、Bao-Liang Lu、Yang Yang、Hai Zhao

Paper:

Introduction

Retrosynthesis analysis is a cornerstone of synthetic chemistry, particularly in drug discovery and organic chemistry. It involves identifying a set of precursor molecules that can be used to synthesize a target molecule. Despite the development of various computational tools over the past decade, AI-based systems often struggle to generalize across diverse reaction types and explore alternative synthetic pathways. This paper introduces BatGPT-Chem, a large language model with 15 billion parameters, designed to enhance retrosynthesis prediction. By integrating chemical tasks through a unified framework of natural language and SMILES notation, BatGPT-Chem aims to improve the efficiency and creativity of retrosynthetic analysis.

Related Work

Computer-Aided Synthesis Planning (CASP)

Over recent decades, various computer-aided synthesis planning (CASP) methods have been developed to address the challenges of retrosynthesis. These methods can be broadly categorized into three types:

Template-Based Methods: These methods use reaction templates, which are subgraph patterns that illustrate changes in atoms and bonds between a product molecule and its reactants. While they offer high interpretability, their scope is limited by the template library.
Template-Free Methods: These methods treat retrosynthesis as a sequential generation problem, transforming products into potential precursors in an end-to-end fashion.
Semi-Template Methods: These methods partition retrosynthesis into two phases: identifying the reaction center to generate intermediate molecules (synthons) and then augmenting these synthons to form precursors.

Limitations of Existing AI Models

Despite significant progress, existing AI models for retrosynthesis face several limitations:
1. Deficiency in Molecular and Chemical Reaction Knowledge: Traditional AI models often lack comprehensive knowledge from chemical literature.
2. Neglect of Reaction Conditions: Current models often exclude substances not directly involved in reactions, reducing interpretability and reliability.
3. Limited Zero-Shot Prediction Capability: Typical AI models struggle with out-of-distribution predictions, particularly in zero-shot retrosynthesis tasks.

Research Methodology

Unified Modeling Approach

BatGPT-Chem leverages the widely-used SMILES notation as a specialized chemical language, integrating it with natural language through Large Language Models (LLMs). The model combines open-source and closed-source datasets into a larger-scale instruction-tuning dataset, utilizing prompt templates to facilitate instruction-tuning. This approach allows BatGPT-Chem to capture a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions and strong zero-shot capabilities.

Training Techniques

BatGPT-Chem employs both autoregressive and bidirectional training techniques across over one hundred million instances. Building upon the previously developed BatGPT-15B model, BatGPT-Chem expands the model’s vocabulary with specialized chemical terms and further refines it through instruction tuning.

Experimental Design

Benchmark Datasets

To evaluate BatGPT-Chem’s performance, eight datasets with various reaction types were collected and organized to establish a new benchmark for retrosynthesis prediction. These datasets include:

Suzuki-Miyaura (SM) Dataset: Contains 5,760 reactions of Suzuki-Miyaura cross-coupling reactions.
High-Throughput Experiments Buchwald-Hartwig (HTE BH) Dataset: Contains 3,955 reactions of Pd-catalyzed Buchwald-Hartwig C-N cross-coupling reactions.
Electronic Laboratory Notebooks Buchwald-Hartwig (ELN BH) Dataset: Contains 551 reactions of Pd-catalyzed Buchwald-Hartwig C-N cross-coupling reactions.
Asymmetric Allylic Alkylation with Amine (AAAA) Dataset: Contains 273 reactions.
Denmark Dataset: Contains 1,075 reactions of asymmetric N, S-acetal formation using CPA catalysts.
Asymmetric Hydrogenation of Olefins (AHO) Dataset: Contains 10,268 reactions.
Metabolites and Biochemical Reactions (BioChem) Dataset: Contains 33,687 reactions.
USPTO-100 Dataset: Contains 100 reactions randomly sampled from the USPTO dataset.

Evaluation Metrics

To thoroughly evaluate retrosynthetic models, several metrics were introduced:
1. Coverage of Reactants: Indicates whether the true reactant molecules are covered by the model outputs.
2. Intersection of Reaction Conditions: Measures whether the model predicts any of the true reaction condition molecules.
3. MaxFrag: Assesses the ability to identify principal transformations for classical retrosynthesis.
4. Validity: Measures how many of the SMILES codes predicted by the model are legal and do not violate chemical principles.

Results and Analysis

Retrosynthesis Prediction Benchmark

BatGPT-Chem demonstrated state-of-the-art performance across most datasets in the MaxFrag score analysis, except for the ELN BH dataset where it was surpassed by ChemDFM. The model also achieved high coverage scores, indicating its ability to accurately predict additional reactants.

Reaction Condition Prediction

BatGPT-Chem outperformed other methods in predicting reaction conditions, demonstrating exceptional ability to predict reaction conditions and complete retrosynthesis routes effectively.

Diversity in Retrosynthesis Routes

BatGPT-Chem excelled in generating multiple viable retrosynthesis routes, providing diverse predictions for retrosynthesis pathways and predicting reactions beyond the benchmark dataset.

Validity of Outputs

BatGPT-Chem consistently achieved high validity rates, confirming its strong grasp of chemical language and ability to interpret cis-trans and chiral information inherent in chemical language.

Overall Conclusion

BatGPT-Chem sets new benchmarks for effective and dependable AI-driven retrosynthesis planning. By integrating specialized chemical language with advanced instruction-tuning techniques via a powerful LLM, BatGPT-Chem enhances the accuracy and robustness of retrosynthesis analysis. The model’s ability to predict reaction conditions, generate diverse retrosynthesis routes, and maintain high validity rates underscores its potential to revolutionize computational chemistry. Future improvements will likely require collective efforts from the scientific community to address data quality and expand the scope of chemical languages covered.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

Authors:

Paper:

Introduction

Related Work

Computer-Aided Synthesis Planning (CASP)

Limitations of Existing AI Models

Research Methodology

Unified Modeling Approach

Training Techniques

Experimental Design

Benchmark Datasets

Evaluation Metrics

Results and Analysis

Retrosynthesis Prediction Benchmark

Reaction Condition Prediction

Diversity in Retrosynthesis Routes

Validity of Outputs

Overall Conclusion

Related Posts