The University of Southampton
University of Southampton Institutional Repository

Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier

Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier
Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier

We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data. [Figure not available: see fulltext.].

Attention, GPU, InChI, IUPAC, seq2seq, Transformer
1758-2946
Handsel, Jennifer
a8e896fb-b86d-4a98-84e5-81d9bdf270d7
Matthews, Brian
c9911673-fab3-4e11-b8e2-bc4890e8c0e0
Knight, Nicola J.
fbc21e18-095e-4c1a-a4bf-6277debf5c4b
Coles, Simon J.
3116f58b-c30c-48cf-bdd5-397d1c1fecf8
Handsel, Jennifer
a8e896fb-b86d-4a98-84e5-81d9bdf270d7
Matthews, Brian
c9911673-fab3-4e11-b8e2-bc4890e8c0e0
Knight, Nicola J.
fbc21e18-095e-4c1a-a4bf-6277debf5c4b
Coles, Simon J.
3116f58b-c30c-48cf-bdd5-397d1c1fecf8

Handsel, Jennifer, Matthews, Brian, Knight, Nicola J. and Coles, Simon J. (2021) Translating the InChI: adapting neural machine translation to predict IUPAC names from a chemical identifier. Journal of Cheminformatics, 13 (1), [79]. (doi:10.1186/s13321-021-00535-x).

Record type: Article

Abstract

We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine’s online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data. [Figure not available: see fulltext.].

Text
s13321-021-00535-x - Version of Record
Available under License Creative Commons Attribution.
Download (1MB)

More information

Accepted/In Press date: 20 July 2021
Published date: 7 October 2021
Additional Information: Funding Information: We thank the SCARF team in Scientific Computing for providing access to high performance computing clusters. JH thanks Keith Butler for helping with batch scripts for the computing cluster, and Tom Allam for helping with the commercial software testing. Funding Information: This work was funded by the Physical Sciences Data-science Service under EPSRC grant number EP/S020357/1. Publisher Copyright: © 2021, The Author(s). Copyright: Copyright 2021 Elsevier B.V., All rights reserved.
Keywords: Attention, GPU, InChI, IUPAC, seq2seq, Transformer

Identifiers

Local EPrints ID: 453346
URI: http://eprints.soton.ac.uk/id/eprint/453346
ISSN: 1758-2946
PURE UUID: e32b214d-a70a-442a-868a-195493ce851f
ORCID for Nicola J. Knight: ORCID iD orcid.org/0000-0001-8286-3835
ORCID for Simon J. Coles: ORCID iD orcid.org/0000-0001-8414-9272

Catalogue record

Date deposited: 13 Jan 2022 17:52
Last modified: 18 Mar 2024 03:52

Export record

Altmetrics

Contributors

Author: Jennifer Handsel
Author: Brian Matthews
Author: Nicola J. Knight ORCID iD
Author: Simon J. Coles ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×