The University of Southampton
University of Southampton Institutional Repository

Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents

Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents
Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents

Purpose: Wiswesser Line Notation (WLN) is a old line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. In the context of modernising chemical data, we present a comprehensive WLN parser developed using the OpenBabel toolkit, capable of translating WLN strings into various formats supported by the library. Furthermore, we have devised a specialised Finite State Machine l, constructed from the rules of WLN, enabling the recognition and extraction of chemical strings out of large bodies of text. Available open-access WLN data with corresponding SMILES or InChI notation is rare, however ChEMBL, ChemSpider and PubChem all contain WLN records which were used for conversion scoring. Our investigation revealed a notable proportion of inaccuracies within the database entries, and we have taken steps to rectify these errors whenever feasible. Scientific contribution: Tools for both the extraction and conversion of WLN from chemical documents have been successfully developed. Both the Deterministic Finite Automaton (DFA) and parser handle the majority of WLN rules officially endorsed in the three major WLN manuals, with the parser showing a clear jump in accuracy and chemical coverage over previous submissions. The GitHub repository can be found here: https://github.com/Mblakey/wiswesser.

Chemical Compounds, Chemical Line Notation, SMILES, Text Parsing, WLN, Chemical line notation, Text parsing, Chemical compounds
1758-2946
Blakey, Michael
dd52ac5f-a5f5-4698-8099-712f410fa92e
Pearman-Kanza, Samantha
b73bcf34-3ff8-4691-bd09-aa657dcff420
Frey, Jeremy G.
ba60c559-c4af-44f1-87e6-ce69819bf23f
Blakey, Michael
dd52ac5f-a5f5-4698-8099-712f410fa92e
Pearman-Kanza, Samantha
b73bcf34-3ff8-4691-bd09-aa657dcff420
Frey, Jeremy G.
ba60c559-c4af-44f1-87e6-ce69819bf23f

Blakey, Michael, Pearman-Kanza, Samantha and Frey, Jeremy G. (2024) Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents. Journal of Cheminformatics, 16 (1), [42]. (doi:10.1186/s13321-024-00831-2).

Record type: Article

Abstract

Purpose: Wiswesser Line Notation (WLN) is a old line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. In the context of modernising chemical data, we present a comprehensive WLN parser developed using the OpenBabel toolkit, capable of translating WLN strings into various formats supported by the library. Furthermore, we have devised a specialised Finite State Machine l, constructed from the rules of WLN, enabling the recognition and extraction of chemical strings out of large bodies of text. Available open-access WLN data with corresponding SMILES or InChI notation is rare, however ChEMBL, ChemSpider and PubChem all contain WLN records which were used for conversion scoring. Our investigation revealed a notable proportion of inaccuracies within the database entries, and we have taken steps to rectify these errors whenever feasible. Scientific contribution: Tools for both the extraction and conversion of WLN from chemical documents have been successfully developed. Both the Deterministic Finite Automaton (DFA) and parser handle the majority of WLN rules officially endorsed in the three major WLN manuals, with the parser showing a clear jump in accuracy and chemical coverage over previous submissions. The GitHub repository can be found here: https://github.com/Mblakey/wiswesser.

Text
Zombie_Cheminformatics___Extraction_and_Conversion_of_Wiswesser_Line_Notation__WLN__from_Chemical_Documents_Version_2 - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (571kB)
Text
s13321-024-00831-2 - Version of Record
Available under License Creative Commons Attribution.
Download (2MB)

More information

Accepted/In Press date: 23 March 2024
Published date: 15 April 2024
Additional Information: Publisher Copyright: © The Author(s) 2024.
Keywords: Chemical Compounds, Chemical Line Notation, SMILES, Text Parsing, WLN, Chemical line notation, Text parsing, Chemical compounds

Identifiers

Local EPrints ID: 489426
URI: http://eprints.soton.ac.uk/id/eprint/489426
ISSN: 1758-2946
PURE UUID: 332b118c-e01c-4ca0-96bc-d20621d06059
ORCID for Samantha Pearman-Kanza: ORCID iD orcid.org/0000-0002-4831-9489
ORCID for Jeremy G. Frey: ORCID iD orcid.org/0000-0003-0842-4302

Catalogue record

Date deposited: 24 Apr 2024 16:30
Last modified: 22 May 2024 01:54

Export record

Altmetrics

Contributors

Author: Michael Blakey
Author: Jeremy G. Frey ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×