The University of Southampton
University of Southampton Institutional Repository

Wiswesser line notation in modern cheminformatics; Implementations for parsing, conversion and compression of chemical entities

Wiswesser line notation in modern cheminformatics; Implementations for parsing, conversion and compression of chemical entities
Wiswesser line notation in modern cheminformatics; Implementations for parsing, conversion and compression of chemical entities
Wiswesser Line Notation (WLN) is a older line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. This thesis explores the potential of WLN as a modern notation system, or alternatively, if some of its design fundamentals can be taken forward for newer ideas. The compactness and fragmented nature of WLN may offer advantages in handling and managing vast chemical datasets if the rule set can be codified properly. It also seems reasonable to state that since the notation was designed at a time where computer memory was both scarce and expensive, that fundamentally its design would focus on efficiency. Compactness of a notation is certainly advantageous, however exponential growth of chemical data demands more than just an efficient standard representation. It calls for algorithms and procedures designed at maximising data storage, particularly in terms of data compression. Lossless text compression techniques, adapted and optimised for chemical data, will soon be required in order for researchers to handle large-scale curated sources. A novel approach could involve examining a chemical notation system purely in terms of its compressibility. In this regard, WLN appears to offer a promising starting point. The objective of this thesis is two-fold, the first is to develop algorithms for the conversion between WLN and other line notations such as SMILES and InChI, which are commonly used in modern cheminformatics. Second, to test its compressibility. Compression schemes require a large corpus of data in order to give accurate assessments, therefore any conversion tools will have to be robust enough to convert and encode millions of compounds in order to create the required data. Once substantial datasets are accessible, creating domain-specific compression schemes using WLN can assess whether a notational structure based on fragments and scaffolds can in fact save space.
University of Southampton
Blakey, Michael
dd52ac5f-a5f5-4698-8099-712f410fa92e
Blakey, Michael
dd52ac5f-a5f5-4698-8099-712f410fa92e
Frey, Jeremy
ba60c559-c4af-44f1-87e6-ce69819bf23f
Pearman-Kanza, Samantha
b73bcf34-3ff8-4691-bd09-aa657dcff420

Blakey, Michael (2024) Wiswesser line notation in modern cheminformatics; Implementations for parsing, conversion and compression of chemical entities. University of Southampton, Doctoral Thesis, 329pp.

Record type: Thesis (Doctoral)

Abstract

Wiswesser Line Notation (WLN) is a older line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. This thesis explores the potential of WLN as a modern notation system, or alternatively, if some of its design fundamentals can be taken forward for newer ideas. The compactness and fragmented nature of WLN may offer advantages in handling and managing vast chemical datasets if the rule set can be codified properly. It also seems reasonable to state that since the notation was designed at a time where computer memory was both scarce and expensive, that fundamentally its design would focus on efficiency. Compactness of a notation is certainly advantageous, however exponential growth of chemical data demands more than just an efficient standard representation. It calls for algorithms and procedures designed at maximising data storage, particularly in terms of data compression. Lossless text compression techniques, adapted and optimised for chemical data, will soon be required in order for researchers to handle large-scale curated sources. A novel approach could involve examining a chemical notation system purely in terms of its compressibility. In this regard, WLN appears to offer a promising starting point. The objective of this thesis is two-fold, the first is to develop algorithms for the conversion between WLN and other line notations such as SMILES and InChI, which are commonly used in modern cheminformatics. Second, to test its compressibility. Compression schemes require a large corpus of data in order to give accurate assessments, therefore any conversion tools will have to be robust enough to convert and encode millions of compounds in order to create the required data. Once substantial datasets are accessible, creating domain-specific compression schemes using WLN can assess whether a notational structure based on fragments and scaffolds can in fact save space.

Text
mkb_pure_submission_pdf3a
Restricted to Repository staff only until 1 October 2025.
Available under License University of Southampton Thesis Licence.
Text
Final-thesis-submission-Examination-Mr-Michael-Blakey
Restricted to Repository staff only

More information

Published date: 2024

Identifiers

Local EPrints ID: 495257
URI: http://eprints.soton.ac.uk/id/eprint/495257
PURE UUID: 60b5367c-7079-405f-b4d8-d982bc3aaca6
ORCID for Jeremy Frey: ORCID iD orcid.org/0000-0003-0842-4302
ORCID for Samantha Pearman-Kanza: ORCID iD orcid.org/0000-0002-4831-9489

Catalogue record

Date deposited: 05 Nov 2024 17:30
Last modified: 06 Nov 2024 02:56

Export record

Contributors

Author: Michael Blakey
Thesis advisor: Jeremy Frey ORCID iD
Thesis advisor: Samantha Pearman-Kanza ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×