Wiswesser line notation in modern cheminformatics; Implementations for parsing, conversion and compression of chemical entities
Wiswesser line notation in modern cheminformatics; Implementations for parsing, conversion and compression of chemical entities
Wiswesser Line Notation (WLN) is a older line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. This thesis explores the potential of WLN as a modern notation system, or alternatively, if some of its design fundamentals can be taken forward for newer ideas. The compactness and fragmented nature of WLN may offer advantages in handling and managing vast chemical datasets if the rule set can be codified properly. It also seems reasonable to state that since the notation was designed at a time where computer memory was both scarce and expensive, that fundamentally its design would focus on efficiency. Compactness of a notation is certainly advantageous, however exponential growth of chemical data demands more than just an efficient standard representation. It calls for algorithms and procedures designed at maximising data storage, particularly in terms of data compression. Lossless text compression techniques, adapted and optimised for chemical data, will soon be required in order for researchers to handle large-scale curated sources. A novel approach could involve examining a chemical notation system purely in terms of its compressibility. In this regard, WLN appears to offer a promising starting point. The objective of this thesis is two-fold, the first is to develop algorithms for the conversion between WLN and other line notations such as SMILES and InChI, which are commonly used in modern cheminformatics. Second, to test its compressibility. Compression schemes require a large corpus of data in order to give accurate assessments, therefore any conversion tools will have to be robust enough to convert and encode millions of compounds in order to create the required data. Once substantial datasets are accessible, creating domain-specific compression schemes using WLN can assess whether a notational structure based on fragments and scaffolds can in fact save space.
University of Southampton
Blakey, Michael
dd52ac5f-a5f5-4698-8099-712f410fa92e
2024
Blakey, Michael
dd52ac5f-a5f5-4698-8099-712f410fa92e
Frey, Jeremy
ba60c559-c4af-44f1-87e6-ce69819bf23f
Pearman-Kanza, Samantha
b73bcf34-3ff8-4691-bd09-aa657dcff420
Blakey, Michael
(2024)
Wiswesser line notation in modern cheminformatics; Implementations for parsing, conversion and compression of chemical entities.
University of Southampton, Doctoral Thesis, 329pp.
Record type:
Thesis
(Doctoral)
Abstract
Wiswesser Line Notation (WLN) is a older line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. This thesis explores the potential of WLN as a modern notation system, or alternatively, if some of its design fundamentals can be taken forward for newer ideas. The compactness and fragmented nature of WLN may offer advantages in handling and managing vast chemical datasets if the rule set can be codified properly. It also seems reasonable to state that since the notation was designed at a time where computer memory was both scarce and expensive, that fundamentally its design would focus on efficiency. Compactness of a notation is certainly advantageous, however exponential growth of chemical data demands more than just an efficient standard representation. It calls for algorithms and procedures designed at maximising data storage, particularly in terms of data compression. Lossless text compression techniques, adapted and optimised for chemical data, will soon be required in order for researchers to handle large-scale curated sources. A novel approach could involve examining a chemical notation system purely in terms of its compressibility. In this regard, WLN appears to offer a promising starting point. The objective of this thesis is two-fold, the first is to develop algorithms for the conversion between WLN and other line notations such as SMILES and InChI, which are commonly used in modern cheminformatics. Second, to test its compressibility. Compression schemes require a large corpus of data in order to give accurate assessments, therefore any conversion tools will have to be robust enough to convert and encode millions of compounds in order to create the required data. Once substantial datasets are accessible, creating domain-specific compression schemes using WLN can assess whether a notational structure based on fragments and scaffolds can in fact save space.
Text
mkb_pure_submission_pdf3a
Restricted to Repository staff only until 1 October 2025.
Text
Final-thesis-submission-Examination-Mr-Michael-Blakey
Restricted to Repository staff only
More information
Published date: 2024
Identifiers
Local EPrints ID: 495257
URI: http://eprints.soton.ac.uk/id/eprint/495257
PURE UUID: 60b5367c-7079-405f-b4d8-d982bc3aaca6
Catalogue record
Date deposited: 05 Nov 2024 17:30
Last modified: 06 Nov 2024 02:56
Export record
Contributors
Author:
Michael Blakey
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics