Insights from heterogeneous data through transitive semantic relationships and text analytics
Insights from heterogeneous data through transitive semantic relationships and text analytics
Many organisations are finding that the volume of information they need to analyse to make effective decisions is increasing. An important element in effective decision making is the ability to prioritise information quickly and accurately from a variety of sources. Technology tools are widely used to aid decision making through analysis and visualisation of numeric data, leveraging structured knowledge as in expert systems, or identifying items based on known existing relationships and content information as in recommender systems. However, producing similar insights from unstructured text documents of varying formats, intents, and domains, with little prior knowledge or labelling, remains an open problem. This thesis takes the approach of using machine understanding of natural language text and the semantic content of documents as the basis for downstream tasks of recommendation, visualisation, summarisation, clustering, and topic naming to highlight key areas of interest in large heterogeneous datasets. The approach builds on both traditional techniques and recent advances in machine learning and natural language processing and combines and supplements them to address issues including sparse labelling, the cold-start problem, and the explainability of results. A novel recommendation algorithm, Transitive Semantic Relationships (TSR) is proposed to address challenging cases of the cold-start problem and is demonstrated as an effective tool for identifying supply chain relationships using company descriptions and a small number of known relationships. For the more general problem of finding meaning in large collections of unstructured text, this thesis proposes and demonstrates a methodology for combining several existing text analytics techniques to produce an overview of the distribution and typical content of key topics present in the data. This method is demonstrated for varied examples including a survey of experts concerns regarding the COVID-19 pandemic in the United Kingdom, the descriptions of businesses on the Isle of Wight, and the descriptions of 2500 TED talks. A webbased tool, the Text Insights Pipeline (TIP) is presented enabling non-experts to make use of this approach for analysis of other collections of unstructured text. This thesis concludes that semantic understanding of text through deep learning coupled with explainable downstream algorithms is an effective basis for producing explainable insights and representative overviews of large unstructured text datasets. The contributions of this thesis have already seen adoption in industry, government, and research, and have the potential for making previously indigestible datasets open to analysis by aiding in the presentation and organisation of unstructured text data.
University of Southampton
Ralph, David
ea363a70-b796-4912-89c5-e256c5dc1282
2022
Ralph, David
ea363a70-b796-4912-89c5-e256c5dc1282
Green, Nicolas
d9b47269-c426-41fd-a41d-5f4579faa581
Ralph, David
(2022)
Insights from heterogeneous data through transitive semantic relationships and text analytics.
University of Southampton, Doctoral Thesis, 148pp.
Record type:
Thesis
(Doctoral)
Abstract
Many organisations are finding that the volume of information they need to analyse to make effective decisions is increasing. An important element in effective decision making is the ability to prioritise information quickly and accurately from a variety of sources. Technology tools are widely used to aid decision making through analysis and visualisation of numeric data, leveraging structured knowledge as in expert systems, or identifying items based on known existing relationships and content information as in recommender systems. However, producing similar insights from unstructured text documents of varying formats, intents, and domains, with little prior knowledge or labelling, remains an open problem. This thesis takes the approach of using machine understanding of natural language text and the semantic content of documents as the basis for downstream tasks of recommendation, visualisation, summarisation, clustering, and topic naming to highlight key areas of interest in large heterogeneous datasets. The approach builds on both traditional techniques and recent advances in machine learning and natural language processing and combines and supplements them to address issues including sparse labelling, the cold-start problem, and the explainability of results. A novel recommendation algorithm, Transitive Semantic Relationships (TSR) is proposed to address challenging cases of the cold-start problem and is demonstrated as an effective tool for identifying supply chain relationships using company descriptions and a small number of known relationships. For the more general problem of finding meaning in large collections of unstructured text, this thesis proposes and demonstrates a methodology for combining several existing text analytics techniques to produce an overview of the distribution and typical content of key topics present in the data. This method is demonstrated for varied examples including a survey of experts concerns regarding the COVID-19 pandemic in the United Kingdom, the descriptions of businesses on the Isle of Wight, and the descriptions of 2500 TED talks. A webbased tool, the Text Insights Pipeline (TIP) is presented enabling non-experts to make use of this approach for analysis of other collections of unstructured text. This thesis concludes that semantic understanding of text through deep learning coupled with explainable downstream algorithms is an effective basis for producing explainable insights and representative overviews of large unstructured text datasets. The contributions of this thesis have already seen adoption in industry, government, and research, and have the potential for making previously indigestible datasets open to analysis by aiding in the presentation and organisation of unstructured text data.
Text
David Ralph PhD Thesis
- Version of Record
Text
PTD_Thesis_Ralph-SIGNED
Restricted to Repository staff only
More information
Published date: 2022
Identifiers
Local EPrints ID: 470732
URI: http://eprints.soton.ac.uk/id/eprint/470732
PURE UUID: 4d57d524-e59d-4b00-85c3-20497b40e385
Catalogue record
Date deposited: 18 Oct 2022 17:37
Last modified: 17 Mar 2024 02:59
Export record
Contributors
Author:
David Ralph
Thesis advisor:
Nicolas Green
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics