The University of Southampton
University of Southampton Institutional Repository

Insights from heterogeneous data through transitive semantic relationships and text analytics

Insights from heterogeneous data through transitive semantic relationships and text analytics
Insights from heterogeneous data through transitive semantic relationships and text analytics
Many organisations are finding that the volume of information they need to analyse to make effective decisions is increasing. An important element in effective decision making is the ability to prioritise information quickly and accurately from a variety of sources. Technology tools are widely used to aid decision making through analysis and visualisation of numeric data, leveraging structured knowledge as in expert systems, or identifying items based on known existing relationships and content information as in recommender systems. However, producing similar insights from unstructured text documents of varying formats, intents, and domains, with little prior knowledge or labelling, remains an open problem. This thesis takes the approach of using machine understanding of natural language text and the semantic content of documents as the basis for downstream tasks of recommendation, visualisation, summarisation, clustering, and topic naming to highlight key areas of interest in large heterogeneous datasets. The approach builds on both traditional techniques and recent advances in machine learning and natural language processing and combines and supplements them to address issues including sparse labelling, the cold-start problem, and the explainability of results. A novel recommendation algorithm, Transitive Semantic Relationships (TSR) is proposed to address challenging cases of the cold-start problem and is demonstrated as an effective tool for identifying supply chain relationships using company descriptions and a small number of known relationships. For the more general problem of finding meaning in large collections of unstructured text, this thesis proposes and demonstrates a methodology for combining several existing text analytics techniques to produce an overview of the distribution and typical content of key topics present in the data. This method is demonstrated for varied examples including a survey of experts concerns regarding the COVID-19 pandemic in the United Kingdom, the descriptions of businesses on the Isle of Wight, and the descriptions of 2500 TED talks. A webbased tool, the Text Insights Pipeline (TIP) is presented enabling non-experts to make use of this approach for analysis of other collections of unstructured text. This thesis concludes that semantic understanding of text through deep learning coupled with explainable downstream algorithms is an effective basis for producing explainable insights and representative overviews of large unstructured text datasets. The contributions of this thesis have already seen adoption in industry, government, and research, and have the potential for making previously indigestible datasets open to analysis by aiding in the presentation and organisation of unstructured text data.
University of Southampton
Ralph, David
ea363a70-b796-4912-89c5-e256c5dc1282
Ralph, David
ea363a70-b796-4912-89c5-e256c5dc1282
Green, Nicolas
d9b47269-c426-41fd-a41d-5f4579faa581

Ralph, David (2022) Insights from heterogeneous data through transitive semantic relationships and text analytics. University of Southampton, Doctoral Thesis, 148pp.

Record type: Thesis (Doctoral)

Abstract

Many organisations are finding that the volume of information they need to analyse to make effective decisions is increasing. An important element in effective decision making is the ability to prioritise information quickly and accurately from a variety of sources. Technology tools are widely used to aid decision making through analysis and visualisation of numeric data, leveraging structured knowledge as in expert systems, or identifying items based on known existing relationships and content information as in recommender systems. However, producing similar insights from unstructured text documents of varying formats, intents, and domains, with little prior knowledge or labelling, remains an open problem. This thesis takes the approach of using machine understanding of natural language text and the semantic content of documents as the basis for downstream tasks of recommendation, visualisation, summarisation, clustering, and topic naming to highlight key areas of interest in large heterogeneous datasets. The approach builds on both traditional techniques and recent advances in machine learning and natural language processing and combines and supplements them to address issues including sparse labelling, the cold-start problem, and the explainability of results. A novel recommendation algorithm, Transitive Semantic Relationships (TSR) is proposed to address challenging cases of the cold-start problem and is demonstrated as an effective tool for identifying supply chain relationships using company descriptions and a small number of known relationships. For the more general problem of finding meaning in large collections of unstructured text, this thesis proposes and demonstrates a methodology for combining several existing text analytics techniques to produce an overview of the distribution and typical content of key topics present in the data. This method is demonstrated for varied examples including a survey of experts concerns regarding the COVID-19 pandemic in the United Kingdom, the descriptions of businesses on the Isle of Wight, and the descriptions of 2500 TED talks. A webbased tool, the Text Insights Pipeline (TIP) is presented enabling non-experts to make use of this approach for analysis of other collections of unstructured text. This thesis concludes that semantic understanding of text through deep learning coupled with explainable downstream algorithms is an effective basis for producing explainable insights and representative overviews of large unstructured text datasets. The contributions of this thesis have already seen adoption in industry, government, and research, and have the potential for making previously indigestible datasets open to analysis by aiding in the presentation and organisation of unstructured text data.

Text
David Ralph PhD Thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (6MB)
Text
PTD_Thesis_Ralph-SIGNED
Restricted to Repository staff only

More information

Published date: 2022

Identifiers

Local EPrints ID: 470732
URI: http://eprints.soton.ac.uk/id/eprint/470732
PURE UUID: 4d57d524-e59d-4b00-85c3-20497b40e385
ORCID for David Ralph: ORCID iD orcid.org/0000-0003-3385-9295
ORCID for Nicolas Green: ORCID iD orcid.org/0000-0001-9230-4455

Catalogue record

Date deposited: 18 Oct 2022 17:37
Last modified: 17 Mar 2024 02:59

Export record

Contributors

Author: David Ralph ORCID iD
Thesis advisor: Nicolas Green ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×