Insights from heterogeneous data through transitive semantic relationships and text analytics

Ralph, David (2022) Insights from heterogeneous data through transitive semantic relationships and text analytics. University of Southampton, Doctoral Thesis, 148pp.

Record type: Thesis (Doctoral)

Abstract

Many organisations are finding that the volume of information they need to analyse to make effective decisions is increasing. An important element in effective decision making is the ability to prioritise information quickly and accurately from a variety of sources. Technology tools are widely used to aid decision making through analysis and visualisation of numeric data, leveraging structured knowledge as in expert systems, or identifying items based on known existing relationships and content information as in recommender systems. However, producing similar insights from unstructured text documents of varying formats, intents, and domains, with little prior knowledge or labelling, remains an open problem. This thesis takes the approach of using machine understanding of natural language text and the semantic content of documents as the basis for downstream tasks of recommendation, visualisation, summarisation, clustering, and topic naming to highlight key areas of interest in large heterogeneous datasets. The approach builds on both traditional techniques and recent advances in machine learning and natural language processing and combines and supplements them to address issues including sparse labelling, the cold-start problem, and the explainability of results. A novel recommendation algorithm, Transitive Semantic Relationships (TSR) is proposed to address challenging cases of the cold-start problem and is demonstrated as an effective tool for identifying supply chain relationships using company descriptions and a small number of known relationships. For the more general problem of finding meaning in large collections of unstructured text, this thesis proposes and demonstrates a methodology for combining several existing text analytics techniques to produce an overview of the distribution and typical content of key topics present in the data. This method is demonstrated for varied examples including a survey of experts concerns regarding the COVID-19 pandemic in the United Kingdom, the descriptions of businesses on the Isle of Wight, and the descriptions of 2500 TED talks. A webbased tool, the Text Insights Pipeline (TIP) is presented enabling non-experts to make use of this approach for analysis of other collections of unstructured text. This thesis concludes that semantic understanding of text through deep learning coupled with explainable downstream algorithms is an effective basis for producing explainable insights and representative overviews of large unstructured text datasets. The contributions of this thesis have already seen adoption in industry, government, and research, and have the potential for making previously indigestible datasets open to analysis by aiding in the presentation and organisation of unstructured text data.

Text

David Ralph PhD Thesis - Version of Record

Available under License University of Southampton Thesis Licence.

Download (6MB)

Text

PTD_Thesis_Ralph-SIGNED

Restricted to Repository staff only