The University of Southampton
University of Southampton Institutional Repository

Learning and evaluation of topics via distributional semantics

Learning and evaluation of topics via distributional semantics
Learning and evaluation of topics via distributional semantics
Written language is a means of communication. It not only shapes our thoughts, written language also helps us communicate information. As the amount of digital text available keeps growing, it becomes increasingly difficult to locate and keep track of specific information of interest. This observation has fuelled the search for sophisticated representations of written text, and methods for learning meaning. In particular, topic identification has grown in importance in recent years as an approach to summarise, organise and understand text. Underpinning modern topic identification methods is the framework of distributional semantics which is based on the assumption that meaning is associated with use, and in particular, meaning can be learned by examining the contexts in which words occurs. Motivated by this, we look in this thesis at the broad field of topic identification in text learned via state-of-the-art distributional semantics models. As such, we provide new answers to the complex question of how meaning is used to derive abstract concepts like topics, and how non-expert humans evaluate such abstract concept generated from artificial processes. In more detail, we address three key problems. We first tackle the problem of evaluating the output of topic models (a particular kind of topic identification method) on large text corpora by leveraging non-expert annotators to assess the relevance of topics to a set of documents. Second, we develop a new method to assist in the interpretation of topics by providing additional context. In particular, our solution learns topics as collections of sentences extracted from large corpus of unstructured documents. Finally, we identify and track the topic of text collected over time. In particular, we look at text-based dialogues which often consists of short utterances covering a variety of topics.
University of Southampton
Augustin, Alexandry
dca1be1e-909c-471a-ba63-19da670b095a
Augustin, Alexandry
dca1be1e-909c-471a-ba63-19da670b095a
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9

Augustin, Alexandry (2020) Learning and evaluation of topics via distributional semantics. University of Southampton, Doctoral Thesis, 176pp.

Record type: Thesis (Doctoral)

Abstract

Written language is a means of communication. It not only shapes our thoughts, written language also helps us communicate information. As the amount of digital text available keeps growing, it becomes increasingly difficult to locate and keep track of specific information of interest. This observation has fuelled the search for sophisticated representations of written text, and methods for learning meaning. In particular, topic identification has grown in importance in recent years as an approach to summarise, organise and understand text. Underpinning modern topic identification methods is the framework of distributional semantics which is based on the assumption that meaning is associated with use, and in particular, meaning can be learned by examining the contexts in which words occurs. Motivated by this, we look in this thesis at the broad field of topic identification in text learned via state-of-the-art distributional semantics models. As such, we provide new answers to the complex question of how meaning is used to derive abstract concepts like topics, and how non-expert humans evaluate such abstract concept generated from artificial processes. In more detail, we address three key problems. We first tackle the problem of evaluating the output of topic models (a particular kind of topic identification method) on large text corpora by leveraging non-expert annotators to assess the relevance of topics to a set of documents. Second, we develop a new method to assist in the interpretation of topics by providing additional context. In particular, our solution learns topics as collections of sentences extracted from large corpus of unstructured documents. Finally, we identify and track the topic of text collected over time. In particular, we look at text-based dialogues which often consists of short utterances covering a variety of topics.

Text
Alexandry Augustin Thesis
Available under License University of Southampton Thesis Licence.
Download (7MB)
Text
PTD_thesis_Augustin-SIGNED
Restricted to Repository staff only
Text
3rdpartypermission
Restricted to Repository staff only

More information

Published date: December 2020

Identifiers

Local EPrints ID: 447272
URI: http://eprints.soton.ac.uk/id/eprint/447272
PURE UUID: a0c87c00-9b01-4a52-8463-2aa809cec885
ORCID for Alexandry Augustin: ORCID iD orcid.org/0000-0003-0285-9444
ORCID for Jonathon Hare: ORCID iD orcid.org/0000-0003-2921-4283

Catalogue record

Date deposited: 08 Mar 2021 17:31
Last modified: 13 Dec 2021 02:57

Export record

Contributors

Author: Alexandry Augustin ORCID iD
Thesis advisor: Jonathon Hare ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×