The University of Southampton
University of Southampton Institutional Repository

A simplified topological representation of text for local and global context

A simplified topological representation of text for local and global context
A simplified topological representation of text for local and global context
Topological data analysis (TDA) is a branch of mathematics that analyzes the shape of high-dimensional data sets using geometry and algebra. TDA is used for data visualization which represents the relationship among elements using a network. Traditionally, TDA is quadratic in complexity and not commonly used for natural language processing. In this research, we visualize the relationship among words in a text block, words in a corpus and text blocks in a corpus. Text block represents a unit of a corpus such as, a web page in a web corpus, a chapter or section in a book corpus or a document in media corpus. This research proposes circular topology for representing words both for Local Context (LC) and Global Context (GC). Each text block is a set of sentences forming the LC. We found that feature words are extracted successfully from our LC analysis. The occurrence of extracted featured words in the corpus formed the GC. We evaluate this proposed simplified topological analysis on 3 different corpora: a single book corpus, a book corpus consisting of 7 books having 6020 narrations and a web corpus consisting of 990 web pages. The peripheral nature of the LC reduced the vocabulary size of the corpus significantly in O(nm) time where n is the number of text blocks and m is number of nouns in a sentence. GC analysis of featured words reflected useful properties of featured word movement which can be used to analyze topic evolution. GC analysis of text block points is aimed to find closely related text blocks in a radius. This reflected interesting results that need further supervised investigation. Research on topology driven natural language processing is in its infancy. This article contributes to this research field by introducing a method motivated by TDA to represent and visualize the peripheral nature of text block and corpus, by achieving success in dimensional reduction using local analysis and by simplifying the approach of complex topological analysis through localization.
1451-1456
ACM
Sami, Ishrat
6e1db24f-2004-4235-b93b-e90dafb57786
Farrahi, Katayoun
bc848b9c-fc32-475c-b241-f6ade8babacb
Sami, Ishrat
6e1db24f-2004-4235-b93b-e90dafb57786
Farrahi, Katayoun
bc848b9c-fc32-475c-b241-f6ade8babacb

Sami, Ishrat and Farrahi, Katayoun (2017) A simplified topological representation of text for local and global context. In MM '17 Proceedings of the 2017 ACM on Multimedia Conference. ACM. pp. 1451-1456 . (doi:10.1145/3123266.3123330).

Record type: Conference or Workshop Item (Paper)

Abstract

Topological data analysis (TDA) is a branch of mathematics that analyzes the shape of high-dimensional data sets using geometry and algebra. TDA is used for data visualization which represents the relationship among elements using a network. Traditionally, TDA is quadratic in complexity and not commonly used for natural language processing. In this research, we visualize the relationship among words in a text block, words in a corpus and text blocks in a corpus. Text block represents a unit of a corpus such as, a web page in a web corpus, a chapter or section in a book corpus or a document in media corpus. This research proposes circular topology for representing words both for Local Context (LC) and Global Context (GC). Each text block is a set of sentences forming the LC. We found that feature words are extracted successfully from our LC analysis. The occurrence of extracted featured words in the corpus formed the GC. We evaluate this proposed simplified topological analysis on 3 different corpora: a single book corpus, a book corpus consisting of 7 books having 6020 narrations and a web corpus consisting of 990 web pages. The peripheral nature of the LC reduced the vocabulary size of the corpus significantly in O(nm) time where n is the number of text blocks and m is number of nouns in a sentence. GC analysis of featured words reflected useful properties of featured word movement which can be used to analyze topic evolution. GC analysis of text block points is aimed to find closely related text blocks in a radius. This reflected interesting results that need further supervised investigation. Research on topology driven natural language processing is in its infancy. This article contributes to this research field by introducing a method motivated by TDA to represent and visualize the peripheral nature of text block and corpus, by achieving success in dimensional reduction using local analysis and by simplifying the approach of complex topological analysis through localization.

Full text not available from this repository.

More information

Accepted/In Press date: 2 July 2017
e-pub ahead of print date: 23 October 2017
Published date: October 2017

Identifiers

Local EPrints ID: 419464
URI: http://eprints.soton.ac.uk/id/eprint/419464
PURE UUID: b6beeafe-f713-4708-af38-f92572b11578

Catalogue record

Date deposited: 12 Apr 2018 16:30
Last modified: 06 Oct 2020 23:55

Export record

Altmetrics

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×