The University of Southampton
University of Southampton Institutional Repository

Comparing "parallel passages'' in Digital Archives

Comparing "parallel passages'' in Digital Archives
Comparing "parallel passages'' in Digital Archives
The term “parallel passage” refers to identical, or approximate text patterns of variable length, which could be regarded as semantically equivalent. “Parallel passages” represent alternative surface representations that exhibit identical wording, such as those representing reported speech and direct quotations, or with some small variation in grammatical structure, or vocabulary choice as a result of paraphrasing. On the one hand, differences in vocabulary choice may be the result of synonymy, or hyperonymy where a general or higher-level concept has been selected [Madnani and Dorr, 2010]. On the other hand, paraphrasing on the part of the author may provide evidence of text-reuse, or intertextuality [Fairclough, 1992], where the author has summarised the main concepts, or meaning, encoded by one or more texts that preceded it. Further differences between passages may arise due to a shift in authorship, dialect, the natural evolution of language over time [B¨uchler, 2010], and errors introduced by optical character recognition (OCR) during the digitisation. The task of comparing equivalent or similar shared text patterns in text corpora stored in digital archives, has become increasingly challenging and time consuming due to the current scale of digital text data, which makes the task of comparing shared text patterns across multiple documents practically impossible to do manually. Identifying parallel passages, such as those exemplified by paraphrases, also supports a range of natural language tasks, including text generation, information retrieval and extraction, and summarisation. This paper presents an overview of the text mining tools developed to compare parallel passages, which were deployed in a system known as the Samtla (Search And Mining Tools for Language Archives), which was developed to support the research of historic and cultural heritage collections of documents stored in digital archives. The paper is organised as follows, in Section 1, we review the related work. Section 3 describes the corpora used as test cases to explore the results generated by our proposed approach. We provide a description of the model used as a basis for extracting and scoring the contents of documents according to their shared-text patterns in Section 4. In Section 5, we describe the approach used for identifying related documents according to our proposed model, where we measure the similarity of pairs of documents based on their character-level n-gram probability distributions. Section 6 presents an approach for visualising local similarities between the content of related documents in the form of variable length parallel passages extracted from the document content. We briefly discuss the motivation behind the user interface in Section 7, and some of the language and corpus dependent issues that the document comparison tool addresses to demonstrate the flexibility of the approach to different domains, languages, authors, and time periods in Section 7. We conclude the paper with a summary of the work in Section 9, and future research and development.
archives, Data mining, Humanities
Levene, Dan
fdf6fd40-020a-4cbb-b953-d5c2dcc6a002
Levene, Mark
c6f33387-824f-43c3-bec0-30949fe2cc8b
Martyn, Harris
d09c1477-0641-4e72-b25d-bc927c9fd4c0
Zhang, Dell
df7b0ed5-137a-4832-b03f-75cd92c8e83c
Levene, Dan
fdf6fd40-020a-4cbb-b953-d5c2dcc6a002
Levene, Mark
c6f33387-824f-43c3-bec0-30949fe2cc8b
Martyn, Harris
d09c1477-0641-4e72-b25d-bc927c9fd4c0
Zhang, Dell
df7b0ed5-137a-4832-b03f-75cd92c8e83c

Levene, Dan, Levene, Mark, Martyn, Harris and Zhang, Dell (2019) Comparing "parallel passages'' in Digital Archives. Journal of Documentation. (doi:10.1108/JD-10-2018-0175).

Record type: Article

Abstract

The term “parallel passage” refers to identical, or approximate text patterns of variable length, which could be regarded as semantically equivalent. “Parallel passages” represent alternative surface representations that exhibit identical wording, such as those representing reported speech and direct quotations, or with some small variation in grammatical structure, or vocabulary choice as a result of paraphrasing. On the one hand, differences in vocabulary choice may be the result of synonymy, or hyperonymy where a general or higher-level concept has been selected [Madnani and Dorr, 2010]. On the other hand, paraphrasing on the part of the author may provide evidence of text-reuse, or intertextuality [Fairclough, 1992], where the author has summarised the main concepts, or meaning, encoded by one or more texts that preceded it. Further differences between passages may arise due to a shift in authorship, dialect, the natural evolution of language over time [B¨uchler, 2010], and errors introduced by optical character recognition (OCR) during the digitisation. The task of comparing equivalent or similar shared text patterns in text corpora stored in digital archives, has become increasingly challenging and time consuming due to the current scale of digital text data, which makes the task of comparing shared text patterns across multiple documents practically impossible to do manually. Identifying parallel passages, such as those exemplified by paraphrases, also supports a range of natural language tasks, including text generation, information retrieval and extraction, and summarisation. This paper presents an overview of the text mining tools developed to compare parallel passages, which were deployed in a system known as the Samtla (Search And Mining Tools for Language Archives), which was developed to support the research of historic and cultural heritage collections of documents stored in digital archives. The paper is organised as follows, in Section 1, we review the related work. Section 3 describes the corpora used as test cases to explore the results generated by our proposed approach. We provide a description of the model used as a basis for extracting and scoring the contents of documents according to their shared-text patterns in Section 4. In Section 5, we describe the approach used for identifying related documents according to our proposed model, where we measure the similarity of pairs of documents based on their character-level n-gram probability distributions. Section 6 presents an approach for visualising local similarities between the content of related documents in the form of variable length parallel passages extracted from the document content. We briefly discuss the motivation behind the user interface in Section 7, and some of the language and corpus dependent issues that the document comparison tool addresses to demonstrate the flexibility of the approach to different domains, languages, authors, and time periods in Section 7. We conclude the paper with a summary of the work in Section 9, and future research and development.

This record has no associated files available for download.

More information

Accepted/In Press date: 17 July 2019
e-pub ahead of print date: 2 September 2019
Keywords: archives, Data mining, Humanities

Identifiers

Local EPrints ID: 436352
URI: http://eprints.soton.ac.uk/id/eprint/436352
PURE UUID: 94d9b080-8edc-436e-830e-1dca3c4fa91b

Catalogue record

Date deposited: 06 Dec 2019 17:30
Last modified: 16 Mar 2024 02:54

Export record

Altmetrics

Contributors

Author: Dan Levene
Author: Mark Levene
Author: Harris Martyn
Author: Dell Zhang

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×